Search suggestions:

praca zdalna
praca
praca od zaraz
hr
elektryk
praca zakwaterowanie
księgowa
project manager
urząd pracy
pkp
kasjer sprzedawca
praca biurowa
dam pracę
Wrocław
województwo mazowieckie
województwo śląskie
województwo dolnośląskie
powiat lubelski
Warsaw
województwo lubuskie
Gdańsk
Katowice
powiat giżycki
województwo zachodniopomorskie
Gdynia
Apply

Lead Site Reliability Engineer

EPAM Systems
Ruda Śląska, województwo śląskie
2 tygodnie temu

We are seeking a highly experienced and motivated Lead Site Reliability Engineer to ensure the reliability, scalability, and efficiency of our critical cloud-based infrastructure, primarily built on Azure platforms.

This pivotal role requires in-depth technical expertise, strategic problem-solving capabilities, and the ability to drive collaboration across multi-functional teams to deliver maximum system uptime and performance.

Responsibilities

  • Drive reliability and scalability across Azure cloud systems, with a focus on AKS and other Azure services
  • Ensure proactive monitoring and observability enhancements using tools like Azure Monitor, Log Analytics, Application Insights, Grafana, and Prometheus
  • Optimize performance and uptime of enterprise infrastructure to meet stringent SLA requirements
  • Automate infrastructure deployment and scaling using Terraform and Azure DevOps Pipelines for seamless operations
  • Troubleshoot and resolve complex technical issues across Azure ecosystems, conducting root-cause analysis and postmortems for critical incidents
  • Strengthen monitoring and alert systems to detect and prevent potential issues before SLA impact
  • Collaborate with development, DevOps, and IT operations teams to integrate site reliability principles into workflows
  • Use scripting (Bash, PowerShell, Python) to automate routine tasks, performance reporting, and incident recovery processes
  • Promote secure and scalable solutions in managing cloud resources within Azure and AWS environments
  • Lead incident response practices, ensuring real-time resolution and long-term fixes to recurring system issues
  • Establish and refine observability tools and practices to provide deeper insights into system performance
  • Mentor junior engineers and promote a culture of reliability, automation, and continuous improvement in the organization

Requirements

  • 7+ years of experience in site reliability engineering, cloud infrastructure management, or related roles
  • Expertise in Azure services, including AKS, Azure Monitor, Application Insights, Log Analytics, Cosmos DB, PostgreSQL
  • Strong hands-on experience with Terraform, Azure DevOps, and scripting languages (Bash, PowerShell, Python)
  • Deep knowledge of observability tools such as Grafana, Prometheus, and Azure-native monitoring platforms
  • Competency in incident management, including the use of root-cause analysis and resolution strategies for SLA-critical issues
  • Background in implementing Infrastructure-as-Code (IaC) for automated workload provisioning and scaling
  • Competency in debugging and triaging complex systems, ensuring secure and resilient cloud deployments
  • Proficiency in working within Agile development environments while managing competing priorities
  • Excellent communication skills to articulate technical issues and solutions across diverse teams

Nice to have

  • Experience with AWS services, including EKS, CloudWatch, X-Ray, RDS (PostgreSQL)
  • Familiarity with distributed logging pipelines and optimization of monitoring resource impacts in Kubernetes environments
  • Skills in using AWS CloudWatch, OpenSearch, or other AWS-native monitoring solutions for observability
  • Knowledge of advanced Kubernetes features for networking configurations and scaling across Azure AKS/AWS EKS
  • Certifications like Microsoft Azure Administrator or AWS Certified DevOps Engineer demonstrating expertise in cloud solutions

We offer

  • We gather like-minded people:
    • Engineering community of industry professionals
    • Friendly team and enjoyable working environment
    • Flexible schedule and opportunity to work remotely within Poland
    • Chance to work abroad for up to 60 days annually
    • Business-driven relocation opportunities
  • We provide growth opportunities:
    • Outstanding career roadmap
    • Leadership development, career advising, soft skills, and well-being programs
    • Certification (GCP, Azure, AWS)
    • Unlimited access to LinkedIn Learning, Get Abstract, Cloud Guru
    • English classes
  • We cover it all:
    • Stable income (Employment Contract or B2B)
    • Participation in the Employee Stock Purchase Plan
    • Benefits package (health insurance, multisport, shopping vouchers)
    • Strategically located offices featuring entertainment and relaxation zones, table tennis and football, free snacks, fantastic coffee, and more
    • Referral bonuses
    • Corporate, social and well-being events
  • Please, note:
    • The set of bonuses might vary based on the role you apply for – specifics will be discussed with our recruiter during the general interview.
    • We will reach out to selected candidates exclusively.

EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

Save Apply
Report job
Other Job Recommendations:

Lead Site Reliability Engineer

kontakt.io
Ruda Śląska, województwo śląskie
  • Lead incident response and on-call operations, reducing...
  • Lead disaster recovery and business continuity planning to...
2 tygodnie temu

Site Reliability Engineer Tech Lead

HARMAN International
  • Combine the physical and digital, making technology a more...
  • Work at the convergence of cross channel UX, cloud,...
3 tygodnie temu

Site Reliability Engineer (m/f/x)

Cloudflight
Wrocław, województwo dolnośląskie
What if your passion for technology could make a difference? Picture your code being used by public institutions, running on...
3 dni temu

B2B Senior Systems Site Reliability Engineer New

jamf
Ruda Śląska, województwo śląskie
  • Identify improvements in both the platform and processes by...
  • Proactively engage and collaborate with other individuals...
3 dni temu

Staff Software Engineer, Site Reliability Engineering

Google
Warsaw, województwo mazowieckie
3 years of experience leading projects and designing, analyzing, and troubleshooting distributed systems. SRE ensures that Google...
3 dni temu