We are seeking a highly experienced and motivated Lead Site Reliability Engineer to ensure the reliability, scalability, and efficiency of our critical cloud-based infrastructure, primarily built on Azure platforms.

This pivotal role requires in-depth technical expertise, strategic problem-solving capabilities, and the ability to drive collaboration across multi-functional teams to deliver maximum system uptime and performance.

Responsibilities

Drive reliability and scalability across Azure cloud systems, with a focus on AKS and other Azure services
Ensure proactive monitoring and observability enhancements using tools like Azure Monitor, Log Analytics, Application Insights, Grafana, and Prometheus
Optimize performance and uptime of enterprise infrastructure to meet stringent SLA requirements
Automate infrastructure deployment and scaling using Terraform and Azure DevOps Pipelines for seamless operations
Troubleshoot and resolve complex technical issues across Azure ecosystems, conducting root-cause analysis and postmortems for critical incidents
Strengthen monitoring and alert systems to detect and prevent potential issues before SLA impact
Collaborate with development, DevOps, and IT operations teams to integrate site reliability principles into workflows
Use scripting (Bash, PowerShell, Python) to automate routine tasks, performance reporting, and incident recovery processes
Promote secure and scalable solutions in managing cloud resources within Azure and AWS environments
Lead incident response practices, ensuring real-time resolution and long-term fixes to recurring system issues
Establish and refine observability tools and practices to provide deeper insights into system performance
Mentor junior engineers and promote a culture of reliability, automation, and continuous improvement in the organization

Requirements

7+ years of experience in site reliability engineering, cloud infrastructure management, or related roles
Expertise in Azure services, including AKS, Azure Monitor, Application Insights, Log Analytics, Cosmos DB, PostgreSQL
Strong hands-on experience with Terraform, Azure DevOps, and scripting languages (Bash, PowerShell, Python)
Deep knowledge of observability tools such as Grafana, Prometheus, and Azure-native monitoring platforms
Competency in incident management, including the use of root-cause analysis and resolution strategies for SLA-critical issues
Background in implementing Infrastructure-as-Code (IaC) for automated workload provisioning and scaling
Competency in debugging and triaging complex systems, ensuring secure and resilient cloud deployments
Proficiency in working within Agile development environments while managing competing priorities
Excellent communication skills to articulate technical issues and solutions across diverse teams

Nice to have

Experience with AWS services, including EKS, CloudWatch, X-Ray, RDS (PostgreSQL)
Familiarity with distributed logging pipelines and optimization of monitoring resource impacts in Kubernetes environments
Skills in using AWS CloudWatch, OpenSearch, or other AWS-native monitoring solutions for observability
Knowledge of advanced Kubernetes features for networking configurations and scaling across Azure AKS/AWS EKS
Certifications like Microsoft Azure Administrator or AWS Certified DevOps Engineer demonstrating expertise in cloud solutions

We offer

We gather like-minded people:
- Engineering community of industry professionals
- Friendly team and enjoyable working environment
- Flexible schedule and opportunity to work remotely within Poland
- Chance to work abroad for up to 60 days annually
- Business-driven relocation opportunities
We provide growth opportunities:
- Outstanding career roadmap
- Leadership development, career advising, soft skills, and well-being programs
- Certification (GCP, Azure, AWS)
- Unlimited access to LinkedIn Learning, Get Abstract, Cloud Guru
- English classes
We cover it all:
- Stable income (Employment Contract or B2B)
- Participation in the Employee Stock Purchase Plan
- Benefits package (health insurance, multisport, shopping vouchers)
- Strategically located offices featuring entertainment and relaxation zones, table tennis and football, free snacks, fantastic coffee, and more
- Referral bonuses
- Corporate, social and well-being events
Please, note:
- The set of bonuses might vary based on the role you apply for – specifics will be discussed with our recruiter during the general interview.
- We will reach out to selected candidates exclusively.

EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

Lead Site Reliability Engineer

Lead Site Reliability Engineer

Azure Site Reliability Engineer

Site Reliability Engineer II

Site Reliability Engineer

Staff Software Engineer, Site Reliability Engineering