We are seeking a highly experienced and motivated Lead Site Reliability Engineer to ensure the reliability, scalability, and efficiency of our critical cloud-based infrastructure, primarily built on Azure platforms.
This pivotal role requires in-depth technical expertise, strategic problem-solving capabilities, and the ability to drive collaboration across multi-functional teams to deliver maximum system uptime and performance.
Responsibilities
- Drive reliability and scalability across Azure cloud systems, with a focus on AKS and other Azure services
- Ensure proactive monitoring and observability enhancements using tools like Azure Monitor, Log Analytics, Application Insights, Grafana, and Prometheus
- Optimize performance and uptime of enterprise infrastructure to meet stringent SLA requirements
- Automate infrastructure deployment and scaling using Terraform and Azure DevOps Pipelines for seamless operations
- Troubleshoot and resolve complex technical issues across Azure ecosystems, conducting root-cause analysis and postmortems for critical incidents
- Strengthen monitoring and alert systems to detect and prevent potential issues before SLA impact
- Collaborate with development, DevOps, and IT operations teams to integrate site reliability principles into workflows
- Use scripting (Bash, PowerShell, Python) to automate routine tasks, performance reporting, and incident recovery processes
- Promote secure and scalable solutions in managing cloud resources within Azure and AWS environments
- Lead incident response practices, ensuring real-time resolution and long-term fixes to recurring system issues
- Establish and refine observability tools and practices to provide deeper insights into system performance
- Mentor junior engineers and promote a culture of reliability, automation, and continuous improvement in the organization
Requirements
- 7+ years of experience in site reliability engineering, cloud infrastructure management, or related roles
- Expertise in Azure services, including AKS, Azure Monitor, Application Insights, Log Analytics, Cosmos DB, PostgreSQL
- Strong hands-on experience with Terraform, Azure DevOps, and scripting languages (Bash, PowerShell, Python)
- Deep knowledge of observability tools such as Grafana, Prometheus, and Azure-native monitoring platforms
- Competency in incident management, including the use of root-cause analysis and resolution strategies for SLA-critical issues
- Background in implementing Infrastructure-as-Code (IaC) for automated workload provisioning and scaling
- Competency in debugging and triaging complex systems, ensuring secure and resilient cloud deployments
- Proficiency in working within Agile development environments while managing competing priorities
- Excellent communication skills to articulate technical issues and solutions across diverse teams
Nice to have
- Experience with AWS services, including EKS, CloudWatch, X-Ray, RDS (PostgreSQL)
- Familiarity with distributed logging pipelines and optimization of monitoring resource impacts in Kubernetes environments
- Skills in using AWS CloudWatch, OpenSearch, or other AWS-native monitoring solutions for observability
- Knowledge of advanced Kubernetes features for networking configurations and scaling across Azure AKS/AWS EKS
- Certifications like Microsoft Azure Administrator or AWS Certified DevOps Engineer demonstrating expertise in cloud solutions
We offer
-
We gather like-minded people:
- Engineering community of industry professionals
- Friendly team and enjoyable working environment
- Flexible schedule and opportunity to work remotely within Poland
- Chance to work abroad for up to 60 days annually
- Business-driven relocation opportunities
-
We provide growth opportunities:
- Outstanding career roadmap
- Leadership development, career advising, soft skills, and well-being programs
- Certification (GCP, Azure, AWS)
- Unlimited access to LinkedIn Learning, Get Abstract, Cloud Guru
- English classes
-
We cover it all:
- Stable income (Employment Contract or B2B)
- Participation in the Employee Stock Purchase Plan
- Benefits package (health insurance, multisport, shopping vouchers)
- Strategically located offices featuring entertainment and relaxation zones, table tennis and football, free snacks, fantastic coffee, and more
- Referral bonuses
- Corporate, social and well-being events
-
Please, note:
- The set of bonuses might vary based on the role you apply for – specifics will be discussed with our recruiter during the general interview.
- We will reach out to selected candidates exclusively.
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.