Lead Site Reliability Engineer

EPAM Systems
Ruda Śląska, województwo śląskie
3 tygodnie temu

We are seeking a highly experienced and motivated Lead Site Reliability Engineer to ensure the reliability, scalability, and efficiency of our critical cloud-based infrastructure, primarily built on Azure platforms.

This pivotal role requires in-depth technical expertise, strategic problem-solving capabilities, and the ability to drive collaboration across multi-functional teams to deliver maximum system uptime and performance.

Responsibilities

  • Drive reliability and scalability across Azure cloud systems, with a focus on AKS and other Azure services
  • Ensure proactive monitoring and observability enhancements using tools like Azure Monitor, Log Analytics, Application Insights, Grafana, and Prometheus
  • Optimize performance and uptime of enterprise infrastructure to meet stringent SLA requirements
  • Automate infrastructure deployment and scaling using Terraform and Azure DevOps Pipelines for seamless operations
  • Troubleshoot and resolve complex technical issues across Azure ecosystems, conducting root-cause analysis and postmortems for critical incidents
  • Strengthen monitoring and alert systems to detect and prevent potential issues before SLA impact
  • Collaborate with development, DevOps, and IT operations teams to integrate site reliability principles into workflows
  • Use scripting (Bash, PowerShell, Python) to automate routine tasks, performance reporting, and incident recovery processes
  • Promote secure and scalable solutions in managing cloud resources within Azure and AWS environments
  • Lead incident response practices, ensuring real-time resolution and long-term fixes to recurring system issues
  • Establish and refine observability tools and practices to provide deeper insights into system performance
  • Mentor junior engineers and promote a culture of reliability, automation, and continuous improvement in the organization

Requirements

  • 7+ years of experience in site reliability engineering, cloud infrastructure management, or related roles
  • Expertise in Azure services, including AKS, Azure Monitor, Application Insights, Log Analytics, Cosmos DB, PostgreSQL
  • Strong hands-on experience with Terraform, Azure DevOps, and scripting languages (Bash, PowerShell, Python)
  • Deep knowledge of observability tools such as Grafana, Prometheus, and Azure-native monitoring platforms
  • Competency in incident management, including the use of root-cause analysis and resolution strategies for SLA-critical issues
  • Background in implementing Infrastructure-as-Code (IaC) for automated workload provisioning and scaling
  • Competency in debugging and triaging complex systems, ensuring secure and resilient cloud deployments
  • Proficiency in working within Agile development environments while managing competing priorities
  • Excellent communication skills to articulate technical issues and solutions across diverse teams

Nice to have

  • Experience with AWS services, including EKS, CloudWatch, X-Ray, RDS (PostgreSQL)
  • Familiarity with distributed logging pipelines and optimization of monitoring resource impacts in Kubernetes environments
  • Skills in using AWS CloudWatch, OpenSearch, or other AWS-native monitoring solutions for observability
  • Knowledge of advanced Kubernetes features for networking configurations and scaling across Azure AKS/AWS EKS
  • Certifications like Microsoft Azure Administrator or AWS Certified DevOps Engineer demonstrating expertise in cloud solutions

We offer

  • We gather like-minded people:
    • Engineering community of industry professionals
    • Friendly team and enjoyable working environment
    • Flexible schedule and opportunity to work remotely within Poland
    • Chance to work abroad for up to 60 days annually
    • Business-driven relocation opportunities
  • We provide growth opportunities:
    • Outstanding career roadmap
    • Leadership development, career advising, soft skills, and well-being programs
    • Certification (GCP, Azure, AWS)
    • Unlimited access to LinkedIn Learning, Get Abstract, Cloud Guru
    • English classes
  • We cover it all:
    • Stable income (Employment Contract or B2B)
    • Participation in the Employee Stock Purchase Plan
    • Benefits package (health insurance, multisport, shopping vouchers)
    • Strategically located offices featuring entertainment and relaxation zones, table tennis and football, free snacks, fantastic coffee, and more
    • Referral bonuses
    • Corporate, social and well-being events
  • Please, note:
    • The set of bonuses might vary based on the role you apply for – specifics will be discussed with our recruiter during the general interview.
    • We will reach out to selected candidates exclusively.

EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

Apply
Other Job Recommendations:

Lead Site Reliability Engineer

kontakt.io
Ruda Śląska, województwo śląskie
  • Lead incident response and on-call operations, reducing...
  • Lead disaster recovery and business continuity planning to...
3 tygodnie temu

Azure Site Reliability Engineer

UBS
Wrocław, województwo dolnośląskie
Our purpose-led culture and global infrastructure help us connect, collaborate, and work together in agile ways to meet all our...
3 dni temu

Site Reliability Engineer II

Akamai Technologies
Polska
  • Deploying and maintaining the platform and tools used...
  • Partnering with multiple teams to ensure the availability,...
5 dni temu

Site Reliability Engineer

EOS
Warsaw, województwo mazowieckie
  • Operations: Troubleshoot and resolve technical issues...
  • Collaboration: Work closely with development, operations,...
1 tydzień temu

Staff Software Engineer, Site Reliability Engineering

Google
Warsaw, województwo mazowieckie
3 years of experience leading projects and designing, analyzing, and troubleshooting distributed systems. SRE ensures that Google...
1 tydzień temu