We are searching for a Senior Site Reliability Engineer to join our Reliability Tooling team, where you'll play a pivotal role in designing, improving, and building scalable solutions to enhance system reliability.
As a respected expert within the team, you will contribute to technical strategy, mentor engineers, and advocate for SRE best practices to optimize service delivery and operational efficiency.
Responsibilities
- Build tools to enable your team to identify and resolve infrastructure, platform, and application issues
- Use Chaos Engineering methodologies to test reliability of systems under real-world conditions
- Deploy and manage modern cloud technologies leveraging Infrastructure as Code and self-healing patterns
- Develop effective telemetry, alerts, and automated responses to minimize Mean Time to Recovery (MTTR)
- Provide technical guidance and expertise across team collaborations
- Develop frameworks and practices for sustainable incident response via blameless postmortems and SRE methods
- Identify reliability and operational inefficiencies to promote continuous improvement
- Write code that enhances scalability, security, and maintainability of critical systems
- Foster team involvement in delivering thoughtful and high-quality software solutions
- Mentor team members in core SRE principles to support professional development
Requirements
- 3+ years of hands-on experience in SRE, DevOps, systems engineering, or software engineering
- Strong communication skills, both written and verbal
- Enthusiasm for learning and exploration in leveraging new technologies
- Expertise in Cloud/PaaS/SaaS tools and platforms (e.g. AWS, Azure, GCP)
- Proficiency in container technologies within enterprise environments (e.g. Docker, Kubernetes, AWS ECS and EKS)
- Background in programming languages (Python, Go, Rust, or similar)
Nice to have
- Familiarity with DevOps methodologies and SRE principles
- Background in monitoring and observability solutions
- Capability to work with automation tools like Terraform, CloudFormation, or Ansible within Infrastructure as Code practices
- Understanding of Service Level Objectives and error budgets
- Experience with scalable software development in languages such as Java or Scala
We offer
-
We gather like-minded people:
- Engineering community of industry professionals
- Friendly team and enjoyable working environment
- Flexible schedule and opportunity to work remotely within Poland
- Chance to work abroad for up to 60 days annually
- Business-driven relocation opportunities
-
We provide growth opportunities:
- Outstanding career roadmap
- Leadership development, career advising, soft skills, and well-being programs
- Certification (GCP, Azure, AWS)
- Unlimited access to LinkedIn Learning, Get Abstract, Cloud Guru
- English classes
-
We cover it all:
- Stable income (Employment Contract or B2B)
- Participation in the Employee Stock Purchase Plan
- Benefits package (health insurance, multisport, shopping vouchers)
- Strategically located offices featuring entertainment and relaxation zones, table tennis and football, free snacks, fantastic coffee, and more
- Referral bonuses
- Corporate, social and well-being events
-
Please, note:
- The set of bonuses might vary based on the role you apply for – specifics will be discussed with our recruiter during the general interview.
- We will reach out to selected candidates exclusively.
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.