About Graphcore
Graphcore is one of the world's leading innovators in Artificial Intelligence compute.
It is developing hardware, software and systems infrastructure that will unlock the next generation of AI breakthroughs and power the widespread adoption of AI solutions across every industry.
As part of the SoftBank Group, Graphcore is a member of an elite family of companies responsible for some of the world's most transformative technologies. Together, they share a bold vision: to enable Artificial Super Intelligence and ensure its benefits are accessible to everyone.
Graphcore's teams are drawn from diverse backgrounds and bring a broad range of skills and perspectives. A melting pot of AI research specialists, silicon designers, software engineers and systems architects, Graphcore enjoys a culture of continuous learning and constant innovation.
Job Summary
As a Senior Software Engineer in the ML Software Performance Analysis team, you will play a critical role in ensuring end-to-end performance excellence of our proprietary AI hardware and software stack. You will directly report to the Performance Analysis Team Lead and collaborate closely with component teams, including ML Framework developers, Compiler and Runtime teams, Infrastructure engineers, and Product Management. Your work will directly influence the efficiency and scalability of our ML software solutions, significantly impacting our business by enabling reliable and performant AI solutions for customers.
The Team
The ML Software Performance Analysis team is a part of the wider ML Software Engineering organisation, responsible for delivering optimised, proprietary machine learning solutions. Our team consists of experienced engineers and domain experts focused on rigorous performance benchmarking, in-depth analysis, and cross-layer optimization from single chip to large-scale, distributed systems.
We work closely with both internal partners and external collaborators to ensure our solutions meet the highest standards of performance, efficiency, and scalability.
Our core responsibilities include:
- ML Software Stack Performance Reports – We publish regular reports that provide a comprehensive view of the performance status of the ML software stack
- End-to-End Performance Optimization – We take a holistic approach to performance, ensuring that local optimizations do not lead to global regressions. Our work spans component boundaries, enabling balanced and efficient performance across the entire stack
Responsibilities and Duties
- Conduct in-depth analysis of performance metrics to identify bottlenecks, inefficiencies, and regression trends across the ML stack
- Collaborate with cross-functional teams to drive end-to-end performance improvements across software components
- Prepare and deliver performance reports, summarizing key findings, trends, and recommendations
- Design, implement, and maintain performance benchmarking tools and infrastructure for large-scale ML software systems
- Investigate and resolve performance-related issues, including CPU utilization, memory usage, and network overhead
- Ensure that local optimizations do not negatively impact overall system performance, applying a global performance perspective
- Provide actionable feedback and guidance to engineering teams to support continuous performance optimization
Candidate Profile
Essential:
- A passion for your work and the ability to thrive in uncertain and complex environments
- Strong programming skills in Python/C/C++, with a focus on performance-sensitive applications
- Solid understanding of computer architecture, performance profiling, and low-level system behaviour (CPU, memory, I/O)
- Experience with benchmarking and analysing complex, distributed systems
- Familiarity with Linux-based development environments and tools
- Strong problem-solving skills and ability to interpret and communicate performance data clearly
Desirable
- Knowledge of ML frameworks (ideally PyTorch) and their performance characteristics
- Experience with performance analysis in GPU-accelerated environments (CUDA, ROCm, etc.)
- Familiarity of hardware performance characteristics especially in ML context including high-speed networking (e.g. RoCE, RDMA)
- Familiarity with distributed computing frameworks (ideally collectives experience)
- Experience building dashboards or visualizations for performance monitoring (e.g., Grafana, Prometheus, or custom tooling)
- Exposure to performance regression tracking and CI pipelines for performance validation
Benefits
In addition to a competitive salary, Graphcore offers annual leave policy, medical and dental health plans, a gym card, and employee pension (matched up to 4%). We review our benefits on a yearly basis to ensure we offer a valuable and rewarding benefits programme to our employees. We welcome people of different backgrounds and experiences; we're committed to building an inclusive work environment that makes Graphcore a great home for everyone. We offer an equal opportunity process and understand that there are visible and invisible differences in all of us. We can provide a flexible approach to interview and encourage you to chat to us if you require any reasonable adjustments.