Skip to main content

Senior Infrastructure Engineer - Supercomputing in Sunnyvale

Energy Jobline is the largest and fastest growing global Energy Job Board and Energy Hub. We have an audience reach of over 7 million energy professionals, 400,000+ monthly advertised global energy and engineering jobs, and work with the leading energy companies worldwide.

We focus on the Oil & Gas, Renewables, Engineering, Power, and Nuclear markets as well as emerging technologies in EV, Battery, and Fusion. We are committed to ensuring that we offer the most exciting career opportunities from around the world for our jobseekers.

Job DescriptionJob DescriptionAbout the Institute of Foundation ModelsWe are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next of AI builders, and drive transformative contributions to a knowledge-driven economy.
As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next of AI pioneers.


The Role
We are operating some of the world’s largest GPU supercomputing clusters to support cutting-edge AI research and large-scale model deployment. We’re looking for an Infrastructure Engineer to join our core platform team to help build, operate, and scale our hybrid infrastructure across both on-prem and cloud environments.
This role is ideal for engineers who thrive at the intersection of distributed systems, cloud automation, and high-performance computing.Key Responsibilities

  • Operate and scale high-performance GPU clusters used for AI training and production inference.
  • Manage infrastructure across on-premise (Slurm-based) HPC environments and cloud providers like AWS and Azure.
  • Implement and maintain Infrastructure as Code using Pulumi, Terraform, or Ansible.
  • Enhance and secure deployment pipelines using Kubernetes, Flux, and ArgoCD.
  • Help define and enforce security best practices for internal researchers and production services.
  • Continuously improve observability, resiliency, and operational tooling across environments.

Tech Stack

  • Kubernetes, Slurm
  • Pulumi, Terraform, Ansible
  • Rust and Go
  • Flux, ArgoCD
  • AWS, Azure

Professional Experience

  • Strong experience managing compute infrastructure in hybrid environments (on-prem and cloud).
  • Hands-on experience operating Slurm clusters at scale.
  • Proficiency in deploying and managing containerized applications, ideally written in Rust or Go.
  • Solid background in IaC and CI/CD best practices.
  • Experience working with GPU workloads or HPC infrastructure is a strong plus.
  • Familiarity with securing and monitoring multi-tenant compute environments.

Salary depends on level. Visa SponsorshipThis position is eligible for visa sponsorship.
Benefits Include*Comprehensive medical, dental, and vision benefits  *Bonus*401K Plan*Generous paid time off, sick leave and holidays*Paid Parental Leave*Employee Assistance Program*Life insurance and

If you are interested in applying for this job please press the Apply Button and follow the application process. Energy Jobline wishes you the very best of luck in your next career move.

Senior Infrastructure Engineer - Supercomputing in Sunnyvale

Sunnyvale, CA
Full time

Published on 10/27/2025

Share this job now