Job DescriptionJob Description

Location: Berkeley, CA (Onsite at Lawrence Berkeley Laboratory)
Employment Type: 5–6 Month Contract (Extension Possible)
Pay Rate: $80/hr + Full Benefits (Medical, Dental, Vision, 401k)
Employer: Bay Systems Consulting

About the Role

Bay Systems Consulting is seeking a Site Reliability Engineer (SRE) to support the Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley Laboratory. NERSC’s mission is to accelerate scientific discovery through high-performance computing and data analysis for the U.S. Department of Energy’s Office of Science.

As an SRE in the Operations Group, you will help ensure the accessibility, reliability, security, and availability of world-class HPC systems that support over 10,000 scientific users. You will work with state-of-the-art monitoring systems (such as OMNI), responding to real-time alerts, automating processes, and improving reliability for mission-critical infrastructure.

Key Responsibilities

Monitor and support NERSC’s HPC facility as part of a 24x7 operations team (including some overnight “OWL” shifts).
Respond to alerts from computer systems, storage, networks, and data center infrastructure by triaging issues or engaging on-call staff.
Develop automation to handle routine service conditions and improve system efficiency.
Maintain and enhance monitoring tools, pipelines, and alerting systems.
Create and maintain scripts and software to integrate HPC system APIs into monitoring pipelines.
Collaborate with cross-functional NERSC groups to coordinate maintenance activities and manage diagnostic software.
Document and track outages, incidents, and maintenance in the ticketing system.
Troubleshoot and resolve diverse technical issues involving HPC, networking, and infrastructure.

Qualifications

Required (Level 2):

Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent work experience).
5+ years of related experience (or 3+ years with a Master’s).
Strong Linux/Unix administration and command-line skills.
Proficiency with programming/scripting (Python, C/C++, Perl, Java, or similar).
Experience supporting highly available systems in large-scale data centers.
Familiarity with networking, firewalls, ACLs, and network protocols.
Knowledge of automation and monitoring tools (e.g., Kubernetes, Prometheus, Alertmanager).
Strong troubleshooting and communication skills.

(Level 3):

8+ years of relevant experience (or 6+ with a Master’s).
Expertise in software development and monitoring pipeline design.
Experience leading technical projects and mentoring junior staff.
Advanced knowledge of data center management technologies.

Site Reliability Engineer

Site Reliability Engineer

Share this job now