Lead Associate Principal, Software Engineering: DevOps in Chicago
Job DescriptionJob Description
About the Role
We are seeking an experienced Site Reliability Engineer / DevOps Infrastructure Lead to support a highly scalable, cloud-based technology platform. This individual will collaborate closely with product, infrastructure, operations, security, architecture, network, testing, and production control teams to gather technical requirements, improve platform reliability, and drive operational excellence.
The ideal candidate will bring strong experience in AWS, Kubernetes, Kafka, CI/CD, Terraform, Ansible, observability, incident management, and large-scale distributed systems. This role requires a hands-on technical leader who can guide infrastructure implementation, promote reliability best practices, improve system observability, and support high-performance, multi-region cloud environments.
Key Responsibilities
- Guide the implementation of CI/CD pipelines within a Kubernetes environment.
- Review, configure, and support execution of Terraform and Ansible automation pipelines delivered by product teams.
- Support the setup of shared infrastructure platforms, including multi-region Kubernetes and Kafka clusters.
- Gather application deployment and sizing requirements to support expected workloads.
- Define and enforce Service Level Objectives, Service Level Indicators, and Error Budgets in partnership with product teams.
- Lead blameless post-mortems and drive resolution of action items to reduce repeat incidents.
- Design and implement observability frameworks covering metrics, logs, and distributed tracing across platform services.
- Identify and automate repetitive operational work to reduce toil and improve efficiency.
- Partner with product teams to embed reliability requirements and non-functional requirements early in the software development lifecycle.
- Monitor application performance and partner with product teams to tune systems.
- Work with product team leads and technical practitioners to create deployment and reliability plans.
- Collaborate with Enterprise Architecture and Renaissance architecture teams to define implementation architecture.
- Promote application configuration standards that support a strong security posture.
- Partner with access management and security teams to establish roles and permissions using least-privilege strategies.
- Collaborate with integration and performance testing teams to leverage integrated release testing in the Release Acceptance environment.
- Work with production control teams on monitoring, failover, logging, and alerting strategies.
- Own and continuously improve incident response runbooks, on-call rotations, and escalation procedures.
- Conduct capacity planning and load forecasting to proactively address scalability requirements.
- Implement and validate infrastructure failover scenarios.
- Partner with network teams on connectivity planning and issue resolution, including connectivity between on-premises environments and AWS.
- Follow and support program-level Agile practices to improve collaboration and delivery.
- Develop documentation for technical infrastructure, architecture, and reliability support.
Required Qualifications
- Bachelor’s degree in Computer Science, a related technical field, or equivalent professional experience.
- 7+ years of experience building large-scale, data-centric technology solutions.
- 7+ years of recent experience participating on a DevOps or SRE team, or serving as a product owner for a DevOps/SRE team.
- Strong understanding of Kanban and/or Agile methodologies.
- Familiarity with SRE principles as defined by Google SRE practices, including error budgets, toil elimination, and reliability hierarchy.
- Ability to succeed in a fast-paced environment with frequent changes.
- Strong communication skills with the ability to engage both technical and non-technical audiences.
- Self-starter who takes initiative to research, learn, and deliver solutions.
- Collaborative team player with a humble, team-first mindset.
Required Technical Skills
- Strong experience with AWS EC2, Kubernetes, Kafka, Jenkins, Terraform, Ansible, and HashiCorp Vault.
- Experience with observability tools such as Prometheus, Grafana, OpenTelemetry, Datadog, or equivalent platforms.
- Experience with incident management and on-call tooling such as PagerDuty, OpsGenie, or similar tools.
- Strong knowledge of microservices and streaming data-intensive application architecture.
- Experience with application architecture, networking, and cloud security.
- Experience setting up AWS platforms for high-performance requirements.
- Broad experience with API-based development.
- Experience using Git and Artifactory for source control and artifact management.
- Strong knowledge of multi-AZ and multi-region failover architecture.
- Familiarity with chaos engineering principles and tooling such as Chaos Monkey, Gremlin, or LitmusChaos.
- Fluency with data formats and structures including JSON, Protobuf, and Avro.
- Experience with SQL and NoSQL databases, as well as in-memory data stores.
- Software development experience with Java, Python, Scala, and/or Golang.
- Experience with at least two of the following:
-
- Web or mobile application development
- Unix/Linux environments
- Event-driven systems
- Transaction processing systems
- Distributed and parallel systems
- Large-scale software system development
- Security software development
- Public cloud platforms
- Strong understanding of industry best practices, software design patterns, and architecture principles.
- Knowledge of enterprise architecture frameworks such as TOGAF.
- Ability to define and document architecture strategies, technical designs, and requirements across enterprise architecture domains.
- Ability to define service-based and component-based architectures and visually communicate enterprise architecture concepts.
Certifications
- AWS Certified Solutions Architect and/or AWS DevOps Engineer certification.
- Kubernetes and/or Kafka certification.
- Google Cloud Professional Site Reliability Engineer certification or equivalent SRE-focused certification.
- Project or program management certification.
Ideal Candidate Profile
The ideal candidate is a senior technical professional with deep experience in cloud infrastructure, DevOps, SRE, and enterprise-scale distributed systems. This person should be comfortable partnering across multiple technical teams, driving reliability standards, improving observability, supporting incident response, and helping build resilient cloud platforms that can scale across multi-region environments.
Reasonable Accommodation Statement
Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions of this role.
Company DescriptionWe care about the success of each member of out team! We strive for long lasting partnerships where you can grow and expand your career.Company DescriptionWe care about the success of each member of out team! We strive for long lasting partnerships where you can grow and expand your career.