Job DescriptionJob Description

Site Reliability Engineer

100% remote

Good understanding of SRE principles
Supporting Kafka from an SRE perspective (e.g., tuning, fault tolerance, operational readiness).
Has created SLO-based alerts or dashboards, which are essential for observability in this role.
experience in building infrastructure-level dashboards that provide meaningful insights into system health and performance.
strong troubleshooting skills and/or a structured approach to diagnosing and resolving complex production issues.
understanding of fault-tolerant system design, especially in the context of Kafka.
work with the SRE team to educate them on best practices, best path forward, etc.

Client is going through a modernization effort for their equity trading platform. They have asked us to bring in an a senior SRE who ideally has experience supporting large scale deployments.

Desired Qualifications

10+ years' experience in information technology and/or professional services, with emphasis on subject matter expertise.
At least 4 years of experience as a Site Reliability Engineering or equivalent role.
Strong track record of delivering projects of demonstrable complexity and scale.
Experience with data visualization and monitoring tools such as Splunk, Grafana, Dynatrace, Datadog, New Relic, Oracle Enterprise Manager, etc.
Experience with telemetry frameworks and tools: (including but not limited to) Open Telemetry, Prometheus, Loki, Tempo, Fluent, Jaeger, etc.
Prior success in automating real-world production environments using Chef, Puppet, Salt, Ansible, or cloud- equivalents.
Ability to lead adoption and mentor team members on modern Site Reliability Engineering and architectural concepts.
Proficient in designing and building highly available, resilient large-scale distributed systems.
Demonstrated leadership abilities in an engineering environment in driving operational excellence and best practices.
Demonstrated ability to achieve stretch goals in a highly innovative and fast-paced environment.
Subject matter expertise with the following Cloud Service Providers: Amazon Web Services (AWS), Microsoft Azure (Experience with GCP plus).
Expertise in Site Reliability Engineering technologies including but not limited to CI/CD frameworks, Infrastructure as Code, and monitoring and logging frameworks.
Familiarity with container technologies like Docker, Kubernetes, and cloud- frameworks.
Excellence in technical communication with peers and non-technical cohorts.
Sharp analytical abilities and proven design skills.
Strong sense of ownership, urgency, and drive.

Company Descriptionplease visit our site nobletechies.com.Company Descriptionplease visit our site nobletechies.com.

SRE Architect

SRE Architect

Share this job now