SRE Architect
Job DescriptionJob Description
Site Reliability Engineer
100% remote
- Good understanding of SRE principles
- Supporting Kafka from an SRE perspective (e.g., tuning, fault tolerance, operational readiness).
- Has created SLO-based alerts or dashboards, which are essential for observability in this role.
- experience in building infrastructure-level dashboards that provide meaningful insights into system health and performance.
- strong troubleshooting skills and/or a structured approach to diagnosing and resolving complex production issues.
- understanding of fault-tolerant system design, especially in the context of Kafka.
- work with the SRE team to educate them on best practices, best path forward, etc.
Client is going through a modernization effort for their equity trading platform. They have asked us to bring in an a senior SRE who ideally has experience supporting large scale deployments.
Desired Qualifications
- 10+ years' experience in information technology and/or professional services, with emphasis on subject matter expertise.
- At least 4 years of experience as a Site Reliability Engineering or equivalent role.
- Strong track record of delivering projects of demonstrable complexity and scale.
- Experience with data visualization and monitoring tools such as Splunk, Grafana, Dynatrace, Datadog, New Relic, Oracle Enterprise Manager, etc.
- Experience with telemetry frameworks and tools: (including but not limited to) Open Telemetry, Prometheus, Loki, Tempo, Fluent, Jaeger, etc.
- Prior success in automating real-world production environments using Chef, Puppet, Salt, Ansible, or cloud- equivalents.
- Ability to lead adoption and mentor team members on modern Site Reliability Engineering and architectural concepts.
- Proficient in designing and building highly available, resilient large-scale distributed systems.
- Demonstrated leadership abilities in an engineering environment in driving operational excellence and best practices.
- Demonstrated ability to achieve stretch goals in a highly innovative and fast-paced environment.
- Subject matter expertise with the following Cloud Service Providers: Amazon Web Services (AWS), Microsoft Azure (Experience with GCP plus).
- Expertise in Site Reliability Engineering technologies including but not limited to CI/CD frameworks, Infrastructure as Code, and monitoring and logging frameworks.
- Familiarity with container technologies like Docker, Kubernetes, and cloud- frameworks.
- Excellence in technical communication with peers and non-technical cohorts.
- Sharp analytical abilities and proven design skills.
- Strong sense of ownership, urgency, and drive.
Company Descriptionplease visit our site nobletechies.com.Company Descriptionplease visit our site nobletechies.com.