Original listing text, shown exactly as published by the company.
About the Role
We are looking for a Site Reliability Engineer (SRE) to join our team and help ensure the availability, performance, and scalability of our critical systems. You will work closely with development and operations teams to automate processes, enhance system reliability, and improve observability.
Responsibilities
- Develop and improve observability using monitoring, logging, tracing, and alerting tools (Prometheus, Grafana, ELK, OpenTelemetry, etc.).
- Optimize system performance, troubleshoot incidents, and conduct post-mortems/RCA to prevent future issues.
- Collaborate with developers to enhance application reliability, scalability, and performance.
- Drive cost optimization efforts in cloud environments.
- Experience with multiple databases Mongo, Redis, ES, Queue based etc
Requirements
- Experience: 7+ years in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles.
- Cloud Expertise: Hands-on experience with GCP and AWS.
- Infrastructure as Code (IaC): Terraform, Helm, or equivalent tools.
- Containerization & Orchestration: Docker, Kubernetes (GKE).
- Observability: Experience with Prometheus, Grafana, ELK, OpenTelemetry, or similar monitoring/logging tools.
- Programming/Scripting: Proficiency in Python, Bash, or Shell scripting. Basic understanding of API parsing and JSON manipulation.
- CI/CD Pipelines: Hands-on experience with Jenkins, GitHub Actions, ArgoCD, or similar tools.…