Original listing text, shown exactly as published by the company.
In this role, you will
- Own and evolve the observability platform (e.g., New Relic) to provide end-to-end visibility across applications and infrastructure
- Establish standards for monitoring, alerting, dashboards, and telemetry (logs, metrics, traces)
- Leverage AIOps capabilities to improve anomaly detection, reduce noise, and accelerate root cause analysis
- Drive automation and self-healing workflows to minimize manual intervention and improve system resilience
- Collaborate across teams to ensure systems are observable by design and aligned with reliability goals
- Continuously analyze system behavior and incident patterns to improve performance, scalability, and uptime
You will be part of a team focused on building a highly reliable, data-driven, and scalable operational ecosystem, where observability is a core foundation for engineering excellence.
How you will make an impact
Lead the observability strategy and execution, ensuring comprehensive visibility across all production and delivery environments.
- Own and govern the enterprise observability platform (New Relic or equivalent tools such as Datadog or Dynatrace) and ensure consistent monitoring standards across systems.
- Explore and adopt AI-driven monitoring capabilities (AIOps) to automate anomaly detection, reduce alert fatigue, and enable predictive problem management.
- Collaborate closely with Production Support (L1/L2), DevOps, CloudOps, Software Engineering, and Database teams to triage complex production issues and accelerate incident resolution.
- Act as the operational coordinator during service-impacting events, organizing workflows, managing cross-team dependencies, and providing structured updates to leadership.
- Design and implement automated remediation workflows and self-healing mechanisms for recurring incidents.
- Analyze telemetry data (logs, metrics, traces) to identify incident patterns and systemic anomalies, and continuously refine alert thresholds and routing logic.
- Develop and maintain dynamic dashboards that reflect real-time system health, application performance, and infrastructure behavior.
- Define and track reliability metrics such as SLOs, SLIs, MTTD, and MTTR to improve service reliability.
- Ensure clear, timely communication with stakeholders during incidents and operational events.
- Drive organization-wide adoption of observability best practices through documentation, training, and knowledge sharing.
What you need to be successful
8–10+ years of experience in observability, site reliability engineering (SRE), DevOps, or advanced production operations in large-scale enterprise environments.
- Expert-level hands-on experience implementing and optimizing observability platforms such as New Relic, Datadog, Dynatrace, or Splunk.
- Strong understanding of monitoring fundamentals including logs, metrics, traces, and alerting strategies.
- Experience working with cloud-native architectures (AWS preferred).
- Familiarity with containerized environments and orchestration platforms such as Kubernetes.
- Experience integrating observability practices into CI/CD pipelines to ensure applications are observable by design.
- Strong understanding of incident management, problem management, and change management practices (ITIL concepts).
- Demonstrated ability to analyze telemetry data to identify patterns, detect anomalies, and improve operational reliability.
- Strong leadership and collaboration skills with the ability to coordinate across engineering, DevOps, and operations teams.
- Excellent communication skills and a strong focus on operational excellence and continuous improvement.
Nice to Have
- Experience implementing AI/ML capabilities within observability tools for anomaly detection and predictive monitoring.
- Familiarity with AIOps platforms and automated remediation workflows.
- Experience with event streaming platforms such as Kafka for telemetry ingestion or real-time data processing.
- Basic understanding of application architecture and troubleshooting distributed systems.
- Experience with automation frameworks or serverless workflows (e.g., AWS Lambda, scripting, or infrastructure automation).