Original listing text, shown exactly as published by the company.
Key Responsibilities:Infrastructure & Platform Ownership
- Design, implement, and maintain scalable infrastructure on Google Cloud Platform to support CodeRabbit's growing user base and processing demands
- Own and operate critical platform services
- Build and maintain Infrastructure as Code using Terraform to ensure consistent, reproducible, and version-controlled infrastructure deployments
Reliability & Performance Engineering
- Establish and maintain SLI/SLO frameworks for all critical services, ensuring we meet our reliability commitments to users
- Implement comprehensive monitoring, alerting, and observability solutions using Datadog and custom instrumentation
- Conduct thorough incident response, root cause analysis, and post-mortem processes to continuously improve system reliability
- Optimize application and infrastructure performance to handle millions of pull request analyses with minimal latency
- Design and implement chaos engineering practices to proactively identify and resolve system weaknesses
Automation & Developer Experience
- Develop self-service platforms and tooling that empower engineering teams to deploy, monitor, and troubleshoot their services independently
- Automate operational tasks including scaling, backup/recovery, security patching, and routine maintenance
- Create and maintain infrastructure APIs and abstractions that simplify complex operations for development teams]
Security & Compliance
- Integrate security best practices into all infrastructure and platform services
- Implement and maintain security monitoring, vulnerability scanning, and compliance reporting
- Design secure network architectures including VPC configuration, firewall rules, and access control systems
- Establish and maintain disaster recovery procedures and business continuity planning
Required Qualifications
- 7+years of hands-on experience in Site Reliability Engineering, Platform Engineering, or DevOps Engineering roles
- Proven track record of managing production systems at scale, preferably in high-growth technology companies
- Experience with cloud platforms, particularly AWS or Google Cloud Platform (GCP), including compute, storage, networking, and managed services
- Strong background in containerization and orchestration platforms (Kubernetes, Docker)
Technical Skills
- Programming Languages: Proficiency in Node.js and TypeScript for building automation tools, monitoring solutions, and platform services
- Infrastructure as Code: Advanced experience with Terraform for infrastructure provisioning and management
- Monitoring & Observability: Hands-on experience with Datadog or similar platforms (Prometheus, Grafana, ELK stack) for observability
- Cloud Platforms: Comprehensive experience with GCP services including Compute Engine, GKE, Cloud Run, Cloud SQL, Cloud Storage, Load Balancing, and IAM
Systems & Operations
- Strong Linux/Unix systems skills
- Experience with network protocols, load balancing, and CDN technologies
- Knowledge of security principles and best practices for cloud infrastructure
- Familiarity with CI/CD tools and practices (Jenkins, GitLab CI, GitHub Actions)
- Understanding of microservices architecture and distributed systems principles
Bonus Points
- Experience with AI/ML infrastructure and tools
- Background in managing high-traffic web applications and API services
- Experience with disaster recovery planning and execution
- Familiarity with compliance frameworks (SOC 2, ISO 27001)
- Contributions to open-source infrastructure or SRE tooling projects
- Experience with cost optimization and FinOps practices
- Knowledge of performance testing and capacity planning methodologies