Original listing text, shown exactly as published by the company.
What You’ll Do
- Design and build scalable, fault-tolerant infrastructure systems that power Harvey's AI platform across multiple cloud regions
- Own and evolve our multi-cloud infrastructure (Azure, GCP), including Kubernetes orchestration, networking, and container management
- Lead technical initiatives around observability, incident response, and operational excellence — building systems that enable rapid detection and resolution of issues
- Architect and optimize our distributed systems for reliability, including load balancing, quota management, and failover mechanisms
- Partner with Product Engineering and Security teams to ensure our infrastructure is an accelerant, not a constraint
- Drive infrastructure-as-code practices using tools like Terraform and Pulumi to enable reproducible, auditable deployments
- Mentor junior / intern engineers and raise the technical bar across the organization through code reviews, design reviews, and technical leadership
Representative Projects
- Design and implement a next-generation model proxy architecture that routes millions of daily inference requests while maintaining model API compatibility and enabling seamless model integration
- Build distributed rate limiting and quota management systems using Redis-backed algorithms to handle bursty traffic patterns without degrading user experience
- Architect multi-region deployment strategies that meet strict data residency requirements for global enterprise customers
- Develop comprehensive observability infrastructure with granular SLA monitoring, burn rate alerts, and detailed token attribution for cost tracking
- Lead the evolution of our CI/CD pipelines to improve developer velocity while maintaining production stability
What You Have
- 4+ years of experience in Infrastructure Engineering or Platform Engineering in a production environment
- Long track record building and scaling complex, large-scale distributed systems
- Deep proficiency with cloud infrastructure platforms (Azure preferred; GCP or AWS experience transfers well)
- Strong fluency in Infrastructure as Code (IaC) tools — Terraform, Pulumi, or CloudFormation
- Solid understanding of Kubernetes, container orchestration, networking, and cloud security at scale
- Experience with observability tools (Datadog, Sentry) and incident response practices (PagerDuty, Incident.io)
- Strong programming skills in Python, Go, or similar languages
- Excellent problem-solving skills, a "spidey sense" of where things could go wrong, and a commitment to operational excellence
Nice to Have
- Experience building infrastructure for AI/ML workloads or high-throughput inference systems
- Background with distributed rate limiting, load balancing, or quota management systems
- Experience operating multi-tenant platforms with strict security and compliance requirements
- Track record of leading complex cross-functional projects and delivering measurable impact
Compensation Range$200,000 - $250,000 USD
Depending on your location, an Applicant Privacy Notice may apply to you. You can find all of our Applicant Privacy Notices [here].
#LI-AN2…