Original listing text, shown exactly as published by the company.
The Opportunity
Deepgram's infrastructure spans bare metal GPU clusters, multi-cloud deployments, and global edge presence -- all serving real-time voice AI at massive scale while simultaneously powering large-scale model training. As a Systems Architect, you will own the end-to-end infrastructure architecture that makes this possible. You will design the compute, storage, and networking systems that serve both production inference and research training workloads, build multi-cloud strategies that balance performance with cost, and create burstable infrastructure that scales with Deepgram's rapidly growing demands. This is a senior technical leadership role where your architectural decisions shape the foundation everything at Deepgram runs on.
What You'll Do
- Define and drive the end-to-end infrastructure architecture for Deepgram's AI/ML workloads across production inference and research training
- Design multi-cloud and hybrid infrastructure strategies that balance performance, reliability, cost, and vendor flexibility
- Architect compute orchestration systems that efficiently schedule and manage GPU and CPU workloads across heterogeneous infrastructure
- Design storage architectures that handle the massive datasets required for speech and audio ML -- from high-throughput training data pipelines to low-latency model serving
- Lead capacity planning across all infrastructure dimensions, modeling growth and ensuring Deepgram can scale ahead of demand
- Drive cost optimization and FinOps practices, identifying opportunities to reduce infrastructure spend without compromising performance or reliability
- Design burstable, elastic training infrastructure that can scale up for large training runs and scale down to minimize idle cost
- Architect research compute infrastructure that gives ML teams the resources they need while maintaining operational efficiency
- Establish architectural standards, design review processes, and technical documentation practices for infrastructure decisions
- Collaborate with engineering leadership to align infrastructure strategy with product roadmap and business objectives
- Evaluate emerging hardware, cloud services, and infrastructure technologies for potential adoption
You'll Love This Role If You
- Think in systems -- you naturally see the connections between compute, storage, network, and how they interact under load
- Are motivated by designing infrastructure that operates at the intersection of real-time production systems and large-scale ML training
- Enjoy making architectural trade-offs where cost, performance, reliability, and velocity are all in tension
- Want to work across the full infrastructure stack -- from bare metal and GPUs to cloud services and container orchestration
- Are excited about building cost-effective, burstable infrastructure that enables world-class AI research
- Like operating at a strategic level while staying technically deep enough to validate designs and debug complex issues
It's Important To Us That You Have
- 7+ years of experience in infrastructure engineering, systems architecture, or a senior technical role focused on large-scale infrastructure
- Proven experience designing multi-cloud architectures spanning AWS and at least one other major cloud provider or on-premises environment
- Deep expertise in storage system design -- block, object, and file storage, including performance tuning for large-scale data workloads
- Strong experience with compute orchestration using Kubernetes, and an understanding of how to schedule diverse workloads efficiently
- Hands-on experience with GPU infrastructure -- procurement considerations, cluster design, driver and runtime management
- Track record of capacity planning and infrastructure scaling for high-growth environments
- Ability to communicate complex architectural decisions clearly to both technical and non-technical stakeholders
- Strong understanding of networking fundamentals as they relate to infrastructure architecture (see our Network Engineer role for the deep specialist)
It Would Be Great If You Had
- Direct experience architecting infrastructure for ML training workloads -- distributed training, large dataset management, experiment infrastructure
- Background in cost optimization and FinOps practices for large-scale cloud and bare metal infrastructure
- Experience operating and managing bare metal infrastructure in colocation facilities
- Expertise in network architecture design, including high-bandwidth GPU interconnects and global traffic routing
- Experience with infrastructure modeling and simulation for capacity planning
- Familiarity with Slurm, Ray, or other HPC/ML job scheduling systems
- Understanding of power, cooling, and physical infrastructure considerations for GPU-dense deployments