Original listing text, shown exactly as published by the company.
Core Technical Responsibilities
This customer-facing role combines deep technical expertise with hands-on implementation. You'll be instrumental in:
Customer Architecture & Design
- Partner with clients to understand workload requirements and design optimal GPU cluster architectures
- Create technical proposals and capacity planning for clusters ranging from 100 to 10,000+ GPUs
- Develop deployment strategies for LLM training, inference, and HPC workloads
- Present architectural recommendations to technical and executive stakeholders
Infrastructure Deployment & Optimization
- Deploy and configure orchestration systems including SLURM and Kubernetes for distributed workloads
- Implement high-performance networking with InfiniBand, RoCE, and NVLink interconnects
- Optimize GPU utilization, memory management, and inter-node communication
- Configure parallel filesystems (Lustre, BeeGFS, GPFS) for optimal I/O performance
- Tune system performance from kernel parameters to CUDA configurations
Production Operations & Support
- Serve as primary technical escalation point for customer infrastructure issues
- Diagnose and resolve complex problems across the full stack - hardware, drivers, networking, and software
- Implement monitoring, alerting, and automated remediation systems
- Provide 24/7 on-call support for critical customer deployments
- Create runbooks and documentation for customer operations teams
Technical Requirements
Required Experience
- 3+ years hands-on experience with GPU clusters and HPC environments
- Deep expertise with SLURM and Kubernetes in production GPU settings
- Proven experience with InfiniBand configuration and troubleshooting
- Strong understanding of NVIDIA GPU architecture, CUDA ecosystem, and driver stack
- Experience with infrastructure automation tools (Ansible, Terraform)
- Proficiency in Python, Bash, and systems programming
- Track record of customer-facing technical leadership
Infrastructure Skills
- NVIDIA driver installation and troubleshooting (CUDA, Fabric Manager, DCGM)
- Container runtime configuration for GPUs (Docker, Containerd, Enroot)
- Linux kernel tuning and performance optimization
- Network topology design for AI workloads
- Power and cooling requirements for high-density GPU deployments
Nice to Have
- Experience with 1000+ GPU deployments
- NVIDIA DGX, HGX, or SuperPOD certification
- Distributed training frameworks (PyTorch FSDP, DeepSpeed, Megatron-LM)
- ML framework optimization and profiling
- Experience with AMD MI300 or Intel Gaudi accelerators
- Contributions to open-source HPC/AI infrastructure projects
Growth Opportunity
You'll work directly with customers pushing the boundaries of AI, from startups training foundation models to enterprises deploying massive inference infrastructure. You'll collaborate with our world-class engineering team while having direct impact on systems powering the next generation of AI breakthroughs.
We value expertise and customer obsession - if you're passionate about building reliable, high-performance GPU infrastructure and have a track record of successful large-scale deployments, we want to talk to you.