Sydicom insightsSydicom overview

A remote DevOps & Infrastructure role at Prime Intellect. Partner with clients to understand workload requirements and design optimal GPU cluster architectures

Keywords this role’s ATS scans for

Sydicom tailors your CV and cover letter to match these.

LLMsStakeholder mgmt

Level

Lead / Exec

Work

Remote

Focus

DevOps & Infrastructure

Pay

Est. $176k-$240k/yr

How Sydicom helps: we read this listing’s requirements and tune your CV and cover letter to the keywords its ATS (Ashby) is scanning for, wherever you are, then help you apply.

Related roles

Original listing text, shown exactly as published by the company.

Core Technical Responsibilities

This customer-facing role combines deep technical expertise with hands-on implementation. You'll be instrumental in:

Customer Architecture & Design

Partner with clients to understand workload requirements and design optimal GPU cluster architectures
Create technical proposals and capacity planning for clusters ranging from 100 to 10,000+ GPUs
Develop deployment strategies for LLM training, inference, and HPC workloads
Present architectural recommendations to technical and executive stakeholders

Infrastructure Deployment & Optimization

Deploy and configure orchestration systems including SLURM and Kubernetes for distributed workloads
Implement high-performance networking with InfiniBand, RoCE, and NVLink interconnects
Optimize GPU utilization, memory management, and inter-node communication
Configure parallel filesystems (Lustre, BeeGFS, GPFS) for optimal I/O performance
Tune system performance from kernel parameters to CUDA configurations

Production Operations & Support

Serve as primary technical escalation point for customer infrastructure issues
Diagnose and resolve complex problems across the full stack - hardware, drivers, networking, and software
Implement monitoring, alerting, and automated remediation systems
Provide 24/7 on-call support for critical customer deployments
Create runbooks and documentation for customer operations teams

Technical Requirements

Required Experience

3+ years hands-on experience with GPU clusters and HPC environments
Deep expertise with SLURM and Kubernetes in production GPU settings
Proven experience with InfiniBand configuration and troubleshooting
Strong understanding of NVIDIA GPU architecture, CUDA ecosystem, and driver stack
Experience with infrastructure automation tools (Ansible, Terraform)
Proficiency in Python, Bash, and systems programming
Track record of customer-facing technical leadership

Infrastructure Skills

NVIDIA driver installation and troubleshooting (CUDA, Fabric Manager, DCGM)
Container runtime configuration for GPUs (Docker, Containerd, Enroot)
Linux kernel tuning and performance optimization
Network topology design for AI workloads
Power and cooling requirements for high-density GPU deployments

Nice to Have

Experience with 1000+ GPU deployments
NVIDIA DGX, HGX, or SuperPOD certification
Distributed training frameworks (PyTorch FSDP, DeepSpeed, Megatron-LM)
ML framework optimization and profiling
Experience with AMD MI300 or Intel Gaudi accelerators
Contributions to open-source HPC/AI infrastructure projects

Growth Opportunity

You'll work directly with customers pushing the boundaries of AI, from startups training foundation models to enterprises deploying massive inference infrastructure. You'll collaborate with our world-class engineering team while having direct impact on systems powering the next generation of AI breakthroughs.

We value expertise and customer obsession - if you're passionate about building reliable, high-performance GPU infrastructure and have a track record of successful large-scale deployments, we want to talk to you.

About Prime Intellect

Prime Intellect

DevOps & Infrastructure

6 open roles on Sydicom

Member of Technical Staff - GPU Infrastructure

Core Technical Responsibilities

Technical Requirements

Required Experience

Nice to Have

About Prime Intellect

About Prime Intellect