Original listing text, shown exactly as published by the company.
THE OPPORTUNITY
Networking and compute are no longer separate disciplines; they are converging. The massive throughput of H100, B200, and NVL72 architectures enables and demands a new approach where communication is co-optimized alongside computation. We are entering an era where the network is an active accelerator, leveraging smart hardware offloads and direct interconnects to ensure that data movement operates at wire-speed.
In this role, you will go beyond network configuration to architect the software fabric that unifies thousands of GPUs into a cohesive operating system. While you will leverage the best of the open-source ecosystem, you won't be limited by it. Where off-the-shelf solutions stop, you will build from scratch, engineering the primitives required to co-optimize communication and compute for Disaggregated Serving, Wide Expert Parallelism (WideEP), and lightening cold starts.
WHAT YOU'LL DO
- Make RDMA First-Class: You will work on integrating RDMA/RoCE/InfiniBand capabilities directly into our inference stack, helping us move beyond TCP/IP to unlock order-of-magnitude improvements in bandwidth and latency.
- Optimize Distributed Inference: You will implement and tune the networking layers necessary for efficient Disaggregated KV Cache Offload and WideEP, ensuring seamless communication across NVLink and InfiniBand for our MoE models.
- Enable Serverless-Grade Startup Speeds for LLMs: You will work deeply with checkpointing and storage mechanisms to enable sub-10-second startup for trillion-parameter models.
- Deep-Dive into Hardware: You will characterize and validate networking performance on bleeding-edge clusters (H100/H200, B200/B300, GB200/300 NVL72), writing the acceptance tests that ensure our hardware delivers peak achievable throughput and minimal latency.
- Build Observability: You will design the tools that let us visualize packet flow, congestion, and effective bandwidth across the GPU interconnects, helping us diagnose complex distributed system behaviors.
- Optimize Kernels: You will work with communication libraries (NCCL, NVSHMEM) and potentially write custom communication kernels to overlap compute and data transfer.
WHO YOU ARE
- You have deep experience with high-performance networking protocols (InfiniBand, RoCE v2) and understand the physics of data movement.
- You are fluent in C++ or Python, with the ability to bridge the gap between high-level logic and hardware. You have a deep understanding of the memory hierarchy in modern NVIDIA architectures (H100/Blackwell) and know how to optimize for it.
- You like going deep. You aren't afraid to dive into TensorRT-LLM source code, write custom C++ / Python bindings, or debug NVLink topology issues.
- You know when to use an off-the-shelf solution and when we need to build a custom solution because the upstream tools (like standard Kubernetes networking) are too slow for our needs.
HIGHLY PREFERRED
- Deep knowledge of NCCL, NVSHMEM, and UCX.
- Experience with GPUDirect Storage (GDS) or high-performance filesystems like Weka or 3FS.
- Familiarity with TensorRT-LLM, vLLM, or Sglang.
- Experience running low-level benchmarks to "qualify" new hardware clusters.