Sydicom insightsSydicom overview

A hybrid DevOps & Infrastructure role at Saviynt.

Level

Mid-level

Work

Hybrid

Focus

DevOps & Infrastructure

Pay

$5k-$8k/mo

How Sydicom helps: we read this listing’s requirements and tune your CV and cover letter to the keywords its ATS (Lever) is scanning for, wherever you are, then help you apply.

Related roles

Original listing text, shown exactly as published by the company.

What You Will Be Doing

Own the Ray ecosystem end-to-end: manage KubeRay on GKE, tune Ray Core Task/Actor scheduling, operate the Plasma distributed object store, and configure Ray Data for GPU-direct streaming from GCS/S3
Operate distributed training with Ray Train: configure TorchTrainer + DDP/NCCL for multi-node H100 clusters, manage checkpoint lifecycle, implement spot-preemption recovery, and integrate warm-start fine-tuning for retrain pipelines
Build and operate the LLM inference mesh with Ray Serve: compose vLLM (PagedAttention), SGLang (RadixAttention), and NVIDIA Triton (TensorRT/ONNX) as a unified deployment graph with Plasma zero-copy memory sharing
Optimise inference performance: configure fractional GPU allocation, enable continuous batching, implement per-engine autoscaling based on request queue depth, and tune KV-cache block sizes
Design and operate the model routing layer: capability-based, version-based, and tenant-based routing with cost-aware fallback between self-hosted SLMs and cloud LLMs
Build RL training infrastructure: define Flyte workflows for RL pipelines (rollout, reward shaping, policy update, evaluation), integrate Ray RLlib or custom PPO/GRPO loops with Ray Train, and manage replay buffer persistence on GCS
Operate the full model promotion lifecycle: quality gate → integration tests → load tests (k6) → shadow mode → A/B gate → canary (10%→100%) with golden-signal auto-rollback
Operate the retrain pipeline: drift detection triggers, warm-start retraining, relative quality gates (V2 >= V1 − 2%), and automated Flyte DAG through to canary
Integrate RAG retrieval into the inference mesh: vector similarity search, context assembly, and prompt construction before LLM inference

What You Bring

Experience in ML engineering with time in an ML platform or MLOps role
Production Ray depth: Ray Train, Serve, Core, and Data — debugged real production failures including NCCL timeouts, Plasma OOM, and Serve autoscaling lag
LLM serving engines: hands-on with vLLM, SGLang, or NVIDIA Triton — PagedAttention, prefix caching, and continuous batching tuned for latency/throughput targets
Distributed training: DDP, FSDP, NCCL collectives, gradient checkpointing, and mixed precision (BF16/FP8)
RL working knowledge: PPO, policy gradient, or RLHF — able to translate an algorithm into distributed compute primitives
Model lifecycle operations: MLflow registry, shadow/A/B/canary patterns, and auto-

rollback on golden signal degradation

Vector databases: Pgvector or Qdrant — ANN index strategies, embedding upsert, and query latency tuning under inference load
Strong Python and PyTorch; Flyte or equivalent ML orchestrator
Quantization (nice to have): INT8/INT4/FP8 post-training quantization (GPTQ, AWQ, or bitsandbytes)
Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent

practical experience or equivalent military experience

We offer you a competitive total rewards package, learning and tremendous opportunities to grow and advance in your career. At Saviynt, it is not typical for an individual to be hired at or near the top of the range for their role and final compensation decisions are dependent on many factors including, but not limited to location; skill sets; experience and training; licensure and certifications; and other relevant business and organizational needs.

You may also be eligible to participate in a Saviynt discretionary bonus plan, subject to the rules governing the program, whereby an award, if any, depends on various factors, including, without limitation, individual and organizational performance.

About Saviynt

Saviynt

DevOps & Infrastructure

153 open roles on Sydicom

Saviynt provides a cloud-native identity security platform that converges identity governance, privileged access, and cloud security solutions. Their offerings help enterprises manage digital identities, enforce access controls, and ensure compliance across complex environments.

Generated by Sydicom AI

Open source

github.com/Saviynt· 3 public repos· 9 stars· Go

AI Platform Engineer, Training and Inference

What You Will Be Doing

What You Bring

About Saviynt

Open source

About Saviynt

Open source