Rhoda-ai

Rhoda-ai

Machine Learning Engineer - Training Platform

Company

Rhoda-ai

Role

Machine Learning Engineer - Training Platform

Job type

Full-time

Posted

Yesterday

Salary

Not disclosed by employer

Job description

At Rhoda AI, we're building the full-stack foundation for the next generation of humanoid robots — from high-performance, software-defined hardware to the foundational models and video world models that control it. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team that includes researchers from Stanford, Berkeley, Harvard, and beyond. We're not building a feature; we're building a new computing platform for physical work — and with over $400M raised, we're investing aggressively in the R&D, hardware development, and manufacturing scale-up to make that a reality.

We're looking for a Staff / Principal ML Engineer to build and own our training platform — the system that makes large-scale training reliable, reproducible, and easy to run. You will define how training jobs are launched, tracked, recovered, and debugged across the cluster. Your work ensures that researchers can move fast without fighting infrastructure.

This role sits at the core of research velocity: when training fails → you make it recover automatically. When experiments are hard to reproduce → you fix the system. When GPU-hours are wasted → you make it visible and preventable.

What You'll Do

Own the training job lifecycle

  • Design and build systems for job launch and configuration, monitoring and state tracking, automatic retry and resume, and failure handling and recovery
  • Define clean, scalable interfaces for running distributed training: CLI / SDK / config systems and standardized launch templates across model families

Build robust checkpointing and recovery systems

  • Develop checkpointing systems that are reliable (no silent corruption or mismatch), efficient (fast save/load at scale), and flexible (support sharded and distributed models)
  • Enable seamless resume from failures, partial recovery (e.g., node/rank failures), and consistent state across distributed jobs

Make training reproducible and debuggable

  • Build systems for experiment configuration and versioning, tracking training state, metrics, and lineage, and reproducible "golden runs" and configs
  • Ensure runs can be reliably reproduced and differences between runs are explainable

Make performance and failures observable

  • Create unified visibility into per-job behavior (failures, slowdowns, anomalies) and fleet-wide trends (GPU utilization, failure modes, wasted compute)
  • Partner with training systems engineers to surface step-time breakdowns, resource inefficiencies, and failure patterns across jobs

Reduce operational burden on researchers

  • Eliminate manual debugging and babysitting of training jobs
  • Provide clean abstractions so researchers don't need to think about cluster quirks, retry logic, or distributed setup details
  • Goal: make large-scale training feel simple and reliable

Collaborate with infra / SRE on cluster reliability

  • Work with infrastructure teams to reduce GPU waste from node failures, network instability, checkpointing/storage bottlenecks, and scheduler placement issues

What We're Looking For

  • Strong experience building distributed systems or ML infrastructure
  • Experience with large-scale training environments (preferred but not required)
  • Hands-on experience with modern ML stacks (e.g., PyTorch; JAX a plus)
  • Solid understanding of distributed systems fundamentals (fault tolerance, state management, retries), training workflows and failure modes, and checkpointing and data consistency challenges
  • Strong product / systems instincts — you build tools people actually want to use and simplify complex workflows into clean abstractions
  • High ownership mindset and comfort in a fast-moving environment

Nice to Have (But Not Required)

  • Experience with checkpointing for large distributed models (FSDP / ZeRO / sharded states)
  • Experience with cluster schedulers (Slurm, Kubernetes, Ray, etc.)
  • Experience building experiment tracking or ML observability systems
  • Familiarity with large-scale storage systems and I/O bottlenecks

Why This Role

  • Own the reliability layer that every training run in the company depends on — your systems are the foundation research velocity is built on
  • Direct impact on developer experience and research throughput at a company building real-world embodied intelligence, not toy ML pipelines
  • High ownership in a small, elite team where your infrastructure decisions compound across every model the research team trains
Resume ExampleCover Letter Example

Explore more

Similar jobs