Training Infrastructure Engineer

SID is a research lab for search. We train AI that can retrieve and reason over any data source. Backed by Y Combinator, Canaan, Rebel, and General Catalyst, as well as researchers from Anthropic, Deepmind, OpenAI, MIT, Cognition, Cursor, Applied Compute, Prime Intellect, Standard Intelligence, and Jeff Dean.

If you don't match all of the requirements, we still encourage you to apply. We care much more about potential and the rate of improvement than achievements. We train and invest in our people!

There is no playbook. In pretraining, all GPUs are doing mostly the same thing. In post-training, tasks are heterogeneous: You need to produce rollouts, which is memory bandwidth limited. You need to train on those rollouts, which is compute limited. You need reference policy and reward workers, environments, etc.

You can't just optimize these constraints in isolation. For example, it is attractive to make inference more efficient. But this makes the rollouts slightly different to what a full precision model would produce, leading to very low probability tokens that dominate the gradient and collapse the model over time. Ask us how we know. Can you correct for this mismatch? Kinda, but we know you can do even better.

On the architecture level, there are no simple answers either: Making everything async sounds compelling, but introduces its own challenges: Rollouts are not fixed-length and so you will overtrain on short responses and thus easy questions. Is this unfairness an issue? Maybe.

Most people expect RL post-training to consume >90% of all training compute soon. Right now, it's still the wild west. There isn't even an MFU equivalent yet! (Come help design one).

Responsibilities

  • Design post-training infrastructure that scales to thousands of GPUs.
  • Own the entire training stack: From NCCL to worker orchestration.
  • The ideal model architecture for search is very different to language modelling which means custom CUDA, etc.
  • Future: Manage an infra team that you get to help pick.

Perks

  • Extreme freedom. We have built great RL infrastructure, but know the ceiling is much higher.
  • Work on frontier methods that scale. No weird old-school AI.
  • Be part of a small but highly leveraged team. We believe great research teams don't scale beyond 15-20 people.
  • There is no playbook. See above. We value clean sheet designs.
  • Everyone on the team can code. This might change in the future of course.
  • Competitive compensation with generous early-stage equity, full medical, and vision.

Requirements

  • Not afraid of CUDA – a BSc/MSc/PhD is an indicator of this (but isn't the only one).
  • Thinks they can learn anything in 2 weeks, but isn't arrogant about it.
  • Familiar with vLLM/SGLANG/TensorRT-LLM.
  • Can reason deeply across the stack. This role will require considering tradeoffs from the chip level.
  • Proactive. Small teams win because new learnings can diffuse faster.
  • Familiar with 'You and Your Research.' Understands what it takes to do significant work.

Things you should know

  • Startup work is always intense and sometimes frustrating: The nature of working on novel ideas is that not all of them pan out. It can be that you put blood, sweat, and tears into a feature or model and it just ends up not working through no fault of your own.
  • We might publish, but cannot guarantee that we will.
  • The role is in-person only from our offices in SF and Zürich. If you do not have US work authorization, we can help with that.