Research Infrastructure Engineer, Training Systems

OpenAI
Washington, US
Remote

Job Description

Research Infrastructure Engineer, Training Systems. Build and maintain infrastructure for large-scale model training and experimentation. Design APIs and interfaces for complex training workflows. Improve reliability, debuggability, and performance across training and data pipelines.

Requirements

  • Systems engineering role focused on ML training infrastructure
  • Build and maintain infrastructure for large-scale model training and experimentation
  • Design APIs and interfaces that make complex training workflows easier to express and harder to misuse
  • Improve reliability, debuggability, and performance across training and data pipelines
  • Debug issues spanning Python, PyTorch, distributed systems, GPUs, networking, and storage
  • Write tests, benchmarks, and diagnostics that catch meaningful regressions

Benefits

  • Diversity and inclusion policy
  • Equal employment opportunity
  • Background checks administered in accordance with applicable law
  • Fair chance hiring policy

Skills & Requirements

Technical Skills

Ml training infrastructureApisInterfacesPythonPytorchDistributed systemsGpusNetworkingStorageDebug issuesWrite testsBenchmarksDiagnosticsAiMl

Employment Type

FULL TIME

Level

mid

Posted

4/28/2026

Apply Now

You will be redirected to OpenAI's application portal.

Sign in and we'll score your resume against this role.