Research Engineer, GPU Performance

Harnham
Washington, US
Remote

Job Description

Title: Research Engineer, GPU

Location: USA Remote

Compensation: Up to $400k + Equity

We’re partnered with a well-funded AI research company focused on building next-generation multimodal models for media and interactive experiences. Their work spans cutting-edge generative systems and is increasingly moving toward real-time, interactive environments, pushing beyond static outputs into dynamic, AI-driven applications.

This is a deeply technical, high-impact role focused on making large-scale AI systems faster, more efficient, and capable of running in real time. You’ll work across the stack, from low-level GPU kernels to distributed training systems, directly influencing what is computationally possible for next-generation AI models.

What You’ll Do

  • Optimize training throughput across large GPU clusters, improving efficiency and utilization
  • Implement techniques such as mixed precision (FP8, BF16), memory-efficient attention, and checkpointing
  • Design and scale distributed training systems (tensor parallelism, FSDP, multi-node setups)
  • Profile and optimize inference pipelines for real-time multimodal generation
  • Improve latency through CUDA graphing, KV cache optimization, and operator fusion
  • Contribute across the stack, from kernel-level optimization to system-level architecture

Requirements

  • 4+ years of experience in systems engineering, ML infrastructure, or performance optimization
  • Strong experience with GPU programming (CUDA, Triton, or similar)
  • Experience with distributed systems and large-scale training (NCCL, model parallelism)
  • Familiarity with ML framework internals such as PyTorch or JAX
  • Experience with mixed or low-precision techniques (FP8, INT8, BF16)
  • Proven experience building and operating scalable, fault-tolerant training systems
  • Strong interest in pushing the limits of performance for cutting-edge AI systems

Nice to Have

  • Experience with compiler optimizations or model compilation (e.g., PyTorch compile)
  • Background working on large multimodal or generative models
  • Exposure to real-time inference systems

If you're interested in working on the systems that enable next-generation AI models to train faster and run in real time, this is a rare opportunity to operate at the cutting edge of research and infrastructure.

Skills & Requirements

Technical Skills

CudaTritonNcclPytorchJaxFp8Bf16FsdpMulti-node setupsPytorch compileLeadershipProblem-solvingCommunicationAiMlGpuDistributed systemsLarge-scale training

Salary

$400,000 - $400,000

year

Employment Type

FULL TIME

Level

senior

Posted

4/23/2026

Continue to LinkedIn

You will be redirected to the job posting on LinkedIn.