AI ML Performance Engineer

VDart
Bellevue; Washington, US
Hybrid

Job Description

Role: AI ML Performance Engineer

Location: Bellevue, WA (Hybrid)

Employment Type: Contract

About the Role

We are looking for an experienced AI/ML Performance Engineer to design and execute high-intensity stress workloads for next-generation AI platforms. This role focuses on identifying performance bottlenecks, improving system stability, and enabling scalable, production-ready AI infrastructure.

Key Responsibilities

  • Design and implement high-intensity stress workloads using PyTorch and Triton
  • Analyze system performance to identify bottlenecks, stability issues, and performance cliffs
  • Develop workloads targeting large GEMMs, attention mechanisms, MoE-like architectures, mixed precision, and long-running executions
  • Build custom Triton kernels to stress hardware execution units, memory hierarchies, and synchronization paths
  • Create scalable test harnesses across problem size, number of devices, and runtime duration
  • Integrate workloads with profiling, monitoring, and failure triage tools
  • Collaborate with platform, firmware, and SDK teams
  • Provide documentation and reproducible scripts for lab and CI environments

Required Skills

  • Strong experience in performance testing and analysis (test result analysis, server stats, bottleneck identification, tuning, and recommendations)
  • Proficiency in Python
  • Scripting experience using Shell or PowerShell
  • Experience with PyTorch and/or Triton

Nice to Have

  • Experience with AI hardware platforms or simulators
  • Exposure to distributed systems and multi-device workloads

Skills & Requirements

Technical Skills

PythonPytorchTritonShellPowershellAi hardware platformsDistributed systemsMulti-device workloads

Employment Type

CONTRACT

Level

Mid-Level

Posted

5/5/2026

Continue to LinkedIn

You will be redirected to the job posting on LinkedIn.

Sign in and we'll score your resume against this role.

Find Similar Jobs

Browse roles in the same category, level, and remote setup.