AI Engineer - Model Performance

Fathom

Washington, US

Remote

Job Description

ABOUT FATHOM

We created Fathom to eliminate the needless overhead of meetings. Our AI assistant captures, summarizes, and organizes the key moments of your calls, so you and your team can stay fully present without sacrificing context or clarity. From instant, searchable call summaries to seamless CRM updates and team-wide sharing, Fathom transforms meetings from a source of friction into a place for alignment and momentum.

We’re a small company that creates magical experiences through the hard work of focused builders. We try to live our values - Care Deeply, Seek Leverage, Share Ownership, Sustain Urgency, and Be Tenacious - in everything we do, every day.

We started Fathom to rid us all of the tyranny of note-taking, and people seem to really love what we've built so far:

🥇 #1 Most Used App of the Year on HubSpot for 2025

🔥 #1 Rated on G2 with 4,500+ reviews and a perfect 5/5 rating

🥇 #1 Product of the Day and #2 AI Product of the Year

🚀 Most installed AI meeting assistant on both the Zoom and HubSpot marketplaces

📈 We’re hitting revenue and usage records every week

We think you’ll be pretty excited about Fathom too if you give it a try. Sign up today (it’s free)!

ROLE OVERVIEW

We're hiring a Model Performance Engineer to own the speed, cost, and reliability of our model inference stack, and to build the fine-tuning infrastructure that makes the rest of the AI team faster.

This is not a research role. You'll be optimizing real systems serving millions of meetings — choosing between quantization trade-offs, debugging speculative decoding, or figuring out why one GPU family's tail latency explodes at high concurrency while another stays stable.

You'll own two things:

HOW YOU’LL HELP US WIN

Benchmark FP8 quantization across GPU families, find that FP8 KV cache causes catastrophic repetition loops, identify static quantization as 6% faster than dynamic on certain hardware, and ship a production config that gets 1.3x speedup with <1% quality degradation
Evaluate serving frameworks (vLLM vs SGLang) with speculative decoding — discover that ngram speculation degrades ASR quality while EAGLE3 draft models don't, and that torch.compile makes certain GPUs 7% slower
Build a fine-tuning pipeline that takes a JSONL dataset and produces an optimized tune ready for serving, so a teammate can train a small classifier in an afternoon instead of a week
Optimize GPU spend — know which GPU families are best for batch workloads (stable under high concurrency) vs latency-sensitive paths (40% faster, but tail latency blows up under load), and when a 30% cost premium isn't worth it
Debug production inference issues — trace a quality regression to a serving framework upgrade that changed the default attention backend, or find that audio format handling in the multimodal pipeline silently drops segments

REQUIREMENTS

Hard Skills:

Deep experience with LLM serving frameworks (vLLM, SGLang, TensorRT-LLM, or similar) — not just deploying them, but tuning them: attention backends, scheduling strategies, CUDA graph warmup, prefix caching
Hands-on quantization experience — you've gone beyond "apply FP8 and hope." You understand weight vs activation quantization, per-channel vs per-tensor scaling, and when dynamic quantization introduces more overhead than it saves
Production fine-tuning experience — LoRA/QLoRA SFT, familiarity with training frameworks (ms-swift, Axolotl, torchtune, or similar), understanding of data formatting, learning rate schedules, and how to diagnose training failures
Strong Python. You'll write serving infrastructure, benchmarking harnesses, and training pipelines — not notebooks
Comfort with GPU profiling and performance analysis. You should be able to look at a benchmark result and know whether the bottleneck is compute, memory bandwidth, or scheduling overhead

Soft Skills

ML research background or publications
Prompt engineering expertise (we have a team for that)
Frontend or full-stack experience
Masters/PhD (though it's fine if you have one)

WHAT'S IN IT FOR YOU

The opportunity to shape the foundational software services of a growing company
A role that balances innovation and incremental improvement
A dynamic and collaborative engi

Skills & Requirements

Technical Skills

AiModel performanceModel inference stackFine-tuning infrastructureQuantizationSpeculative decodingGpu selectionBatching strategiesCold start mitigationAdapter swappingThroughput curvesFp8 quantizationServing frameworksFine-tuning pipelinesJsonl datasetOptimized tuneGpu spendProduction inference issuesBenchmark resultComputeMemory bandwidthScheduling overheadMl research backgroundPrompt engineering expertiseFrontend or full-stack experienceMasters/phdAbility to map out a process flowUnderstanding how data moves from point a to point bAbility to benchmark fp8 quantization across gpu familiesAbility to evaluate serving frameworksAbility to build a fine-tuning pipelineAbility to optimize gpu spendAbility to debug production inference issuesAbility to trace a quality regressionAbility to identify the bottleneckAiModel performanceModel inference stackFine-tuning infrastructureQuantizationSpeculative decodingGpu selectionBatching strategiesCold start mitigationAdapter swappingThroughput curvesFp8 quantizationServing frameworksFine-tuning pipelinesJsonl datasetOptimized tuneGpu spendProduction inference issuesBenchmark resultComputeMemory bandwidthScheduling overheadMl research backgroundPrompt engineering expertiseFrontend or full-stack experienceMasters/phd

Salary

$4,500+

week

Employment Type

FULL TIME

Level

senior

Posted

4/30/2026

Continue to Ashby

You will be redirected to the job posting on Ashby.