Engineering Manager (AI Inference)

Perplexity
Remote
Remote

Job Description

ABOUT THE ROLE

We are looking for an Inference Engineering Manager to lead our AI Inference team. This is a unique opportunity to build and scale the infrastructure that powers Perplexity's products and APIs, serving millions of users with state-of-the-art AI capabilities.

You will own the technical direction and execution of our inference systems while building and leading a world-class team of inference engineers. Our current stack includes Python, PyTorch, Rust, C++, and Kubernetes. You will help architect and scale the large-scale deployment of machine learning models behind Perplexity's Comet, Sonar, Search, Deep Research products.

WHY PERPLEXITY?

  • Build SOTA systems that are the fastest in the industry with cutting-edge technology
  • High-impact work on a smaller team with significant ownership and autonomy
  • Opportunity to build 0-to-1 infrastructure from scratch rather than maintaining legacy systems
  • Work on the full spectrum: reducing cost, scaling traffic, and pushing the boundaries of inference
  • Direct influence on technical roadmap and team culture at a rapidly growing company

RESPONSIBILITIES

  • Lead and grow a high-performing team of AI inference engineers
  • Develop APIs for AI inference used by both internal and external customers
  • Architect and scale our inference infrastructure for reliability and efficiency
  • Benchmark and eliminate bottlenecks throughout our inference stack
  • Drive large sparse/MoE model inference at rack scale, including sharding strategies for massive models
  • Push the frontier with building inference systems to support sparse attention, disaggregated pre-fill/decoding serving, etc.
  • Improve the reliability and observability of our systems and lead incident response
  • Own technical decisions around batching, throughput, latency, and GPU utilization
  • Partner with ML research teams on model optimization and deployment
  • Recruit, mentor, and develop engineering talent
  • Establish team processes, engineering standards, and operational excellence

QUALIFICATIONS

  • 5+ years of engineering experience with 2+ years in a technical leadership or management role
  • Deep experience with ML systems and inference frameworks (PyTorch, TensorFlow, ONNX, TensorRT, vLLM)
  • Strong understanding of LLM architecture: Multi-Head Attention, Multi/Grouped-Query Attention, and common layers
  • Experience with inference optimizations: batching, quantization, kernel fusion, FlashAttention
  • Familiarity with GPU characteristics, roofline models, and performance analysis
  • Experience deploying reliable, distributed, real-time systems at scale
  • Track record of building and leading high-performing engineering teams
  • Experience with parallelism strategies: tensor parallelism, pipeline parallelism, expert parallelism
  • Strong technical communication and cross-functional collaboration skills

NICE TO HAVE

  • Experience with CUDA, Triton, or custom kernel development
  • Background in training infrastructure and RL workloads
  • Experience with Kubernetes and container orchestration at scale
  • Published work or contributions to inference optimization research

Skills & Requirements

Technical Skills

PythonPytorchRustC++KubernetesMl systemsInference frameworksLlm architectureInference optimizationsGpu characteristicsRoofline modelsPerformance analysisDeploying reliable, distributed, real-time systemsParallelism strategiesCudaTritonCustom kernel developmentTraining infrastructureRl workloadsKubernetesContainer orchestrationTechnical communicationCross-functional collaborationAi inferenceMachine learning

Level

mid

Posted

4/13/2026

Continue to Ashby

You will be redirected to the job posting on Ashby.