Senior AI Inference Performance Engineer

Cango Inc.

Dallas, US

Job Description

About the Role

We are building a GPU-native AI platform that provides model inference APIs, dedicated inference instances, and GPU infrastructure services for AI applications and agent workloads. Our platform supports multiple model categories, including:

large language models (LLMs)
speech models, including ASR and TTS
image generation and diffusion models

We are looking for a Senior AI Inference Performance Engineer to help us optimize model serving performance across these workloads on our GPU infrastructure. This role sits at the intersection of machine learning systems, GPU architecture, inference engines, CUDA optimization, and production serving infrastructure.

You will be responsible for improving the throughput, latency, stability, and cost efficiency of model inference workloads running on our platform. This includes tuning model serving stacks, profiling bottlenecks, optimizing GPU utilization, and working across both software and system layers to achieve best-in-class inference performance.

Responsibilities

Core Inference Optimization: Optimize performance for LLMs, speech, and image models by benchmarking and fine-tuning serving frameworks (e.g., vLLM, TensorRT-LLM, Triton) to maximize throughput, minimize latency, and reduce cost per inference.
Deep Profiling & Hardware Tuning: Analyze and resolve GPU bottlenecks across multiple layers—including CUDA kernel efficiency, KV cache behavior, and data movement—utilizing advanced observability tools like Nsight Systems and PyTorch Profiler.
System Architecture & Scalability: Elevate system-level performance across diverse deployment patterns (low-latency, high-throughput, multi-tenant) by refining model loading times, autoscaling behavior, request routing, and memory management.
Cross-Functional Collaboration: Partner with platform, infrastructure, and product teams to architect efficient serving pipelines for production APIs, establishing robust performance targets and capacity models.
Continuous Innovation: Stay at the forefront of the industry by actively integrating the latest advances in AI inference optimization, CUDA techniques, and open-source serving systems into production environments.

Required Qualifications

Education & Core Experience: Bachelor’s, Master’s, or PhD in Computer Science, Electrical Engineering, Machine Learning, or a related field, accompanied by 5+ years of experience in ML systems, GPU software, inference optimization, high-performance computing, or large-scale model serving.
Deep Learning & GPU Architecture: Deep understanding of transformer-based models and modern generative AI workloads, paired with strong hands-on expertise in NVIDIA GPU architecture, CUDA, and multi-level performance tuning (kernel, framework, and system).
Inference Frameworks & Tooling: Extensive experience with leading inference stacks—such as TensorRT, TensorRT-LLM, Triton Inference Server, vLLM, ONNX Runtime, PyTorch, SGLang, and TGI—alongside proficiency in modern GPU profiling and debugging tools.
Programming & Engineering Fundamentals: Strong programming skills in Python and C++, backed by solid software engineering principles and the ability to navigate complex model serving tradeoffs involving latency vs. throughput, memory footprint vs. concurrency, and precision vs. quality.
Production Infrastructure: Proven familiarity with deploying and scaling AI workloads in containerized and distributed production environments, including Kubernetes, Docker, and cloud or on-prem GPU clusters.

Skills & Requirements

Technical Skills

PythonC++CUDATensorRTTriton Inference ServervLLMONNX RuntimePyTorchSGLangTGIAIGPUinference performancemodel servingmachine learning systems

Level

senior

Posted

3/23/2026

Continue to LinkedIn

You will be redirected to the job posting on LinkedIn.