Lead Machine Learning Engineer, LLM Infrastructure

100 Salesforce, Inc.

Washington, US

Remote

Job Description

To get the best candidate experience, please consider applying for a maximum of 3 roles within 12 months to ensure you are not duplicating efforts. Job Category Software Engineering Job Details About Salesforce Salesforce is the #1 AI CRM, where humans with agents drive customer success together. Here, ambition meets action. Tech meets trust. And innovation isn’t a buzzword — it’s a way of life. The world of work as we know it is changing and we're looking for Trailblazers who are passionate about bettering business and the world through AI, driving innovation, and keeping Salesforce's core values at the heart of it all. Ready to level-up your career at the company leading workforce transformation in the agentic era? You’re in the right place! Agentforce is the future of AI, and you are the future of Salesforce. About the Role We are seeking a Lead ML Engineer, LLM Post-Training Infrastructure to join the Salesforce AI Research Incubation Team. In this role, you will own the infrastructure and engineering systems that support LLM post-training, large-scale evaluation, and model deployment. You will build scalable, reliable pipelines for training orchestration, rollout generation, reward and feedback pipelines, experiment management, and model iteration, helping translate research ideas into production-grade systems. This is an engineering-first role focused on ML infrastructure, distributed systems, and training/evaluation workflows rather than developing new model architectures or algorithms. You will work closely with research scientists, agent engineers, and platform teams to operationalize post-training and feedback-driven learning methods into robust, reusable systems.This is a lead-level individual contributor role with deep ownership of model-facing infrastructure and strong cross-functional influence. Key Responsibilities: ● Design, build, and maintain infrastructure for LLM post-training, evaluation, and deployment. ● Own scalable pipelines for training orchestration, rollout generation, reward and feedback processing, checkpointing, and experiment management. ● Build reliable systems for feedback-driven model improvement, including human or AI feedback loops, large-scale offline evaluation, and regression detection. ● Partner closely with research scientists to turn new post-training methods into reusable engineering workflows. ● Collaborate with agent engineers and platform teams to integrate training and evaluation systems with production model and agent stacks. ● Optimize distributed training and inference workloads for reliability, throughput, cost efficiency, and observability. ● Drive best practices for reproducibility, versioning, monitoring, deployment, and operational excellence across ML systems. Required Qualifications: ● 5+ years of experience in software engineering, ML systems, or distributed infrastructure. ● Strong proficiency in Python and experience building production systems or large-scale ML pipelines. ● Hands-on experience building infrastructure for model training, post-training, evaluation, or serving. ● Experience designing reliable, scalable systems for distributed and GPU-based workloads. ● Strong debugging skills across systems, pipelines, and model-facing failures. ● Experience building infrastructure for LLM post-training, including RLHF, preference optimization, reward modeling, or related feedback-driven training workflows. ● Experience working cross-functionally with research scientists and engineers. ● Familiarity with cloud platforms (AWS, GCP) and containerized environments (Docker, Kubernetes). Preferred Qualifications: ● Experience with rollout systems, large-scale evaluation loops, or training data/feedback pipelines. ● Familiarity with distributed training frameworks and modern ML infrastructure stacks. ● Experience supporting agent-based learning, simulation environments, or iterative model improvement systems. ● Prior experience working closely with AI research or incubation teams. Why Join Us? ● Own the systems that turn research models into production AI capabilities. ● Work at the intersection of AI research and large-scale engineering systems. ● Shape how models are trained, deployed, evaluated, and evolved. ● Competitive compensation, benefits, and strong long-term growth opportunities. Unleash Your Potential When you join Salesforce, you’ll be limitless in all areas of your life. Our benefits and resources support you to find balance and be your best, and our AI agents accelerate your impact so you can do your best. Together, we’ll bring the power of Agentforce to organizations of all sizes and deliver amazing experiences that customers love. Apply today to not only shape the future — but to redefine what’s possible — for yourself, for AI, and the world. Accommodations If you need a reasonable accommodation during the application or the recruiting process, please submit a request via this Accommodations Request Form. Please note that Salesforce uses

Skills & Requirements

Technical Skills

PythonMl infrastructureDistributed systemsTraining/evaluation workflowsCollaborationDebuggingCross-functional influenceMachine learningLlm infrastructureAi

Domain Knowledge

Technology

Employment Type

FULL TIME

Level

senior

Posted

4/27/2026

Continue to Workday

You will be redirected to the job posting on Workday.