Senior/Staff Software Engineer - Machine Learning Infrastructure

Salesforce.Com Inc
Austin, US

Job Description

To enhance your candidate experience, please consider applying for a maximum of 3 roles within a 12-month period to avoid duplication in your efforts.

Job Category

Software Engineering

About Salesforce

Salesforce is at the forefront of AI-driven customer relationship management. We merge ambition with action and technology with trust, making innovation part of our everyday life. As the landscape of work evolves, we seek Trailblazers passionate about improving businesses and the world through AI, while upholding Salesforce's core values.

Are you ready to elevate your career at a company leading a workforce transformation? Welcome to Agentforce, where you can shape the future of AI and define the next era of business.

About Slack AI

Slack AI's mission revolves around revolutionizing work by transforming Slack into an AI-powered operating system. We are committed to addressing key challenges such as unlocking collective knowledge and reducing overwhelm, all while crafting an intuitive, consumer-grade AI experience integrated into users' workflows. Join us in redefining the work landscape through AI.

About the Team

The AI and ML Infrastructure team forms part of Slack's Core Infrastructure organization, responsible for the fundamental systems enabling machine learning and AI across the platform. Our focus is on building robust, scalable, and high-performance systems that empower both product and ML teams to develop, deploy, and manage AI-driven functionalities confidently. As Slack AI expands, our evolution from traditional ML deployments to highly distributed systems poses intricate architectural challenges regarding scalable model deployment strategies, real-time feature serving with high throughput, and robust model training on sensitive data, adhering to privacy and safety standards.

Core Focus Areas

  • ML Infrastructure: Focuses on the underlying systems driving training and inference at scale, architecting and maintaining distributed systems using Kubernetes-based platforms, GPU infrastructure, and open-source ML stacks like KubeRay and vLLM.
  • AI Platform: Develops the tooling and platform layers facilitating AI development across Slack, including developer-friendly tools, SDKs, and workflows for seamless AI integration into Slack features.

About the Role

We are seeking a Senior or Staff Software Engineer to join our ML Infrastructure focus area and lead the architecture and operation of essential systems powering AI at Slack. In this role, you will create foundational infrastructure for large-scale model training and inference, evolving it into a secure, reliable, and self-service platform company-wide.

As a key player, you will tackle complex challenges at the intersection of distributed systems, GPU infrastructure, and modern ML stacks.

What You Will Be Doing

  • Design, build, and manage systems for training, serving, and deploying machine learning models at scale, emphasizing reliability and performance.
  • Enhance GPU-backed inference infrastructure to handle high-throughput, latency-sensitive workloads effectively.
  • Architect and optimize distributed training and data processing systems with technologies like Ray, Airflow, or Spark.
  • Construct and maintain Kubernetes-based orchestration layers using tools such as KubeRay, vLLM, and proprietary services.
  • Develop solutions bridging legacy systems with cutting-edge technologies while ensuring stability in monolithic applications.
  • Create robust monitoring, observability, and alerting systems for production ML workloads, ensuring operational excellence.
  • Collaborate closely with AI Platform, ML modeling, security, and product engineering teams to design infrastructure supporting evolving AI use cases.
  • Provide technical leadership through design reviews and mentorship while establishing engineering standards and architectural direction for ML infrastructure.
  • Document technical designs and architect solutions, contributing thought leadership through blog posts.

What You Should Have

  • Extensive experience in software engineering with a focus on infrastructure, backend systems, platform engineering, or MLOps.
  • Expertise in building and managing distributed systems, particularly with Kubernetes and container-based platforms.
  • Hands-on experience with modern ML infrastructure and orchestration stacks like Ray, KubeRay, vLLM, or similar.
  • Practical experience with GPU infrastructure, including performance optimization at scale.
  • Strong background in data infrastructure and orchestration technologies such as Airflow, and Spark.
  • Experience operating cloud-native systems on platforms like AWS, GCP, or Azure, including infrastructure as code.
  • A capability to guide technical direction for complex systems while balancing immediate deliverables with long-term architectural goals.
  • Exceptional written communication skills and ability to excel in a globally distributed team.
  • A related technical degree is required.

Unleash Your Potenti

Skills & Requirements

Technical Skills

KubernetesGPU infrastructureKubeRayvLLMRayAirflowSparkAWSGCPAzureInfrastructure as codeDistributed systemsMachine learningAIcommunicationleadershipcloud computingdata infrastructureAImachine learning

Level

mid

Posted

4/3/2026

Apply Now

You will be redirected to Salesforce.Com Inc's application portal.