Staff Production Engineer, Managed AI

Crusoe
Remote
Remote

Job Description

Crusoe is on a mission to accelerate the abundance of energy and intelligence. As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of the stack — from electrons to tokens — to power the world's most ambitious AI workloads. When you join Crusoe, you join a team that is building the future, faster.

We're in the midst of the greatest industrial revolution of our time. The demand for AI compute is boundless, and power is a bottleneck. We're solving that — with an energy-first approach that makes AI infrastructure better for the world and faster for the people innovating with AI.

We're looking for problem-solving, opportunity-finding teammates with a sense of urgency, who believe in the scale of our ambition and thrive on a path not fully paved — people who want to grow their careers alongside a team of experts across energy, manufacturing, data center construction, and cloud services.

If you want to do the most meaningful work of your career, help our customers and partners advance their AI strategies, and be part of a high-performing team that believes in each other, come build with us at Crusoe.

About the Role:

At Crusoe, our Production Engineering team ensures the reliability, scalability, and operational excellence of Crusoe’s AI-optimized cloud platform. We’re looking for a Staff Production Engineer with deep experience in distributed systems and hands-on exposure to large language models to help build and operate managed AI services at scale.

This role sits at the intersection of software engineering and infrastructure, focusing on designing, operating, and improving the production systems that power Crusoe’s managed AI platform. You will help ensure highly available, performant, and cost-efficient infrastructure capable of supporting compute-intensive, latency-sensitive AI workloads for customers running large-scale training and inference.

What You’ll Work On:

  • Design and operate reliable production systems for managed AI services, with a focus on serving and scaling LLM workloads
  • Build automation, tooling, and reliability systems to support distributed AI pipelines and inference platforms
  • Define, measure, and improve SLIs and SLOs across AI workloads to ensure performance and reliability targets are consistently met
  • Partner with AI, platform, and infrastructure teams to improve reliability, efficiency, and scaling of large-scale training and inference clusters
  • Build observability and telemetry systems to monitor latency-sensitive AI services and identify performance bottlenecks
  • Investigate and resolve reliability issues in distributed production environments using logs, metrics, tracing, and profiling
  • Contribute to the architecture of next-generation AI infrastructure and distributed systems designed for large-scale production environments
  • Drive improvements in operational automation, incident response, and system resiliency across Crusoe’s AI platform

What You’ll Bring:

  • Strong software engineering background, with experience building and operating production-grade systems beyond scripting or basic automation
  • Demonstrated experience designing and operating large-scale distributed systems
  • Hands-on experience working with LLMs or AI/ML infrastructure, including training or inference systems
  • A Production Engineering / SRE mindset, including experience with:
  • Defining and measuring SLIs and SLOs
  • Building monitoring and observability systems
  • Driving performance and reliability improvements in production environments
  • Designing fault-tolerant systems and automated testing strategies
  • Proficiency in at least one modern programming language such as Python, Go, Java, or C++
  • Experience working with Kubernetes or container orchestration platforms
  • Strong collaboration and communication skills across engineering teams
  • Ability to thrive in a fast-moving, mission-driven environment

Bonus Points:

  • Experience scaling LLM training or inference workloads in production environments
  • Experience building or operating AI platforms or managed AI services

Benefits:

  • Competitive compensation
  • Restricted Stock Units
  • Paid time off & paid holidays
  • Comprehensive health, dental & vision insurance
  • Employer contributions to HSA account
  • Paid parental leave
  • Paid life insurance, short-term and long-term disability
  • Professional development & tuition reimbursement
  • Mental health & wellness support
  • Commuter benefits (parking & transit)
  • Cell phone stipend
  • 401(k) Retirement plan with company match up to 4% of salary
  • Volunteer time off

Compensation:

Compensation will be paid in the range of $204,000 – $247,000 + bonus. Restricted Stock Units are included in all offers. Compensation will be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.

Skills & Requirements

Technical Skills

Distributed systemsLarge language modelsSlis and slosObservability and telemetry systemsAi infrastructureCloud servicesData center constructionEnergyManufacturingProblem-solvingOpportunity-findingSense of urgencyTeamworkLeadershipAiInfrastructureCloudData centerEnergyManufacturing

Soft Skills

LeadershipCommunication

Domain Knowledge

AIML

Salary

$204,000 - $247,000

year

Employment Type

FULL TIME

Level

senior

Posted

1/24/2026

Continue to Ashby

You will be redirected to the job posting on Ashby.

Sign in and we'll score your resume against this role.