GPU / AI Infrastructure Engineer

KubeRox Technologies
Singapore, SG
On-site

Job Description

Job Summary

We are looking for a GPU / AI Infrastructure Engineer with 5–7 years of experience to build, optimize, and support scalable AI/ML and HPC environments. The ideal candidate will have strong expertise in GPU acceleration, containerized workloads, and MLOps pipelines, along with hands-on experience managing AI infrastructure across on-prem or cloud platforms.

Key Responsibilities

  • Design, deploy, and manage GPU-enabled infrastructure for AI/ML and HPC workloads.
  • Install, configure, and optimize GPU software stacks including NVIDIA AI Enterprise, CUDA, ROCm, OpenCL, and NIMS.
  • Support GPU acceleration for machine learning frameworks and scientific applications.
  • Build and manage containerized environments using Docker, Kubernetes (K8s), and Singularity.
  • Deploy and manage Kubernetes GPU workloads using GPU Operator and related ecosystem tools.
  • Support ML frameworks such as TensorFlow, PyTorch, Scikit-learn, and MXNet.
  • Develop and maintain MLOps pipelines using MLflow and Kubeflow.
  • Design and implement Infrastructure as Code (IaC) solutions for AI/ML pipelines.
  • Automate infrastructure provisioning using Terraform, Pulumi, and CloudFormation.
  • Build and maintain CI/CD pipelines for ML model deployment and infrastructure automation.
  • Collaborate with data scientists and engineers to optimize model performance and resource utilization.
  • Monitor GPU utilization, system performance, and troubleshoot issues across the stack.
  • Ensure scalability, reliability, and security of AI infrastructure environments.

Required Skills & Qualifications

  • 5 years of experience in AI/ML infrastructure, HPC, or DevOps engineering roles.
  • Strong experience with GPU technologies and acceleration frameworks (CUDA, ROCm, OpenCL).
  • Hands-on experience with NVIDIA AI Enterprise stack and GPU ecosystem tools (e.g., NIMS, GPU Operator).
  • Proficiency in container technologies: Docker, Kubernetes, and Singularity.
  • Experience working with ML frameworks: TensorFlow, PyTorch, Scikit-learn, MXNet.
  • Solid understanding of MLOps tools such as MLflow and Kubeflow.
  • Expertise in Infrastructure as Code (Terraform, Pulumi, CloudFormation).
  • Experience building and managing CI/CD pipelines for ML or infrastructure workflows.
  • Strong scripting skills (Python, Bash, or similar).
  • Familiarity with Linux-based environments.

Skills & Requirements

Technical Skills

Gpu accelerationContainerized workloadsMlops pipelinesNvidia ai enterpriseCudaRocmOpenclNimsGpu operatorDockerKubernetesSingularityTensorflowPytorchScikit-learnMxnetMlflowKubeflowInfrastructure as codeTerraformPulumiCloudformationCi/cd pipelinesPythonBashLinuxAi/mlHpcDevops

Employment Type

FULL TIME

Level

senior

Posted

4/30/2026

Continue to LinkedIn

You will be redirected to the job posting on LinkedIn.

Sign in and we'll score your resume against this role.