GPU / AI Infrastructure Engineer

KubeRox Technologies

Singapore, SG

On-site

Job Description

Job Summary

We are looking for a GPU / AI Infrastructure Engineer with 5–7 years of experience to build, optimize, and support scalable AI/ML and HPC environments. The ideal candidate will have strong expertise in GPU acceleration, containerized workloads, and MLOps pipelines, along with hands-on experience managing AI infrastructure across on-prem or cloud platforms.

Key Responsibilities

Design, deploy, and manage GPU-enabled infrastructure for AI/ML and HPC workloads.
Install, configure, and optimize GPU software stacks including NVIDIA AI Enterprise, CUDA, ROCm, OpenCL, and NIMS.
Support GPU acceleration for machine learning frameworks and scientific applications.
Build and manage containerized environments using Docker, Kubernetes (K8s), and Singularity.
Deploy and manage Kubernetes GPU workloads using GPU Operator and related ecosystem tools.
Support ML frameworks such as TensorFlow, PyTorch, Scikit-learn, and MXNet.
Develop and maintain MLOps pipelines using MLflow and Kubeflow.
Design and implement Infrastructure as Code (IaC) solutions for AI/ML pipelines.
Automate infrastructure provisioning using Terraform, Pulumi, and CloudFormation.
Build and maintain CI/CD pipelines for ML model deployment and infrastructure automation.
Collaborate with data scientists and engineers to optimize model performance and resource utilization.
Monitor GPU utilization, system performance, and troubleshoot issues across the stack.
Ensure scalability, reliability, and security of AI infrastructure environments.

Required Skills & Qualifications

5 years of experience in AI/ML infrastructure, HPC, or DevOps engineering roles.
Strong experience with GPU technologies and acceleration frameworks (CUDA, ROCm, OpenCL).
Hands-on experience with NVIDIA AI Enterprise stack and GPU ecosystem tools (e.g., NIMS, GPU Operator).
Proficiency in container technologies: Docker, Kubernetes, and Singularity.
Experience working with ML frameworks: TensorFlow, PyTorch, Scikit-learn, MXNet.
Solid understanding of MLOps tools such as MLflow and Kubeflow.
Expertise in Infrastructure as Code (Terraform, Pulumi, CloudFormation).
Experience building and managing CI/CD pipelines for ML or infrastructure workflows.
Strong scripting skills (Python, Bash, or similar).
Familiarity with Linux-based environments.

Skills & Requirements

Technical Skills

Gpu accelerationContainerized workloadsMlops pipelinesNvidia ai enterpriseCudaRocmOpenclNimsGpu operatorDockerKubernetesSingularityTensorflowPytorchScikit-learnMxnetMlflowKubeflowInfrastructure as codeTerraformPulumiCloudformationCi/cd pipelinesPythonBashLinuxAi/mlHpcDevops

Employment Type

FULL TIME

Level

senior

Posted

4/30/2026

Continue to LinkedIn

You will be redirected to the job posting on LinkedIn.