Job Summary
We are looking for a GPU / AI Infrastructure Engineer with 5–7 years of experience to build, optimize, and support scalable AI/ML and HPC environments. The ideal candidate will have strong expertise in GPU acceleration, containerized workloads, and MLOps pipelines, along with hands-on experience managing AI infrastructure across on-prem or cloud platforms.
Key Responsibilities
- Design, deploy, and manage GPU-enabled infrastructure for AI/ML and HPC workloads.
- Install, configure, and optimize GPU software stacks including NVIDIA AI Enterprise, CUDA, ROCm, OpenCL, and NIMS.
- Support GPU acceleration for machine learning frameworks and scientific applications.
- Build and manage containerized environments using Docker, Kubernetes (K8s), and Singularity.
- Deploy and manage Kubernetes GPU workloads using GPU Operator and related ecosystem tools.
- Support ML frameworks such as TensorFlow, PyTorch, Scikit-learn, and MXNet.
- Develop and maintain MLOps pipelines using MLflow and Kubeflow.
- Design and implement Infrastructure as Code (IaC) solutions for AI/ML pipelines.
- Automate infrastructure provisioning using Terraform, Pulumi, and CloudFormation.
- Build and maintain CI/CD pipelines for ML model deployment and infrastructure automation.
- Collaborate with data scientists and engineers to optimize model performance and resource utilization.
- Monitor GPU utilization, system performance, and troubleshoot issues across the stack.
- Ensure scalability, reliability, and security of AI infrastructure environments.
Required Skills & Qualifications
- 5 years of experience in AI/ML infrastructure, HPC, or DevOps engineering roles.
- Strong experience with GPU technologies and acceleration frameworks (CUDA, ROCm, OpenCL).
- Hands-on experience with NVIDIA AI Enterprise stack and GPU ecosystem tools (e.g., NIMS, GPU Operator).
- Proficiency in container technologies: Docker, Kubernetes, and Singularity.
- Experience working with ML frameworks: TensorFlow, PyTorch, Scikit-learn, MXNet.
- Solid understanding of MLOps tools such as MLflow and Kubeflow.
- Expertise in Infrastructure as Code (Terraform, Pulumi, CloudFormation).
- Experience building and managing CI/CD pipelines for ML or infrastructure workflows.
- Strong scripting skills (Python, Bash, or similar).
- Familiarity with Linux-based environments.