AI/HPC Engineer (Senior Technical Role)

AIHostingHub

On-site

Job Description

Company Description

AIHostingHub, the UAE's leading provider of cutting-edge AI and High-Performance Computing (HPC) infrastructure. We specialize in building large-scale AI data centers and delivering GPU-as-a-Service from nimble deployments to massive clusters. As a trusted professional services partner for industry giants like Supermicro and VAST Data in the GCC, we provide the technology, expertise, and support to fuel your most ambitious projects.

Our Services

AI/HPC Data CentersCustom-built, scalable environments optimized for the most demanding AI workloads.
GPU as a ServiceOn-demand access to massive GPU clusters, starting from a 2048 GPU to over 16,384 GPU per cluster.
Cybersecurity MSSP Fortinet and AttackIQ powered, 24/7 managed security to protect your critical infrastructure and data.
Expert Professional Services End-to-end support from design and deployment to optimization, directly from GCC-based partners.

AIHostingHub prides itself on delivering customized security solutions, dedicated support, and strategic guidance, ensuring that clients can operate confidently in the digital landscape. Explore the future of cybersecurity with AIHostingHub, where protection is the top priority.

Role Description

We are seeking an experienced AI/HPC Engineer for a full-time, on-site position based in Dubai. In this senior technical role, you will design, implement, and optimize AI and High-Performance Computing (HPC) solutions.

Responsibilities include developing and deploying GPU-based solutions, integrating container orchestration tools, advancing neural network architectures, and collaborating with cross-functional teams to accelerate AI capabilities. You will also conduct research and implement best practices for handling large-scale workloads.

AI/HPC Engineer will manage and optimize large‑scale GPU clusters (HGX H100/H200) within a critical datacenter environment. You will be responsible for cluster health, firmware updates, fabric diagnostics, RMA coordination, and performance validation. This role directly supports AI training and inference workloads.

Key Responsibilities

Deploy, monitor, and maintain GPU compute nodes (HGX H100/H200), InfiniBand (NDR) fabric, and Ethernet management/storage networks.

Perform post‑repair acceptance testing using CUDA P2P, NCCL, HPL, DCGMI, Stream, IOR, and LLM validation.

Diagnose and resolve hardware failures (GPU ECC errors, NIC flapping, overheating, memory faults) in coordination with OEMs (Nvidia, Supermicro, Weka).

Manage RMA processes, spare parts, and vendor warranties.

Support BIOS/firmware updates and maintenance windows.

Collaborate with remote AI engineering teams for physical troubleshooting, cabling, and fabric health assurance.

Contribute to root cause analysis and service improvement plans.

Required Qualifications

3+ years in HPC, AI infrastructure, or datacenter engineering.

Deep experience with Nvidia GPUs (H100/H200) , InfiniBand (NDR), and ROCE/Ethernet fabrics.

Proficiency in Linux system administration, GPU driver/firmware management, and performance benchmarking (NCCL, HPL, DCGM).

Familiarity with Slurm, Kubernetes, and AI frameworks (PyTorch/TensorFlow) is a plus.

Strong understanding of liquid cooling, power/capacity management, and environmental monitoring.

Ability to follow strict change control, security, and incident management processes.

UAE on‑site availability; DIAC/DMCA knowledge is an advantage.

We Offer

Competitive compensation + service credits/penalty structures per MSA.

Work on a 256+ GPU node cluster.

Direct impact on AI production uptime (99% availability targets).

Skills & Requirements

Technical Skills

AiHpcGpuContainer orchestrationNeural network architecturesLinux system administrationGpu driver/firmware managementPerformance benchmarkingSlurmKubernetesAi frameworksLiquid coolingPower/capacity managementEnvironmental monitoringChange controlSecurityIncident managementCollaborationAdaptabilityLeadershipAi data centersGpu-as-a-serviceHgx h100/h200InfinibandRoce/ethernet fabricsAi trainingAi inferenceAi production uptime

Salary

$16,384+

year

Employment Type

FULL TIME

Level

senior

Posted

4/20/2026

Apply Now

You will be redirected to AIHostingHub's application portal.