Senior Network Engineer – Data Center / HPC Infrastructure

GTN Technical Staffing
Dallas, US
Hybrid

Job Description

Senior Network Engineer – Data Center / HPC Infrastructure

Location: Dallas, TX (Hybrid)

Type: Direct Hire

  • Competitive base salary + performance bonus
  • 100% company-paid benefits
  • Relocation available

Overview

We are seeking a Senior Network Engineer to design, build, and operate high-performance data center networks supporting HPC, AI/ML workloads, and next-generation CaaS / GPUaaS platforms.

This role focuses on delivering ultra-low-latency, high-throughput network infrastructure optimized for GPU- and CPU-intensive compute environments. You will play a critical role in enabling scalable, multi-tenant AI infrastructure by ensuring network performance, reliability, and efficiency across distributed data center environments.

The ideal candidate brings deep expertise in modern data center networking, hands-on experience with high-performance fabrics, and a strong understanding of networking requirements for GPU-accelerated and containerized platforms at scale.

Key Responsibilities

Data Center & HPC Network Engineering

  • Design, implement, and operate high-performance data center networks supporting HPC, AI/ML, and GPUaaS / CaaS environments
  • Optimize architectures for east-west traffic, low latency, and high throughput across large-scale compute clusters
  • Support distributed GPU and CPU workloads, ensuring consistent performance under heavy parallel processing demands

Network Architecture & Multi-Tenant Design

  • Design and manage leaf-spine / Clos architectures using EVPN-VXLAN overlays
  • Build scalable, multi-tenant network architectures supporting workload isolation and segmentation for CaaS / GPUaaS platforms
  • Support DCI, backbone connectivity, and hybrid/cloud on-ramp strategies

Performance, Reliability & Optimization

  • Monitor and tune network performance for latency, throughput, and congestion across HPC environments
  • Perform deep packet inspection, traffic flow analysis, and root cause troubleshooting
  • Drive capacity planning and scaling strategies aligned with compute and GPU cluster growth
  • Ensure high availability through redundancy, failover validation, and operational rigor

Automation & Infrastructure Engineering

  • Develop network automation frameworks using Python, Ansible, Git, and Jinja2
  • Implement Infrastructure-as-Code (IaC) and CI/CD pipelines for network provisioning and changes
  • Standardize and scale network deployments across environments

Observability & Telemetry

  • Implement telemetry and monitoring solutions to provide real-time visibility into network performance
  • Analyze metrics to proactively identify risks and optimize system behavior
  • Integrate network observability into broader platform monitoring ecosystems

Cross-Functional Collaboration

  • Partner with HPC platform, compute, storage, and infrastructure teams to align network architecture with workload demands
  • Collaborate with architecture and engineering teams on new environment design and deployment
  • Work closely with vendors to validate performance, interoperability, and scalability

Technical Leadership

  • Serve as a senior escalation point for network incidents and complex troubleshooting
  • Mentor junior engineers and contribute to documentation, standards, and best practices
  • Drive continuous improvement across network architecture, operations, and tooling

Required Experience

  • 5–8+ years of experience designing and supporting large-scale data center networks
  • Experience supporting HPC, AI/ML, or GPU-accelerated infrastructure environments
  • Experience working with or supporting CaaS, GPUaaS, or multi-tenant platform architectures
  • Strong expertise with:
  • Leaf-spine / Clos architectures
  • EVPN, VXLAN, BGP, MPLS
  • Cisco and/or Arista platforms (NX-OS, EOS, IOS-XR)
  • Strong understanding of low-latency, high-throughput network optimization
  • Proven troubleshooting experience in complex distributed environments

Technical Skills

  • Network automation: Python, Ansible, Jinja2, Git
  • Infrastructure-as-Code (IaC) and CI/CD pipelines
  • Network observability, telemetry, and monitoring tools
  • Packet analysis and traffic flow diagnostics

Preferred Experience

  • Experience with HPC networking concepts (GPU clusters, distributed training environments)
  • Familiarity with InfiniBand, RDMA, or RoCE networking
  • Experience in hyperscale or AI-focused data center environments
  • CCNP or equivalent certification preferred; CCIE or advanced certifications a plus

Additional Requirements

  • This position requires applicants to be currently authorized to work in the U.S. without employer sponsorship.
  • We are unable to sponsor or take over sponsorship of employment visas at this time.

Skills & Requirements

Technical Skills

PythonAnsibleGitJinja2Evpn-vxlanInfinibandRdmaRoceGpu clustersDistributed training environmentsCcnpCcieData centerHpcAi/mlGpuaasCaas

Employment Type

FULL TIME

Level

senior

Posted

4/14/2026

Continue to LinkedIn

You will be redirected to the job posting on LinkedIn.