Senior Network Engineer – Data Center / HPC Infrastructure

GTN Technical Staffing

Dallas, US

Hybrid

Job Description

Location: Dallas, TX (Hybrid)

Type: Direct Hire

Competitive base salary + performance bonus

100% company-paid benefits

Relocation available

Overview

We are seeking a Senior Network Engineer to design, build, and operate high-performance data center networks supporting HPC, AI/ML workloads, and next-generation CaaS / GPUaaS platforms.

This role focuses on delivering ultra-low-latency, high-throughput network infrastructure optimized for GPU- and CPU-intensive compute environments. You will play a critical role in enabling scalable, multi-tenant AI infrastructure by ensuring network performance, reliability, and efficiency across distributed data center environments.

The ideal candidate brings deep expertise in modern data center networking, hands-on experience with high-performance fabrics, and a strong understanding of networking requirements for GPU-accelerated and containerized platforms at scale.

Key Responsibilities

Data Center & HPC Network Engineering

Design, implement, and operate high-performance data center networks supporting HPC, AI/ML, and GPUaaS / CaaS environments
Optimize architectures for east-west traffic, low latency, and high throughput across large-scale compute clusters
Support distributed GPU and CPU workloads, ensuring consistent performance under heavy parallel processing demands

Network Architecture & Multi-Tenant Design

Design and manage leaf-spine / Clos architectures using EVPN-VXLAN overlays
Build scalable, multi-tenant network architectures supporting workload isolation and segmentation for CaaS / GPUaaS platforms
Support DCI, backbone connectivity, and hybrid/cloud on-ramp strategies

Performance, Reliability & Optimization

Monitor and tune network performance for latency, throughput, and congestion across HPC environments
Perform deep packet inspection, traffic flow analysis, and root cause troubleshooting
Drive capacity planning and scaling strategies aligned with compute and GPU cluster growth
Ensure high availability through redundancy, failover validation, and operational rigor

Automation & Infrastructure Engineering

Develop network automation frameworks using Python, Ansible, Git, and Jinja2
Implement Infrastructure-as-Code (IaC) and CI/CD pipelines for network provisioning and changes
Standardize and scale network deployments across environments

Observability & Telemetry

Implement telemetry and monitoring solutions to provide real-time visibility into network performance
Analyze metrics to proactively identify risks and optimize system behavior
Integrate network observability into broader platform monitoring ecosystems

Cross-Functional Collaboration

Partner with HPC platform, compute, storage, and infrastructure teams to align network architecture with workload demands
Collaborate with architecture and engineering teams on new environment design and deployment
Work closely with vendors to validate performance, interoperability, and scalability

Technical Leadership

Serve as a senior escalation point for network incidents and complex troubleshooting
Mentor junior engineers and contribute to documentation, standards, and best practices
Drive continuous improvement across network architecture, operations, and tooling

Required Experience

5–8+ years of experience designing and supporting large-scale data center networks
Experience supporting HPC, AI/ML, or GPU-accelerated infrastructure environments
Experience working with or supporting CaaS, GPUaaS, or multi-tenant platform architectures
Strong expertise with:
Leaf-spine / Clos architectures
EVPN, VXLAN, BGP, MPLS
Cisco and/or Arista platforms (NX-OS, EOS, IOS-XR)
Strong understanding of low-latency, high-throughput network optimization
Proven troubleshooting experience in complex distributed environments

Technical Skills

Network automation: Python, Ansible, Jinja2, Git
Infrastructure-as-Code (IaC) and CI/CD pipelines
Network observability, telemetry, and monitoring tools
Packet analysis and traffic flow diagnostics

Preferred Experience

Experience with HPC networking concepts (GPU clusters, distributed training environments)
Familiarity with InfiniBand, RDMA, or RoCE networking
Experience in hyperscale or AI-focused data center environments
CCNP or equivalent certification preferred; CCIE or advanced certifications a plus

Additional Requirements

This position requires applicants to be currently authorized to work in the U.S. without employer sponsorship.
We are unable to sponsor or take over sponsorship of employment visas at this time.

Skills & Requirements

Technical Skills

PythonAnsibleGitJinja2Evpn-vxlanInfinibandRdmaRoceGpu clustersDistributed training environmentsCcnpCcieData centerHpcAi/mlGpuaasCaas

Employment Type

FULL TIME

Level

senior

Posted

4/14/2026

Continue to LinkedIn

You will be redirected to the job posting on LinkedIn.