Senior Network Engineer – Data Center / HPC Infrastructure
Location: Dallas, TX (Hybrid)
Type: Direct Hire
- Competitive base salary + performance bonus
- 100% company-paid benefits
Overview
We are seeking a Senior Network Engineer to design, build, and operate high-performance data center networks supporting HPC, AI/ML workloads, and next-generation CaaS / GPUaaS platforms.
This role focuses on delivering ultra-low-latency, high-throughput network infrastructure optimized for GPU- and CPU-intensive compute environments. You will play a critical role in enabling scalable, multi-tenant AI infrastructure by ensuring network performance, reliability, and efficiency across distributed data center environments.
The ideal candidate brings deep expertise in modern data center networking, hands-on experience with high-performance fabrics, and a strong understanding of networking requirements for GPU-accelerated and containerized platforms at scale.
Key Responsibilities
Data Center & HPC Network Engineering
- Design, implement, and operate high-performance data center networks supporting HPC, AI/ML, and GPUaaS / CaaS environments
- Optimize architectures for east-west traffic, low latency, and high throughput across large-scale compute clusters
- Support distributed GPU and CPU workloads, ensuring consistent performance under heavy parallel processing demands
Network Architecture & Multi-Tenant Design
- Design and manage leaf-spine / Clos architectures using EVPN-VXLAN overlays
- Build scalable, multi-tenant network architectures supporting workload isolation and segmentation for CaaS / GPUaaS platforms
- Support DCI, backbone connectivity, and hybrid/cloud on-ramp strategies
Performance, Reliability & Optimization
- Monitor and tune network performance for latency, throughput, and congestion across HPC environments
- Perform deep packet inspection, traffic flow analysis, and root cause troubleshooting
- Drive capacity planning and scaling strategies aligned with compute and GPU cluster growth
- Ensure high availability through redundancy, failover validation, and operational rigor
Automation & Infrastructure Engineering
- Develop network automation frameworks using Python, Ansible, Git, and Jinja2
- Implement Infrastructure-as-Code (IaC) and CI/CD pipelines for network provisioning and changes
- Standardize and scale network deployments across environments
Observability & Telemetry
- Implement telemetry and monitoring solutions to provide real-time visibility into network performance
- Analyze metrics to proactively identify risks and optimize system behavior
- Integrate network observability into broader platform monitoring ecosystems
Cross-Functional Collaboration
- Partner with HPC platform, compute, storage, and infrastructure teams to align network architecture with workload demands
- Collaborate with architecture and engineering teams on new environment design and deployment
- Work closely with vendors to validate performance, interoperability, and scalability
Technical Leadership
- Serve as a senior escalation point for network incidents and complex troubleshooting
- Mentor junior engineers and contribute to documentation, standards, and best practices
- Drive continuous improvement across network architecture, operations, and tooling
Required Experience
- 5–8+ years of experience designing and supporting large-scale data center networks
- Experience supporting HPC, AI/ML, or GPU-accelerated infrastructure environments
- Experience working with or supporting CaaS, GPUaaS, or multi-tenant platform architectures
- Strong expertise with:
- Leaf-spine / Clos architectures
- EVPN, VXLAN, BGP, MPLS
- Cisco and/or Arista platforms (NX-OS, EOS, IOS-XR)
- Strong understanding of low-latency, high-throughput network optimization
- Proven troubleshooting experience in complex distributed environments
Technical Skills
- Network automation: Python, Ansible, Jinja2, Git
- Infrastructure-as-Code (IaC) and CI/CD pipelines
- Network observability, telemetry, and monitoring tools
- Packet analysis and traffic flow diagnostics
Preferred Experience
- Experience with HPC networking concepts (GPU clusters, distributed training environments)
- Familiarity with InfiniBand, RDMA, or RoCE networking
- Experience in hyperscale or AI-focused data center environments
- CCNP or equivalent certification preferred; CCIE or advanced certifications a plus
Additional Requirements
- This position requires applicants to be currently authorized to work in the U.S. without employer sponsorship.
- We are unable to sponsor or take over sponsorship of employment visas at this time.