Derived from job-description analysis by Serendipath's career intelligence engine.
Original posting from Fintal Partners via LinkedIn
A leading high-frequency trading firm is building out a world-class machine learning platform team focused on large-scale model training and ultra-low latency inference. This team owns the infrastructure powering next-generation AI research and production systems across the business.
They are looking for Senior Machine Learning Engineers with deep experience building and scaling distributed training and inference systems for large models. The role sits at the intersection of ML systems, distributed computing, and high-performance infrastructure.
You will design and optimize large-scale PyTorch training pipelines, improve GPU cluster utilization, and build highly reliable inference infrastructure capable of operating at massive scale and extremely low latency. The environment is highly technical, fast-paced, and engineering-driven.
Key responsibilities:
- Build and scale distributed training systems for large deep learning models
- Optimize PyTorch-based training and inference performance across GPU clusters
- Design high-throughput, low-latency inference infrastructure for production workloads
- Improve scheduling, orchestration, checkpointing, and data pipeline efficiency
- Work closely with researchers and infrastructure engineers to productionize models
- Drive performance improvements across networking, memory, storage, and compute layers
Requirements:
- Strong experience with large-scale ML systems and distributed training
- Deep expertise in PyTorch and modern deep learning infrastructure
- Experience with technologies such as Kubernetes, Ray, Slurm, NCCL, Triton, or similar
- Strong Python engineering skills with solid systems knowledge
- Experience scaling GPU infrastructure in production environments
- Background in performance optimization and reliability engineering
- Computer Science, Mathematics, Physics, or related technical degree preferred
The firm offers exceptional compensation, access to cutting-edge compute infrastructure, and the opportunity to work alongside some of the strongest engineers and researchers in the industry.