Senior Machine Learning Engineer – GPU Optimization & CUDA Systems

Fintal Partners

Chicago, US

On-site

Why this role

Pace

Fast Paced

Collaboration

Medium

Autonomy

High

Decision Impact

Individual

Role Level

Individual Contributor

Derived from job-description analysis by Serendipath's career intelligence engine.

What success looks like

Developing and optimizing CUDA kernels
Improving GPU utilization
Profiling and debugging GPU workloads

Typical background

Machine learningSystems engineeringHigh-performance computing

Transferable backgrounds

Coming from Systems Engineering
Coming from High-Performance Computing

Skills & requirements

Required

C++CUDAGPU OptimizationPyTorchTritonNCCLCutlass

Preferred

KubernetesDocker

Stack & domain

C++CudaPyTorchTritonNcclCutlassHigh-frequency Trading

About the role

Original posting from Fintal Partners via LinkedIn

A market-leading high-frequency trading firm is seeking Senior Machine Learning Engineers to join a specialist performance engineering team focused on low-level optimization for large-scale AI workloads.
This role is heavily focused on GPU performance, CUDA kernel optimization, and systems-level acceleration work later in the ML pipeline. The team works on extracting maximum performance from modern hardware architectures to support highly demanding training and inference workloads.
You will work close to the metal, optimizing critical components across CUDA, C++, memory management, and GPU execution paths. The work combines deep systems engineering with cutting-edge machine learning infrastructure.
Key responsibilities:
Develop and optimize CUDA kernels for high-performance ML workloads
Improve GPU utilization, memory efficiency, and execution performance
Profile and optimize bottlenecks across training and inference pipelines
Work on compiler/runtime-level optimizations and kernel fusion strategies
Collaborate with ML systems and infrastructure teams on end-to-end acceleration
Build highly optimized C++ components for latency and throughput-sensitive systems
Requirements:
Strong C++ and CUDA development experience
Deep understanding of GPU architecture and performance optimization
Experience profiling and debugging GPU workloads using tools such as Nsight
Knowledge of PyTorch internals, Triton, NCCL, CUTLASS, or similar frameworks
Strong systems programming background with focus on performance engineering
Experience working on high-throughput or low-latency distributed systems
Computer Science, Mathematics, Physics, Engineering, or related technical degree preferred
This is an opportunity to work on some of the most technically challenging AI infrastructure problems in the industry, within an environment that values engineering excellence, autonomy, and performance.

Source: Fintal Partners careers (LinkedIn)