ML Platform Engineer: Scalable Research Pipelines

Leadingnation

Hong Kong, HK

Remote

Job Description

Responsibilities

Lead the design and development of a scalable, reliable, and reproducible machine learning research platform.
Build infrastructure to support large-scale experimentation, model training, and simulation across both on‑premise high‑performance compute environments and multi‑cloud setups.
Work closely with researchers to understand evolving workflows and translate those needs into robust platform capabilities.
Architect and optimize distributed training pipelines for high-throughput, GPU‑accelerated workloads.
Enhance experiment management, model versioning, artifact tracking, and data lineage to ensure transparent and repeatable research processes.
Develop tools and frameworks that improve feature engineering, dataset creation, and large-scale backtesting.
Drive initiatives to improve compute efficiency, resource allocation, and workload isolation across heterogeneous environments.
Enhance platform observability with improved metrics, logging, tracing, and debugging capabilities tailored to ML and distributed systems.
Support rapid iteration by delivering features and fixes quickly while maintaining strong engineering standards.
Contribute to long-term architectural planning to ensure the platform scales with growing data volumes and model complexity.

Qualifications

2+ years of experience designing and building distributed systems at scale, ideally supporting research or data-heavy workloads.
Strong programming skills in Python with a focus on clean, maintainable, high-performance code.
Experience running applications on Linux-based HPC clusters and/or cloud computing platforms.
Solid understanding of distributed computing, parallel processing, and resource management.
Hands-on experience with GPU workloads and familiarity with modern ML frameworks such as PyTorch, TensorFlow, or JAX.
Experience optimizing data pipelines and handling large structured and unstructured datasets.
Strong debugging skills with the ability to diagnose issues across multiple layers of the stack.
Comfortable working independently in a fast-paced, research-oriented environment.
Strong communication skills and experience collaborating directly with researchers or data-focused teams.

Preferred Attributes

Experience building internal ML platforms or research tooling at scale.
Familiarity with experiment‑tracking tools, workflow orchestration systems, and model lifecycle management.
Experience with containerization and orchestration technologies (e.g., Docker, Kubernetes).
Exposure to high-performance or latency-sensitive domains such as quantitative research, simulation systems, or large‑scale distributed compute.

PythonLinuxDockerKubernetesPytorchTensorflowJaxCommunication

FULL TIME

mid

4/14/2026

You will be redirected to Leadingnation's application portal.