Lead the design and development of a scalable, reliable, and reproducible machine learning research platform.
Build infrastructure to support large-scale experimentation, model training, and simulation across both on‑premise high‑performance compute environments and multi‑cloud setups.
Work closely with researchers to understand evolving workflows and translate those needs into robust platform capabilities.
Architect and optimize distributed training pipelines for high-throughput, GPU‑accelerated workloads.
Enhance experiment management, model versioning, artifact tracking, and data lineage to ensure transparent and repeatable research processes.
Develop tools and frameworks that improve feature engineering, dataset creation, and large-scale backtesting.
Drive initiatives to improve compute efficiency, resource allocation, and workload isolation across heterogeneous environments.
Enhance platform observability with improved metrics, logging, tracing, and debugging capabilities tailored to ML and distributed systems.
Support rapid iteration by delivering features and fixes quickly while maintaining strong engineering standards.
Contribute to long-term architectural planning to ensure the platform scales with growing data volumes and model complexity.
Qualifications
2+ years of experience designing and building distributed systems at scale, ideally supporting research or data-heavy workloads.
Strong programming skills in Python with a focus on clean, maintainable, high-performance code.