Sr. Applied ML Specialist, Research Eng.

Vector Institute

Toronto, CA; US

On-site

Job Description

Senior Applied Machine Learning Specialist, Research Engineering

POSITION SUMMARY

As a Senior Applied Machine Learning Specialist, Research Engineering, you will build and scale the tools, infrastructure, and systems that accelerate applied ML research at Vector and across its partner ecosystem. Working closely with applied ML scientists and researchers, you will implement research ideas in code, extend them to broader datasets, model families, and compute regimes, and develop the research engineering foundations that turn research prototypes into reproducible, scalable capabilities.

KEY RESPONSIBILITIES

Design, build, and maintain scalable ML research infrastructure, including training pipelines, experiment orchestration, evaluation harnesses, and data processing systems, that enable researchers to iterate faster across models, datasets, and compute configurations;
Implement and extend ML research prototypes from papers and internal work, taking them from proof-of-concept to robust, reproducible systems capable of scaling across hardware and data regimes;
Develop internal tooling and libraries that reduce friction in the research lifecycle, covering data ingestion and preprocessing, model training and fine-tuning, benchmarking, and results tracking;
Scale applied research efforts by engineering efficient pipelines for multi-dataset, multi-model, and distributed compute workloads, optimizing for both researcher productivity and resource efficiency;
Build and ship open-source research software, reference implementations, and model toolkits following engineering best practices (testing, versioning, documentation, CI/CD);
Collaborate with Applied ML Scientists and researchers to translate research requirements into concrete AI engineering specifications, ensuring systems are designed for extensibility as research directions evolve;
Take ownership of complex, high-effort research engineering initiatives, defining system architecture, leading implementation, and driving delivery end-to-end for large-scale projects that require significant engineering depth and coordination across research and engineering teams;
Communicate engineering progress, system design decisions, and tooling capabilities through technical documentation, demos, and internal presentations; and,
Other related duties as assigned from time to time.

KEY SUCCESS MEASURES

Measurable improvement in research throughput, i.e. researchers running more experiments, across more models and datasets, with less engineering overhead;
Delivery of reliable, well-documented research tooling and infrastructure that becomes a shared foundation for applied research efforts;
Successful scaling of research prototypes to broader compute, data, and model configurations with reproducible results; and,
Active contribution to research engineering culture through code quality, documentation standards, and knowledge-sharing with the broader team.

PROFILE OF THE IDEAL CANDIDATE

Bachelor's degree in computer science, mathematics, electrical engineering, or a related discipline; MSc/MEng preferred, particularly in a machine learning or systems-adjacent field;
Minimum of four years of experience in research engineering, ML infrastructure, or applied ML, with a track record of building systems that directly accelerate research or experimentation workflows;
Demonstrated experience as a technical lead on research engineering or applied ML projects, including owning system architecture, tooling decisions, and delivery from prototype to scalable implementation;
Experience mentoring or leading a team of engineers or researchers is an asset;
Strong proficiency in Python, with emphasis on writing clean, well-tested, and reusable research code;
Hands-on experience building and maintaining ML training and evaluation pipelines, including handling large-scale, heterogeneous, and real-world datasets;
Deep familiarity with leading ML frameworks such as PyTorch, HuggingFace Transformers, JAX, and experience with CUDA or low-level GPU optimization is a strong asset;
Strong command of the ML tooling ecosystem, spanning experiment tracking (e.g., MLflow, W&B), model evaluation and benchmarking, dataset versioning, and model registries;
Experience with distributed training, multi-GPU/multi-node compute orchestration, and cloud-native infrastructure including Kubernetes, Docker, and managed cloud services (GCP/AWS/Azure); familiarity with job schedulers (e.g., SLURM) is an asset;
Familiar with the full ML research lifecycle, from problem formulation and data curation through training, evaluation, scaling, and reproducibility;
Experience contributing to or maintaining open-source ML libraries, research codebases, or shared internal tooling is strongly preferred; and,
Strong written and verbal communication skills, with the ability to translate research requirements into engineering specifications and document systems clearly for both technica

Skills & Requirements

Technical Skills

Machine learningResearch engineeringMl infrastructureApplied mlTraining pipelinesExperiment orchestrationEvaluation harnessesData processing systemsOpen-source research softwareModel toolkitsTestingVersioningDocumentationCi/cdData ingestionPreprocessingModel trainingFine-tuningBenchmarkingResults trackingMulti-datasetMulti-modelDistributed compute workloadsResearch engineeringAi engineeringSystem architectureEngineering best practicesCommunicationCollaborationLeadershipTeamworkResearchMachine learningEngineering

Employment Type

FULL TIME

Level

senior

Posted

4/6/2026

Continue to Indeed

You will be redirected to the job posting on Indeed.