Research Scientist - Post Training

Sentiro Partners

Miami, US

On-site

Job Description

"I left a place where the models were bigger but the questions were smaller.

Here, I own the pipeline. I see the failure modes directly. When something breaks at scale, I find out why, and I get to fix it. Different arena, same class of problem." - your next colleague.

Research Scientist - Post-Training

Scope

Sentiro Partners is working with a world-leading research operation on a focused mandate: building and aligning large-scale models under real constraints, not academic ones. This is one of the most prestigious research environments in the world.
This is not a frontier lab in the conventional sense. It is a research environment with frontier-lab ambition, frontier-lab compute, and a more demanding feedback loop. Models are trained against live signals. Failure is observable. Progress is unambiguous.
This role sits inside an elite, senior post-training research group led by one of the field's recognised practitioners.
The team works on how large models learn from human feedback, how reward signals shape behaviour at scale, and how training dynamics evolve across the alignment pipelin
You will have direct ownership over post-training infrastructure, RLHF loop design, and the failure modes that emerge when optimisation meets real-world distribution.
This is not a product ML role. This is not post-hoc evaluation. This is not incremental fine-tuning work.
If your background is in post-training, alignment, or reward modelling at a frontier lab, this role is designed to feel familiar and likely more demanding.
You have likely published at NeurIPS, ICML, ICLR, or ACL. More importantly, you have built systems that run in production and broken them enough times to understand why.

What you will work on

Why does a reward model that scores well in offline evaluation produce degenerate behaviour at scale? How preference data construction choices propagate into capability gaps three training stages later. Where the optimisation dynamics of a large policy model diverge from what theory predicts and what that tells you about the objective. How to build an RLHF pipeline that is robust to distribution shift without sacrificing the signal that made it work in the first place? What it actually means for a model to generalise from human feedback versus overfit to the annotators who generated it. Et cetera ;-).

What makes this different

You own entire research threads, end to end.
You can change how models are trained and aligned, not just how they are evaluated. Feedback cycles are tight and the signal is real.
Success is defined by robustness and alignment quality, not publication count.

Who this is for

You may be a strong fit if you have worked on:

post-training, RLHF, DPO, or preference optimisation at scale.
reward modelling, constitutional approaches, or process-based supervision.
training dynamics and optimisation behaviour in large language models.
systems-aware ML where infrastructure decisions and modelling choices are inseparable

We are not optimising for narrow domain specialists, pure infrastructure engineers, or product-driven ML roles.

Environment

Small (elite), research team. The majority hold PhDs from leading programmes & have trained under strongly cited researchers.
Deep technical autonomy. In-office collaboration on the East Coast.
Compensation and resources competitive with top-tier research labs.
Unlimited compute.
World-class reward packages and meaningful upside.

Requirements

Post-graduate degree in machine learning, computer science, NLP, or a related discipline or equivalent commercial experience developing novel ML algorithms.
Demonstrable research depth through internships or employment in leading research or frontier lab environments.
Three to eight years of post-graduate experience.
NB.: Exceptional problem solving skills & extreme curiosity. Being a good colleague with ambition helps.

Location

East Coast USA. Full visa sponsorship provided for international talent

Reward

Top tier packages and additional incentives. Calibre is everything.

About | Sentiro Partners | Leadership for the Augmentation Era

We scout the frontier to secure the transformational leaders, experts and mavericks who will define the future of the human & agentic workforce.

Providing Executive Search, Talent Augmentation & Embedded (fractional) Executive Search Services globally.

Sentiro Partners is a frontier search firm specializing in the intersection of advanced technology, data science, and quantitative research as applied to a breadth of industries. Operating internationally, we focus exclusively on identifying and embedding exceptional talent into frontier research environments; the organizations working on genuinely challenging problems that require both intellectual depth and world-class engineering & execution. We work with organizations with the highest barriers to entry.

W.: https://sentiropartners.com/frontierailabtalent

Need expert hiring suppor

Skills & Requirements

Technical Skills

post-trainingRLHFDPOreward modellingalignmentoptimizationresearchproblem-solvingcollaborationAIMLlarge language models

Employment Type

FULL TIME

Level

senior

Posted

3/27/2026

Apply Now

You will be redirected to Sentiro Partners's application portal.