Applied Data Scientist, LLM Evaluation
Introduction
At Driver, we're building systems that turn source code into human language. The tech stack includes a core compiler-like engine, a heavily asynchronous/distributed backend server, and a frontend web application that provides a rich user experience.
About Driver
We're an early-stage startup backed by Y Combinator and Google Ventures that combines first principles technical approaches and applied LLM expertise to tackle context engineering at scale. Driver builds the context layer for employees and AI agents alike to use in developing software.
Working at Driver
Driver is an early-stage but fast-growing startup. As such, we take advantage of that which startups can excel: delivery speed, flexibility, and enjoying working with a small close-knit team.
Organizational and engineering values at Driver include first-principles thinking, correct by construction, writing things down, experimentation and iteration, pragmatism, commitment to effective communication and transparency, autonomy, and ambition.
Job Overview
Title: Applied Data Scientist, LLM Evaluation
Location: Remote or Austin, Tx
Our value is directly tied to the quality of our content at scale. The platform generates technical documentation across a complex, multi-stage pipeline - producing multiple content types at different levels of abstraction, from individual code elements up to high-level summaries. Today, changes to models, context strategies, or pipeline architecture are evaluated largely through manual review and intuition. There is no systematic way to answer: "Did this change make our output better, worse, or the same - and for which languages, repo sizes, and content types?"
This is a hard problem. LLM outputs are non-deterministic - identical inputs produce different outputs across runs, and small variations at early pipeline stages compound into meaningfully different end-user content downstream. Evaluating quality requires methodology that accounts for this: statistical reasoning over multiple runs, understanding of cascade effects through the pipeline, and rubrics that balance human judgment with automated signals.
This role builds the evaluation function from scratch. You'll define what "good" means for our generated content, build the infrastructure to measure it, and create the experimental framework that lets the team ship changes with confidence.
What You'll Do
You'll own the LLM evaluation strategy at Driver - from first principles to production infrastructure. This is a foundational role: you're not joining an existing eval team, you're building it. As the function matures, you'll seed and grow a team around it.
Define quality metrics and build evaluation datasets. Establish what "good" looks like for each content type across the pipeline. Build and curate gold-standard evaluation datasets across languages and repo archetypes (monorepos, microservices, libraries, applications). Design rubrics that capture accuracy, completeness, usefulness, and readability.
Build benchmarking and experimentation infrastructure. Create automated evaluation pipelines that score output against reference datasets. Instrument the content generation pipeline to support A/B comparisons - run the same codebase through two strategies and compare results. Build tooling for LLM-as-judge evaluation and regression detection. Integrate evaluation into CI so pipeline changes come with quality evidence.
Develop automated quality signals at scale. Build quality checks that flag degraded output without requiring human review of every document. Monitor content quality trends over time. Design sampling strategies for human review that maximize signal with minimal annotation effort.
Quantify tradeoffs and inform decisions. Run experiments on model selection, context strategies, and pipeline architecture changes. Quantify cost/quality/latency tradeoffs. Partner with the engineering team to turn evaluation insights into shipped improvements.
Qualifications
Education: Bachelor's, Master's, or PhD in Statistics, Machine Learning, Data Science, Computational Linguistics, or a related quantitative field.
Experience: Minimum 3 - 5 years in applied science, ML engineering, or data science roles with a focus on evaluation, NLP, or generative AI. 7+ years experience preferred.
Required Technical Skills
FULL TIME
mid
4/24/2026
You will be redirected to Driver AI Inc.'s application portal.