ML Engineer - Automated Evaluation and Adversarial Design

Apple

San Diego, US

On-site

Job Description

About the position

The Productivity and Machine Learning Evaluation team ensures the quality of

AI-powered features across a suite of productivity and creative applications -

including Creator Studio - used by hundreds of millions of people. This team

serves as the primary evaluation function, providing critical quality signals

that directly influence model development decisions and product launches. This

role focuses on building and scaling automated evaluation systems and designing

adversarial and stress-testing methodologies across multiple AI features. The

work requires a deep understanding of how AI systems fail and how to measure

quality rigorously. This is an opportunity to shape the evaluation

infrastructure that determines whether AI features meet the bar for hundreds of

millions of users.

DESCRIPTION

Day-to-day work involves designing, building, and maintaining automated

evaluation systems that assess AI feature quality at scale. This includes

creating adversarial test suites that probe model weaknesses and running stress

tests to ensure features perform under demanding conditions. The role requires

close collaboration with cross-functional partners to ensure evaluation methods

are well-calibrated and integrated into development workflows. Typical

deliverables include: evaluation frameworks and rubrics, quality assessment

reports, adversarial test case libraries, and recommendations on model

readiness.

Responsibilities

designing, building, and maintaining automated evaluation systems that assess AI feature quality at scale
creating adversarial test suites that probe model weaknesses
running stress tests to ensure features perform under demanding conditions
close collaboration with cross-functional partners to ensure evaluation methods are well-calibrated and integrated into development workflows
evaluation frameworks and rubrics
quality assessment reports
adversarial test case libraries
recommendations on model readiness

Requirements

Bachelor's degree in Computer Science, Machine Learning, Statistics, or a related field
4+ years of experience building or significantly extending ML evaluation systems, including designing evaluation benchmarks or quality assessment frameworks
Experience independently defining evaluation architecture and methodology for AI or ML systems
Experience designing adversarial or red-teaming test methodologies for ML models or AI-powered features
Experience with Python and ML frameworks (PyTorch, TensorFlow, or equivalent) in production or near-production settings
Track record of owning technical direction for evaluation efforts across multiple features or product areas

Nice-to-haves

Experience evaluating user-facing AI features in consumer applications, with an understanding of how technical metrics connect to user-perceived quality
Familiarity with productivity software or creative tools, with the ability to assess output quality from a user workflow perspective
Experience ensuring alignment between automated and human evaluation methods, including inter-annotator agreement analysis and bias detection
Track record of designing evaluation systems that scale across multiple features or product areas without requiring bespoke solutions for each
Experience evaluating different types of AI systems, including API-based and custom-trained models
Demonstrated ability to communicate evaluation findings and readiness assessments to cross-functional partners
Experience leveraging automation to scale evaluation data generation and analysis
Graduate degree in a relevant field

Skills & Requirements

Technical Skills

PythonPyTorchTensorFlowML frameworksadversarial test methodologiesstress testingevaluation systemsquality assessment frameworkscollaborationcommunicationAIMLevaluation infrastructureproductivity and creative applications

Employment Type

FULL TIME

Level

mid

Posted

3/26/2026

Apply Now

You will be redirected to Apple's application portal.