Senior Engineer - AI Evaluator

G2i Inc.

Phoenix, US

Remote

Job Description

Senior AI Interaction Evaluator (Codex / Claude Code)

Contract | $100–$200/hour | 10–20 hrs/week | Start ASAP (through early May)

Check out this Loom video for more details!

We’re looking for highly experienced software engineer (SR+) to help evaluate the quality of interactions with modern coding agents such as OpenAI Codex and Claude Code.

This is not a traditional engineering role.

You won’t be writing production code.

You’ll be evaluating something harder: whether the model thinks like a great engineer.

What This Role Actually Is

You will assess how AI coding agents behave in real-world scenarios — focusing on:

•

Whether the response makes sense

•

Whether the preamble and reasoning are useful

•

Whether the output reflects strong engineering judgment

•

Whether the interaction feels right to an experienced developer

This role is about engineering taste — not syntax correctness.

What You’ll Be Doing

•

Evaluate AI-generated coding interactions end-to-end

•

Judge whether outputs are:

•

Useful

•

Correct (at a high level)

•

Aligned with how a strong engineer would think

•

Assess the quality of explanations and reasoning, not just code

•

Distinguish between different levels of response quality (e.g. what makes something a 2 vs 4)

•

Provide clear, opinionated feedback on:

•

What worked

•

What didn’t

•

What felt “off” or misleading

•

Help define what great looks like when interacting with tools like Cursor

What We Mean by “Taste”

We’re specifically looking for engineers who can answer questions like:

•

Does this feel like something a strong engineer would actually say?

•

Is this explanation helpful, or just technically correct?

•

Is the model guiding the user well, or just dumping output?

•

Would this interaction build or erode trust?

You should be comfortable making subjective but rigorous judgments.

Who You Are

•

Staff / Principal-level engineer (or equivalent experience)

•

Strong background in one of the below:

•

TypeScript / JavaScript

•

Python

•

Hands-on experience using:

•

OpenAI Codex

•

Claude Code

•

Cursor

•

Deep familiarity with modern AI-assisted dev workflows

•

Able to evaluate code without needing to fully execute or deeply review every line

•

Comfortable giving direct, opinionated feedback

•

High bar for what “good engineering” looks like

Nice to Have

•

Experience with tools like Cursor or similar AI-first IDEs

•

Prior exposure to prompt design or evaluation workflows

•

Experience mentoring senior engineers or defining engineering standards

Engagement Details

•

Rate: $100–$200/hour

•

Hours: ~10–20 hours/week

•

Duration: Through early May (with possible extension)

•

Start: ASAP

•

Process:

•

Take-home evaluation exercise

•

One behavioral interview

Skills & Requirements

Technical Skills

TypescriptJavascriptPythonOpenai codexClaude codeCursorAi-assisted dev workflowsCode evaluationModel evaluationCommunicationProblem-solvingTeamworkTime managementLeadershipIntegrityHonestyPositive attitudeRapport buildingAiCode evaluationModel evaluationEngineering

Salary

$100 - $200

hour

Employment Type

CONTRACT

Level

senior

Posted

4/27/2026

Continue to Indeed

You will be redirected to the job posting on Indeed.