Computer Vision, Applied Research Scientist

Boon Technologies, Inc.

San Francisco, US

On-site

Job Description

About Boon

Boon is the professional AI platform built specifically for construction. Founded in the San Francisco Bay Area in 2023 by product and engineering leaders from Samsara, Apple, Google and DoorDash. Boon is backed by leading Silicon Valley venture capitalists.

Our AI agents embed directly into existing workflows, from preconstruction estimating to bid management. They automate the repetitive tasks that drain time and margins while surfacing the insights leaders need to make faster and more confident decisions.

The result is measurable impact. Teams move faster, bids are submitted sooner, win rates increase, and costs are reduced. Boon enables construction companies to build more, generate more revenue, and grow with confidence.

About the Role

We are building the first foundation model for construction drawings — a unified multi-modal vision system that reads, understands, and reasons about architectural, mechanical, electrical, plumbing, and structural plans the way a human estimator does.

As a Computer Vision Applied Research Scientist at Boon, you will own end-to-end experiments on our foundation model, from architecture design through self-supervised pretraining, supervised fine-tuning, and shipping production models into our inference pipeline.

This is a 50/50 research-to-production role. You will propose new architectures, run the experiments that prove or disprove them, and ship the winning models to real customers. You will have autonomy over direction and experimental ideas, staying aligned with the team and the company's research focus. This is not a role for someone who wants to be told what to build.

What Success Looks Like

Within your first 12-18 months, the successful candidate will:

Push our production model to ≥95% accuracy across multiple trades and scopes

Design a genre-defining, novel architecture for construction drawing understanding

Publish a paper on the work at a top venue (CVPR, ICCV, ECCV, NeurIPS, or ICLR). We're committed to publishing; we may selectively not release weights or code

What You Will Do

Research & Architecture

Design and evaluate novel multi-stage vision architectures for construction drawing understanding — perception, text-object association, and relational reasoning across elements

Drive architecture decisions: backbones, decoders, fusion strategies, loss functions, training regimes

Run rigorous experiments with clean baselines, ablations, and held-out evaluation on real construction drawings

Own supervised training and self-supervised pretraining strategies

Pursue research directions that compound accuracy across trades and scopes

Production Shipping

Take models from experimental notebooks to the production inference pipeline

Work hands-on with PyTorch, YOLO, SAM, DINO, and other modern CV stacks

Collaborate with ML engineers on deployment, quantization, and serving

Debug real failures on real customer drawings and close the loop into the next training run

Cross-Functional Work

Collaborate with the synthetic data, annotation, and infrastructure teams to make sure experiments have the data and compute they need

Partner with engineering leadership on the accuracy roadmap and strategic direction

Write clean internal research reports so the broader team can learn from your work

Present findings, trade-offs, and recommendations to engineering leadership

Data & Evaluation

Help shape what data we acquire and annotate, based on what the model actually needs

Define evaluation datasets and metrics that track progress honestly — not Kaggle-style leaderboard chasing

Identify failure modes on real customer drawings and design experiments that address them

You Are a Great Fit If

You have 3-7+ years of computer vision research experience, ideally with a track record of published papers, open-source work, or production CV models

You have deep hands-on experience with multi-modal/vision transformers — segmentation, detection, or joint text+vision tasks

You have worked with modern vision transformer architectures like SAM, DINO, or similar foundation vision models

You can move from a research idea to a trained model to a production-shipped system with minimal hand-holding

You think about experiments rigorously — clean baselines, meaningful ablations, honest evaluation on real data

You have a point of view on architecture decisions and can defend it with reasoning and experimental evidence

You thrive on autonomy and set your own direction while staying aligned with team goals

You communicate clearly in English (written and verbal) and can collaborate during California business hours

Requirements

3-7+ years in computer vision research (industry research lab, applied science team, PhD research + industry, or equivalent)

Strong track record of published CV research OR trained production CV models that shipped at scale

Hands-on expertise in multi-modal dense prediction (segmentation, de

Skills & Requirements

Technical Skills

Computer visionMulti-modal dense predictionCommunicationCollaborationConstructionAi

Salary

$170,000 - $250,000

year

Employment Type

FULL TIME

Level

senior

Posted

5/4/2026

Continue to Ashby

You will be redirected to the job posting on Ashby.

Find Similar Jobs

Browse roles in the same category, level, and remote setup.