ML Ops Engineer

Upgraid
Boston, US
On-site

Job Description

Why Upgraid exists

Buildings are the world’s largest asset class, consume ~40% of energy globally (and generate the same share of greenhouse gas emissions), and shape the way we live, work, play, and interact. They are foundational to human societies. There are billions of them, from skyscrapers to data centers, malls, warehouses, and single-family homes.

Hundreds of millions - perhaps billions - of these buildings would benefit from upgrades. These upgrades would reduce energy costs, improve health, and create more attractive spaces for residents, consumers, students, patients, and more. But the way building upgrades are done today is archaic. Physical inspections, owners with no understanding of the systems in their buildings, and expensive manual energy audits of variable quality make the old way of doing things untenable.

Instead, imagine if every building could tell you exactly how it wants to be upgraded — what to fix, what it would save, and how fast it pays back. That’s what we’re building. Our AI model reads the built environment from space, runs advanced energy simulations, and delivers a ready-to-pitch upgrade proposal for every property. It’s how we will accelerate building upgrades globally.

Who we are and where we’re at

Our experienced founding team includes a former McKinsey partner and leader of built environment sustainability, an experienced product leader, and an MIT building scientist. We have rapidly closed our funding round, have advisors who have built companies from zero to IPO and senior leaders from the industry. We have recently been accepted to Greentown Labs, the world’s leading climate tech incubator. We have paying customers who consider our product a quantum leap in how building upgrades are done. We are going places fast and would like incredibly bright and talented people to join us.

What you’ll help build

● Global Data Plane & Model Registry

○ Design a centralized data lakehouse and API schema that is stable, versioned, and strictly typed.

○ Establish a multi-tenant data architecture with clear governance and isolation.

○ Implement a model registry to manage global model versions, artifacts, and lineage.

● Federated Multi-Tenant Engine

○ Architect a cost-efficient serving layer to support client-specific customization.

○ Implement dynamic serving to hot-swap client-specific adapters (PEFT/LoRA) on top of a global base model at runtime.

○ Ensure strict data isolation and privacy boundaries without maintaining separate infrastructure stacks for every tenant.

● Async Job Orchestration

○ Build the queuing architecture required for portfolio-scale analysis workloads.

○ Design the asynchronous POST /jobs API to manage long-running inference tasks and state management.

○ Implement robust failure handling (retries, dead letter queues) and event-driven notifications (webhooks/email).

Day-to-day (Your First 90 Days)

● Month 1: Foundation & Infrastructure

○ Establish the core cloud environment using Infrastructure as Code (Terraform/Pulumi).

○ Configure VPCs, IAM roles, secrets management, and secure CI/CD pipelines.

○ Set up basic observability (logs, metrics) and deploy a "Hello World" service securely.

● Month 2: The Data & API Backbone

○ Define the initial database schema (PostgreSQL) with strict tenant isolation logic.

○ Build the skeleton of the Async API (FastAPI).

○ Set up the message queue infrastructure (SQS/Kafka) to handle a basic job flow.

● Month 3: The MVP Loop

○ Deploy the global base model to a production inference endpoint.

○ Implement a v1 client feedback loop: ingestion of feedback data → storage → manual trigger of a fine-tuning job.

○ Deliver a working end-to-end flow where a user can submit a job and receive results.

In 6 months, success looks like

● We can onboard a new client without code changes, configure their local adapter, and push them into production in days, not months.

● Models are versioned, reproducible, and observable; you can compare global vs. local performance at a glance and roll back safely.

● Engineering velocity is high because infra is predictable, typed, and automated.

The kind of problems you’ll enjoy

● Multi-tenant global → local model architectures (shared schemas, tenant overrides, RBAC).

● Geospatial pipelines and indexing for at-scale queries.

● Asynchronous job orchestration, fan-out/fan-in, idempotency, and cost-aware scaling.

● Turning fuzzy real-world building data into opinionated, API-ready insights that sellers can act on.

You might have done some of this

● Designed data platforms with Postgres/PostGIS, BigQuery/Snowflake, object storage, and a feature store; comfortable with schema evolution and backfills.

● Ran ML ops in production (model registry, evals, canary deploys, drift detection, feedback loops).

● Used tools like LoRA/Adapters or dynamic model loading

● Built job queues (Celery, BullMQ), worked with message brokers (Kafka/SQS), and handled state management for long-running processes

● Built secure APIs (O

Skills & Requirements

Technical Skills

Data platformsPostgres/postgisBigquery/snowflakeObject storageFeature storeSchema evolutionBackfillsMl opsModel registryEvalsCanary deploysDrift detectionFeedback loopsLora/adaptersDynamic model loadingJob queuesCeleryBullmqMessage brokersKafkaSqsState managementAiMachine learningBuilding upgradesEnergy simulations

Employment Type

FULL TIME

Level

senior

Posted

4/4/2026

Apply Now

You will be redirected to Upgraid's application portal.