AI Research Engineer: Vision AI / VLM / Physical AI

Centific

Seattle; Washington, US

Remote

Job Description

AI Research Engineer:

Vision AI / VLM / Physical AI

Company:

Centific

Location:

Seattle, WA (or Remote)

Type:

Full-time

Build the Future of Perception & Embodied Intelligence

Are you pushing the frontier of computer vision, multimodal large models, and embodied/physical AIand have the publications to show it? Join us to translate cutting-edge research into production systems that perceive, reason, and act in the real world.

The Mission

We are building state-of-the-art Vision AI across 2D/3D perception, egocentric/360 understanding, and multimodal reasoning. As an AI Research Engineer, you will own high-leverage experiments from paper ? prototype ? deployable module in our platform.

We are seeking passionate Engineers to join our cutting-edge labs, you could be part of:

Computer Vision team as a Research Engineer and dive into the world of 3D reconstruction, scene understanding, and visual AI. You'll explore innovative techniques like those used to transform real-world spaces into immersive 3D modelssuch as the 3D Reconstruction projectsand work with cutting-edge architectures like VGG-T (Visual Geometry Grounded Transformers), known for advancing deep learning in vision tasks. This role is perfect for those excited to develop AI systems that interpret, reconstruct, and interact with the visual world, using state-of-the-art tools and methodologies.
Physical AI Robotics team, where you'll work at the intersection of simulation, robotics, and AI. You'll leverage NVIDIA's Omniverse for advanced 3D simulation and collaboration, Isaac Sim for robotics training and testing, and GR00T for foundation models in robotics. Experience with Holoscan SDK for real-time medical and industrial robotics pipelines, Newton Physics for dynamic simulation, and NVIDIA's NERD for neural robot dynamics will be a plus. This role is ideal for those eager to push the boundaries of AI-driven robotics using state-of-the-art tools and frameworks.

What You'll Do

Advance Visual Perception:

Build and fine-tune models for detection, tracking, segmentation (2D/3D), pose & activity recognition, and scene understanding (incl. 360 and multi-view).

Multimodal Reasoning with VLMs:

Train/evaluate visionlanguage models (VLMs) for grounding, dense captioning, temporal QA, and tool use; design retrieval-augmented and agentic loops for perception-action-tasks.

Physical AI & Embodiment:

Prototype perception-in-the-loop policies that close the gap from pixels to actions (simulation + real data). Integrate with planners and task graphs for manipulation, navigation, or safety workflows.

Data & Evaluation at Scale:

Curate datasets, author high-signal evaluation protocols/KPIs, and run ablations that make results irreproducible impossible.

Systems & Deployment:

Package research into reliable services on a modern stack (Kubernetes, Docker, Ray, FastAPI), with profiling, telemetry, and CI for reproducible science.

Agentic Workflows:

Orchestrate multi-agent pipelines (e. g. , LangGraph-style graphs) that combine perception, reasoning, simulation, and code generation to self-check and self-correct.

Example Problems You Might Tackle

Long horizon-video understanding (events, activities, causality) from egocentric or 360 video.
3D scene grounding:

linking language queries to objects, affordances, and trajectories.

Fast, privacy preserving perception for on-device or edge inference.
Robust multi-modal evaluation:

temporal consistency, open-set detection, uncertainty.

Vision conditioned-policy evaluation in sim (Isaac/MuJoCo) with sim2real stress tests.

Minimum Qualifications

Masters/Ph. D in CS/EE/Robotics (or related), actively publishing in CV/ML/Robotics (e. g. , CVPR/ICCV/ECCV, NeurIPS/ICML/ICLR, CoRL/RSS).
Strong PyTorch (or JAX) and Python; comfort with CUDA profiling and mixed precision-training.
Demonstrated research in computer vision and at least one of:

VLMs (e. g. , LLaVA style, video-language-models), embodied/physical AI, 3D perception.

Proven ability to move from paper ? code ? ablation ? result with rigorous experiment tracking.

Preferred Qualifications

Experience with video models (e. g. , TimeSFormer/MViT/VideoMAE), diffusion or 3D GS/NeRF pipelines, or SLAM/scene reconstruction.
Prior work on multimodal grounding (referring expressions, spatial language, affordances) or temporal reasoning.
Familiarity with ROS2, DeepStream/TAO, or edge inference optimizations (TensorRT, ONNX).
Scalable training:

Ray, distributed data loaders, sharded checkpoints.

Strong software craft:

testing, linting, profiling, containers, and reproducibility.

Public code artifacts (GitHub) and first-author publications or strong open source-impact.

Our Stack (you'll touch a subset)

Modeling:

PyTorch, torchvision/lightning, Hugging Face, OpenMMLab, xFormers

Perception:

YOLO/Detectron/MMDet, SAM/Mask2Former, CLIP-style backbones, optical flow

VLM / LLM:

Vision encoders + LLMs, RAG for video, toolformer-/agent loops

Skills & Requirements

Technical Skills

PythonSqlVbaMatlabTableauQlik sensePower biPytorchHugging faceOpenmmlabXformersYoloDetectronMmdetSamMask2formerClipCalypsoNumerix

Salary

$62,000+

year

Employment Type

FULL TIME

Level

senior

Posted

5/1/2026

Apply Now

You will be redirected to Centific's application portal.