Audit, secure, and optimize our existing cloud infrastructure (AWS) to ensure high availability, fault tolerance, and security for both training and production workloads.
Design and maintain scalable architectures for serving deep learning models (PyTorch/TensorFlow), optimizing for low latency and high throughput in handling complex infrastructure data.
Build and maintain automated pipelines for model testing, validation, deployment, and rollback.
Architect efficient, scalable compute environments for training complex computer vision and time-series models on large datasets.
Implement comprehensive monitoring for model drift, data quality, and system health, ensuring rapid response to performance degradation.
Requirements:
4-6+ years of experience in MLOps, DevOps, or Data Engineering, with a strong emphasis on machine learning workloads.
A security-first and stability-first mindset—you think about edge cases, failure modes, and system hardening by default.
Strong collaborative instincts to work closely with Data Scientists, ensuring smooth handoffs from experimentation to production.
Clear communication skills to articulate architectural decisions and tradeoffs to the broader technical team.
Deep expertise in AWS (e.g., EC2, S3, EKS, SageMaker, Lambda) and cloud security best practices.
Strong experience with Docker and Kubernetes for packaging and scaling ML applications.
Proficiency with tools like Terraform or AWS CloudFormation.
Experience building robust automated pipelines using GitHub Actions, GitLab CI, or Jenkins.
Strong Python skills with a focus on writing clean, production-grade, and well-tested code.
Familiarity with model registry and tracking tools (e.g., MLflow, Weights & Biases).
Benefits:
Medical, Dental, Vision, Basic Life, 401(k), and more
Unlimited PTO
Tools and resources to support success
Competitive compensation with high-growth potential