Overview
Keysight is at the forefront of technology innovation, delivering breakthroughs and trusted insights in electronic design, simulation, prototyping, test, manufacturing, and optimization. Our ~15,000 employees create world-class solutions in communications, 5G, automotive, energy, quantum, aerospace, defense, and semiconductor markets for customers in over 100 countries. Learn more about what we do.
Our award-winning culture embraces a bold vision of where technology can take us and a passion for tackling challenging problems with industry-first solutions. We believe that when people feel a sense of belonging, they can be more creative, innovative, and thrive at all points in their careers.
Responsibilities
We are expanding our engineering team with a dedicated MLOps Engineer specializing in AWS to support the deployment, scaling, and operationalization of machine learning solutions across our manufacturing and semiconductor analytics platforms. This role will serve as a critical bridge between our Machine Learning Engineers—focused on Generative AI and classical ML—and production environments, ensuring seamless, reliable, and efficient ML workflows.
You will collaborate closely with the Senior Machine Learning Engineer (GenAI Platform) and the Machine Learning Engineer (Classical ML and Predictive Analytics) to automate pipelines, monitor model performance, and manage infrastructure for high-stakes applications like test plan generation, anomaly detection, predictive maintenance, and market intelligence. In our AWS-centric ecosystem, you will leverage best-in-class tools to enable rapid iteration while maintaining compliance, security, and cost efficiency in regulated industrial settings.
This position is perfect for a mid-level professional with a passion for DevOps in ML contexts, who excels at turning complex models into robust, production-ready systems.
Key Responsibilities
- Design, implement, and maintain end-to-end MLOps pipelines on AWS, including CI/CD automation for model training, validation, deployment, and retraining, using services like SageMaker, CodePipeline, CodeBuild, and Step Functions.
- Support the Generative AI platform by operationalizing AWS Bedrock workflows, including RAG pipelines, vector databases (e.g., via OpenSearch or Pinecone integrations), Lambda functions, and agentic systems—ensuring scalability for large-scale data processing like historical test plans and news article summarization.
- Enable classical ML initiatives by deploying and monitoring models built with XGBoost, Scikit-learn, and NLP architectures (e.g., RNNs/LSTMs) on AWS infrastructure, incorporating drift detection for anomaly tracking in sensor data and competitor pricing monitoring.
- Manage infrastructure as code (IaC) using Terraform or CloudFormation to provision and optimize AWS resources, such as EC2 instances, S3 buckets, EMR for Apache Spark-based processing (supporting our PMA product), and ECS/EKS for containerized deployments.
- Implement comprehensive monitoring, logging, and alerting systems with CloudWatch, X-Ray, and third-party tools (e.g., Prometheus/Grafana integrations) to track model performance, detect anomalies, handle concept drift, and ensure high availability for customer-facing tools like Q&A chatbots and predictive maintenance advisors.
- Collaborate in an Agile environment with ML engineers, data scientists, and SRE teams to conduct A/B testing, version models, automate rollbacks, and optimize costs through auto-scaling and spot instances.
- Enforce security and compliance best practices, including IAM roles, VPC configurations, data encryption, and audit logging, to safeguard sensitive manufacturing data and meet industry standards.
- Troubleshoot production issues, perform root-cause analysis, and drive continuous improvements in ML operations, staying ahead of AWS innovations to enhance platform reliability and efficiency.
Qualifications
Must-have qualifications
- Bachelor's or Master's degree in Computer Science, Engineering, Information Systems, or a related technical field.
- 3–5 years of experience in MLOps, DevOps, or cloud engineering roles, with a proven track record of deploying and managing ML models in production environments.
- Deep expertise in AWS services for ML and data workflows, including SageMaker (real-time endpoints, inference components, multi-instance/multi-variant deployments), Bedrock (provisioned throughput, cross-Region inference profiles for scaling & resilience), EMR (for Spark-based PMA workloads), Lambda, S3, ECR, and orchestration tools like Step Functions or Airflow.
- Proven experience with Amazon Elastic Container Registry (ECR): building, scanning for vulnerabilities, tagging, versioning, and pushing custom Docker images for inference containers (including Bring-Your-Own-Container patterns for custom ML frameworks, vLLM, or deep learning environments); managing ECR lifecycle policies, replication across regions, and secure ac