Senior AI Infrastructure Reliability Engineer - Full-time

Oracle

Boston, US

On-site

Job Description

**Job Description**

Join Oracle's Health Data Intelligence (HDI) team as a Software Engineer 3, focused on Site Reliability Engineering for large-scale healthcare analytics platforms. In this role, you will design, build, and operate highly reliable, scalable infrastructure and data pipelines that power mission-critical analytics globally.

You will also contribute to the next evolution of cloud operations by advancing automation, observability, and AI-assisted reliability practices. This includes exploring the use of Generative AI and intelligent automation to improve incident response, system resilience, and operational efficiency.

You will work within a collaborative team to deliver robust solutions that handle massive datasets with precision and performance, while continuously improving system reliability and operational excellence.

**_U.S. citizenship is required for this position, as the successful candidate will be required to obtain (and maintain) a U.S. government security clearance after hire._**

**Required Skills**

**Infrastructure & Reliability**

+ Experience building and operating high-availability, fault-tolerant systems

+ Strong understanding of distributed systems, performance monitoring, and resiliency patterns

+ Experience with incident response, root-cause analysis, and production troubleshooting

**AI-Native Engineering (NEW)**

+ Hands-on experience applying Generative AI or Agentic AI (e.g., LangChain, AutoGPT, custom agents) to:

+ Infrastructure lifecycle management

+ Observability and anomaly detection

+ Incident response and remediation automation

+ Ability to design or integrate AI-driven workflows for operational efficiency and reliability

+ Familiarity with building or integrating autonomous agents for DevOps/SRE use cases

**Cloud & Multi-Cloud Ecosystems**

+ Strong experience with **multi-cloud environments** (OCI, AWS/Azure)

+ Deep understanding of cloud infrastructure design, deployment, and resource optimization

+ Experience managing hybrid or cross-cloud architectures

**DevOps/SRE Practices**

+ Advanced competency in CI/CD pipelines (Jenkins, Kubernetes)

+ Infrastructure as Code (Terraform)

+ Observability tools (Prometheus, Grafana)

+ Strong focus on **automation-first operations**

**Data Technologies**

+ Proficiency in Data Warehousing platforms (e.g., Vertica, Snowflake)

+ Experience with ETL frameworks and large-scale data processing

+ Understanding of columnar storage systems

**BI & Reporting**

+ Experience supporting or integrating BI tools (Tableau, Power BI, Oracle Analytics)

**Programming & Tools**

+ Strong proficiency in Python, Java, or Go

+ Experience with Docker, Kubernetes, and shell scripting

**Problem-Solving**

+ Strong troubleshooting skills with ability to perform root-cause analysis

+ Experience resolving complex production issues in distributed systems

**Responsibilities**

Work with the Site Reliability Engineering (SRE) team to take shared ownership of services and platform components. Develop a strong understanding of end-to-end system architecture, dependencies, and production behavior.

+ Design, build, and operate reliable, scalable, and secure infrastructure supporting large-scale analytics workloads

+ Improve system reliability through automation, monitoring, and performance optimization

+ Contribute to the adoption of AI-assisted approaches for operations, including:

+ Enhancing observability and alerting

+ Supporting automated incident detection and remediation

+ Exploring intelligent automation for infrastructure lifecycle management

+ Partner with development teams to enhance service architecture, scalability, and operability

+ Participate in on-call rotations and act as an escalation point for complex production issues

+ Perform root cause analysis and implement long-term fixes to prevent recurrence

+ Apply knowledge of distributed systems to troubleshoot issues and optimize system performance

+ Drive continuous improvement in DevOps/SRE practices, including CI/CD, Infrastructure as Code, and automation at scale

**Develop & Maintain**

+ Implement and optimize infrastructure for Oracle HDI Analytics Platform

+ Ensure system uptime, reliability, and scalability

**AI-Driven Automation (NEW)**

+ Design and implement GenAI-powered or agent-based solutions for:

+ Observability and anomaly detection

+ Incident triage and remediation

+ Infrastructure provisioning and lifecycle management

+ Build tools and frameworks that enable self-service and autonomous operations

**Data Pipeline Execution**

+ Build and optimize scalable data pipelines using Vertica and ETL frameworks

**Operational Excellence**

+ Apply DevOps/SRE practices to automate deployments and operations

+ Enhance observability using Prometheus/Grafana and AI-driven insights

**Cloud Integration**

+ Support multi-cloud initiatives across OCI, AWS, and Azure

+ Optimize cost, performance, and compliance across e

Skills & Requirements

Technical Skills

PythonJavaGoDockerKubernetesTerraformPrometheusGrafanaVerticaSnowflakeTableauPower biOracle analyticsAiCloudDevopsSre

Employment Type

FULL TIME

Level

senior

Posted

4/15/2026

Apply Now

You will be redirected to Oracle's application portal.