**Job Description**
Join Oracle's Health Data Intelligence (HDI) team as a Software Engineer 3, focused on Site Reliability Engineering for large-scale healthcare analytics platforms. In this role, you will design, build, and operate highly reliable, scalable infrastructure and data pipelines that power mission-critical analytics globally.
You will also contribute to the next evolution of cloud operations by advancing automation, observability, and AI-assisted reliability practices. This includes exploring the use of Generative AI and intelligent automation to improve incident response, system resilience, and operational efficiency.
You will work within a collaborative team to deliver robust solutions that handle massive datasets with precision and performance, while continuously improving system reliability and operational excellence.
**_U.S. citizenship is required for this position, as the successful candidate will be required to obtain (and maintain) a U.S. government security clearance after hire._**
**Required Skills**
**Infrastructure & Reliability**
+ Experience building and operating high-availability, fault-tolerant systems
+ Strong understanding of distributed systems, performance monitoring, and resiliency patterns
+ Experience with incident response, root-cause analysis, and production troubleshooting
**AI-Native Engineering (NEW)**
+ Hands-on experience applying Generative AI or Agentic AI (e.g., LangChain, AutoGPT, custom agents) to:
+ Infrastructure lifecycle management
+ Observability and anomaly detection
+ Incident response and remediation automation
+ Ability to design or integrate AI-driven workflows for operational efficiency and reliability
+ Familiarity with building or integrating autonomous agents for DevOps/SRE use cases
**Cloud & Multi-Cloud Ecosystems**
+ Strong experience with **multi-cloud environments** (OCI, AWS/Azure)
+ Deep understanding of cloud infrastructure design, deployment, and resource optimization
+ Experience managing hybrid or cross-cloud architectures
**DevOps/SRE Practices**
+ Advanced competency in CI/CD pipelines (Jenkins, Kubernetes)
+ Infrastructure as Code (Terraform)
+ Observability tools (Prometheus, Grafana)
+ Strong focus on **automation-first operations**
**Data Technologies**
+ Proficiency in Data Warehousing platforms (e.g., Vertica, Snowflake)
+ Experience with ETL frameworks and large-scale data processing
+ Understanding of columnar storage systems
**BI & Reporting**
+ Experience supporting or integrating BI tools (Tableau, Power BI, Oracle Analytics)
**Programming & Tools**
+ Strong proficiency in Python, Java, or Go
+ Experience with Docker, Kubernetes, and shell scripting
**Problem-Solving**
+ Strong troubleshooting skills with ability to perform root-cause analysis
+ Experience resolving complex production issues in distributed systems
**Responsibilities**
**Responsibilities**
Work with the Site Reliability Engineering (SRE) team to take shared ownership of services and platform components. Develop a strong understanding of end-to-end system architecture, dependencies, and production behavior.
+ Design, build, and operate reliable, scalable, and secure infrastructure supporting large-scale analytics workloads
+ Improve system reliability through automation, monitoring, and performance optimization
+ Contribute to the adoption of AI-assisted approaches for operations, including:
+ Enhancing observability and alerting
+ Supporting automated incident detection and remediation
+ Exploring intelligent automation for infrastructure lifecycle management
+ Partner with development teams to enhance service architecture, scalability, and operability
+ Participate in on-call rotations and act as an escalation point for complex production issues
+ Perform root cause analysis and implement long-term fixes to prevent recurrence
+ Apply knowledge of distributed systems to troubleshoot issues and optimize system performance
+ Drive continuous improvement in DevOps/SRE practices, including CI/CD, Infrastructure as Code, and automation at scale
**Develop & Maintain**
+ Implement and optimize infrastructure for Oracle HDI Analytics Platform
+ Ensure system uptime, reliability, and scalability
**AI-Driven Automation (NEW)**
+ Design and implement GenAI-powered or agent-based solutions for:
+ Observability and anomaly detection
+ Incident triage and remediation
+ Infrastructure provisioning and lifecycle management
+ Build tools and frameworks that enable self-service and autonomous operations
**Data Pipeline Execution**
+ Build and optimize scalable data pipelines using Vertica and ETL frameworks
**Operational Excellence**
+ Apply DevOps/SRE practices to automate deployments and operations
+ Enhance observability using Prometheus/Grafana and AI-driven insights
**Cloud Integration**
+ Support multi-cloud initiatives across OCI, AWS, and Azure
+ Optimize cost, performance, and compliance across e
FULL TIME
senior
4/15/2026
You will be redirected to Oracle's application portal.