Lead Specialty Software Engineer- Observability & Agentic AI Platforms

Wells Fargo
Phoenix, US
On-siteVisa Sponsorship

Job Description

About this role:

Wells Fargo is seeking a Lead AI Ops Engineer to own and advance the Commercial Observability Platform. This role provides technical leadership across agentic AI systems, AIpowered observability, advanced analytics, and enterprise telemetry platforms, enabling proactive monitoring, faster root cause analysis, and improved operational resilience across critical business applications.

This position is intended for a senior, handson AI engineer who will serve as a technical role model and bar raiser, setting standards for engineering excellence in AIdriven observability and operations

In this role, you will:

  • Design, build, and maintain productiongrade AI and agentic systems that reason over observability data including logs, metrics, traces, events, and digital experience signals
  • Develop LLMpowered workflows to support automated incident analysis, intelligent alerting, operational insights, and root cause analysis (RCA) summaries
  • Architect and implement agentic or multiagent AI workflows that decompose complex operational problems, analyze telemetry across multiple tools, and coordinate actionable recommendations
  • Apply AIOps and machine learning techniques such as anomaly detection, correlation, pattern recognition, forecasting, noise reduction, and predictive insights
  • Write and maintain Pythonbased AI services, orchestration logic, and data pipelines deployed in production environments
  • Establish best practices for AI system observability, governance, feedback loops, and continuous improvement
  • Lead the design, implementation, and evolution of enterprise observability platforms supporting commercial applications
  • Own and operate observability tools including Splunk Observability, Splunk (logs, metrics, traces), AppDynamics, and Glassbox
  • Define and enforce standards for telemetry collection, including logging, metrics, distributed tracing, and real user monitoring
  • Perform and lead complex root cause analysis by analyzing application code, logs, metrics, traces, infrastructure signals, and user experience data
  • Act as a senior Splunk query developer, designing highly complex SPL queries that function as analytical programs to correlate large volumes of telemetry data
  • Build and optimize advanced Splunk dashboards using multistage SPL pipelines, statistical functions, joins, lookups, and enrichments
  • Develop Splunk analytics that power realtime operational insights, advanced alerting, historical analysis, and AI model inputs
  • Design and develop Beacon / Telemetry APIs to collect custom application, platform, and business signals
  • Build and maintain telemetry ingestion services that normalize, store, and enrich data for analytics and AI/ML solutions
  • Partner closely with application engineering, SRE, and platform teams to improve reliability, performance, and operational maturity
  • Provide technical leadership and mentoring, serving as a role model for strong AI, analytics, and observability engineering practices
  • Influence engineering standards and contribute to longterm observability and AI platform strategy

Required Qualifications:

  • 5+ years of Specialty Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • 3+ years hands on experience in platform engineering, SRE, observability
  • 3+ years handson software engineering experience building production services, APIs, data pipelines, or AI systems
  • 2+ years experience designing or implementing AI, AI Ops, or MLdriven systems in production environments (LLMs, generative AI, or agentic AI systems)
  • 2+ years experience with Splunk SPL, including writing advanced, multistage queries equivalent to programmatic logic, building complex Splunk dashboards and analytics used for operational decisionmaking, complex queries
  • 5+ years experience in distributed systems, microservices, and cloudnative architectures
  • 2+ handson experience with enterprise observability platforms such as Splunk, Splunk Observability, AppDynamics, or equivalent tools, Grafana or prometheus

Desired Qualifications:

  • Proven ability to perform deep root cause analysis using application code and telemetry data
  • Experience designing or implementing multiagent or autonomous AI workflows
  • Familiarity with AI frameworks and tooling (for example: LangChain, LangGraph, AutoGen, CrewAI, or equivalent concepts)
  • Experience designing and building custom telemetry ingestion pipelines or Beacon APIs
  • Familiarity with OpenTelemetry and modern instrumentation standards
  • Experience building internal observability, analytics, or AI platforms used by multiple engineering teams
  • Ability to act as a technical bar raiser, influencing engineering standards across AI, analytics, and observability domains

Job Expectations:

  • This position is not eligible for Visa sponsorship or transfer of visa
  • Ability to work on-site at approved location
  • Relocation ass

Skills & Requirements

Technical Skills

PythonLLMSplunkOpenTelemetryBeaconTelemetry APIsDistributed TracingReal User MonitoringTechnical LeadershipMentoringCollaborationProblem SolvingAIObservabilityAgentic AIAIOpsMachine LearningAdvanced AnalyticsEnterprise Telemetry

Employment Type

FULL TIME

Level

mid

Posted

4/11/2026

Apply Now

You will be redirected to Wells Fargo's application portal.