Senior Associate - Workload Automation Engineer

New York Life Insurance Company
New York, US
Hybrid

Job Description

Location Designation: Hybrid - 3 days per quarter

Role Summary

Serve as the engineering owner for New York Life's enterprise workload automation ecosystem. You'll operate and harden scheduling platforms and calendars, design resilient restart/rerun patterns, and standardize job definitions, logging, and audit evidence across environments. Your work will ensure critical batch chains run predictably, meet SLAs, and support a consistent, automation-first operating model.

What You'll Do:

Run & Harden the Platform

  • Operate and maintain scheduling controllers and agents across environments.
  • Manage calendars and holiday tables; configure SLA jeopardy thresholds, alerting, and escalation paths.
  • Implement platform upgrades, patches, and configuration changes in line with standards and change governance.

Engineer Reliability & Resilience

  • Design restart/rerun patterns (checkpointing, idempotent wrappers) and failure-handling flows for critical batches.
  • Model dependencies and schedules as code (job-as-code) in version control with CI/CD-based promotion.
  • Reduce single points of failure and improve consistency across job chains and environments.

Standardize & Govern

  • Define and maintain standard naming conventions, templates, parameters, and calendars across schedulers.
  • Engineer common audit-evidence and log schemas to support internal and external reviews.
  • Ensure data retention, traceability, and segregation of duties align with policies and regulatory requirements.

Guardrails, Health & Service Readiness

  • Implement pre/post checks, synthetic probes, and health validations for batch workflows.
  • Define and maintain SLIs/SLOs for batch completion, success rates, and recovery times.
  • Build safeguards that detect anomalies and misconfigurations before they impact downstream processes.

Observability & Operational Excellence

  • Integrate schedulers with observability tools (logs, metrics, dashboards) to improve visibility.
  • Tune job concurrency, execution windows, and resource usage for performance and cost efficiency.
  • Reduce noisy alerts and improve the signal-to-noise ratio for incident responders.

Change, Incident & Release Coordination

  • Align scheduler changes, maintenance, and releases with APSO/Change Management processes.
  • Lead incident triage and resolution for batch failures, including rapid root-cause analysis and safe restarts/reruns.
  • Contribute to post-incident reviews and drive remediation actions into platform and pattern improvements.

Partner & Influence Across Teams

  • Collaborate with Application Owners/Developers, DBAs/Data teams, SRE/Observability, Security, and Vendors to keep batch chains healthy and compliant.
  • Provide guidance on best practices for job design, scheduling windows, dependencies, and error handling.
  • Document patterns, playbooks, and standards; mentor peers and junior engineers in workload automation.

What You'll Bring:

  • 5-8+ years of experience in enterprise workload automation, SRE, or production operations supporting mission-critical batch processing.
  • Hands-on experience with Stonebranch or at least one major enterprise scheduler (e.g., ESP, Control-M, AutoSys, IBM Workload Scheduler/TWS, Redwood) including:
  • Operating controllers/agents across environments.
  • Managing calendars/holiday tables and SLA jeopardy configurations.
  • Strong scripting and automation skills in PowerShell, Bash, or Python, plus familiarity with YAML/JSON and REST APIs.
  • Experience with Git-based workflows and CI/CD pipelines for job-as-code and configuration promotion.
  • Proven design and implementation of restart/rerun patterns, dependency modeling, and idempotent batch frameworks.
  • Experience integrating schedulers with observability platforms (logs/metrics/dashboards) and defining SLIs/SLOs.
  • Excellent coordination skills across incident and change processes, with clear, concise communication to technical and non-technical stakeholders.

Nice to Have

  • Experience in financial services or other highly regulated industries.
  • Background standardizing multiple schedulers and creating common audit schemas and evidence-capture patterns.
  • Relevant certifications such as ITIL, cloud architect/operations, DR/BC (e.g., DRII/BCI), or security (e.g., CISSP).

How Success Will Be Measured

  • Reduction in SLA jeopardy and breaches; lower mean time to recover (MTTR) from failed jobs.
  • Percentage of batch chains using standardized templates, restart/rerun patterns, and automated pre/post checks.
  • Completeness, consistency, and time-to-produce logs and evidence for audits and reviews.
  • Reduction in manual interventions and alert noise; improved rate of on-time, successful batch completion.

Working Model

Hybrid role based in New York, NY with periodic on-site participation for key release and batch events. Participation in an on-call rotation for critical batch windows is expected. You'll work within clear governance, established change processes, and close cross-technology collabora

Skills & Requirements

Technical Skills

PythonPowershellBashYamlJsonRest apisGitLeadershipCommunicationFinance

Employment Type

FULL TIME

Level

senior

Posted

4/14/2026

Apply Now

You will be redirected to New York Life Insurance Company's application portal.