Senior Associate - Workload Automation Engineer

New York Life Insurance Company

New York, US

Hybrid

Job Description

Location Designation: Hybrid - 3 days per quarter

Role Summary

Serve as the engineering owner for New York Life's enterprise workload automation ecosystem. You'll operate and harden scheduling platforms and calendars, design resilient restart/rerun patterns, and standardize job definitions, logging, and audit evidence across environments. Your work will ensure critical batch chains run predictably, meet SLAs, and support a consistent, automation-first operating model.

What You'll Do:

Run & Harden the Platform

Operate and maintain scheduling controllers and agents across environments.
Manage calendars and holiday tables; configure SLA jeopardy thresholds, alerting, and escalation paths.
Implement platform upgrades, patches, and configuration changes in line with standards and change governance.

Engineer Reliability & Resilience

Design restart/rerun patterns (checkpointing, idempotent wrappers) and failure-handling flows for critical batches.
Model dependencies and schedules as code (job-as-code) in version control with CI/CD-based promotion.
Reduce single points of failure and improve consistency across job chains and environments.

Standardize & Govern

Define and maintain standard naming conventions, templates, parameters, and calendars across schedulers.
Engineer common audit-evidence and log schemas to support internal and external reviews.
Ensure data retention, traceability, and segregation of duties align with policies and regulatory requirements.

Guardrails, Health & Service Readiness

Implement pre/post checks, synthetic probes, and health validations for batch workflows.
Define and maintain SLIs/SLOs for batch completion, success rates, and recovery times.
Build safeguards that detect anomalies and misconfigurations before they impact downstream processes.

Observability & Operational Excellence

Integrate schedulers with observability tools (logs, metrics, dashboards) to improve visibility.
Tune job concurrency, execution windows, and resource usage for performance and cost efficiency.
Reduce noisy alerts and improve the signal-to-noise ratio for incident responders.

Change, Incident & Release Coordination

Align scheduler changes, maintenance, and releases with APSO/Change Management processes.
Lead incident triage and resolution for batch failures, including rapid root-cause analysis and safe restarts/reruns.
Contribute to post-incident reviews and drive remediation actions into platform and pattern improvements.

Partner & Influence Across Teams

Collaborate with Application Owners/Developers, DBAs/Data teams, SRE/Observability, Security, and Vendors to keep batch chains healthy and compliant.
Provide guidance on best practices for job design, scheduling windows, dependencies, and error handling.
Document patterns, playbooks, and standards; mentor peers and junior engineers in workload automation.

What You'll Bring:

5-8+ years of experience in enterprise workload automation, SRE, or production operations supporting mission-critical batch processing.
Hands-on experience with Stonebranch or at least one major enterprise scheduler (e.g., ESP, Control-M, AutoSys, IBM Workload Scheduler/TWS, Redwood) including:
Operating controllers/agents across environments.
Managing calendars/holiday tables and SLA jeopardy configurations.
Strong scripting and automation skills in PowerShell, Bash, or Python, plus familiarity with YAML/JSON and REST APIs.
Experience with Git-based workflows and CI/CD pipelines for job-as-code and configuration promotion.
Proven design and implementation of restart/rerun patterns, dependency modeling, and idempotent batch frameworks.
Experience integrating schedulers with observability platforms (logs/metrics/dashboards) and defining SLIs/SLOs.
Excellent coordination skills across incident and change processes, with clear, concise communication to technical and non-technical stakeholders.

Nice to Have

Experience in financial services or other highly regulated industries.
Background standardizing multiple schedulers and creating common audit schemas and evidence-capture patterns.
Relevant certifications such as ITIL, cloud architect/operations, DR/BC (e.g., DRII/BCI), or security (e.g., CISSP).

How Success Will Be Measured

Reduction in SLA jeopardy and breaches; lower mean time to recover (MTTR) from failed jobs.
Percentage of batch chains using standardized templates, restart/rerun patterns, and automated pre/post checks.
Completeness, consistency, and time-to-produce logs and evidence for audits and reviews.
Reduction in manual interventions and alert noise; improved rate of on-time, successful batch completion.

Working Model

Hybrid role based in New York, NY with periodic on-site participation for key release and batch events. Participation in an on-call rotation for critical batch windows is expected. You'll work within clear governance, established change processes, and close cross-technology collabora

Skills & Requirements

Technical Skills

PythonPowershellBashYamlJsonRest apisGitLeadershipCommunicationFinance

Employment Type

FULL TIME

Level

senior

Posted

4/14/2026

Apply Now

You will be redirected to New York Life Insurance Company's application portal.