Location Designation: Hybrid - 3 days per quarter
Role Summary
Serve as the engineering owner for New York Life's enterprise workload automation ecosystem. You'll operate and harden scheduling platforms and calendars, design resilient restart/rerun patterns, and standardize job definitions, logging, and audit evidence across environments. Your work will ensure critical batch chains run predictably, meet SLAs, and support a consistent, automation-first operating model.
What You'll Do:
Run & Harden the Platform
- Operate and maintain scheduling controllers and agents across environments.
- Manage calendars and holiday tables; configure SLA jeopardy thresholds, alerting, and escalation paths.
- Implement platform upgrades, patches, and configuration changes in line with standards and change governance.
Engineer Reliability & Resilience
- Design restart/rerun patterns (checkpointing, idempotent wrappers) and failure-handling flows for critical batches.
- Model dependencies and schedules as code (job-as-code) in version control with CI/CD-based promotion.
- Reduce single points of failure and improve consistency across job chains and environments.
Standardize & Govern
- Define and maintain standard naming conventions, templates, parameters, and calendars across schedulers.
- Engineer common audit-evidence and log schemas to support internal and external reviews.
- Ensure data retention, traceability, and segregation of duties align with policies and regulatory requirements.
Guardrails, Health & Service Readiness
- Implement pre/post checks, synthetic probes, and health validations for batch workflows.
- Define and maintain SLIs/SLOs for batch completion, success rates, and recovery times.
- Build safeguards that detect anomalies and misconfigurations before they impact downstream processes.
Observability & Operational Excellence
- Integrate schedulers with observability tools (logs, metrics, dashboards) to improve visibility.
- Tune job concurrency, execution windows, and resource usage for performance and cost efficiency.
- Reduce noisy alerts and improve the signal-to-noise ratio for incident responders.
Change, Incident & Release Coordination
- Align scheduler changes, maintenance, and releases with APSO/Change Management processes.
- Lead incident triage and resolution for batch failures, including rapid root-cause analysis and safe restarts/reruns.
- Contribute to post-incident reviews and drive remediation actions into platform and pattern improvements.
Partner & Influence Across Teams
- Collaborate with Application Owners/Developers, DBAs/Data teams, SRE/Observability, Security, and Vendors to keep batch chains healthy and compliant.
- Provide guidance on best practices for job design, scheduling windows, dependencies, and error handling.
- Document patterns, playbooks, and standards; mentor peers and junior engineers in workload automation.
What You'll Bring:
- 5-8+ years of experience in enterprise workload automation, SRE, or production operations supporting mission-critical batch processing.
- Hands-on experience with Stonebranch or at least one major enterprise scheduler (e.g., ESP, Control-M, AutoSys, IBM Workload Scheduler/TWS, Redwood) including:
- Operating controllers/agents across environments.
- Managing calendars/holiday tables and SLA jeopardy configurations.
- Strong scripting and automation skills in PowerShell, Bash, or Python, plus familiarity with YAML/JSON and REST APIs.
- Experience with Git-based workflows and CI/CD pipelines for job-as-code and configuration promotion.
- Proven design and implementation of restart/rerun patterns, dependency modeling, and idempotent batch frameworks.
- Experience integrating schedulers with observability platforms (logs/metrics/dashboards) and defining SLIs/SLOs.
- Excellent coordination skills across incident and change processes, with clear, concise communication to technical and non-technical stakeholders.
Nice to Have
- Experience in financial services or other highly regulated industries.
- Background standardizing multiple schedulers and creating common audit schemas and evidence-capture patterns.
- Relevant certifications such as ITIL, cloud architect/operations, DR/BC (e.g., DRII/BCI), or security (e.g., CISSP).
How Success Will Be Measured
- Reduction in SLA jeopardy and breaches; lower mean time to recover (MTTR) from failed jobs.
- Percentage of batch chains using standardized templates, restart/rerun patterns, and automated pre/post checks.
- Completeness, consistency, and time-to-produce logs and evidence for audits and reviews.
- Reduction in manual interventions and alert noise; improved rate of on-time, successful batch completion.
Working Model
Hybrid role based in New York, NY with periodic on-site participation for key release and batch events. Participation in an on-call rotation for critical batch windows is expected. You'll work within clear governance, established change processes, and close cross-technology collabora