Infrastructure Site Reliability Engineering (SRE) Manager

New York City
Posted 2 months ago

Infrastructure Site Reliability Engineering (SRE) Manager

Location: Chicago or New York


We are looking for an engineering manager to lead our Infrastructure SRE team, including our advanced monitoring & tooling and IT change management teams.  In this role, you will be responsible for leading these teams in running and supporting our Infrastructure, improving the operational reliability of our systems, and driving the organization towards our next generation workflows and tools.

If you aspire to:

  • Manage a team of high performing engineers that manage global infrastructure systems, and resolve any issues & problems as they arise (including servers, networking, circuits, storage, operating system, datacenters, and associated software tools).
  • Constantly evaluating new ways to optimize how systems are supported globally, creating solutions to improve the overall supportability and scalability of infrastructure technology.
  • Identify opportunities for automating repetitive tasks and improve service levels with technology solution.
  • Lead the strategy and tooling as it relates to global monitoring systems, alerting systems, automation of deployments, test automations and change management.
  • Create and provide regular reporting to management & stakeholders on support metrics & SLA’s; as well as ad hoc reporting related to incident triage and post-mortems.
  • Continually innovate, increasing our operational reliability while reducing the technical debt across the organization

If your qualifications include:

  • 8+ years of overall Infrastructure Engineering, Operations, and/or SRE experience
  • 4+ years of successful engineering management experience, including the support of business-critical systems
  • Building and managing a support model providing effective application, infrastructure, geographical, time zone and user coverage.
  • Understanding the sense of urgency and taking personal responsibility for incident management and problem resolution.
  • Ability to establish, partner and maintain strong written and verbal communications with both internal technology teams and outside parties / vendors.
  • Programming knowledge in Python, but helpful to also have experience with C++, JavaScript or Java.
  • Experience building and deploying comprehensive IT monitoring systems
  • Passion for metrics and KPIs, leading to data drive decision making
  • Working knowledge of IT networks including:
    • Network appliances (e.g., routers, load balancers, domain name servers, firewalls)
    • Network services and architecture (BGP, OSPF, DHCP, DNS, TCP/IP, WAN, VPN, VLAN, VRF, etc.)
  • Working knowledge of Operating Systems, including Linux and Windows
  • Experience with cloud platforms such as AWS, GCP, or Azure.
  • Self-motivated individual with excellent organizational, multi-tasking, prioritizing, and teamwork skills.
  • Engineering, Computer Science, or Mathematics degree.
  • Certifications (e.g. CCNA, CCNP, RHCE, etc.) are a plus.

Summary: The successful candidate will be a mature self-starter who has demonstrated the ability to function independently in a fast-paced, dynamic and demanding environment. This person needs to be able to resolve conflict within a short period of time, while being highly collaborative and professional. This person will be intellectually curious, intuitive, trustworthy, and have the highest ethical standards. In addition, he/she will be effective addressing a number of internal and external audiences in a professional manner. This person will add value by working on a number of simultaneous tasks with minimal supervision and exemplary follow-through.

Job Features

Job CategoryFull Time

Apply Online

A valid phone number is required.
A valid email address is required.