Cloud Reliability Engineer

Hedge Fund

New York City
Posted 3 weeks ago

General Information

Hiring Department/Group:            Public Cloud

Job Title:                                          Cloud Reliability Engineer

Office Location:                               New York, NY

Job Function

The successful candidate will be an engineer with development skills and deep technical expertise on public cloud platforms. The role will be part of a cloud engineering team that is developing frameworks and tooling for automating and managing the deployment of applications in the cloud. The candidate will be focused on ensuring the reliability and resiliency of our cloud offerings.  This position offers the opportunity to deliver significant value and drive technical innovation.

The right candidate will have deep experience managing and automating the lifecycle and operations of cloud infrastructure on AWS or Google using native tools, open source tools, and third party products.

The candidate will have experience developing production ready code, in one or more languages, that must include Python. They should also be familiar with developing unit and functional tests, and have experience with continuous integration as it applies to infrastructure as code.

The candidate should have experience architecting infrastructure to ensure the availability and resiliency of services and data.  The candidate must be experienced with managing, persisting, and replicating data in different formats in the cloud including databases, file systems, block stores, object stores, and machine images and containers.  The ideal candidate will have experience dealing with key management and encrypted data across multiple regions and accounts.

The candidate should be comfortable with Linux systems and containers as well as automating configuration management.  The candidate should have a full understanding of systems management concepts such as statelessness, immutability, and idempotence.

Experience with log management and monitoring tools is required, as is an ability to aggregate, correlate, and report on both logs and metrics, use them for capacity planning, performance tuning and to trigger automated alerts or actions.

The candidate must be able to work closely with application developers and owners to design and automate meaningful tests that validate functionality, performance, availability, and failover capabilities.  Additionally the candidate needs to be able to perform load testing and capacity planning on applications and cloud infrastructure.

Any experience with building platforms to reliably support large scale data ingest and analytics in a cloud based environment is a strong plus.

Principal Responsibilities/Qualifications

  • Designing and building resiliency as default into our cloud based architecture
  • Design and automate and tests that ensure the reliability of cloud deployed applications
  • Design and automate deployment mechanisms such as Blue/Green and Canary
  • Automating systems configuration and orchestration using tools, such as Chef, Ansible, or Salt
  • Automating creation of machine images and containers
  • Designing CI/CD pipelines to include infrastructure, application, and security testing, and gates
  • Implementation of availability, security, and performance monitoring and alerting
  • Implement load testing and capacity planning
  • Automating data resiliency and replication based on policies

 Qualifications/Skills Required

  • Significant experience designing and supporting production cloud environments
  • Strong coding skills, in one or more languages, to include python
  • Experienced developing collaboratively, including infrastructure as code
  • Experience developing automated tests, preferably in python, to validate application and infrastructure functionality, security, and performance as part of an SDLC process
  • Cloud templating and automation tools for deploying and managing infrastructure
  • Experience building CI/CD pipelines including the use of cloud native tools
  • Experience with data management and protection strategies in the cloud
  • Experience with key management as it pertains to data in cloud environments
  • Monitoring applications using cloud native, open source, and 3rd party tools
  • Deep knowledge of cloud platform APIs and automation
  • Excellent written and verbal communications with an ability to summarize and translate between business and technical contexts
  • Excellent troubleshooting and analytical skills
  • Self-starter able to execute independently, with light supervision
  • Degree preferred in a STEM or related field

Job Features

Job CategoryFull Time

Apply Online

A valid email address is required.
A valid phone number is required.