Site Reliability Engineer, Platform Engineering

Asset Management Firm #003

Position: Site Reliability Engineer, Platform Engineering
Group: Platform Engineering
Location: New York, Chicago

Purpose of the role:
The Platform SRE team builds and maintains the core services that our distributed systems are built upon.
While you will have exposure to many services, we are looking for engineers to focus on reliability engineering of
our core messaging platform based on Apache Kafka. Additionally, you will have a hand in the design and
development of several core services offered through a PaaS-like experience – large-scale compute and container
runtimes, observability platforms, caching, data-stores, service discovery, secrets, and an integrated development
and deployment pipeline.

Key responsibilities:
• Serve as a primary point responsible for the overall health, performance, and capacity of our businessfacing platforms, e.g. globally distributed Kubernetes as well as common data and streaming platforms.
• Gain deep knowledge of our complex platforms, business applications, and use-cases
• Assist in the roll-out and deployment of new platforms or features to facilitate our rapid iteration and
continuous improvements
• Develop tools to improve our ability to rapidly deploy and effectively monitor and maintain custom
applications or services in a large-scale Linux environment
• Work closely with development teams to ensure that platforms are designed with “operability” and
“usability” in mind
• Function well in a fast-paced, rapidly changing environment
• Participate in a 24×7 rotation for second-tier escalations.
• B.S. (M.S. preferred, and Ph.D a plus) in Computer Science, Engineering, Physics, or Mathematics
• Developer background with experience in two or more of C++, Go, Python, or Node.js
• 5+ years in a Linux-based large-scale systems role
• Experience managing container orchestration platforms such as Kubernetes
• Experience building self-service APIs and tuning, sharding, and partitioning systems to auto-manage
platforms at scale
• Knowledge of most of these: data structures, relational and non-relational data-stores, networking, Linux
internals, file systems, distributed systems, and related topics
• Experience in containerizing applications and services a plus
• Experience using AWS or GCP at scale a plus
• Experience with random fault injection (Chaos Engineering) and building self-healing capabilities into
platforms a plus
• Commits to well-known open-source projects a huge plus
• Strong interpersonal communication skills and ability to work well in a diverse, team-focused
environment with other SREs, SWEs, product managers, etc.

To apply for this job email your details to

Job Location