Site Reliability Engineer (Kafka)
(Staff) Site Reliability Engineer, Kafka
Group: Platform Engineering
Location: New York, San Francisco, Chicago
The Platform Engineering team is one of the most mission-critical engineering teams in the firm and is in charge of driving technology innovation across the organization. We design much of the computational and data-oriented platforms in use across different groups and tackle the tough scalability issues. In this role, you’ll be at the center of the team that empowers our business to examine the world and its markets in a way only possible thanks to your work.
Purpose of the role:
The Platform SRE team builds and maintains the core services distributed systems are built upon. While you will have exposure to many services, we are looking for engineers to focus on reliability engineering of our core messaging platform based on Apache Kafka. Additionally, you will have a hand in the design and development of several core services offered through a PaaS-like experience – large-scale compute and container runtimes, observability platforms, caching, data-stores, service discovery, secrets, and an integrated development and deployment pipeline.
Site Reliability Engineers (SRE) fill the mission-critical role of ensuring that our complex, large-scale systems are healthy, monitored, automated, and designed to scale. Your will use your background in software engineering combined with experience as an operations generalist to work closely with our development teams from the early stages of design all the way through identifying and resolving production issues. The ideal candidate will be passionate about applying a software engineering approach to the operations problem-space, involving deep knowledge of our platforms as well as our various use-cases. You are both a generalist, capable of picking up and working with multiple, disparate systems, and an expert, having an ability to dive deep into specific topics and quickly master them.
- Serve as a primary point responsible for the overall health, performance, and capacity of our business-facing platforms, e.g. our Kafka service
- Gain deep knowledge of our complex platforms, business applications, and use-cases
- Assist in the roll-out and deployment of new platforms or features to facilitate our rapid iteration and continuous improvements
- Develop tools to improve our ability to rapidly deploy and effectively monitor and maintain custom applications or services in a large-scale Linux environment
- Work closely with development teams to ensure that platforms are designed with “operability” and “usability” in mind
- Function well in a fast-paced, rapidly changing environment
- Participate in a 24×7 rotation for second-tier escalations.
- B.S. (M.S. preferred, and Ph.D a plus) in Computer Science, Engineering, Physics, or Mathematics
- Developer background with experience in two or more of C++, Java, Python, or Node.js
- 5+ years in a Linux-based large-scale systems role
- Experience with Java/J2EE architectures and JVM tuning and configuration
- Experience building self-service APIs and tuning, sharding, and partitioning systems to auto-manage platforms at scale
- Knowledge of most of these: data structures, relational and non-relational data-stores, networking, Linux internals, file systems, distributed systems, and related topics
- Experience in containerizing applications and services a plus
- Experience using AWS or GCP at scale a plus
- Experience with random fault injection (Chaos Engineering) and building self-healing capabilities into platforms a plus
- Commits to Kafka source code would be a huge plus
- Strong interpersonal communication skills and ability to work well in a diverse, team-focused environment with other SREs, SWEs, product managers, etc.