Cryptocurrency Firm #066
The Department: Platform
Our Platform organization’s purpose is to enable us to scale effectively and empower our engineering teams to focus on building innovative financial products and experiences for individuals around the world. Platform focuses around building a scalable and secure foundations platform, enabling Engineering to deploy, validate, and operate their services in production, improve resiliency of the service and increase organizational efficiency by reducing operational toil and increase system efficiency through architectural evolution.
The Site Reliability Engineering team engages directly with our other engineering teams to onboard them onto our platform systems, reviewing and recommending design and architectural decisions, and guiding our engineering teams on how to implement the tooling provided by the larger Platform organization required to ensure systems can scale and react to changing conditions, with continuous improvement loops.
The Role: Principal Site Reliability Engineer
You will be an integral part of leading our engineering teams towards modern DevOps practices, both by developing and providing modern automation and operational tooling, and working cross-functionally across our engineering teams to influence and shape our development practices and culture.
Responsibilities:
- Provide primary operational support and engineering for various services
- Improve reliability, quality and time-to-market across all services and offerings
- Guide engineering teams onto the various supported services provided by Platform
- Run on-going performance evaluations and improvements for systems
- Provide architecture recommendations and engagement as part of SDLC
- Create “Production-ready Scorecards” to evaluate the health of systems pre-launch
- Implement and teaching monitoring, alerting and automated resolution best practices
- Define SLIs, SLOs with Engineering teams
- Educate and guide Engineering teams on reliability and resiliency best practices, like statelessness, chaos testing, blue/green deployments, etc.
- Design, build, and maintain operational tooling and automation that streamline processes and enhance system reliability
Qualifications:
- 10+ years using monitoring, alerting, and automation tooling to understand and remediate performance and health issues in systems at scale
- Good knowledge for various cloud technology providers like AWS, GCP, or Azure
- Expert in an infrastructure as code environment (Terraform), developing automated solutions to solve support and operational issues
- Experience as a Technical Leader within a team, helping evaluating and making tech decisions for the team
- Expert working with containerization such as Nomad, EKS (k8s), Docker, etc.
- Expert working with Configuration Management such as Ansible, Chef, Puppet
- Proficient at writing scripts or cli tools that help increase Developer Productivity in high-level languages like Python, Go, etc.
- Expert analyzing system and application performance, identifying bottlenecks, and recommending architectural or systemic improvements
- Experience working with Engineering teams, teaching, training, and mentoring on how to implement best-practice technical solutions