Site Reliability Engineer (Applications)

Global Hedge Fund #060

Responsibilities

Evangelize the SRE mindset and implement best practices across the environment
Understand the business, tease out error budgets and find ways to measure and enhance resilience across our application estate
Eliminate the toil that emerges with complex, distributed systems by automating where possible
Working as both an individual contributor and collaboratively to find new ways of improving the reliability, availability, security and performance of the infrastructure
Accelerate the migration strategy to more cloud-native, distributed applications

Requirements

Expert level scripting / coding skills in one or more languages (Python / Golang etc.)
Expert knowledge of observability systems (Prometheus / ELK / Jaeger / Opentelemetry / Service Meshes etc.)
Experience with configuration management tools (Ansible / Puppet / Kapitan / Terraform)
Experience with distributed data platforms (Kafka / Flink / Airflow)
Comfortable using cloud native and containerisation technologies (Kubernetes / Docker)
Good Linux systems knowledge (experience with RHEL desirable)
Broad knowledge across network technologies, server virtualisation and storage
Self-starter, able to quickly pick up concepts, implement new ideas and think outside the box
Focused on improving system reliability, availability, security, and performance through testing, automation, and standardisation
Ability to simply articulate the “why” behind best practices
Ability to build positive and collaborative relationships with colleagues across teams and geographies

To apply for this job email your details to Graham.Gates@TechExecOnline.com

Job Location