Systems Engineer (HPC)

  • Full-Time
  • Hybrid
  • London
  • Posted on April 29, 2024

Hedge Fund #002

We are looking for a Systems Engineer to join our Aligned Infrastructure team. The team is comprised of multidisciplinary individuals with unrestricted access across a large environment. We believe that one cannot build a truly great service without the ability to make changes across the stack. We take great care in focusing on solving real business problems, reducing operational overhead and working together as a team.

This team is responsible for the following areas – this includes both engineering and operations:

1.     Data modelling, database tuning & query optimization

2.     HPC job scheduling

3.     Workflow management and batch processing

4.     Container orchestration

5.     Service discovery

6.     POSIX and object storage systems

On Premise:

  • Bare metal compute (Linux)
  • System tuning
  • Configuration management and drift management
  • Performance tuning
  • Network configuration management
  • Compute, storage, network system purchases / evaluations

Cloud:

  • Environment provisioning and management

Qualifications/Skills Required:

We are looking for individuals with experience in two or more of the following areas:

HPC job scheduling

  • Experience in environments at scale (eg. billions of jobs per week/month)
  • Understanding of cost metrics, preemption, job types, queuing, scheduler and optimizations
  • Experience with products like HTCondor, slurm, spectrum LSF, nomad, AWS batch

Container Orchestration (Kubernetes)

  • Experience with: PSPs, helm, admission/mutation controllers, PVs/PVCs, kube-router, BGP – generally demonstrated ability dig deep into the k8s projects to solve hard problems
  • Experience with docker & registries (eg. harbor, artifactory, GCP container registry, AWS container registry)
  • Mature approach to dealing with operational complexities and gaps of the kubernetes platform

Storage Systems

  • Experience deploying and managing petabyte scale systems supporting varied workloads
  • Mature approach to accessing price/performance, tiering and backup requirements
  • Experience with products like GPFS, NetApp, Pure, Lightbits Ceph, GCP PDs or other nvme specific products
  • Familiarity with NVMe over fabric, POSIX, object storage and various modes of permissioning data

Linux

  • Experience using configuration management systems (eg. saltstack, ansible)
  • Understanding of linux kernel components (eg. VFS, scheduler, memory mgmt., network)
  • Solid troubleshooting experience using gdb, OS & application tracing/profiling mechanisms
  • Experience with some of docker, lxd/lxc, kerberos, ebpf and virtualization technologies

Workflow management and batch processing

  • Experience in the challenges of workflow management in heavily multi-tenant environments
  • Mature approach to dealing with/avoiding task failure and system failure
  • Experience with products like airflow, nifi, gnubatch, GCP cloud composer, AWS sagemaker

Software Engineering

  • Proficient in OO development (we use python), git and CI/CD concepts
  • Comfortable contributing to a large code-base with varied technologies

In addition to the above, the following qualifications always apply:

  • Ability to review and/or extend open source platforms to satisfy business requirements
  • A passion for technology and automation, deep sense of curiosity and willingness to always question
  • A passion for in-depth understanding of technology, and building large-scale systems.
  • Excellent verbal and written communication skills.

To apply for this job email your details to Graham.Gates@TechExecOnline.com

Job Overview
Job Location