HPC Architect

Hedge Fund #SP09

HPC Architect

Business Area: Technology Infrastructure

Highly Specialized, Fast-Paced Global team focused on Advancing Cutting-Edge, Low-Latency Solutions and High-Performance Computing Platforms.

This role offers the opportunity to leverage your expertise in the design, research, and optimization of high-performance computing systems, while engaging in in-depth L3 support and comprehensive technical documentation. You will collaborate with cross-functional teams—including business stakeholders, application owners, clients, and vendors—to deliver scalable, end-to-end solutions. If you’re passionate about solving complex problems in a dynamic environment and want to push the limits of innovation, this role is for you.

Key Responsibilities:

  • Architect, Document, and Enhance Platform services, with a focus on Server, Storage, and Cloud Technologies.
  • Integrate next-gen computing architectures, including GPUs, advanced CPUs, and modern HPC Storage systems to meet Performance & Scalability requirements.
  • Conduct Deep Analysis to identify and resolve inefficiencies in compute and storage resource utilization, implementing optimizations at all levels.
  • Provide detailed technical documentation to ensure clarity and reproducibility of solutions.
  • Utilize Quantitative Metrics to monitor and optimize HPC System Performance, ensuring continuous improvement.
  • Lead the Execution of Projects, Collaborating with both internal and external stakeholders to deliver high-impact solutions on time and within scope.
  • Offer L3 technical support, diagnosing and addressing complex performance and availability issues across the platform.
  • Proactively assess and address potential issues before they impact system stability or performance.
  • Develop Tailored Solutions to meet evolving Business and Infrastructure requirements.

Required Qualifications:

  • 10+ years of Hands-on Exp .working with Linux-based operating systems (RHEL/Rocky/CentOS/OEL preferred), specializing in system operations, engineering, and performance tuning.
  • Expert knowledge of High-Performance Computing Systems, including memory, CPU, and network optimization in high-bandwidth environments.
  • Proven ability to identify and mitigate performance bottlenecks across various layers of the stack—operating systems, software architecture, HPC storage systems, and networking.
  • In-depth understanding of Network Protocols (TCP, UDP, RDMA) and Advanced Techniques for server and Network Tuning to maximize performance.
  • Expertise in Physical Server Architecture, understanding CPU Chipset Architectures (Intel/AMD/ARM) and selecting appropriate hardware to Optimize System Performance.
  • Hands-on experience with HPC Job Schedulers (e.g., Slurm, RunAI, Bright Cluster Manager) and their optimization.
  • Programming Proficiency in Python and/or C++, with a Strong grasp of Software Development Principles and Performance Tuning.
  • Exceptional organizational skills, with the ability to manage competing priorities and thrive in a high-pressure, dynamic environment.
  • Strong Problem-Solving and Critical Thinking skills, with the ability to resolve Complex, Ambiguous issues independently.
  • Excellent communication skills, both written and verbal, with the ability to articulate complex technical concepts clearly to diverse audiences.

Nice to Have:

  • Experience with Configuration Management Tools such as Ansible, Chef, or Terraform to automate infrastructure management.
  • Familiarity with Network Switch Architectures and working knowledge of different switch vendors.
  • Practical experience with KDB (Q) for High-Frequency Data management.
  • Knowledge of machine learning frameworks such as XGBoost, LightGBM, PyTorch, or TensorFlow, and experience in debugging and enhancing applications built with these tools.
  • Understanding of Kubernetes and integrating HPC workloads into containerized environments.

To apply for this job email your details to Graham.Gates@TechExecOnline.com

Job Overview
Job Location