
Hedge Fund #SP09
HPC Architect
Business Area: Technology Infrastructure
Highly Specialized, Fast-Paced Global team focused on Advancing Cutting-Edge, Low-Latency Solutions and High-Performance Computing Platforms.
This role offers the opportunity to leverage your expertise in the design, research, and optimization of high-performance computing systems, while engaging in in-depth L3 support and comprehensive technical documentation. You will collaborate with cross-functional teams—including business stakeholders, application owners, clients, and vendors—to deliver scalable, end-to-end solutions. If you’re passionate about solving complex problems in a dynamic environment and want to push the limits of innovation, this role is for you.
Key Responsibilities:
- Architect, Document, and Enhance Platform services, with a focus on Server, Storage, and Cloud Technologies.
- Integrate next-gen computing architectures, including GPUs, advanced CPUs, and modern HPC Storage systems to meet Performance & Scalability requirements.
- Conduct Deep Analysis to identify and resolve inefficiencies in compute and storage resource utilization, implementing optimizations at all levels.
- Provide detailed technical documentation to ensure clarity and reproducibility of solutions.
- Utilize Quantitative Metrics to monitor and optimize HPC System Performance, ensuring continuous improvement.
- Lead the Execution of Projects, Collaborating with both internal and external stakeholders to deliver high-impact solutions on time and within scope.
- Offer L3 technical support, diagnosing and addressing complex performance and availability issues across the platform.
- Proactively assess and address potential issues before they impact system stability or performance.
- Develop Tailored Solutions to meet evolving Business and Infrastructure requirements.
Required Qualifications:
- 10+ years of Hands-on Exp .working with Linux-based operating systems (RHEL/Rocky/CentOS/OEL preferred), specializing in system operations, engineering, and performance tuning.
- Expert knowledge of High-Performance Computing Systems, including memory, CPU, and network optimization in high-bandwidth environments.
- Proven ability to identify and mitigate performance bottlenecks across various layers of the stack—operating systems, software architecture, HPC storage systems, and networking.
- In-depth understanding of Network Protocols (TCP, UDP, RDMA) and Advanced Techniques for server and Network Tuning to maximize performance.
- Expertise in Physical Server Architecture, understanding CPU Chipset Architectures (Intel/AMD/ARM) and selecting appropriate hardware to Optimize System Performance.
- Hands-on experience with HPC Job Schedulers (e.g., Slurm, RunAI, Bright Cluster Manager) and their optimization.
- Programming Proficiency in Python and/or C++, with a Strong grasp of Software Development Principles and Performance Tuning.
- Exceptional organizational skills, with the ability to manage competing priorities and thrive in a high-pressure, dynamic environment.
- Strong Problem-Solving and Critical Thinking skills, with the ability to resolve Complex, Ambiguous issues independently.
- Excellent communication skills, both written and verbal, with the ability to articulate complex technical concepts clearly to diverse audiences.
Nice to Have:
- Experience with Configuration Management Tools such as Ansible, Chef, or Terraform to automate infrastructure management.
- Familiarity with Network Switch Architectures and working knowledge of different switch vendors.
- Practical experience with KDB (Q) for High-Frequency Data management.
- Knowledge of machine learning frameworks such as XGBoost, LightGBM, PyTorch, or TensorFlow, and experience in debugging and enhancing applications built with these tools.
- Understanding of Kubernetes and integrating HPC workloads into containerized environments.