返回职位列表

HPC Infrastructure Site Reliability Engineer

radiantGloucestershire薪资面议全职

职位描述

About UsWe’re a fast-growing GPU-as-a-Service provider, delivering scalable, high-performance compute infrastructure purpose-built for AI and HPC workloads. Operating across global data centres, we run mission-critical environments where uptime, throughput, and ultra-low latency are non-negotiable.Role OverviewWe are looking for a senior Infrastructure Site Reliability Engineer with deep experience operating large-scale distributed systems and recent hands-on expertise in high-performance computing (HPC) and AI infrastructure. This is an operations-first SRE role, working in a 24/7/365 on-call environment, responsible for ensuring reliability, performance, and continuous improvement of mission-critical infrastructure. This role sits within a cross-functional organisation spanning network engineering, infrastructure SRE, Platform SRE, infrastructure tooling engineers (software) and data centre operations.The ideal candidate has progressed through large-scale, globally distributed or multi-site infrastructure environments and has more recently specialised in GPU-accelerated HPC systems. This role provides exposure to the latest high-density AI compute platforms, including next-generation GPU infrastructure at significant scale. You will bring strong breadth across bare metal, networking, storage, virtualisation, and orchestration, alongside deep HPC experience including NVIDIA GPU ecosystems, RDMA networking (RoCE and InfiniBand), and performance validation and benchmarking. Strong Linux and distributed systems expertise is essential.Alongside operational ownership, this is a deeply technical Infrastructure SRE role centred on advanced operational troubleshooting and performance evaluation across large-scale HPC systems. You will investigate complex, cross-layer issues spanning GPU compute, networking, storage, and orchestration, building a clear understanding of system behaviour under real production AI and HPC workloads.A key responsibility is performance evaluation

立即申请

发布于 2026/6/15

公司信息

r

radiant

Tech