DE Jobs

Search from over 2 Million Available Jobs, No Extra Steps, No Extra Forms, Just DirectEmployers

Job Information

Meta AI/HPC Systems Production Engineer in London, United Kingdom

Summary:

Meta's AI Training and Inference Infrastructure is growing exponentially to support ever increasing uses cases of AI. We need to build and evolve our network infrastructure that connects myriads of training accelerators like GPUs together. In addition, we need to ensure that the network is running smoothly and meets stringent performance, availability and reliability requirements of RDMA workloads that expects a loss-less fabric interconnect. To improve performance of these systems we constantly look for opportunities across the stack: network fabric and host networking, communication libraries and scheduling infrastructure.

Required Skills:

AI/HPC Systems Production Engineer Responsibilities:

  1. Responsible for the overall reliability of the communication system, including monitoring, troubleshooting and proactive identification of production issues.

  2. Develop, extend and maintain CI/CD, testing pipelines for host components of training stack infrastructure, e.g. collective communication libraries (NCCL, RCCL), RDMA host stack dependencies.

  3. Active member of a multi-disciplinary team to develop solutions for large scale training systems. Work with performance engineers to ensure safe and robust rollout of new features.

Minimum Qualifications:

Minimum Qualifications:

  1. BS/MS/PhD in relevant fields (EE, CS), with 4+ years work experience.

  2. Python, C/C++ coding skills

  3. Knowledge of Linux and foundational networking principles

Preferred Qualifications:

Preferred Qualifications:

  1. Experience working with up-to-date AI training workload packaging, CI/CD and distribution processes, containerization principles.

  2. Understanding of RDMA network stack principles and pain points on InfiniBand and RoCE Networks. Experience in development of systems and applications utilizing RDMA technologies. Experience with using communication libraries, such as MPI, NVIDIA Collective Communication Library (NCCL).

  3. Experience with GPU accelerator development frameworks, for example CUDA, OpenCL

  4. Experience in developing and troubleshooting system level software

Industry: Internet

DirectEmployers