Hardware

Overview

The cluster consists of several sections:

MPI section for MPI intense applications
MEM section for applications requiring lots of memory (in a single node)
ACC section for applications that use accelerators
TEST section for evaluating new hardware

The whole system is located at the HPC building (L5|08) on campus Lichtwiese, and consists of several stages (“phases”), running concurrently.

Phase II of Lichtenberg II is currently in testing.
Phase I of Lichtenberg II was operational in December 2020 (in testing since September 2020).
Phase II of Lichtenberg I was operational in February 2015 (and expanded end of 2015), and was decommissioned in May 2021.
Phase I of Lichtenberg I has been in operation since fall 2013 and was decommissioned in April 2020.

  • Each node can be used as-is, ie. “single node”, with either one large or several smaller jobs/programs
  • several nodes concurrently by interprocess communication (MPI) via InfiniBand

The distinct phases (expansion stages) of the Lichtenberg 2 still are large islands with respect to their interconnect : only the compute nodes of the same phase (expansion stage) can reach and talk to each other with almost the same speed and latency – their InfiniBand fabric (inside one island/phase) is “non-blocking”.

In contrast, the bandwidth between distinct phases (islands) is limited.

586 Compute Nodes and 8 Login Nodes

  • Processors: ~4,0 PFlop/s in total (Double Precision/FP64, peak – theoretical)
    • Real ~3,1 PFlop/s in Linpack benchmark
  • Accelerators ~1312 TFlop/s in total (DP/FP64, peak – theoretical) and 32,78 Tensor PFlop/s (Half Precision/FP16)
  • Main memory: in total 312 TByte
  • Sections in this phase 2:
    • MPI section: 576 nodes (104 CPU cores and 512 GByte Main memory each)
    • (In preparation) ACC section: 7 nodes (104/128 CPU cores and 1024/1536 GByte Main memory each)
      • 5 nodes with 4x Intel Ponte Vecchio 100 GPUs each
      • 2 nodes with 4x Nvidia Hopper H100 GPUs each
    • (In preparation) MEM section: 3 Knoten
      • 2 nodes (104 CPU cores and 2048 GByte Main memory each)
      • 1 node (104 CPU cores and 6144 GByte Main memory)

In “Operations”/“Hardware”, you can find prozessor and accelerator details .

643 Compute nodes and 8 Login nodes

  • Processors: in total, ~4,5 PFlop/s computing power (Double Precision, peak – theoretical)
    • Realistically achieved: ca. 3,03 PFlop/s computing power with Linpack benchmark
  • Accelerators: overall 424 TFlop/s computing power (Double Precision/FP64, peak – theoretical)
    and ~6,8 Tensor PFlop/s (Half Precision/FP16)
  • Memory: in total, ~250 TByte main memory
  • All compute and accelerator nodes in one large island:
    • MPI section: 630 nodes (each with 96 CPU cores and 384 GByte main memory)
    • ACC section: 8 nodes (each with 96 CPU cores and 384 GByte main memory)
      • 4 nodes with 4x Nvidia V100 GPUs each (total: 16)
      • 4 nodes with 4x Nvidia A100 GPUs each (total: 16)
    • MEM section: 2 nodes (each with 96 CPU cores and 1536 GByte main memory)
  • NVIDIA DGX A100
    • 3 nodes (each with 128 CPU cores, 1024 GByte main memory)
      • 8x NVIDIA A100 Tensor Core GPUs (320 GByte total)
      • Local storage: ca. 19 TByte (Flash, NVME)

In “Operations”/“Hardware”, you can find prozessor and accelerator details .

632 Compute nodes and 8 Login nodes (decommissioned since 2021-05-31)

  • Processors: in total, ~512 TFlop/s computing power (Double Precision, peak – theoretical)
    • Realistically achieved approx. 460 TFlop/s in Linpack benchmark
  • Accelerators: overall 11,54 TFlop/s computing power (Double Precision, peak – theoretical)
  • Memory: in total, ~44 TByte main memory
  • Compute nodes grouped in 18 islands:
    • 1x MPI island with 84 nodes (in total: 2016 CPU cores, 5376 GByte main memory)
    • 16x MPI islands, each with 32 nodes (in total: 768 CPU cores and 2048 GByte main memory per island)
    • 1x ACC island with 32 nodes (ACC-N) – 3x with GPU accelerators (29x without)

780 compute nodes and 4 login nodes (decommissioned since 2020-04-27)

  • Processors: Overall ~261 TFlop/s computing power (Double Precision, peak – theoretical)
    • realistically achieved approximately 216 TFlop/s computing power with Linpack benchmark
  • Accelerators: Overall ~168 TFlop/s computing power (Double Precision, peak – theoretical)
    • realistically achieved approximately 119 TFlop/s computing power with linpack
  • Memory: overall ~32 TByte main memory
  • The computing nodes are subdivided into 19 islands:
    • 1 x MPI island with 162 nodes (2592 cores, overall 5184 GByte main memory)
    • 2 x MPI island with each 32 nodes (512 cores and 2048 GByte main memory per island
    • 15 x MPI island with each 32 nodes (512 cores and 1024 GByte main memory per island)
    • 1 x ACC island with 44 nodes (ACC-G) and 26 nodes (ACC-M), 4 nodes (MEM)

The current storage system is a IBM/Lenovo “Elastic Storage System” and was put in operation 2022-12-20. The ESS consists entirely of NVMe flash drives (in total: 576), instead of legacy (magnetic) hard disks. NVMe are solid state disks directly connected via PCI express to the storage servers' CPUs, rather than by intermediary SAS or SATA controllers with added latency.

The ESS is thus capable of providing way more storage bandwidth and throughput as well as I/Os per second than the former system.

In total, 2.1 PByte of storage capacity is available.

The high speed parallel file system is “IBM Storage Scale” (formerly known as General Parallel File System), well known for its parallel performance and flexibility.

The stored data is being delivered to all cluster nodes via the fast interconnect , allowing all nodes concurrent read and write access.

One of the most notable features of the storage system is its constant distribution of all files and directories over all available disks and SSDs/NVMe. Unlike before, there is almost no performance difference any longer between eg. /work/scratch and /home. In addition, any expansion in storage capacity also eventuates a substantial gain in storage performance.

The former storage system will serve as a second tier in a so-called Information Lifecycle Management solution.

Only the most recent and the most frequently read/written files will remain on the primary NVMe drives, and a policy (driven by available space and access times) controls movement of less “hot” data to the second, slower tier. “Colder” data in less demand will thus wander over to the legacy magnetic hard disks, freeing capacity and performance on the fastest tier for the actual jobs' I/O.

This is entirely transparent to users and jobs and takes place only behind the scenes. A file internally migrated to tier 2 will not require to be treated or handled differently than others – from a user's and jobs' perspective, it will remain available and accessible without any change, as if there were no tier 1 and 2 at all.