HPC with reduced resources

Lichtenberg HPC available with reduced resources

2024/02/08

After a short in the server room's power distribution system, which rendered the whole cluster inoperative, the Lichtenberg cluster is now active with about 35% capacity.

+++Update 2024-04-2

By using temporary workarounds, the available capacity could be increased to about 62%.

+++Update 2024-03-13

By reactivating more compute nodes, the available capacity could be increased to about 35%.

+++Update 2024-03-08

The cluster is now partly available with about 20% of it's nominal compute capacity.

User data & deletion

As there have been questions about your data on the cluster's file systems and the deletion policy: We cannot ascertain yet whether there are user data affected. However, the deletion policy on /work/scratch will be suspended before/while powering up, so as to not delete data you could not have accessed for the time of the outage.

+++Update 2024-03-07

The GFE has been refilled and is operational, the power rails have been cleared for use after thorough checks, and we were able to bring up the storage system of the Lichtenberg cluster without data loss.

Using our login nodes, you now can download data from the cluster.

As to the partial restart of the compute nodes, we will give further updates in due time.

+++ Update 2024-02-23:

With the current state of the repair and replacement procedures--and if there are no further findings--we estimate 2024-03-08 as the earliest possible date for cluster availability. Please note that this is the best case estimate.

+++Update 2024-02-14 13:42 Uhr:

urgent requests for offers have been sent for re-filling the GFE, and for checks of the power distribution infrastructure – both will take some time.

Currently, there is no serious estimate possible as to when the HPC will be restored to normal operation.

Details

The short occurred in a wall box for the non-backed power distribution, which broke totally in an instant.

Primarily affected were thus all systems not being behind an uninterruptible power supply (UPS), such as the HPC compute nodes.

All racks and computers on UPS-backed power (such as HPC storage and Housing) were not directly hit, and continued to run in the first place.

However, the smoke of the short's flashover triggered the gaseous fire extinguisher, which duly flooded the server room with nitrogen/N₂.

This inherently causes very high acoustic pressure and air vibrations, known to severely wreaking havoc to legacy magnetic hard disks. That's why some systems in Housing suffered nonetheless (even if not by the power problem itself).

Next Steps – Housing

Since operation of this server room without a working GFE is considered too risky, we have had to switch off all systems (HPC cluster as well as Housing).

It is currently unclear when this GFE will be checked and refilled – and all general operation of computer systems in L5|08 has to cease until that.

Next Steps – HPC

Both power conductor rail as well as the distributing installation (400 volts, up to 2000 ampere) has to be inspected for possible damage and checked before reactivation.

It is currently unclear, when this (and all necessary repairs, if any) will be finished.

The outage of the HPC cluster will thus take some more time.

Project and user applications

Due to the unavailability of the cluster, applications cannot be processed currently. The processing of the applications will be resumed once the infrastructure is available again.