Failure of Coldwater-Cooling – Emergency Shutdown

Update: sensors replaced, Lichtenberg HPC up & running

2023/06/16

Friday (2023-06-16) and Sunday (2023-06-18): failure of the coldwater cooling / air conditioning system – the Lichtenberg HPC has had to be switched off.

Update 2023-06-20

Two sensors have been successfully replaced, full cooling capacity restored – and the cluster has been brought up again.

Normal operations recommenced.

Update 2023-06-19

The faulty sensor will be replaced tomorrow morning (Tuesday). Until its function test being successful, only the storage and the login nodes will be available for the moment – no compute nodes (and thus, no jobs running).

2023-06-16+18

Due to sensor misreadings or faults, the cooling system for cold water (air condition) has switched itself off, causing the server room's air temperature to rise up to 50°C.

Thus, the HPC cluster and storage was necessary to be shut down.

Since it happened twice so far and in rapid succession, the HPC including login nodes will remain offline until the problem has been fixed for sure.

Thus, there is currently no login possible.

We will inform you as soon as we have new information about the problem.