Late evening of the 25. January, the cooling system for all coldwater-cooled components has suffered a dead loss.
To avoid overheating (and to mitigate the risk of fire), the HPC cluster and all servers in the adjacent “housing” area had to be shut off without any further notice.
All running jobs have been set to
requeue wherever possible, to enable their clean restart after the cluster becomes available again.