HPC: new storage system

2022-12-19 8am

2022/12/19 by Christian Griebel

For the final migration to a new storage system, all user and job activity must be stopped for about one day.

+++ Update 2022-12-20 5.30pm

The migration has been finished successfully, and the new storage system is deployed.

We apologize for the down time to take longer than expected (due to synchronisation / error handling for files with very, very strange file names).

+++ Update 2022-12-19 1pm

Due to some hardware quirks detected (and resolved) last minute, we expect the downtime to last until Tuesday noon.

-----------------------------

The new storage system consists only of enterprise-class SSDs (solid state disks), and no legacy, rotational hard disks. It achieves far better I/O throughput and much more I/O operations per second (IOPs).

It still works with IBM's proven SpectrumScale (formerly known as GPFS).

Nothing will change with respect to pathes or mount points: the new system provides exactly the same file system structure you are used to have.

Since a few weeks, we have been preparing the migration by repeatedly synchronizing all content from old to new, behind the scenes and during normal operations.

However, for the last and decisive synchronisation to be consistent and comprehensive, all user and job activity need to cease.

At 8am on the 19th December,

we will shut down the whole cluster including login nodes and all its shared file systems, to do the final synchronisation from the old to the new system. The new system will supersede the old, and we will reactivate the cluster.

Due to most large data already transferred and with only a few latest changes remaining, we expect this downtime to be not more than a day.

As usual, you do not have to do anything with respect to your jobs: all of them will either be safely finished before the downtime, or will recommence being scheduled right after it.