Network problems affecting cluster operations
Queues have been halted
2025/07/19
+++ Update 2025-07-22 +++ Queues have been reactivated
+++ Update 2025-07-22
The network/routing problem has been solved and the queues have been reactivated.
+++
Since last weekend, we see effects of apparently flaky routing in the TUDa network, and have stopped the job queues to prevent pending jobs from failing due to unreachable external resources (eg. license servers in the institutes).
Thus, no pending jobs will start for the time being, whereas running jobs should be able to complete successfully (as long as they do not need external resources).
Another symptom of this is that login attempts to the cluster's login nodes may fail from one PC and work from another.
As soon as the (routing) problems are rectified, we will reactivate the queues – and logins should then succeed again reliably, too.