Calculation jobs and batch system

Computing in the cluster is organised via the Slurm batch system. Slurm calculates when and on which of the computing nodes the job is started. Among other things, the demand for resources, the waiting time of the job and the priority of the assigned project are taken into account.

With the current cluster configuration, you normally do not need to specify a queue or partition name* when submitting new compute jobs, as this will be done automatically by Slurm, depending on the job's properties (e.g. the run time or special resources like accelerators).

Members of multiple projects however, should make sure to choose the proper project when submitting new jobs (e.g. with the parameter -A <project name>).

  • sbatch <job script>
    This puts a new job in the queue. Please refer sbatch-parameters for further important parameters of commands and job scripts.
    More detailed examples of job scripts are available under “script”.
  • squeue
    This shows an overview of all your active and waiting jobs of the job queue.
  • csqueue
    This shows a global overview of all the active and waiting jobs of job queue.
  • sjobs <Job-ID>
    This is a special TU Darmstadt script for showing detailed information about all your pending and running jobs or the job with the given ID.
  • scancel <Job-ID>
    This deletes a job from the queue or terminates an active job.
    • scancel -u $USER
      Deletes/terminates all your own jobs.
  • csreport
    This is a special TU Darmstadt script for showing the resource usage of the last months and for each of your projects (in comparison to the proposed value). This command shows all values in core*hours per month. For the current month you also see your own part of usage for each of your projects (important for projects with multiple users).
    • sreport
      This is the standard Slurm command and shows the resource usage separated for each of your projects.
      Attention: without “-t hours”, the values are given in core*minutes.
      • In addition you can get a report of a specific month or any time period. For that you need to give the parameters cluster Account and the start and end point. The following example shows, how to get a report (core minutes) of the month April 2016:
        sreport cluster Account Start=2021-04-01 End=2021-05-01 -t hours
  • csum
    This is a special TU Darmstadt script for showing the resource usage in total for each of your projects (in comparison to the approved value). This command shows all values in core*hours.

* except course / training users

We recommend providing all parameters inside the job script (instead of using sbatch command line parameters). This way, even rather old job scripts clearly document the conditions the job ran under.

Examples of different job scripts (MPI, OpenMP, MPI+OpenMP) can be found down below.

Here, only the most important pragmas are given. You can find a complete list of parameters using the command ''man sbatch'' on the login nodes .

-A project_name
With this option, you choose the project the core hours used will be accounted on.
Attention: If you omit this pragma, the core hours used will be accounted on your default project (typically your first or main project), which may or may not be intended!

-J job_name
This gives the job a more descriptive name.
Referable as %x (in #SBATCH … pragmas) and as $SLURM_JOB_NAME (in the job script's payload).

--mail-type=BEGIN
Send an email at begin of job.
--mail-type=END
Send an email at end or termination of job.
--mail-type=ALL
Send an email at both events (and in some other special cases).

Please note: if you submit a lot of distinct jobs separately, at least the same number of emails will be generated. In the past, this problem has caused the mail servers of the TU Darmstadt to be blacklisted as “spamming hosts” by several mail and internet service providers, refusing to receive any further mail from the TU Darmstadt.
The mail and groupware team of the HRZ was having a lot of efforts to revert this.

Avoid this by

  • using job arrays (#SBATCH -a 1-100 for 100 similar jobs)
  • --mail-type=NONE – instead, use “squeue” to see all your jobs still running or finished.

Please also avoid setting “--mail-user=<yourMailAddress>”, as our system determines your mail address automatically based on your TU-ID, and typos in this parameter cause unnecessary workload.

-o /path/to/outfile_name
This writes the standard output (STDOUT) of the whole job script in the designated file.
-e /path/to/errfile_name
This writes the error channel (STDERR) of the whole job script in the designated file.

For both options, we recommend to use the full pathname and/or to use %. file name variables, to avoid overwriting other job's files.

-n number of tasks
This determines the number of tasks (separate processes) for this job.
For MPI programs, this corresponds to the total number of necessary compute cores for the MPI job.

Processes can be scheduled to different nodes (for that, your program is required to be capable of using MPI).

Also, MPI programs need to be run under control of either “mpirun” or “srun” (preferred), ie.

 srun /path/to/my/MPIprogram …

-c cores_per_task (Default: 1)
This gives the number of cores per task/process. For pure multi-threading/OpenMP jobs, -n should be set to 1 and -c to the number of OpenMP threads.

Threads will never be scheduled onto distinct nodes.

Pure multi-threaded/OpenMP programs do not need to be run under “srun”!

If you always want your OpenMP program to use the allotted number of cores requested with “-c”, use the following construct in your job script (before the program's line):
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

--mem-per-cpu=memory
This defines the maximum required main memory per compute core in MByte. For how to get an idea of this value for your program, see the FAQ batch system , heading How Do I “Size” My Job?. If you are uncertain, you can start with a default of 3800 on LB2.

-t run time
This sets the run time limit for the job (“wall clock time”). If a job is not completed within this time, it will be terminated automatically by the batch system. The prospected run time can be given in minutes or specified in 00:00:00 (hours:minutes:seconds).

-C feature
Requests the nodes for this job having a certain feature, e.g. AVX512 or larger main memory. Features can be combined by “&”. Possible features are for example:

  • avx512
  • mem or mem1536g
  • mpi (default)

-d dependency
This determines dependencies between different jobs. For details, please see ''man sbatch''.

--exclusive
This requests a compute node job-exclusively, meaning there are none of your other jobs allowed on this node.

This might be important if you request fewer cores per node than available (96 on our LB 2 phase I nodes). In this case, Slurm could dispatch other jobs of the same user to the node. While permitted in general, this could adversely affect the runtime behaviour of the first job (possibly distorting timing and performance analyses).

Except for GPU nodes, jobs of other users are not permitted anyhow on nodes already running jobs -- our Slurm configuration for the default MPI nodes is per se user-exclusive.

--gres=class:type:# accelerator specification, eg. GPUs
(if not specified, the defaults are: type=any and #=1)

  • --gres=gpu
    requests 1 of any GPU accelerator cards
  • --gres=gpu:v100
    requests 1 NVidia “Volta 100” card
  • --gres=gpu:a100:3
    requests 3 NVidia “Ampere 100” cards

To have your job scripts (and programs) adapt automatically to the amount of (requested) GPUs, you can use the variable $SLURM_GPUS_ON_NODE wherever your programs expect the number of GPUs to use, ie.
myCUDAprogram --num-devices=$SLURM_GPUS_ON_NODE”.

If you need more than one GPU node for distributed Machine/Deep Learning (eg. using “horovod”), the job needs to request several GPU nodes explicitly using -N # (with # = 2-8). Consequently, the number of tasks requested with -n # needs to be equal or higher than the number of nodes.

Since “GRes” are per node, you should not exceed --gres=gpu:4 (except when targeting the DGX with 8 GPUs ), even when using several 4-GPU-nodes.

MPI-Script

#!/bin/bash

#SBATCH -J <Job_Name>
#SBATCH --mail-type=END
# Please check pathes (directories have to exist beforehand):
#SBATCH -e /work/scratch/<TU-ID>/<yourWorkingDirectory>/%x.err.%j
#SBATCH -o /work/scratch/<TU-ID>/<yourWorkingDirectory>/%x.out.%j
#
#SBATCH -n 192               # number of processes (= total cores to use, here: 2 nodes à 96 cores)
#SBATCH --mem-per-cpu=1750   # required main memory in MByte per MPI task/process
#SBATCH -t 01:30:00          # in hours, minutes and seconds, or '#SBATCH -t 10' - just minutes

# -------------------------------
# your job's "payload" in form of commands to execute, eg.
module purge
module load gcc openmpi
cd /work/scratch/<TU-ID>/<yourWorkingDirectory>
srun  <MPI program>  <parameters>
EXITCODE=$?

# any cleanup and copy commands:
...
# end this job script with precisely the exit status of your scientific program above:
exit $EXITCODE
Please replace anything within <…> with your own values! This script runs a MPI capable program enabling it to run in parallel on several compute nodes.

Multi-Threading/OpenMP-Script

#!/bin/bash

#SBATCH -J <job_name>
#SBATCH --mail-type=END
# Please check pathes (directories have to exist beforehand):
#SBATCH -e /work/scratch/<TU-ID>/<project_name>/%x.err.%j
#SBATCH -o /work/scratch/<TU-ID>/<project_name>/%x.out.%j
#
#SBATCH -n 1                  # 1 process only
#SBATCH -c 24                 # number of CPU cores per process
#                               can be referenced as $SLURM_CPUS_PER_TASK in your "payload" down below​
#SBATCH --mem-per-cpu=1750    # Main memory in MByte for each cpu core
#SBATCH -t 01:30:00           # Hours and minutes, or '#SBATCH -t 10' - just minutes

# -------------------------------
# your job's "payload" in form of commands to execute, eg.
module purge
module load gcc
cd /work/scratch/<TU-ID>/<project_name>

# Specification from OMP_NUM_THREADS depends on your program
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK​

</path/to/program>  <parameters>
EXITCODE=$?

#  any cleanup and copy commands:
...
# end this job script with precisely the exit status of your scientific program above:
exit $EXITCODE
Please replace anything within <…> with your own values! This script is for a non-MPI program, yet capable of using multi-threading or Open MultiProcessing

Hybrid: MPI + OpenMP-Script

#!/bin/bash

#SBATCH -J <Job_Name>
#SBATCH --mail-type=END
# Please check pathes (directories have to exist beforehand):
#SBATCH -e /work/scratch/<TU-ID>/<project_name>/%x.err.%j
#SBATCH -o /work/scratch/<TU-ID>/<project_name>/%x.out.%j
#
#SBATCH -n 4                 # number of processes (here: just 4, but - see next line - each with 96 cores)
#SBATCH -c 96                # number of OpenMP threads or CPU cores per process
#                              can be referenced as $SLURM_CPUS_PER_TASK in your "payload" down below
#SBATCH --mem-per-cpu=1750   # Main memory in MByte for each cpu core
#SBATCH -t 01:30:00          # Hours and minutes, or '#SBATCH -t 10' - just minutes

# -------------------------------
# your job's "payload" in form of commands to execute, eg.
module purge
module load gcc openmpi
cd /work/scratch/<TU-ID>/<yourWorkingDirectory>

# specification from OMP_NUM_THREADS depends on your program
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK​

srun  <Programm>  <Parameter>
EXITCODE=$?

#  any cleanup and copy commands:
...
# end this job script with precisely the exit status of your scientific program above:
exit $EXITCODE
Please replace anything within <…> with your own values! This example runs four MPI processes, each of which then uses 96 CPU cores with MultiThreading.

GPU/GRes

#!/bin/bash

#SBATCH -J <Job_Name>
#SBATCH --mail-type=END
# Please check pathes (directories have to exist beforehand):
#SBATCH -e /work/scratch/<TU-ID>/<project_name>/%x.err.%j
#SBATCH -o /work/scratch/<TU-ID>/<project_name>/%x.out.%j
#
# CPU specification
#SBATCH -n 1                  # 1 process
#SBATCH -c 24                 # 24 CPU cores per process 
#                               can be referenced as $SLURM_CPUS_PER_TASK​ in the "payload" part
#SBATCH --mem-per-cpu=1750    # Hauptspeicher in MByte pro Rechenkern
#SBATCH -t 01:30:00           # in hours:minutes, or '#SBATCH -t 10' - just minutes

# GPU specification
#SBATCH --gres=gpu:v100:2     # 2 GPUs of type NVidia "Volta 100"
#                               can be referenced down below as $SLURM_GPUS_ON_NODE

# -------------------------------
# your job's "payload" in form of commands to execute, eg.
module purge
module load gcc cuda
cd /work/scratch/<TU-ID>/<yourWorkingDirectory>

# specification from OMP_NUM_THREADS depends on your program
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
​
# for checking whether and which GPUs have been allocated
# (output appears in the "#SBATCH -e" file specified above):
nvidia-smi 1>&2

# if your program supports this way of getting told how many GPUs to use:
export CUDA_NUM_DEVICES=$SLURM_GPUS_ON_NODE

./<Programm>  <Parameter>
EXITCODE=$?

#  any cleanup and copy commands:
...
# end this job script with precisely the exit status of your scientific program above:
exit $EXITCODE
Please replace anything within <…> with your own values! If your program has a parameter as to how many GPUs to use, you can specify this using the variable $SLURM_GPUS_ON_NODE. Your program will then use automatically the exact amount of GPUs requested above with --gres=.

The request “--gres=Class:Type:Amount” always refers to a single accelerator node, and to GPU cards as a whole. There is no way of requesting separate amounts of GPU cores (i.e. 48 Tensor units)--you can just ask for one up to 4 whole GPU cards.

It is also possible to work interactively on compute nodes, though it is not advised for regular work: due to many pending jobs and a mostly fully used-up cluster, resources will not be available immediately.

If interactive work on compute nodes in fact is necessary, it can be requested with the srun command and the --pty /bin/bash option. The three mandatory parameters are -t (time), -n (No. of tasks) and --mem-per-cpu= (memory per task), and need to be supplied to the srun command, too. Optional parameters like features and mail options can also be given on the command line.

Example

srun -t15 -n4 --mem-per-cpu=500 --pty /bin/bash

In general, a direct login to the compute nodes is not possible.

However, during execution of your own job(s), you are entitled to login to the executing compute nodes (from a login node, not from the internet).

That is to run “top” or “strace” or similar utilities, and in general to see the behaviour of your job(s) at first-hand.

This works only from the login node, as the compute nodes are not directly connected to the internet.

To get the (list of) compute node(s) executing your job, run

squeue -t RUNNING

Then hop onto one of them (from the login node only) by either executing

ssh <name of compute node as shown in NODELIST>

In case you requested a total of one compute node, you can also achieve the same with

srun --jobid=<yourJobID> --pty bash

without prior extraction of the NODELIST.

In case of multi-node MPI jobs, this command would put you on any one of your compute nodes.

In order to go to other nodes involved, the “NODELIST” from the squeue output needs to be decomposed to get distinct host names. For example, the node list “mpsc0[301,307,412-413]” expands to

mpsc0301
mpsc0307
mpsc0412
mpsc0413

which you are all entitled to log in to with “ssh mpsc0…” while these are still executing your job.

To have a detailed look on how (and if) your job is using the assigned resources, you can use commands like “top” or “htop” or your own favorite linux tools.

If your job ends while you are still logged into one of its compute nodes, you will be logged out automatically, ending up back on the login node.