Calculation jobs and batch system

Computing in the cluster is organised via the Slurm batch system. Slurm calculates when and on which of the computing nodes the job is started. Among other things, the demand for resources, the waiting time of the job and the priority of the assigned project are taken into account.

With the current cluster configuration, you normally do not need to specify a queue or partition name* when submitting new compute jobs, as this will be done automatically by Slurm, depending on the job's properties (e.g. the run time or special resources like accelerators).

Members of multiple projects however, should make sure to choose the proper project when submitting new jobs (e.g. with the parameter -A <project name>).

  • sbatch <job script>
    This puts a new job in the queue. Please refer sbatch-parameters for further important parameters of commands and job scripts.
    More detailed examples of job scripts are available under “script”.
  • squeue
    This shows an overview of all your active and waiting jobs of the job queue.
  • sjobs <Job-ID>
    This is a special TU Darmstadt script for showing detailed information about all your pending and running jobs or the job with the given ID.
  • scancel <Job-ID>
    This deletes a job from the queue or terminates an active job.
    • scancel -u <TU-ID>
      Deletes/terminates all own jobs.
  • csreport
    This is a special TU Darmstadt script for showing the resource usage of the last months and for each of your projects (in comparison to the proposed value). This command shows all values in core*hours per month. For the current month you can also verify your special user part for each of your projects (important for projects with multiple users).
    • sreport
      This is the standard Slurm command and shows the resource usage separated for each of your projects. Attention: The values are given in core*minutes.
      • In addition you can get a report of a specific month or any time period. For that you need to give the parameters cluster Account and the start and end point. The following example shows, how to get a report (core minutes) of the month April 2016:
        sreport cluster Account Start=2021-04-01 End=2021-05-01
  • csum
    This is a special TU Darmstadt script for showing the resource usage in total for each of your projects (in comparison to the approved value). This command shows all values in core*hours.

* except course / training users

We recommend providing all parameters inside the job script (instead of using sbatch command line parameters). You can find examples under the corresponding scripts (MPI, OpenMP, MPI+OpenMP).

Here, only the most important pragmas are given. You can find a complete list of parameters using the command ''man sbatch'' on the login nodes (e.g. lcluster1).

-A project_name
With this option, you choose the project the core hours used will be accounted on. The project name usually consists of the word 'project', followed by a 5-digit number, e.g. 'project12345'.
Attention: If you omit this pragma, the core hours used will be accounted on your default project (typically your first or main project), which may or may not be intended!

-J job_name
This gives the job a more descriptive name.

--mail-type=BEGIN
Send an email at begin of job.
--mail-type=END
Send an email at end or termination of job.
--mail-type=ALL
Send an email at both events (and in some other special cases).

Please note: if you submit a lot of distinct jobs separately, at least the same number of emails will be generated. In the past, this problem has caused the mail servers of the TU Darmstadt to be blacklisted as “spamming hosts” by several mail and internet service providers. The mail and groupware team of the HRZ was having a lot of efforts to revert this.

Avoid this by

  • using job arrays (#SBATCH -a 1-100 for 100 similar jobs)
  • --mail-type=NONE – instead, use “squeue” to see all your jobs still running or finished.

-o /path/to/outfile_name
This writes the standard output STDOUT of the whole job script in the designated file.
-e /path/to/errfile_name
This writes the error channel STDERR of the whole job script in the designated file.

For both options, we recommend to use the full pathname to avoid overwriting other job's files.

-n number of tasks
This gives the number of tasks (separate processes) for this job. For MPI programs, this corresponds to the total number of necessary compute cores for the job.

Processes can be scheduled to different nodes (for that, your program is required to be capable of usin MPI).

-c cores_per_task
This gives the number of cores per task. For pure multi-threading/OpenMP jobs, -n should be set to 1 and -c to the number of OpenMP threads. Default: 1

Threads will never be scheduled onto distinct nodes.

--mem-per-cpu=memory
This defines the maximum required main memory per compute core in MByte. For how to get an idea of this value for your program, see the FAQ batch system, heading How Do I “Size” My Job?. If you are uncertain, you can start with a default of 3800 on LB2.

-t run time
This sets the run time limit for the job (“wall clock time”). If a job is not completed within this time, it will be terminated automatically by the batch system. The prospected run time can be given in minutes or specified in 00:00:00 (hours:minutes:seconds).

-C feature
Requests the nodes for this job having a certain feature, e.g. AVX512 or larger main memory. Features can be combined by “&”. Possible features are for example:

  • avx512
  • mem or mem1536g
  • mpi (default)

--gres=class:type:# accelerator specification, eg. GPUs
(if not specified, the defaults are: type=any and #=1)

  • --gres=gpu – requests 1 of any GPU accelerator cards
  • --gres=gpu:v100 – requests 1 NVidia “Volta 100” card
  • --gres=gpu:a100:3 – requests 3 NVidia “Ampere 100” cards

If you need more than one GPU node for distributed Machine/Deep Learning (eg. using “horovod”), the job needs to request several GPU nodes explicitely using -N #(# = 2-8). Consequently, the number of tasks requested with -n # needs to be equal or higher than the number of nodes.
Since “GRes” are per node, you should not exceed --gres=gpu:4, even when using several 4-GPU-nodes.

-p Partition (important for Lichtenberg 2!!!)
Allowed values are test24, test30m, test7d, and testgpu24

-d dependency
This determines dependencies between different jobs. For details, please see ''man sbatch''.

Tips

--exclusive
This requests a compute node job-exclusively, meaning there are none of your other jobs allowed on this node.

This might be important if you request fewer cores per node than available (96 on our LB 2 phase I nodes). In this case, Slurm could dispatch other jobs of the same user to the node. While permitted in general, this could adversely affect the runtime behaviour of the first job (possibly distorting timing and performance analyses).

Jobs of other users are not permitted anyhow on nodes already running jobs -- our Slurm configuration is per se user-exclusive.

MPI-Script

#!/bin/bash

#SBATCH -J <Job_Name>
#SBATCH --mail-type=END
# Please check pathes (directories have to exist beforehand):
#SBATCH -e /scratch/<TUID>/<yourWorkingDirectory>/%x.err.%j
#SBATCH -o /scratch/<TUID>/<yourWorkingDirectory>/%x.out.%j
#
#SBATCH -n 192               # number of processes (= total cores to use, here: 2 nodes à 96 cores)
#SBATCH --mem-per-cpu=1750   # required main memory in MByte per MPI task/process
#SBATCH -t 01:30:00          # in hours, minutes and seconds, or '#SBATCH -t 10' - just minutes

# -------------------------------
# your job's "payload" in form of commands to execute, eg.
module purge
module load gcc openmpi
cd /scratch/<TUID>/<yourWorkingDirectory>
srun  <MPI program>  <parameters>
EXITCODE=$?

# any cleanup and copy commands:
...
# end this job script with precisely the exit status of your scientific program above:
exit $EXITCODE
please replace anything within <…> with your own values!

OpenMP-Script

#!/bin/bash

#SBATCH -J <job_name>
#SBATCH --mail-type=END
# Please check pathes (directories have to exist beforehand):
#SBATCH -e /home/<TUID>/<project_name>/<job_name>.err.%j
#SBATCH -o /home/<TUID>/<project_name>/<job_name>.out.%j
#
#SBATCH -n 1                  # 1 process only
#SBATCH -c 24                 # number of CPU cores per process
#                               can be referenced as $SLURM_CPUS_PER_TASK in your "payload" down below​
#SBATCH --mem-per-cpu=1750    # Main memory in MByte for each cpu core
#SBATCH -t 01:30:00           # Hours and minutes, or '#SBATCH -t 10' - just minutes

# -------------------------------
# your job's "payload" in form of commands to execute, eg.
module purge
module load gcc
cd /scratch/<TUID>/<project_name>

# Specification from OMP_NUM_THREADS depends on your program
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK​

/home/<TUID>/<path/to/program>  <parameters>
EXITCODE=$?

#  any cleanup and copy commands:
...
# end this job script with precisely the exit status of your scientific program above:
exit $EXITCODE
please replace anything within <…> with your own values!

MPI + OpenMP-Script

#!/bin/bash

#SBATCH -J <Job_Name>
#SBATCH --mail-type=END
# Please check pathes (directories have to exist beforehand):
#SBATCH -e /home/<TUID>/<project_name>/<job_name>.err.%j
#SBATCH -o /home/<TUID>/<project_name>/<job_name>.out.%j
#
#SBATCH -n 4                 # number of processes (= total cores to use, here: 4 nodes à 96 cores)
#SBATCH -c 96                # number of OpenMP threads or CPU cores per process
#                              can be referenced as $SLURM_CPUS_PER_TASK in your "payload" down below
#SBATCH --mem-per-cpu=1750   # Main memory in MByte for each cpu core
#SBATCH -t 01:30:00          # Hours and minutes, or '#SBATCH -t 10' - just minutes

# -------------------------------
# your job's "payload" in form of commands to execute, eg.
module purge
module load gcc openmpi
cd /scratch/<TUID>/<yourWorkingDirectory>

# specification from OMP_NUM_THREADS depends on your program
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK​

srun  <Programm>  <Parameter>
EXITCODE=$?

#  any cleanup and copy commands:
...
# end this job script with precisely the exit status of your scientific program above:
exit $EXITCODE
please replace anything within <…> with your own values!

GPU/GRes

#!/bin/bash

#SBATCH -J <Job_Name>
#SBATCH --mail-type=END
# Please check pathes (directories have to exist beforehand):
#SBATCH -e /home/<TUID>/<project_name>/<job_name>.err.%j
#SBATCH -o /home/<TUID>/<project_name>/<job_name>.out.%j
#
# CPU specification
#SBATCH -n 1                  # 1 process
#SBATCH -c 24                 # 24 CPU cores per process 
#                               can be referenced as $SLURM_CPUS_PER_TASK​ in the "payload" part
#SBATCH --mem-per-cpu=1750    # Hauptspeicher in MByte pro Rechenkern
#SBATCH -t 01:30:00           # in hours:minutes, or '#SBATCH -t 10' - just minutes

# GPU specification
#SBATCH --gres=gpu:v100:2     # 2 GPUs of type NVidia "Volta 100"

# -------------------------------
# your job's "payload" in form of commands to execute, eg.
module purge
module load gcc cuda
cd /scratch/<TUID>/<yourWorkingDirectory>

# specification from OMP_NUM_THREADS depends on your program
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
​
# for checking whether and which GPUs have been allocated
# (output appears in the "#SBATCH -e" file specified above):
nvidia-smi 1>&2

./<Programm>  <Parameter>
EXITCODE=$?

#  any cleanup and copy commands:
...
# end this job script with precisely the exit status of your scientific program above:
exit $EXITCODE
please replace anything within <…> with your own values!

The request “--gres=Class:Type:Amount” always refers to a single accelerator node, and to GPU cards as a whole. There is no way of requesting separate amounts of GPU cores (i.e. 48 Tensor units)--you can just ask for one up to 4 whole GPU cards.

Unless you explicitly specify a certain number of “CPU cores per task” (using “-c #” or “--cpus-per-task=#”), your job will automatically be assigned a quarter of the available CPU cores (96/4 = 24) per requested GPU.

Requesting just “--gres=gpu:a100:2” (without specifying “-c #”), your job will find 48 CPU cores to be available on the node, in addition to the two GPU cards.

It is also possible to work interactively on compute nodes, though it is not advised for regular work: due to many pending jobs and a fully used-up cluster, resources may not be immediately available.

If interactive work on compute nodes in fact is necessary, it can be requested with the srun command and the --pty /bin/bash parameter. The three parameters always required are -t (time), -n (No. of tasks) and --mem-per-cpu= (memory per task), and need to be supplied to the srun command, too. Optional parameters like features and mail options can also be given on the command line.

Example

srun -t15 -n4 --mem-per-cpu=500 --pty /bin/bash