Tips and Tricks

To define module load in the job script

To make job submission easier and more fault-tolerant for you, Slurm by default passes on all the environment (variables) and all loaded modules of the (login) session you submit the job from.

Thus, for a better reproducibility it is recommended to begin each job script with module purge, followed by only those specific module load … lines necessary for this job. Submitted that way, the job's main program will run with only the required and desired software (versions).

This is especially important if you use for example module initadd to load certain modules from ~/.bashrc (because you need them time and again in each login session).

Define your own module “collections”

When you have a set of modules optimized for a class of jobs, you can define them as a “collection”, easily to be restored in one line in your job scripts.

After loading your elaborated set of modules with “module load mX mY mZ …” (optionally with “… mX/versionX mY/versionY mZ/versionZ …”), save it as a “collection” using “module save <myCollectionName>”.

In your job scripts, you can then load and activate this “collection” simply with

module purge
module restore <myCollectionName>

LMod puts each of your “collections” into a text file $HOME/.lmod.d/<myCollectionName>, where you can also inspect the exact settings of them.

A list of all your “collections” appears with “module savelist”.

Archive decompression in /work/scratch--Attention: automatic file cleanup

The extraction of archives (e.g. *.zip, *.tar) often keeps the modification timestamps of all files. If the modification time of the decompressed file is too old, e.g. older than 8 weeks, the freshly extracted files may be deleted by the automatic cleaning policy of the scratch area (run daily).

To avoid such cleaning, you can often use an additional tool parameter, e.g. for tar you can use the parameter -m. Alternatively you can use the touch command to generate an updated modification time attribute.

Attention: starting April 18th 2017, the scratch cleaning cycle will be changed from the 'modification time' to being based on 'creation time' for all files. After this change, there is no need for a modification time update (via additional archive parameters or touch) any more. In other words (after the change), the update of the modification time of a file is pointless and will no longer prevent your file(s) from being deleted.

Missing Slurm support at MPI applications

Many applications have problems to use the correct number of cores within the batch system. This might be a problem of missing Slurm support. In general those applications use their own MPI versions and have to be supported explicitly by the right number of cores and by the Hostfile.

First you have to generate a current Hostfile The following line replaces the usual call: “mpirun <MPI-Programm>”:

srun hostname > hostfile.$SLURM_JOB_ID
mpirun  -n 64  -hostfile hostfile.$SLURM_JOB_ID  <MPI-Program>

The first line (above) generates the Hostfile, additionally the second line gives MPI the number of planned cores (here 64) and the name of the Hostfiles.

Job details at the end

After your job has finished, the following command reports about CPU and memory efficiency of the job:

seff <JobID>

Even more details will be shown by the following command.

sacct -l -j <JobID>
tuda-seff <JobID>

Expiry date of your user account

To see the expiry date of your own user account, use the script /shared/bin/account_expire.

Your user account's validity term is independent of any projects' term or validity you might be associated with.

File transfer to and from the Lichtenberg HPC

Before and after calculations, your data needs to get on and your results to get off the Lichtenberg filesystems.

We recommend the following tools:

One-off: scp

As you can log in via ssh to the login nodes, you can also use SSH's scp tool to copy files and directories from or to the Lichtenberg.

Use the login nodes for your scp transfers, as these have high bandwidth network ports also to the TU campus network (we do not have any other special input/output nodes).

In case of (large) text/ASCII files, you should use the optional compression (-C) built into the SSH protocol, in order to save network bandwidth and to possibly speed up your transfers.
Omit compression when copying already compressed data like JPG images or videos in modern container formats (mp4, OGG).

Example:

tuid@hla0003:~ $ scp -Cpr myResultDir mylocalworkstation.inst.tu-darmstadt.de:/path/to/my/local/resultdir

Details: man scp,

Fault tolerance: none (when interrupted, scp will transfer everything afresh, regardless what's already in the destination).

Repeatedly: rsync

Some cases, ie. repeating transfers, are less suitable for scp.

Examples: “I need my calculations' results also on my local workstation's hard disk for analysis with graphical tools” or “My local experiment's raw data need to hop to the HPC for analysis as soon as it is generated”.

As soon as you have to keep (one of) your Lichtenberg directories “in sync” with one on your institute's (local) infrastructure, running scp repeatedly would be inefficient, as it is not aware of “changes” and would blindly copy the same files over and over again.

That's where rsync can step in. Like scp, it is a command line tool, transferring files from any (remote) “SRC” to any other (remote) “DEST”ination. In contrast to scp however, it has a notion of “changes” and can find out whether a file in “SRC” has been changed and needs to be transferred at all. New as well as small files will simply be transmitted, for large files however, rsync will transfer only their changed blocks (safeguarded by checksums).

In essence: unchanged files are not transferred again, new and changed files will, but for large files, only their changed portions (delta) will be transferred.

Example:

tuid@hla0003:~ $ rsync -aH myResultDir mylocalworkstation.inst.tu-darmstadt.de:/path/to/my/local/resultdir

Details: man rsync,

Fault tolerance: partly (when interrupted, rsync will transfer only what is missing or not complete in the destination).

Remember: both scp and rsync are “one way” tools only! If--between transfers--a file is changed in “DEST”, the next transfer will overwrite it with the (older) version from “SRC”.

Not available on the Lichtenberg:

FTP(S), sFTP, rcp and other older, clear-text protocols.