Tips – Hochleistungsrechnen

Tips and Tricks

Our Mailinglist

Our mailing list [HPC-Nutzer] is a convenient way of being notified about outages or failures.

The easiest way of subscribing to or unsubscribing from is to send command mails to it.

From the mail address you want to receive our HPC news with, send a mail containing the command “subscribe” (in subject and/or mail body) to hpc-nutzer-request@lists.tu-….

The list server replies with an “opt-in” mail to make sure it was in fact you sending the subscription request. After confirming the request, you will receive upcoming HPC news per mail.

Unsubscribe

From the mail address you are currently receiving our HPC news with, send a mail containing the command “unsubscribe” (in subject and/or mail body) to hpc-nutzer-request@lists.tu-….

After receiving a last confirmation mail of the closure of your subscription, you will no longer receive HPC news via mail.

Expiry date of your user account

To see the expiry date of your own user account, use the script /shared/bin/account_expire.

Your user account's validity term is independent of any projects' term or validity you might be associated with.

To list your current membership in (active) HPC projects, you can use

the command “cat ~/.project” (static view, updated nightly)
the command “member” without any parameters (dynamic / immediate view without update delay)

Expiration

As projects can have several users/members, and a given user can be member of several projects, the validity terms of HPC user accounts and HPC projects are completely independent of each other. Both can expire (run out) at different dates, and extending one does not imply extending the other.

Automatic loading of your modules at login

If you always want certain modules to be loaded automatically at login time, edit your $HOME/.bashrc_profile, adding the lines

if [ -n “${PS1}” ] ; then
 # interactive shell, do output only after that check:
 module load <module1>/<v1> <module2>/<v2> …
fi

at the end.

After next login, these modules will be loaded as soon as the shell % prompt appears.

How best to “module load” in the job script

To make job submission easier and more fault-tolerant for you, Slurm by default passes on all the environment (variables) and all loaded modules of the (login) session you submit the job from.

Thus, for better reproducibility it is recommended to begin each job script with module purge, followed by only those module load … lines really necessary for this job. Submitted that way, the job's main program will run with only the required and desired software (versions).

This is especially important if you use for example module initadd to load certain modules from ~/.bashrc (because you need them time and again in each login session).

Define your own module “collections”

When you have a set of modules optimized for a class of jobs, you can define them as a “collection”, easily to be restored by just one line in your job scripts.

After loading your elaborated set of modules with “module load mX mY mZ …” (optionally with “… mX/versionX mY/versionY mZ/versionZ …”), save it as a “collection” using “module save <myCollectionName>”.

In your job scripts, you can then load and activate this “collection” simply with

module purge
module restore <myCollectionName>

LMod puts each of your “collections” into a text file $HOME/.lmod.d/<myCollectionName>, where you can also inspect the exact settings of them.

A list of all your “collections” appears with “module savelist”.

Archive decompression in /work/scratch--Attention: automatic file cleanup

The extraction of archives (e.g. *.zip, *.tar) often keeps the “modification” timestamps of all files.

If you extract an archive to your scratch area /work/scratch/<TU-ID> and are any of the decompressed files older than 8 weeks, the freshly extracted files may be deleted by the nightly automatic cleaning policy of the scratch area.

To avoid your freshly extracted files to be deleted next night, you can use additional parameters, e.g. for tar the -m switch. Alternatively, you can use the touch command on these files, to update their modification timestamp.

MPI applications missing Slurm support

Some MPI applications have problems to use the correct number of cores when run as a batch job, maybe due to missing Slurm support. Such applications often use their own (built-in) MPI version and need additional guidance by explicitly specifying the right number of cores and by providing a so-called hostfile.

First you have to generate a suitable hostfile by asking “srun” to tell the compute nodes assigned to your job. The following two lines replace the usual call “mpirun <MPI-Programm>”:

srun hostname > /tmp/hostfile.$SLURM_JOB_ID
mpirun  -n $SLURM_NPROCS  -hostfile /tmp/hostfile.$SLURM_JOB_ID  <MPI program>

The first line generates the hostfile, and the second line informs MPI about the number of planned cores (via environment variable always correctly derived from “#SBATCH -n XX” of your job script) and the name of the hostfile.

Separate, Enriched Python Environments

Python “conda” environments could interfere with and do not harmonize well with python modules from our module system.

Thus we strongly recommend using python's “virtual environments”, if you need specific python packages not available in the module system.

To prepare your own “vEnv” named “myenv”:

ml gcc/8 python/3.10               # load a suitable compiler & python version
mkdir test ; cd test
python -m venv myenv               # create a new, empty vEnv named "myenv"
source myenv/bin/activate          # and activate it
pip install --upgrade pip          # Now, you can use "pip" (without "--user")...
pip install MyPyPkg1 MyPyPkg2 ...  # ... to install your missing python packages
deactivate

Only once on a login node – not in every job!

You can of course freely name your vEnv to your liking. And you may install it also into a project directory, to make your vEnv available to your coworkers.

After this preparation (necessary only once on a login node), your virtual environment “myenv” is ready to be used in jobs:

ml gcc/8 python/3.10
cd test

source myenv/bin/activate

python myScript ...   # which requires MyPyPkg1+2+...
deactivate

Using “myenv” in a job. Only these lines are necessary in each job script.

Details of finished jobs

After your job has finished, the following command will show its efficiency with respect to CPU and memory usage (in relation to the requested amounts):

seff <JobID>

Even more details will be shown by the following commands:

sacct -l -j <JobID>
tuda-seff <JobID>

File transfer to and from the Lichtenberg HPC

Before and after calculations, your data needs to get on and your results to get off the Lichtenberg filesystems.

Use the login nodes for your transfers, as these have high bandwidth network ports also to the TU campus network (we do not have any other special input/output nodes).

We recommend the following tools:

One-off: `scp` (or `sftp`)

As you can log in via ssh to the login nodes, you can also use SSH's scp (and sftp) tools to copy files and directories from or to the Lichtenberg.

In case of (large) text/ASCII files, you should use the optional compression (-C) built into the SSH protocol, in order to save network bandwidth and to possibly speed up your transfers.
Omit compression when copying already compressed data like JPG images or videos in modern container formats (mp4, OGG).

Example:

tu-id@logc0004:~ $ scp -Cpr myResultDir mylocalworkstation.inst.tu-darmstadt.de:/path/to/my/local/resultdir

Details: man scp.

Fault tolerance: none (when interrupted, scp will transfer everything afresh, regardless what's already in the destination).

Repeatedly: `rsync`

Some cases, ie. repeating transfers, are less suitable for scp.

Examples: “I need my calculations' results also on my local workstation's hard disk for analysis with graphical tools” or “My local experiment's raw data need to hop to the HPC for analysis as soon as it is generated”.

As soon as you have to keep (one of) your Lichtenberg directories “in sync” with one on your institute's (local) infrastructure, running scp repeatedly would be inefficient, as it is not aware of “changes” and would blindly copy the same files over and over again.

That's where rsync can step in. Like scp, it is a command line tool, transferring files from any (remote) “SRC” to any other (remote) “DEST”ination. In contrast to scp however, it has a notion of “changes” and can find out whether a file in “SRC” has been changed and needs to be transferred at all. New as well as small files will simply be transmitted, for large files however, rsync will transfer only their changed blocks (safeguarded by checksums).

While the initial “rsync”hronisation of a larger directory tree won't be much faster than with “scp -Cr”, any subsequent synchronisation will be finished much more quickly, as only deltas (the changes) are transferred.

In essence: unchanged files are never transferred again, new and changed files will, but for large files, only their changed portions (delta) will be transferred.

Example:

tu-id@logc0004:~ $ rsync -aH myResultDir mylocalworkstation.inst.tu-darmstadt.de:/path/to/my/local/resultdir

Details: man rsync.

Fault tolerance: partly (when interrupted, rsync will transfer only what is missing or not complete in the destination).

Remember: both scp and rsync are “one way” tools only! If--between transfers--a file is changed in “DEST”, the next transfer will overwrite it with the (older) version from “SRC”.

^{If you want to go “bidirectional”, you may try syncthing or unison.}

Problems

If you can log in to a shell (interactively), yet any file transfer fails, consult our FAQ “I can't upload files!”.

Not available on the Lichtenberg:

HTTP(S), SMB (CIFS), rcp and other older, unencrypted protocols.