serverservices_gpu-cluster [LME - WIKI]
serverservices_gpu-cluster [LME - WIKI]
LME - WIKI
GPU-Cluster
This cluster is not intended for CPU-processes but for heavy GPU processes.
The cluster software we use is called Slurm [http://slurm.schedmd.com/].
Mailing List
Before you start using the cluster, please subscribe to the mailing list cs5-cluster [https://lists.fau.de/cgi-bin/listinfo/cs5-cluster/].
(If you run into problems, please let us know at [email protected]).
Help
If you have problems using the cluster, we will help you, of course. :)
But before contacting the cluster admins, please check again if the answer to your question is not somewhere in this document.
Or can your advisor help you maybe?
If you're still stuck, please contact the cluster admins on [email protected].
Basic Concepts
On a cluster you don't normally work directly with the computers performing your computations (the compute nodes). Instead, you connect to a special node (the submit node), submit
your job there, and the cluster software will schedule it for execution on one of the compute nodes. As soon as the scheduler has found a node with the resources required for your
job (and you haven't exceeded the maximum number of active jobs allowed for your account), the job is executed there.
Our hardware
Nodes: lme49 lme50 lme51 lme52 lme53 lme170 lme171
Data Sets
Please move large data sets to `/cluster/shared_data` so that everybody can use them.
Job Submission
You can submit jobs to the cluster from host cluster.i5.informatik.uni-erlangen.de (lme242). Use SSH to connect to this machine.
https://cloud5.cs.fau.de/dokuwiki/doku.php?id=serverservices:gpu-cluster&s[]=gpu&s[]=cluster 1/4
3/20/2020 serverservices:gpu-cluster [LME - WIKI]
The most common way to run a job on the cluster is to submit a small shell script containing information about the job. Here is one example:
example.sh
#!/bin/bash
#SBATCH --job-name=MY_EXAMPLE_JOB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=12000
#SBATCH --gres=gpu:1
#SBATCH -o /home/%u/%x-%j-on-%N.out
#SBATCH -e /home/%u/%x-%j-on-%N.err
#SBATCH --mail-type=ALL
#Timelimit format: "hours:minutes:seconds" -- max is 24h
#SBATCH --time=24:00:00
#SBATCH --exclude=lme53
# Small Python packages can be installed in own home directory. Not recommended for big packages like tensorflow -> Follow instructions for pipenv below
# cluster_requirements.txt is a text file listing the required pip packages (one package per line)
pip3 install --user -r cluster_requirements.txt
python3 train.py
Make sure to put all #SBATCH commands at the top of the file. This would say that the job consists of only 1 parallel task which needs 2 CPUs and 12000 MiB RAM and 1 GPU,
the job itself would be th train.lua in this example. There is a number of other interesting options, please refer to the manpage of sbatch [https://slurm.schedmd.com/sbatch.html].
To submit this you would ssh to cluster.i5.informatik.uni-erlangen.de and run sbatch example.sh (if that's your file name).
(You can also get an interactive shell with srun --pty --nodelist=lme49 bash — but please avoid using this command.)
If you need to set a lot of enviroment variables it might be recommendable to set them in you .bashrc and source your .bashrc in your job scripts. For instance
export PATH="$HOME/.cargo/bin:$PATH"
https://cloud5.cs.fau.de/dokuwiki/doku.php?id=serverservices:gpu-cluster&s[]=gpu&s[]=cluster 2/4
3/20/2020 serverservices:gpu-cluster [LME - WIKI]
Miniconda
Miniconda is installed on cluster nodes at /opt/miniconda(not lme242) on if you prefer installations using conda or you need Python 3.7.
Just add export PATH=/opt/miniconda/bin:$PATH to your job script. This will enable python, pip, conda. Be aware that you still have to use the –user option of pip to be able to
install packages.
wget http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh \
-O miniconda.sh
chmod +x miniconda.sh && ./miniconda.sh -b -p /cluster/$(whoami)/miniconda
Create your own Pipenv (to use your own package version)
And then execute on cluster (TODO: fix link) https://asciinema.org/a/IwL7TBoR1iTBhfwUdJMbcTpD2 [https://asciinema.org/a/IwL7TBoR1iTBhfwUdJMbcTpD2] (cd to the correct
folder in your sbatch script to be on the safe side)
You can also copy the created folder to your PC. Be careful, Tensorflow must be compatible with installed CUDA and cudnn version. E.g.
But be aware that Python will never allow fully reproducible runs. It's here to surprise you everyday!
This will tell pipenv to install the dependencies on our cluster drive
export WORKON_HOME=/cluster/`whoami`/.python_cache
If something fails and you need to debug your script it's usually good not to request a GPU in your sbatch file (remove #SBATCH –gres=gpu:1). Then you don't need to wait until a
GPU is available. So you can immediately run your script. Add a time a time limit via #SBATCH –time=5 for 5min maximum runtime.
You can also launch an interactive session to see your script fail in real-time (with or without GPU). Please do not spend too much time in this mode:
Data
The cluster has two main directories for storing intermediate data:
If you have to work with significant amounts of data, please don't use your home directory, this could impair performance for everybody else!
Important:
Put your data in a sub-directory with the same name as your user account (e.g. /cluster/gropp or /scratch/gropp).
Don't forget to delete your data from the cluster directories when you're done.
Note that there is no backup of the cluster storage.
Don't use your home directory for processing big amounts of data!
Don't store data you cannot afford to lose on the cluster file systems!
The /cluster file system is accessible from non-cluster (Linux) computers on /net/cluster, on Windows computers you can use WinSCP and access /cluster on
cluster.i5.informatik.uni-erlangen.de.
Output
By default both standard output (stdout) and standard error (stderr) are directed to a file of the name “slurm-%j.out”, where the “%j” is replaced with the job allocation number.
It is possible to change the file names with the #SBATCH -o and #SBATCH -e options (you can use the variable %j).
If you run into problems with you cluster jobs, first check these output files!
Highscore
sreport user top topcount=10 -t hourper --tres=gres/gpu start=$(date --date="$(date +'%Y-%m-01')" +%D)
https://cloud5.cs.fau.de/dokuwiki/doku.php?id=serverservices:gpu-cluster&s[]=gpu&s[]=cluster 3/4
3/20/2020 serverservices:gpu-cluster [LME - WIKI]
Job Control
Use squeue to show running jobs, or scancel to abort a job.
Advanced Topics
eMail Notifications
Slurm can notify you when the status of a job changes. See notifications.
In case you'd like to have complete control over your python packages or need a different version of a library than what is installed, you can use virtualenvs.
Cluster Administration
→ admin
A time limit of 24h is enforced on the cluster. You can launch follow-up jobs with dependencies (they will only start if your first job successfully finished.
http://www.vrlab.umu.se/documentation/batchsystem/job-dependencies [http://www.vrlab.umu.se/documentation/batchsystem/job-dependencies]
PyTorch
PyTorch is installed for python3. If you need a different version please see the instructions for creating a virtualenv above.
If its an offical Ubuntu package there's no problem asking the admins to install it.
Otherwise install the software on /cluster/$(whoami)/opt or /cluster/$(whoami)/local. You may need to set PATH, CPATH, LD_LIBRARY_PATH, LD_LOAD_PATH
accordingly. E.g.
export LD_LOAD_PATH=/cluster/$(whoami)/local/lib:$LD_LOAD_PATH
export LD_LIBRARY_PATH=/cluster/$(whoami)/local/lib:$LD_LIBRARY_PATH
export CPATH=/cluster/$(whoami)/local/include:$CPATH
export PATH=/cluster/$(whoami)/local/include:$PATH
Then you can simply set CMAKE_INSTALL_PREFIX to /cluster/$(whoami)/local and you will use your self compiled libraries.
References
man sbatch [https://slurm.schedmd.com/sbatch.html]
man squeue [https://slurm.schedmd.com/squeue.html]
man scancel [https://slurm.schedmd.com/scancel.html]
man sinfo [https://slurm.schedmd.com/sinfo.html]
man scontrol [https://slurm.schedmd.com/scontrol.html]
https://cloud5.cs.fau.de/dokuwiki/doku.php?id=serverservices:gpu-cluster&s[]=gpu&s[]=cluster 4/4