Read this very carefully: https://scicomp.ethz.ch/wiki/Getting_started_with_clusters
Available software packages pre-installed https://scicomp.ethz.ch/wiki/Euler_applications_and_libraries
- Only log into the cluster using a terminal.
Do not use VSCode otherwise, your account will be banned. - Schedule experiments only if you are sure your code runs.
- Make sure that all jobs jo schedule can be analyzed and outputs are saved correctly.
- When searching for hyperparameters, don`t do a lazy grid-search (be smart. alter one of the hyperparameters after another)
- Only reserve the resources you need. Cores, Memory, Time, GPUs.
- Using more resources for a single run is often worse than running 2 jobs with half the resources (% of code fully parallelizable).
First read the data management in the cluster documentation.
-
/cluster/home
:
Home Directory is 16GB.
Store your code / GitHub repositories here and conda environment for python.
Can be transferred extremely fast to the compute nodes. Even for a lot of small files the speed is not reduced based on the number of copied files. -
/cluster/scratch
: Temporary directory for 2 Weeks with 2TB.
If a lot of small files need to be transferred to the compute node this may be extremely slow. A file limit automatically triggers a massive reduction in transmission speed. This is due to the storage is connected via a network which can slow down the full cluster.
Used for datasets or maybe your experiment results. RULEScat $SCRATCH/__USAGE_RULES__
-
/work/usergroup
: Same as local scratch but persistent.
Use for datasets or maybe your experiment results. -
$TMPDIR
: Local scratch (on each compute node).
Needs to be reserved when submitting a job withbsub
.
Provides SSD storage on the compute node itself.
Is extremely fast, especially for a lot of file access (Not penetrating the cluster network).
Therefore good for everything you need to often read and write.
Example ImageNet dataset with individual .pngs. Transfer to local scratch before starting the training.
WARNING: All directories have file limits as well!
It`s preferable to transfer 1 big tarred file than a lot of individual small files.
Tar all data within a folder without compression: It`s recommended to perform the tar process on your local machine and then transfer the .tar to the cluster.
tar -cvf $HOME/some_folder.tar $HOME/some_folder
When you submit your run script copy and extract the data to the local SSD on the compute node:
tar -xvf $HOME/some_folder.sif -C $TMPDIR
scp -r /home/username/Documents [email protected]:/cluster/home/username
Transfers all contents (-r
) of your local machine Documents folder to your home directory on the cluster.
rsync -r -v --delete --exclude 'something/*' --exclude '__pycache__' --exclude '*.pyc' --exclude '*.ipynb' /home/username/Documents jonfrey@euler:/cluster/home/jonfrey/project
Syncs all (-r
) files in your Documents folder to your home/project
directory on the cluster.
Deletes everything within your cluster home/project
directory to match your Documents folder (--delete
, be careful to not delete important things on the cluster!).
Ignores everything in the /home/username/Documents/something
folder and specific file endings.
Available software packages pre-installed https://scicomp.ethz.ch/wiki/Euler_applications_and_libraries
Modules to use for python with GPU support:
module load gcc/6.3.0 cuda/11.3.1 cudnn/8.0.5 python_gpu/3.8.5
ETH Proxy activate in the actual compute node (not the login node):
module load eth_proxy
See active modules:
module list
See available modules:
module avail
Example command:
bsub -n 18 \
-W 4:00 \
-R singularity \
-R "rusage[mem=3096,ngpus_excl_p=1]" \
-o $HOME/results.out \
-R "select[gpu_mtotal0>=10000]" \
-R "rusage[scratch=2000]" \
-R "select[gpu_driver>=470]" \
$HOME/some_script.sh
n
Number of CPU-Coresmem=3096
Memory in MB per Cores-R "rusage[scratch=2000]"
Scratch Memory in MB per Core on local SSDngpus_excl_p=1
Use a single GPU-R "select[gpu_driver>=470]"
GPU Driver Version-R "select[gpu_mtotal0>=10000]"
GPU Memory over 10000MB-o $HOME/results.out
File to store the consol output-I
Get an interactive job-s
Will give you a login shell to the compute node
my_share_info
lquota $HOME
cd /cluster/work/rsl && du -a | cut -d/ -f2 | sort | uniq -c | sort -nr
To debug on the cluster and get your code running without scheduling for each trial a new job you can get a shell on the execution node:
bsub -Is ..specify other resources for the job.. bash
Using tmux is even more convenient. It allows you to have multiple terminal windows on the same compute node.
module load tmux
bsub -Is ..specify other resources for the job.. tmux
Append to following to your $HOME/.bashrc
on the cluster to get color support.
if [ -n "$force_color_prompt" ]; then
if [ -x /usr/bin/tput ] && tput setaf 1 >&/dev/null; then
# We have color support; assume it's compliant with Ecma-48
# (ISO/IEC-6429). (Lack of such support is extremely rare, and such
# a case would tend to support setf rather than setaf.)
color_prompt=yes
else
color_prompt=
fi
fi
if [ "$color_prompt" = yes ]; then
PS1='${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\w\[\033[00m\]\$ '
else
PS1='${debian_chroot:+($debian_chroot)}\u@\h:\w\$ '
fi
unset color_prompt force_color_prompt
Prints all running or scheduled jobs:
bjobs
Provides detailed information about a job.
If you have a low Resource usage
change your core usage or write more performant code.
bbjobs JOB_ID
sudo du -h --max-depth=1 ./