Submitting Slurm jobs
Slurm allows for interactive and non-interactive work. Most users will use
sbatch
or srun
to submit non-interactive scripts or commands. In each
case, specifying a queue
and qos
is mandatory.
Warning
Without valid --partition
and --qos
specified for each job, Slurm won’t
accept the job. Please check your associations using entropy_account_info
.
Time limit
One of the most important parameters is the --time
parameter. Without setting
this flag explicitly the system will assign the default value for the specified
QoS, but it most probably will be suboptimal.
Note
Slurm’s scheduling algorithm is quite complex. Properly estimating and setting your job’s running time will result in a faster access to the resources and better overall system utilization.
Time format is presented in the table below.
Format |
Example |
Description |
---|---|---|
<min> |
30 |
30 minutes |
<min>:<sec> |
20:20 |
20 minutes and 20 seconds |
<hr>:<min>:<sec> |
1:30:45 |
1 hour 30 minutes and 45 seconds |
<days>-<hr> |
2-0 |
2 days |
<days>-<hr> |
2-6 |
2 days and 6 hours |
<days>-<hr>:<min>:<sec> |
1-12:30:00 |
1 day 12 hours and 30 minutes |
Using srun
The srun
command allows for running a single job on the cluster. Each time a valid
partition
and qos
needs to be specified. If not redirected, all program
output will be printed to the standard output.
Running a single command and printing the results to the standard output.
1$ srun --partition=common --qos=1gpu1h --time=10 --gres=gpu:1 nvidia-smi -L 2 3GPU 0: TITAN V (UUID: GPU-6426f3d6-4cec-9167-5035-4e9129551d9b) 4GPU 1: TITAN V (UUID: GPU-bcaaee86-bd21-4735-edc2-d18b5fed40a7) 5GPU 2: TITAN V (UUID: GPU-109e5f3c-c2e8-3a9d-486a-0df29fb6c905) 6GPU 3: TITAN V (UUID: GPU-e3d1f883-02b2-1da6-80e1-32efd4ab7453)
Running a single command with a specific node selected and printing the results to the standard output.
1$ srun --nodelist arnold --partition=common --qos=1gpu1h --time=20 --gres=gpu:1 nvidia-smi -L 2 3GPU 0: TITAN V (UUID: GPU-6426f3d6-4cec-9167-5035-4e9129551d9b) 4GPU 1: TITAN V (UUID: GPU-bcaaee86-bd21-4735-edc2-d18b5fed40a7) 5GPU 2: TITAN V (UUID: GPU-109e5f3c-c2e8-3a9d-486a-0df29fb6c905) 6GPU 3: TITAN V (UUID: GPU-e3d1f883-02b2-1da6-80e1-32efd4ab7453)
Running a single command with a specific node and card selected and saving the output to a file.
1$ srun --nodelist=arnold --partition=common --qos=1gpu1h --output=username_out.txt --time=1:00 --gres=gpu:titanv:1 nvidia-smi -L 2$ 3$ cat /results/username_out.txt 4 5GPU 0: TITAN V (UUID: GPU-6426f3d6-4cec-9167-5035-4e9129551d9b) 6GPU 1: TITAN V (UUID: GPU-bcaaee86-bd21-4735-edc2-d18b5fed40a7) 7GPU 2: TITAN V (UUID: GPU-109e5f3c-c2e8-3a9d-486a-0df29fb6c905) 8GPU 3: TITAN V (UUID: GPU-e3d1f883-02b2-1da6-80e1-32efd4ab7453)
Default Time, CPU and Memory values
Each partition has predefined memory, CPU and time values for a submitted job. These are set to fill nodes optimally – please do not change them without a reason.
To fully allocate memory and time within the assigned qos, please use the flags specified in the table below.
Parameter |
Flag |
---|---|
Memory (RAM) |
|
Time |
|
Warning
The --mem
flag should be used with caution! The default values for DefCpuPerGPU
and DefMemPerCPU
will allocate optimal number of resources and it is thus
recommmended not to tinker with the --mem
flag without a very specific reason.
For example:
1$ srun --partition=common --qos=16gpu14d --output=username_out.txt --time=1-0 --gres=gpu:titanv:1 a_command_to_run
Using sbatch
Using sbatch
involves writing a script with all needed details for job
submission. Passing all required parameters is similar to the #DEFINE
stanzas known in the C language. Slurm uses #BATCH
. In the batch
mode, defining the --output
file is mandatory.
Running a single command.
1$ cat job.sh 2 3#!/bin/bash 4# 5#SBATCH --job-name=test_job_username 6#SBATCH --partition=common 7#SBATCH --qos=1gpu1d 8#SBATCH --gres=gpu:1 9#SBATCH --time=1-0 10#SBATCH --output=test_job.txt 11 12nvidia-smi -L 13 14$ sbatch job.sh
Running a single command with a specific node and a GPU selected.
1$ cat job.sh 2 3#!/bin/bash 4# 5#SBATCH --job-name=test_job_n 6#SBATCH --partition=research 7#SBATCH --qos=lecturer 8#SBATCH --gres=gpu:rtx2080ti:8 9#SBATCH --output=test_job_n.txt 10#SBATCH --time=3-0 11#SBATCH --nodelist=asusgpu2 12 13nvidia-smi -L 14 15$ sbatch job.sh
Environmental variables
By default Slurm will copy (and as a consequence overwrite) all environmental variables from
the submission node to the compute nodes. Thus, using full paths to binaries or changes to the
PATH
variable are required. For example, let us try to run nvcc
without any PATH
or script modifications:
1 $ cat job.sh
2
3 #!/bin/bash
4 #
5 #SBATCH --job-name=test_job_n
6 #SBATCH --partition=common
7 #SBATCH --qos=student
8 #SBATCH --gres=gpu:rtx2080ti:1
9 #SBATCH --time=30
10 #SBATCH --output=/results/test_job_n.txt
11
12 nvcc --version
13
14 $ sbatch job.sh
15 $
16 $ cat /results/test_job_n.txt
17
18 /var/spool/slurm/d/job00124/slurm_script: line 9: nvcc: command not found
Specifying the full path would work:
1/usr/local/cuda/bin/nvcc --version
We could also add the --export
option to a batch script:
1$ cat job.sh
2
3#!/bin/bash
4#
5#SBATCH --job-name=test_job_n
6#SBATCH --partition=common
7#SBATCH --qos=student
8#SBATCH --gres=gpu:rtx2080ti:1
9#SBATCH --output=/results/test_job_n.txt
10#SBATCH --time=0-8
11#SBATCH --export=ALL,PATH="/usr/local/cuda/bin:${PATH}"
12
13nvcc --version
14
15$ sbatch job.sh
16$
17$ cat /results/test_job_n.txt
18
19nvcc: NVIDIA (R) Cuda compiler driver
20Copyright (c) 2005-2019 NVIDIA Corporation
21Built on Sun_Jul_28_19:07:16_PDT_2019
22Cuda compilation tools, release 10.1, V10.1.243
Please read the --export
explanation in the manual: https://slurm.schedmd.com/sbatch.html.