Submitting Slurm jobs

Slurm allows for interactive and non-interactive work. Most users will use sbatch or srun to submit non-interactive scripts or commands. In each case, specifying a queue and qos is mandatory.

Warning

Without valid --partition and --qos specified for each job, Slurm won’t accept the job. Please check your associations using entropy_account_info.

Time limit

One of the most important parameters is the --time parameter. Without setting this flag explicitly the system will assign the default value for the specified QoS, but it most probably will be suboptimal.

Note

Slurm’s scheduling algorithm is quite complex. Properly estimating and setting your job’s running time will result in a faster access to the resources and better overall system utilization.

Time format is presented in the table below.

Format	Example	Description
<min>	30	30 minutes
<min>:<sec>	20:20	20 minutes and 20 seconds
<hr>:<min>:<sec>	1:30:45	1 hour 30 minutes and 45 seconds
<days>-<hr>	2-0	2 days
<days>-<hr>	2-6	2 days and 6 hours
<days>-<hr>:<min>:<sec>	1-12:30:00	1 day 12 hours and 30 minutes

Using srun

The srun command allows for running a single job on the cluster. Each time a valid partition and qos needs to be specified. If not redirected, all program output will be printed to the standard output.

Running a single command and printing the results to the standard output.

$ srun --partition=common --qos=1gpu1h --time=10 --gres=gpu:1 nvidia-smi -L

GPU 0: TITAN V (UUID: GPU-6426f3d6-4cec-9167-5035-4e9129551d9b)
GPU 1: TITAN V (UUID: GPU-bcaaee86-bd21-4735-edc2-d18b5fed40a7)
GPU 2: TITAN V (UUID: GPU-109e5f3c-c2e8-3a9d-486a-0df29fb6c905)
GPU 3: TITAN V (UUID: GPU-e3d1f883-02b2-1da6-80e1-32efd4ab7453)

Running a single command with a specific node selected and printing the results to the standard output.

$ srun --nodelist arnold --partition=common --qos=1gpu1h --time=20 --gres=gpu:1 nvidia-smi -L

GPU 0: TITAN V (UUID: GPU-6426f3d6-4cec-9167-5035-4e9129551d9b)
GPU 1: TITAN V (UUID: GPU-bcaaee86-bd21-4735-edc2-d18b5fed40a7)
GPU 2: TITAN V (UUID: GPU-109e5f3c-c2e8-3a9d-486a-0df29fb6c905)
GPU 3: TITAN V (UUID: GPU-e3d1f883-02b2-1da6-80e1-32efd4ab7453)

Running a single command with a specific node and card selected and saving the output to a file.

$ srun --nodelist=arnold --partition=common --qos=1gpu1h --output=username_out.txt --time=1:00 --gres=gpu:titanv:1 nvidia-smi -L
$
$ cat /results/username_out.txt

GPU 0: TITAN V (UUID: GPU-6426f3d6-4cec-9167-5035-4e9129551d9b)
GPU 1: TITAN V (UUID: GPU-bcaaee86-bd21-4735-edc2-d18b5fed40a7)
GPU 2: TITAN V (UUID: GPU-109e5f3c-c2e8-3a9d-486a-0df29fb6c905)
GPU 3: TITAN V (UUID: GPU-e3d1f883-02b2-1da6-80e1-32efd4ab7453)

Default Time, CPU and Memory values

Each partition has predefined memory, CPU and time values for a submitted job. These are set to fill nodes optimally – please do not change them without a reason.

To fully allocate memory and time within the assigned qos, please use the flags specified in the table below.

Parameter	Flag
Memory (RAM)	`--mem`
Time	`--time`

Warning

The --mem flag should be used with caution! The default values for DefCpuPerGPU and DefMemPerCPU will allocate optimal number of resources and it is thus recommmended not to tinker with the --mem flag without a very specific reason.

For example:

$ srun --partition=common --qos=16gpu14d --output=username_out.txt --time=1-0 --gres=gpu:titanv:1 a_command_to_run

Using sbatch

Using sbatch involves writing a script with all needed details for job submission. Passing all required parameters is similar to the #DEFINE stanzas known in the C language. Slurm uses #BATCH. In the batch mode, defining the --output file is mandatory.

Running a single command.

$ cat job.sh

#!/bin/bash
#
#SBATCH --job-name=test_job_username
#SBATCH --partition=common
#SBATCH --qos=1gpu1d
#SBATCH --gres=gpu:1
#SBATCH --time=1-0
#SBATCH --output=test_job.txt

nvidia-smi -L

$ sbatch job.sh

Running a single command with a specific node and a GPU selected.

$ cat job.sh

#!/bin/bash
#
#SBATCH --job-name=test_job_n
#SBATCH --partition=research
#SBATCH --qos=lecturer
#SBATCH --gres=gpu:rtx2080ti:8
#SBATCH --output=test_job_n.txt
#SBATCH --time=3-0
#SBATCH --nodelist=asusgpu2

nvidia-smi -L

$ sbatch job.sh

Environmental variables

By default Slurm will copy (and as a consequence overwrite) all environmental variables from the submission node to the compute nodes. Thus, using full paths to binaries or changes to the PATH variable are required. For example, let us try to run nvcc without any PATH or script modifications:

 $ cat job.sh

 #!/bin/bash
 #
 #SBATCH --job-name=test_job_n
 #SBATCH --partition=common
 #SBATCH --qos=student
 #SBATCH --gres=gpu:rtx2080ti:1
 #SBATCH --time=30
 #SBATCH --output=/results/test_job_n.txt

 nvcc --version

 $ sbatch job.sh
 $
 $ cat /results/test_job_n.txt

 /var/spool/slurm/d/job00124/slurm_script: line 9: nvcc: command not found

Specifying the full path would work:

1/usr/local/cuda/bin/nvcc --version

We could also add the --export option to a batch script:

$ cat job.sh

#!/bin/bash
#
#SBATCH --job-name=test_job_n
#SBATCH --partition=common
#SBATCH --qos=student
#SBATCH --gres=gpu:rtx2080ti:1
#SBATCH --output=/results/test_job_n.txt
#SBATCH --time=0-8
#SBATCH --export=ALL,PATH="/usr/local/cuda/bin:${PATH}"

nvcc --version

$ sbatch job.sh
$
$ cat /results/test_job_n.txt

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

Please read the --export explanation in the manual: https://slurm.schedmd.com/sbatch.html.