Queues and resources

Each entropy user can reserve and use a specific amount of resources defined by two most important cluster elements: partitions, called queues and QOS (quality of service) assigned to each user on account creation.

Queues (partitions)

In the Slurm lingo, a queue (partition) is a logical partition of available machines into named sets (each machine can be in more than one partition). Each queue may serve different purposes and each user is assigned to at least one queue called common. Each partition may have defined specific restrictions, for example, to limit maximum number of GPUs available to each user.

One can see the defined queues by running the sinfo command:

1$ sinfo
2PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
3common*      up 14-00:00:0      6   idle arnold,asusgpu[1-2,5-6],steven
4a6000        up 14-00:00:0      1   idle bruce
5a100         up 14-00:00:0      1   idle 4124gs0

The NODELIST column shows servers assigned to each queue. This is the basic view of the cluster and as one can see, there is no limit imposed on the job length nor other limits are visible in the output of the command. This is because most limits are defined using QOS (quality of service).

Quality of service (QOS)

The QOS defines sets of limits imposed on each user (it complements partition limits in certain hierarchy). Each user has been assigned at least one qos, which defines the user’s capabilities regarding available resources.

The qos defined in the cluster can be displayed using the following command:

 1clusteradm@asusgpu0:/usr/local/bin$ entropy_show_qos
 2
 3 __________
 4< QoS list >
 5 ----------
 6  \
 7   \   \_\_    _/_/
 8    \      \__/
 9           (oo)\_______
10           (__)\       )\/\
11               ||----w |
12               ||     ||
13
14      Name                                                Flags                MaxTRESPU          MaxWall
15---------- ---------------------------------------------------- ------------------------ ----------------
16    normal
17   1gpu30m      DenyOnLimit,OverPartQOS,NoDecay,UsageFactorSafe               gres/gpu=1         00:30:00
18    1gpu1h      DenyOnLimit,OverPartQOS,NoDecay,UsageFactorSafe               gres/gpu=1         01:00:00
19    1gpu2h      DenyOnLimit,OverPartQOS,NoDecay,UsageFactorSafe               gres/gpu=1         02:00:00
20    1gpu3h      DenyOnLimit,OverPartQOS,NoDecay,UsageFactorSafe               gres/gpu=1         03:00:00
21    1gpu4h      DenyOnLimit,OverPartQOS,NoDecay,UsageFactorSafe               gres/gpu=1         04:00:00
22    ...         ...                                                           ...                ...

User associations

Both QOS and queue (with two other, but immutable parameters) form associations. Associations define the ways a user can use the cluster by showing all combinations of available queues and qos vales. To display associations available to a user use entropy_account_info command:

 1kmwil@asusgpu0:~$ entropy_account_info
 2 ______________
 3< Account Info >
 4 --------------
 5  \
 6   \   \_\_    _/_/
 7    \      \__/
 8           (oo)\_______
 9           (__)\       )\/\
10               ||----w |
11               ||     ||
12
13# GrpTRESMins is the cumulative limit for the GPU usage.
14
15   Account             User    Partition          QOS          GrpTRESMins
16---------- ---------------- ------------ ------------ --------------------
17       mim            kmwil       common       3gpu1d       gres/gpu=10000
18
19---

Note

This is the most useful command used for determining which resources are available to a user: find associations and check the limits using entropy_show_qos.

GPUMinutes

Each user has a number of GPUMinutes available for use on the cluster. Once this resource is depleted, new jobs won’t be accepted. You can check the current usage by running the entropy_usage_report command. The limit is visible in the entropy_account_info command output as GrpTRESMins.

 1kmwil@asusgpu0:~$ entropy_usage_report
 2
 3 ______________
 4< Usage Report >
 5 --------------
 6  \
 7   \   \_\_    _/_/
 8    \      \__/
 9           (oo)\_______
10           (__)\       )\/\
11               ||----w |
12               ||     ||
13
14  # Historical GPU and CPU usage report.
15  You can run the sreport command to select diffrent filter criteria.
16  The most important options are: user start end format.
17  See https://slurm.schedmd.com/sreport.html for details.
18
19--------------------------------------------------------------------------------
20Top 1 Users 2024-01-05T00:00:00 - 2024-01-07T20:59:59 (248400 secs)
21Usage reported in TRES Minutes
22--------------------------------------------------------------------------------
23       Login     Used        TRES Name
24------------ -------- ----------------
25       kmwil        1         gres/gpu
26       kmwil       17              cpu