Queues and resources
Each entropy user can reserve and use a specific amount of resources defined by two most important cluster elements: partitions, called queues and QOS (quality of service) assigned to each user on account creation.
Queues (partitions)
In the Slurm lingo, a queue (partition) is a logical partition of available
machines into named sets (each machine can be in more than one partition).
Each queue may serve different purposes and each user is assigned to at least
one queue called common
. Each partition may have defined specific restrictions,
for example, to limit maximum number of GPUs available to each user.
One can see the defined queues by running the sinfo
command:
1$ sinfo
2PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
3common* up 14-00:00:0 6 idle arnold,asusgpu[1-2,5-6],steven
4a6000 up 14-00:00:0 1 idle bruce
5a100 up 14-00:00:0 1 idle 4124gs0
The NODELIST
column shows servers assigned to each queue. This is the basic
view of the cluster and as one can see, there is no limit imposed on the job
length nor other limits are visible in the output of the command. This is
because most limits are defined using QOS (quality of service).
Quality of service (QOS)
The QOS defines sets of limits imposed on each user (it complements partition limits in certain hierarchy). Each user has been assigned at least one qos, which defines the user’s capabilities regarding available resources.
The qos defined in the cluster can be displayed using the following command:
1clusteradm@asusgpu0:/usr/local/bin$ entropy_show_qos
2
3 __________
4< QoS list >
5 ----------
6 \
7 \ \_\_ _/_/
8 \ \__/
9 (oo)\_______
10 (__)\ )\/\
11 ||----w |
12 || ||
13
14 Name Flags MaxTRESPU MaxWall
15---------- ---------------------------------------------------- ------------------------ ----------------
16 normal
17 1gpu30m DenyOnLimit,OverPartQOS,NoDecay,UsageFactorSafe gres/gpu=1 00:30:00
18 1gpu1h DenyOnLimit,OverPartQOS,NoDecay,UsageFactorSafe gres/gpu=1 01:00:00
19 1gpu2h DenyOnLimit,OverPartQOS,NoDecay,UsageFactorSafe gres/gpu=1 02:00:00
20 1gpu3h DenyOnLimit,OverPartQOS,NoDecay,UsageFactorSafe gres/gpu=1 03:00:00
21 1gpu4h DenyOnLimit,OverPartQOS,NoDecay,UsageFactorSafe gres/gpu=1 04:00:00
22 ... ... ... ...
User associations
Both QOS and queue (with two other, but immutable parameters) form
associations. Associations define the ways a user can use the cluster
by showing all combinations of available queues and qos
vales. To display
associations available to a user use entropy_account_info
command:
1kmwil@asusgpu0:~$ entropy_account_info
2 ______________
3< Account Info >
4 --------------
5 \
6 \ \_\_ _/_/
7 \ \__/
8 (oo)\_______
9 (__)\ )\/\
10 ||----w |
11 || ||
12
13# GrpTRESMins is the cumulative limit for the GPU usage.
14
15 Account User Partition QOS GrpTRESMins
16---------- ---------------- ------------ ------------ --------------------
17 mim kmwil common 3gpu1d gres/gpu=10000
18
19---
Note
This is the most useful command used for determining which resources are
available to a user: find associations and check the
limits using entropy_show_qos
.
GPUMinutes
Each user has a number of GPUMinutes
available for use on the cluster.
Once this resource is depleted, new jobs won’t be accepted. You can check
the current usage by running the entropy_usage_report
command. The limit
is visible in the entropy_account_info
command output as GrpTRESMins
.
1kmwil@asusgpu0:~$ entropy_usage_report
2
3 ______________
4< Usage Report >
5 --------------
6 \
7 \ \_\_ _/_/
8 \ \__/
9 (oo)\_______
10 (__)\ )\/\
11 ||----w |
12 || ||
13
14 # Historical GPU and CPU usage report.
15 You can run the sreport command to select diffrent filter criteria.
16 The most important options are: user start end format.
17 See https://slurm.schedmd.com/sreport.html for details.
18
19--------------------------------------------------------------------------------
20Top 1 Users 2024-01-05T00:00:00 - 2024-01-07T20:59:59 (248400 secs)
21Usage reported in TRES Minutes
22--------------------------------------------------------------------------------
23 Login Used TRES Name
24------------ -------- ----------------
25 kmwil 1 gres/gpu
26 kmwil 17 cpu