Queues and resources
Each entropy user can reserve and use a specific amount of resources defined by two most important cluster elements: partitions, called queues and QOS (quality of service) assigned to each user on account creation.
Queues (partitions)
In the Slurm lingo, a queue (partition) is a logical partition of available
machines into named sets (each machine can be in more than one partition).
Each queue may serve different purposes and each user is assigned to at least
one queue called common
. Each partition may have defined specific restrictions,
for example, to limit maximum number of GPUs available to each user.
One can see the defined queues by running the sinfo
command:
1$ sinfo
2PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
3common* up 14-00:00:0 8 idle asusgpu[1-6],steven,sylvester
4a6000 up 14-00:00:0 1 idle bruce
5a100 up 14-00:00:0 1 idle a100a
6h100 up 14-00:00:0 1 idle h100a
The NODELIST
column shows servers assigned to each queue. This is the basic
view of the cluster and as one can see, that there are only basic limits imposed
on the job length. This is because most limits are defined using QOS (quality of service).
Quality of service (QOS)
The QOS defines sets of limits imposed on each user (it complements partition limits in certain hierarchy). Each user has been assigned at least one qos, which defines the user’s capabilities regarding available resources. Each QoS can be used in the context of a specific queue (partition).
Both QOS and queue (with two other, but immutable parameters) form
associations. Associations define the ways a user can use the cluster
by showing all combinations of available queues and qos
vales. To display
associations available to a user use entropy_account_info
command.
1$ entropy_account_info
2
3 ______________
4 < Slurm limits >
5 --------------
6 \ ,-^-.
7 \ !oYo!
8 \ /./=\.\______
9 ## )\/\
10 ||-----w||
11 || ||
12
13+---------------+-------------------+
14| Partition | Available QoS |
15+---------------+-------------------+
16| common | kmwil_common |
17+---------------+-------------------+
18+------------------+----------+----------------------+----------+----------------------+------------+------------------+
19| QoS | GPUs | Used GPU Minutes | CPUs | Used CPU Minutes | Memory | Maximum Wall |
20+------------------+----------+----------------------+----------+----------------------+------------+------------------+
21| kmwil_common | 8 | 0 out of 10000 | -- | 1 out of -- | -- | 1-00:00:00 |
22+------------------+----------+----------------------+----------+----------------------+------------+------------------+
GPUMinutes
Each user has a number of GPUMinutes
available for use on the cluster.
Once this resource is depleted, new jobs won’t be accepted. The limit
is visible in the entropy_account_info
command output as GrpTRESMins
.
Double dash --
means that there is currently no limit for a resource.