Queues and resources

Each entropy user can reserve and use a specific amount of resources defined by two most important cluster elements: partitions, called queues and QOS (quality of service) assigned to each user on account creation.

Queues (partitions)

In the Slurm lingo, a queue (partition) is a logical partition of available machines into named sets (each machine can be in more than one partition). Each queue may serve different purposes and each user is assigned to at least one queue called common. Each partition may have defined specific restrictions, for example, to limit maximum number of GPUs available to each user.

One can see the defined queues by running the sinfo command:

1$ sinfo
2PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
3common*      up 14-00:00:0      8   idle asusgpu[1-6],steven,sylvester
4a6000        up 14-00:00:0      1   idle bruce
5a100         up 14-00:00:0      1   idle a100a
6h100         up 14-00:00:0      1   idle h100a

The NODELIST column shows servers assigned to each queue. This is the basic view of the cluster and as one can see, that there are only basic limits imposed on the job length. This is because most limits are defined using QOS (quality of service).

Quality of service (QOS)

The QOS defines sets of limits imposed on each user (it complements partition limits in certain hierarchy). Each user has been assigned at least one qos, which defines the user’s capabilities regarding available resources. Each QoS can be used in the context of a specific queue (partition).

Both QOS and queue (with two other, but immutable parameters) form associations. Associations define the ways a user can use the cluster by showing all combinations of available queues and qos vales. To display associations available to a user use entropy_account_info command.

 1$ entropy_account_info
 2
 3   ______________
 4  < Slurm limits >
 5   --------------
 6          \    ,-^-.
 7           \   !oYo!
 8            \ /./=\.\______
 9                 ##        )\/\
10                  ||-----w||
11                  ||      ||
12
13+---------------+-------------------+
14| Partition     | Available QoS     |
15+---------------+-------------------+
16| common        | kmwil_common      |
17+---------------+-------------------+
18+------------------+----------+----------------------+----------+----------------------+------------+------------------+
19| QoS              | GPUs     | Used GPU Minutes     | CPUs     | Used CPU Minutes     | Memory     | Maximum Wall     |
20+------------------+----------+----------------------+----------+----------------------+------------+------------------+
21| kmwil_common     | 8        | 0 out of 10000       | --       | 1 out of --          | --         | 1-00:00:00       |
22+------------------+----------+----------------------+----------+----------------------+------------+------------------+

GPUMinutes

Each user has a number of GPUMinutes available for use on the cluster. Once this resource is depleted, new jobs won’t be accepted. The limit is visible in the entropy_account_info command output as GrpTRESMins.

Double dash -- means that there is currently no limit for a resource.