translated by Google

Machine-translated page for increased accessibility for English questioners.

Aura

Server aura.fi.muni.cz is available to FI staff and PhD students for longer, more demanding or GPU calculations. For study or research purposes, FI staff may request that access be granted to other persons through unix@fi.muni.cz . Students can refer to the use of the server Adonis where they have access automatically.

Hardware configuration

Server Aura is built on the Asus RS720A-E11-RS24U platform in the following configurations:

  • two 64-core AMD EPYC 7713 2.0 GHz processors ( 128 physical cores and 256 fibers in total).
  • 2 TiB DDR4 RAM 3200 MHz
  • 10 Gbps Ethernet connection
  • 2 SATA SSDs with a capacity of 960 GB in RAID 1
  • 2 NVMe disks with a capacity of 6 TB connected in RAID 1

See also blog post representing this server .

The way you work on computer servers

Run long-running processes (hour or more) with reduced priority (in the range of 10-19, 19 is the lowest), for example nice ./your_program or nice -n 15 ./your_program .

You can use the command to change the priority of an already running process renice , but be aware that a process can run in multiple threads, and changing the priority for a single process can change the priority of only one thread. For example, you can get a list of all threads in your processes, including priority, as follows:

ps x -Lo pgid,pid,tid,user,nice,tty,time,args

You can perform short-term processes or interactive debugging of your programs with normal priority.

If your process does not comply with the priority limit and uses a lot of computing power, the lowest priority will be set to prevent other users from restricting all your processes 19 . Repeated or more serious violations of this rule may result in the suspension of your faculty account.

Memory limit using systemd

The upper limit of memory usage in the system can be found using the command below. When this limit is exceeded, the OOM mechanism starts and attempts to terminate the appropriate process.

systemctl show -p MemoryMax user-$UID.slice

However, you can create your own system scope , in which a stricter (lower) usable memory limit can be set:

systemd-run --user --scope -p MemoryMax=9G program

The program can also be a command line (eg bash ). The memory limit will apply to him and all his descendants together. There is a difference from the mechanism ulimit , where the restrictions apply to each process separately.

Monitoring both the created scope and the user scope can be useful:

# monitoring of the memory and CPU usage of your processes
systemd-cgtop /user.slice/user-$UID.slice

Resource constraints using ulimit

Resource limit commands:

# limit available resources
help ulimit
# cap the size of virtual memory to 20000 kB
ulimit -v 20000
# cap the amount of total CPU time to 3600 seconds
ulimit -t 3600
# cap the number of concurrently running threads/processes
ulimit -u 100

The above commands limit the shell resources and all its descendants to the specified values. These cannot be increased back; you will need to run another separate shell to restore the environment without the restrictions set. But be careful that the resources set using ulimit applies to each process separately. Therefore, if you set the limit to 20 MB of memory and run 10 processes in such an environment, they can allocate a total of 200 MB of memory. If you just want to limit the total memory to 20 MB, use systemd-run .

Specific software

If you need to install libraries or tools for your work, you have (apart from local compilation) several options:

  • if they are part of a distribution ( dnf search software-name ), you can ask the administrator to install,
  • you can make module ,
  • as for the Python package, you can ask the administrator to install it in the module python3 . You can also install it locally using pip/pip3 install --user . In use virtualenv, conda etc., we recommend installing the environment in /var/tmp/login (pay attention to the lifespan of the files below).

Disk capacity

There are two directories available on the Aura server for temporary data that should be available locally quickly.

  • Address book /tmp is of type tmpfs. Due to the location in RAM, access is very fast, but the data between server reboots is not persistent and the capacity is very small.
  • Address book /var/tmp is located on a fast NVMe RAID 1 volume.

To use this space, store your data in the directory with your login. Data that is not accessed (according to atime ) are automatically lubricated, u /tmp at the age of several days, u /var/tmp in months (the exact settings can be found in the file /etc/tmpfiles.d/tmp.conf ). Disk quotas do not apply here; however, be considerate of others in your use of space.

GPU calculations

The Aura server has two GPU cards, namely NVIDIA A100 80 GB PCIe.

Warning : Aura is the first server managed by CVT, which is also equipped with GPU cards, so it is not excluded that everything will not work smoothly from the beginning. So if you have any comments or advice, do not hesitate write to us .

GPU calculations at Aura are currently not systemically limited, so it is necessary to be considerate of others.

GPU cards are distributed using MIG (Multi-Instance GPU) technology, which makes it possible to have several non-affecting virtual GPUs ( instances ).

Before starting the calculation, we must choose a free MIG instance. Using an extract from nvidia-smi we find out which instance IDs exist and on which calculations are performed.

In the example below we see that there is 1 MIG instance on GPU0 and 3 MIG instances on GPU1. Looking into Processes: we see that the calculations are running on one of the instances on GPU0 (GI ID 0). On GPU1, calculations on instance 3 are in progress, so instances with GI IDs 2 and 4 are free. We will therefore choose GPU1 with GI CI 4 for our calculations, which in our case has MIG Dev 2.

[user@aura ~]$ nvidia-smi
Mon Mar 21 15:34:36 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  Off  | 00000000:21:00.0 Off |                   On |
| N/A   43C    P0    65W / 300W |      0MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  Off  | 00000000:61:00.0 Off |                   On |
| N/A   41C    P0    70W / 300W |     45MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    0   0   0  |   2406MiB / 81069MiB | 98      0 |  7   0    5    1    1 |
|                  |      5MiB / 13107... |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    2   0   0  |     19MiB / 40192MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 65535MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    3   0   1  |   2080MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      2MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    4   0   2  |     13MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0    0    0     235532      C   python3                          2403MiB |
|    1    3    0     235740      C   python3                          2061MiB |
+-----------------------------------------------------------------------------+

Now use the command nvidia-smi -L we find out the UUID. We want GPU 1 with MIG Dev 2, that is MIG-64f11db9-b10b-5dd9-97d1-3c46450b9388 :

[user@aura ~]$ nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-309d72fd-b4f8-d6e8-6a66-e3f2253e8540)
  MIG 7g.80gb     Device  0: (UUID: MIG-a5459e6a-b26d-5985-874c-528458a7728b)
GPU 1: NVIDIA A100 80GB PCIe (UUID: GPU-04712e69-7356-4de5-f983-84083131460e)
  MIG 3g.40gb     Device  0: (UUID: MIG-4f7fbfb7-a8a2-553d-875a-d9d56baf97a5)
  MIG 2g.20gb     Device  1: (UUID: MIG-bad562c5-744a-5742-be1a-c735195c52d0)
  MIG 2g.20gb     Device  2: (UUID: MIG-64f11db9-b10b-5dd9-97d1-3c46450b9388)

Now we set the environment variable CUDA_VISIBLE_DEVICES and we can count.

[user@aura ~]$ export CUDA_VISIBLE_DEVICES=MIG-64f11db9-b10b-5dd9-97d1-3c46450b9388
[user@aura ~]$ python main.py

We can monitor our calculation either with the command nvidia-smi or using a graphical tool nvitop . (Note: Due to the use of MIG monitoring tools, they are not able to display GPU usage, but only used memory.)

If the existing configuration of the MIG instances is not suitable for you, after an agreement on unix@fi.muni.cz it can be changed if the circumstances (other running calculations) reasonably allow.