Appendix

GPU virtualization and fractional GPU allocation

Backend.AI supports GPU virtualization technology which allows single physical GPU can be divided and shared by multiple users simultaneously. Therefore, if you want to execute a task that does not require much GPU computation capability, you can create a compute session by allocating a portion of the GPU. The amount of GPU resources that 1 fGPU actually allocates may vary from system to system depending on administrator settings. For example, if the administrator has set one physical GPU to be divided into five pieces, 5 fGPU means 1 physical GPU, or 1 fGPU means 0.2 physical GPU. If you set 1 fGPU when creating a compute session, the session can utilize the streaming multiprocessor (SM) and GPU memory equivalent to 0.2 physical GPU.

In this section, we will create a compute session by allocating a portion of the GPU and then check whether the GPU recognized inside the compute container really corresponds to the partial physics GPU.

First, let’s check the type of physical GPU installed in the host node and the amount of memory. The GPU node used in this guide is equipped with a GPU with 8 GB of memory as in the following figure. And through the administrator settings, 1 fGPU is set to an amount equivalent to 0.5 physical GPU (or 1 physical GPU is 2 fGPU).

../_images/host_gpu.png

Now let’s go to the Sessions page and create a compute session by allocating 0.5 fGPU as follows:

../_images/session_launch_dialog_with_gpu.png

In the Configuration panel of the session list, you can see that 0.5 fGPU is allocated.

../_images/session_list_with_gpu.png

Now, let’s connect directly to the container and check if the allocated GPU memory is really equivalent to 0.5 units (~2 GB). Let’s bring up a web terminal. When the terminal comes up, run the nvidia-smi command. As you can see in the following figure, you can see that about 2 GB of GPU memory is allocated. This shows that the physical GPU is actually divided into quarters and allocated inside the container for this compute session, which is not possible by a way like PCI passthrough.

../_images/nvidia_smi_inside_container.png

Let’s open up a Jupyter Notebook and run a simple ML training code.

../_images/mnist_train.png

While training is in progress, connect to the shell of the GPU host node and execute the nvidia-smi command. You can see that there is one GPU attached to the process and this process is occupying about 25% of the resources of the physical GPU. (GPU occupancy can vary greatly depending on training code and GPU model.)

../_images/host_nvidia_smi.png

Alternatively, you can run the nvidia-smi command from the web terminal to query the GPU usage history inside the container.

Automated job scheduling

Backend.AI server has a built-in self-developed task scheduler. It automatically checks the available resources of all worker nodes and delegates the request to create a compute session to the worker that meets the user’s resource request. In addition, when resources are insufficient, the user’s request to create a compute session is registered as a PENDING state in the job queue. Later, when the resources become available again, the pended request is resumed to create a compute session.

You can check the operation of the job scheduler in a simple way from the user GUI console. When the GPU host can allocate up to 2 fGPUs, let’s create 3 compute sessions at the same time requesting allocation of 1 fGPU, respectivley. In the Custom allocation section of the session launch dialog, there are GPU and Sessions sliders. If you specify a value greater than 1 in Sessions and click the LAUNCH button, the number of sessions will be requested at the same time. Let’s set the GPU and Sessions to 1 and 3, respectively. This is the situation that 3 sessions requesting a total of 3 fGPUs are created when only 2 fGPUs exist.

../_images/session_launch_dialog_3_sessions.png

Wait for a while and you will see three compute sessions being listed. If you look closely at the Status panel, you can see that two of the three compute sessions are in RUNNING state, but the other compute session remains in PENDING state. This PENDING session is only registered in the job queue and has not actually been allocated a container due to insufficient GPU resources.

../_images/pending_session_list.png

Now let’s destroy one of the two sessions in RUNNING state. Then you can see that the compute session in PENDING state is allocated resources by the job scheduler and converted to RUNNING state soon. In this way, the job scheduler utilizes the job queue to hold the user’s compute session requests and automatically process the requests when resources become available.

../_images/pending_to_running.png

Multi-version machine learning container support

Backend.AI provides variaous pre-built ML and HPC kernel images. Therefore, users can immediately utilize major libraries and packages without having to install packages by themselves. Here, we’ll walk through an example that takes advantage of multiple versions of the multiple ML library immediately.

Go to the Sessions page and open the session launch dialog. There may be various kernel images depending on the installation settings.

../_images/various_kernel_images.png

Here, let’s select the TensorFlow 2.2 environment and created a session.

../_images/session_launch_dialog_tf22.png

Open the web terminal of the created session and run the following Python command. You can see that TensorFlow 2.2 version is installed.

../_images/tf22_version_print.png

This time, let’s select the TensorFlow 1.13 environment to create a compute session. If resources are insufficient, delete the previous session.

../_images/session_launch_dialog_tf113.png

Open the web terminal of the created session and run the same Python command as before. You can see that TensorFlow 1.13(.1) version is installed.

../_images/tf113_version_print.png

Finally, create a compute session using PyTorch version 1.5.

../_images/session_launch_dialog_pytorch15.png

Open the web terminal of the created session and run the following Python command. You can see that PyTorch 1.5 version is installed.

../_images/pytorch15_version_print.png

Like this, you can utilize various versions of major libraries such as TensorFlow and PyTorch through Backend.AI without unnecessary effort to install them.

Backend.AI Server Installation Guide

For Backend.AI Server daemons/services, following hardware specification should be met. For optimal performance, just double the amount of each resources.

  • Manager: 2 cores, 4 GiB memory
  • Agent: 4 cores, 32 GiB memory, NVIDIA GPU (for GPU workload), > 512 GiB SSD
  • Console-Server: 2 cores, 4 GiB memory
  • WSProxy: 2 cores, 4 GiB memory
  • PostgreSQL DB: 2 cores, 4 GiB memory
  • Redis: 1 core, 2 GiB memory
  • Etcd: 1 core, 2 GiB memory

The essential host dependent packages that must be pre-installed before installing each service are:

  • GUI console: Operating system that can run the latest browsers (Windows, Mac OS, Ubuntu, etc.)
  • Manager: Python (≥3.8), pyenv/pyenv-virtualenv (≥1.2)
  • Agent: docker (≥19.03), CUDA/CUDA Toolkit (≥8, 11 recommend), nvidia-docker v2, Python (≥3.8), pyenv/pyenv-virtualenv (≥1.2)
  • Console-Server: Python (≥3.8), pyenv/pyenv-virtualenv (≥1.2)
  • WSProxy: docker (≥19.03), docker-compose (≥1.24)
  • PostgreSQL DB: docker (≥19.03), docker-compose (≥1.24)
  • Redis: docker (≥19.03), docker-compose (≥1.24)
  • Etcd: docker (≥19.03), docker-compose (≥1.24)

For Enterprise version, Backend.AI server daemons are installed by Lablup support team and following materials/services are provided after initial installation:

  • DVD 1 (includes Backend.AI packages)
  • User GUI Guide manual
  • Admin GUI Guide manual
  • Installation report
  • First-time user/admin on-site tutorial (3-5 hours)

Product maintenance and support information: the commercial contract includes a monthly/annual subscription fee for the Enterprise version by default. Initial user/administrator training (1-2 times) and wired/wireless customer support services are provided for about 2 weeks after initial installation, minor release updater support and customer support services through online channels are provided for 3-6 months. Maintenance and support services provided afterwards may have different details depending on the terms of the contract.

Users of the open source version can also purchase an installation and support plan separately.

Backend.AI Server Management Guide

Backend.AI is composed of many modules and daemons. Here, we briefly describe each services and provide basic maintenance guide in case of specific service failure. Note that the maintenance operations provided here are generally applicable, but may differ depending on the host-specific installation details.

Manager

Gateway server that accepts and handles every user request. If the request is related with the compute session (container), Manager will delegate the request to Agent and/or containers in each Agent.

# check status
sudo systemctl status backendai-manager
# start service
sudo systemctl start backendai-manager
# stop service
sudo systemctl stop backendai-manager
# restart service
sudo systemctl restart backendai-manager
# see logs
sudo journalctl --output cat -u backendai-manager

Agent

Worker node which manages the lifecycle of compute sessions (containers).

# check status
sudo systemctl status backendai-agent
# start service
sudo systemctl start backendai-agent
# stop service
sudo systemctl stop backendai-agent
# restart service
sudo systemctl restart backendai-agent
# see logs
sudo journalctl --output cat -u backendai-agent

Console-Server

Serves user GUI Console and provides authentication by email and password.

# check status
sudo systemctl status backendai-console-server
# start service
sudo systemctl start backendai-console-server
# stop service
sudo systemctl stop backendai-console-server
# restart service
sudo systemctl restart backendai-console-server
# see logs
sudo journalctl --output cat -u backendai-console-server

WSProxy

Proxies the connection between user-created web apps (such as web Terminal and Jupyter Notebook) and Manager, which is then relayed to a specific compute session (container).

cd /home/lablup/halfstack
# check status
docker-compose -f docker-compose.wsproxy-simple.yaml -p <project> ps
# start service
docker-compose -f docker-compose.wsproxy-simple.yaml -p <project> up -d
# stop service
docker-compose -f docker-compose.wsproxy-simple.yaml -p <project> down
# restart service
docker-compose -f docker-compose.wsproxy-simple.yaml -p <project> restart
# see logs
docker-compose -f docker-compose.wsproxy-simple.yaml -p <project> logs

PostgreSQL DB

Database for Manager.

cd /home/lablup/halfstack
# check status
docker-compose -f docker-compose.hs.postgres.yaml -p <project> ps
# start service
docker-compose -f docker-compose.hs.postgres.yaml -p <project> up -d
# stop service
docker-compose -f docker-compose.hs.postgres.yaml -p <project> down
# restart service
docker-compose -f docker-compose.hs.postgres.yaml -p <project> restart
# see logs
docker-compose -f docker-compose.hs.postgres.yaml -p <project> logs

To back up the DB data, you can use the following commands from the DB host. The specific commands may vary depending on the configuration.

# query postgresql container's ID
docker ps | grep halfstack-db
# Connect to the postgresql container via bash
docker exec -it <postgresql-container-id> bash
# Backup DB data. PGPASSWORD may vary depending on the system configuration
PGPASSWORD=develove pg_dumpall -U postgres > /var/lib/postgresql/backup_db_data.sql
# Exit container
exit

To restore the DB from the backup data, you can execute the following commands. Specific options may vary depending on the configuration.

# query postgresql container's ID
docker ps | grep halfstack-db
# Connect to the postgresql container via bash
docker exec -it <postgresql-container-id> bash
# Disconnect all connection, for safety
psql -U postgres
postgres=# SELECT pg_terminate_backend(pg_stat_activity.pid)
postgres-# FROM pg_stat_activity
postgres-# WHERE pg_stat_activity.datname = 'backend'
postgres-# AND pid <> pg_backend_pid();
# Ensure previous data be cleaned (to prevent overwrite)
postgres=# DROP DATABASE backend;
postgres=# \q
# Restore from data
psql -U postgres < backup_db_data.sql

Redis

Cache server which is used to collect per-session and per-agent usage statistics and relays heartbeat signal from Agent to Manager. It also keeps user’s authentication information.

cd /home/lablup/halfstack
# check status
docker-compose -f docker-compose.hs.redis.yaml -p <project> ps
# start service
docker-compose -f docker-compose.hs.redis.yaml -p <project> up -d
# stop service
docker-compose -f docker-compose.hs.redis.yaml -p <project> down
# restart service
docker-compose -f docker-compose.hs.redis.yaml -p <project> restart
# see logs
docker-compose -f docker-compose.hs.redis.yaml -p <project> logs

Usually, Redis data do not need backup since it contains temporary cached data only, such user’s login session information, per-container live stat, and etc.

Etcd

Config server, which contains Backend.AI system-wide configuration.

cd /home/lablup/halfstack
# check status
docker-compose -f docker-compose.hs.etcd.yaml -p <project> ps
# start service
docker-compose -f docker-compose.hs.etcd.yaml -p <project> up -d
# stop service
docker-compose -f docker-compose.hs.etcd.yaml -p <project> down
# restart service
docker-compose -f docker-compose.hs.etcd.yaml -p <project> restart
# see logs
docker-compose -f docker-compose.hs.etcd.yaml -p <project> logs

To back up the Etcd config data used by the Manager, go to the folder where the Manager is installed and use the following command.

cd /home/lablup/manager  # paths may vary
backend.ai mgr etcd get --prefix '' > etcd_backup.json

To restore Etcd settings from the backup data, you can run a command like this.

cd /home/lablup/manager  # paths may vary
backend.ai mgr etcd put-json '' etcd_backup.json