Offline Install gpu-operator¶

AI platform comes with pre-installed driver images for the following three operating systems: Ubuntu 22.04, Ubuntu 20.04, and CentOS 7.9. The driver version is 535.104.12. Additionally, it includes the required Toolkit images for each operating system, so users no longer need to manually provide offline toolkit images.

This page demonstrates using AMD architecture with CentOS 7.9 (3.10.0-1160). If you need to deploy on Red Hat 8.4, refer to Uploading Red Hat gpu-operator Offline Image to the Bootstrap Node Repository and Building Offline Yum Source for Red Hat 8.4.

Prerequisites¶

The kernel version of the cluster nodes where the gpu-operator is to be deployed must be completely consistent. The distribution and GPU model of the nodes must fall within the scope specified in the GPU Support Matrix.
When installing the gpu-operator, select v23.9.0+2 or above.

Steps¶

To install the gpu-operator plugin for your cluster, follow these steps:

Log in to the platform and go to Container Management -> Clusters , check cluster eetails.
On the Helm Charts page, select All Repositories and search for gpu-operator .
Select gpu-operator and click Install .
Configure the installation parameters for gpu-operator based on the instructions below to complete the installation.

Configure parameters¶

systemOS : Select the operating system for the host. The current options are Ubuntu 22.04, Ubuntu 20.04, Centos 7.9, and other. Please choose the correct operating system.

Basic information¶

Name : Enter the plugin name
Namespace : Select the namespace for installing the plugin
Version: The version of the plugin. Here, we use version v23.9.0+2 as an example.
Failure Deletion: If the installation fails, it will delete the already installed associated resources. When enabled, Ready Wait will also be enabled by default.
Ready Wait: When enabled, the application will be marked as successfully installed only when all associated resources are in a ready state.
Detailed Logs: When enabled, detailed logs of the installation process will be recorded.

Advanced settings¶

Operator parameters¶

InitContainer.image : Configure the CUDA image, recommended default image: nvidia/cuda
InitContainer.repository : Repository where the CUDA image is located, defaults to nvcr.m.daocloud.io repository
InitContainer.version : Version of the CUDA image, please use the default parameter

Driver parameters¶

Driver.enable : Configure whether to deploy the NVIDIA driver on the node, default is enabled. If you have already deployed the NVIDIA driver on the node before using the gpu-operator, please disable this.
Driver.image : Configure the GPU driver image, recommended default image: nvidia/driver .
Driver.repository : Repository where the GPU driver image is located, default is nvidia's nvcr.io repository.
Driver.usePrecompiled : Enable the precompiled mode to install the driver.
Driver.version : Version of the GPU driver image, use default parameters for offline deployment. Configuration is only required for online installation. Different versions of the Driver image exist for different types of operating systems. For more details, refer to Nvidia GPU Driver Versions. Examples of Driver Version for different operating systems are as follows:
Note

When using the built-in operating system version, there is no need to modify the image version. For other operating system versions, refer to Uploading Images to the Bootstrap Node Repository. note that there is no need to include the operating system name such as Ubuntu, CentOS, or Red Hat in the version number. If the official image contains an operating system suffix, please manually remove it.
- For Red Hat systems, for example, 525.105.17
- For Ubuntu systems, for example, 535-5.15.0-1043-nvidia
- For CentOS systems, for example, 525.147.05
Driver.RepoConfig.ConfigMapName : Used to record the name of the offline yum repository configuration file for the gpu-operator. When using the pre-packaged offline bundle, refer to the following documents for different types of operating systems.
- Building CentOS 7.9 Offline Yum Repository
- Building Red Hat 8.4 Offline Yum Repository

Toolkit parameters¶

Toolkit.enable : Enabled by default. This component allows containerd/docker to support running containers that require GPUs.

MIG parameters¶

For detailed configuration methods, refer to Enabling MIG Functionality.

MigManager.Config.name : The name of the MIG split configuration file, used to define the MIG (GI, CI) split policy. The default is default-mig-parted-config . For custom parameters, refer to Enabling MIG Functionality.

Next Steps¶

After completing the configuration and creation of the above parameters:

If using full-card mode , GPU resources can be used when creating applications.
If using vGPU mode , after completing the above configuration and creation, proceed to vGPU Addon Installation.
If using MIG mode and you need to use a specific split specification for individual GPU nodes, otherwise, split according to the default value in MigManager.Config.
- For single mode, add label to nodes as follows:
```
kubectl label nodes {node} nvidia.com/mig.config="all-1g.10gb" --overwrite
```
- For mixed mode, add label to nodes as follows:
```
kubectl label nodes {node} nvidia.com/mig.config="custom-config" --overwrite
```
After spliting, applications can use MIG GPU resources.