Skip to main content

Your submission was sent successfully! Close

Thank you for signing up for our newsletter!
In these regular emails you will find the latest updates from Canonical and upcoming events where you can meet our team.Close

Thank you for contacting us. A member of our team will be in touch shortly. Close

  1. Blog
  2. Article

anaqvi
on 22 October 2019

NVIDIA GPU Operator – Simplifying AI/ML Deployments on the Canonical Platform


Leveraging Kubernetes for AI deployments is becoming increasingly popular. Chances are if your business is involved in AI/ML with Kubernetes you are using tools like Kubeflow to reduce complexity, costs and deployment time. Or, you may be missing out!

With AI/ML being the tech topics of the world, GPUs play a critical role in the space. NVIDIA, a prominent player in the GPU space is one of the top choices for most stakeholders in the field. Nvidia takes their commitment to the space a step ahead with the launch of the GPU Operator open-source project at Mobile World Congress LA.

What is the GPU Operator

The GPU, being a high performance compute resource in the cluster requires a few components to be installed before application workloads can be deployed onto the GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin, container runtime, etc. With the GPU Operator, you can manage resources in a Kubernetes cluster and automate bootstrapping GPU nodes tasks. 

Supported Platforms

The NVIDIA GPU Operator currently supports and has been validated with the following:

●     Pascal+ GPUs are supported (incl. Tesla V100 and T4)

●     Kubernetes v1.13+

  • Canonical’s Charmed Kubernetes v1.15 has been tested with and supports NVIDIA Nvidia GPU Operator. The GPU Operator works out the box with Canonical’s Charmed Kubernetes and is supported from day one.

– Note: Helm may fail to initialize in Kubernetes v1.16. The Helm installation step above includes a workaround for this. More details can be found in the Github issue.

●     Helm 2

●     Ubuntu 18.04.3 LTS

●     The GPU Operator includes  the following NVIDIA components:

●     Docker CE 19.03.2

●     NVIDIA Container Toolkit 1.0.5

●      NVIDIA Kubernetes Device Plugin 1.0.0-beta4

●      NVIDIA Tesla Driver 418.87.01

 Set-Up

Prerequisites

The GPU Operator has a few prerequisites:

  • It requires a fresh configuration of nodes – nodes must not be pre-configured with NVIDIA components (driver, container runtime, device plugin).
  • i2c_core and ipmi_msghandler kernel modules need to be loaded

The following command ensures these modules are loaded:

$ sudo modprobe -a i2c_core ipmi_msghandler

The module loading step is not persistent and refreshes after a reboot. To make module loading persistent add the modules to the config file as shown:

$ echo -e “i2c_core\nipmi_msghandler” | sudo tee /etc/modules-load.d/driver.conf

  • Node Feature Discovery (NFD) is required on each node. By default, NFD master and worker are automatically deployed .

If NFD is already running in the cluster prior to the deployment of the operator, set the variable nfd.enabled=false at the helm install step:

$ helm install –devel –set nfd.enabled=false nvidia/gpu-operator -n test-operator

See notes on NFD setup

Install Helm

$ curl -L https://git.io/get_helm.sh | bash

Create service-account for helm

$ kubectl create serviceaccount -n kube-system tiller

$ kubectl create clusterrolebinding tiller-cluster-rule –clusterrole=cluster-admin –serviceaccount=kube-system:tiller

Initialize Helm

$ helm init –service-account tiller –wait

Note that if you have Helm already deployed in your cluster and you are adding a new node, run this instead

$ helm init –client-only

 

Install the GPU Operator

Note that after running this command, NFD will be automatically deployed.

$ helm install –devel nvidia/gpu-operator -n test-operator –wait

$ kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/manifests/cr/sro_cr_sched_none.yaml

To check the gpu-operator version

$ helm ls

Running a Sample GPU Application

Create a tensorflow notebook example

$ kubectl apply -f https://nvidia.github.io/gpu-operator/notebook-example.yml

Grab the token from the pod once it is created

$ kubectl get pod tf-notebook

$ kubectl logs tf-notebook

Use the following URL in your browser when you connect for the first time, to login with a token:

http://localhost:8888/?token=MY_TOKEN

You can now access the notebook on http://localhost:30001/?token=MY_TOKEN

What’s next

NVIDIA and Canonical will continue partnering to improve the AI/ML space and enable innovators.  One area of interest is extending the GPU Operator to MicroK8s. MicroK8s takes the Kubernetes simplification one step ahead; a lightweight Kubernetes distribution with Kubeflow, GPUs, Helm and GPU Operator all in one package -Get started in seconds!.

Contributing

If you find a bug, have technical issues or would like to contribute to the NVIDIA GPU Operator, please visit the official Github page.

For issues or contributing to Canonical’s Kubernetes, please visit the Github page. You can also reach out to us on Twitter @canonical @ubuntu.

Canonical and NVIDIA look forward to your valuable feedback!

Related posts


Mita Bhattacharya
6 November 2024

Meet Canonical at KubeCon + CloudNativeCon North America 2024

Cloud and server Article

We are ready to connect with the pioneers of open-source innovation! Canonical, the force behind Ubuntu, is returning as a gold sponsor at KubeCon + CloudNativeCon North America 2024.  This premier event, hosted by the Cloud Native Computing Foundation, brings together the brightest minds in open source and cloud-native technologies. From ...


Karen Horovitz
18 March 2024

Canonical accelerates AI Application Development with NVIDIA AI Enterprise

Kubernetes Article

Charmed Kubernetes support comes to NVIDIA AI Enterprise Canonical’s Charmed Kubernetes is now supported on NVIDIA AI Enterprise 5.0. Organisations using Kubernetes deployments on Ubuntu can look forward to a seamless licensing migration to the latest release of the NVIDIA AI Enterprise software platform providing developers the latest AI ...


Yalton Ruiz
9 November 2023

Canonical Kubernetes enhances AI/ML development capabilities with NVIDIA integrations

Kubernetes Article

In recent years, Artificial Intelligence (AI) and Machine Learning (ML) have surged in importance. This rise can be attributed to a massive influx of data, enhanced computational capabilities, and significant advancements in algorithms. These changes have empowered various industries and society at large, resulting in cost-effective produ ...