Custom Resource Naming and Supporting Multiple GPU SKUs on a Single Node in Kubernetes

Kevin Klues <kklues@nvidia.com>

Last Updated:

12-May-2022

Table of Contents

Overview 2

Assumptions, Constraints, Dependencies 2

Design Details 3

Overview

A common question that often gets asked is:

Can I have multiple GPUs with different SKUs on the same node in my Kubernetes cluster?

As of today, the short answer is no.

The underlying problem being that the NVIDIA GPU device plugin has no way of disambiguating cards with different SKUs. It will simply advertise all cards under the same resource type of nvidia.com/gpu, giving users no way of directing workloads to a specific card.

The common way of dealing with this limitation is to have dedicated nodes for each type of GPU and use a nodeSelector to direct traffic to a node with a specific type of GPU installed on it. The resource type for all GPUs remains nvidia.com/gpu, but the node selector ensures you land on a node with a label of , e.g., nvidia.com/gpu.product=A100-SXM4-40GB or nvidia.com/gpu.product=GeForce-GT-710.

However, this is an unviable solution for users with only a few nodes in their cluster or only a handful of GPUs available. Likewise, with the introduction of MIG, this problem is exacerbated. Dividing a GPU into MIG devices essentially makes it so the node appears to have multiple GPUs of different sizes on it. We have worked around this limitation in the case of MIG by providing a set of new, well-defined resource names, but this solution does not scale when considering the vast number of GPU cards on the market.

This document outlines our proposal to address this issue.

Assumptions, Constraints, Dependencies

For now, only discrete GPUs and MIG devices are supported by this API

Nothing in the API should prevent other types of devices from being supported in the future (e.g. vGPUs or iGPUs), but it is not yet clear if they should be bundled with the existing abstractions or called out separately

Design Details

The use of a configuration file was recently introduced to configure the k8s-device-plugin and gpu-feature-discovery.

Using a config file to configure the k8s-device-plugin and gpu-feature-discovery

To support multiple GPUs on a single node, we extend this configuration file with a resources section common to both components. This section allows multiple resource names to be exposed by the plugin and labeled by gpu-feature-discovery. These resource names are defined by matching device queries against a device’s product name (in the case of full GPUs) or MIG profile name (in the case of MIG devices).

For example:

version: v1
resources:
  gpus:
  - pattern: "A100-SXM4-40GB"
    name: a100
  - pattern: "Tesla V100-SXM2-16GB-N"
    name: v100
  mig:
  - pattern: "1g.5gb"
    name: mig-small
  - pattern: "2g.10gb"
    name: mig-medium
  - pattern: "3g.10gb"
    name: mig-large

For any resources listed under resources.gpus we match the pattern against the output of nvidia-smi --query-gpu=name.ces listed under resources.gpus we

For example:

$ nvidia-smi --query-gpu=name --format=csv,noheader
A100-SXM4-40GB
A100-SXM4-40GB
A100-SXM4-40GB
A100-SXM4-40GB
A100-SXM4-40GB
A100-SXM4-40GB
A100-SXM4-40GB
A100-SXM4-40GB

And for MIG devices listed under resources.mig, we match against the canonical MIG device’s profile name (e.g. 1g.5gb, 2g.10gb, 3g.20gb) as seen in the output of nvidia-smi -L.

For example:

$ nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54)
  MIG 3g.20gb Device 0: (UUID: MIG-GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54/2/0)
  MIG 2g.10gb Device 1: (UUID: MIG-GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54/3/0)
  MIG 1g.5gb Device 2: (UUID: MIG-GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54/9/0)
  MIG 1g.5gb Device 3: (UUID: MIG-GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54/10/0)

Once matched, resources will be advertised according to the name specified in each entry as nvidia.com/<name>. From the example above, this would result in the following resources being advertised on nodes where these patterns were matched:

nvidia.com/a100
nvidia.com/v100
nvidia.com/mig-small
nvidia.com/mig-medium
nvidia.com/mig-large

Note: Patterns can include wildcards (i.e. ‘*’) to match against multiple devices with similar names. Additionally, the order of the entries under resources.gpus and resources.mig matters. Entries earlier in the list will be matched before entries later in the list.

For example:

version: v1
resources:
  gpus:
  - pattern: "A100-*-40GB"
    name: a100-40gb
  - pattern: "*A100*"
    name: a100

This will first attempt to match against the string A100-*-40GB and only move onto the string *A100* if no match is found. In a cluster with a mix of 40GB and 80GB A100 cards across, the 40GB cards would be advertised as nvidia-com/a100-40gb and the 80 GB cards would be advertised as nvidia.com/a100.

Likewise, for MIG, the following is useful to match only on GPC count and not on memory size:

version: v1
resources:
  mig:
  - pattern: "1g.*"
    name: mig-small
  - pattern: "2g.*"
    name: mig-medium
  - pattern: "3g.*"
    name: mig-large
  - pattern: "4g.*"
    name: mig-large

Note: In this example, both the “3g.*” and “4g.*” patterns map to the same resource name of nvidia.com/mig-large. This is both allowed and encouraged in order to group multiple device types under the same resource name.

Finally, if no pattern matching / resource naming specification is found for a given GPU or MIG profile name, we fallback to a set of defaults.

For full GPUs and MIG devices with the single strategy, the default resource name is nvidia.com/gpu. For MIG devices with the mixed strategy, the default resource name is nvidia-com/mig-<profile-name>.

For full GPUs, this is equivalent to having the following at the end of the pattern matching list:

version: v1
resources:
  gpus:
  ...
  - pattern: "*"
    name: gpu

For the single MIG strategy, this is equivalent to having:

version: v1
resources:
  mig:
  ...
  - pattern: "*"
    name: gpu

And for the mixed MIG strategy, this is equivalent to having (on an A100 40GB device):

version: v1
resources:
  mig:
  ...
  - pattern: "1g.5gb"
    name: mig-1g.5gb
  - pattern: "2g.10gb"
    name: mig-2g.10gb
  - pattern: "3g.20gb"
    name: mig-2g.20gb

When used in conjunction with gpu-feature-discovery, the following labels will remain unchanged, regardless of the new resource types provided:

nvidia.com/cuda.driver.major
nvidia.com/cuda.driver.minor
nvidia.com/cuda.driver.rev
nvidia.com/cuda.runtime.major
nvidia.com/cuda.runtime.minor
nvidia.com/gfd.timestamp

With custom labels provided for the following, based on the resources provided (for full GPUs):

nvidia.com/<resource>.compute.major
nvidia.com/<resource>.compute.minor
nvidia.com/<resource>.family
nvidia.com/<resource>.count
nvidia.com/<resource>.machine
nvidia.com/<resource>.memory
nvidia.com/<resource>.product

Likewise, for MIG, the labels will be customized based on the resource name chosen.

nvidia.com/<resource>.count
nvidia.com/<resource>.memory
nvidia.com/<resource>.multiprocessors
nvidia.com/<resource>.slices.ci
nvidia.com/<resource>.slices.gi
nvidia.com/<resource>.engines.copy
nvidia.com/<resource>.engines.decoder
nvidia.com/<resource>.engines.encoder
nvidia.com/<resource>.engines.jpeg
nvidia.com/<resource>.engines.ofa

For example, using the following configuration on a DGX-A100:

version: v1
flags:
  migStrategy: mixed
resources:
  gpus:
  - pattern: "*A100*"
    name: a100
  mig:
  - pattern: "1g.5gb"
    name: mig-small

With the following MIG-parted configuration:

version: v1
mig-configs:
  current:
  - devices: all
    mig-enabled: true
    mig-devices:
      1g.5gb: 7

This would result in the following labels:

nvidia.com/cuda.driver.major: "455"
nvidia.com/cuda.driver.minor: "06"
nvidia.com/cuda.driver.rev: ""
nvidia.com/cuda.runtime.major: "11"
nvidia.com/cuda.runtime.minor: "6"
nvidia.com/a100.compute.major: "8"
nvidia.com/a100.compute.minor: "0"
nvidia.com/a100.count: 8
nvidia.com/a100.family: ampere
nvidia.com/a100.machine: DGXA100-920-23687-2530-000
nvidia.com/a100.memory: "39538"
nvidia.com/a100.product: A100-SXM4-40GB
nvidia.com/mig-small.count: 56
nvidia.com/mig-small.memory: 10240
nvidia.com/mig-small.multiprocessors: 14
nvidia.com/mig-small.slices.ci: 1
nvidia.com/mig-small.slices.gi: 1
nvidia.com/mig-small.engines.copy: 1
nvidia.com/mig-small.engines.decoder: 1
nvidia.com/mig-small.engines.encoder: 1
nvidia.com/mig-small.engines.jpeg: 0
nvidia.com/mig-small.engines.ofa: 0

Note: If multiple device types are bundled under the same resource name on the same node, then only the common labels (across all device types) will be generated for that resource name on the node. We may choose to change this behavior in the future (i.e. by adding another field in the label name to disambiguate which device type the label applies to) , but for now non-common labels will simply be omitted.

IT 기술 정리 블로그

nvidia

Overview

Assumptions, Constraints, Dependencies

Design Details

티스토리툴바