Custom Resource Naming and Supporting Multiple GPU SKUs on a Single Node in Kubernetes
Kevin Klues <kklues@nvidia.com>
Last Updated:
12-May-2022
Table of Contents
Assumptions, Constraints, Dependencies 2
Overview
A common question that often gets asked is:
Can I have multiple GPUs with different SKUs on the same node in my Kubernetes cluster?
As of today, the short answer is no.
The underlying problem being that the NVIDIA GPU device plugin has no way of disambiguating cards with different SKUs. It will simply advertise all cards under the same resource type of nvidia.com/gpu, giving users no way of directing workloads to a specific card.
The common way of dealing with this limitation is to have dedicated nodes for each type of GPU and use a nodeSelector to direct traffic to a node with a specific type of GPU installed on it. The resource type for all GPUs remains nvidia.com/gpu, but the node selector ensures you land on a node with a label of , e.g., nvidia.com/gpu.product=A100-SXM4-40GB or nvidia.com/gpu.product=GeForce-GT-710.
However, this is an unviable solution for users with only a few nodes in their cluster or only a handful of GPUs available. Likewise, with the introduction of MIG, this problem is exacerbated. Dividing a GPU into MIG devices essentially makes it so the node appears to have multiple GPUs of different sizes on it. We have worked around this limitation in the case of MIG by providing a set of new, well-defined resource names, but this solution does not scale when considering the vast number of GPU cards on the market.
This document outlines our proposal to address this issue.
Assumptions, Constraints, Dependencies
- For now, only discrete GPUs and MIG devices are supported by this API
- Nothing in the API should prevent other types of devices from being supported in the future (e.g. vGPUs or iGPUs), but it is not yet clear if they should be bundled with the existing abstractions or called out separately
Design Details
The use of a configuration file was recently introduced to configure the k8s-device-plugin and gpu-feature-discovery.
Using a config file to configure the k8s-device-plugin and gpu-feature-discovery
To support multiple GPUs on a single node, we extend this configuration file with a resources section common to both components. This section allows multiple resource names to be exposed by the plugin and labeled by gpu-feature-discovery. These resource names are defined by matching device queries against a device’s product name (in the case of full GPUs) or MIG profile name (in the case of MIG devices).
For example:
version: v1 resources: gpus: - pattern: "A100-SXM4-40GB" name: a100 - pattern: "Tesla V100-SXM2-16GB-N" name: v100 mig: - pattern: "1g.5gb" name: mig-small - pattern: "2g.10gb" name: mig-medium - pattern: "3g.10gb" name: mig-large |
For any resources listed under resources.gpus we match the pattern against the output of nvidia-smi --query-gpu=name.ces listed under resources.gpus we
For example:
$ nvidia-smi --query-gpu=name --format=csv,noheader A100-SXM4-40GB A100-SXM4-40GB A100-SXM4-40GB A100-SXM4-40GB A100-SXM4-40GB A100-SXM4-40GB A100-SXM4-40GB A100-SXM4-40GB |
And for MIG devices listed under resources.mig, we match against the canonical MIG device’s profile name (e.g. 1g.5gb, 2g.10gb, 3g.20gb) as seen in the output of nvidia-smi -L.
For example:
$ nvidia-smi -L GPU 0: A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54) MIG 3g.20gb Device 0: (UUID: MIG-GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54/2/0) MIG 2g.10gb Device 1: (UUID: MIG-GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54/3/0) MIG 1g.5gb Device 2: (UUID: MIG-GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54/9/0) MIG 1g.5gb Device 3: (UUID: MIG-GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54/10/0) |
Once matched, resources will be advertised according to the name specified in each entry as nvidia.com/<name>. From the example above, this would result in the following resources being advertised on nodes where these patterns were matched:
nvidia.com/a100 nvidia.com/v100 nvidia.com/mig-small nvidia.com/mig-medium nvidia.com/mig-large |
Note: Patterns can include wildcards (i.e. ‘*’) to match against multiple devices with similar names. Additionally, the order of the entries under resources.gpus and resources.mig matters. Entries earlier in the list will be matched before entries later in the list.
For example:
version: v1 resources: gpus: - pattern: "A100-*-40GB" name: a100-40gb - pattern: "*A100*" name: a100 |
This will first attempt to match against the string A100-*-40GB and only move onto the string *A100* if no match is found. In a cluster with a mix of 40GB and 80GB A100 cards across, the 40GB cards would be advertised as nvidia-com/a100-40gb and the 80 GB cards would be advertised as nvidia.com/a100.
Likewise, for MIG, the following is useful to match only on GPC count and not on memory size:
version: v1 resources: mig: - pattern: "1g.*" name: mig-small - pattern: "2g.*" name: mig-medium - pattern: "3g.*" name: mig-large - pattern: "4g.*" name: mig-large |
Note: In this example, both the “3g.*” and “4g.*” patterns map to the same resource name of nvidia.com/mig-large. This is both allowed and encouraged in order to group multiple device types under the same resource name.
Finally, if no pattern matching / resource naming specification is found for a given GPU or MIG profile name, we fallback to a set of defaults.
For full GPUs and MIG devices with the single strategy, the default resource name is nvidia.com/gpu. For MIG devices with the mixed strategy, the default resource name is nvidia-com/mig-<profile-name>.
For full GPUs, this is equivalent to having the following at the end of the pattern matching list:
version: v1 resources: gpus: ... - pattern: "*" name: gpu |
For the single MIG strategy, this is equivalent to having:
version: v1 resources: mig: ... - pattern: "*" name: gpu |
And for the mixed MIG strategy, this is equivalent to having (on an A100 40GB device):
version: v1 resources: mig: ... - pattern: "1g.5gb" name: mig-1g.5gb - pattern: "2g.10gb" name: mig-2g.10gb - pattern: "3g.20gb" name: mig-2g.20gb |
When used in conjunction with gpu-feature-discovery, the following labels will remain unchanged, regardless of the new resource types provided:
nvidia.com/cuda.driver.major nvidia.com/cuda.driver.minor nvidia.com/cuda.driver.rev nvidia.com/cuda.runtime.major nvidia.com/cuda.runtime.minor nvidia.com/gfd.timestamp |
With custom labels provided for the following, based on the resources provided (for full GPUs):
nvidia.com/<resource>.compute.major nvidia.com/<resource>.compute.minor nvidia.com/<resource>.family nvidia.com/<resource>.count nvidia.com/<resource>.machine nvidia.com/<resource>.memory nvidia.com/<resource>.product |
Likewise, for MIG, the labels will be customized based on the resource name chosen.
nvidia.com/<resource>.count nvidia.com/<resource>.memory nvidia.com/<resource>.multiprocessors nvidia.com/<resource>.slices.ci nvidia.com/<resource>.slices.gi nvidia.com/<resource>.engines.copy nvidia.com/<resource>.engines.decoder nvidia.com/<resource>.engines.encoder nvidia.com/<resource>.engines.jpeg nvidia.com/<resource>.engines.ofa |
For example, using the following configuration on a DGX-A100:
version: v1 flags: migStrategy: mixed resources: gpus: - pattern: "*A100*" name: a100 mig: - pattern: "1g.5gb" name: mig-small |
With the following MIG-parted configuration:
version: v1 mig-configs: current: - devices: all mig-enabled: true mig-devices: 1g.5gb: 7 |
This would result in the following labels:
nvidia.com/cuda.driver.major: "455" nvidia.com/cuda.driver.minor: "06" nvidia.com/cuda.driver.rev: "" nvidia.com/cuda.runtime.major: "11" nvidia.com/cuda.runtime.minor: "6" nvidia.com/a100.compute.major: "8" nvidia.com/a100.compute.minor: "0" nvidia.com/a100.count: 8 nvidia.com/a100.family: ampere nvidia.com/a100.machine: DGXA100-920-23687-2530-000 nvidia.com/a100.memory: "39538" nvidia.com/a100.product: A100-SXM4-40GB nvidia.com/mig-small.count: 56 nvidia.com/mig-small.memory: 10240 nvidia.com/mig-small.multiprocessors: 14 nvidia.com/mig-small.slices.ci: 1 nvidia.com/mig-small.slices.gi: 1 nvidia.com/mig-small.engines.copy: 1 nvidia.com/mig-small.engines.decoder: 1 nvidia.com/mig-small.engines.encoder: 1 nvidia.com/mig-small.engines.jpeg: 0 nvidia.com/mig-small.engines.ofa: 0 |
Note: If multiple device types are bundled under the same resource name on the same node, then only the common labels (across all device types) will be generated for that resource name on the node. We may choose to change this behavior in the future (i.e. by adding another field in the label name to disambiguate which device type the label applies to) , but for now non-common labels will simply be omitted.