Kubernetes for RISC‑V + GPU Clusters: Device Plugins, Scheduling and Resource Topology
KubernetesRISC-VGPUDevOps

Kubernetes for RISC‑V + GPU Clusters: Device Plugins, Scheduling and Resource Topology

UUnknown
2026-02-25
10 min read
Advertisement

Design patterns and sample implementations for Kubernetes device plugins and scheduler plugins that handle NVLink GPU groups and RISC‑V NUMA affinity.

Pain point: You’re designing clusters that mix RISC‑V servers and NVLink‑attached GPUs and need Kubernetes to make placement decisions that respect NVLink topology, NUMA locality and GPU affinity — otherwise your ML jobs will see 2–5x worse latency or underutilize expensive interconnects.

In 2025–2026 the ecosystem shifted: SiFive’s announced integration of Nvidia’s NVLink Fusion with RISC‑V IP and wider interest in RISC‑V server-class silicon mean production clusters must understand GPU interconnect topologies and CPU NUMA layouts. This article gives concrete design patterns, sample device‑plugin and scheduler implementations, kubelet configuration, and testing strategies so you can build reliable topology‑aware orchestration for RISC‑V + GPU deployments.

Executive summary — what you must do first

  • Expose topology from nodes: run a resource‑topology exporter that publishes NUMA, PCIe and NVLink groupings to Kubernetes (or a CRD).
  • Enhance device plugins: device plugins must advertise grouped GPU resources (NVLink peers, MIG instances) and include topology metadata in their Allocate responses.
  • Use topology‑aware kubelet options: enable CPUManager (static), set Topology Manager to single‑numa‑node or restricted depending on SLAs.
  • Implement a scheduler plugin: Scheduler Framework plugin (Filter + Score) that understands NVLink and NUMA hints and enforces co‑placement and anti‑affinity rules for GPU groups.
  • Test and observe: run microbenchmarks that exercise NVLink bandwidth/latency and collect telemetry (DCGM, Prometheus) to validate placement rules.

By 2026, the datacenter landscape is evolving: RISC‑V silicon is moving from niche to mainstream for specialized inference and accelerators, and Nvidia’s NVLink Fusion efforts (announced with partners in late 2025) are making tighter GPU‑to‑CPU and GPU‑to‑GPU coherence possible on heterogeneous platforms. Kubernetes scheduling and device plugin frameworks matured toward topology awareness, and the community maintains projects like resource-topology-exporter and Topology Manager best‑practices. That combination creates both the need and the mechanisms for topology‑aware orchestration.

Key concepts to understand (brief)

  • NVLink topology: high‑speed GPU‑GPU or GPU‑CPU links that form groups (rings/meshes) and make inter‑GPU transfers much cheaper when inside a link domain.
  • NUMA alignment: CPU cores, memory and PCIe/NVLink devices belong to NUMA nodes. Optimal performance requires pinning workloads to matching NUMA regions.
  • Device plugin model: Kubernetes device plugins (Register, ListAndWatch, Allocate) are the mechanism to advertise and hand out GPUs and can include topology metadata.
  • Scheduler Framework: replace older extenders — implement Filter & Score plugins to drive placement decisions based on topology objects or node labels.

Design pattern 1 — Node topology discovery and export

Start by exporting a canonical view of resource topology from each node. Do not rely solely on device plugin ephemeral state; use a separate exporter that inspects /sys, libnvidia, and platform firmware to export a topology snapshot.

What to export

  • NUMA nodes and cpuset ranges.
  • Per‑GPU PCI bus IDs, NVLink peer groups and link widths.
  • GPU memory sizes and MIG slice information.
  • Driver compatibility and firmware versions (important for RISC‑V ABI mismatches).

Use or extend kubernetes-sigs/resource-topology-exporter to publish this as NodeTopology CRs or via the ResourceTopology API so scheduler plugins can consume it.

Sample topology JSON model (simplified)

{
  "node": "riscv-node-01",
  "numaNodes": [
    { "id": 0, "cpus": "0-15", "memMB": 131072 },
    { "id": 1, "cpus": "16-31", "memMB": 131072 }
  ],
  "gpus": [
    { "id": "GPU-0", "pciBus": "0000:3b:00.0", "numaNode": 0, "nvlinkGroup": "gA" },
    { "id": "GPU-1", "pciBus": "0000:3c:00.0", "numaNode": 1, "nvlinkGroup": "gA" },
    { "id": "GPU-2", "pciBus": "0000:3d:00.0", "numaNode": 1, "nvlinkGroup": "gB" }
  ]
}

Device plugins should do more than advertise counts. They should:

  • Expose logical GPU resources that include NVLink group and NUMA affinity as metadata.
  • Offer grouped resources (e.g., nvidia.com/gpu-group/gA) for applications that must allocate GPUs sharing NVLink.
  • Provide allocation hooks that return CPU pinning hints and env vars exposing NVLink peer IDs.

Implementation notes

Follow the standard Device Plugin gRPC API. In ListAndWatch, return devices with IDs like gpu-0 and include topology annotation in Allocate response (via env var or annotations carried to the pod). For grouping, advertise additional synthetic resources like nvidia.com/gpu-group-gA with integer quantity representing whole groups.

// AllocateResponse contains environment variables and mounts
{"envs": {
  "GPU_IDS": "GPU-0,GPU-1",
  "GPU_NUMA_NODES": "0,1",
  "GPU_NVLINK_GROUPS": "gA"
}}

Implement a Scheduler Framework plugin instead of an external scheduler extender. A plugin gives you hooks for PreFilter, Filter, Score and Reserve — enabling precise placement decisions and scoring for multi‑GPU jobs.

Core responsibilities

  • PreFilter: parse pod resource requests and expected topology (e.g., request for 2 GPUs in same NVLink group).
  • Filter: exclude nodes that cannot satisfy both GPU and NUMA alignment constraints.
  • Score: rank nodes by NVLink locality (prefer same NVLink group) and NUMA alignment (prefer same NUMA node for GPU and CPU).
  • Reserve/PreBind: reserve the exact GPU IDs or set node/pod annotations so device plugin and kubelet can complete Allocate with correct CPU pinning.

Scheduler plugin pseudocode (Filter + Score skeleton)

func (p *Plugin) Filter(ctx, pod, nodeInfo) (bool, string) {
  req := parsePodRequirements(pod)
  topo := getNodeTopology(nodeInfo.node)
  if !topo.canSatisfy(req) { return false, "insufficient topology" }
  return true, ""
}

func (p *Plugin) Score(ctx, pod, nodeName) int64 {
  topo := getNodeTopology(nodeName)
  score := 0
  if topo.gpuGroupMatches(pod) { score += 100 }
  score += topo.numaAffinityScore(pod) // prefer same NUMA
  return int64(score)
}

NUMA policies and kubelet configuration

To ensure CPU pinning aligns with GPU placement, configure kubelet with:

# Example kubelet flags (kubeadm or systemd drop-in)
--cpu-manager-policy=static
--kube-reserved=cpu=100m,memory=200Mi
--system-reserved=cpu=100m,memory=200Mi
--feature-gates=KubeletPodResources=true

And set Topology Manager policy based on workload requirements:

  • single-numa-node — highest locality guarantees. Use for latency-sensitive inference or training when strict placement is possible.
  • restricted — balances flexibility and locality. Good for mixed workload clusters.
  • best-effort — prefer locality but allow cross‑NUMA assignments.

Also enable CPUManager & Topology Manager in kubelet config and ensure runtime (containerd) supports cpuset and cgroup v2 features on RISC‑V kernels.

Gang scheduling and multi‑GPU jobs

Machine learning workloads often need multiple GPUs and consistent NVLink fabric between them. Use a gang scheduling pattern so the scheduler treats the pod set as one unit:

  • Use kube-batch, Volcano or a custom scheduler plugin that supports gang semantics.
  • Before binding, ensure a single node can satisfy group requirements (or coordinate cross-node NVLink if hardware supports it — this is rare and must be validated).
  • For ephemeral workloads, create a preemptible reservation (Reserve phase) to avoid partial allocation across NVLink groups.

Handling MIG, fractional GPUs and RISC‑V specifics

MIG slices complicate topology: a single GPU may expose multiple MIG devices with different affinities. Device plugins must map MIG instances to parent GPU NVLink groups and advertise per‑MIG NUMA affinity.

RISC‑V: expect heterogeneity. Early RISC‑V server designs may provide asymmetric NUMA layouts, custom PCIe root complexes, and driver limitations. Important checks:

  • Verify NVIDIA driver support on RISC‑V (firmware and ABI). Keep a compatibility matrix as part of NodeTopology CRD.
  • Expose CPU ISA and microarchitecture tags via Node Feature Discovery so scheduler can avoid mixing incompatible nodes.

Observability and verification

Instrument every layer of the control path and devices:

  • Node exporters: resource-topology-exporter, Node Feature Discovery.
  • GPU telemetry: NVIDIA DCGM exporter, and NVLink link status checks (nvidia-smi topo --matrix equivalent).
  • Scheduler metrics: custom metrics from your plugin (filter failures, scoring distribution).
  • End‑to‑end tests: run microbenchmarks (CUDA/NCCL, microsecond latency tests) across different placements and record throughput and latency.

Sample workflow: from node to Pod

  1. Node boots; resource exporter publishes NodeTopology: NUMA nodes, GPU list, NVLink groups.
  2. Device plugin registers and ListAndWatch returns GPU device list and synthetic resources for NVLink groups.
  3. User submits Pod requesting 2 GPUs in same NVLink group and CPU count 8.
  4. Scheduler PreFilter validates request and consumes NodeTopology.
    • Filter excludes nodes without a matching NVLink group or insufficient CPU NUMA capacity.
  5. Scheduler Score prefers nodes where both requested GPUs are in the same NVLink group and close to the requested CPU NUMA node.
  6. Reserve + Bind: annotations pass selected GPU IDs and NUMA hint to kubelet/device plugin for final Allocate and cpu pinning.

Practical implementation checklist

  • Deploy resource-topology-exporter (or similar) on all nodes.
  • Create/extend device plugin to advertise NVLink groups & MIG mapping.
  • Implement a Scheduler Framework plugin or extend an existing one (Volcano integrations are common for gang scheduling).
  • Configure kubelet: CPU manager static policy and Topology Manager policy aligned with your workload.
  • Add observability: DCGM exporter, Prometheus rules, and dashboards to monitor NVLink link health and allocation failures.

Common pitfalls and how to avoid them

  • Assuming PCIe is enough: NVLink changes performance characteristics. Benchmark both PCIe and NVLink paths.
  • Ignoring driver/firmware mismatches on RISC‑V: maintain an image matrix and validation pipeline for drivers on RISC‑V kernels.
  • Relying only on resource counts: advertising “2 gpus” without topology leads to bad placement. Always include topology metadata.
  • Over‑constraining the Topology Manager: overly strict policies cause unnecessary scheduling failures — use restricted for mixed workloads.

Validation & benchmarks (example tests)

  1. NVLink throughput test: run NCCL allreduce across GPUs placed in same NVLink group vs different groups; measure bandwidth and latency.
  2. NUMA latency test: pin CPU threads to local vs remote NUMA node while accessing GPU; measure latency using microbenchmarks.
  3. Stress test scheduler: submit many gang jobs requesting different NVLink group sizes; measure scheduling success rate and time-to-bind.

Security, licensing and governance considerations

Proprietary GPU drivers and vendor firmware are a governance issue on RISC‑V. Track driver licenses, sign binaries, and manage firmware updates via an approved pipeline. Consider isolating driver updates in a staged fleet and use admission controls to prevent incompatible images from scheduling on RISC‑V nodes with GPUs.

Future predictions (2026–2028)

Expect the following:

  • RISC‑V server silicon will move into more inference and edge training roles where tight NVLink integration provides cost/performance wins.
  • Device plugins will standardize topology metadata conventions (CRDs or annotations) and resource-topology-exporter patterns will be common in clouds and private DCs.
  • Scheduler plugins that understand interconnect beyond NUMA (NVLink, CXL, coherent fabrics) will be first-class components in enterprise Kubernetes distributions.
"Topology is the new resource. Counting CPUs and GPUs isn’t enough — orchestration must understand how pieces are wired together."

Actionable starter snippets

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: trainer
    image: your-registry/trainer:latest
    resources:
      limits:
        nvidia.com/gpu-group-gA: 1  # requests a whole NVLink group
        cpu: "8"

Kubelet config recommendations (kubelet config or flags)

cpuManagerPolicy: "static"
featureGates:
  KubeletPodResources: true
kubeReserved:
  cpu: "100m"
  memory: "200Mi"
topologyManagerPolicy: "single-numa-node" # or "restricted" for mixed workloads

Where to start next (practical roadmap)

  1. Inventory: run node inspection scripts to discover NUMA and NVLink topology on existing hardware.
  2. Prototype exporter & device plugin on a development node and validate that topology metadata reaches the scheduler.
  3. Build a minimal Scheduler Framework plugin (Filter + Score) and run unit tests for decision logic.
  4. Run bench repeats (NCCL, microbenchmarks) to quantify gains; iterate kubelet topology policy and scheduler heuristics.
  5. Roll out to staging with observability and driver governance gates before production.

Closing takeaways

In 2026 the combination of RISC‑V servers and NVLink‑attached GPUs requires Kubernetes clusters to be topology conscious. The winning pattern is a three‑layer approach: (1) robust node topology export, (2) device plugins that advertise grouped topology resources, and (3) Scheduler Framework plugins that enforce and score placements with NUMA and NVLink affinity. This yields predictable performance, higher utilization, and safer driver operations.

Call to action

Ready to prototype? Start by deploying resource-topology-exporter and a lightweight device plugin on one RISC‑V + GPU node, then run a simple scheduler plugin to prefer NVLink co‑placement. If you want, grab our sample repo (search "opensources.live RISC‑V NVLink kube samples") and join the discussion on topology-aware scheduling on the Kubernetes SIGs. Share your results and contribute your device plugin patterns so the community can standardize NVLink topology conventions.

Advertisement

Related Topics

#Kubernetes#RISC-V#GPU#DevOps
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-25T02:54:48.114Z