Building Open Drivers for NVLink on RISC‑V: Where to Start
Open SourceDriversRISC-VLinux

Building Open Drivers for NVLink on RISC‑V: Where to Start

UUnknown
2026-02-24
13 min read
Advertisement

A practical roadmap for open-source developers to get NVLink running on RISC‑V: specs, kernel hooks, runtime shims, Nvidia constraints and a contributor checklist.

As AI workloads push memory-bandwidth and interconnect demands, teams building on RISC‑V platforms face a familiar but painful gap: the hardware link (NVLink) that binds host CPUs and Nvidia GPUs is now available on RISC‑V silicon (SiFive announced NVLink Fusion integration in late 2025), but the software stack to make it production-ready is not. If you’re an open-source developer or maintainer asking "where do I start?" — this roadmap gives you a clear, practical path: the specs you need, the kernel hooks to implement, the user-space runtime work, the constraints imposed by Nvidia's driver model, and a contributor checklist so you and your team can make measurable progress.

Executive summary (inverted pyramid)

Short version: SiFive’s NVLink Fusion silicon will expose PCIe-like and dedicated NVLink PHYs on RISC‑V platforms. To make NVLink usable you need work across three layers:

  • Kernel/platform — PCIe host, IOMMU, MSI/MSI‑X, device-tree/ACPI, DMA mapping, VFIO and a kernel driver to manage NVLink link bring‑up and firmware.
  • Kernel driver for the GPU device — either upstream Nouveau extension or a new open driver that integrates NVLink link management, peer-to-peer DMA, and DMA‑BUF/export support.
  • User-space runtimes — runtime shims (CUDA ABI compatibility layers, OpenCL/SYCL backends, or a new OpenGPU runtime) to expose unified APIs for allocation, peer-to-peer transfers, and topology discovery.

Expect significant constraints: Nvidia’s official stack (CUDA + proprietary kernel modules) is closed-source and largely x86/arm-targeted; binary-only drivers for riscv64 are not available as of early 2026. Your open-source path will rely on upstream kernel work (PCIe/IOMMU and device model support), extending Nouveau or building a community driver, and practical user-space shims to support existing ML frameworks.

2026 context: why this is timely

Two trends make this the moment to act. First, SiFive’s 2025 announcement of NVLink Fusion support in its IP portfolio means RISC‑V SoCs are being designed with high-speed GPU interconnects in mind. Second, the industry pressure for open stacks in AI infrastructure — driven by data center operators and academia — has increased the appetite for community drivers and vendor-neutral runtimes. If you can deliver a working open pathway now, you’ll position your project for adoption by early RISC‑V AI platforms.

What to collect before you write a single line of code

Starting without the right artifacts wastes months. Collect these items first.

Specifications & documentation

  • NVLink Fusion spec or engineering agreements from your SoC vendor (SiFive) describing PHYs, link training, and management registers.
  • PCIe host controller spec used on your RISC‑V board and associated device-tree bindings or ACPI entries.
  • IOMMU spec used by the platform (RISC‑V platforms use varying IOMMU implementations; get vendor docs on translation structure and API semantics).
  • GPU hardware doc — register map, DMA capabilities, peer‑to‑peer behavior and any firmware blobs required to power the GPU.

Hardware and testbed

  • Development board with the NVLink‑enabled RISC‑V SoC and an NVLink‑capable Nvidia GPU.
  • An FPGA-based PCIe interposer or logic analyzer to trace link bring‑up and training if you need to reverse-engineer firmware interactions.
  • A test OS image: riscv64 Linux built from mainline kernel (6.x or later in 2026) with debug symbol support.
  • Confirm any NDA limits before using vendor-only firmware images in your public repo.
  • Plan contributor license agreements (CLAs) and a governance model if you intend to accept community patches.

Kernel-level work: hooks and architectures you must address

The kernel is the most complex piece. The goals are link management, secure DMA, peer-to-peer transfers and exposing topology to user-space. Below are the specific kernel subsystems and actionable tasks.

1) PCIe host/controller & device enumeration

  • Ensure the platform's PCIe host controller driver is upstream or acceptable for mainline. Enable MSI/MSI‑X support and ensure riscv64 platform IRQ mapping matches the GPU’s MSI vectors.
  • Provide correct device-tree nodes or ACPI entries for the NVLink‑attached GPU. Example DT snippet for a PCIe root port on RISC‑V:
<>
/ {
  soc {
    pcie0: pci@40000000 {
      compatible = "snps,dw-pcie";
      reg = <0x0 0x40000000 0x0 0x10000000>;
      #address-cells = <3>;
      #size-cells = <2>;
      ranges;
    };
  };
};

2) IOMMU and DMA mapping

GPUs perform large DMA transfers; without a properly configured IOMMU you cannot guarantee address mapping or isolation. Tasks:

  • Port or enable the platform IOMMU driver into mainline; expose it through the Linux iommu subsystem so vfio and device drivers can attach.
  • Test DMA coherence and ensure cache syncs on RISC‑V cores — some platforms require explicit cache maintenance before DMA.

NVLink requires link training and error handling. Depending on how SiFive exposes NVLink, it may be presented as a separate PHY controlled via MMIO registers that your kernel module must manage:

  • Add a kernel driver that exposes link up/down, lane training status, and health counters via sysfs and devlink (for diagnostics).
  • Integrate with the kernel's devfreq/power framework for link power management if needed.

4) VFIO / DMA‑BUF and peer‑to‑peer

For user-space runtimes to move pages between host and GPU (and GPU-to-GPU), you need DMA‑BUF exporters and VFIO to mediate access.

  • Expose GPU BARs through VFIO so user-space can mmap properly and manage DMA mappings using IOMMU API.
  • Implement DMA‑BUF exporter support in the GPU driver so buffers can be shared across devices over NVLink.

5) Topology reporting

User-space runtimes need to know which GPU is reachable via NVLink and what the topology/latency is.

  • Add sysfs or udev attributes reporting NVLink-connected peers and bandwidth. Consider integration with libnuma/topology APIs.

GPU driver strategy: Nouveau vs new open driver

There are two primary paths: extend Nouveau (existing open-source Nvidia driver) or create a new driver that targets modern GPU families and NVLink on RISC‑V.

Extending Nouveau

  • Pros: community familiarity, already integrates with DRM, DMA‑BUF and modesetting.
  • Cons: Nouveau historically lacks complete support for recent Nvidia hardware; adding NVLink support for modern GPUs may require lots of reverse engineering and firmware work.

New open driver

  • Pros: design for modern ABI, explicit NVLink and RISC‑V needs — avoids legacy quirks.
  • Cons: larger initial effort; you must implement DRM/KMS if you need display, or focus purely on compute path and skip KMS to limit scope.
  1. Start with compute-only driver paths: expose PCI BARs, DMA, DMA‑BUF and peer‑to‑peer primitives without KMS. This reduces scope and aligns with ML/AI use-cases.
  2. Add topology & link management in kernel; integrate with user-space runtimes (next section).

User-space runtimes: bridging app expectations and new hardware

The biggest practical blocker to adoption is user-space: ML frameworks expect CUDA semantics, yet CUDA and Nvidia-provided driver stacks are binary and platform-limited. Your options in 2026 are pragmatic shims, open runtimes, and ABI translation layers.

Option A — runtime shim / ABI compatibility layer

  • Implement a lightweight libcuda.so shim that maps common CUDA driver API calls to an open backend (e.g., OpenCL or a native runtime). Prioritize APIs used by major frameworks (memory alloc/free, memcpy, peer-to-peer, event synchronization).
  • Examples: intercept cuMemAlloc and translate to your runtime’s allocation + DMA‑BUF export.

Option B — support OpenCL/SYCL and push framework adaptation

  • Port key workloads to OpenCL or SYCL where possible. Promote OpenXLA or direct backend plugins for PyTorch/XLA to use your runtime.

Option C — community runtime (longer-term)

  • Work on an open runtime ("OpenGPU") that implements a minimal GPU driver model for compute, supporting NVLink peer-to-peer, unified memory semantics and multi-device collectives.

Nvidia constraints and realities

Be realistic about what’s possible without proprietary support.

  • No official riscv64 binaries: As of 2026, Nvidia distributes drivers targeting x86_64 and aarch64 (ARM64); riscv64 binary kernel modules are not provided. That blocks direct use of Nvidia's kernel modules on RISC‑V.
  • Closed CUDA ABI — CUDA user-space libraries are proprietary and expect the proprietary kernel module stack to be present for many features; a shim cannot cover everything.
  • Firmware/secure boot — some GPUs require firmware blobs and signed microcode. Verify whether your hardware can load open firmware or if the vendor supplies necessary blobs under a redistribution-friendly license.

Practically, that means an open-source route will initially focus on features accessible without full CUDA-driver parity (peer-to-peer DMA, unified memory within a controlled runtime, and explicit transfers). Aim to provide enough compatibility for research and early deployments rather than 100% CUDA feature parity.

Concrete development checklist for contributors

Use this checklist to track deliverables and to structure PRs and community milestones.

  1. Scoping & governance
    • Create a project README with scope, goals and contribution guidelines.
    • Decide on a license (MIT, Apache 2.0 for user-space; GPLv2/3 or MIT for kernel patches depending on upstream needs).
  2. Obtain specs & hardware
    • Secure FPGA/board and an NVLink GPU; obtain NVLink Fusion docs from your SoC vendor.
    • File vendor requests for missing docs and mark NDA‑protected materials clearly.
  3. Platform brings-up
    • Ensure PCIe host and IOMMU drivers are upstream or maintainable as a stable tree. Open PRs for any missing device-tree bindings.
    • Implement IRQ/MSI mappings and validate with lspci/msi tests.
  4. Kernel NVLink driver
    • Produce an initial kernel module that exposes link state and MMIO control via sysfs and devlink. Submit small, focused patches upstream.
    • Test link training, error counters and link reset flows under stress tests.
  5. GPU compute driver
    • Start compute-only: map BARs, support DMA, add DMA‑BUF exporters and VFIO attachment.
    • Expose simple ioctl/uAPI for queue submission and buffer registration for early framework experiments.
  6. User-space runtime
    • Ship a libcuda shim that supports memory alloc/free, memcpy and peer-to-peer commands required by a chosen minimal benchmark (e.g., NCCL-like collectives or a basic PyTorch kernel).
    • Provide example patches to PyTorch/TF showing how to target the backend for basic workloads.
  7. Testing & CI
    • Automate kernel build/test with QEMU (for unit tests) and a hardware lane for integration tests.
    • Performance benchmarks: report NVLink throughput, latency and end-to-end ML training speed for a small model.
  8. Documentation & demos
    • Publish step‑by‑step guides: building kernel, flashing platform, running the runtime sample and interpreting topology outputs.
    • Record a tutorial video or live demo to lower onboarding friction for contributors and operators.

The following is a minimal kernel module skeleton to expose NVLink link status via sysfs. This is a conceptual starting point — adapt to your register map and error handling.

#include <linux/module.h>
#include <linux/platform_device.h>

static ssize_t link_status_show(struct device *dev,
                                struct device_attribute *attr, char *buf)
{
    u32 status = readl(dev_get_drvdata(dev)); /* mmio read */
    return sprintf(buf, "0x%x\n", status);
}
static DEVICE_ATTR_RO(link_status);

static int nvlink_probe(struct platform_device *pdev)
{
    /* map regs, init HW, register sysfs */
    dev_set_drvdata(&pdev->dev, /* mmio base */);
    device_create_file(&pdev->dev, &dev_attr_link_status);
    return 0;
}
static int nvlink_remove(struct platform_device *pdev)
{
    device_remove_file(&pdev->dev, &dev_attr_link_status);
    return 0;
}

static struct platform_driver nvlink_driver = {
    .probe = nvlink_probe,
    .remove = nvlink_remove,
    .driver = {
        .name = "nvlink-sifive",
    },
};
module_platform_driver(nvlink_driver);
MODULE_LICENSE("GPL");

Testing strategy and benchmarks

You need quantitative metrics to validate progress. Recommended tests:

  • PCIe enumeration & IOMMU attach tests (lspci -vv, dmesg validation).
  • Link validation: synthetic link training stress to surface lane errors and reset behavior.
  • Throughput: DMA copy between GPUs over NVLink using small/large transfers; measure bandwidth and CPU utilization.
  • End-to-end ML: train a small transformer or ResNet model across NVLink-connected GPUs and compare iteration throughput against an x86 setup (where available).

Community & collaboration: where to get help

You won’t build this alone. Key allies:

  • SiFive engineering contacts — for NVLink Fusion register maps, bring-up advice and reference boards.
  • Nouveau and kernel DRM maintainers — open early design discussions and small patches to solicit feedback.
  • RISC‑V community (riscv-linux mailing lists) — for platform kernel and device-tree best practices.
  • Open ML runtime projects (OpenXLA, SYCL communities) — to explore runtime integrations.

Future predictions and long-term view (2026–2028)

In 2026 we’re at an inflection point: silicon vendors plan NVLink-capable RISC‑V parts, cloud providers are evaluating open stacks, and the community is pushing for non-proprietary runtimes. Expect a two-track evolution over the next 24 months:

  • Short term (2026): community-driven kernel drivers and runtime shims will enable research workloads and early proofs of concept. Performance parity with proprietary stacks is unlikely initially.
  • Medium term (2027–2028): as more vendors adopt NVLink-like interconnects and upstream kernel code matures, expect better feature coverage and potential vendor contributions of firmware and documentation for safe open implementations.
"The best route to usable NVLink on RISC‑V is incremental: secure the hardware contracts, land platform kernel hooks, expose deterministic DMA semantics, then iterate on runtime compatibility." — Community playbook, 2026

Actionable takeaways — start here this week

  • Contact your SiFive rep and request NVLink Fusion register docs and recommended board designs.
  • Spin up a riscv64 Linux build (mainline 6.x) and validate PCIe enumeration for any attached GPU today.
  • Open a GitHub project and publish your scope, license, and a one‑page roadmap to attract contributors.
  • Prototype a minimal kernel NVLink driver that exposes link status via sysfs and submit it for review early; small patches get attention faster.
  • Write a libcuda shim covering the 20 most-used CUDA driver calls for your target framework and publish test cases (ML model training script) to prove viability.

Final checklist before you merge to mainline

  1. PCIe/IOMMU integration passes upstream maintainers' tests.
  2. Kernel NVLink driver reviewed and documented (devlink/sysfs interfaces stable).
  3. DMA‑BUF and VFIO support implemented and tested for buffer sharing.
  4. User-space runtime can perform multi-GPU memcpy and a representative ML workload runs end-to-end.
  5. Performance benchmarks and test reports published; any vendor-supplied blobs are accounted for legally.

Call to action

If you’re ready to make NVLink on RISC‑V real, start small: open a repo with your hardware details and a one-page roadmap, push a kernel NVLink sysfs prototype, and publish a minimal libcuda shim with tests. Share your progress on the Nouveau and riscv-linux lists and tag SiFive — the faster the community iterates, the faster RISC‑V can host production AI stacks. Want a contributor checklist template or a PR review checklist tailored to your board? Reach out on the project page and we’ll publish a starter CI pipeline and test harness for your first upstream patches.

Advertisement

Related Topics

#Open Source#Drivers#RISC-V#Linux
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-25T06:33:23.765Z