Unleash the Beast: Building and Sharing High-Performance ROCm Kernels with Hugging Face
In the fast-paced world of AI and deep learning, custom kernels are the secret sauce that unlocks peak performance. They’re the finely tuned engines that power specialized GPU operations, whether you’re crunching through image processing, manipulating massive tensors, or tackling any compute-intensive task. However, the path to creating these kernels can often be a labyrinth of compiler errors, complex build configurations (think CMake or Nix), and the dreaded ABI incompatibilities – a frustrating detour from the core innovation you’re aiming for.
But what if there was a simpler way? Enter Hugging Face, a name synonymous with democratizing AI, and their powerful kernel-builder and kernels libraries. These tools are designed to demystify the process of building and sharing custom kernels, with robust support for a wide array of GPU and accelerator backends, including CUDA, ROCm, Metal, and XPU. The goal? To ensure your kernels are not just lightning-fast but also portable and seamlessly integrated into your PyTorch workflows.
This guide zeroes in on the exciting realm of ROCm-compatible kernels, specifically for AMD GPUs. We’ll walk you through the entire lifecycle – from crafting your kernel and configuring its build process to testing, sharing, and ultimately, deploying it. You’ll learn how to harness the full potential of AMD hardware, ensuring your AI models perform at their absolute best, all while benefiting from best practices in reproducibility, packaging, and deployment.
Consider this your streamlined, ROCm-centric roadmap. For a broader look at building production-ready CUDA kernels, you can refer to the original guide: A Guide to Building and Scaling Production-Ready CUDA Kernels.
Dive into the RadeonFlow GEMM Kernel: A Showcase of AMD GPU Prowess
To illustrate the power of this workflow, we’ll use the GEMM kernel from the RadeonFlow_Kernels project as our running example. This isn’t just any matrix multiplication; it’s a highly optimized implementation designed for the cutting-edge AMD Instinct MI300X GPU, leveraging the efficiency of FP8 (8-bit floating-point) precision.
Authors: ColorsWind, Zesen Liu, and Andy
At its core, General Matrix Multiplication (GEMM) is the bedrock of most deep learning operations. It’s the mathematical dance of multiplying two matrices, A and B, to produce a result C = A × B. The RadeonFlow GEMM kernel takes this a step further by employing FP8, a format that shrewdly trades a marginal dip in precision for significant gains in computational speed and a reduction in memory bandwidth demands. This optimization was a standout feature, earning it the 🏆 Grand Prize in the AMD Developer Challenge 2025 for its exceptional performance and innovation on AMD hardware.
This kernel operates on quantized inputs using the e4m3fnuz floating-point format. Think of it as a specialized FP8 variant, featuring 4 exponent bits and 3 mantissa bits, meticulously engineered for neural network efficiency. Because FP8’s dynamic range is considerably smaller than FP16 or FP32, the kernel employs per-block scaling factors (a_scale and b_scale). These factors rescale segments of the input matrices before and after computation, ensuring that accuracy is maintained even with the reduced precision.
The kernel’s arguments are structured as follows:
- (a, b, a_scale, b_scale, c)
a: Input matrix,K × Mine4m3fnuzformat.b: Input matrix,K × Nine4m3fnuzformat.a_scale: Scaling factor fora, dimensions(K // 128) × Minfp32.b_scale: Scaling factor forb, dimensions(K // 128) × (N // 128)infp32.c: Output matrix,M × Ninbf16(bfloat16) format.
It’s important to note that this kernel is precompiled for specific matrix dimensions and expects a transposed memory layout, a requirement dictated by the competition. To accommodate different matrix shapes or memory layouts, modifications to the kernel launcher would be necessary.
From Raw Code to Shareable Asset: The Step-by-Step Build Process
Now that we have a powerful ROCm kernel, the crucial question is: how do we integrate it into a real-world PyTorch workflow and share it with the wider community? This is precisely where Hugging Face’s kernel-builder and kernels libraries shine. They provide the structure, tools, and distribution channels to make this process remarkably smooth.
While this guide delves into technical details, you can follow along step-by-step. Understanding every nuance isn’t essential for success; you can always revisit the deeper concepts later.
Step 1: Sculpting Your Project Structure
The kernel-builder expects a well-defined project layout. This structure is key to its ability to understand and compile your code efficiently.
The Standard Layout:
project-root/
├── build.toml # The project's manifest file
├── kernel-code/ # Directory for your raw GPU source code
│ └── your_kernel.h # Example kernel header
├── flake.nix # For reproducible build environments
└── torch-ext/ # Python wrapper and PyTorch integration
├── torch_binding.cpp
├── torch_binding.h
└── python_package/
└── __init__.py
build.toml: Think of this as the brain of your build process. It contains all the configuration and metadata.kernel-code/: This is where your GPU magic resides – the raw source files that will be compiled for the target hardware.flake.nix: The cornerstone of reproducible builds. It locks down the exact versions of all dependencies, ensuring consistency across different machines and environments.torch-ext/: This directory houses the C++ code that bridges your custom kernel with PyTorch, creating native operators.
Our GEMM Kernel’s Structure:
For our RadeonFlow GEMM example, the structure is slightly more detailed to accommodate its specific components:
gemm/
├── build.toml
├── gemm
│ ├── gemm_kernel.h
│ ├── gemm_kernel_legacy.h
│ ├── transpose_kernel.h
│ └── gemm_launcher.hip # The HIP implementation file
├── include
│ ├── clangd_workaround.h
│ ├── gpu_libs.h
│ ├── gpu_types.h
│ └── timer.h
├── src/utils
│ ├── arithmetic.h
│ └── timer.hip
├── tests/checker
│ ├── checker.cpp
│ ├── metrics.h
│ └── checker.h
├── flake.nix
└── torch-ext
├── torch_binding.cpp
├── torch_binding.h
└── gemm
└── __init__.py
A key convention here is the file extension for your GPU code:
.h(Header Files): Use for kernel declarations, inline functions, or template code that will be included by other files..hip(HIP Implementation Files): Use for files containing HIP/GPU code that needs separate compilation, such as kernel launchers or complex device functions.
In our GEMM example, gemm_kernel.h, gemm_kernel_legacy.h, and transpose_kernel.h are headers, while gemm_launcher.hip contains the actual executable HIP code. This clear distinction helps the kernel-builder correctly process each file.
Step 2: Configuring Your Build with build.toml and flake.nix
The build.toml Manifest: The Blueprint for Your Kernel
This file is the central orchestrator, dictating what gets compiled, how it’s compiled, and how it all fits together.
[general]
name = "gemm"
universal = false
[torch]
src = [
"torch-ext/torch_binding.cpp",
"torch-ext/torch_binding.h",
]
[kernel.gemm]
backend = "rocm"
rocm-archs = [ "gfx942" ] # Targets AMD MI300 series
depends = ["torch"]
src = [
"include/clangd_workaround.h",
"include/gpu_libs.h",
"include/gpu_types.h",
"include/timer.h",
"gemm/gemm_kernel.h",
"gemm/gemm_kernel_legacy.h",
"gemm/gemm_launcher.hip",
"gemm/transpose_kernel.h",
"src/utils/arithmetic.h",
"src/utils/timer.hip",
"tests/checker/metrics.h",
]
include = ["include"]
[general]:name: The name of your kernel (e.g.,gemm). This will be used for the Python package.universal: Set totruefor pure Python kernels (like Triton). Defaults tofalse.
[torch]:src: A list of source files and headers that form your PyTorch extension, enabling Python access.
[kernel.gemm]: (You can define multiple[kernel]sections for different kernels)backend: Specifies the compute backend. We use"rocm"for AMD GPUs.rocm-archs: Crucial for ROCm. Lists the target architectures."gfx942"is specific to the MI300 series.depends: Dependencies your kernel needs. Here, it’s"torch"for PyTorch tensor operations.src: Lists all the source files and headers that comprise your kernel.include: Directories to search for header files.
The flake.nix File: Guaranteeing Reproducibility
To ensure anyone can build your kernel on any machine, the flake.nix file is indispensable. It locks down the exact versions of the kernel-builder and all its dependencies, eliminating the "it works on my machine" syndrome.
{ description = "Flake for GEMM kernel";
inputs =
{
kernel-builder.url = "github:huggingface/kernel-builder";
};
outputs =
{ self, kernel-builder, }: kernel-builder.lib.genFlakeOutputs
{ inherit self;
path = ./.;
};
}
Tip: You can often adapt an existing flake.nix for your project by simply changing the description.
Crafting the GPU Code (gemm/gemm_launcher.hip)
This is where the actual GPU execution logic is defined. The gemm_launcher.hip file orchestrates the calls to the optimized GEMM kernel or a fallback implementation based on the configuration.
// ... includes and definitions ...
extern "C" void run(
void *a, void *b, void *as, void *bs, void *c,
int m, int n, int k,
PerfMetrics *metrics, hipStream_t job_stream0
) {
const __FP8_TYPE *a_ptr = static_cast<const __FP8_TYPE *>(a);
const __FP8_TYPE *b_ptr = static_cast<const __FP8_TYPE *>(b);
__BF16_TYPE *c_ptr = static_cast<__BF16_TYPE *>(c);
const float *as_ptr = static_cast<const float *>(as);
const float *bs_ptr = static_cast<const float *>(bs);
KernelTimerScoped timer(timers, 2LL * m * n * k, metrics ? &metrics->entries[0].time : nullptr, metrics ? &metrics->entries[0].gflops : nullptr, job_stream0);
// Dispatch GEMM to the fastest available implementation
switch (pack_shape(m, n, k)) {
case 0: DISPATCH_GEMM(1024, 1536, 7168, 256, 128, 128, 4, 2, 512, 4, 16);
case 1: DISPATCH_GEMM(6144, 7168, 2304, 256, 128, 128, 4, 2, 512, 1, 16);
default:
printf("Error: Unsupported shape M=%d, K=%d, N=%d\n", m, k, n);
abort();
}
}
// ... more code ...
Registering a Native PyTorch Operator
Making your kernel available in Python is crucial. The torch-ext/torch_binding.cpp file handles this by registering your kernel as a native PyTorch operator. This means it becomes a first-class citizen within PyTorch, accessible via torch.ops.
#include <torch/all.h>
#include <torch/library.h>
#include <hip/hip_runtime.h>
#include "registration.h"
#include "torch_binding.h"
// Forward declaration of the C function from gemm_launcher.hip
extern "C" {
struct PerfMetrics;
void run(void *a, void *b, void *as, void *bs, void *c, int m, int n, int k, PerfMetrics *metrics, hipStream_t job_stream0);
}
void gemm(
torch::Tensor &out,
torch::Tensor const &a,
torch::Tensor const &b,
torch::Tensor const &as,
torch::Tensor const &bs
) {
// Validate tensor properties (device, contiguity, etc.)
TORCH_CHECK(a.device().is_cuda(), "Input tensor a must be on GPU device");
// ... other checks ...
// Get matrix dimensions
int M = a.size(0);
int K = a.size(1);
int N = b.size(1);
// ... dimension validation ...
const hipStream_t stream = 0; // Use default HIP stream
run(a.data_ptr(), b.data_ptr(), as.data_ptr(), bs.data_ptr(), out.data_ptr(), M, N, K, nullptr, stream);
}
TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
ops.def("gemm(Tensor! out, Tensor a, Tensor b, Tensor a_scale, Tensor b_scale) -> ()");
ops.impl("gemm", torch::kCUDA, &gemm);
}
REGISTER_EXTENSION(TORCH_EXTENSION_NAME)
The torch_binding.h file simply declares the C++ function that will be called from Python.
The __init__.py Wrapper: User-Friendly Access
To make your kernel easily accessible from Python, the torch-ext/gemm/__init__.py file acts as the entry point.
from typing import Optional
import torch
from ._ops import ops # Assuming _ops.py contains the operator registration
def gemm(
a: torch.Tensor,
b: torch.Tensor,
as_: torch.Tensor, # Renamed to avoid conflict with Python keyword
bs: torch.Tensor,
out: Optional[torch.Tensor] = None
) -> torch.Tensor:
if out is None:
# Create output tensor with appropriate shape and dtype
M, K = a.shape
K_b, N = b.shape
assert K == K_b, f"Matrix dimension mismatch: A has {K} cols, B has {K_b} rows"
# Output should be BF16 type on the same device as inputs
out = torch.empty((M, N), dtype=torch.bfloat16, device=a.device)
ops.gemm(out, a, b, as_, bs)
return out
Step 3: Building Your Kernel with Nix
Nix is the engine behind reproducible builds. If you don’t have Nix installed, follow the official installer for Linux or the Determinate Nix installer for macOS. You’ll also need Xcode 16.x for building kernels.
Getting Started with Nix:
Update Dependencies:
nix flake updateThis creates a
flake.lockfile, pinning all dependencies. Commit bothflake.nixandflake.lockto your repository for guaranteed reproducibility.Leverage the Hugging Face Cache:
To speed up builds and avoid redundant compilation, configure the Hugging Face cache:# Install and configure cachix permanently cachix use huggingface # Or use it temporarily without installation nix run nixpkgs#cachix -- use huggingface
Building Kernels:
With flake.nix in place, building is straightforward:
cd Build_RadeonFlow_Kernels/gemm
nix build . -L
The compiled kernel artifacts will be located in the local build/ directory.
Development Shell for Local Iteration:
For a seamless development experience, kernel-builder offers dedicated development shells where all dependencies are pre-configured:
$ nix develop
Inside this shell, you can generate project files, build, and even install your kernel as a local Python package:
$ build2cmake generate-torch build.toml
$ cmake -B build-ext
$ cmake --build build-ext
$ pip install --no-build-isolation -e .
To target specific PyTorch and ROCm versions, you can use specific shell environments, for example:
$ rm -rf .venv # Clean previous virtual environment if any
$ nix develop .#devShells.torch27-cxx11-rocm63-x86_64-linux
Step 4: Preparing and Uploading to the Hugging Face Hub
Before sharing, it’s good practice to clean up any development artifacts:
build2cmake clean build.toml
Building for All Supported Versions:
The kernel-builder tool automates building for all compatible PyTorch and ROCm versions:
nix build . -L
Note: This process can be lengthy as it compiles for multiple configurations. The output will be in the result/ directory.
Organizing Build Artifacts:
Move the compiled results into the expected build/ directory, which the kernels library will scan:
mkdir -p build
rsync -av --delete --chmod=Du+w,Fu+w result/ build/
Pushing to the Hugging Face Hub:
Sharing your kernel on the Hub makes it instantly accessible to the community.
Create a Repository:
hf repo create gemmEnsure you’re logged in via
huggingface-cli login.Connect and Push Your Project:
# Initialize git and connect to the Hub repository git init git remote add origin https://huggingface.co/<your-username>/gemm git pull origin main git lfs install git checkout -b main # Configure Git LFS for binary files git lfs track "*.so" # Add and commit necessary files git add \ build/ \ gemm/ \ include/ \ src/utils \ tests/checker \ torch-ext/torch_binding.cpp \ torch-ext/torch_binding.h \ torch-ext/gemm \ flake.nix \ flake.lock \ build.toml git commit -m "feat: Created a compliant gemm kernel" git push -u origin main
Congratulations! Your high-performance ROCm kernel is now live on the Hugging Face Hub, ready for seamless integration.
Step 5: Effortless Integration with the kernels Library
The beauty of the Hugging Face kernels library is that you don’t "install" kernels in the traditional sense. You load them directly from their Hub repository, which automatically registers the new operator.
import torch
from kernels import get_kernel
# Load the kernel from the Hub (replace with your username)
# gemm_kernel = get_kernel("kernels-community/gemm") # Example if using community repo
gemm_kernel = get_kernel("<your-username>/gemm") # Load your specific kernel
# Matrix dimensions (must be supported by the kernel's launcher)
M, N, K = 1024, 1536, 7168
QUANT_SIZE = 128
# Setup device
device = torch.device("cuda") # Assuming CUDA is available for ROCm emulation or direct use
# Create input tensors
# Kernel expects A:(K,M), B:(K,N) - Note: Original code had A:(M,K), B:(K,N) - adjust as per kernel definition
A_fp32 = torch.randn(K, M, device=device) # Adjusted dimensions based on typical GEMM definitions where A is KxM
B_fp32 = torch.randn(K, N, device=device)
# Convert to FP8
A_fp8 = A_fp32.to(torch.float8_e4m3fnuz)
B_fp8 = B_fp32.to(torch.float8_e4m3fnuz)
# Create scale factors (uniform scaling example)
A_scale = torch.ones(K // QUANT_SIZE, M, device=device, dtype=torch.float32)
B_scale = torch.ones(K // QUANT_SIZE, N // QUANT_SIZE, device=device, dtype=torch.float32)
# Create output tensor
C = torch.zeros(M, N, device=device, dtype=torch.bfloat16)
# Use the kernel
# The kernel function might be named differently or exposed as an attribute
# Check the __init__.py or torch_binding.cpp for the exact callable name.
# Assuming the operator is registered as 'gemm' within the loaded kernel module:
result = gemm_kernel.gemm(C, A_fp8, B_fp8, A_scale, B_scale)
print("ROCm GEMM kernel executed successfully!")
Note: The tensor dimensions for A and B in the usage example might need adjustment based on the precise definition and requirements of the gemm_launcher.hip function within the RadeonFlow_Kernels project. The provided code snippet reflects a common GEMM convention but should be verified against the kernel’s implementation.
And that’s it! Your custom ROCm kernel is now seamlessly integrated and ready to accelerate your deep learning tasks.
Conclusion: Revolutionizing ROCm Kernel Development
Building and sharing ROCm kernels for AMD GPUs has never been more accessible. Thanks to Hugging Face’s kernel-builder and kernels libraries, coupled with Nix for robust reproducibility, developers can shift their focus from arduous build configurations and compatibility headaches to the core task of performance optimization. Once built and published on the Hugging Face Hub, your custom kernels become instantly available to the community, deployable across projects with minimal effort, and ready to drive significant performance gains.
Related Resources:
- kernel-builder: Your go-to tool for building and compiling custom kernels.
- kernels: The library for managing and loading kernels from the Hugging Face Hub.
- Kernels Community Hub: Discover and share a growing collection of community-contributed kernels.