The Dawn of Transformers v5: A Giant Leap for the AI Ecosystem
Five years. In the fast-paced world of Artificial Intelligence, that’s an eon. Back in November 2020, the release candidate for Transformers v4.0.0 marked a significant milestone. Today, on December 1, 2025, we’re not just releasing an update; we’re ushering in a new era with the launch of Transformers v5.0.0rc-0. The numbers are staggering: the library, once a crucial tool, has exploded into an indispensable cornerstone of the AI landscape. Imagine this: over 3 million daily installations via pip, a monumental leap from the 20,000 daily installs during the v4 era. In total, Transformers has now surpassed a colossal 1.2 billion installations. This isn’t just growth; it’s a testament to the democratization of AI and the incredible power of open-source collaboration.
The evolution is also reflected in the sheer breadth of supported models. From 40 architectures in v4, we now boast over 400, and the Hugging Face Hub, a vibrant marketplace of AI innovation, hosts more than 750,000 community-contributed model checkpoints compatible with Transformers – a dramatic increase from the roughly 1,000 available at the time of v4. This exponential expansion is fueled by the rapid advancements in AI and its mainstream adoption. As a leading library for model definitions, continuous evolution and adaptation are not just beneficial; they are essential for continued relevance. In the dynamic realm of AI, reinvention is the key to longevity.
We’re incredibly fortunate to collaborate with a vast and growing ecosystem of libraries and applications built upon Transformers. This includes, but is not limited to, luminaries like llama.cpp, MLX, onnxruntime, Jan, LMStudio, vLLM, SGLang, Unsloth, LlamaFactory, dLLM, MaxText, TensorRT, and Argmax, alongside countless other friends. For v5, our focus has been razor-sharp: simplicity, training, inference, and production. We’re thrilled to share the detailed journey and the impactful work that has gone into these critical areas.
Simplicity at its Core: Clean Code, Clear Understanding
At the heart of the v5 development, simplicity has been our guiding star. For us, working on Transformers, the code itself is a product. We believe that our model integrations should be elegant and transparent. This clarity allows the broader AI ecosystem to not only depend on our model definitions but also to truly understand the inner workings of these powerful architectures, how they differentiate, and what makes each new model unique. Simplicity, in turn, breeds wider standardization, greater generality, and ultimately, more robust and widespread support. It’s the bedrock upon which innovation is built.
Model Additions: Scaling the Source of Truth
Transformers serves as the fundamental backbone for hundreds of thousands of projects. For instance, Unsloth, a vital player in efficient model training, relies heavily on Transformers. "We build on Transformers to help people fine-tune and train models efficiently, whether that’s BERT, text-to-speech (TTS), or others; to run fast inference for reinforcement learning (RL) even when models aren’t yet supported in other libraries. We’re excited for Transformers v5 and are super happy to be working with the Hugging Face team!" shares Michael Han at Unsloth.
At its core, Transformers remains a comprehensive model architecture toolkit. Our aspiration is to house all recent architectures and to be the definitive "source of truth" for model definitions. The dedication to this mission is evident: for the past five years, we’ve been integrating, on average, 1 to 3 new models every week. This consistent influx of innovation has been driven by an ongoing effort to refine and streamline the model addition process.
A Modular Approach: Building for the Future
Over the past year, we’ve made a significant commitment to a modular design philosophy. This has been a monumental step forward, enabling easier maintenance, accelerated integration, and fostering deeper collaboration within the community. We’ve shared a more in-depth perspective on this in our "Maintain the Unmaintainable" blog post. In essence, our goal is to create a significantly simpler model contribution process and to drastically reduce the maintenance burden. One compelling metric that underscores this success is the substantial reduction in lines of code required for contributions and reviews when employing a modular approach.
While we deeply respect the "one model, one file" ethos, we’ve also introduced abstractions to simplify the management of common functionalities. A prime example is the introduction of the AttentionInterface. This provides a centralized abstraction layer for various attention mechanisms. The eager method will continue to reside within the modeling file, while other implementations, such as FA1/2/3, FlexAttention, and SDPA, have been elegantly moved to this dedicated interface. Wing Lian from Axolotl highlights the impact: "Over the past couple of years, the increasing amount of 0-day support for new model architectures and standardization of attention handling has helped to simplify our support for post-training modern LLMs."
Tooling for Model Conversion: Automating Innovation
We are actively developing tools to intelligently identify existing model architectures that new models resemble. This feature leverages machine learning to pinpoint code similarities across independent modeling files. Our ultimate aim is to fully automate the model conversion process by automatically generating draft pull requests for integrating new models into the Transformers format. This initiative promises to significantly reduce manual effort and ensure a high degree of consistency across all model integrations.
Code Reduction and Streamlining: Efficiency Unleashed
Streamlining Modeling & Tokenization/Processing Files
We’ve undertaken a substantial refactoring of our modeling and tokenization files. The modular approach has dramatically improved our modeling files, complemented by a drive towards standardization across all models. This standardization abstracts away much of the boilerplate code, allowing modeling files to focus solely on the essential forward and backward passes of a model.
Concurrently, we’re simplifying tokenization and processing files. Moving forward, our primary focus will be on the tokenizers backend, phasing out the distinction between "Fast" and "Slow" tokenizers. We will champion the tokenizers library as our main backend, mirroring our approach for PyTorch-based models. Alternatives for Sentencepiece or MistralCommon backed tokenizers will still be supported, though they will be non-default options. Similarly, image processors will now exclusively feature their fast variant, which relies on the torchvision backend.
Furthermore, we are sunsetting our Flax/TensorFlow support to concentrate our efforts on PyTorch as the sole backend. However, we are actively collaborating with partners in the JAX ecosystem to ensure seamless compatibility between our models and their frameworks. Matt White, Executive Director of the PyTorch Foundation and GM of AI at the Linux Foundation, emphasizes this shift: "With its v5 release, transformers is going all in on PyTorch. Transformers acts as a source of truth and foundation for modeling across the field; we’ve been working with the team to ensure good performance across the stack. We’re excited to continue pushing for this in the future across training, inference, and deployment."
Training: Empowering Scale and Fine-tuning
Training remains a paramount focus for the v5 release. While our previous efforts heavily emphasized fine-tuning, we’ve recently invested significant resources to bolster our support for large-scale pre-training and full-model training.
Pre-training at Scale: Ready for the Big Leagues
Supporting pre-training at scale has necessitated a complete overhaul of our model initialization processes. We’ve ensured that our models perform optimally across various parallelism paradigms and have introduced support for optimized kernels for both forward and backward passes. Looking ahead, we’re thrilled to announce extended compatibility with established pre-training tools like Torchtitan, Megatron, and Nanotron, as well as an open invitation to any other pre-training tool interested in collaboration.
Fine-tuning & Post-training: A Collaborative Ecosystem
Our close collaboration with fine-tuning tools across the Python ecosystem continues to be a cornerstone of our strategy. We are committed to providing model implementations that are seamlessly compatible with Unsloth, Axolotl, LlamaFactory, TRL, and other key players in the PyTorch ecosystem. Beyond that, we are actively working with tools like MaxText in the JAX ecosystem to ensure robust interoperability between their frameworks and Transformers. All fine-tuning and post-training tools can now confidently rely on Transformers for accurate model definitions, further enabling powerful agentic use-cases through platforms like OpenEnv or the Prime Environment Hub.
Inference: Speed, Simplicity, and Scalability
For v5, we’re placing a significant emphasis on inference, introducing several paradigm shifts. These include the integration of specialized kernels, cleaner default configurations, new APIs, and enhanced support for optimized inference engines.
Similar to our training advancements, we’ve dedicated efforts to packaging kernels that are automatically utilized when your hardware and software configurations permit. If you’re new to the concept of kernels, we highly recommend exploring this documentation for a deeper understanding.
New Inference APIs: Continuous Batching and Paged Attention
Alongside these kernel enhancements, we’re introducing two powerful new APIs dedicated to inference: continuous batching and paged attention. These mechanisms have been undergoing internal testing and refinement, and we’re now focused on polishing the final details and developing comprehensive usage guides.
Introducing transformers.serve: Your OpenAI-Compatible Serving Solution
We are proud to introduce transformers.serve, a new serving system specifically designed for Transformers. This system deploys an OpenAI API-compatible server, representing a major stride forward for use-cases like large-scale evaluation where numerous inference requests are processed concurrently. Our goal is not to replicate the specialized optimizations offered by dedicated inference engines like vLLM, SGLang, or TensorRT LLM. Instead, we aim for perfect inter-compatibility with these engines, as detailed in the following section.
Simon Mo and Harry Mellor from vLLM highlight the impact: "The Transformers backend in vLLM has been very enabling to get more architectures, like BERT and other encoders, available to more users. We’ve been working with the Transformers team to ensure many models are available across modalities with the best performance possible. This is just the start of our collaboration: we’re happy to see the Transformers team will have this as a focus going into version 5."
Chenyang Zhao at SGLang echoes this sentiment: "Standardization is key to accelerating AI innovation. Transformers v5 empowers the SGLang team to spend less time on model reimplementation and more time on kernel optimization. We look forward to building a more efficient and unified AI ecosystem together!"
Production & Local Deployment: Bridging the Gap
We’ve been working hand-in-hand with the most popular inference engines, empowering them to leverage Transformers as their backend. This collaboration offers significant value: as soon as a new model is integrated into Transformers, it immediately becomes accessible within these inference engines, benefiting from their unique strengths such as inference optimizations, specialized kernels, and dynamic batching.
Our close partnerships with ONNXRuntime, llama.cpp, and MLX ensure exceptional interoperability between Transformers and these modeling libraries. For instance, thanks to a monumental community effort, loading GGUF files directly into Transformers for further fine-tuning is now remarkably straightforward. Conversely, Transformers models can be effortlessly converted into GGUF files for seamless use with llama.cpp.
Georgi Gerganov, from ggml-org, states: "The Transformers framework is the go-to place for reference AI model implementations. The framework plays a crucial role in enabling modern AI across the entire stack. The team and the community behind the project truly understand and embrace the spirit of open-source development and collaboration."
The same level of integration applies to MLX, where Transformers’ safetensors files are directly compatible with MLX models. Awni Hannun from MLX emphasizes the impact: "It’s hard to overstate the importance of Transformers (and datasets, tokenizers, etc) to the open-source and overall AI ecosystem. I can’t count the number of times I’ve personally used Transformers as a source-of-truth."
Finally, we’re pushing the boundaries of local inference, collaborating closely with the Executorch team to bring Transformers models to on-device deployment. Our efforts are expanding to encompass multimodal models, including vision and audio capabilities.
Quantization: Making AI Accessible and Efficient
Quantization is rapidly emerging as the de facto standard for state-of-the-art model development. A growing number of cutting-edge models are now released in low-precision formats, such as 8-bit and 4-bit (e.g., GPT-OSS, Kimi-K2, Deepseek-r1). Hardware is increasingly optimized for low-precision workloads, and the community is actively sharing high-quality quantized checkpoints. In v5, we’re elevating quantization to a central focus within Transformers, ensuring comprehensive compatibility with all major features and providing a reliable framework for both training and inference.
We’ve introduced a significant change in how model weights are loaded, fundamentally positioning quantization as a first-class citizen. Jerry Zhang at TorchAO notes the productive collaboration: "Our collaboration with the Transformers team was highly productive, marked by their proactive code reviews, feedback, and technical expertise. Their support was crucial in integrating TorchAO, expanding quantization features, and improving documentation for broader adoption in the V5."
Matthew Douglas and Titus von Koeller from bitsandbytes express their enthusiasm: "We’re excited that v5 has made quantization a first-class citizen. It provides the foundation for bitsandbytes to better support key features like TP and MoEs, and also makes it easier to integrate new quantization methods."
Conclusion: A Promise of Interoperability and Continued Innovation
The overarching theme of this v5 release is unequivocally "interoperability." Every refactor, performance enhancement, and standardization effort has been meticulously aligned with this principle. v5 is designed to seamlessly integrate and work end-to-end with the burgeoning AI ecosystem: you can train a model using Unsloth, Axolotl, LlamaFactory, or MaxText, deploy it with vLLM or SGLang, and then export it to llama.cpp, Executorch, or MLX for local execution. The possibilities are now more expansive than ever.
Version 5 is undeniably a monumental achievement, representing the collective efforts of a vast number of individuals within our community over the past five years. More than just an accomplishment, it serves as a promise and a beacon, illuminating the future direction of AI development. We’ve seized this opportunity to refine our toolkit, isolating what truly matters, and creating a clean slate upon which to build even greater innovations. Thanks to the multitude of contributions from both the community and the core team, delivering improvements in performance, usability, and readability will become a far more streamlined process.
With the first Release Candidate for v5.0.0 now available, we eagerly await your feedback. Dive into our release notes for the exhaustive technical details, and share your thoughts and insights in our GitHub issues! The journey of Transformers is far from over; it’s just reaching an exciting new chapter.