Unlocking AI Superpowers: How NVIDIA DGX Spark and Ollama Are Revolutionizing Local LLM Performance

The Dawn of High-Performance Local AI: NVIDIA DGX Spark Meets Ollama

In the rapidly evolving landscape of Artificial Intelligence, the ability to run powerful Large Language Models (LLMs) locally, with speed and efficiency, is no longer a luxury – it’s a necessity. For developers, researchers, and businesses pushing the boundaries of what AI can do, the hardware and software ecosystem needs to deliver. Enter NVIDIA DGX Spark and Ollama, a dynamic duo that promises to elevate local LLM performance to unprecedented heights.

This article dives into the heart of this synergy, dissecting a recent performance analysis that puts NVIDIA DGX Spark, powered by the latest firmware and the versatile Ollama platform, through its paces. We’ll explore the benchmark results, understand the nuances of model performance, and offer insights into how you can leverage this potent combination for your own AI endeavors.

Behind the Benchmarks: Setting the Stage for Performance

To truly understand the capabilities of NVIDIA DGX Spark when running LLMs via Ollama, a rigorous testing methodology was employed. The goal was to provide a clear, fact-based assessment of how different models perform under consistent, controlled conditions. The tests were conducted on release day firmware and an updated Ollama version, ensuring that the results reflect the cutting edge of available technology. Specifically, the NVIDIA DGX Spark’s latest firmware, version 580.95.05, and Ollama version v0.12.6 were utilized.

Each test scenario was meticulously repeated 10 times to ensure statistical reliability and to iron out any performance anomalies. Several key parameters were held constant to provide a fair comparison across different models:

  • Temperature set to 0: This ensures that the model generates the most deterministic and focused output, prioritizing factual accuracy and coherence over creative variation. In LLM inference, temperature is a hyperparameter that controls the randomness of the output. A temperature of 0 makes the model’s predictions as likely as possible, essentially picking the most probable next token.
  • Constrained to 500 tokens output: This limits the length of the generated response, allowing for consistent measurement of inference speed without being bogged down by extremely long outputs. This is crucial for comparing the raw processing power.
  • Prompt: “write an in-depth summary of this story: $(head -n200 pg98.txt)”: The prompt itself was designed to be substantial, requiring the model to process a significant chunk of text. By using the first 200 lines of the classic novel "A Tale of Two Cities" (via the pg98.txt file, a common identifier for Project Gutenberg texts), the test simulates a real-world scenario of summarizing lengthy content. The test script and its accompanying readme are publicly available, empowering users to customize and replicate these tests, fostering transparency and community-driven improvement.
  • Caching is disabled: This critical setting ensures that repeated tests are not artificially faster due to previously loaded model weights or generated tokens. Each run starts from a clean slate, providing a true measure of the system’s sustained performance.

Decoding the Data: A Detailed Look at LLM Performance

The results table is a treasure trove of performance data, showcasing how various LLMs fare on the NVIDIA DGX Spark. We see a fascinating interplay between model size, quantization, and the crucial metrics of Prefill (tokens per second) and Decode (tokens per second).

Let’s break down these metrics:

  • Prefill Tokens per Second: This measures how quickly the model can process the initial prompt and prepare to generate the response. It’s akin to the model "reading" and understanding the input. A higher number here means the model can ingest your requests faster.
  • Decode Tokens per Second: This measures how quickly the model can generate the actual output, token by token, after understanding the prompt. This is the speed at which you’ll see the AI’s response unfold.

Here’s a closer look at the performance of different models:

The Powerhouse: gpt-oss Models

NVIDIA DGX Spark demonstrates remarkable capability with the gpt-oss models. The 20B parameter version, quantized with MXFP4, achieves an impressive 3.224k tokens per second during prefill and a stellar 58.27 tokens per second during decoding. This indicates a very swift understanding of prompts and rapid response generation, making it ideal for interactive applications.

Even the larger 120B parameter gpt-oss model, also using MXFP4 quantization, delivers robust performance. While the prefill speed drops to 1.169k tokens per second, the decode speed remains a strong 41.14 tokens per second. This is particularly noteworthy as it showcases the DGX Spark’s ability to handle massive models, fitting the entire 120B model into its substantial VRAM – a testament to the GB10 Grace Blackwell Superchip’s power.

A note on gpt-oss models: These models are officially provided by OpenAI and distributed via Ollama. It’s important to understand that some GGUF files labeled as MXFP4 found online might undergo further quantization to q8_0 in their attention layers. However, in the context of these tests with Ollama, the same layers are treated as BF16, aligning with OpenAI’s intended design for optimal performance.

The Versatile Gemma 3 Family

Google’s Gemma 3 models also make a strong showing on the DGX Spark.

  • The 12B parameter Gemma 3 with q4_K_M quantization achieves 1.894k tokens per second for prefill and 24.25 tokens per second for decode.
  • When upgrading to q8_0 quantization for the same 12B model, prefill slows to 1.406k tokens per second, and decode to 15.46 tokens per second. This highlights the trade-off: higher precision (q8_0) can sometimes come at the cost of raw speed compared to more aggressive quantization (q4_K_M), though it might offer improved accuracy.

Moving up to the 27B parameter Gemma 3:

  • With q4_K_M quantization, it achieves 834.1 tokens per second for prefill and 10.83 tokens per second for decode.
  • The q8_0 quantized version shows a further reduction, with 585.4 tokens per second for prefill and 7.210 tokens per second for decode. These figures demonstrate that while the larger Gemma 3 models are capable, they naturally require more computational resources, impacting their raw throughput compared to smaller counterparts.

The Agile Llama 3.1

Meta’s Llama 3.1 models are a testament to efficient design, and their performance on DGX Spark is compelling.

  • The 8B parameter Llama 3.1 with q4_K_M quantization is a speed demon, boasting an astonishing 7.614k tokens per second for prefill and 38.02 tokens per second for decode. This makes it an excellent choice for applications demanding quick responses and rapid iteration.
  • Upgrading to q8_0 quantization for the 8B model brings prefill down to 6.110k tokens per second and decode to 25.23 tokens per second. Again, the quantization level influences the speed-accuracy balance.

For the larger 70B parameter Llama 3.1 with q4_K_M quantization, we see prefill at 1.911k tokens per second and decode at a respectable 4.423 tokens per second. This indicates that even very large models can be handled effectively, though the decode speed naturally scales with model size and complexity.

DeepSeek R1 and Qwen 3: Exploring Other Top Contenders

The benchmark also includes insights into DeepSeek R1 and Qwen 3 models, showcasing the breadth of compatibility and performance.

  • DeepSeek R1 (14B parameters):

    • With q4_K_M, it achieves 5.919k tokens per second prefill and 19.99 tokens per second decode.
    • The q8_0 version yields 4.667k tokens per second prefill and 13.32 tokens per second decode. These figures position DeepSeek R1 as a strong performer for its size.
  • Qwen 3 (32B parameters):

    • The q4_K_M version delivers 705.0 tokens per second prefill and 9.411 tokens per second decode.
    • The q8_0 version sees prefill at 487.2 tokens per second and decode at 6.240 tokens per second. These results highlight Qwen 3’s capabilities, particularly for more substantial tasks, though at a lower token generation rate compared to some of the smaller, highly optimized models.

Optimizing Your Environment: The Importance of Firmware Updates

To achieve these stellar performance figures, it’s crucial to ensure your NVIDIA DGX Spark is running the latest software. NVIDIA consistently releases firmware updates to enhance stability, security, and performance. For DGX Spark users, an update to firmware version 580.95.05 or later is highly recommended.

**Why Update?

  • Performance Boosts: Newer firmware often includes optimizations that directly translate to faster LLM inference.
  • Bug Fixes: Updates address known issues, ensuring a more stable and reliable computing experience.
  • Security Enhancements: Keeping your system updated is vital for protecting against emerging threats.

How to Update:
If you’re using a DGX Spark firmware version below 580.95.05, it’s straightforward to get up to date. The most user-friendly method is through the DGX Dashboard. Simply navigate to the update section and follow the prompts.

For those who prefer command-line management, the process involves updating both the Ubuntu distribution and the firmware itself. Here are the commands you’ll need:

sudo apt update
sudo apt dist-upgrade

Once your system packages are up to date, you can refresh and upgrade the firmware:

sudo fwupdmgr refresh
sudo fwupdmgr upgrade

After the firmware has been successfully upgraded, a reboot is necessary to apply the changes:

sudo reboot

By keeping your DGX Spark firmware current, you’re ensuring that you’re benefiting from the latest advancements and maximizing the performance potential for your AI workloads.

Getting Started with Ollama: Your Gateway to Local LLMs

Ollama has rapidly become a favorite among AI practitioners for its simplicity and power in running LLMs locally. It provides a streamlined experience for downloading, managing, and running a wide variety of open-source models.

Installation is a breeze:

curl -fsSL https://ollama.com/install.sh | sh

This single command downloads and executes the installation script, setting up Ollama on your system.

Running your first model is just as easy:

ollama run gpt-oss

This command will download the gpt-oss model (if not already present) and launch an interactive session, allowing you to start chatting with the AI. You can replace gpt-oss with any other model name available in the Ollama library.

The Synergy of Code and AI: Coding with Codex & Ollama

For developers looking to integrate LLMs into their workflows, the combination of OpenAI’s Codex and Ollama on NVIDIA DGX Spark offers a powerful development environment. Codex, known for its code generation and understanding capabilities, works seamlessly with Ollama-hosted models.

To get started with Codex:

First, install it globally using npm:

npm install -g @openai/codex

Once installed, you can leverage Codex with Ollama models. To use the gpt-oss model, simply run:

codex --oss --model gpt-oss

Harnessing the 120B model:

The true power of DGX Spark shines with its ability to accommodate massive models. For instance, the larger gpt-oss-120b model can be fully loaded into the 120GB of VRAM provided by the GB10 Grace Blackwell Superchip. This allows for running very large, sophisticated models locally without compromise:

codex --oss --model gpt-oss:120b

This integration opens up exciting possibilities for code assistance, natural language interfaces for development tools, and rapid prototyping of AI-powered applications.

The Future is Local: Empowering AI Innovation

The performance benchmarks presented here clearly demonstrate that NVIDIA DGX Spark, in conjunction with Ollama, is a formidable platform for running LLMs locally. The combination of cutting-edge hardware, optimized firmware, and user-friendly software like Ollama is democratizing access to high-performance AI. Whether you’re fine-tuning models, building AI-driven applications, or conducting cutting-edge research, the ability to run these powerful tools on your own infrastructure, with remarkable speed, is a game-changer. As these technologies continue to mature, we can expect even more groundbreaking innovations to emerge from the local AI revolution.

Leave a Reply

Your email address will not be published. Required fields are marked *