CIO CISOBoardNVIDIANemotronDiffusion Language ModelsHugging FaceAI InfrastructureInferenceGPUSGLangFinOpsAI GovernanceEnterprise AI

NVIDIA tests diffusion models for faster AI text generation

Joachim Høgby

23. mai 202623. mai 20265 min lesingKilde: NVIDIA / Hugging Face

Del

LinkedIn X Facebook E-post WhatsApp Telegram

NVIDIA has published a new Nemotron-Labs Diffusion family on Hugging Face. At first glance, it looks like a research note. The practical implications are bigger.

The core issue is that language models do not always have to write one token at a time. Today's autoregressive models are familiar and effective, but they have a hard operating limit: every new token requires another model pass. When a company builds coding tools, customer-service workflows, document handling or internal analysis systems on top of language models, that limit becomes latency, GPU utilization and cost.

Nemotron-Labs Diffusion targets exactly that bottleneck. The models can generate multiple tokens in parallel and then refine the answer over several steps. NVIDIA describes it as a generate-and-refine approach. It also gives operators a way to adjust how much compute the model should spend at runtime. Fewer refinement steps can reduce latency and cost, while more steps can be used where quality matters more.

That is an important governance point. Many AI projects do not fail because the model is useless. They fail because the economics of production are too weak. Token-by-token generation looks cheap in a pilot. In production, latency, batch size, GPU memory and throughput become part of the business case.

NVIDIA is releasing models at 3B, 8B and 14B scale, including both base and instruction-tuned variants. It is also releasing an 8B vision-language model. The text models are available under the NVIDIA Nemotron Open Model License, while the VLM model is under the NVIDIA Source Code License. NVIDIA is also publishing training recipes and code through Megatron Bridge.

The most interesting part is not just that the models are open. It is that the same model can be served in three modes.

The first mode is standard autoregressive generation. The model behaves like a normal left-to-right language model.

The second mode is diffusion. The model fills blocks of text by refining several tokens in parallel steps.

The third mode is self-speculation. The model uses diffusion to draft several candidate tokens, then uses autoregressive decoding to verify them. NVIDIA argues that this gives developers a path to higher speed without changing application logic.

In the blog post, NVIDIA says Nemotron-Labs Diffusion 8B achieved 1.2 percentage points higher average accuracy than Qwen3 8B on evaluated tasks. Measured in tokens per forward pass, the company reports 2.6 times higher efficiency in diffusion mode than autoregressive models. Self-speculation raises that to 6 times for the linear variant and 6.4 times for the quadratic variant, with comparable accuracy across the evaluated tasks.

The most concrete operating number comes from the SGLang integration. NVIDIA writes that LinearSpec reached around 865 tokens per second on B200 on the speedbench dataset. That is roughly four times the autoregressive baseline on the same hardware at temperature 0.

For CIOs and technology leaders, this is not a reason to replace every current model. It is a signal that model selection is becoming an operating-architecture decision. When a model can run in several inference modes, the question changes from “which model is best?” to “which mode fits this workflow, this latency target and this risk level?”

Coding is the obvious use case. Agentic developer tools need fast responses, but they also need precision. A model that can draft quickly and verify before committing fits well with pull requests, tests and human approval.

Customer service and document processing are another. Parts of an answer may be generated faster, while the riskiest parts still need stricter controls. It is not enough to ask whether the model is cheap. Leaders need to know which generation mode is being used, what quality thresholds are set and how errors are detected before they reach a customer or case worker.

For procurement and FinOps, this means the benchmarks need to get sharper. Price per million tokens is too blunt. Companies need to measure latency, tokens per second, batch size, GPU type, cache strategy, quality at lower inference budgets and what happens when the model is allowed to refine the answer several times.

There is also a security angle. Diffusion and self-speculation make generation more complex. That may be good for speed, but logging, evaluations and observability need to understand the full generation path. If an AI agent makes a wrong move in a system that touches money, code or customer data, logging the final answer is not enough. The organization needs to explain how the suggestion was drafted, verified and approved.

That is why Nemotron-Labs Diffusion is mainly a production story. It points toward a more mature AI stack where speed, cost and quality can be governed at inference level, not just in the model card.

Sources and media

Primary source: NVIDIA / Hugging Face, “Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models”, published May 23, 2026: https://huggingface.co/blog/nvidia/nemotron-labs-diffusion
Model collection: https://huggingface.co/collections/nvidia/nemotron-labs-diffusion
Training recipe and code: https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/diffusion/recipes/nemotron_labs_diffusion
Thumbnail: OpenAI Image 2 / hogby.ai

📬 Likte du denne?

AI-nyheter for ledere. Kuratert av en CIO som bygger det selv. Daglig i innboksen.

Relaterte saker

Anthropic gjør Claude Opus 5 til ny toppmodell for agentarbeid

Breaking

AI-modellerAnthropicClaude

Anthropic gjør Claude Opus 5 til ny toppmodell for agentarbeid

Claude Opus 5 flytter Anthropic-kampen fra ren intelligens til styrbar kost, fart og sikkerhet i agentarbeid. Det er en tydelig CIO-sak, ikke bare en modellnyhet.

24. juli 20265 min lesing

Anthropic

Åpne saken

CIOCISOCTO

GitHub ruller Claude Opus 5 inn i Copilot for agentisk koding

Claude Opus 5 er tilgjengelig i GitHub Copilot for Pro+, Max, Business og Enterprise. GitHub fremhever agentiske kodeflyter, egenverifisering og strengere cyber-sperrer. For IT-ledere blir modellvalg i Copilot et spørsmål om styring, kostnad og sikkerhet – ikke bare autocomplete.

24. juli 20265 min lesing

GitHub

Åpne saken

AI-modellerGoogle AIGemini

Google gjør Gemini Flash raskere for agentarbeid

Google lanserer Gemini 3.6 Flash og 3.5 Flash-Lite med tydeligere fokus på hastighet, token-effektivitet og produksjonsklare AI-agenter.

24. juli 20264 min lesing

Google AI

Åpne saken