Hopp til hovedinnhold
 AI-nyheter, ferdig filtrert for ledere
SISTE:

Anthropic: AI fant over 10.000 alvorlige sårbarheter • Reuters: AI-feil i retten gir advokater karriererisiko • CNBC: GitHub svikter under presset fra AI-koding

NVIDIA tests diffusion models for faster AI text generation
CIOCISOBoardNVIDIANemotronDiffusion Language ModelsHugging FaceAI InfrastructureInferenceGPUSGLangFinOpsAI GovernanceEnterprise AI

NVIDIA tests diffusion models for faster AI text generation

JH
Joachim Høgby
23. mai 202623. mai 20265 min lesingKilde: NVIDIA / Hugging Face

NVIDIA has published a new Nemotron-Labs Diffusion family on Hugging Face. At first glance, it looks like a research note. The practical implications are bigger.

The core issue is that language models do not always have to write one token at a time. Today's autoregressive models are familiar and effective, but they have a hard operating limit: every new token requires another model pass. When a company builds coding tools, customer-service workflows, document handling or internal analysis systems on top of language models, that limit becomes latency, GPU utilization and cost.

Nemotron-Labs Diffusion targets exactly that bottleneck. The models can generate multiple tokens in parallel and then refine the answer over several steps. NVIDIA describes it as a generate-and-refine approach. It also gives operators a way to adjust how much compute the model should spend at runtime. Fewer refinement steps can reduce latency and cost, while more steps can be used where quality matters more.

That is an important governance point. Many AI projects do not fail because the model is useless. They fail because the economics of production are too weak. Token-by-token generation looks cheap in a pilot. In production, latency, batch size, GPU memory and throughput become part of the business case.

NVIDIA is releasing models at 3B, 8B and 14B scale, including both base and instruction-tuned variants. It is also releasing an 8B vision-language model. The text models are available under the NVIDIA Nemotron Open Model License, while the VLM model is under the NVIDIA Source Code License. NVIDIA is also publishing training recipes and code through Megatron Bridge.

The most interesting part is not just that the models are open. It is that the same model can be served in three modes.

The first mode is standard autoregressive generation. The model behaves like a normal left-to-right language model.

The second mode is diffusion. The model fills blocks of text by refining several tokens in parallel steps.

The third mode is self-speculation. The model uses diffusion to draft several candidate tokens, then uses autoregressive decoding to verify them. NVIDIA argues that this gives developers a path to higher speed without changing application logic.

In the blog post, NVIDIA says Nemotron-Labs Diffusion 8B achieved 1.2 percentage points higher average accuracy than Qwen3 8B on evaluated tasks. Measured in tokens per forward pass, the company reports 2.6 times higher efficiency in diffusion mode than autoregressive models. Self-speculation raises that to 6 times for the linear variant and 6.4 times for the quadratic variant, with comparable accuracy across the evaluated tasks.

The most concrete operating number comes from the SGLang integration. NVIDIA writes that LinearSpec reached around 865 tokens per second on B200 on the speedbench dataset. That is roughly four times the autoregressive baseline on the same hardware at temperature 0.

For CIOs and technology leaders, this is not a reason to replace every current model. It is a signal that model selection is becoming an operating-architecture decision. When a model can run in several inference modes, the question changes from “which model is best?” to “which mode fits this workflow, this latency target and this risk level?”

Coding is the obvious use case. Agentic developer tools need fast responses, but they also need precision. A model that can draft quickly and verify before committing fits well with pull requests, tests and human approval.

Customer service and document processing are another. Parts of an answer may be generated faster, while the riskiest parts still need stricter controls. It is not enough to ask whether the model is cheap. Leaders need to know which generation mode is being used, what quality thresholds are set and how errors are detected before they reach a customer or case worker.

For procurement and FinOps, this means the benchmarks need to get sharper. Price per million tokens is too blunt. Companies need to measure latency, tokens per second, batch size, GPU type, cache strategy, quality at lower inference budgets and what happens when the model is allowed to refine the answer several times.

There is also a security angle. Diffusion and self-speculation make generation more complex. That may be good for speed, but logging, evaluations and observability need to understand the full generation path. If an AI agent makes a wrong move in a system that touches money, code or customer data, logging the final answer is not enough. The organization needs to explain how the suggestion was drafted, verified and approved.

That is why Nemotron-Labs Diffusion is mainly a production story. It points toward a more mature AI stack where speed, cost and quality can be governed at inference level, not just in the model card.

Sources and media

  • Primary source: NVIDIA / Hugging Face, “Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models”, published May 23, 2026: https://huggingface.co/blog/nvidia/nemotron-labs-diffusion
  • Model collection: https://huggingface.co/collections/nvidia/nemotron-labs-diffusion
  • Training recipe and code: https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/diffusion/recipes/nemotron_labs_diffusion
  • Thumbnail: OpenAI Image 2 / hogby.ai

📬 Likte du denne?

AI-nyheter for ledere. Kuratert av en CIO som bygger det selv. Daglig i innboksen.