“OS” for AI factories: NVIDIA Dynamo will speed up inference and simplify scaling of reasoning AI models

NVIDIA unveiled NVIDIA Dynamo, the successor to NVIDIA Triton Inference Server, an open-source software environment for developers that accelerates inference and makes it easier to scale reasoning AI models in AI factories with minimal overhead and maximum efficiency. NVIDIA CEO Jensen Huang called Dynamo “the operating system for AI factories.”

NVIDIA Dynamo improves inference performance while reducing the cost of scaling computations during testing. By optimizing inference on NVIDIA Blackwell, the platform is reported to increase the performance of the DeepSeek-R1 reasoning AI model by a factor of several times.

Image source: NVIDIA

Designed to maximize token revenue for AI data centers, the NVIDIA Dynamo platform orchestrates and accelerates inference communications across thousands of accelerators, and uses disaggregated data processing to separate the processing and generation phases of large language models (LLMs) across accelerators. This allows each phase to be optimized independently of its specific needs, ensuring maximum utilization of compute resources.

With the same number of accelerators, Dynamo doubles the performance (i.e., the actual revenue of AI factories) of Llama models on the NVIDIA Hopper platform. When running the DeepSeek-R1 model on a large GB200 NVL72 cluster, thanks to intelligent inference optimization by NVIDIA Dynamo, the number of tokens generated per token accelerator increases by more than 30 times, NVIDIA reported.

NVIDIA Dynamo can dynamically redistribute workloads across accelerators in response to changing request volumes and types, and can assign tasks to specific accelerators in large clusters to help minimize response computation and route requests. The platform can also offload inference data to more accessible memory and storage devices and quickly retrieve it when needed.

NVIDIA Dynamo is fully open source and supports PyTorch, SGLang, NVIDIA TensorRT-LLM, and vLLM, allowing customers to develop and optimize ways to run AI models for disaggregated inference. According to NVIDIA, this will accelerate adoption across platforms including AWS, Cohere, CoreWeave, Dell, Fireworks, Google Cloud, Lambda, Meta✴, Microsoft Azure, Nebius, NetApp, OCI, Perplexity, Together AI, and VAST.

NVIDIA Dynamo distributes the information that inference systems store in memory from processing previous queries (KV cache) across many accelerators (up to thousands). The platform then routes new queries to those accelerators whose KV cache contents are closest to the new query, thereby avoiding costly recalculations.

NVIDIA Dynamo also provides disaggregated query processing, which dispatches different stages of LLM execution — from query understanding to generation — to different accelerators. This approach is ideal for reasoning models. Disaggregated serving allows each phase to be configured and allocated independently, delivering higher throughput and faster query responses.

NVIDIA Dynamo includes four key mechanisms:

  • GPU Planner: A scheduling engine that dynamically changes the number of accelerators to match changing demands, eliminating the possibility of over- or under-provisioning resources.
  • Smart Router: A router for LLM that distributes requests across large groups of accelerators to minimize expensive recomputations of duplicate or overlapping requests, freeing up resources to handle new requests.
  • Low-Latency Communication Library: An inference-optimized library that supports communication between accelerators and simplifies communication between disparate devices, accelerating data transfer.
  • Memory Manager: A mechanism that transparently and intelligently loads, unloads, and distributes inference data between memory and storage devices.

The NVIDIA Dynamo platform will be available in NVIDIA NIM microservices and will be supported in a future release of the NVIDIA AI Enterprise platform.

admin

Share
Published by
admin

Recent Posts

Google Tests EU News Shutdown Without Losing Ad Revenue

Google has conducted an experiment in eight European Union (EU) countries, temporarily excluding links to…

12 minutes ago

Scientists have attempted to generate electric current using the Earth’s rotation

A team of physicists from Princeton University conducted an experiment in which they investigated whether…

12 minutes ago

Two Point Museum is a wonderful addition to the collection. Review

Played on Xbox Series S At Two Point Hospital and Two Point Campus, the appearance…

30 minutes ago

Pixels the Size of a Virus: Chinese Scientists Create World’s Smallest LED Display

Chinese photonics experts from Zhejiang University in Hangzhou have developed the world's smallest LED display…

30 minutes ago

Chinese Scientists Create Advanced Solid-State Laser for Semiconductor DUV Lithography

Chinese scientists have built a compact solid-state laser system that generates coherent light with a…

30 minutes ago

Tencent Releases T1 Reasoning AI Model — Outperforms DeepSeek R1 in Select Tests

Chinese tech giant Tencent unveiled the official version of its own reasoning artificial intelligence model,…

13 hours ago