NVIDIA unveiled NVIDIA Dynamo, the successor to NVIDIA Triton Inference Server, an open-source software environment for developers that accelerates inference and makes it easier to scale reasoning AI models in AI factories with minimal overhead and maximum efficiency. NVIDIA CEO Jensen Huang called Dynamo “the operating system for AI factories.”
NVIDIA Dynamo improves inference performance while reducing the cost of scaling computations during testing. By optimizing inference on NVIDIA Blackwell, the platform is reported to increase the performance of the DeepSeek-R1 reasoning AI model by a factor of several times.
Image source: NVIDIA
Designed to maximize token revenue for AI data centers, the NVIDIA Dynamo platform orchestrates and accelerates inference communications across thousands of accelerators, and uses disaggregated data processing to separate the processing and generation phases of large language models (LLMs) across accelerators. This allows each phase to be optimized independently of its specific needs, ensuring maximum utilization of compute resources.
With the same number of accelerators, Dynamo doubles the performance (i.e., the actual revenue of AI factories) of Llama models on the NVIDIA Hopper platform. When running the DeepSeek-R1 model on a large GB200 NVL72 cluster, thanks to intelligent inference optimization by NVIDIA Dynamo, the number of tokens generated per token accelerator increases by more than 30 times, NVIDIA reported.
NVIDIA Dynamo can dynamically redistribute workloads across accelerators in response to changing request volumes and types, and can assign tasks to specific accelerators in large clusters to help minimize response computation and route requests. The platform can also offload inference data to more accessible memory and storage devices and quickly retrieve it when needed.
NVIDIA Dynamo is fully open source and supports PyTorch, SGLang, NVIDIA TensorRT-LLM, and vLLM, allowing customers to develop and optimize ways to run AI models for disaggregated inference. According to NVIDIA, this will accelerate adoption across platforms including AWS, Cohere, CoreWeave, Dell, Fireworks, Google Cloud, Lambda, Meta✴, Microsoft Azure, Nebius, NetApp, OCI, Perplexity, Together AI, and VAST.
NVIDIA Dynamo distributes the information that inference systems store in memory from processing previous queries (KV cache) across many accelerators (up to thousands). The platform then routes new queries to those accelerators whose KV cache contents are closest to the new query, thereby avoiding costly recalculations.
NVIDIA Dynamo also provides disaggregated query processing, which dispatches different stages of LLM execution — from query understanding to generation — to different accelerators. This approach is ideal for reasoning models. Disaggregated serving allows each phase to be configured and allocated independently, delivering higher throughput and faster query responses.
NVIDIA Dynamo includes four key mechanisms:
The NVIDIA Dynamo platform will be available in NVIDIA NIM microservices and will be supported in a future release of the NVIDIA AI Enterprise platform.
Google has conducted an experiment in eight European Union (EU) countries, temporarily excluding links to…
A team of physicists from Princeton University conducted an experiment in which they investigated whether…
Played on Xbox Series S At Two Point Hospital and Two Point Campus, the appearance…
Chinese photonics experts from Zhejiang University in Hangzhou have developed the world's smallest LED display…
Chinese scientists have built a compact solid-state laser system that generates coherent light with a…
Chinese tech giant Tencent unveiled the official version of its own reasoning artificial intelligence model,…