![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Nvidia's gargantuan Blackwell Ultra and upcoming Vera and Rubin CPUs and GPUs have certainly grabbed plenty of headlines at the corp's GPU Technology Conference this week. But arguably one of the most important announcements of the annual developer event wasn't a chip at all but rather a software framework called Dynamo, designed to tackle the challenges of AI inference at scale.
NvidiaのGargantuan Blackwell Ultraと今後のVeraとRubin CPUとGPUは、今週のCorp's GPU Technology Conferenceで確かに多くの見出しをつかみました。しかし、おそらく、毎年恒例の開発者イベントの最も重要な発表の1つは、チップではなく、むしろDynamoと呼ばれるソフトウェアフレームワークであり、大規模なAI推論の課題に取り組むように設計されています。
Announced on stage at GTC, it was described by CEO Jensen Huang as the "operating system of an AI factory," and drew comparisons to the real-world dynamo that kicked off an industrial revolution. "The dynamo was the first instrument that started the last industrial revolution," the chief exec said. "The industrial revolution of energy — water comes in, electricity comes out."
GTCで舞台で発表され、CEOのJensen Huangによって「AI工場のオペレーティングシステム」と記述され、産業革命を開始した現実世界のダイナモとの比較を引き出しました。 「ダイナモは最後の産業革命を始めた最初の楽器でした」と最高幹部は言いました。 「エネルギーの産業革命 - 水が入り、電気が出てきます。」
At its heart, the open source inference suite is designed to better optimize inference engines such as TensorRT LLM, SGLang, and vLLM to run across large quantities of GPUs as quickly and efficiently as possible.
中心にあるオープンソースの推論スイートは、Tensort LLM、Sglang、VLLMなどの推論エンジンをより迅速かつ効率的に走るように、可能な限り迅速かつ効率的に走るように設計されています。
As we've previously discussed, the faster and cheaper you can turn out token after token from a model, the better the experience for users.
Inference is harder than it looks
At a high level, LLM output performance can be broken into two broad categories: Prefill and decode. Prefill is dictated by how quickly the GPU's floating-point matrix math accelerators can process the input prompt. The longer the prompt — say, a summarization task — the longer this typically takes.
高レベルでは、LLM出力パフォーマンスは、PrefillとDecodeの2つの広範なカテゴリに分割できます。 Prefillは、GPUのフローティングマトリックス数学アクセラレータが入力プロンプトをどの程度処理できるかによって決定されます。プロンプトが長いほど、たとえば要約タスク - これには通常時間がかかります。
Decode, on the other hand, is what most people associate with LLM performance, and equates to how quickly the GPUs can produce the actual tokens as a response to the user's prompt.
So long as your GPU has enough memory to fit the model, decode performance is usually a function of how fast that memory is and how many tokens you're generating. A GPU with 8TB/s of memory bandwidth will churn out tokens more than twice as fast as one with 3.35TB/s.
GPUにモデルに適合するのに十分なメモリがある限り、デコードパフォーマンスは通常、そのメモリの速さと生成するトークンの数の関数です。 8TB/sのメモリ帯域幅を備えたGPUは、3.35TB/sの1つのトークンの2倍以上のトークンをかき混ぜます。
Where things start to get complicated is when you start looking at serving up larger models to more people with longer input and output sequences, like you might see in an AI research assistant or reasoning model.
Large models are typically distributed across multiple GPUs, and the way this is accomplished can have a major impact on performance and throughput, something Huang discussed at length during his keynote.
"Under the Pareto frontier are millions of points we could have configured the datacenter to do. We could have parallelized and split the work and sharded the work in a whole lot of different ways," he said.
What he means is, depending on your model's parallelism you might be able to serve millions of concurrent users but only at 10 tokens a second each. Meanwhile another combination is only be able to serve a few thousand concurrent requests but generate hundreds of tokens in the blink of an eye.
According to Huang, if you can figure out where on this curve your workload delivers the ideal mix of individual performance while also achieving the maximum throughput possible, you'll be able to charge a premium for your service and also drive down operating costs. We imagine this is the balancing act at least some LLM providers perform when scaling up their generative applications and services to more and more customers.
Cranking the Dynamo
Finding this happy medium between performance and throughput is one the key capabilities offered by Dynamo, we're told.
In addition to providing users with insights as to what the ideal mix of expert, pipeline, or tensor parallelism might be, Dynamo disaggregates prefill and decode onto different accelerators.
According to Nvidia, a GPU planner within Dynamo determines how many accelerators should be dedicated to prefill and decode based on demand.
However, Dynamo isn't just a GPU profiler. The framework also includes prompt routing functionality, which identifies and directs overlapping requests to specific groups of GPUs to maximize the likelihood of a key-value (KV) cache hit.
If you're not familiar, the KV cache represents the state of the model at any given time. So, if multiple users ask similar questions in short order, the model can pull from this cache rather than recalculating the model state over and over again.
Alongside the smart router, Dynamo also features a low-latency communication library to speed up GPU-to-GPU data flows, and a memory management subsystem that's responsible for pushing or pulling KV cache data from HBM to or from system memory or cold storage to maximize responsiveness and minimize wait times.
For Hopper-based systems running Llama models, Nvidia claims Dynamo can effectively double the inference performance. Meanwhile for larger Blackwell NVL72 systems, the GPU giant claims a 30x advantage in DeepSeek-R1 over Hopper with the framework enabled.
Llamaモデルを実行しているホッパーベースのシステムの場合、Nvidiaは、Dynamoが推論のパフォーマンスを効果的に2倍にできると主張しています。一方、より大きなBlackwell NVL72システムの場合、GPUの巨人は、Frameworkを有効にしてHopperよりもDeepseek-R1で30倍の利点を主張しています。
Broad compatibility
While Dynamo is obviously tuned for Nvidia's hardware and software stacks, much like the Triton Inference Server it replaces, the framework is designed to integrate with popular software libraries for model serving, like vLLM, PyTorch, and SGLang.
Dynamoは明らかにNvidiaのハードウェアおよびソフトウェアスタックに合わせて調整されていますが、Triton Inference Serverが置き換えるのと同様に、フレームワークは、VLLM、Pytorch、Sglangなどのモデルサービングの人気ソフトウェアライブラリと統合するように設計されています。
This means, if you
提供される情報は取引に関するアドバイスではありません。 kdj.com は、この記事で提供される情報に基づいて行われた投資に対して一切の責任を負いません。暗号通貨は変動性が高いため、十分な調査を行った上で慎重に投資することを強くお勧めします。
このウェブサイトで使用されているコンテンツが著作権を侵害していると思われる場合は、直ちに当社 (info@kdj.com) までご連絡ください。速やかに削除させていただきます。
- 昨日の記事で、私は次の視点を表明しました。
- 2025-03-26 11:20:12
- Crypto Ecosystem Investmentsで、長期的、継続的、安定したリターンを達成するために