![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
迈克尔·D·凯特(Michael D. Kats)
Nvidia's gargantuan Blackwell Ultra and upcoming Vera and Rubin CPUs and GPUs have certainly grabbed plenty of headlines at the corp's GPU Technology Conference this week. But arguably one of the most important announcements of the annual developer event wasn't a chip at all but rather a software framework called Dynamo, designed to tackle the challenges of AI inference at scale.
NVIDIA的Gargantuan Blackwell Ultra以及即将到来的Vera,Rubin CPU和GPU肯定在本周的GPU技术会议上吸引了许多头条新闻。但是,可以说,年度开发人员活动最重要的公告之一根本不是芯片,而是一个名为Dynamo的软件框架,旨在应对大规模AI推论的挑战。
Announced on stage at GTC, it was described by CEO Jensen Huang as the "operating system of an AI factory," and drew comparisons to the real-world dynamo that kicked off an industrial revolution. "The dynamo was the first instrument that started the last industrial revolution," the chief exec said. "The industrial revolution of energy — water comes in, electricity comes out."
首席执行官詹森·黄(Jensen Huang)在GTC的舞台上宣布,它是“ AI工厂的操作系统”,并与现实世界中的发电机进行了比较,该杂音开始了工业革命。首席执行官说:“迪纳摩是第一届工业革命开始的工具。” “能源的工业革命 - 水进来,电力出来了。”
At its heart, the open source inference suite is designed to better optimize inference engines such as TensorRT LLM, SGLang, and vLLM to run across large quantities of GPUs as quickly and efficiently as possible.
开源推理套件的核心旨在更好地优化诸如Tensorrt LLM,Sglang和VLLM等推理引擎,以尽可能快速有效地跨越大量的GPU。
As we've previously discussed, the faster and cheaper you can turn out token after token from a model, the better the experience for users.
正如我们先前讨论的那样,从模型中的代币之后,您可以更快,更便宜,用户体验越好。
Inference is harder than it looks
推论比看起来更难
At a high level, LLM output performance can be broken into two broad categories: Prefill and decode. Prefill is dictated by how quickly the GPU's floating-point matrix math accelerators can process the input prompt. The longer the prompt — say, a summarization task — the longer this typically takes.
在高水平上,LLM输出性能可以分为两个广泛的类别:预填充和解码。预填充是由GPU的浮点矩阵数学加速器可以处理输入提示的速度来决定的。提示时间越长(例如,摘要任务),通常需要的时间越长。
Decode, on the other hand, is what most people associate with LLM performance, and equates to how quickly the GPUs can produce the actual tokens as a response to the user's prompt.
另一方面,解码是大多数人与LLM性能相关联的方法,并且等同于GPU可以产生实际令牌作为对用户提示的响应的速度。
So long as your GPU has enough memory to fit the model, decode performance is usually a function of how fast that memory is and how many tokens you're generating. A GPU with 8TB/s of memory bandwidth will churn out tokens more than twice as fast as one with 3.35TB/s.
只要您的GPU具有足够的内存以适合模型,解码性能通常是该内存的速度和生成多少代价的函数。具有8TB/s的内存带宽的GPU将使令牌的速度超过3.35tb/s的速度两倍。
Where things start to get complicated is when you start looking at serving up larger models to more people with longer input and output sequences, like you might see in an AI research assistant or reasoning model.
事情开始变得复杂的地方是,当您开始考虑为更多具有更长输入和输出序列的人提供更大的模型时,就像您在AI研究助理或推理模型中可能看到的那样。
Large models are typically distributed across multiple GPUs, and the way this is accomplished can have a major impact on performance and throughput, something Huang discussed at length during his keynote.
大型模型通常分布在多个GPU中,并且完成此操作的方式可能会对性能和吞吐量产生重大影响,Huang在主题演讲过程中详细讨论了这一点。
"Under the Pareto frontier are millions of points we could have configured the datacenter to do. We could have parallelized and split the work and sharded the work in a whole lot of different ways," he said.
他说:“在帕累托边境下,我们本可以配置了数据中心要做的数百万分。我们可以平行并分割工作,并以多种不同的方式分解了工作。”
What he means is, depending on your model's parallelism you might be able to serve millions of concurrent users but only at 10 tokens a second each. Meanwhile another combination is only be able to serve a few thousand concurrent requests but generate hundreds of tokens in the blink of an eye.
他的意思是,根据模型的并行性,您可能可以为数百万并发用户提供服务,但每秒只有10个令牌。同时,另一种组合只能提供几千个并发请求,但会眨眼间产生数百个令牌。
According to Huang, if you can figure out where on this curve your workload delivers the ideal mix of individual performance while also achieving the maximum throughput possible, you'll be able to charge a premium for your service and also drive down operating costs. We imagine this is the balancing act at least some LLM providers perform when scaling up their generative applications and services to more and more customers.
根据黄的说法,如果您可以弄清楚在此曲线上的位置,您的工作负载将提供个人绩效的理想组合,同时也可以实现最大的吞吐量,那么您将能够为您的服务保费并降低运营成本。我们想象这是至少一些LLM提供商在向越来越多的客户扩展其生成应用和服务时,至少有一些LLM提供商可以执行的。
Cranking the Dynamo
摇动发电机
Finding this happy medium between performance and throughput is one the key capabilities offered by Dynamo, we're told.
我们被告知,在性能和吞吐量之间找到这种快乐的媒介是发电机提供的关键功能之一。
In addition to providing users with insights as to what the ideal mix of expert, pipeline, or tensor parallelism might be, Dynamo disaggregates prefill and decode onto different accelerators.
除了向用户提供有关专家,管道或张量并行性理想组合的见解之外,Dynamo将预填充物分解并解码到不同的加速器上。
According to Nvidia, a GPU planner within Dynamo determines how many accelerators should be dedicated to prefill and decode based on demand.
根据NVIDIA的说法,Dynamo中的GPU规划师确定应根据需求进行预填充和解码多少加速器。
However, Dynamo isn't just a GPU profiler. The framework also includes prompt routing functionality, which identifies and directs overlapping requests to specific groups of GPUs to maximize the likelihood of a key-value (KV) cache hit.
但是,Dynamo不仅仅是GPU剖面。该框架还包括提示路由功能,该功能可以标识并直接将重叠请求到特定的GPU组,以最大程度地提高键值(KV)缓存命中的可能性。
If you're not familiar, the KV cache represents the state of the model at any given time. So, if multiple users ask similar questions in short order, the model can pull from this cache rather than recalculating the model state over and over again.
如果您不熟悉,则KV缓存代表任何给定时间的模型状态。因此,如果多个用户在短时间内提出类似的问题,则该模型可以从此缓存中拉出,而不是一遍又一遍地重新计算模型状态。
Alongside the smart router, Dynamo also features a low-latency communication library to speed up GPU-to-GPU data flows, and a memory management subsystem that's responsible for pushing or pulling KV cache data from HBM to or from system memory or cold storage to maximize responsiveness and minimize wait times.
除了智能路由器外,Dynamo还具有低延迟通信库,以加快GPU到GPU数据流,以及负责将KV CACHE数据从HBM推送到或从系统内存或冷存储中推动或从冷存储中推动或将其提取的内存管理子系统,以最大程度地提高响应能力并最大程度地减少等待时间。
For Hopper-based systems running Llama models, Nvidia claims Dynamo can effectively double the inference performance. Meanwhile for larger Blackwell NVL72 systems, the GPU giant claims a 30x advantage in DeepSeek-R1 over Hopper with the framework enabled.
对于基于料斗的系统运行骆驼模型,NVIDIA声称Dynamo可以有效地翻倍推理性能。同时,对于较大的Blackwell NVL72系统,GPU巨头声称启用了框架,在DeepSeek-R1中比Hopper具有30倍的优势。
Broad compatibility
广泛的兼容性
While Dynamo is obviously tuned for Nvidia's hardware and software stacks, much like the Triton Inference Server it replaces, the framework is designed to integrate with popular software libraries for model serving, like vLLM, PyTorch, and SGLang.
尽管Dynamo显然是为NVIDIA的硬件和软件堆栈调整的,但就像它替换的Triton推理服务器一样,该框架的设计旨在与流行的软件库集成,例如VLLM,Pytorch和Sglang。
This means, if you
这意味着,如果你
免责声明:info@kdj.com
所提供的信息并非交易建议。根据本文提供的信息进行的任何投资,kdj.com不承担任何责任。加密货币具有高波动性,强烈建议您深入研究后,谨慎投资!
如您认为本网站上使用的内容侵犯了您的版权,请立即联系我们(info@kdj.com),我们将及时删除。
-
- 为什么模因硬币购买季节到了
- 2025-03-26 11:30:11
- 市场情绪在模因硬币表现中起着巨大的作用。从历史上看,模因硬币爆炸了三个条件时:
-
-
-
-
-
- 在昨天的文章中,我表达了以下观点:
- 2025-03-26 11:20:12
- 在加密生态系统投资中,以实现长期,连续和稳定的回报
-
- RWA(现实世界资产)研究所
- 2025-03-26 11:15:12
- 2025年3月24日,深圳将举办一项关注现实世界资产(RWA)的行业活动-------“企业全球化全球论坛”
-
-