$87274.402613 USD

0.66%

ethereum

$2055.039534 USD

0.05%

tether

$1.000123 USD

-0.01%

xrp

$2.447357 USD

1.07%

bnb

$629.486401 USD

-1.48%

solana

$142.558475 USD

2.35%

usd-coin

$0.999959 USD

0.00%

dogecoin

$0.192670 USD

4.35%

cardano

$0.742449 USD

2.01%

tron

$0.227395 USD

0.38%

chainlink

$15.330075 USD

2.00%

avalanche

$22.696566 USD

6.07%

stellar

$0.293630 USD

1.71%

unus-sed-leo

$9.763134 USD

-0.14%

toncoin

$3.598396 USD

-1.65%

Cryptocurrency News Articles

Nvidia's Blackwell Ultra and upcoming Vera and Rubin CPUs and GPUs

Mar 24, 2025 at 01:38 am

By Michael D. Kats

Nvidia's gargantuan Blackwell Ultra and upcoming Vera and Rubin CPUs and GPUs have certainly grabbed plenty of headlines at the corp's GPU Technology Conference this week. But arguably one of the most important announcements of the annual developer event wasn't a chip at all but rather a software framework called Dynamo, designed to tackle the challenges of AI inference at scale.

Announced on stage at GTC, it was described by CEO Jensen Huang as the "operating system of an AI factory," and drew comparisons to the real-world dynamo that kicked off an industrial revolution. "The dynamo was the first instrument that started the last industrial revolution," the chief exec said. "The industrial revolution of energy — water comes in, electricity comes out."

At its heart, the open source inference suite is designed to better optimize inference engines such as TensorRT LLM, SGLang, and vLLM to run across large quantities of GPUs as quickly and efficiently as possible.

As we've previously discussed, the faster and cheaper you can turn out token after token from a model, the better the experience for users.

Inference is harder than it looks

At a high level, LLM output performance can be broken into two broad categories: Prefill and decode. Prefill is dictated by how quickly the GPU's floating-point matrix math accelerators can process the input prompt. The longer the prompt — say, a summarization task — the longer this typically takes.

Decode, on the other hand, is what most people associate with LLM performance, and equates to how quickly the GPUs can produce the actual tokens as a response to the user's prompt.

So long as your GPU has enough memory to fit the model, decode performance is usually a function of how fast that memory is and how many tokens you're generating. A GPU with 8TB/s of memory bandwidth will churn out tokens more than twice as fast as one with 3.35TB/s.

Where things start to get complicated is when you start looking at serving up larger models to more people with longer input and output sequences, like you might see in an AI research assistant or reasoning model.

Large models are typically distributed across multiple GPUs, and the way this is accomplished can have a major impact on performance and throughput, something Huang discussed at length during his keynote.

"Under the Pareto frontier are millions of points we could have configured the datacenter to do. We could have parallelized and split the work and sharded the work in a whole lot of different ways," he said.

What he means is, depending on your model's parallelism you might be able to serve millions of concurrent users but only at 10 tokens a second each. Meanwhile another combination is only be able to serve a few thousand concurrent requests but generate hundreds of tokens in the blink of an eye.

According to Huang, if you can figure out where on this curve your workload delivers the ideal mix of individual performance while also achieving the maximum throughput possible, you'll be able to charge a premium for your service and also drive down operating costs. We imagine this is the balancing act at least some LLM providers perform when scaling up their generative applications and services to more and more customers.

Cranking the Dynamo

Finding this happy medium between performance and throughput is one the key capabilities offered by Dynamo, we're told.

In addition to providing users with insights as to what the ideal mix of expert, pipeline, or tensor parallelism might be, Dynamo disaggregates prefill and decode onto different accelerators.

According to Nvidia, a GPU planner within Dynamo determines how many accelerators should be dedicated to prefill and decode based on demand.

However, Dynamo isn't just a GPU profiler. The framework also includes prompt routing functionality, which identifies and directs overlapping requests to specific groups of GPUs to maximize the likelihood of a key-value (KV) cache hit.

If you're not familiar, the KV cache represents the state of the model at any given time. So, if multiple users ask similar questions in short order, the model can pull from this cache rather than recalculating the model state over and over again.

Alongside the smart router, Dynamo also features a low-latency communication library to speed up GPU-to-GPU data flows, and a memory management subsystem that's responsible for pushing or pulling KV cache data from HBM to or from system memory or cold storage to maximize responsiveness and minimize wait times.

For Hopper-based systems running Llama models, Nvidia claims Dynamo can effectively double the inference performance. Meanwhile for larger Blackwell NVL72 systems, the GPU giant claims a 30x advantage in DeepSeek-R1 over Hopper with the framework enabled.

Broad compatibility

While Dynamo is obviously tuned for Nvidia's hardware and software stacks, much like the Triton Inference Server it replaces, the framework is designed to integrate with popular software libraries for model serving, like vLLM, PyTorch, and SGLang.

This means, if you

Disclaimer:info@kdj.com

The information provided is not trading advice. kdj.com does not assume any responsibility for any investments made based on the information provided in this article. Cryptocurrencies are highly volatile and it is highly recommended that you invest with caution after thorough research！

If you believe that the content used on this website infringes your copyright, please contact us immediately (info@kdj.com) and we will delete it promptly.

Other articles published on Mar 26, 2025