Market Cap: $2.7009T 0.140%
Volume(24h): $75.6887B -12.470%
  • Market Cap: $2.7009T 0.140%
  • Volume(24h): $75.6887B -12.470%
  • Fear & Greed Index:
  • Market Cap: $2.7009T 0.140%
Cryptos
Topics
Cryptospedia
News
CryptosTopics
Videos
Top News
Cryptos
Topics
Cryptospedia
News
CryptosTopics
Videos
bitcoin
bitcoin

$85164.293495 USD

0.46%

ethereum
ethereum

$1631.626805 USD

-0.06%

tether
tether

$0.999902 USD

0.05%

xrp
xrp

$2.140262 USD

-0.29%

bnb
bnb

$585.593727 USD

-0.75%

solana
solana

$129.553695 USD

-2.38%

usd-coin
usd-coin

$0.999953 USD

0.01%

tron
tron

$0.252961 USD

-2.17%

dogecoin
dogecoin

$0.159379 USD

-3.88%

cardano
cardano

$0.637759 USD

-1.07%

unus-sed-leo
unus-sed-leo

$9.434465 USD

0.10%

avalanche
avalanche

$19.984115 USD

-0.50%

chainlink
chainlink

$12.624915 USD

-1.61%

stellar
stellar

$0.241348 USD

0.09%

toncoin
toncoin

$2.899684 USD

1.82%

Cryptocurrency News Articles

Introducing speculative decoding, asynchronous batch API, and expanded LoRA support to Workers AI

Apr 11, 2025 at 09:00 pm

Over the last few quarters, our Workers AI team has been heads down on improving the quality of our platform, working on various routing improvements

Introducing speculative decoding, asynchronous batch API, and expanded LoRA support to Workers AI

Since the launch of Workers AI in September, our mission has been to make inference accessible to everyone. Over the last few quarters, our Workers AI team has been heads down on improving the quality of our platform, working on various routing improvements, GPU optimizations, and capacity management improvements. Managing a distributed inference platform is not a simple task, but distributed systems are also what we do best. You’ll notice a recurring theme from all these announcements that has always been part of the core Cloudflare ethos — we try to solve problems through clever engineering so that we are able to do more with less.

Today, we’re excited to introduce speculative decoding to bring you faster inference, an asynchronous batch API for large workloads, and expanded LoRA support for more customized responses. Lastly, we’ll be recapping some of our newly added models, updated pricing, and unveiling a new dashboard to round out the usability of the platform.

Speeding up inference by 2-4x with speculative decoding and more

We’re excited to be rolling out speed improvements to models in our catalog, starting with the Llama 3.3 70b model. These improvements include speculative decoding, prefix caching, an updated inference backend, and more. We’ve previously done a technical deep dive on speculative decoding and how we’re making Workers AI faster, which you can read about here. With these changes, we’ve been able to improve inference times by 2-4x, without any significant change to the quality of answers generated. We’re planning to incorporate these improvements into more models in the future as we release them. Today, we’re starting to roll out these changes so all Workers AI users of @cf/meta/llama-3.3-70b-instruct-fp8-fast will enjoy this automatic speed boost.

What is speculative decoding?

The way LLMs work is by generating text by predicting the next token in a sentence given the previous tokens. Typically, an LLM is able to predict a single future token (n+1) with one forward pass through the model. These forward passes can be computationally expensive, since they need to work through all the parameters of a model to generate one token (e.g., 70 billion parameters for Llama 3.3 70b).

With speculative decoding, we put a small model (known as the draft model) in front of the original model that helps predict n+x future tokens. The draft model generates a subset of candidate tokens, and the original model just has to evaluate and confirm if they should be included in the generation. Evaluating tokens is less computationally expensive, as the model can evaluate multiple tokens concurrently in a forward pass. As such, inference times can be sped up by 2-4x — meaning that users can get responses much faster.

What makes speculative decoding particularly efficient is that it’s able to use unused GPU compute left behind due to the GPU memory bottleneck LLMs create. Speculative decoding takes advantage of this unused compute by squeezing in a draft model to generate tokens faster. This means we’re able to improve the utilization of our GPUs by using them to their full extent without having parts of the GPU sit idle.

What is prefix caching?

With LLMs, there are usually two stages of generation — the first is known as “pre-fill”, which processes the user’s input tokens such as the prompt and context. Prefix caching is aimed at reducing the pre-fill time of a request. As an example, if you were asking a model to generate code based on a given file, you might insert the whole file into the context window of a request. Then, if you want to make a second request to generate the next line of code, you might send us the whole file again in the second request. Prefix caching allows us to cache the pre-fill tokens so we don’t have to process the context twice. With the same example, we would only do the pre-fill stage once for both requests, rather than doing it per request. This method is especially useful for requests that reuse the same context, such as Retrieval Augmented Generation (RAG), code generation, chatbots with memory, and more. Skipping the pre-fill stage for similar requests means faster responses for our users and more efficient usage of resources.

How did you validate that quality is preserved through these optimizations?

Since this is an in-place update to an existing model, we were particularly cautious in ensuring that we would not break any existing applications with this update. We did extensive A/B testing through a blind arena with internal employees to validate the model quality, and we asked internal and external customers to test the new version of the model to ensure that response formats were compatible and model quality was acceptable. Our testing concluded that the model performed up to standards, with people being extremely excited about the speed of the model. Most LLMs are not perfectly deterministic even with the same set of inputs, but if you do notice something

Disclaimer:info@kdj.com

The information provided is not trading advice. kdj.com does not assume any responsibility for any investments made based on the information provided in this article. Cryptocurrencies are highly volatile and it is highly recommended that you invest with caution after thorough research!

If you believe that the content used on this website infringes your copyright, please contact us immediately (info@kdj.com) and we will delete it promptly.

Other articles published on Apr 15, 2025