$85164.293495 USD

0.46%

ethereum

$1631.626805 USD

-0.06%

tether

$0.999902 USD

0.05%

xrp

$2.140262 USD

-0.29%

bnb

$585.593727 USD

-0.75%

solana

$129.553695 USD

-2.38%

usd-coin

$0.999953 USD

0.01%

tron

$0.252961 USD

-2.17%

dogecoin

$0.159379 USD

-3.88%

cardano

$0.637759 USD

-1.07%

unus-sed-leo

$9.434465 USD

0.10%

avalanche

$19.984115 USD

-0.50%

chainlink

$12.624915 USD

-1.61%

stellar

$0.241348 USD

0.09%

toncoin

$2.899684 USD

1.82%

加密货币新闻

引入投机解码，异步批量API和向工人AI扩展的洛拉支持

2025/04/11 21:00

在过去的几个季度中，我们的工人AI团队一直在提高我们平台的质量，从事各种路由改进

Since the launch of Workers AI in September, our mission has been to make inference accessible to everyone. Over the last few quarters, our Workers AI team has been heads down on improving the quality of our platform, working on various routing improvements, GPU optimizations, and capacity management improvements. Managing a distributed inference platform is not a simple task, but distributed systems are also what we do best. You’ll notice a recurring theme from all these announcements that has always been part of the core Cloudflare ethos — we try to solve problems through clever engineering so that we are able to do more with less.

自9月份的工人AI成立以来，我们的任务一直是使所有人都可以访问推论。在过去的几个季度中，我们的工人AI团队一直在提高平台的质量，从事各种路由改进，GPU优化以及能力管理的改进。管理分布式推理平台并不是一项简单的任务，但是分布式系统也是我们最擅长的。您会注意到所有这些公告中的反复出现的主题一直是Core Cloudflare精神的一部分 - 我们尝试通过巧妙的工程来解决问题，以便我们能够以更少的速度做更多的事情。

Today, we’re excited to introduce speculative decoding to bring you faster inference, an asynchronous batch API for large workloads, and expanded LoRA support for more customized responses. Lastly, we’ll be recapping some of our newly added models, updated pricing, and unveiling a new dashboard to round out the usability of the platform.

今天，我们很高兴引入投机解码，以使您更快地推理，一种用于大型工作负载的异步批量API，并扩展了Lora支持，以提供更多自定义的响应。最后，我们将回顾一些新添加的型号，更新定价，并发布新的仪表板，以完善平台的可用性。

Speeding up inference by 2-4x with speculative decoding and more

通过投机解码和更多

We’re excited to be rolling out speed improvements to models in our catalog, starting with the Llama 3.3 70b model. These improvements include speculative decoding, prefix caching, an updated inference backend, and more. We’ve previously done a technical deep dive on speculative decoding and how we’re making Workers AI faster, which you can read about here. With these changes, we’ve been able to improve inference times by 2-4x, without any significant change to the quality of answers generated. We’re planning to incorporate these improvements into more models in the future as we release them. Today, we’re starting to roll out these changes so all Workers AI users of @cf/meta/llama-3.3-70b-instruct-fp8-fast will enjoy this automatic speed boost.

我们很高兴能够从Llama 3.3 70b型号开始对目录中的型号进行速度改进。这些改进包括投机解码，前缀缓存，更新的推理后端等。我们以前已经对投机解码以及如何使工人更快地进行了技术深入研究，您可以在此处阅读。通过这些更改，我们已经能够将推理时间提高2-4倍，而没有对产生的答案的质量进行任何重大变化。我们计划在将来将这些改进纳入更多模型中。今天，我们开始推出这些更改，以便所有工人AI用户 @cf/meta/llama-3.3-70b-Instruct-fp8-fast都将享受这种自动速度提升。

What is speculative decoding?

什么是投机解码？

The way LLMs work is by generating text by predicting the next token in a sentence given the previous tokens. Typically, an LLM is able to predict a single future token (n+1) with one forward pass through the model. These forward passes can be computationally expensive, since they need to work through all the parameters of a model to generate one token (e.g., 70 billion parameters for Llama 3.3 70b).

LLMS的工作方式是通过预测以前令牌给定的句子中的下一个令牌来生成文本。通常，LLM能够通过模型通过一个正向来预测一个将来的令牌（n+1）。这些正向通行证在计算上可能很昂贵，因为它们需要通过模型的所有参数来生成一个令牌（例如，Llama 3.3 70B的参数为700亿个）。

With speculative decoding, we put a small model (known as the draft model) in front of the original model that helps predict n+x future tokens. The draft model generates a subset of candidate tokens, and the original model just has to evaluate and confirm if they should be included in the generation. Evaluating tokens is less computationally expensive, as the model can evaluate multiple tokens concurrently in a forward pass. As such, inference times can be sped up by 2-4x — meaning that users can get responses much faster.

通过投机解码，我们将一个小型模型（称为草稿模型）放在原始模型的前面，有助于预测N+X未来令牌。草案模型生成了候选令牌的子集，而原始模型只需要评估并确认是否应将其包括在一代中。评估令牌在计算上的昂贵不太昂贵，因为该模型可以在正向通行证中同时评估多个令牌。因此，推理时间可以用2-4倍加速 - 这意味着用户可以更快地获得响应。

What makes speculative decoding particularly efficient is that it’s able to use unused GPU compute left behind due to the GPU memory bottleneck LLMs create. Speculative decoding takes advantage of this unused compute by squeezing in a draft model to generate tokens faster. This means we’re able to improve the utilization of our GPUs by using them to their full extent without having parts of the GPU sit idle.

使投机解码特别有效的原因是，由于GPU存储器瓶颈LLMS创建，它能够使用未使用的GPU计算。投机解码通过在草稿模型中挤压来利用这种未使用的计算来更快地生成令牌。这意味着我们能够通过将其全部使用，而不会使GPU的一部分闲置。

What is prefix caching?

什么是前缀缓存？

With LLMs, there are usually two stages of generation — the first is known as “pre-fill”, which processes the user’s input tokens such as the prompt and context. Prefix caching is aimed at reducing the pre-fill time of a request. As an example, if you were asking a model to generate code based on a given file, you might insert the whole file into the context window of a request. Then, if you want to make a second request to generate the next line of code, you might send us the whole file again in the second request. Prefix caching allows us to cache the pre-fill tokens so we don’t have to process the context twice. With the same example, we would only do the pre-fill stage once for both requests, rather than doing it per request. This method is especially useful for requests that reuse the same context, such as Retrieval Augmented Generation (RAG), code generation, chatbots with memory, and more. Skipping the pre-fill stage for similar requests means faster responses for our users and more efficient usage of resources.

使用LLMS，通常有两个生成阶段 - 第一个被称为“预填充”，它处理用户的输入令牌（例如提示和上下文）。前缀缓存旨在减少请求的填充时间。例如，如果您要求模型基于给定文件生成代码，则可以将整个文件插入请求的上下文窗口中。然后，如果您想提出第二个请求以生成下一行代码，则可以在第二个请求中再次向我们发送整个文件。前缀缓存使我们可以缓存预填充令牌，因此我们不必两次处理上下文。以相同的示例，我们只为两个请求进行一次预填充阶段，而不是根据请求进行。此方法对于重复使用相同上下文的请求特别有用，例如检索增强生成（RAG），代码生成，带有内存的聊天机器人等。跳过填充阶段的类似请求意味着对我们的用户的响应速度更快，并更有效地使用了资源。

How did you validate that quality is preserved through these optimizations?

您是如何通过这些优化来验证该质量的？

Since this is an in-place update to an existing model, we were particularly cautious in ensuring that we would not break any existing applications with this update. We did extensive A/B testing through a blind arena with internal employees to validate the model quality, and we asked internal and external customers to test the new version of the model to ensure that response formats were compatible and model quality was acceptable. Our testing concluded that the model performed up to standards, with people being extremely excited about the speed of the model. Most LLMs are not perfectly deterministic even with the same set of inputs, but if you do notice something

由于这是对现有模型的现有更新，因此我们在确保我们不会使用此更新的任何现有应用程序时特别谨慎。我们通过与内部员工的盲人竞技场进行了广泛的A/B测试，以验证模型质量，并要求内部和外部客户测试模型的新版本，以确保响应格式兼容并且模型质量是可以接受的。我们的测试得出的结论是，该模型达到了标准，人们对模型的速度非常兴奋。即使使用相同的输入，大多数LLM也不是完全确定性的，但是如果您确实注意到了一些东西

免责声明:info@kdj.com

所提供的信息并非交易建议。根据本文提供的信息进行的任何投资，kdj.com不承担任何责任。加密货币具有高波动性，强烈建议您深入研究后，谨慎投资！

如您认为本网站上使用的内容侵犯了您的版权，请立即联系我们（info@kdj.com），我们将及时删除。

2025年04月15日发表的其他文章