市值: $2.6982T 0.340%
體積(24小時): $77.0921B -10.370%
  • 市值: $2.6982T 0.340%
  • 體積(24小時): $77.0921B -10.370%
  • 恐懼與貪婪指數:
  • 市值: $2.6982T 0.340%
加密
主題
加密植物
資訊
加密術
影片
頭號新聞
加密
主題
加密植物
資訊
加密術
影片
bitcoin
bitcoin

$85164.293495 USD

0.46%

ethereum
ethereum

$1631.626805 USD

-0.06%

tether
tether

$0.999902 USD

0.05%

xrp
xrp

$2.140262 USD

-0.29%

bnb
bnb

$585.593727 USD

-0.75%

solana
solana

$129.553695 USD

-2.38%

usd-coin
usd-coin

$0.999953 USD

0.01%

tron
tron

$0.252961 USD

-2.17%

dogecoin
dogecoin

$0.159379 USD

-3.88%

cardano
cardano

$0.637759 USD

-1.07%

unus-sed-leo
unus-sed-leo

$9.434465 USD

0.10%

avalanche
avalanche

$19.984115 USD

-0.50%

chainlink
chainlink

$12.624915 USD

-1.61%

stellar
stellar

$0.241348 USD

0.09%

toncoin
toncoin

$2.899684 USD

1.82%

加密貨幣新聞文章

引入投機解碼,異步批量API和向工人AI擴展的洛拉支持

2025/04/11 21:00

在過去的幾個季度中,我們的工人AI團隊一直在提高我們平台的質量,從事各種路由改進

引入投機解碼,異步批量API和向工人AI擴展的洛拉支持

Since the launch of Workers AI in September, our mission has been to make inference accessible to everyone. Over the last few quarters, our Workers AI team has been heads down on improving the quality of our platform, working on various routing improvements, GPU optimizations, and capacity management improvements. Managing a distributed inference platform is not a simple task, but distributed systems are also what we do best. You’ll notice a recurring theme from all these announcements that has always been part of the core Cloudflare ethos — we try to solve problems through clever engineering so that we are able to do more with less.

自9月份的工人AI成立以來,我們的任務一直是使所有人都可以訪問推論。在過去的幾個季度中,我們的工人AI團隊一直在提高平台的質量,從事各種路由改進,GPU優化以及能力管理的改進。管理分佈式推理平台並不是一項簡單的任務,但是分佈式系統也是我們最擅長的。您會注意到所有這些公告中的反復出現的主題一直是Core Cloudflare精神的一部分 - 我們嘗試通過巧妙的工程來解決問題,以便我們能夠以更少的速度做更多的事情。

Today, we’re excited to introduce speculative decoding to bring you faster inference, an asynchronous batch API for large workloads, and expanded LoRA support for more customized responses. Lastly, we’ll be recapping some of our newly added models, updated pricing, and unveiling a new dashboard to round out the usability of the platform.

今天,我們很高興引入投機解碼,以使您更快地推理,一種用於大型工作負載的異步批量API,並擴展了Lora支持,以提供更多自定義的響應。最後,我們將回顧一些新添加的型號,更新定價,並發布新的儀表板,以完善平台的可用性。

Speeding up inference by 2-4x with speculative decoding and more

通過投機解碼和更多

We’re excited to be rolling out speed improvements to models in our catalog, starting with the Llama 3.3 70b model. These improvements include speculative decoding, prefix caching, an updated inference backend, and more. We’ve previously done a technical deep dive on speculative decoding and how we’re making Workers AI faster, which you can read about here. With these changes, we’ve been able to improve inference times by 2-4x, without any significant change to the quality of answers generated. We’re planning to incorporate these improvements into more models in the future as we release them. Today, we’re starting to roll out these changes so all Workers AI users of @cf/meta/llama-3.3-70b-instruct-fp8-fast will enjoy this automatic speed boost.

我們很高興能夠從Llama 3.3 70b型號開始對目錄中的型號進行速度改進。這些改進包括投機解碼,前綴緩存,更新的推理後端等。我們以前已經對投機解碼以及如何使工人更快地進行了技術深入研究,您可以在此處閱讀。通過這些更改,我們已經能夠將推理時間提高2-4倍,而沒有對產生的答案的質量進行任何重大變化。我們計劃在將來將這些改進納入更多模型中。今天,我們開始推出這些更改,以便所有工人AI用戶 @cf/meta/llama-3.3-70b-Instruct-fp8-fast都將享受這種自動速度提升。

What is speculative decoding?

什麼是投機解碼?

The way LLMs work is by generating text by predicting the next token in a sentence given the previous tokens. Typically, an LLM is able to predict a single future token (n+1) with one forward pass through the model. These forward passes can be computationally expensive, since they need to work through all the parameters of a model to generate one token (e.g., 70 billion parameters for Llama 3.3 70b).

LLMS的工作方式是通過預測以前令牌給定的句子中的下一個令牌來生成文本。通常,LLM能夠通過模型通過一個正向來預測一個將來的令牌(n+1)。這些正向通行證在計算上可能很昂貴,因為它們需要通過模型的所有參數來生成一個令牌(例如,Llama 3.3 70B的參數為700億個)。

With speculative decoding, we put a small model (known as the draft model) in front of the original model that helps predict n+x future tokens. The draft model generates a subset of candidate tokens, and the original model just has to evaluate and confirm if they should be included in the generation. Evaluating tokens is less computationally expensive, as the model can evaluate multiple tokens concurrently in a forward pass. As such, inference times can be sped up by 2-4x — meaning that users can get responses much faster.

通過投機解碼,我們將一個小型模型(稱為草稿模型)放在原始模型的前面,有助於預測N+X未來令牌。草案模型生成了候選令牌的子集,而原始模型只需要評估並確認是否應將其包括在一代中。評估令牌在計算上的昂貴不太昂貴,因為該模型可以在正向通行證中同時評估多個令牌。因此,推理時間可以用2-4倍加速 - 這意味著用戶可以更快地獲得響應。

What makes speculative decoding particularly efficient is that it’s able to use unused GPU compute left behind due to the GPU memory bottleneck LLMs create. Speculative decoding takes advantage of this unused compute by squeezing in a draft model to generate tokens faster. This means we’re able to improve the utilization of our GPUs by using them to their full extent without having parts of the GPU sit idle.

使投機解碼特別有效的原因是,由於GPU存儲器瓶頸LLMS創建,它能夠使用未使用的GPU計算。投機解碼通過在草稿模型中擠壓來利用這種未使用的計算來更快地生成令牌。這意味著我們能夠通過將其全部使用,而不會使GPU的一部分閒置。

What is prefix caching?

什麼是前綴緩存?

With LLMs, there are usually two stages of generation — the first is known as “pre-fill”, which processes the user’s input tokens such as the prompt and context. Prefix caching is aimed at reducing the pre-fill time of a request. As an example, if you were asking a model to generate code based on a given file, you might insert the whole file into the context window of a request. Then, if you want to make a second request to generate the next line of code, you might send us the whole file again in the second request. Prefix caching allows us to cache the pre-fill tokens so we don’t have to process the context twice. With the same example, we would only do the pre-fill stage once for both requests, rather than doing it per request. This method is especially useful for requests that reuse the same context, such as Retrieval Augmented Generation (RAG), code generation, chatbots with memory, and more. Skipping the pre-fill stage for similar requests means faster responses for our users and more efficient usage of resources.

使用LLMS,通常有兩個生成階段 - 第一個被稱為“預填充”,它處理用戶的輸入令牌(例如提示和上下文)。前綴緩存旨在減少請求的填充時間。例如,如果您要求模型基於給定文件生成代碼,則可以將整個文件插入請求的上下文窗口中。然後,如果您想提出第二個請求以生成下一行代碼,則可以在第二個請求中再次向我們發送整個文件。前綴緩存使我們可以緩存預填充令牌,因此我們不必兩次處理上下文。以相同的示例,我們只為兩個請求進行一次預填充階段,而不是根據請求進行。此方法對於重複使用相同上下文的請求特別有用,例如檢索增強生成(RAG),代碼生成,帶有內存的聊天機器人等。跳過填充階段的類似請求意味著對我們的用戶的響應速度更快,並更有效地使用了資源。

How did you validate that quality is preserved through these optimizations?

您是如何通過這些優化來驗證該質量的?

Since this is an in-place update to an existing model, we were particularly cautious in ensuring that we would not break any existing applications with this update. We did extensive A/B testing through a blind arena with internal employees to validate the model quality, and we asked internal and external customers to test the new version of the model to ensure that response formats were compatible and model quality was acceptable. Our testing concluded that the model performed up to standards, with people being extremely excited about the speed of the model. Most LLMs are not perfectly deterministic even with the same set of inputs, but if you do notice something

由於這是對現有模型的現有更新,因此我們在確保我們不會使用此更新的任何現有應用程序時特別謹慎。我們通過與內部員工的盲人競技場進行了廣泛的A/B測試,以驗證模型質量,並要求內部和外部客戶測試模型的新版本,以確保響應格式兼容並且模型質量是可以接受的。我們的測試得出的結論是,該模型達到了標準,人們對模型的速度非常興奮。即使使用相同的輸入,大多數LLM也不是完全確定性的,但是如果您確實注意到了一些東西

免責聲明:info@kdj.com

所提供的資訊並非交易建議。 kDJ.com對任何基於本文提供的資訊進行的投資不承擔任何責任。加密貨幣波動性較大,建議您充分研究後謹慎投資!

如果您認為本網站使用的內容侵犯了您的版權,請立即聯絡我們(info@kdj.com),我們將及時刪除。

2025年04月15日 其他文章發表於