$85164.293495 USD

0.46%

ethereum

$1631.626805 USD

-0.06%

tether

$0.999902 USD

0.05%

xrp

$2.140262 USD

-0.29%

bnb

$585.593727 USD

-0.75%

solana

$129.553695 USD

-2.38%

usd-coin

$0.999953 USD

0.01%

tron

$0.252961 USD

-2.17%

dogecoin

$0.159379 USD

-3.88%

cardano

$0.637759 USD

-1.07%

unus-sed-leo

$9.434465 USD

0.10%

avalanche

$19.984115 USD

-0.50%

chainlink

$12.624915 USD

-1.61%

stellar

$0.241348 USD

0.09%

toncoin

$2.899684 USD

1.82%

暗号通貨のニュース記事

投機的なデコード、非同期バッチAPI、およびLORAサポートの拡大を労働者AIに導入する

2025/04/11 21:00

過去数四半期にわたって、私たちの労働者AIチームは、私たちのプラットフォームの品質を改善し、さまざまなルーティングの改善に取り組んでいます

Since the launch of Workers AI in September, our mission has been to make inference accessible to everyone. Over the last few quarters, our Workers AI team has been heads down on improving the quality of our platform, working on various routing improvements, GPU optimizations, and capacity management improvements. Managing a distributed inference platform is not a simple task, but distributed systems are also what we do best. You’ll notice a recurring theme from all these announcements that has always been part of the core Cloudflare ethos — we try to solve problems through clever engineering so that we are able to do more with less.

9月に労働者AIが発売されて以来、私たちの使命は、すべての人が推論にアクセスできるようにすることでした。過去数四半期にわたって、ワーカーAIチームは、プラットフォームの品質を向上させ、さまざまなルーティングの改善、GPUの最適化、能力管理の改善に取り組んでいます。分散型推論プラットフォームの管理は簡単なタスクではありませんが、分散システムも最善を尽くしています。これらすべての発表から、常にCore CloudFlare Ethosの一部であったこれらすべての発表から繰り返されるテーマに気付くでしょう。巧妙なエンジニアリングを通じて問題を解決しようとして、より少ないことでより多くのことをすることができます。

Today, we’re excited to introduce speculative decoding to bring you faster inference, an asynchronous batch API for large workloads, and expanded LoRA support for more customized responses. Lastly, we’ll be recapping some of our newly added models, updated pricing, and unveiling a new dashboard to round out the usability of the platform.

今日、推測を導入して、より高速な推論、大規模なワークロード用の非同期バッチAPI、およびよりカスタマイズされた応答のためにLORAサポートを拡張することを楽しみにしています。最後に、新しく追加されたモデルのいくつかを再採取し、価格設定を更新し、プラットフォームの使いやすさを締めくくるために新しいダッシュボードを発表します。

Speeding up inference by 2-4x with speculative decoding and more

投機的なデコードなどで2-4Xで推論を高速化する

We’re excited to be rolling out speed improvements to models in our catalog, starting with the Llama 3.3 70b model. These improvements include speculative decoding, prefix caching, an updated inference backend, and more. We’ve previously done a technical deep dive on speculative decoding and how we’re making Workers AI faster, which you can read about here. With these changes, we’ve been able to improve inference times by 2-4x, without any significant change to the quality of answers generated. We’re planning to incorporate these improvements into more models in the future as we release them. Today, we’re starting to roll out these changes so all Workers AI users of @cf/meta/llama-3.3-70b-instruct-fp8-fast will enjoy this automatic speed boost.

Llama 3.3 70bモデルから始めて、カタログのモデルの速度改善を展開できることを楽しみにしています。これらの改善には、投機的デコード、プレフィックスキャッシュ、更新された推論バックエンドなどが含まれます。私たちは以前、投機的なデコードと、労働者のaiをより速くする方法について技術的な深いダイビングを行いました。これについては、ここで読むことができます。これらの変更により、生成された回答の質に大きな変化はなく、推論時間を2〜4倍改善することができました。これらの改善を、将来、それらをリリースする際に、より多くのモデルに組み込むことを計画しています。今日、私たちはこれらの変更を展開し始めているので、 @CF/Meta/Llama-3.3-70B-Instruct-FP8-FASTのすべてのワーカーAIユーザーは、この自動速度のブーストを享受します。

What is speculative decoding?

投機的デコードとは何ですか？

The way LLMs work is by generating text by predicting the next token in a sentence given the previous tokens. Typically, an LLM is able to predict a single future token (n+1) with one forward pass through the model. These forward passes can be computationally expensive, since they need to work through all the parameters of a model to generate one token (e.g., 70 billion parameters for Llama 3.3 70b).

LLMSの仕組みは、以前のトークンを与えられた文で次のトークンを予測することにより、テキストを生成することです。通常、LLMは、モデルを1つのフォワードパスを使用して、単一の将来のトークン（n+1）を予測することができます。これらのフォワードパスは、モデルのすべてのパラメーターを使用して1つのトークンを生成する必要があるため、計算上高価です（例えば、Llama 3.3 70bの700億パラメーター）。

With speculative decoding, we put a small model (known as the draft model) in front of the original model that helps predict n+x future tokens. The draft model generates a subset of candidate tokens, and the original model just has to evaluate and confirm if they should be included in the generation. Evaluating tokens is less computationally expensive, as the model can evaluate multiple tokens concurrently in a forward pass. As such, inference times can be sped up by 2-4x — meaning that users can get responses much faster.

投機的デコードを使用すると、N+Xの将来のトークンを予測するのに役立つ元のモデルの前に小さなモデル（ドラフトモデルとして知られています）を元のモデルの前に置きます。ドラフトモデルは候補トークンのサブセットを生成し、元のモデルは、それらが生成に含まれるべきかどうかを評価して確認する必要があります。モデルは前方パスで複数のトークンを同時に評価できるため、トークンの評価は計算上の高価ではありません。そのため、推論時間は2〜4倍上昇できます。つまり、ユーザーはより速く応答を得ることができます。

What makes speculative decoding particularly efficient is that it’s able to use unused GPU compute left behind due to the GPU memory bottleneck LLMs create. Speculative decoding takes advantage of this unused compute by squeezing in a draft model to generate tokens faster. This means we’re able to improve the utilization of our GPUs by using them to their full extent without having parts of the GPU sit idle.

投機的デコードが特に効率的になっているのは、GPUメモリボトルネックLLMSの作成により、未使用のGPUコンピューティングを使用することができることです。投機的デコードは、ドラフトモデルを絞ることにより、この未使用の計算を利用してトークンをより速く生成します。これは、GPUの一部をアイドル状態にすることなく、GPUを最大限に使用することにより、GPUの利用を改善できることを意味します。

What is prefix caching?

プレフィックスキャッシュとは何ですか？

With LLMs, there are usually two stages of generation — the first is known as “pre-fill”, which processes the user’s input tokens such as the prompt and context. Prefix caching is aimed at reducing the pre-fill time of a request. As an example, if you were asking a model to generate code based on a given file, you might insert the whole file into the context window of a request. Then, if you want to make a second request to generate the next line of code, you might send us the whole file again in the second request. Prefix caching allows us to cache the pre-fill tokens so we don’t have to process the context twice. With the same example, we would only do the pre-fill stage once for both requests, rather than doing it per request. This method is especially useful for requests that reuse the same context, such as Retrieval Augmented Generation (RAG), code generation, chatbots with memory, and more. Skipping the pre-fill stage for similar requests means faster responses for our users and more efficient usage of resources.

LLMSを使用すると、一般に2つの生成段階があります。最初の段階は「Pre-Fill」として知られており、プロンプトやコンテキストなどのユーザーの入力トークンを処理します。プレフィックスキャッシングは、リクエストの事前充填時間を短縮することを目的としています。例として、特定のファイルに基づいてコードを生成するようにモデルに求めている場合は、リクエストのコンテキストウィンドウにファイル全体を挿入する場合があります。次に、次のコード行を生成するために2番目のリクエストを作成する場合は、2番目のリクエストでファイル全体を再度送信する場合があります。プレフィックスキャッシュを使用すると、事前に充填トークンをキャッシュできるため、コンテキストを2回処理する必要はありません。同じ例を使用すると、リクエストごとに行うのではなく、両方のリクエストに対して1回のみ装備段階を1回しか行いません。この方法は、検索拡張生成（RAG）、コード生成、メモリ付きチャットボットなど、同じコンテキストを再利用するリクエストに特に役立ちます。同様のリクエストのために事前に充填段階をスキップすると、ユーザーに対する応答が高速であり、リソースのより効率的な使用が得られます。

How did you validate that quality is preserved through these optimizations?

これらの最適化を通じて品質が保存されていることをどのように検証しましたか？

Since this is an in-place update to an existing model, we were particularly cautious in ensuring that we would not break any existing applications with this update. We did extensive A/B testing through a blind arena with internal employees to validate the model quality, and we asked internal and external customers to test the new version of the model to ensure that response formats were compatible and model quality was acceptable. Our testing concluded that the model performed up to standards, with people being extremely excited about the speed of the model. Most LLMs are not perfectly deterministic even with the same set of inputs, but if you do notice something

これは既存のモデルのインプレースアップデートであるため、このアップデートで既存のアプリケーションを破らないようにすることに特に慎重になりました。私たちは、モデルの品質を検証するために内部従業員との盲目のアリーナを通じて広範なA/Bテストを行い、内部および外部の顧客にモデルの新しいバージョンをテストして、応答形式が互換性があり、モデルの品質が許容可能であることを確認するように依頼しました。私たちのテストは、モデルが基準に合わせて実行され、人々はモデルの速度に非常に興奮していると結論付けました。ほとんどのLLMは、同じ入力セットがあっても完全に決定論的ではありませんが、何かに気付いた場合

免責事項:info@kdj.com

提供される情報は取引に関するアドバイスではありません。 kdj.com は、この記事で提供される情報に基づいて行われた投資に対して一切の責任を負いません。暗号通貨は変動性が高いため、十分な調査を行った上で慎重に投資することを強くお勧めします。

このウェブサイトで使用されているコンテンツが著作権を侵害していると思われる場合は、直ちに当社 (info@kdj.com) までご連絡ください。速やかに削除させていただきます。

2025年04月15日に掲載されたその他の記事

もっと