市值: $3.5741T 1.690%
體積(24小時): $110.9047B -10.480%
  • 市值: $3.5741T 1.690%
  • 體積(24小時): $110.9047B -10.480%
  • 恐懼與貪婪指數:
  • 市值: $3.5741T 1.690%
Cryptos
主題
Cryptospedia
資訊
CryptosTopics
影片
Top News
Cryptos
主題
Cryptospedia
資訊
CryptosTopics
影片
bitcoin
bitcoin

$105250.754790 USD

2.53%

ethereum
ethereum

$3189.789116 USD

1.47%

xrp
xrp

$3.121855 USD

0.28%

tether
tether

$1.000037 USD

0.03%

solana
solana

$238.908785 USD

2.41%

bnb
bnb

$677.503551 USD

0.09%

usd-coin
usd-coin

$1.000041 USD

0.00%

dogecoin
dogecoin

$0.331814 USD

-0.04%

cardano
cardano

$0.962023 USD

1.95%

tron
tron

$0.246267 USD

1.47%

chainlink
chainlink

$24.376944 USD

4.06%

avalanche
avalanche

$33.758638 USD

0.83%

stellar
stellar

$0.404669 USD

0.70%

toncoin
toncoin

$4.905481 USD

0.65%

hedera
hedera

$0.317476 USD

2.81%

加密貨幣新聞文章

過多的變壓器:重新構想詞彙設計,以實現高效且可擴展的語言模型

2025/01/31 01:43

令牌化在大語言模型(LLMS)的性能和可擴展性中起著基本作用。儘管是關鍵組成部分,但其對模型訓練和效率的影響仍然沒有得到充實的影響。雖然較大的詞彙可以壓縮序列並降低計算成本,但現有方法將輸入和輸出詞彙串在一起,從而創造了權衡,使縮放範圍有益於較大的模型,但會損害較小的模型。本文介紹了一個名為“過度施加的變壓器”的框架,該框架通過解耦輸入和輸出令牌來重新想像詞彙設計,從而解開了模型效率和性能的新途徑。

過多的變壓器:重新構想詞彙設計,以實現高效且可擴展的語言模型

Tokenization, a fundamental aspect of language models, has largely remained unexplored in terms of its influence on model training efficiency and performance. While increasing vocabulary size can reduce sequence length and computational costs, existing approaches tie input and output vocabularies together, creating trade-offs where scaling benefits larger models but harms smaller ones. To address this, researchers introduce Over-Tokenized Transformers, a framework that reimagines vocabulary design by decoupling input and output tokenization, unlocking new pathways for model efficiency and performance.

語言模型的一個基本方面的令牌化在很大程度上仍未探索其對模型培訓效率和性能的影響。雖然增加詞彙大小可以降低序列的長度和計算成本,但現有方法將輸入和輸出詞彙結合在一起,從而創造了權衡,使縮放縮放受益更大的模型,但會損害較小的模型。為了解決這個問題,研究人員介紹了過度施加的變壓器,該框架通過解耦輸入和輸出令牌化來重新構想詞彙設計,從而解開了模型效率和性能的新途徑。

Traditional tokenization methods use identical vocabularies for both input processing and output prediction. While larger vocabularies allow models to process longer n-gram tokens (e.g., multi-character sequences), they force smaller models to handle overly granular output predictions, increasing the risk of underfitting. For instance, a 3-gram tokenizer reduces sequence length by 66% but requires predicting three characters jointly—a task manageable for large models but overwhelming for smaller ones. Previous work like multi-token prediction (MTP) attempted to address this by predicting future tokens in parallel, but these methods still entangled input/output granularity and struggled with smaller architectures.

傳統的令牌化方法使用相同的詞彙進行輸入處理和輸出預測。儘管較大的詞彙使模型可以處理更長的n-gram令牌(例如,多字符序列),但它們迫使較小的模型來處理過度顆粒狀的輸出預測,從而增加了擬合不足的風險。例如,一個3克令牌將序列長度降低66%,但需要共同預測三個字符,這是大型模型可管理的任務,但對於較小的模型來說是壓倒性的。以前的工作,例如多token預測(MTP),試圖通過並行預測未來的代幣來解決這一問題,但是這些方法仍然糾纏了輸入/輸出粒度,並與較小的體系結構鬥爭。

The research team identified a critical insight through synthetic experiments with context-free grammars: input and output vocabularies influence models differently. Larger input vocabularies consistently improved all model sizes by enriching context representations through multi-gram embeddings. Conversely, larger output vocabularies introduced fine-grained prediction tasks that only benefited sufficiently large models. This dichotomy motivated their Over-Tokenized framework, which separates input encoding (Over-Encoding) and output decoding (Over-Decoding) vocabularies.

研究團隊通過與無上下文語法的合成實驗確定了關鍵的見解:輸入和輸出詞彙對模型的影響不同。較大的輸入詞彙始終通過通過多克嵌入來豐富上下文表示來改善所有模型大小。相反,較大的輸出詞彙引入了細粒度的預測任務,僅使足夠大的模型受益。這種二分法激發了他們過度施加的框架,該框架將輸入編碼(過度編碼)和輸出解碼(編碼過多)詞彙分開。

Over-Encoding (OE) scales input vocabularies exponentially using hierarchical n-gram embeddings. Instead of a single token ID, each input token is represented as the sum of 1-, 2-, and 3-gram embeddings. For example, the word “cat” might decompose into embeddings for “c,” “ca,” and “cat,” allowing the model to capture multi-scale contextual cues. To avoid impractical memory costs from large n-gram tables (e.g., 100k³ entries), the team used parameter-efficient techniques:

使用層次N-gram嵌入方式,過度編碼(OE)尺度輸入詞彙指數。每個輸入令牌代替單個令牌ID表示為1-,2-和3克嵌入的總和。例如,“ CAT”一詞可能會分解為“ C”,“ CA”和“ CAT”的嵌入,從而允許該模型捕獲多尺度的上下文提示。為了避免大型N-Gram表(例如100K條目)的不切實際的內存成本,團隊使用了參數有效的技術:

Over-Decoding (OD) approximates larger output vocabularies by predicting multiple future tokens sequentially, a refinement of earlier MTP methods. For instance, instead of predicting one token at a time, OD trains the model to predict the next two tokens conditioned on the first prediction. Crucially, OD is selectively applied—only larger models benefit from this granular supervision, while smaller ones retain single-token decoding to avoid underfitting.

過度編碼(OD)通過順序預測多個將來的代幣,對早期MTP方法的細化來近似較大的輸出詞彙。例如,OD不是一次預測一個令牌,而是訓練模型以預測以第一個預測為條件的下兩個令牌。至關重要的是,OD被選擇性地應用 - 僅較大的模型受益於這種顆粒狀監督,而較小的模型則保留了單token解碼以避免擬合不足。

The researchers performed experiments on OLMo and OLMoE architectures and demonstrated three key findings:

研究人員對Olmo和Olmoe架構進行了實驗,並證明了三個關鍵發現:

On evaluations, the framework demonstrated consistent performance improvements across various model types. For dense models, a 151M Over-Encoded (OE) model achieved a 14% reduction in perplexity compared to its baseline. Similarly, in sparse Mixture-of-Experts (MoE) models, the OLMoE-1.3B with OE reduced validation loss by 0.12 points, although the gains were less pronounced as the benefits of sparse experts diluted the impact of embedding enhancements. Beyond synthetic experiments, real-world evaluations on large-scale datasets further validated these findings. Over-Encoded models consistently improved performance across multiple benchmarks, including MMLU-Var, Hellaswag, ARC-Challenge, ARC-Easy, and PIQA. Notably, the framework accelerated convergence, achieving a 5.7× speedup in training loss reduction. Additionally, downstream evaluations showed significant acceleration, with OE delivering speedups of 3.2× on MMLU-Var, 3.0× on Hellaswag, 2.6× on ARC-Challenge, 3.1× on ARC-Easy, and 3.9× on PIQA, highlighting its efficiency and effectiveness across diverse tasks.

在評估中,該框架在各種模型類型上表現出一致的性能改進。對於密集的模型,與基線相比,151m過度編碼(OE)模型的困惑度降低了14%。同樣,在稀疏的專家(MOE)模型中,OLMOE-1.3B具有OE的OL-1.3B將驗證損失減少了0.12點,儘管隨著稀疏專家的益處稀釋了嵌入增強功能的影響,增益的效果不太明顯。除了合成實驗外,大規模數據集的現實世界評估進一步驗證了這些發現。編碼過多的模型始終提高了多個基準的性能,包括MMLU-VAR,HELLASWAG,ARC-CHALLENGE,ARC-EASY和PIQA。值得注意的是,該框架加速了收斂,在減少訓練損失方面達到了5.7×加速。此外,下游評估顯示出明顯的加速度,OE在MMLU-VAR上的加速度為3.2×,Hellaswag上的3.0倍,Arc-Challenge上的2.6倍,Arc-Challenge上的3.1倍,Arc-Easy上的3.1倍,PIQA上的3.9倍,突出顯示其效率和效率和效率和有效性,跨越不同的任務。

In conclusion, this work redefines tokenization as a scalable dimension in language model design. By decoupling input and output vocabularies, Over-Tokenized Transformers break traditional trade-offs, enabling smaller models to benefit from compressed input sequences without grappling with overly complex prediction tasks. The log-linear relationship between input vocabulary size and performance suggests embedding parameters represent a new axis for scaling laws, complementing existing work on model depth and width. Practically, the framework offers a low-cost upgrade path for existing architectures—integrating Over-Encoding requires minimal code changes but yields immediate efficiency gains. Future research could explore hybrid tokenization strategies or dynamic vocabulary adaptation, further solidifying tokenization’s role in the next generation of efficient, high-performing LLMs.

總之,這項工作將令牌化重新定義為語言模型設計中的可擴展維度。通過將輸入和輸出詞彙解耦,超過的變壓器打破了傳統的權衡,使較小的模型能夠從壓縮輸入序列中受益,而無需努力處理過於復雜的預測任務。輸入詞彙大小和性能之間的對數線性關係表明,嵌入參數代表了用於縮放定律的新軸,並補充了模型深度和寬度上的現有工作。實際上,該框架為現有架構提供了低成本的升級路徑 - 整合過度編碼需要最小的代碼更改,但可以立即提高效率。未來的研究可以探索混合令牌化策略或動態詞彙適應,從而進一步鞏固了令牌化在下一代高效,高性能的LLM中的作用。

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

查看紙。這項研究的所有信用都歸該項目的研究人員。另外,不要忘記在Twitter上關注我們,並加入我們的電報頻道和LinkedIn組。不要忘記加入我們的70k+ ml subreddit。

🚨 Meet IntellAgent: An Open-Source Multi

🚨結識Intellagent:開源多

免責聲明:info@kdj.com

The information provided is not trading advice. kdj.com does not assume any responsibility for any investments made based on the information provided in this article. Cryptocurrencies are highly volatile and it is highly recommended that you invest with caution after thorough research!

If you believe that the content used on this website infringes your copyright, please contact us immediately (info@kdj.com) and we will delete it promptly.

2025年01月31日 其他文章發表於