$105250.754790 USD

2.53%

ethereum

$3189.789116 USD

1.47%

xrp

$3.121855 USD

0.28%

tether

$1.000037 USD

0.03%

solana

$238.908785 USD

2.41%

bnb

$677.503551 USD

0.09%

usd-coin

$1.000041 USD

0.00%

dogecoin

$0.331814 USD

-0.04%

cardano

$0.962023 USD

1.95%

tron

$0.246267 USD

1.47%

chainlink

$24.376944 USD

4.06%

avalanche

$33.758638 USD

0.83%

stellar

$0.404669 USD

0.70%

toncoin

$4.905481 USD

0.65%

hedera

$0.317476 USD

2.81%

加密货币新闻

过多的变压器：重新构想词汇设计，以实现高效且可扩展的语言模型

2025/01/31 01:43

令牌化在大语言模型（LLMS）的性能和可扩展性中起着基本作用。尽管是关键组成部分，但其对模型训练和效率的影响仍然没有得到充实的影响。虽然较大的词汇可以压缩序列并降低计算成本，但现有方法将输入和输出词汇串在一起，从而创造了权衡，使缩放范围有益于较大的模型，但会损害较小的模型。本文介绍了一个名为“过度施加的变压器”的框架，该框架通过解耦输入和输出令牌来重新想象词汇设计，从而解开了模型效率和性能的新途径。

Tokenization, a fundamental aspect of language models, has largely remained unexplored in terms of its influence on model training efficiency and performance. While increasing vocabulary size can reduce sequence length and computational costs, existing approaches tie input and output vocabularies together, creating trade-offs where scaling benefits larger models but harms smaller ones. To address this, researchers introduce Over-Tokenized Transformers, a framework that reimagines vocabulary design by decoupling input and output tokenization, unlocking new pathways for model efficiency and performance.

语言模型的一个基本方面的令牌化在很大程度上仍未探索其对模型培训效率和性能的影响。虽然增加词汇大小可以降低序列的长度和计算成本，但现有方法将输入和输出词汇结合在一起，从而创造了权衡，使缩放缩放受益更大的模型，但会损害较小的模型。为了解决这个问题，研究人员介绍了过度施加的变压器，该框架通过解耦输入和输出令牌化来重新构想词汇设计，从而解开了模型效率和性能的新途径。

Traditional tokenization methods use identical vocabularies for both input processing and output prediction. While larger vocabularies allow models to process longer n-gram tokens (e.g., multi-character sequences), they force smaller models to handle overly granular output predictions, increasing the risk of underfitting. For instance, a 3-gram tokenizer reduces sequence length by 66% but requires predicting three characters jointly—a task manageable for large models but overwhelming for smaller ones. Previous work like multi-token prediction (MTP) attempted to address this by predicting future tokens in parallel, but these methods still entangled input/output granularity and struggled with smaller architectures.

传统的令牌化方法使用相同的词汇进行输入处理和输出预测。尽管较大的词汇使模型可以处理更长的n-gram令牌（例如，多字符序列），但它们迫使较小的模型来处理过度颗粒状的输出预测，从而增加了拟合不足的风险。例如，一个3克令牌将序列长度降低66％，但需要共同预测三个字符，这是大型模型可管理的任务，但对于较小的模型来说是压倒性的。以前的工作，例如多token预测（MTP），试图通过并行预测未来的代币来解决这一问题，但是这些方法仍然纠缠了输入/输出粒度，并与较小的体系结构斗争。

The research team identified a critical insight through synthetic experiments with context-free grammars: input and output vocabularies influence models differently. Larger input vocabularies consistently improved all model sizes by enriching context representations through multi-gram embeddings. Conversely, larger output vocabularies introduced fine-grained prediction tasks that only benefited sufficiently large models. This dichotomy motivated their Over-Tokenized framework, which separates input encoding (Over-Encoding) and output decoding (Over-Decoding) vocabularies.

研究团队通过与无上下文语法的合成实验确定了关键的见解：输入和输出词汇对模型的影响不同。较大的输入词汇始终通过通过多克嵌入来丰富上下文表示来改善所有模型大小。相反，较大的输出词汇引入了细粒度的预测任务，仅使足够大的模型受益。这种二分法激发了他们过度施加的框架，该框架将输入编码（过度编码）和输出解码（编码过多）词汇分开。

Over-Encoding (OE) scales input vocabularies exponentially using hierarchical n-gram embeddings. Instead of a single token ID, each input token is represented as the sum of 1-, 2-, and 3-gram embeddings. For example, the word “cat” might decompose into embeddings for “c,” “ca,” and “cat,” allowing the model to capture multi-scale contextual cues. To avoid impractical memory costs from large n-gram tables (e.g., 100k³ entries), the team used parameter-efficient techniques:

使用层次N-gram嵌入方式，过度编码（OE）尺度输入词汇指数。每个输入令牌代替单个令牌ID表示为1-，2-和3克嵌入的总和。例如，“ CAT”一词可能会分解为“ C”，“ CA”和“ CAT”的嵌入，从而允许该模型捕获多尺度的上下文提示。为了避免大型N-Gram表（例如100K条目）的不切实际的内存成本，团队使用了参数有效的技术：

Over-Decoding (OD) approximates larger output vocabularies by predicting multiple future tokens sequentially, a refinement of earlier MTP methods. For instance, instead of predicting one token at a time, OD trains the model to predict the next two tokens conditioned on the first prediction. Crucially, OD is selectively applied—only larger models benefit from this granular supervision, while smaller ones retain single-token decoding to avoid underfitting.

过度编码（OD）通过顺序预测多个将来的代币，对早期MTP方法的细化来近似较大的输出词汇。例如，OD不是一次预测一个令牌，而是训练模型以预测以第一个预测为条件的下两个令牌。至关重要的是，OD被选择性地应用 - 仅较大的模型受益于这种颗粒状监督，而较小的模型则保留了单token解码以避免拟合不足。

The researchers performed experiments on OLMo and OLMoE architectures and demonstrated three key findings:

研究人员对Olmo和Olmoe架构进行了实验，并证明了三个关键发现：

On evaluations, the framework demonstrated consistent performance improvements across various model types. For dense models, a 151M Over-Encoded (OE) model achieved a 14% reduction in perplexity compared to its baseline. Similarly, in sparse Mixture-of-Experts (MoE) models, the OLMoE-1.3B with OE reduced validation loss by 0.12 points, although the gains were less pronounced as the benefits of sparse experts diluted the impact of embedding enhancements. Beyond synthetic experiments, real-world evaluations on large-scale datasets further validated these findings. Over-Encoded models consistently improved performance across multiple benchmarks, including MMLU-Var, Hellaswag, ARC-Challenge, ARC-Easy, and PIQA. Notably, the framework accelerated convergence, achieving a 5.7× speedup in training loss reduction. Additionally, downstream evaluations showed significant acceleration, with OE delivering speedups of 3.2× on MMLU-Var, 3.0× on Hellaswag, 2.6× on ARC-Challenge, 3.1× on ARC-Easy, and 3.9× on PIQA, highlighting its efficiency and effectiveness across diverse tasks.

在评估中，该框架在各种模型类型上表现出一致的性能改进。对于密集的模型，与基线相比，151m过度编码（OE）模型的困惑度降低了14％。同样，在稀疏的专家（MOE）模型中，OLMOE-1.3B具有OE的OL-1.3B将验证损失减少了0.12点，尽管随着稀疏专家的益处稀释了嵌入增强功能的影响，增益的效果不太明显。除了合成实验外，大规模数据集的现实世界评估进一步验证了这些发现。编码过多的模型始终提高了多个基准的性能，包括MMLU-VAR，HELLASWAG，ARC-CHALLENGE，ARC-EASY和PIQA。值得注意的是，该框架加速了收敛，在减少训练损失方面达到了5.7×加速。此外，下游评估显示出明显的加速度，OE在MMLU-VAR上的加速度为3.2×，Hellaswag上的3.0倍，Arc-Challenge上的2.6倍，Arc-Challenge上的3.1倍，Arc-Easy上的3.1倍，PIQA上的3.9倍，突出显示其效率和效率和效率和有效性，跨越不同的任务。

In conclusion, this work redefines tokenization as a scalable dimension in language model design. By decoupling input and output vocabularies, Over-Tokenized Transformers break traditional trade-offs, enabling smaller models to benefit from compressed input sequences without grappling with overly complex prediction tasks. The log-linear relationship between input vocabulary size and performance suggests embedding parameters represent a new axis for scaling laws, complementing existing work on model depth and width. Practically, the framework offers a low-cost upgrade path for existing architectures—integrating Over-Encoding requires minimal code changes but yields immediate efficiency gains. Future research could explore hybrid tokenization strategies or dynamic vocabulary adaptation, further solidifying tokenization’s role in the next generation of efficient, high-performing LLMs.

总之，这项工作将令牌化重新定义为语言模型设计中的可扩展维度。通过将输入和输出词汇解耦，超过的变压器打破了传统的权衡，使较小的模型能够从压缩输入序列中受益，而无需努力处理过于复杂的预测任务。输入词汇大小和性能之间的对数线性关系表明，嵌入参数代表了用于缩放定律的新轴，并补充了模型深度和宽度上的现有工作。实际上，该框架为现有架构提供了低成本的升级路径 - 整合过度编码需要最小的代码更改，但可以立即提高效率。未来的研究可以探索混合令牌化策略或动态词汇适应，从而进一步巩固了令牌化在下一代高效，高性能的LLM中的作用。

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

查看纸。这项研究的所有信用都归该项目的研究人员。另外，不要忘记在Twitter上关注我们，并加入我们的电报频道和LinkedIn组。不要忘记加入我们的70k+ ml subreddit。

🚨 Meet IntellAgent: An Open-Source Multi

🚨结识Intellagent：开源多

免责声明:info@kdj.com

The information provided is not trading advice. kdj.com does not assume any responsibility for any investments made based on the information provided in this article. Cryptocurrencies are highly volatile and it is highly recommended that you invest with caution after thorough research！

If you believe that the content used on this website infringes your copyright, please contact us immediately (info@kdj.com) and we will delete it promptly.

2025年01月31日发表的其他文章