$105250.754790 USD

2.53%

ethereum

$3189.789116 USD

1.47%

xrp

$3.121855 USD

0.28%

tether

$1.000037 USD

0.03%

solana

$238.908785 USD

2.41%

bnb

$677.503551 USD

0.09%

usd-coin

$1.000041 USD

0.00%

dogecoin

$0.331814 USD

-0.04%

cardano

$0.962023 USD

1.95%

tron

$0.246267 USD

1.47%

chainlink

$24.376944 USD

4.06%

avalanche

$33.758638 USD

0.83%

stellar

$0.404669 USD

0.70%

toncoin

$4.905481 USD

0.65%

hedera

$0.317476 USD

2.81%

암호화폐 뉴스 기사

과도한 점화 변압기 : 효율적이고 확장 가능한 언어 모델을위한 어휘 디자인을 재구성

2025/01/31 01:43

토큰 화는 대형 언어 모델 (LLM)의 성능과 확장 성에서 근본적인 역할을합니다. 중요한 구성 요소 임에도 불구하고 모델 교육 및 효율성에 미치는 영향은 남아 있습니다. 더 큰 어휘는 시퀀스를 압축하고 계산 비용을 줄일 수 있지만 기존 접근법은 입력 및 출력 어휘를 함께 연결하여 스케일링이 더 큰 모델에 도움이되지만 더 작은 모델에 해를 끼치는 트레이드 오프를 만듭니다. 이 논문은 입력 및 출력 토큰 화를 분리하여 어휘 설계를 재구성하여 모델 효율성 및 성능을위한 새로운 경로를 잠금 해제하는 과도한 고정 변압기라는 프레임 워크를 소개합니다.

Tokenization, a fundamental aspect of language models, has largely remained unexplored in terms of its influence on model training efficiency and performance. While increasing vocabulary size can reduce sequence length and computational costs, existing approaches tie input and output vocabularies together, creating trade-offs where scaling benefits larger models but harms smaller ones. To address this, researchers introduce Over-Tokenized Transformers, a framework that reimagines vocabulary design by decoupling input and output tokenization, unlocking new pathways for model efficiency and performance.

언어 모델의 기본 측면 인 토큰 화는 모델 교육 효율성 및 성능에 미치는 영향 측면에서 크게 설명되지 않았습니다. 어휘 크기를 증가 시키면 시퀀스 길이와 계산 비용이 줄어들 수 있지만 기존 접근법은 입력 및 출력 어휘를 함께 연결하여 스케일링이 더 큰 모델에 도움이되지만 더 작은 모델에 해를 끼치는 트레이드 오프를 만듭니다. 이를 해결하기 위해 연구원들은 입력 및 출력 토큰 화를 분리하여 어휘 설계를 재구성하여 모델 효율성 및 성능을위한 새로운 경로를 잠금 해제하는 프레임 워크 인 과도한 고정 변압기를 소개합니다.

Traditional tokenization methods use identical vocabularies for both input processing and output prediction. While larger vocabularies allow models to process longer n-gram tokens (e.g., multi-character sequences), they force smaller models to handle overly granular output predictions, increasing the risk of underfitting. For instance, a 3-gram tokenizer reduces sequence length by 66% but requires predicting three characters jointly—a task manageable for large models but overwhelming for smaller ones. Previous work like multi-token prediction (MTP) attempted to address this by predicting future tokens in parallel, but these methods still entangled input/output granularity and struggled with smaller architectures.

기존 토큰 화 방법은 입력 처리 및 출력 예측 모두에 동일한 어휘를 사용합니다. 더 큰 어휘를 사용하면 모델이 더 긴 N-Gram 토큰 (예 : 다중 문자 시퀀스)을 처리 할 수 있지만 소규모 모델이 지나치게 세분화 된 출력 예측을 처리하여 언더 피팅의 위험을 증가시킵니다. 예를 들어, 3 그램 토큰 화기는 시퀀스 길이를 66% 줄이지 만 3 개의 문자를 공동으로 예측해야합니다. 이는 대형 모델의 경우 관리 할 수 있지만 작은 모델에는 압도적입니다. MTP (Multi-Token Prediction)와 같은 이전의 작업은 미래의 토큰을 동시에 예측하여이를 해결하려고 시도했지만, 이러한 방법은 여전히 입력/출력 세분성을 얽히고 더 작은 아키텍처로 어려움을 겪었습니다.

The research team identified a critical insight through synthetic experiments with context-free grammars: input and output vocabularies influence models differently. Larger input vocabularies consistently improved all model sizes by enriching context representations through multi-gram embeddings. Conversely, larger output vocabularies introduced fine-grained prediction tasks that only benefited sufficiently large models. This dichotomy motivated their Over-Tokenized framework, which separates input encoding (Over-Encoding) and output decoding (Over-Decoding) vocabularies.

연구팀은 상황이없는 문법으로 합성 실험을 통해 비판적인 통찰력을 확인했습니다. 입력 및 출력 어휘는 모델에 다르게 영향을 미칩니다. 더 큰 입력 어휘는 멀티 그램 임베딩을 통해 컨텍스트 표현을 풍부하게함으로써 모든 모델 크기를 지속적으로 개선했습니다. 반대로, 더 큰 출력 어휘는 충분히 큰 모델에만 도움이되는 세밀한 예측 작업을 도입했습니다. 이 이분법은 입력 인코딩 (오버 인코딩) 및 출력 디코딩 (오버 디코딩) 어휘를 분리하는 과잉 점화 프레임 워크를 동기를 부여했습니다.

Over-Encoding (OE) scales input vocabularies exponentially using hierarchical n-gram embeddings. Instead of a single token ID, each input token is represented as the sum of 1-, 2-, and 3-gram embeddings. For example, the word “cat” might decompose into embeddings for “c,” “ca,” and “cat,” allowing the model to capture multi-scale contextual cues. To avoid impractical memory costs from large n-gram tables (e.g., 100k³ entries), the team used parameter-efficient techniques:

오버 인코딩 (OE) 스케일은 계층 적 N- 그램 임베딩을 사용하여 기하 급수적으로 입력 어휘. 단일 토큰 ID 대신 각 입력 토큰은 1-, 2- 및 3 그램 임베딩의 합으로 표시됩니다. 예를 들어, "고양이"라는 단어는 "C", "CA"및 "CAT"에 대한 임베드로 분해되어 모델이 다중 규모의 맥락 신호를 캡처 할 수 있습니다. 대형 N- 그램 테이블 (예 : 100k³ 항목)의 비현실적인 메모리 비용을 피하기 위해 팀은 매개 변수 효율적인 기술을 사용했습니다.

Over-Decoding (OD) approximates larger output vocabularies by predicting multiple future tokens sequentially, a refinement of earlier MTP methods. For instance, instead of predicting one token at a time, OD trains the model to predict the next two tokens conditioned on the first prediction. Crucially, OD is selectively applied—only larger models benefit from this granular supervision, while smaller ones retain single-token decoding to avoid underfitting.

오버 디코딩 (OD)은 초기 MTP 방법의 개선 인 여러 미래의 토큰을 순차적으로 예측함으로써 더 큰 출력 어휘에 근사합니다. 예를 들어, 한 번에 하나의 토큰을 예측하는 대신 OD는 첫 번째 예측에 조절 된 다음 두 토큰을 예측하도록 모델을 훈련시킵니다. 결정적으로, OD는 선택적으로 적용됩니다. 더 큰 모델은이 세분화 된 감독의 이점을 얻는 반면, 작은 모델은 언더 피팅을 피하기 위해 단일 토로 디코딩을 유지합니다.

The researchers performed experiments on OLMo and OLMoE architectures and demonstrated three key findings:

연구원들은 Olmo 및 Olmoe 아키텍처에 대한 실험을 수행했으며 세 가지 주요 결과를 보여주었습니다.

On evaluations, the framework demonstrated consistent performance improvements across various model types. For dense models, a 151M Over-Encoded (OE) model achieved a 14% reduction in perplexity compared to its baseline. Similarly, in sparse Mixture-of-Experts (MoE) models, the OLMoE-1.3B with OE reduced validation loss by 0.12 points, although the gains were less pronounced as the benefits of sparse experts diluted the impact of embedding enhancements. Beyond synthetic experiments, real-world evaluations on large-scale datasets further validated these findings. Over-Encoded models consistently improved performance across multiple benchmarks, including MMLU-Var, Hellaswag, ARC-Challenge, ARC-Easy, and PIQA. Notably, the framework accelerated convergence, achieving a 5.7× speedup in training loss reduction. Additionally, downstream evaluations showed significant acceleration, with OE delivering speedups of 3.2× on MMLU-Var, 3.0× on Hellaswag, 2.6× on ARC-Challenge, 3.1× on ARC-Easy, and 3.9× on PIQA, highlighting its efficiency and effectiveness across diverse tasks.

평가에서 프레임 워크는 다양한 모델 유형에서 일관된 성능 향상을 보여주었습니다. 밀도가 높은 모델의 경우 151m의 과도한 인코딩 (OE) 모델은 기준선에 비해 당황도의 14% 감소를 달성했습니다. 마찬가지로, 희소 한 혼합물 (MOE) 모델에서 OE가있는 OLMOE-1.3B는 검증 손실을 0.12 점 감소 시켰지만, 이익은 드문 전문가의 이점이 삽입 된 향상의 영향을 희석함으로써 덜 두드러졌다. 합성 실험 외에도 대규모 데이터 세트에 대한 실제 평가는 이러한 결과를 추가로 검증했습니다. 과잉 인코딩 된 모델은 MMLU-VAR, Hellaswag, Arc-Challenge, Arc-Easy 및 PIQA를 포함한 여러 벤치 마크에서 성능을 지속적으로 향상 시켰습니다. 특히, 프레임 워크는 수렴을 가속화하여 훈련 손실 감소에서 5.7 × 속도를 달성했습니다. 또한, 다운 스트림 평가는 상당한 가속을 보여 주었고, OE는 MMLU-VAR에서 3.2 × x, hellaswag에서 3.0 ×, 아크-홀리지의 2.6 ×, 아크-엔 시티의 3.1 × 및 PIQA에서 3.9 ×를 제공함으로써 상당한 가속도를 보여 주었다. 다양한 작업에 걸쳐.

In conclusion, this work redefines tokenization as a scalable dimension in language model design. By decoupling input and output vocabularies, Over-Tokenized Transformers break traditional trade-offs, enabling smaller models to benefit from compressed input sequences without grappling with overly complex prediction tasks. The log-linear relationship between input vocabulary size and performance suggests embedding parameters represent a new axis for scaling laws, complementing existing work on model depth and width. Practically, the framework offers a low-cost upgrade path for existing architectures—integrating Over-Encoding requires minimal code changes but yields immediate efficiency gains. Future research could explore hybrid tokenization strategies or dynamic vocabulary adaptation, further solidifying tokenization’s role in the next generation of efficient, high-performing LLMs.

결론적으로,이 작업은 토큰 화를 언어 모델 설계에서 확장 가능한 차원으로 재정의합니다. 입력 및 출력 어휘를 분리함으로써 과도하게 고정 된 트랜스포머는 전통적인 트레이드 오프를 깨뜨려 지나치게 복잡한 예측 작업으로 어려움을 겪지 않고 압축 된 입력 시퀀스의 이점을 얻을 수 있습니다. 입력 어휘 크기와 성능 사이의 로그 선형 관계는 매개 변수를 임베딩하는 것이 법칙을 스케일링하기위한 새로운 축을 나타내며 모델 깊이 및 너비에 대한 기존 작업을 보완합니다. 실제로, 프레임 워크는 기존 아키텍처에 대한 저렴한 업그레이드 경로를 제공합니다. 통합에는 과도한 인코딩에는 최소한의 코드 변경이 필요하지만 즉각적인 효율성 향상을 산출합니다. 미래의 연구는 하이브리드 토큰 화 전략 또는 동적 어휘 적응을 탐색하여 차세대 효율적이고 고성능 LLM에서 토큰 화의 역할을 더욱 강화시킬 수 있습니다.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

종이를 확인하십시오. 이 연구에 대한 모든 크레딧은이 프로젝트의 연구원들에게 전달됩니다. 또한 Twitter에서 우리를 팔로우하고 Telegram Channel 및 LinkedIn Group에 가입하는 것을 잊지 마십시오. 70k+ ml 하위 레드에 가입하는 것을 잊지 마십시오.

🚨 Meet IntellAgent: An Open-Source Multi

∎ Intellagent를 만나십시오 : 오픈 소스 멀티

부인 성명:info@kdj.com

The information provided is not trading advice. kdj.com does not assume any responsibility for any investments made based on the information provided in this article. Cryptocurrencies are highly volatile and it is highly recommended that you invest with caution after thorough research！

If you believe that the content used on this website infringes your copyright, please contact us immediately (info@kdj.com) and we will delete it promptly.

2025年01月31日 에 게재된 다른 기사

더