市值: $2.6654T -0.710%
體積(24小時): $42.519B -57.530%
  • 市值: $2.6654T -0.710%
  • 體積(24小時): $42.519B -57.530%
  • 恐懼與貪婪指數:
  • 市值: $2.6654T -0.710%
加密
主題
加密植物
資訊
加密術
影片
頭號新聞
加密
主題
加密植物
資訊
加密術
影片
bitcoin
bitcoin

$83346.880838 USD

-0.62%

ethereum
ethereum

$1805.949753 USD

-0.44%

tether
tether

$0.999666 USD

0.00%

xrp
xrp

$2.133678 USD

0.70%

bnb
bnb

$590.813771 USD

-1.07%

solana
solana

$120.127205 USD

-0.72%

usd-coin
usd-coin

$1.000074 USD

0.00%

dogecoin
dogecoin

$0.167862 USD

-1.17%

cardano
cardano

$0.646477 USD

-2.04%

tron
tron

$0.236038 USD

-1.02%

unus-sed-leo
unus-sed-leo

$9.140933 USD

-0.57%

chainlink
chainlink

$12.769209 USD

-0.92%

toncoin
toncoin

$3.233802 USD

-2.39%

stellar
stellar

$0.251938 USD

-2.89%

avalanche
avalanche

$17.403076 USD

-4.14%

加密貨幣新聞文章

多to toke注意(MTA)可以有效地檢索上下文信息

2025/04/02 14:54

本文介紹了多句話的注意(MTA),這是一種高級註意機制,在多個查詢和關鍵矢量上同時調節注意力。

多to toke注意(MTA)可以有效地檢索上下文信息

Large Language Models (LLMs) have significantly benefited from attention mechanisms, which enable the effective retrieval of contextual information. However, traditional attention methods primarily depend on single token attention, where each attention weight is calculated from a single pair of query and key vectors.

大型語言模型(LLM)從注意機制中大大受益,這可以有效地檢索上下文信息。但是,傳統的注意方法主要取決於單一令牌注意,其中每個注意力的重量都是根據一對查詢和關鍵向量計算得出的。

This design inherently constrains the model's ability to discern contexts that require the integration of multiple token signals, ultimately limiting its effectiveness on complex linguistic dependencies. For instance, identifying sentences that simultaneously contain both "Alice" and "rabbit" poses a challenge because conventional attention mechanisms struggle to combine multiple separate attention signals efficiently without substantially increasing model complexity.

該設計固有地限制了模型辨別需要集成多個令牌信號的上下文的能力,最終將其有效性限制在復雜的語言依賴性上。例如,識別同時包含“愛麗絲”和“兔子”的句子構成了挑戰,因為常規的注意機制難以有效地結合多個單獨的注意信號,而不會實質上增加模型的複雜性。

To address this limitation, researchers from Meta AI have introduced Multi-Token Attention (MTA), an advanced attention mechanism that simultaneously conditions attention weights on multiple query and key vectors. MTA integrates convolution operations over queries, keys, and attention heads, thus enhancing the precision and efficiency of contextual information retrieval.

為了解決這一局限性,來自Meta AI的研究人員引入了多型注意(MTA),這是一種高級註意機制,同時在多個查詢和關鍵矢量上調節注意力的權重。 MTA整合了有關查詢,鑰匙和注意力頭的捲積操作,從而提高了上下文信息檢索的精度和效率。

MTA framework consists of two convolutional components:

MTA框架由兩個卷積組成部分組成:

1) key-query convolution, which aggregates multiple token signals within individual attention heads, and

1)鑰匙要查詢卷積,該卷積匯總了個體注意力負責人內的多個令牌信號,並且

2) head mixing convolution, which facilitates information sharing among different attention heads. MTA is implemented using group normalization with depth-dependent scaling to stabilize gradient flow, further improving model training stability and efficacy.

2)頭部混合卷積,這有助於不同註意力頭之間的信息共享。 MTA是使用與深度依賴性縮放的組歸一化實施的,以穩定梯度流,從而進一步提高模型訓練穩定性和功效。

At a technical level, MTA modifies standard attention calculations by incorporating a two-dimensional convolution operation on the attention logits before softmax normalization. This convolution allows adjacent queries and keys to influence attention scores mutually, enabling the attention mechanism to identify contextual relationships more precisely. Consequently, the model efficiently aggregates local token interactions without significantly increasing the number of parameters or the dimensionality of attention vectors.

在技​​術層面上,MTA通過在SoftMax歸一化之前將注意力邏輯上的二維卷積操作納入標準注意計算。該卷積允許相鄰的查詢和鑰匙相互影響,從而使注意力機制更精確地識別上下文關係。因此,該模型有效地匯總了局部令牌相互作用,而沒有顯著增加參數數量或註意向量的維度。

MTA promotes effective knowledge transfer among attention heads, selectively amplifying relevant context signals while attenuating less pertinent information. These enhancements collectively yield a more robust attention mechanism capable of capturing complex multi-token interactions.

MTA促進了注意力頭之間的有效知識轉移,有選擇地放大相關上下文信號,同時減少相關信息較少。這些增強能夠共同產生一種更強大的注意機制,能夠捕獲複雜的多型相互作用。

Empirical evaluations validate the efficacy of MTA across several natural language processing (NLP) benchmarks. In a structured motivating task explicitly designed to illustrate the shortcomings of single-token attention mechanisms, MTA demonstrated near-perfect performance, achieving an error rate of only 0.1% in tasks with 4 x 1024 token sequences. In contrast, standard Transformer models exhibited error rates greater than 50%.

經驗評估驗證了MTA對幾種自然語言處理(NLP)基準的療效。在明確設計的結構化激勵任務中,旨在說明單次注意機制的缺點,MTA表現出近乎完美的性能,在4 x 1024代幣序列的任務中僅達到0.1%的錯誤率。相比之下,標準變壓器模型顯示出大於50%的錯誤率。

Further large-scale experiments involved an 880M-parameter model trained on 105 billion tokens using MTA and baseline architectures. MTA achieved superior validation perplexity scores across diverse datasets such as arXiv, GitHub, and Wikipedia.

進一步的大規模實驗涉及使用MTA和基線體系結構對10050億代幣進行培訓的880m參數模型。 MTA實現了各種數據集(例如Arxiv,Github和Wikipedia)的卓越驗證困惑得分。

MTA outperformed standard Transformer models in tasks requiring extended context comprehension, such as the Needle-in-the-Haystack and BabiLong benchmarks. In the Needle-in-the-Haystack task with 4K token contexts containing multiple needles, MTA achieved accuracies ranging from 67% to 97.6%, surpassing standard models by substantial margins. These results highlight the potential of MTA for enabling LLMs to efficiently process very long-range dependencies.

在需要擴展上下文理解的任務中,MTA優於標準變壓器模型,例如,在海域和Babilong基準測試中。在包含多個針頭的4K令牌上下文的《針線中的針刺》任務中,MTA的準確性範圍從67%到97.6%,超過了大量利潤。這些結果突出了MTA使LLM有效處理非常長期依賴性的潛力。

In summary, Multi-Token Attention (MTA) presents a refined advancement in attention mechanisms by addressing fundamental limitations of traditional single-token attention. Leveraging convolutional operations to concurrently integrate multiple query-key interactions, MTA enhances the ability of language models to handle intricate contextual dependencies.

總之,通過解決傳統的單一注意的基本局限性,多句話注意(MTA)在註意機制方面提出了精緻的進步。 MTA利用卷積操作同時整合了多個查詢鍵交互,增強了語言模型處理複雜的上下文依賴性的能力。

These methodological improvements facilitate more precise and efficient performance, particularly in scenarios involving complex token interactions and long-range contextual understanding. Through targeted modifications to standard attention mechanisms, MTA contributes meaningfully to the evolution of more sophisticated, accurate, and computationally efficient language models.

這些方法學的改進有助於更精確,更有效的性能,尤其是在涉及復雜的令牌互動和遠程上下文理解的情況下。通過針對標準注意機制的有針對性的修改,MTA對更複雜,準確和計算有效的語言模型的演變有意義地貢獻。

免責聲明:info@kdj.com

所提供的資訊並非交易建議。 kDJ.com對任何基於本文提供的資訊進行的投資不承擔任何責任。加密貨幣波動性較大,建議您充分研究後謹慎投資!

如果您認為本網站使用的內容侵犯了您的版權,請立即聯絡我們(info@kdj.com),我們將及時刪除。

2025年04月06日 其他文章發表於