bitcoin
bitcoin

$95337.15 USD 

-0.76%

ethereum
ethereum

$3601.25 USD 

-1.87%

xrp
xrp

$2.57 USD 

6.47%

tether
tether

$1.00 USD 

-0.05%

solana
solana

$225.79 USD 

-1.27%

bnb
bnb

$646.61 USD 

-0.09%

dogecoin
dogecoin

$0.414322 USD 

-3.59%

cardano
cardano

$1.26 USD 

10.62%

usd-coin
usd-coin

$0.999992 USD 

0.00%

avalanche
avalanche

$50.89 USD 

5.82%

tron
tron

$0.226891 USD 

8.21%

shiba-inu
shiba-inu

$0.000029 USD 

-2.62%

toncoin
toncoin

$6.61 USD 

-0.57%

stellar
stellar

$0.530491 USD 

-0.04%

chainlink
chainlink

$24.95 USD 

25.01%

加密貨幣新聞文章

注意力轉移:分離注意力機制在視覺變壓器中的作用

2024/11/21 18:00

視覺變壓器 (ViTs) 透過提供一種使用自註意力機制來處理影像資料的創新架構,徹底改變了電腦視覺。與依賴卷積層進行特徵提取的捲積神經網路 (CNN) 不同,ViT 將圖像劃分為更小的區塊,並將它們視為單獨的標記。這種基於令牌的方法允許對大型資料集進行可擴展且高效的處理,使得 ViT 對於影像分類和物件偵測等高維任務特別有效。它們能夠將令牌之間的資訊流動方式與令牌內的特徵提取方式解耦,為解決各種電腦視覺挑戰提供了一個靈活的框架。

注意力轉移:分離注意力機制在視覺變壓器中的作用

Vision Transformers (ViTs) have emerged as a powerful architecture in computer vision, thanks to their self-attention mechanisms that can effectively process image data. Unlike Convolutional Neural Networks (CNNs), which extract features using convolutional layers, ViTs break down images into smaller patches and treat them as individual tokens. This token-based approach enables scalable and efficient processing of large datasets, making ViTs particularly well-suited for high-dimensional tasks like image classification and object detection. The decoupling of how information flows between tokens from how features are extracted within tokens provides a flexible framework for tackling diverse computer vision challenges.

視覺變換器(ViT)已經成為電腦視覺中強大的架構,這要歸功於它們的自註意力機制,可以有效地處理影像資料。與使用卷積層提取特徵的捲積神經網路 (CNN) 不同,ViT 將影像分解為較小的區塊,並將它們視為單獨的標記。這種基於令牌的方法可以對大型資料集進行可擴展且高效的處理,使 ViT 特別適合影像分類和物件偵測等高維任務。將令牌之間的資訊流動方式與令牌內的特徵提取方式解耦,為應對各種電腦視覺挑戰提供了靈活的框架。

Despite their success, a key question that arises is whether pre-training is necessary for ViTs. It has been widely assumed that pre-training enhances downstream task performance by learning useful feature representations. However, recent research has begun to question whether these features are the sole contributors to performance improvements or whether other factors, such as attention patterns, might play a more significant role. This investigation challenges the traditional belief in the dominance of feature learning, suggesting that a deeper understanding of the mechanisms driving ViTs’ effectiveness could lead to more efficient training methodologies and improved performance.

儘管取得了成功,但出現的一個關鍵問題是 ViT 是否需要預先訓練。人們普遍認為預訓練透過學習有用的特徵表示來增強下游任務的表現。然而,最近的研究開始質疑這些功能是否是表現改進的唯一貢獻者,或者其他因素(例如注意力模式)是否可能發揮更重要的作用。這項研究挑戰了特徵學習占主導地位的傳統觀念,表明對驅動 ViT 有效性的機制進行更深入的了解可能會帶來更有效的訓練方法和更高的表現。

Conventional approaches to utilizing pre-trained ViTs involve fine-tuning the entire model on specific downstream tasks. This process combines attention transfer and feature learning, making it difficult to isolate each contribution. While knowledge distillation frameworks have been employed to transfer logits or feature representations, they largely ignore the potential of attention patterns. The lack of focused analysis on attention mechanisms limits a comprehensive understanding of their role in improving downstream task outcomes. This gap highlights the need for methods to assess attention maps’ impact independently.

利用預訓練 ViT 的傳統方法涉及針對特定下游任務微調整個模型。這個過程結合了注意力轉移和特徵學習,使得很難分離出每個貢獻。雖然知識蒸餾框架已被用來傳遞邏輯或特徵表示,但它們在很大程度上忽略了注意力模式的潛力。缺乏對注意力機制的集中分析限制了對其在改善下游任務結果中的作用的全面理解。這一差距凸顯了需要獨立評估注意力圖影響的方法。

Researchers from Carnegie Mellon University and FAIR have introduced a novel method called “Attention Transfer,” designed to isolate and transfer only the attention patterns from pre-trained ViTs. The proposed framework consists of two methods: Attention Copy and Attention Distillation. In Attention Copy, the pre-trained teacher ViT generates attention maps directly applied to a student model while the student learns all other parameters from scratch. In contrast, Attention Distillation uses a distillation loss function to train the student model to align its attention maps with the teacher’s, requiring the teacher model only during training. These methods separate the intra-token computations from inter-token flow, offering a fresh perspective on pre-training dynamics in ViTs.

卡內基美隆大學和 FAIR 的研究人員推出了一種名為「注意力轉移」的新穎方法,旨在僅從預先訓練的 ViT 中分離和轉移注意力模式。所提出的框架由兩種方法組成:注意力複製和注意力蒸餾。在註意力複製中,預先訓練的教師 ViT 產生直接應用於學生模型的注意力圖,而學生則從頭開始學習所有其他參數。相較之下,注意力蒸餾使用蒸餾損失函數來訓練學生模型,使其註意力圖與教師的注意力圖保持一致,僅在訓練期間需要教師模型。這些方法將令牌內計算與令牌間流分開,為 ViT 中的預訓練動態提供了全新的視角。

Attention Copy transfers pre-trained attention maps to a student model, effectively guiding how tokens interact without retaining learned features. This setup requires both the teacher and student models during inference, which may add computational complexity. Attention Distillation, on the other hand, refines the student model’s attention maps through a loss function that compares them to the teacher’s patterns. After training, the teacher model is no longer needed, making this approach more practical. Both methods leverage the unique architecture of ViTs, where self-attention maps dictate inter-token relationships, allowing the student to focus on learning its features from scratch.

注意力複製將預先訓練的注意力圖轉移到學生模型,有效地指導令牌如何交互,而不保留學習的特徵。此設定在推理過程中需要教師模型和學生模型,這可能會增加計算複雜性。另一方面,注意力蒸餾透過將學生模型的注意力圖與教師的模式進行比較的損失函數來細化學生模型的注意力圖。訓練後,不再需要教師模型,使得這種方法更實用。這兩種方法都利用了 ViT 的獨特架構,其中自註意力圖決定了 token 之間的關係,使學生能夠專注於從頭開始學習其功能。

The performance of these methods demonstrates the effectiveness of attention patterns in pre-trained ViTs. Attention Distillation achieved a top-1 accuracy of 85.7% on the ImageNet-1K dataset, equaling the performance of fully fine-tuned models. While slightly less effective, Attention Copy closed 77.8% of the gap between training from scratch and fine-tuning, reaching 85.1% accuracy. Furthermore, ensembling the student and teacher models enhanced accuracy to 86.3%, showcasing the complementary nature of their predictions. The study also revealed that transferring attention maps from task-specific fine-tuned teachers further improved accuracy, demonstrating the adaptability of attention mechanisms to specific downstream requirements. However, challenges arose under data distribution shifts, where attention transfer underperformed compared to weight tuning, highlighting limitations in generalization.

這些方法的性能證明了預訓練 ViT 中註意力模式的有效性。注意力蒸餾在 ImageNet-1K 資料集上實現了 85.7% 的 top-1 準確率,與完全微調模型的性能相當。雖然效果稍差,但注意力複製縮小了從頭開始訓練和微調之間 77.8% 的差距,達到 85.1% 的準確率。此外,整合學生和教師模型將準確率提高到 86.3%,展示了他們預測的互補性。該研究還表明,從特定任務的微調教師轉移注意力圖進一步提高了準確性,證明了注意力機制對特定下游要求的適應性。然而,數據分佈變化帶來了挑戰,其中註意力轉移與權重調整相比表現不佳,凸顯了泛化的局限性。

This research illustrates that pre-trained attention patterns are sufficient for achieving high downstream task performance, questioning the necessity of traditional feature-centric pre-training paradigms. The proposed Attention Transfer method decouples attention mechanisms from feature learning, offering an alternative approach that reduces reliance on computationally intensive weight fine-tuning. While limitations such as distribution shift sensitivity and scalability across diverse tasks remain, this study opens new avenues for optimizing the use of ViTs in computer vision. Future work could address these challenges, refine attention transfer techniques, and explore their applicability to broader domains, paving the way for more efficient, effective machine learning models.

這項研究表明,預訓練的注意力模式足以實現較高的下游任務表現,質疑傳統的以特徵為中心的預訓練範式的必要性。所提出的注意力轉移方法將注意力機制與特徵學習解耦,提供了一種減少對計算密集型權重微調的依賴的替代方法。雖然不同任務之間的分佈偏移敏感性和可擴展性等限制仍然存在,但這項研究為優化 ViT 在電腦視覺中的使用開闢了新途徑。未來的工作可以解決這些挑戰,完善注意力轉移技術,並探索其在更廣泛領域的適用性,為更有效率、更有效的機器學習模式鋪路。

新聞來源:www.marktechpost.com

免責聲明:info@kdj.com

The information provided is not trading advice. kdj.com does not assume any responsibility for any investments made based on the information provided in this article. Cryptocurrencies are highly volatile and it is highly recommended that you invest with caution after thorough research!

If you believe that the content used on this website infringes your copyright, please contact us immediately (info@kdj.com) and we will delete it promptly.

2024年12月04日 其他文章發表於