$108530.002893 USD

1.12%

ethereum

$2501.495543 USD

2.83%

tether

$1.000245 USD

-0.01%

xrp

$2.198131 USD

0.43%

bnb

$654.360076 USD

0.87%

solana

$152.192030 USD

1.55%

usd-coin

$0.999839 USD

0.00%

tron

$0.276594 USD

0.49%

dogecoin

$0.167580 USD

2.68%

cardano

$0.568515 USD

0.60%

hyperliquid

$40.700758 USD

7.87%

bitcoin-cash

$500.972465 USD

1.64%

sui

$2.847545 USD

2.13%

chainlink

$13.518965 USD

1.41%

unus-sed-leo

$9.163651 USD

0.47%

加密貨幣新聞文章

了解變形金剛

2024/06/28 02:02

「注意力就是你所需要的一切」的簡單分解

A straightforward breakdown of “Attention is All You Need”¹

「注意力就是你所需要的一切」的簡單分解

Aveek Goswami

阿維克·戈斯瓦米

跟隨

Towards Data Science

走向數據科學

Listen

聽

The transformer came out in 2017. There have been many, many articles explaining how it works, but I often find them either going too deep into the math or too shallow on the details. I end up spending as much time googling (or chatGPT-ing) as I do reading, which isn’t the best approach to understanding a topic. That brought me to writing this article, where I attempt to explain the most revolutionary aspects of the transformer while keeping it succinct and simple for anyone to read.

Transformer 在 2017 年問世。我最終花在Google搜尋（或聊天GPT）上的時間和閱讀的時間一樣多，這並不是理解某個主題的最佳方法。這促使我寫了這篇文章，我試圖解釋 Transformer 最具革命性的方面，同時保持其簡潔明了，方便任何人閱讀。

This article assumes a general understanding of machine learning principles.

本文假設您對機器學習原理有一般了解。

The ideas behind the Transformer led us to the era of Generative AI

Transformer 背後的理念引領我們進入生成式 AI 時代

Transformers represented a new architecture of sequence transduction models. A sequence model is a type of model that transforms an input sequence to an output sequence. This input sequence can be of various data types, such as characters, words, tokens, bytes, numbers, phonemes (speech recognition), and may also be multimodal¹.

變形金剛代表了序列轉導模型的新架構。序列模型是一種將輸入序列轉換為輸出序列的模型。此輸入序列可以是各種資料類型，例如字元、單字、標記、位元組、數字、音素（語音辨識），也可以是多模式的。

Before transformers, sequence models were largely based on recurrent neural networks (RNNs), long short-term memory (LSTM), gated recurrent units (GRUs) and convolutional neural networks (CNNs). They often contained some form of an attention mechanism to account for the context provided by items in various positions of a sequence.

在 Transformer 出現之前，序列模型主要基於循環神經網路 (RNN)、長短期記憶 (LSTM)、門控循環單元 (GRU) 和卷積神經網路 (CNN)。它們通常包含某種形式的注意力機制，以解釋序列中不同位置的項目提供的上下文。

The downsides of previous models

先前型號的缺點

Hence, introducing the Transformer, which relies entirely on the attention mechanism and does away with the recurrence and convolutions. Attention is what the model uses to focus on different parts of the input sequence at each step of generating an output. The Transformer was the first model to use attention without sequential processing, allowing for parallelisation and hence faster training without losing long-term dependencies. It also performs a constant number of operations between input positions, regardless of how far apart they are.

因此，引入了 Transformer，它完全依賴注意力機制，並且消除了遞歸和卷積。注意力是模型在產生輸出的每個步驟中用來專注於輸入序列的不同部分的方法。 Transformer 是第一個使用注意力而不進行順序處理的模型，允許並行化，從而在不失去長期依賴性的情況下加快訓練速度。它還在輸入位置之間執行恆定數量的操作，無論它們相距多遠。

Walking through the Transformer model architecture

瀏覽 Transformer 模型架構

The important features of the transformer are: tokenisation, the embedding layer, the attention mechanism, the encoder and the decoder. Let’s imagine an input sequence in french: “Je suis etudiant” and a target output sequence in English “I am a student” (I am blatantly copying from this link, which explains the process very descriptively)

Transformer 的重要特徵是：標記化、嵌入層、注意力機制、編碼器和解碼器。讓我們想像一個法語輸入序列：「Je suis etudiant」和一個英語目標輸出序列「我是學生」（我公然從這個連結複製，它非常描述性地解釋了這個過程）

Tokenisation

代幣化

The input sequence of words is converted into tokens of 3–4 characters long

輸入的單字序列被轉換為 3-4 個字元長的標記

Embeddings

嵌入

The input and output sequence are mapped to a sequence of continuous representations, z, which represents the input and output embeddings. Each token will be represented by an embedding to capture some kind of meaning, which helps in computing its relationship to other tokens; this embedding will be represented as a vector. To create these embeddings, we use the vocabulary of the training dataset, which contains every unique output token that is being used to train the model. We then determine an appropriate embedding dimension, which corresponds to the size of the vector representation for each token; higher embedding dimensions will better capture more complex / diverse / intricate meanings and relationships. The dimensions of the embedding matrix, for vocabulary size V and embedding dimension D, hence becomes V x D, making it a high-dimensional vector.

輸入和輸出序列被映射到連續表示序列 z，它表示輸入和輸出嵌入。每個令牌將由嵌入來表示以捕獲某種含義，這有助於計算其與其他令牌的關係；該嵌入將表示為向量。為了創建這些嵌入，我們使用訓練資料集的詞彙表，其中包含用於訓練模型的每個唯一的輸出標記。然後，我們確定適當的嵌入維度，該維度對應於每個標記的向量表示的大小；更高的嵌入維度將更好地捕捉更複雜/多樣化/錯綜複雜的含義和關係。嵌入矩陣的維度（詞彙大小 V 和嵌入維度 D）因此變成 V x D，使其成為高維向量。

At initialisation, these embeddings can be initialised randomly and more accurate embeddings are learned during the training process. The embedding matrix is then updated during training.

在初始化時，可以隨機初始化這些嵌入，並在訓練過程中學習更準確的嵌入。然後在訓練期間更新嵌入矩陣。

Positional encodings are added to these embeddings because the transformer does not have a built-in sense of the order of tokens.

位置編碼被添加到這些嵌入中，因為轉換器沒有內建的標記順序意義。

Attention mechanism

注意力機制

Self-attention is the mechanism where each token in a sequence computes attention scores with every other token in a sequence to understand relationships between all tokens regardless of distance from each other. I’m going to avoid too much math in this article, but you can read up here about the different matrices formed to compute attention scores and hence capture relationships between each token and every other token.

自註意力是一種機制，序列中的每個標記與序列中的每個其他標記一起計算注意力分數，以了解所有標記之間的關係，無論彼此之間的距離如何。我將在本文中避免過多的數學知識，但您可以在此處閱讀有關用於計算注意力分數的不同矩陣的信息，從而捕獲每個標記與每個其他標記之間的關係。

These attention scores result in a new set of representations⁴ for each token which is then used in the next layer of processing. During training, the weight matrices are updated through back-propagation, so the model can better account for relationships between tokens.

這些注意力分數會為每個標記產生一組新的表示法⁴，然後將其用於下一層處理。在訓練過程中，權重矩陣會透過反向傳播進行更新，因此模型可以更好地解釋標記之間的關係。

Multi-head attention is just an extension of self-attention. Different attention scores are computed, the results are concatenated and transformed and the resulting representation enhances the model’s ability to capture various complex relationships between tokens.

多頭注意力只是自註意力的延伸。計算不同的注意力分數，將結果連接和轉換，得到的表示增強了模型捕獲標記之間各種複雜關係的能力。

Encoder

編碼器

Input embeddings (built from the input sequence) with positional encodings are fed into the encoder. The input embeddings are 6 layers, with each layer containing 2 sub-layers: multi-head attention and feed forward networks. There is also a residual connection which leads to the output of each layer being LayerNorm(x+Sublayer(x)) as shown. The output of the encoder is a sequence of vectors which are contextualised representations of the inputs after accounting for attention scored. These are then fed to the decoder.

具有位置編碼的輸入嵌入（根據輸入序列建構）被饋送到編碼器中。輸入嵌入有 6 層，每層包含 2 個子層：多頭注意力網路和前饋網路。還有一個殘差連接導致每層的輸出為 LayerNorm(x+Sublayer(x))，如圖所示。編碼器的輸出是一系列向量，它們是考慮注意力評分後輸入的上下文表示。然後將它們饋送到解碼器。

Decoder

解碼器

Output embeddings (generated from the target output sequence) with positional encodings are fed into the decoder. The decoder also contains 6 layers, and there are

具有位置編碼的輸出嵌入（從目標輸出序列產生）被饋送到解碼器中。解碼器也包含6層，有

免責聲明:info@kdj.com

所提供的資訊並非交易建議。 kDJ.com對任何基於本文提供的資訊進行的投資不承擔任何責任。加密貨幣波動性較大，建議您充分研究後謹慎投資！

如果您認為本網站使用的內容侵犯了您的版權，請立即聯絡我們（info@kdj.com），我們將及時刪除。

2025年07月01日其他文章發表於