$83571.608249 USD

-1.38%

ethereum

$1826.028236 USD

-3.02%

tether

$0.999839 USD

-0.01%

xrp

$2.053149 USD

-2.48%

bnb

$601.140115 USD

-0.44%

solana

$120.357332 USD

-3.79%

usd-coin

$0.999833 USD

-0.02%

dogecoin

$0.166175 USD

-3.43%

cardano

$0.652521 USD

-3.00%

tron

$0.236809 USD

-0.59%

toncoin

$3.785339 USD

-5.02%

chainlink

$13.253231 USD

-3.91%

unus-sed-leo

$9.397427 USD

-0.19%

stellar

$0.266444 USD

-1.00%

sui

$2.409007 USD

1.15%

加密货币新闻

了解变形金刚

2024/06/28 02:02

“注意力就是你所需要的一切”的简单分解

A straightforward breakdown of “Attention is All You Need”¹

“注意力就是你所需要的一切”的简单分解

Aveek Goswami

阿维克·戈斯瓦米

跟随

Towards Data Science

走向数据科学

Listen

听

The transformer came out in 2017. There have been many, many articles explaining how it works, but I often find them either going too deep into the math or too shallow on the details. I end up spending as much time googling (or chatGPT-ing) as I do reading, which isn’t the best approach to understanding a topic. That brought me to writing this article, where I attempt to explain the most revolutionary aspects of the transformer while keeping it succinct and simple for anyone to read.

Transformer 于 2017 年问世。有很多很多文章解释它是如何工作的，但我经常发现它们要么对数学太深入，要么对细节太浅薄。我最终花在谷歌搜索（或聊天GPT）上的时间和阅读的时间一样多，这并不是理解某个主题的最佳方法。这促使我写了这篇文章，我试图解释 Transformer 最具革命性的方面，同时保持其简洁明了，方便任何人阅读。

This article assumes a general understanding of machine learning principles.

本文假设您对机器学习原理有一般了解。

The ideas behind the Transformer led us to the era of Generative AI

Transformer 背后的理念引领我们进入生成式 AI 时代

Transformers represented a new architecture of sequence transduction models. A sequence model is a type of model that transforms an input sequence to an output sequence. This input sequence can be of various data types, such as characters, words, tokens, bytes, numbers, phonemes (speech recognition), and may also be multimodal¹.

变形金刚代表了序列转导模型的新架构。序列模型是一种将输入序列转换为输出序列的模型。该输入序列可以是各种数据类型，例如字符、单词、标记、字节、数字、音素（语音识别），也可以是多模式的。

Before transformers, sequence models were largely based on recurrent neural networks (RNNs), long short-term memory (LSTM), gated recurrent units (GRUs) and convolutional neural networks (CNNs). They often contained some form of an attention mechanism to account for the context provided by items in various positions of a sequence.

在 Transformer 出现之前，序列模型主要基于循环神经网络 (RNN)、长短期记忆 (LSTM)、门控循环单元 (GRU) 和卷积神经网络 (CNN)。它们通常包含某种形式的注意力机制，以解释序列中不同位置的项目提供的上下文。

The downsides of previous models

先前型号的缺点

Hence, introducing the Transformer, which relies entirely on the attention mechanism and does away with the recurrence and convolutions. Attention is what the model uses to focus on different parts of the input sequence at each step of generating an output. The Transformer was the first model to use attention without sequential processing, allowing for parallelisation and hence faster training without losing long-term dependencies. It also performs a constant number of operations between input positions, regardless of how far apart they are.

因此，引入了 Transformer，它完全依赖于注意力机制，并且消除了递归和卷积。注意力是模型在生成输出的每个步骤中用来关注输入序列的不同部分的方法。 Transformer 是第一个使用注意力而不进行顺序处理的模型，允许并行化，从而在不丢失长期依赖性的情况下加快训练速度。它还在输入位置之间执行恒定数量的操作，无论它们相距多远。

Walking through the Transformer model architecture

浏览 Transformer 模型架构

The important features of the transformer are: tokenisation, the embedding layer, the attention mechanism, the encoder and the decoder. Let’s imagine an input sequence in french: “Je suis etudiant” and a target output sequence in English “I am a student” (I am blatantly copying from this link, which explains the process very descriptively)

Transformer 的重要特征是：标记化、嵌入层、注意力机制、编码器和解码器。让我们想象一个法语输入序列：“Je suis etudiant”和一个英语目标输出序列“我是学生”（我公然从这个链接复制，它非常描述性地解释了这个过程）

Tokenisation

代币化

The input sequence of words is converted into tokens of 3–4 characters long

输入的单词序列被转换为 3-4 个字符长的标记

Embeddings

嵌入

The input and output sequence are mapped to a sequence of continuous representations, z, which represents the input and output embeddings. Each token will be represented by an embedding to capture some kind of meaning, which helps in computing its relationship to other tokens; this embedding will be represented as a vector. To create these embeddings, we use the vocabulary of the training dataset, which contains every unique output token that is being used to train the model. We then determine an appropriate embedding dimension, which corresponds to the size of the vector representation for each token; higher embedding dimensions will better capture more complex / diverse / intricate meanings and relationships. The dimensions of the embedding matrix, for vocabulary size V and embedding dimension D, hence becomes V x D, making it a high-dimensional vector.

输入和输出序列被映射到连续表示序列 z，它表示输入和输出嵌入。每个令牌将由嵌入来表示以捕获某种含义，这有助于计算其与其他令牌的关系；该嵌入将表示为向量。为了创建这些嵌入，我们使用训练数据集的词汇表，其中包含用于训练模型的每个唯一的输出标记。然后，我们确定适当的嵌入维度，该维度对应于每个标记的向量表示的大小；更高的嵌入维度将更好地捕获更复杂/多样化/错综复杂的含义和关系。嵌入矩阵的维度（词汇大小 V 和嵌入维度 D）因此变为 V x D，使其成为高维向量。

At initialisation, these embeddings can be initialised randomly and more accurate embeddings are learned during the training process. The embedding matrix is then updated during training.

在初始化时，可以随机初始化这些嵌入，并在训练过程中学习更准确的嵌入。然后在训练期间更新嵌入矩阵。

Positional encodings are added to these embeddings because the transformer does not have a built-in sense of the order of tokens.

位置编码被添加到这些嵌入中，因为转换器没有内置的标记顺序意义。

Attention mechanism

注意力机制

Self-attention is the mechanism where each token in a sequence computes attention scores with every other token in a sequence to understand relationships between all tokens regardless of distance from each other. I’m going to avoid too much math in this article, but you can read up here about the different matrices formed to compute attention scores and hence capture relationships between each token and every other token.

自注意力是一种机制，序列中的每个标记与序列中的每个其他标记一起计算注意力分数，以了解所有标记之间的关系，无论彼此之间的距离如何。我将在本文中避免过多的数学知识，但您可以在此处阅读有关用于计算注意力分数的不同矩阵的信息，从而捕获每个标记与每个其他标记之间的关系。

These attention scores result in a new set of representations⁴ for each token which is then used in the next layer of processing. During training, the weight matrices are updated through back-propagation, so the model can better account for relationships between tokens.

这些注意力分数会为每个标记生成一组新的表示⁴，然后将其用于下一层处理。在训练过程中，权重矩阵通过反向传播进行更新，因此模型可以更好地解释标记之间的关系。

Multi-head attention is just an extension of self-attention. Different attention scores are computed, the results are concatenated and transformed and the resulting representation enhances the model’s ability to capture various complex relationships between tokens.

多头注意力只是自注意力的延伸。计算不同的注意力分数，将结果连接和转换，得到的表示增强了模型捕获标记之间各种复杂关系的能力。

Encoder

编码器

Input embeddings (built from the input sequence) with positional encodings are fed into the encoder. The input embeddings are 6 layers, with each layer containing 2 sub-layers: multi-head attention and feed forward networks. There is also a residual connection which leads to the output of each layer being LayerNorm(x+Sublayer(x)) as shown. The output of the encoder is a sequence of vectors which are contextualised representations of the inputs after accounting for attention scored. These are then fed to the decoder.

具有位置编码的输入嵌入（根据输入序列构建）被馈送到编码器中。输入嵌入有 6 层，每层包含 2 个子层：多头注意力网络和前馈网络。还有一个残差连接导致每层的输出为 LayerNorm(x+Sublayer(x))，如图所示。编码器的输出是一系列向量，它们是考虑注意力评分后输入的上下文表示。然后将它们馈送到解码器。

Decoder

解码器

Output embeddings (generated from the target output sequence) with positional encodings are fed into the decoder. The decoder also contains 6 layers, and there are

具有位置编码的输出嵌入（从目标输出序列生成）被馈送到解码器中。解码器也包含6层，有

免责声明:info@kdj.com

所提供的信息并非交易建议。根据本文提供的信息进行的任何投资，kdj.com不承担任何责任。加密货币具有高波动性，强烈建议您深入研究后，谨慎投资！

如您认为本网站上使用的内容侵犯了您的版权，请立即联系我们（info@kdj.com），我们将及时删除。

2025年04月03日发表的其他文章