Market Cap: $3.5189T 1.180%
Volume(24h): $338.4236B 9.730%
  • Market Cap: $3.5189T 1.180%
  • Volume(24h): $338.4236B 9.730%
  • Fear & Greed Index:
  • Market Cap: $3.5189T 1.180%
Cryptos
Topics
Cryptospedia
News
CryptosTopics
Videos
Top News
Cryptos
Topics
Cryptospedia
News
CryptosTopics
Videos
bitcoin
bitcoin

$108064.256573 USD

2.62%

ethereum
ethereum

$3416.451426 USD

4.04%

xrp
xrp

$3.182014 USD

-0.61%

tether
tether

$0.998286 USD

-0.06%

solana
solana

$258.371362 USD

-5.60%

bnb
bnb

$703.182066 USD

-0.59%

dogecoin
dogecoin

$0.378176 USD

-4.38%

usd-coin
usd-coin

$1.000010 USD

-0.01%

cardano
cardano

$1.062758 USD

-0.47%

tron
tron

$0.239600 USD

-1.00%

chainlink
chainlink

$25.901897 USD

10.66%

avalanche
avalanche

$38.079479 USD

-2.52%

sui
sui

$4.720134 USD

-3.00%

stellar
stellar

$0.462876 USD

-3.68%

hedera
hedera

$0.354732 USD

0.20%

Cryptocurrency News Articles

Understanding Transformers

Jun 28, 2024 at 02:02 am

A straightforward breakdown of “Attention is All You Need”¹

Understanding Transformers

A straightforward breakdown of “Attention is All You Need”¹

Aveek Goswami

Follow

Towards Data Science

--

Listen

Share

The transformer came out in 2017. There have been many, many articles explaining how it works, but I often find them either going too deep into the math or too shallow on the details. I end up spending as much time googling (or chatGPT-ing) as I do reading, which isn’t the best approach to understanding a topic. That brought me to writing this article, where I attempt to explain the most revolutionary aspects of the transformer while keeping it succinct and simple for anyone to read.

This article assumes a general understanding of machine learning principles.

The ideas behind the Transformer led us to the era of Generative AI

Transformers represented a new architecture of sequence transduction models. A sequence model is a type of model that transforms an input sequence to an output sequence. This input sequence can be of various data types, such as characters, words, tokens, bytes, numbers, phonemes (speech recognition), and may also be multimodal¹.

Before transformers, sequence models were largely based on recurrent neural networks (RNNs), long short-term memory (LSTM), gated recurrent units (GRUs) and convolutional neural networks (CNNs). They often contained some form of an attention mechanism to account for the context provided by items in various positions of a sequence.

The downsides of previous models

Hence, introducing the Transformer, which relies entirely on the attention mechanism and does away with the recurrence and convolutions. Attention is what the model uses to focus on different parts of the input sequence at each step of generating an output. The Transformer was the first model to use attention without sequential processing, allowing for parallelisation and hence faster training without losing long-term dependencies. It also performs a constant number of operations between input positions, regardless of how far apart they are.

Walking through the Transformer model architecture

The important features of the transformer are: tokenisation, the embedding layer, the attention mechanism, the encoder and the decoder. Let’s imagine an input sequence in french: “Je suis etudiant” and a target output sequence in English “I am a student” (I am blatantly copying from this link, which explains the process very descriptively)

Tokenisation

The input sequence of words is converted into tokens of 3–4 characters long

Embeddings

The input and output sequence are mapped to a sequence of continuous representations, z, which represents the input and output embeddings. Each token will be represented by an embedding to capture some kind of meaning, which helps in computing its relationship to other tokens; this embedding will be represented as a vector. To create these embeddings, we use the vocabulary of the training dataset, which contains every unique output token that is being used to train the model. We then determine an appropriate embedding dimension, which corresponds to the size of the vector representation for each token; higher embedding dimensions will better capture more complex / diverse / intricate meanings and relationships. The dimensions of the embedding matrix, for vocabulary size V and embedding dimension D, hence becomes V x D, making it a high-dimensional vector.

At initialisation, these embeddings can be initialised randomly and more accurate embeddings are learned during the training process. The embedding matrix is then updated during training.

Positional encodings are added to these embeddings because the transformer does not have a built-in sense of the order of tokens.

Attention mechanism

Self-attention is the mechanism where each token in a sequence computes attention scores with every other token in a sequence to understand relationships between all tokens regardless of distance from each other. I’m going to avoid too much math in this article, but you can read up here about the different matrices formed to compute attention scores and hence capture relationships between each token and every other token.

These attention scores result in a new set of representations⁴ for each token which is then used in the next layer of processing. During training, the weight matrices are updated through back-propagation, so the model can better account for relationships between tokens.

Multi-head attention is just an extension of self-attention. Different attention scores are computed, the results are concatenated and transformed and the resulting representation enhances the model’s ability to capture various complex relationships between tokens.

Encoder

Input embeddings (built from the input sequence) with positional encodings are fed into the encoder. The input embeddings are 6 layers, with each layer containing 2 sub-layers: multi-head attention and feed forward networks. There is also a residual connection which leads to the output of each layer being LayerNorm(x+Sublayer(x)) as shown. The output of the encoder is a sequence of vectors which are contextualised representations of the inputs after accounting for attention scored. These are then fed to the decoder.

Decoder

Output embeddings (generated from the target output sequence) with positional encodings are fed into the decoder. The decoder also contains 6 layers, and there are

Disclaimer:info@kdj.com

The information provided is not trading advice. kdj.com does not assume any responsibility for any investments made based on the information provided in this article. Cryptocurrencies are highly volatile and it is highly recommended that you invest with caution after thorough research!

If you believe that the content used on this website infringes your copyright, please contact us immediately (info@kdj.com) and we will delete it promptly.

Other articles published on Jan 21, 2025