$108114.133839 USD

-1.05%

ethereum

$2518.441367 USD

-2.26%

tether

$1.000361 USD

0.00%

xrp

$2.223330 USD

-0.95%

bnb

$654.869146 USD

-0.97%

solana

$148.092872 USD

-2.15%

usd-coin

$0.999992 USD

0.01%

tron

$0.282245 USD

-1.50%

dogecoin

$0.163171 USD

-4.43%

cardano

$0.573053 USD

-3.27%

hyperliquid

$39.124413 USD

-0.43%

sui

$2.888741 USD

-3.81%

bitcoin-cash

$485.411383 USD

-0.91%

chainlink

$13.195938 USD

-2.99%

unus-sed-leo

$9.042393 USD

0.21%

암호화폐 뉴스 기사

변환기 이해

2024/06/28 02:02

'당신이 필요로 하는 것은 관심뿐입니다'에 대한 간단한 분석입니다.1

A straightforward breakdown of “Attention is All You Need”¹

'당신이 필요로 하는 것은 관심뿐입니다'에 대한 간단한 분석입니다.1

Aveek Goswami

아비크 고스와미

따르다

Towards Data Science

데이터 과학을 향하여

Listen

듣다

공유하다

The transformer came out in 2017. There have been many, many articles explaining how it works, but I often find them either going too deep into the math or too shallow on the details. I end up spending as much time googling (or chatGPT-ing) as I do reading, which isn’t the best approach to understanding a topic. That brought me to writing this article, where I attempt to explain the most revolutionary aspects of the transformer while keeping it succinct and simple for anyone to read.

트랜스포머는 2017년에 출시되었습니다. 작동 방식을 설명하는 수많은 기사가 있지만, 수학에 너무 깊이 들어가거나 세부 사항에 대해 너무 얕게 설명하는 경우가 많습니다. 나는 책을 읽는 것만큼 인터넷 검색(또는 채팅GPT 사용)에 많은 시간을 소비하게 되는데, 이는 주제를 이해하는 데 가장 좋은 접근 방식이 아닙니다. 그래서 이 글을 쓰게 되었고, 트랜스포머의 가장 혁신적인 측면을 누구나 읽을 수 있도록 간결하고 단순하게 유지하면서 설명하려고 합니다.

This article assumes a general understanding of machine learning principles.

이 문서에서는 머신러닝 원리에 대한 일반적인 이해를 가정합니다.

The ideas behind the Transformer led us to the era of Generative AI

Transformer의 아이디어는 우리를 생성 AI 시대로 이끌었습니다.

Transformers represented a new architecture of sequence transduction models. A sequence model is a type of model that transforms an input sequence to an output sequence. This input sequence can be of various data types, such as characters, words, tokens, bytes, numbers, phonemes (speech recognition), and may also be multimodal¹.

Transformers는 시퀀스 변환 모델의 새로운 아키텍처를 나타냅니다. 시퀀스 모델은 입력 시퀀스를 출력 시퀀스로 변환하는 모델 유형입니다. 이 입력 시퀀스는 문자, 단어, 토큰, 바이트, 숫자, 음소(음성 인식) 등 다양한 데이터 유형일 수 있으며 다중 모드1일 수도 있습니다.

Before transformers, sequence models were largely based on recurrent neural networks (RNNs), long short-term memory (LSTM), gated recurrent units (GRUs) and convolutional neural networks (CNNs). They often contained some form of an attention mechanism to account for the context provided by items in various positions of a sequence.

변환기 이전에는 시퀀스 모델은 주로 순환 신경망(RNN), 장단기 기억(LSTM), 게이트 순환 장치(GRU) 및 컨볼루션 신경망(CNN)을 기반으로 했습니다. 여기에는 시퀀스의 다양한 위치에 있는 항목이 제공하는 컨텍스트를 설명하기 위한 일종의 주의 메커니즘이 포함되는 경우가 많습니다.

The downsides of previous models

이전 모델의 단점

Hence, introducing the Transformer, which relies entirely on the attention mechanism and does away with the recurrence and convolutions. Attention is what the model uses to focus on different parts of the input sequence at each step of generating an output. The Transformer was the first model to use attention without sequential processing, allowing for parallelisation and hence faster training without losing long-term dependencies. It also performs a constant number of operations between input positions, regardless of how far apart they are.

따라서 어텐션 메커니즘에 전적으로 의존하고 반복 및 컨볼루션을 제거하는 Transformer를 도입합니다. Attention은 모델이 출력을 생성하는 각 단계에서 입력 시퀀스의 다양한 부분에 집중하기 위해 사용하는 것입니다. Transformer는 순차 처리 없이 Attention을 사용하는 최초의 모델로, 병렬화가 가능하고 장기적인 종속성을 잃지 않으면서도 더 빠른 학습이 가능합니다. 또한 입력 위치 간의 거리에 관계없이 일정한 수의 작업을 수행합니다.

Walking through the Transformer model architecture

Transformer 모델 아키텍처 살펴보기

The important features of the transformer are: tokenisation, the embedding layer, the attention mechanism, the encoder and the decoder. Let’s imagine an input sequence in french: “Je suis etudiant” and a target output sequence in English “I am a student” (I am blatantly copying from this link, which explains the process very descriptively)

변환기의 중요한 기능은 토큰화, 임베딩 레이어, 어텐션 메커니즘, 인코더 및 디코더입니다. 프랑스어로 된 입력 시퀀스인 "Je suis etudiant"와 영어로 된 목표 출력 시퀀스인 "I am a Student"를 상상해 봅시다. (프로세스를 매우 설명적으로 설명하는 이 링크에서 노골적으로 복사하고 있습니다.)

Tokenisation

토큰화

The input sequence of words is converted into tokens of 3–4 characters long

입력된 단어 순서는 3~4자 길이의 토큰으로 변환됩니다.

Embeddings

임베딩

The input and output sequence are mapped to a sequence of continuous representations, z, which represents the input and output embeddings. Each token will be represented by an embedding to capture some kind of meaning, which helps in computing its relationship to other tokens; this embedding will be represented as a vector. To create these embeddings, we use the vocabulary of the training dataset, which contains every unique output token that is being used to train the model. We then determine an appropriate embedding dimension, which corresponds to the size of the vector representation for each token; higher embedding dimensions will better capture more complex / diverse / intricate meanings and relationships. The dimensions of the embedding matrix, for vocabulary size V and embedding dimension D, hence becomes V x D, making it a high-dimensional vector.

입력 및 출력 시퀀스는 입력 및 출력 임베딩을 나타내는 연속 표현 시퀀스 z에 매핑됩니다. 각 토큰은 일종의 의미를 포착하기 위해 임베딩으로 표현되며, 이는 다른 토큰과의 관계를 계산하는 데 도움이 됩니다. 이 임베딩은 벡터로 표현됩니다. 이러한 임베딩을 생성하기 위해 모델을 훈련하는 데 사용되는 모든 고유 출력 토큰이 포함된 훈련 데이터 세트의 어휘를 사용합니다. 그런 다음 각 토큰의 벡터 표현 크기에 해당하는 적절한 임베딩 차원을 결정합니다. 임베딩 차원이 높을수록 더 복잡하고 다양하며 복잡한 의미와 관계를 더 잘 포착할 수 있습니다. 어휘 크기 V 및 임베딩 차원 D에 대한 임베딩 행렬의 차원은 V x D가 되어 고차원 벡터가 됩니다.

At initialisation, these embeddings can be initialised randomly and more accurate embeddings are learned during the training process. The embedding matrix is then updated during training.

초기화 시 이러한 임베딩은 무작위로 초기화될 수 있으며 훈련 프로세스 중에 더 정확한 임베딩이 학습됩니다. 그러면 학습 중에 임베딩 행렬이 업데이트됩니다.

Positional encodings are added to these embeddings because the transformer does not have a built-in sense of the order of tokens.

변환기에는 토큰 순서에 대한 기본 제공 감각이 없기 때문에 위치 인코딩이 이러한 임베딩에 추가됩니다.

Attention mechanism

주의 메커니즘

Self-attention is the mechanism where each token in a sequence computes attention scores with every other token in a sequence to understand relationships between all tokens regardless of distance from each other. I’m going to avoid too much math in this article, but you can read up here about the different matrices formed to compute attention scores and hence capture relationships between each token and every other token.

셀프 어텐션은 시퀀스의 각 토큰이 시퀀스의 다른 모든 토큰과 함께 어텐션 점수를 계산하여 서로의 거리에 관계없이 모든 토큰 간의 관계를 이해하는 메커니즘입니다. 이 기사에서는 수학적인 내용을 너무 많이 다루지 않겠지만 주의 점수를 계산하여 각 토큰과 다른 모든 토큰 간의 관계를 캡처하기 위해 형성된 다양한 행렬에 대해 여기에서 읽을 수 있습니다.

These attention scores result in a new set of representations⁴ for each token which is then used in the next layer of processing. During training, the weight matrices are updated through back-propagation, so the model can better account for relationships between tokens.

이러한 주의 점수는 각 토큰에 대한 새로운 표현 세트⁴를 생성하며 이는 다음 처리 계층에서 사용됩니다. 훈련 중에 가중치 행렬은 역전파를 통해 업데이트되므로 모델은 토큰 간의 관계를 더 잘 설명할 수 있습니다.

Multi-head attention is just an extension of self-attention. Different attention scores are computed, the results are concatenated and transformed and the resulting representation enhances the model’s ability to capture various complex relationships between tokens.

Multi-head attention은 self attention의 확장일 뿐입니다. 다양한 주의 점수가 계산되고 결과가 연결 및 변환되며 결과 표현은 토큰 간의 다양하고 복잡한 관계를 포착하는 모델의 능력을 향상시킵니다.

Encoder

인코더

Input embeddings (built from the input sequence) with positional encodings are fed into the encoder. The input embeddings are 6 layers, with each layer containing 2 sub-layers: multi-head attention and feed forward networks. There is also a residual connection which leads to the output of each layer being LayerNorm(x+Sublayer(x)) as shown. The output of the encoder is a sequence of vectors which are contextualised representations of the inputs after accounting for attention scored. These are then fed to the decoder.

위치 인코딩이 포함된 입력 임베딩(입력 시퀀스에서 구축됨)이 인코더에 공급됩니다. 입력 임베딩은 6개의 레이어로 구성되며, 각 레이어에는 다중 헤드 주의 및 피드포워드 네트워크라는 2개의 하위 레이어가 포함되어 있습니다. 또한 표시된 대로 각 레이어의 출력이 LayerNorm(x+Sublayer(x))가 되는 잔여 연결도 있습니다. 인코더의 출력은 득점된 주의를 고려한 후 입력을 상황에 맞게 표현한 일련의 벡터입니다. 그런 다음 디코더에 공급됩니다.

Decoder

디코더

Output embeddings (generated from the target output sequence) with positional encodings are fed into the decoder. The decoder also contains 6 layers, and there are

위치 인코딩이 포함된 출력 임베딩(대상 출력 시퀀스에서 생성됨)이 디코더에 공급됩니다. 디코더에는 6개의 레이어도 포함되어 있습니다.

부인 성명:info@kdj.com

제공된 정보는 거래 조언이 아닙니다. kdj.com은 이 기사에 제공된 정보를 기반으로 이루어진 투자에 대해 어떠한 책임도 지지 않습니다. 암호화폐는 변동성이 매우 높으므로 철저한 조사 후 신중하게 투자하는 것이 좋습니다!

2025年07月06日 에 게재된 다른 기사

더