## Attention Mechanisms and Self-Attention in Neural Networks
Attention mechanisms allow neural networks to f**ocus on specific parts of the input when generating each part of the output**. They assign different weights to different inputs, helping the model decide which inputs are most relevant to the task at hand. This is crucial in tasks like machine translation, where understanding the context of the entire sentence is necessary for accurate translation.
이를 위해 많은 레이어가 사용되므로 많은 학습 가능한 매개변수가 이 정보를 포착하게 됩니다.
{% endhint %}
### Understanding Attention Mechanisms
In traditional sequence-to-sequence models used for language translation, the model encodes an input sequence into a fixed-size context vector. However, this approach struggles with long sentences because the fixed-size context vector may not capture all necessary information. Attention mechanisms address this limitation by allowing the model to consider all input tokens when generating each output token.
#### Example: Machine Translation
Consider translating the German sentence "Kannst du mir helfen diesen Satz zu übersetzen" into English. A word-by-word translation would not produce a grammatically correct English sentence due to differences in grammatical structures between languages. An attention mechanism enables the model to focus on relevant parts of the input sentence when generating each word of the output sentence, leading to a more accurate and coherent translation.
### Introduction to Self-Attention
Self-attention, or intra-attention, is a mechanism where attention is applied within a single sequence to compute a representation of that sequence. It allows each token in the sequence to attend to all other tokens, helping the model capture dependencies between tokens regardless of their distance in the sequence.
#### Key Concepts
* **Tokens**: 입력 시퀀스의 개별 요소(예: 문장의 단어).
* **Embeddings**: 의미 정보를 포착하는 토큰의 벡터 표현.
* **Attention Weights**: 다른 토큰에 대한 각 토큰의 중요성을 결정하는 값.
### Calculating Attention Weights: A Step-by-Step Example
Let's consider the sentence **"Hello shiny sun!"** and represent each word with a 3-dimensional embedding:
* **Hello**: `[0.34, 0.22, 0.54]`
* **shiny**: `[0.53, 0.34, 0.98]`
* **sun**: `[0.29, 0.54, 0.93]`
Our goal is to compute the **context vector** for the word **"shiny"** using self-attention.
1.**Compute Attention Scores**: Use the dot product between the embedding of the target word and the embeddings of all words in the sequence.
2.**Normalize Scores to Get Attention Weights**: Apply the softmax function to the attention scores to obtain weights that sum to 1.
3.**Compute Context Vector**: Multiply each word's embedding by its attention weight and sum the results.
## Self-Attention with Trainable Weights
In practice, self-attention mechanisms use **trainable weights** to learn the best representations for queries, keys, and values. This involves introducing three weight matrices:
[https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01\_main-chapter-code/ch03.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01\_main-chapter-code/ch03.ipynb)에서 예제를 가져와서 우리가 이야기한 self-attendant 기능을 구현하는 이 클래스를 확인할 수 있습니다:
Code example from [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01\_main-chapter-code/ch03.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01\_main-chapter-code/ch03.ipynb):
self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1)) # New
def forward(self, x):
b, num_tokens, d_in = x.shape
# b is the num of batches
# num_tokens is the number of tokens per batch
# d_in is the dimensions er token
keys = self.W_key(x) # This generates the keys of the tokens
queries = self.W_query(x)
values = self.W_value(x)
attn_scores = queries @ keys.transpose(1, 2) # Moves the third dimension to the second one and the second one to the third one to be able to multiply
attn_scores.masked_fill_( # New, _ ops are in-place
self.mask.bool()[:num_tokens, :num_tokens], -torch.inf) # `:num_tokens` to account for cases where the number of tokens in the batch is smaller than the supported context_size
attn_weights = torch.softmax(
attn_scores / keys.shape[-1]**0.5, dim=-1
)
attn_weights = self.dropout(attn_weights)
context_vec = attn_weights @ values
return context_vec
torch.manual_seed(123)
context_length = batch.shape[1]
d_in = 3
d_out = 2
ca = CausalAttention(d_in, d_out, context_length, 0.0)
이전 코드를 재사용하고 여러 번 실행하는 래퍼를 추가하는 것이 가능할 수 있지만, 이는 [https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01\_main-chapter-code/ch03.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01\_main-chapter-code/ch03.ipynb)에서 제공하는 더 최적화된 버전으로, 모든 헤드를 동시에 처리합니다(비용이 많이 드는 for 루프의 수를 줄임). 코드에서 볼 수 있듯이, 각 토큰의 차원은 헤드 수에 따라 서로 다른 차원으로 나뉩니다. 이렇게 하면 토큰이 8차원을 가지고 있고 3개의 헤드를 사용하고자 할 경우, 차원은 4차원의 2개의 배열로 나뉘며 각 헤드는 그 중 하나를 사용합니다:
For another compact and efficient implementation you could use the [`torch.nn.MultiheadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html) class in PyTorch.
ChatGPT의 짧은 답변: 각 헤드가 모든 토큰의 모든 차원을 확인하는 대신 토큰의 차원을 헤드 간에 나누는 것이 더 나은 이유:
각 헤드가 모든 임베딩 차원을 처리할 수 있도록 하는 것이 유리해 보일 수 있지만, 표준 관행은 **임베딩 차원을 헤드 간에 나누는 것**입니다. 이 접근 방식은 계산 효율성과 모델 성능의 균형을 맞추고 각 헤드가 다양한 표현을 학습하도록 장려합니다. 따라서 임베딩 차원을 나누는 것이 일반적으로 각 헤드가 모든 차원을 확인하는 것보다 선호됩니다.