Self Attention for dummies

Hi ! to the new to AI folks.... Let me tell you something cool...Long before ChatGPT was out there answering your random thoughts, we had older types of neural networks—like RNNs (Recurrent Neural Networks), LSTMs (Long Short-Term Memory), and GRUs (Gated Recurrent Units). Okay then what is the flaw here ? The flaw was They forgot things. They would go word by word, trying to remember everything by passing information along but they were bad at remembering things that happened sometime ago in the sentence.

RNN illustration

Let's take this sentence:

"The cat sat on the mat. It is eating now."

You and I know that "it" refers to "the cat". Simple, right? But to older models like RNNs or LSTMs, this wasn't so obvious. By the time they reached the word "it", they had already half-forgotten what happened with the cat.

So What Was Missing? The missing thing here was Attention, There was a need to establish a way so that the model can pay attention to right words so as to understand the context better.

Self Attention is like traversing every word to look at every other word in sentence and ask:

"Hey, are you important to me?"

In our case "it" was important to word "cat" but not very important to words like "on", "mat" etc though still related.

So How Does It Actually Work?

So let's dive a little into Maths and working behind self attention, Some of you may get bored here but it is actually the juicy part, First each word (token) is first turned into a vector just like any other vector embedding method that you know of; The thing is that our goal is to decide which other tokens are important to understanding current word and so we need a way for each token to evaluate other tokens in the sequence ,this is where Q, K, V come in , Each word in sentence in transformed into Q, K, V vectors.

A Query vector representing what the token wants to know.
A Key vector representing what information a token offers.
A Value vector representing the actual content a token carries.

For example, the word "it" gets turned into:

Q_it = W_q × embedding(it)
K_it = W_k × embedding(it)
V_it = W_v × embedding(it)

Where embedding(it) is the word vector for "it" and , W_q, W_k, and W_v are learned weight matrices.

Why Q, K, V ? Because this is how we compute relevance or importance Each token computes a dot product between its own Query and every other token's Key. This is also referred to as Attention Score.

score(i, j) = qᵢ · kⱼ

This gives a raw similarity score about how relevant token j is to token i. If the vectors point in similar directions (high dot product), token i thinks token j is important. otherwise it is not and it gets downweighted.

so in our case , To know how much attention "it" should pay to "cat", we compute the dot product between their Query and Key vectors:

score = Q_it · K_cat^T (note that it is Transpose of Key Vector)

Attention calculation

This gives us the similarity between "it" and "cat".

and now for some post processing we scale the score by the square root of the vector size (d) to prevent large dot products, and then pass it through a softmax function to get nice clean attention weights,

attention_weights = softmax( (Q × K^T) / √d )

Due to softmax function the value of these attention weights will always be in range of 0-1. We compute this attention weight for every word with respect to every other word in sentence which tell us how much attention each word deserves.

we were using sentence "The cat sat on the mat. It is eating now."

So in our case for calculating the word "it" would be something like :

"The" gets 0.01 (very low importance)
"Cat" gets 0.87 (very high importance)
"Sat" gets 0.52 (Medium importance)
"On" gets 0.03 (very low importance)
"Mat" gets 0.32 (low importance)
"it" gets 0.98 (almost same) (Yes, we calculate this for the word itself too)

Attention weights

Now that the word "it" knows which other words in sentence are relevant for interpretation (thanks to the attention weights), and for the final step it gathers everyone's Value vectors, multiplies each by their calculated attention weight, and adds them all up:

output = ∑ (attention_weight × Value)

This final output vector becomes the updated, context-aware representation of the word "it" which is influenced by the entire sequence hence much context aware responses. It enabled models to understand longer sentences and complex relationships between words. This mechanism led to the invention of the Transformer model which is the architecture behind BERT, GPT, and many other powerful language model today.

If you want to dive deeper and you can't wait , you should explore more about transformer architecture. Here is a wonderful resource

Outro for Dummies

If you found this explanation helpful, please share it with your friends and colleagues! If you have any questions or suggestions for future topics, feel free to ping me on Twitter / X or LinkedIn.

Diary

of a Developer