Concept of Attention Sink for Dummies

Unlike most of my technical blogs, let's start this one with a story.

In 2023, researchers stumbled upon a strange phenomenon in transformer models. They were experimenting with sliding attention.

Sliding attention is subset of self attention which is a way to speed up transformers by only letting each word look at a small window of recent words, instead of the whole sentence. This makes things much faster, from O(n²) work to O(n × W), where W is the window size.

But instead of improving performance, something bizarre happened. The model started predicting completely unrelated text. A sentence would begin normally, but somewhere in the middle, the model would veer off into strange territory. At first, this seemed like an ordinary training quirk. But as they dug deeper, a clear pattern emerged:

"Attention heads often focused heavily on the very first tokens in the sequence, sometimes just the BOS (beginning-of-sequence) marker or even a piece of punctuation, which were completely unrelated to the word."

These words at beginning were completely unrelated and ideally should not have much attention, yet for some reason, the model kept assigning unusually high attention scores to them, even when predicting tokens far away in the sequence.

This is when the researchers realized what was actually happening. When using a sliding window, they were moving the window forward to include newer words but were not taking into account the starting tokens that had very high attention scores. As a result, the output became distorted because these initial tokens were extremely important and carried significant context for the rest of the sequence.

These tokens at the beginning that carried important information are called as sink tokens or attention sink. These tokens are important because they store the global context and hence dropping them was resulting in bad performance.

BUT WHY WAS THIS HAPPENING IN FIRST PLACE ?

You might be wondering that you now know the cause for why model was performing bad, but why was model giving high attention scores to beginning tokens which were in every way unrelated ? Because it acts as the default location for any leftover attention after softmax, Let's understand this with an example :

Toy example without attention sinks

Let's say we have 5 tokens:

[<BOS>,  The,  cat,  sat,  quietly ]

And the model is trying to generate the next word. If the current token has no relevant past info, softmax still forces it to pick something.

Example:

Query from word "quietly":

Raw attention scores before softmax: [-0.1, -1.2, -0.9, -0.8, -0.5]
Softmax result: [0.45, 0.10, 0.15, 0.16, 0.14]

Notice something interesting? Even though the beginning-of-sequence token <BOS> is now semantically useless, it still receives around 45 percent of the attention. Why does this happen? Because it acts as the default location for any leftover attention. The softmax function must convert the logits into probabilities that sum to exactly 1. Once the model has assigned the necessary probability mass to the truly relevant tokens, any remaining probability naturally gets pushed toward <BOS>, even if it carries no current meaning.

This was the reason that these beginning tokens became attention sink for all global context ! if we remove <BOS> from cache, That 45% "wasted" attention now has to go somewhere else, which changes all attention weights everywhere in the network and destabilizes predictions. That's why models collapse without sinks.

How did they deal with this ?

Researchers realized that this leftover-attention behavior could be used intentionally instead of being treated as a quirk. The idea was to give the model a small set of tokens that always remain in the attention window, so it can safely put unused attention there without interfering with the rest of the context. In this way the beginning tokens will always be present and will never be removed from cache no matter which attention window are we considering. This approach of keeping the first few tokens (for example, <BOS> plus three more) in memory forever, no matter how long the conversation gets helped models to perform efficiently even with sliding attention.

Nowadays the new models which are following sliding attention are explicitly trained with keeping sink tokens in account, model is pretrain the model with a dedicated [SINK] token from the very beginning:

[ <SINK>, actual_text_tokens... ]

In this case you only need one sink, because the model learns from the start to funnel all unused attention into it. but for earlier models which were trained without explicit sink tokens, it will have implicitly learned to spread this leftover attention across several of the earliest tokens. That's why StreamingLLM keeps four sinks it matches the model's existing habit.

The OpenAI way ;

Instead of using special tokens, OpenAI's method is to give each attention head one trainable scalar that gets added to the softmax logits:

$s_{ij} = \frac{Q_i \cdot K_j}{\sqrt{d}} + \alpha_j$

Here α is pre-learned or set to a fixed positive value for special tokens.

We skip adding a special token and instead introduce a learnable scalar α for each head. This scalar gets added to all attention logits before softmax. This shifts the relative attention probabilities without adding a new token. Because this addition happens before the softmax, it directly increases the probability that the model will attend to these anchor tokens, even if other tokens are competing for attention. This makes them more influential in the attention distribution.

Both these approaches successfully solve the fundamental problem: giving attention somewhere to go when it has nothing meaningful to attend to.

Outro for Dummies

If you found this explanation helpful, please share it with your friends and colleagues! If you have any questions or suggestions for future topics, feel free to ping me on Twitter / X or LinkedIn

Diary

of a Developer