The Geometry of Attention

When we talk about attention mechanisms in transformers, we often describe them algorithmically: queries, keys, values, softmax. But there’s a deeper geometric intuition that makes the whole thing click. The original attention mechanism was introduced by Bahdanau et al. (2014) for neural machine translation, before transformers existed.

Consider what’s actually happening in the attention operation. Each token projects itself into three spaces—as a query asking “what should I attend to?”, as a key advertising “here’s what I contain”, and as a value saying “here’s what I contribute.”

The Softmax as a Lens

The softmax function isn’t just a normalization trick—it’s a focusing mechanism that creates a probability distribution over positions. From an information-theoretic perspective, this is doing something quite profound.

This has implications for how we think about model capacity and the flow of information through layers. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. When attention is sharp, we’re essentially doing content-based addressing—looking up specific information. When it’s diffuse, we’re aggregating across contexts.

Mathematical Formulation

The attention operation can be written as:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

where $Q$ , $K$ , and $V$ are the query, key, and value matrices respectively, and $d_k$ is the dimension of the keys.

Why This Matters

Understanding attention geometrically helps explain why transformers generalize so well across domains. The same mechanism that finds relevant words in a sentence can find relevant patches in an image or relevant timesteps in audio. This universality is why the transformer architecture has become the backbone of modern AI, from GPT to DALL-E to Whisper.

The key insight is that attention is fundamentally about learned routing—dynamically deciding where information should flow based on content, not just position.

Here’s a simple Python example:

import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    attn_weights = F.softmax(scores, dim=-1)
    return torch.matmul(attn_weights, V)

This is the core operation that powers everything from GPT to DALL-E to Whisper.

The Softmax as a Lens

Mathematical Formulation

Why This Matters

Comments