aiagentrank.io
🏗️Architecturealso: attention mechanism, self-attention, attention layer

Attention mechanism

The neural-network primitive that lets a transformer model weigh the importance of every input token when generating each output token — the core innovation behind LLMs.

The attention mechanism, introduced by the 2017 "Attention Is All You Need" paper, is the foundation that made modern LLMs possible. For each output token, the model computes a weighted sum over all input tokens — the weights ("attention") determining how much each input contributes. This lets the model focus on relevant context regardless of position.

Self-attention, where the model attends to its own previous tokens, is what gives transformers their reasoning ability. The 2026 variants — flash attention, paged attention, ring attention — are all optimizations of the same core operation, making it faster or more memory-efficient at long contexts.

For agent builders, you rarely touch attention directly. But understanding it helps with practical decisions: why long contexts get expensive (attention is quadratic in sequence length), why KV cache exists (avoid recomputing attention on past tokens), why models "lose the middle" on very long inputs (attention dilutes across many positions).

Frequently asked

What is self-attention?+

Attention where a token attends to other tokens in the same sequence — letting the model build a contextual representation where each token "knows about" the others. Self-attention is the building block of every transformer.

Why is attention quadratic in sequence length?+

Each token attends to every other token, so for N tokens you have N² attention computations. This is why long-context inference is expensive and why optimizations like flash attention matter so much in production.

Related terms