Attention Matching for Key Value Compaction
Posted on Mon 01 June 2026 in posts
Problem Setup
Context compaction is a method of preserving LLM output quality in long conversations by reducing KV cache1 size. Implementing context compaction can be a resource-heavy task. This paper introduces a new efficient family of methods for compacting KV cache directly.
Key Idea
The context is partitioned into a fixed portion and a compacted portion. The compacted portion is replaced by a much smaller synthetic KV representation. Attention over a context block is decomposed into two components:
- locally normalized attention within the block
- the total attention mass assigned to the block.
The key optimization problem then becomes to find such value tuple \((\mathbf{C}_k, \beta, \mathbf{C}_v)\) for a block \((\mathbf{K}, \mathbf{V})\), that the locally normalized attention becomes
while the attention mass becomes
Here \(t<T\), meaning the compacted representation contains fewer entries than the original block.
Method Overview
The paper presents a family of methods to efficiently select compaction values and avoid gradient optimization at inference time.
Each method uses reference queries as proxies for future attention queries when evaluating the quality of the compacted KV cache.
To find the required values \((\mathbf{C}_k, \beta, \mathbf{C}_v)\), there are two steps. The latter two variables have a closed form solution once \(\mathbf{C}_k\) is fixed.
The compacted keys \(\mathbf{C}_k\) are selected by either using highest attention keys or using orthogonal matching pursuit (OMP) algorithm which is a slower but empirically better performant alternative.
Results
The proposed Attention Matching (AM) methods consistently outperform existing KV-compaction strategies across multiple benchmarks. Compared to optimization-heavy latent compaction methods such as Cartridges, AM achieves competitive quality while dramatically reducing compaction cost by avoiding gradient-based optimization during inference.
Limitations
Authors note the main limitations of their methods:
- the more accurate optimization-based version of AM still can take a long time
- more extreme compaction rates are outperformed by traditional methods such as Cartridges
-
Key-Value (KV) cache is a tool for reducing computation resources in attention computation by storing intermediate KV vectors ↩