Attention Matching for Key Value Compaction

Problem Setup

Context compaction is a method of preserving LLM output quality in long conversations by reducing KV cache¹ size. Implementing context compaction can be a resource-heavy task. This paper introduces a new efficient family of methods for compacting KV cache directly.

Key Idea

The context is partitioned into a fixed portion and a compacted portion. The compacted portion is replaced by a much smaller synthetic KV representation. Attention over a context block is decomposed into two components:

locally normalized attention within the block
the total attention mass assigned to the block.

The key optimization problem then becomes to find such value tuple $(\mathbf{C}_k, \beta, \mathbf{C}_v)$ for a block $(\mathbf{K}, \mathbf{V})$, that the locally normalized attention becomes

$$ \frac{\exp\left(\mathbf{q}\mathbf{K}^{T}\right)\mathbf{V}} {\sum_{i=1}^{T}\exp\left(\mathbf{q}\mathbf{K}_{i}^{T}\right)} \approx \frac{\exp\left(\mathbf{q}(\mathbf{C}_k)^{T} + \mathbf{\beta}\right)\mathbf{C}_v} {\sum_{i=1}^{t}\exp\left(\mathbf{q}(\mathbf{C}_k)_{i}^{T} +\mathbf{\beta_i} \right)}, $$

while the attention mass becomes

$$ \sum_{i=1}^{T}\exp\left(\mathbf{q}\mathbf{K}_{i}^{T}\right) \approx \sum_{i=1}^{t}\exp\left(\mathbf{q}(\mathbf{C}_k)_{i}^{T} +\mathbf{\beta_i} \right). $$

Here $t<T$, meaning the compacted representation contains fewer entries than the original block.

Method Overview

The paper presents a family of methods to efficiently select compaction values and avoid gradient optimization at inference time.

Each method uses reference queries as proxies for future attention queries when evaluating the quality of the compacted KV cache.

To find the required values $(\mathbf{C}_k, \beta, \mathbf{C}_v)$, there are two steps. The latter two variables have a closed form solution once $\mathbf{C}_k$ is fixed.

The compacted keys $\mathbf{C}_k$ are selected by either using highest attention keys or using orthogonal matching pursuit (OMP) algorithm which is a slower but empirically better performant alternative.

Results

The proposed Attention Matching (AM) methods consistently outperform existing KV-compaction strategies across multiple benchmarks. Compared to optimization-heavy latent compaction methods such as Cartridges, AM achieves competitive quality while dramatically reducing compaction cost by avoiding gradient-based optimization during inference.

Limitations

Authors note the main limitations of their methods:

the more accurate optimization-based version of AM still can take a long time
more extreme compaction rates are outperformed by traditional methods such as Cartridges

Key-Value (KV) cache is a tool for reducing computation resources in attention computation by storing intermediate KV vectors ↩

Problem Setup

Key Idea

Method Overview

Results

Limitations

You might enjoy