Not All Thoughts Matter: Selective Attention for Efficient Reasoning

Best AI papers explained - Un podcast de Enoch H. Kang

Podcast artwork

Catégories:

This paper studies an inference-time optimization technique designed to reduce the high computational cost of reasoning-optimized large language models (LLMs), which generate long chains of thought. LLMs' self-attention mechanism typically scales quadratically with sequence length, making long reasoning chains prohibitively expensive. RWR addresses this by exploiting the redundancy in intermediate reasoning steps, maintaining only two strategically chosen parts of the key-value (KV) cache: the first window, which holds critical problem context, and the last window, containing the most recent reasoning steps. This simple approach significantly reduces memory and compute requirements, achieving similar accuracy with up to a 50% KV-cache budget reduction, which translates to substantial memory and compute savings across tasks like math reasoning, code generation, and academic question answering, even for models trained with full quadratic attention.

Visit the podcast's native language site