FlashAttention: The Attention Speedup That Came From Reading GPU Memory
FlashAttention keeps attention exact but makes it IO-aware, using tiling to reduce slow GPU memory traffic and make long-sequence Transformers faster and cheaper.
FlashAttention keeps attention exact but makes it IO-aware, using tiling to reduce slow GPU memory traffic and make long-sequence Transformers faster and cheaper.
What problem it solves
Self-attention is expensive on long sequences because time and memory scale quadratically with sequence length. Many alternatives approximate attention, but approximation can hurt model quality and does not always improve real wall-clock speed. FlashAttention argues that the missing bottleneck is not only arithmetic; it is movement between different levels of GPU memory.
The core method
FlashAttention computes exact attention with an IO-aware tiling algorithm. It avoids materializing the full attention matrix in high-bandwidth memory, instead moving blocks through faster on-chip SRAM and carefully recomputing what is cheaper to recompute than to store. The method keeps the mathematical result exact while changing the memory schedule.
Key results
The paper reports faster Transformer training than existing baselines, including large speedups on GPT-2 and long-range tasks. It also enables longer contexts with better model quality, including better-than-chance performance on long Path-X style challenges. The impact grew because the idea fit real GPU hardware rather than only asymptotic complexity charts.
Why it matters
FlashAttention became infrastructure. It made attention-heavy models faster, reduced memory pressure, and helped long-context systems become practical. Many users never interact with the paper directly; they benefit because modern training and inference stacks quietly include FlashAttention-style kernels underneath.
Limits and open questions
FlashAttention accelerates attention, but it does not change the quadratic dependency itself in the same way a different architecture would. It also depends on careful kernel engineering and hardware assumptions. The broader lesson is durable: for large models, algorithm design and systems design are no longer separable.
One line: FlashAttention made exact attention fast by treating memory movement as the real problem.