FlashAttention: Fast Transformer Training with Long Sequences
Date
10/01/2023Date
TechnologyTransformers have been growing both deeper and wider, but there's a catch—training them on long sequences is a computational nightmare. Enter FlashAttention, a new algorithm designed to make your attention layers faster and more memory-efficient, all without approximation. Since its release six months ago, it has been widely adopted, and we're here to share some exciting updates!
We're pushing the envelope when it comes to sequence length in Transformers. The multihead attention layer, the crux of any Transformer model, has runtime and memory requirements that scale quadratically with the input sequence length. The standard 2K sequence length is limiting when you want to train models that can read books, understand high-resolution images, or navigate webpages.
FlashAttention reorders attention computation and leverages classic techniques like tiling to make the attention mechanism much more efficient. This is particularly beneficial when you're dealing with long sequences and small batch sizes.
To further optimize for long sequences, we've introduced additional parallelism over the sequence length dimension. We use multiple workers (thread blocks) to process each attention head, each taking care of a block of rows of the attention matrix. This leads to significant speedups.
When compared with standard PyTorch and Megatron-LM implementations, FlashAttention is up to 2.7x faster. For example, on an A100 40GB GPU, FlashAttention achieves a training efficiency of up to 175 TFLOPs/sec.
We trained GPT-3 models on the Pile dataset with varying context lengths and found that longer context generally leads to better models. For instance, on the ChapterBreak challenge dataset, increasing the context length resulted in a consistent quality improvement.
FlashAttention is just one step towards the future where ML models can handle long sequences for a variety of applications, from personalized user interactions to multi-modal tasks involving text, vision, and speech.
We're excited about this vision and would love to hear your thoughts or see your applications!