FlashAttention: Accelerating Transformer Training for Long Sequences

Introduction

Transformers have been growing both deeper and wider, but there's a catch—training them on long sequences is a computational nightmare. Enter FlashAttention, a new algorithm designed to make your attention layers faster and more memory-efficient, all without approximation. Since its release six months ago, it has been widely adopted, and we're here to share some exciting updates!

Motivation: Why Long Sequences?

We're pushing the envelope when it comes to sequence length in Transformers. The multihead attention layer, the crux of any Transformer model, has runtime and memory requirements that scale quadratically with the input sequence length. The standard 2K sequence length is limiting when you want to train models that can read books, understand high-resolution images, or navigate webpages.

Diagram showing the need for long sequences

How FlashAttention Works

FlashAttention reorders attention computation and leverages classic techniques like tiling to make the attention mechanism much more efficient. This is particularly beneficial when you're dealing with long sequences and small batch sizes.

Attention Parallelism for Long Sequences

To further optimize for long sequences, we've introduced additional parallelism over the sequence length dimension. We use multiple workers (thread blocks) to process each attention head, each taking care of a block of rows of the attention matrix. This leads to significant speedups.

Benchmarks: FlashAttention vs Standard Implementations

When compared with standard PyTorch and Megatron-LM implementations, FlashAttention is up to 2.7x faster. For example, on an A100 40GB GPU, FlashAttention achieves a training efficiency of up to 175 TFLOPs/sec.

Evaluation: Quality of Language Models

We trained GPT-3 models on the Pile dataset with varying context lengths and found that longer context generally leads to better models. For instance, on the ChapterBreak challenge dataset, increasing the context length resulted in a consistent quality improvement.

Looking Forward

FlashAttention is just one step towards the future where ML models can handle long sequences for a variety of applications, from personalized user interactions to multi-modal tasks involving text, vision, and speech.

We're excited about this vision and would love to hear your thoughts or see your applications!

Introduction

Motivation: Why Long Sequences?

Looking Forward

We're excited about this vision and would love to hear your thoughts or see your applications!