Lighthouse Attention: A Training-Efficient Approach to Long-Context Language Models

Introduction

Training large language models on long sequences is computationally intensive due to the quadratic scaling of scaled dot-product attention (SDPA) with sequence length. While FlashAttention mitigated memory bottlenecks through IO-aware tiling, the fundamental compute cost remained. Researchers at Nous Research have introduced Lighthouse Attention, a training-only method that achieves a 1.40× to 1.69× end-to-end wall-clock speedup over a cuDNN-backed SDPA baseline, while maintaining or improving final training loss.

Lighthouse Attention: A Training-Efficient Approach to Long-Context Language Models — Source: www.marktechpost.com

The Pitfalls of Existing Sparse Attention Methods

Prior sparse attention approaches share two common limitations. First, they apply compression asymmetrically—pooling only keys and values while keeping queries at full resolution. Second, their selection logic is embedded within custom attention kernels, preventing reuse of optimized dense-attention kernels designed for modern GPU tensor cores.

Moreover, training-time sparse methods face a unique correctness challenge: after training, the model must still perform well with dense attention at inference. Lighthouse treats this as a central criterion, ensuring that the resulting weights remain competent for dense attention downstream.

How Lighthouse Works

Lighthouse departs from prior work in two key ways: it symmetrically pools queries, keys, and values across a multi-level pyramid, and it places selection entirely outside the attention kernel. After selecting relevant entries, the system gathers them into a contiguous dense sub-sequence and runs standard FlashAttention—the same kernel used by dense baselines. This design allows teams to leverage highly optimized tensor core operations without custom kernel modifications.

The Four-Stage Pipeline in Detail

A Lighthouse attention layer wraps SDPA without modifying it. The pipeline consists of four stages:

Stage 1: Pyramid Construction

Average pooling builds an L-level pyramid from Q, K, and V. With pooling factor p, level ℓ contains N/p^ℓ tokens, each summarizing p^ℓ base positions. Importantly, all three projections are pooled symmetrically, yielding coherent (Q^(ℓ), K^(ℓ), V^(ℓ)) triples at every level. The total cost of pyramid construction is Θ(N) in time and memory.

Stage 2: Parameter-Free Scoring

A scorer assigns each pyramid entry two scalar scores using per-head ℓ₂ norms: a query score (∥Q^(ℓ)_i∥₂) and a key score (∥K^(ℓ)_i∥₂). Coarser levels inherit scores from finer levels, ensuring a consistent ranking across the hierarchy. No learned parameters are involved; the scoring is purely statistical.

Stage 3: Selection

Based on the scores, the system selects the top-K entries from the pyramid. The selection logic is implemented as a simple gather operation, independent of the attention kernel. This design allows the use of any off-the-shelf dense attention implementation for the subsequent step.

Stage 4: Dense Attention

The selected entries form a contiguous dense sub-sequence. Standard FlashAttention is applied to this sub-sequence, leveraging tensor core optimizations. Because the sub-sequence is much shorter than the original sequence, the overall compute scales sub-quadratically.

Performance Gains and Implications

In experiments, Lighthouse delivered a 1.40× to 1.69× wall-clock speedup for pretraining at long contexts, with matching or lower final training loss compared to the dense baseline. The method seamlessly integrates with existing training pipelines and requires no changes to inference-time attention.

This work highlights a promising direction for scaling LLM training to longer contexts without sacrificing hardware efficiency. By decoupling selection from the attention kernel and using symmetric pooling, Lighthouse achieves both speed and correctness.

Conclusion

Lighthouse Attention addresses the quadratic compute bottleneck of attention during pretraining by combining symmetric hierarchical pooling with selection outside the kernel. It achieves significant speedups while maintaining model quality, making it a valuable tool for training long-context language models.

Tags: