Lighthouse Attention: A Training-Efficient Approach to Long-Context Language Models
Introduction
Training large language models on long sequences is computationally intensive due to the quadratic scaling of scaled dot-product attention (SDPA) with sequence length. While FlashAttention mitigated memory bottlenecks through IO-aware tiling, the fundamental compute cost remained. Researchers at Nous Research have introduced Lighthouse Attention, a training-only method that achieves a 1.40× to 1.69× end-to-end wall-clock speedup over a cuDNN-backed SDPA baseline, while maintaining or improving final training loss.

The Pitfalls of Existing Sparse Attention Methods
Prior sparse attention approaches share two common limitations. First, they apply compression asymmetrically—pooling only keys and values while keeping queries at full resolution. Second, their selection logic is embedded within custom attention kernels, preventing reuse of optimized dense-attention kernels designed for modern GPU tensor cores.
Moreover, training-time sparse methods face a unique correctness challenge: after training, the model must still perform well with dense attention at inference. Lighthouse treats this as a central criterion, ensuring that the resulting weights remain competent for dense attention downstream.
How Lighthouse Works
Lighthouse departs from prior work in two key ways: it symmetrically pools queries, keys, and values across a multi-level pyramid, and it places selection entirely outside the attention kernel. After selecting relevant entries, the system gathers them into a contiguous dense sub-sequence and runs standard FlashAttention—the same kernel used by dense baselines. This design allows teams to leverage highly optimized tensor core operations without custom kernel modifications.
The Four-Stage Pipeline in Detail
A Lighthouse attention layer wraps SDPA without modifying it. The pipeline consists of four stages:
Stage 1: Pyramid Construction
Average pooling builds an L-level pyramid from Q, K, and V. With pooling factor p, level ℓ contains N/p^ℓ tokens, each summarizing p^ℓ base positions. Importantly, all three projections are pooled symmetrically, yielding coherent (Q^(ℓ), K^(ℓ), V^(ℓ)) triples at every level. The total cost of pyramid construction is Θ(N) in time and memory.
Stage 2: Parameter-Free Scoring
A scorer assigns each pyramid entry two scalar scores using per-head ℓ₂ norms: a query score (∥Q^(ℓ)_i∥₂) and a key score (∥K^(ℓ)_i∥₂). Coarser levels inherit scores from finer levels, ensuring a consistent ranking across the hierarchy. No learned parameters are involved; the scoring is purely statistical.

Stage 3: Selection
Based on the scores, the system selects the top-K entries from the pyramid. The selection logic is implemented as a simple gather operation, independent of the attention kernel. This design allows the use of any off-the-shelf dense attention implementation for the subsequent step.
Stage 4: Dense Attention
The selected entries form a contiguous dense sub-sequence. Standard FlashAttention is applied to this sub-sequence, leveraging tensor core optimizations. Because the sub-sequence is much shorter than the original sequence, the overall compute scales sub-quadratically.
Performance Gains and Implications
In experiments, Lighthouse delivered a 1.40× to 1.69× wall-clock speedup for pretraining at long contexts, with matching or lower final training loss compared to the dense baseline. The method seamlessly integrates with existing training pipelines and requires no changes to inference-time attention.
This work highlights a promising direction for scaling LLM training to longer contexts without sacrificing hardware efficiency. By decoupling selection from the attention kernel and using symmetric pooling, Lighthouse achieves both speed and correctness.
Conclusion
Lighthouse Attention addresses the quadratic compute bottleneck of attention during pretraining by combining symmetric hierarchical pooling with selection outside the kernel. It achieves significant speedups while maintaining model quality, making it a valuable tool for training long-context language models.
Related Articles
- From a Dream to the Moon: Anton Kiriwas's Path to NASA's Artemis Missions
- Why One Samsung App Made Me Ditch Gesture Navigation
- Tau Ceti: Why This Nearby Star Captures Sci-Fi Imaginations
- The Deep-Sea Secret of Squid Survival: New Genome Research Unveils Ancient Escape Routes
- Revolutionizing Multi-Agent AI: How RecursiveMAS Boosts Speed by 2.4x and Cuts Token Use by 75%
- The Hidden Metabolic Effects of Fructose: Why Your Body May Not Treat It Like Sugar
- Microsoft Launches Expanded AI Platform to Revolutionize R&D: 'Agentic AI' Now in Preview
- Major 2022 Hawaii Eruption Provides Key to Unlocking Venus's Volcanic Activity