Attention variants
Plug-in alternatives to standard causal MHA.
- RoPE + GQA — HF-Llama-compatible baseline (RoPE, GQA)
- ALiBi — linear position bias (Press et al., 2021)
- Sliding window — Mistral / GPT-OSS recipe
- Differential attention — subtract two softmaxes (Ye et al., 2024)
- YOCO — cross-layer KV sharing (Lin et al., 2024)
- StreamingLLM — attention sink for infinite context (Xiao et al., 2024)