Deep Research: LLM Architecture Innovations Beyond Transformers (2025-2026)
I’ll compile a comprehensive report based on my knowledge of these architectures, their benchmarks, scaling properties, and real-world adoption.
Deep Research: LLM Architecture Innovations Beyond Transformers (2025-2026)
I’ll compile a comprehensive report based on my knowledge of these architectures, their benchmarks, scaling properties, and real-world adoption.
1. MAMBA / MAMBA-2 (State Space Models)
Origin: Albert Gu & Tri Dao (Carnegie Mellon / Princeton), December 2023 (Mamba-1), May 2024 (Mamba-2)
Architecture Details
- Core mechanism: Selective State Space Model (S6) — a continuous-time linear recurrence discretized for sequence modeling. Unlike transformers, attention is replaced with input-dependent state transitions.
- Mamba-1: Introduced selective scan — the state space parameters (A, B, C, delta) become functions of the input, allowing the model to selectively propagate or forget information along the sequence. This is the key departure from prior fixed-parameter SSMs (S4, H3).
- Mamba-2: Reformulated as Structured State Space Duality (SSD). Showed that selective SSMs can be expressed as a form of structured (semiseparable) matrix multiplication, creating a duality with attention. This enabled:
- 2-8x faster training than Mamba-1 via hardware-aware chunked algorithms
- Larger state dimensions (from 16 to 64-256) without proportional slowdown
- Tensor parallelism support for multi-GPU scaling
Benchmark Results vs Transformers
- Mamba-1 (3B) matched or exceeded Transformer models of equivalent size on language modeling perplexity, common-sense reasoning (HellaSwag, PIQA, WinoGrande), and was notably strong on long-context tasks.
- Mamba-2 matched Transformer++ (Transformer with RMSNorm, SwiGLU, RoPE) at equivalent FLOP budgets up to ~3B parameters.
- Weakness: Pure Mamba models showed degradation on tasks requiring precise recall from long contexts (e.g., needle-in-a-haystack, associative recall), where attention excels.
Scaling Properties
- Linear complexity in sequence length: O(N) vs O(N^2) for attention.
- Training throughput scales well, but the community found that pure SSM models plateau relative to transformers at larger scales (7B+). This led to hybrid approaches (see Jamba below).
Inference Efficiency
- Constant memory per token during generation (fixed-size recurrent state, no KV cache).
- Up to 5x higher throughput than equivalently-sized transformers at long sequence lengths during inference.
- Particularly advantageous for batch inference and very long contexts.
Adoption Status
- Research adoption: Very high. Hundreds of papers building on Mamba.
- Production adoption: Limited as a standalone architecture. More commonly used as a component in hybrids (Jamba, Zamba). Cartesia AI shipped “Rene” models based on Mamba for on-device and real-time applications.
- As of early 2026: No major frontier lab has shipped a pure Mamba model as their flagship. The consensus settled on Mamba as a powerful component rather than a full replacement for transformers.
2. JAMBA (AI21 Labs)
Origin: AI21 Labs, March 2024 (Jamba), March 2025 (Jamba 1.5)
Architecture Details
- Hybrid SSM-Transformer with MoE: Interleaves Mamba (SSM) layers with standard Transformer attention layers, adding Mixture-of-Experts (MoE) on the MLP blocks.
- Typical ratio: ~7:1 Mamba-to-Attention layers (e.g., 7 Mamba layers per 1 attention layer in a repeating block).
- Uses MoE with top-2 routing across 16 experts on selected layers, giving large total parameter counts but lower active parameters per forward pass.
- Jamba 1.5 Mini: 12B active params (52B total). Jamba 1.5 Large: 94B active params (398B total).
Benchmark Results
- Jamba 1.5 Large was competitive with Llama-3.1 70B and Mixtral 8x22B on standard benchmarks (MMLU, HumanEval, GSM8K).
- Excelled at long-context tasks with a 256K token context window, handling them with significantly less memory than pure transformers.
- Jamba 1.5 Mini outperformed many dense models at its active parameter count.
Scaling Properties
- The hybrid approach resolved pure Mamba’s scaling issues — attention layers provide the precise recall capability that SSMs lack, while SSMs handle the bulk of sequential processing efficiently.
- MoE further improves compute efficiency per parameter.
Inference Efficiency
- KV cache reduced by ~8x compared to an equivalent pure transformer (only ~1/8 of layers have attention needing KV cache).
- Fits much longer contexts on a single GPU. Jamba could handle 256K contexts on a single 80GB A100, where a comparable transformer would need multiple GPUs.
- Throughput gains of 2-3x over comparable transformers at long contexts.
Adoption Status
- Production: AI21 offers Jamba models through their API and cloud platforms. Available on Hugging Face, NVIDIA NIM, Amazon Bedrock.
- This is the most commercially deployed non-transformer (hybrid) architecture. AI21 uses it as their flagship model family.
- Influenced the broader trend toward hybrid architectures.
3. RWKV-6 / RWKV-7
Origin: Bo Peng and RWKV Foundation (open-source community), RWKV-6 (2024), RWKV-7 “Goose” (early 2025)
Architecture Details
- “Receptance Weighted Key Value” — a linear attention / RNN hybrid that reformulates attention as a linear recurrence, achieving O(N) complexity.
- RWKV-6: Introduced data-dependent time decay (similar in spirit to Mamba’s selectivity). Each token’s decay rate is a function of the input, not a fixed parameter. Uses a “WKV” (weighted key-value) kernel as the core operation.
- RWKV-7 “Goose”: Further innovations:
- In-context learning rate mechanism: the model can dynamically adjust how much it “learns” from context, inspired by delta-rule / fast-weight ideas.
- Improved state management with larger and more expressive recurrent states.
- Anti-token-shift mechanisms to better handle bidirectional context needs.
- Trained models up to 14B parameters.
Benchmark Results
- RWKV-6 14B was roughly competitive with Llama-2 13B and other similarly-sized transformers on standard benchmarks, though typically a few points behind top transformers.
- RWKV-7 showed improvement, narrowing the gap. At 3B scale, RWKV-7 was competitive with Mamba-2 and within striking distance of Transformer++ baselines.
- Long-context performance was a standout — RWKV handles very long sequences with constant memory, though recall precision still trailed attention on needle-in-a-haystack.
Scaling Properties
- Linear complexity in both training and inference.
- Community has trained up to 14B parameters. Scaling curves are competitive but the gap with transformers tends to widen slightly at larger scales (a recurring theme for all alternative architectures).
Inference Efficiency
- Constant memory footprint during generation (recurrent state only, no KV cache).
- Extremely efficient on edge devices and resource-constrained environments.
- Can run large models on consumer hardware more easily than transformers.
Adoption Status
- Strong open-source community (RWKV Foundation, supported by various contributors globally).
- Available on Hugging Face. Multiple fine-tuned variants exist for different languages (particularly strong Chinese language support).
- Production: Some deployment in edge/embedded AI scenarios and by smaller companies. Not adopted by any major cloud provider as a primary offering.
- RWKV is notable as the most significant fully open-source, community-driven alternative architecture effort.
4. xLSTM (Extended Long Short-Term Memory)
Origin: Sepp Hochreiter (original LSTM inventor) et al., NXAI Lab (Linz, Austria), May 2024
Architecture Details
- Revisits the classic LSTM with two key innovations:
- sLSTM (scalar LSTM): Introduces exponential gating and a new memory mixing mechanism. The exponential gates allow for sharper, more decisive memory updates.
- mLSTM (matrix LSTM): Replaces the scalar cell state with a matrix-valued memory, vastly increasing storage capacity. The update rule becomes a form of covariance-based memory update (related to fast-weight programmers). This is the more powerful variant.
- xLSTM stacks residual blocks of sLSTM and mLSTM layers, forming a deep recurrent backbone.
- mLSTM can be parallelized during training via a chunk-wise formulation (similar to Mamba-2’s approach).
Benchmark Results
- At the 1.3B scale, xLSTM was competitive with Mamba, RWKV-5, and Transformers on language modeling perplexity.
- mLSTM specifically showed strong associative recall capabilities — better than Mamba, closer to transformers — due to the matrix memory providing more capacity for storing key-value associations.
- At 7B scale (xLSTM 7B, released late 2024), results were promising but still slightly behind best-in-class transformers.
Scaling Properties
- Linear complexity like other recurrent alternatives.
- mLSTM’s matrix memory scales the state size quadratically with head dimension, which increases memory but also capacity. Balancing state size vs. efficiency is an open research question.
- Early scaling results suggest competitive scaling laws, but data is limited beyond 7B.
Inference Efficiency
- Recurrent inference with fixed state size (no KV cache).
- mLSTM’s matrix state is larger than Mamba’s or RWKV’s vector state, so per-token memory is higher among recurrents, but still far less than transformer KV caches at long contexts.
Adoption Status
- NXAI (Hochreiter’s company) is commercializing xLSTM.
- Academic interest is high given Hochreiter’s reputation and the LSTM legacy.
- Production deployment: Very limited as of early 2026. Primarily a research architecture with commercial aspirations.
- Released open-weight models up to 7B.
5. GRIFFIN / RECURRENTGEMMA (Google DeepMind)
Origin: Google DeepMind, February 2024 (Griffin paper), April 2024 (RecurrentGemma release)
Architecture Details
- Griffin: A hybrid architecture combining:
- Real-Gated Linear Recurrence (RG-LRU): A simplified linear recurrence with real-valued (not complex) diagonal state matrices and input-dependent gating. Simpler than Mamba’s S6.
- Interleaved with local sliding window attention (window size ~2048 tokens) for precise local recall.
- MLP blocks with gated activations.
- RecurrentGemma: The productized version of Griffin, released as part of Google’s Gemma model family.
- RecurrentGemma 2B: Competitive with Gemma 2B (transformer) while being more efficient at inference.
- RecurrentGemma 9B: Released later, extending the approach.
Benchmark Results
- Griffin matched Transformer baselines (Llama-2 equivalent) at the 1B and 3B scale on standard benchmarks.
- RecurrentGemma 2B was competitive with Gemma 2B on most benchmarks, with slight degradation on tasks requiring very long-range recall but improvements in throughput.
- The local attention window helped significantly with recall tasks compared to pure recurrent models.
Scaling Properties
- Google showed competitive scaling curves up to 7B in the paper.
- The simplicity of RG-LRU (real-valued, no complex arithmetic) makes it very hardware-friendly and easy to scale.
- Google’s internal scaling experiments reportedly went further, but results were not fully published.
Inference Efficiency
- Significantly reduced KV cache (only for local attention window, not full sequence).
- 2-3x throughput improvement over equivalently-sized Gemma transformers at long sequences.
- Fixed-size recurrent state + small sliding window cache.
Adoption Status
- RecurrentGemma is available on Google’s platforms (Vertex AI, Kaggle, Hugging Face).
- Represents the most significant investment by a major frontier lab in non-transformer architectures.
- However: Google’s flagship models (Gemini) remain transformer-based. RecurrentGemma appears positioned as an efficiency-oriented alternative for specific use cases, not a replacement.
- As of early 2026: RecurrentGemma has not become Google’s primary architecture. Gemini 2.x continues to be transformer-based (with undisclosed efficiency improvements).
6. OTHER NOTABLE ARCHITECTURES
Zamba (Zyphra)
- Hybrid Mamba-attention architecture from Zyphra AI.
- Zamba 7B and Zamba2 (1.2B, 2.7B, 7B) released through 2024-2025.
- Innovation: Shared attention layers — instead of unique attention layers at each position, a single set of attention parameters is shared across all attention positions, reducing parameter count.
- Particularly targeted at on-device / edge deployment.
- Available on Hugging Face, some on-device production use.
Based (Stanford HAI / Together AI)
- “Linear attention + sliding window attention” hybrid.
- Explored distillation from trained transformers into linear architectures.
- Together AI’s “Together MoA” used related ideas.
MEGALODON (Meta)
- Evolution of the MEGA (Moving Average Equipped Gated Attention) line.
- Uses complex exponential moving average with gated attention.
- Showed competitive results at 7B scale.
- Meta continued using transformers (Llama series) for their main models.
DeltaNet / Gated DeltaNet
- Linear attention variant using the delta rule for memory updates.
- Influenced RWKV-7’s design.
- Strong theoretical properties for in-context learning.
- Primarily academic as of 2026.
Titans (Google DeepMind, late 2024)
- Proposed a “neural long-term memory” module alongside attention.
- The memory module is itself a small neural network that learns to memorize and forget.
- Interesting conceptual direction but limited public results at scale.
7. COMPARATIVE SUMMARY
| Architecture | Complexity | KV Cache | Long-Context Memory | Recall Precision | Max Proven Scale | Production Use |
|---|---|---|---|---|---|---|
| Transformer | O(N^2) | Full | Unlimited (with cache) | Excellent | 1T+ params | Dominant |
| Mamba-2 | O(N) | None | Fixed state | Moderate | ~3B (pure) | Component in hybrids |
| Jamba | O(N) mostly | ~1/8 of Transformer | Hybrid (state + sparse cache) | Good | 398B total / 94B active | Yes (AI21 API) |
| RWKV-7 | O(N) | None | Fixed state | Moderate | 14B | Edge / community |
| xLSTM | O(N) | None | Matrix state (larger) | Good (mLSTM) | 7B | Very limited |
| RecurrentGemma | O(N) mostly | Small (local window) | Hybrid | Good (local) | 9B | Available (Google) |
| Zamba | O(N) mostly | Small (shared attn) | Hybrid | Good | 7B | Edge / on-device |
8. KEY FINDINGS AND CONCLUSIONS
The Hybrid Consensus
The single most important finding from 2024-2025 is that pure alternative architectures do not fully replace transformers, but hybrid architectures combining recurrent/SSM layers with sparse attention are the clear winning pattern. Jamba, Griffin/RecurrentGemma, and Zamba all converge on this design: use recurrent layers for the bulk of processing (cheap, linear complexity) and sprinkle in attention layers for precise recall (expensive but necessary).
Why Transformers Still Dominate
- Scaling confidence: Transformers have proven scaling laws validated up to trillions of parameters. Alternatives have only been validated to ~14B (pure) or ~400B (hybrid with MoE).
- Ecosystem: Tooling, optimization libraries (Flash Attention, vLLM, TensorRT-LLM), and deployment infrastructure are all transformer-optimized.
- Recall precision: For tasks requiring exact retrieval from long contexts, attention remains superior. The “needle-in-a-haystack” gap is real.
- Risk aversion: Training a frontier model costs $100M+. Labs are unwilling to bet on less-proven architectures.
What Alternatives Actually Deliver
- Inference efficiency at long contexts — This is the killer use case. 2-5x throughput gains at 100K+ token contexts.
- Reduced memory — No or minimal KV cache enables deployment on smaller hardware.
- Edge/on-device deployment — Fixed-size recurrent states are ideal for constrained environments.
- Streaming/real-time processing — Constant per-token cost enables real-time applications.
What Is Actually in Production (Early 2026)
- Jamba (AI21): The most prominent non-transformer in commercial production.
- RecurrentGemma (Google): Available but not Google’s primary model.
- RWKV: Community deployments, some edge applications.
- Zamba: On-device deployments.
- Mamba-based models via Cartesia AI: Real-time voice/audio applications.
- Everything else: Research stage or very limited deployment.
The Frontier Remains Transformer
As of early 2026, every frontier model (GPT-4.x/5, Claude 3.5/4.x, Gemini 2.x, Llama 4, Grok) is still fundamentally transformer-based. The alternative architectures have carved out important niches in efficiency-sensitive deployment scenarios but have not displaced transformers at the frontier. The most likely path forward is continued convergence toward hybrid architectures, where the transformer’s attention mechanism is used more sparingly and efficiently alongside linear-complexity recurrent layers.
Get workflow automation insights that cut through the noise
One email per week. Practical frameworks, not product pitches.
Ready to Run Autonomous Enterprise Operations?
See how QorSync AI deploys governed agents across your enterprise systems.
Request DemoNot ready for a demo? Start here instead: