Google's Nested Learning Paradigm Reframes Deep Learning as Interconnected Optimization Layers — and Claims to Solve Catastrophic Forgetting

Published at NeurIPS 2025, Google Research's 'Nested Learning' treats neural network architecture and optimization as a single unified system of multi-level learning problems, introducing the self-modifying 'Hope' architecture that outperforms transformers on continual learning benchmarks.

Key Takeaways

Google Research introduces Nested Learning, a paradigm that unifies model architecture and optimization into interconnected optimization levels with distinct update frequencies — revealing that transformers and memory modules are fundamentally linear layers with different learning speeds. The accompanying 'Hope' architecture demonstrates superior continual learning, lower perplexity, and better long-context reasoning than standard transformers by treating memory as a continuum of update frequencies, directly addressing the long-standing problem of catastrophic forgetting.

The history of deep learning has been built on a clean conceptual division: on one side, the architecture — the network structure that determines how information flows; on the other, the optimization algorithm — the training rule that determines how the network learns. Researchers have spent decades refining each independently, producing ever-larger transformers, increasingly sophisticated optimizers, and a constellation of architectural innovations from attention mechanisms to state-space models.

Now a team at Google Research argues that this division is an illusion. In a paper published at NeurIPS 2025, Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni introduce 'Nested Learning' — a paradigm that treats a machine learning model not as a single continuous process, but as a system of interconnected, multi-level optimization problems that run simultaneously at different timescales. Architecture and optimization, they argue, are simply different 'levels' of the same underlying learning system. [1][2]

The claim is provocative, and the results justify the provocation. A proof-of-concept architecture called 'Hope,' designed using Nested Learning principles, outperforms standard transformers and modern recurrent models on language modeling, common-sense reasoning, and long-context tasks — while demonstrating a dramatically improved ability to learn new skills without forgetting old ones.

The Problem: Catastrophic Forgetting

Despite the remarkable capabilities of modern large language models, they share a fundamental limitation: they cannot effectively learn new things after training. When a model's parameters are updated with new data, it tends to lose proficiency on previously learned tasks — a phenomenon known as catastrophic forgetting. The human brain does not suffer from this constraint. Through neuroplasticity, the brain continuously adapts its structure in response to new experiences, forming memories and skills without overwriting old ones.

Current LLMs compensate for this limitation in two ways: either through the immediate context of their input window (essentially short-term memory) or through static knowledge encoded during pre-training (essentially crystallized long-term memory). Neither approach supports genuine continual learning — the ability to actively acquire new knowledge and skills over time. Researchers have traditionally tried to combat catastrophic forgetting through architectural tweaks or better optimization rules, but these approaches have treated the two as separate concerns. [1]

The Insight: Architecture and Optimization Are the Same Thing

Nested Learning's central insight is deceptively simple. The researchers demonstrate that well-known architectural components — such as the attention mechanism in transformers — can be formalized as simple associative memory modules. Similarly, the training process itself (backpropagation) can be modeled as an associative memory that maps data points to their local error signals. In both cases, the underlying computation is the same: learning to map one thing to another based on how surprising or unexpected the input is. [1]

If architecture and optimization are both forms of associative memory, the only difference between them is their update frequency — how often their parameters are adjusted. Attention in a transformer updates its associations with every new token in a sequence (high frequency), while feedforward layers store knowledge from pre-training and rarely change (low frequency). By defining an explicit update frequency for each component, Nested Learning orders these interconnected optimization problems into 'levels,' creating a structured hierarchy that forms the core of the new paradigm.

Standard Deep Learning vs Nested Learning: A Unified View

graph TD
    A["Standard Deep Learning"] --> B["Architecture\n(fixed structure)"]
    A --> C["Optimization\n(training rule)"]
    B -.->|treated separately| C
    
    D["Nested Learning"] --> E["Level 1: High-Frequency\n(attention/sequence memory)"]
    D --> F["Level 2: Mid-Frequency\n(continuum memory)"]
    D --> G["Level 3: Low-Frequency\n(feedforward/long-term)"]
    E --> F
    F --> G
    
    style A fill:#ff6b6b,color:#fff
    style D fill:#4ecdc4,color:#fff

Source: Based on Behrouz et al., NeurIPS 2025

Continuum Memory Systems: Memory as a Spectrum

This hierarchical view unlocks a powerful design principle that the researchers call a 'Continuum Memory System' (CMS). In a standard transformer, memory exists at two extremes: the attention mechanism provides short-term memory (holding the immediate context), while feedforward neural networks provide long-term memory (storing pre-training knowledge). There is nothing in between. [1]

CMS extends this into a full spectrum of memory modules, each updating at a different, precisely calibrated frequency. The result is a much richer memory architecture that mirrors how the human brain operates — at multiple timescales simultaneously. Some modules adapt rapidly to new input (like working memory), others change more slowly (like episodic memory), and the deepest layers change only rarely (like semantic knowledge). This multi-timescale approach directly addresses catastrophic forgetting by ensuring that slowly-updating components preserve old knowledge while faster components integrate new information.

Hope: The Self-Modifying Architecture

To validate these ideas, the researchers designed Hope — a self-modifying recurrent architecture built as a variant of the Titans architecture. Titans are long-term memory modules that prioritize memories based on how surprising they are, but they are limited to two levels of parameter updates. Hope removes this limitation entirely, allowing unbounded levels of in-context learning augmented with CMS blocks for scaling to larger context windows. [1]

The most striking capability of Hope is self-modification. Higher-level components can influence and control the learning speed and focus of lower-level ones. The architecture can essentially optimize its own memory through a self-referential process, creating an infinite loop of learning levels that grow deeper as needed. This is analogous to how the human prefrontal cortex can modulate the learning rates of other brain regions depending on task demands.

Benchmark Results

Task Category	Hope	Titans	Samba	Transformer
Language Modeling (Perplexity ↓)	Best	2nd	3rd	4th
Common-Sense Reasoning (Accuracy ↑)	Best	2nd	3rd	4th
NIAH Pass-Key	Near-Perfect	Good	Moderate	Moderate
NIAH Word Retrieval	Best	Good	Moderate	Poor
Continual Learning	Substantially Best	Moderate	Poor	Poor

Across a diverse set of language modeling and common-sense reasoning tasks, Hope demonstrates lower perplexity and higher accuracy than Titans, Samba, and standard transformers. On long-context Needle-In-A-Haystack (NIAH) benchmarks — which measure a model's ability to retrieve specific information from very long documents — Hope shows superior memory management, particularly on the hardest variants involving word-level retrieval. Most critically, Hope exhibits dramatically better continual learning performance, validating the core claim that Nested Learning can mitigate or eliminate catastrophic forgetting. [1][2]

Why This Matters

The implications extend beyond benchmark scores. If Nested Learning's core insight is correct — that architecture and optimization are fundamentally the same concept operating at different timescales — it opens an entirely new dimension for model design. Instead of choosing between bigger models or better optimizers, researchers can now explore how to structure the relationships between components at different learning speeds. This is a qualitatively new kind of design parameter that has been, in the researchers' words, 'previously invisible.'

More practically, solving catastrophic forgetting would transform how large language models are deployed. Today, updating a production LLM with new knowledge requires expensive retraining or fine-tuning, often introducing regressions on existing capabilities. A model with genuine continual learning ability could update itself incrementally — incorporating new data, correcting errors, and adapting to changing domains — without the costly and fragile retraining cycle.

'We believe the Nested Learning paradigm offers a robust foundation for closing the gap between the limited, forgetting nature of current LLMs and the remarkable continual learning abilities of the human brain,' the researchers write. [1] Whether that ambition is fully realized remains to be seen, but the framework and early results suggest a promising new direction — one that invites the research community to explore a dimension that, until now, no one knew was there.

📚 Sources & References

#	Source	Link
[1]	Nested Learning: The Illusion of Deep Learning Architectures (NeurIPS 2025) Behrouz et al., 2025	arxiv.org