AI Agent Memory Systems14 min read

Memory-Driven Planning in AI Agents: A Comprehensive Research Report

Memory-driven planning represents a paradigm shift in AI agent design: rather than treating each task as a fresh problem, agents accumulate experiences — successes, failures,...

Dhawal Chheda•AI Leader at Accel4•January 11, 2026•

Memory-Driven Planning in AI Agents: A Comprehensive Research Report

Architectures Where Past Experience Directly Informs Action Planning

1. Introduction

Memory-driven planning represents a paradigm shift in AI agent design: rather than treating each task as a fresh problem, agents accumulate experiences — successes, failures, partial solutions, and environmental observations — and use these to inform future action planning. This report examines the major architectures, their mechanisms for learning from past mistakes, and the state of the art as of early 2026.

2. Foundational Architectures

2.1 ReAct (Reasoning + Acting)

Paper: Yao et al., 2022 — “ReAct: Synergizing Reasoning and Acting in Language Models”

Core Mechanism:
ReAct interleaves chain-of-thought reasoning with action execution in a thought-action-observation loop:

Thought: I need to find the population of France. Action: Search["population of France"] Observation: France has a population of approximately 68 million. Thought: Now I have the answer. Action: Finish["68 million"]

Memory Model: ReAct’s memory is ephemeral and within-episode only. The trace of thought-action-observation triples serves as a short-term working memory within a single task. There is no cross-episode memory — the agent does not learn from past tasks.

Strengths:
- Simple, interpretable reasoning traces
- Grounds reasoning in real observations, reducing hallucination
- Works well for single-session knowledge-intensive tasks (QA, fact verification)

Limitations as a Memory-Driven Planner:
- No persistent memory across episodes
- Cannot learn from past mistakes — repeats the same errors on similar tasks
- Planning is purely reactive (one step at a time), not anticipatory
- No mechanism for strategy refinement over time

Role in the Ecosystem: ReAct is best understood as the baseline from which memory-driven architectures depart. Nearly every subsequent framework builds on or extends the ReAct loop.

2.2 Reflexion

Paper: Shinn et al., 2023 — “Reflexion: Language Agents with Verbal Reinforcement Learning”

Core Mechanism:
Reflexion adds an explicit self-reflection loop on top of a ReAct-style agent. After a task attempt (success or failure), the agent generates a natural-language reflection analyzing what went wrong and what to do differently. These reflections are stored in a persistent memory buffer and prepended to the prompt on subsequent attempts.

Architecture:

Episode N: 1. Attempt task (ReAct-style) 2. Receive environment feedback (success/failure/partial score) 3. Self-reflect: generate verbal analysis of what went wrong 4. Store reflection in memory buffer Episode N+1: 1. Load past reflections into context 2. Attempt task, informed by past failures 3. Repeat cycle

How It Learns from Mistakes:
- Reflections are verbal reinforcement signals — natural language descriptions of errors and corrective strategies
- The agent does not update weights; it updates its prompt context with accumulated wisdom
- Example reflection: “In my previous attempt, I searched for the wrong entity. Next time, I should disambiguate the query before searching.”

Memory Structure:
- A sliding window of recent reflections (typically last 3-5)
- No hierarchical organization or retrieval mechanism — all reflections are linearly prepended
- Memory is task-specific, not shared across different task types

Empirical Results:
- Significant improvements on sequential decision-making benchmarks (AlfWorld, HotPotQA, programming tasks)
- Performance improves with each trial, plateauing around 3-5 attempts
- On HotPotQA: from ~34% (ReAct baseline) to ~68% after multiple reflexion rounds

Limitations:
- Bounded by context window — cannot accumulate unlimited experience
- Reflections can be shallow or inaccurate (the LLM may misdiagnose its own failures)
- No generalization across tasks — reflections from Task A do not help with Task B
- Linear memory (no retrieval) means irrelevant reflections consume context

2.3 LATS (Language Agent Tree Search)

Paper: Zhou et al., 2023 — “Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models”

Core Mechanism:
LATS applies Monte Carlo Tree Search (MCTS) principles to LLM-based agent planning. Instead of a single linear trace (ReAct) or sequential retries (Reflexion), LATS builds a tree of possible action sequences, using the LLM as both the policy (action proposer), value function (state evaluator), and reflection generator.

Architecture:

1. Selection: Traverse the tree using UCB1 to balance exploration/exploitation 2. Expansion: Generate multiple candidate actions from current state 3. Evaluation: LLM scores the resulting states (heuristic value function) 4. Simulation: Optionally roll out to terminal states 5. Backpropagation: Update node values based on outcomes 6. Reflection: On failure, generate reflection and store for reuse

Memory Model:
LATS has two forms of memory:
1. Within-episode tree memory: The search tree itself stores explored paths, their outcomes, and value estimates. Failed branches are remembered and avoided.
2. Cross-episode reflection memory: Like Reflexion, failed trajectories produce verbal reflections that can inform future attempts.

How It Learns from Mistakes:
- Failed branches in the tree are assigned low values during backpropagation, steering future search away from them
- The tree structure provides structured memory — not just “what went wrong” but “which specific decision point led to failure”
- Reflections from failed rollouts are more targeted because they can identify the exact branching point where things diverged

Empirical Results:
- Programming tasks (HumanEval): 94.4% pass@1 (vs. 80.1% for ReAct, 91.0% for Reflexion)
- WebShop: 75.9 average score (vs. 65.4 for ReAct)
- HotPotQA: 71.2% (competitive with Reflexion but with fewer total LLM calls in many cases)

Strengths:
- Principled exploration-exploitation tradeoff via MCTS
- Can recover from mid-plan errors without restarting
- Structured memory (tree) is richer than linear reflection lists

Limitations:
- Computationally expensive — each node expansion requires LLM calls
- Tree does not persist across episodes in the base formulation
- LLM as value function is noisy and uncalibrated
- Scales poorly with action space size and planning horizon

2.4 Voyager

Paper: Wang et al., 2023 — “Voyager: An Open-Ended Embodied Agent with Large Language Models”

Core Mechanism:
Voyager is an open-ended learning agent for Minecraft that builds a persistent, growing skill library from experience. It combines automatic curriculum generation, code-as-action, and a retrieval-augmented skill library.

Architecture:

1. Automatic Curriculum: LLM proposes next learning objective 2. Skill Generation: LLM writes executable code (JavaScript) to achieve objective 3. Environment Feedback: Code runs in Minecraft, returns success/error 4. Iterative Refinement: If code fails, LLM debugs using error messages + environment state 5. Skill Library: Successful skills are stored with descriptions for future retrieval 6. Skill Retrieval: When facing new tasks, relevant past skills are retrieved and composed

Memory Model:
Voyager’s memory is a skill library — a vector-indexed collection of:
- Skill code (executable functions)
- Natural language descriptions
- Embedding vectors for retrieval

This is fundamentally different from Reflexion’s reflection buffer or LATS’s search tree. Voyager stores positive knowledge (what works) rather than negative knowledge (what failed).

How It Learns from Mistakes:
- Iterative refinement within an episode: failed code is debugged using error traces
- But failures are not stored — only successful skills enter the library
- Learning from mistakes is implicit: the refined, working version of a skill represents the accumulated debugging effort
- The curriculum itself adapts: if a skill is too hard, the agent may attempt prerequisite skills first

Empirical Results:
- 3.3x more unique items obtained than next-best baseline in Minecraft
- Unlocks key tech tree milestones (diamond pickaxe) that no other LLM agent achieved
- Skill library grows continuously — agent gets more capable over time
- Skills transfer: library built in one world helps in new worlds

Strengths:
- True lifelong learning — capability grows without bound (not limited by context window)
- Composable skills — complex behaviors built from simpler ones
- Code as memory — executable, precise, not subject to language ambiguity
- Retrieval-based — scales beyond context window limits

Limitations:
- Domain-specific (Minecraft, though the pattern generalizes)
- Does not store failure cases — cannot warn against known bad approaches
- Skill retrieval depends on description quality
- No explicit planning over multi-step objectives (curriculum is one skill at a time)

3. 2024-2025 Advances

3.1 Cognitive Architectures: CoALA Framework

Paper: Sumers et al., 2024 — “Cognitive Architectures for Language Agents”

CoALA provides a unifying framework for understanding agent memory architectures by drawing on cognitive science. It decomposes agent memory into:

Working Memory: Current context (prompt), short-term task state
Episodic Memory: Records of past experiences (trajectories, outcomes)
Semantic Memory: General knowledge (facts, rules, heuristics)
Procedural Memory: Action policies and skills (code, plans, strategies)

Under this framework:
- ReAct uses only working memory
- Reflexion adds rudimentary episodic memory (reflections)
- LATS adds structured working memory (search tree) and episodic memory
- Voyager emphasizes procedural memory (skill library) with retrieval-augmented access

The key insight: state-of-the-art agents need all four memory types, with retrieval mechanisms to access them efficiently.

3.2 Generative Agents and Emergent Memory

Paper: Park et al., 2023 — “Generative Agents: Interactive Simulacra of Human Behavior”

While not a planning-focused architecture per se, Generative Agents introduced a memory stream architecture that heavily influenced subsequent work:

Memory Stream: Timestamped log of all observations and actions
Retrieval: Scored by recency, importance, and relevance to current query
Reflection: Periodic synthesis of memories into higher-level insights
Planning: Daily plans generated from character description + recent memories + reflections

This architecture demonstrated that LLM agents with rich memory produce emergent social behaviors (planning parties, forming relationships, coordinating activities).

3.3 MemoryBank and Long-Term Memory Systems

Several 2024-2025 works formalized long-term memory for agents:

MemGPT (Packer et al., 2023-2024): Treats context window management as a virtual memory system with paging. The agent can explicitly write to and read from long-term storage, managing its own context like an OS manages RAM. This enables agents to maintain coherent behavior across very long interactions.
REM (Retrieval-Enhanced Memory): Architectures that combine vector databases with structured memory stores, allowing agents to retrieve relevant past experiences based on semantic similarity to the current situation.

3.4 ExpeL (Experience Learning)

Paper: Zhao et al., 2024 — “ExpeL: LLM Agents Are Experiential Learners”

ExpeL explicitly addresses the gap between Reflexion (task-specific reflections) and generalizable learning:

Agent attempts multiple tasks, collecting success and failure trajectories
Cross-task analysis extracts general insights (not task-specific reflections)
Insights are stored as transferable rules: “When navigating, always check the map before moving” rather than “In Task 7, I should have gone left”
Insights are retrieved and applied to novel tasks

This represents a step toward genuine experience-driven generalization — the agent learns principles, not just task-specific patches.

3.5 REWOO and Plan-then-Execute

Paper: Xu et al., 2023-2024 — “REWOO: Decoupling Reasoning from Observations”

REWOO separates planning from execution:
1. Planner: Generates a full plan with anticipated dependencies upfront
2. Worker: Executes each step, collecting actual observations
3. Solver: Synthesizes observations into final answer

When combined with memory, the planner can draw on past plan templates (what plan structures worked for similar problems), creating a form of procedural memory for planning itself.

4. 2025-2026 State of the Art

4.1 Agent-as-a-Judge and Self-Improvement Loops

The latest generation of agents incorporates self-evaluation as a first-class component:

Agents evaluate their own outputs using rubrics derived from past experience
Failed evaluations trigger targeted retry with specific improvement directives
Evaluation criteria themselves evolve based on what distinguishes success from failure

4.2 Hierarchical Memory with Forgetting

Drawing on cognitive science, newer architectures implement principled forgetting:

Memories decay based on recency and frequency of access
Important memories are consolidated into higher-level abstractions
This prevents memory bloat while preserving critical lessons
Analogous to human episodic memory consolidation during sleep

4.3 Tool-Augmented Memory

Modern agents increasingly use external tools as memory infrastructure:

Vector databases for semantic retrieval of past experiences
Knowledge graphs for structured relationship storage
Code repositories as procedural memory (Voyager-style)
Structured logs with queryable metadata

4.4 Multi-Agent Memory Sharing

In multi-agent systems, memory-driven planning extends to collective experience:

Shared memory pools where agents contribute and retrieve experiences
Specialized memory agents that curate and organize collective knowledge
Cross-agent learning: Agent B benefits from Agent A’s mistakes without repeating them

4.5 Agentic Reasoning Frameworks (2025-2026)

The most recent frameworks (late 2025 and into 2026) converge on several principles:

Memory-conditioned planning: Plans are generated not from scratch but conditioned on retrieved relevant experiences, similar to case-based reasoning in classical AI.
Continuous skill accumulation: Following Voyager’s lead, agents build persistent libraries of verified capabilities, but now with richer metadata (preconditions, failure modes, performance characteristics).
Reflective abstraction: Rather than storing raw reflections (Reflexion) or raw trajectories, agents abstract experiences into reusable patterns — “when X happens, strategy Y tends to work because Z.”
Adaptive retrieval: Instead of fixed retrieval strategies, agents learn which memories are useful in which contexts, effectively learning a meta-policy for memory access.

5. Comparative Analysis

Dimension	ReAct	Reflexion	LATS	Voyager	ExpeL	Modern (2025-2026)
Memory Persistence	None (single episode)	Cross-trial, same task	Within-episode tree	Permanent skill library	Cross-task insights	Hierarchical, persistent
What’s Stored	Nothing	Verbal reflections	Tree + reflections	Working code + descriptions	General rules/insights	Abstracted patterns + skills + episodes
Learns from Failure	No	Yes (verbal)	Yes (tree values)	Implicitly (debugging)	Yes (cross-task)	Yes (structured)
Generalizes	No	No (task-specific)	No	Partially (skill reuse)	Yes	Yes
Retrieval	N/A	Linear (all loaded)	Tree traversal (UCB1)	Vector similarity	Relevance-based	Adaptive, learned
Scalability	N/A	Context-bounded	Compute-bounded	Unbounded (external DB)	Moderate	Unbounded
Planning Type	Reactive (1-step)	Reactive + retry	Deliberative (tree)	Curriculum-driven	Reactive + informed	Deliberative + experience-conditioned

6. How Agents Learn from Past Mistakes: A Taxonomy

Across all architectures, there are five mechanisms for learning from stored mistakes:

6.1 Verbal Self-Reflection (Reflexion, ExpeL)

The agent generates natural language analysis of what went wrong. This is the most common approach because it leverages the LLM’s strength (language). Limitation: reflections can be superficial or incorrect.

6.2 Value Backpropagation (LATS)

Failed paths receive low value scores that propagate through the search tree. This is more principled (grounded in MCTS theory) but only operates within an episode.

6.3 Iterative Code Debugging (Voyager)

Failed code is refined using error messages. The mistake is “learned from” by producing working code, but the failure itself is discarded.

6.4 Negative Example Storage

Some newer systems explicitly store anti-patterns — “Do NOT do X because it leads to Y.” These are retrieved when the agent is about to make a similar mistake.

6.5 Contrastive Experience Pairs

The most sophisticated approach: store (failure, success) pairs for similar situations, allowing the agent to understand what distinguishes good from bad strategies.

7. Open Problems and Research Frontiers

Memory Quality Control: How to ensure stored experiences are accurate and useful? Incorrect reflections can cause negative transfer.
Forgetting and Consolidation: How to manage growing memory stores? Which experiences to retain, which to abstract, which to discard?
Cross-Domain Transfer: Can experience from one domain (e.g., web navigation) help in another (e.g., coding)? Current systems are mostly domain-specific.
Credit Assignment: In multi-step plans, identifying which specific decision caused failure remains hard. LATS partially addresses this with tree structure, but it is still an open problem.
Memory-Planning Integration: Current systems often retrieve memories and plan separately. Tighter integration — where the planning process itself is guided by memory retrieval at each decision point — is an active area.
Evaluation Benchmarks: There is no standard benchmark for measuring how well agents learn from experience over long horizons. Most evaluations are still single-task or short-horizon.

8. Conclusions

The evolution from ReAct to modern memory-driven planning architectures represents a clear trajectory:

ReAct (2022): Established the reasoning-acting loop but with no memory.
Reflexion (2023): Added verbal self-critique as a form of episodic memory, enabling within-task improvement.
LATS (2023): Introduced structured deliberative planning with tree-based memory, enabling principled exploration.
Voyager (2023): Demonstrated lifelong skill accumulation as procedural memory, enabling unbounded capability growth.
ExpeL and successors (2024): Showed cross-task generalization from experience, moving toward genuine learning.
2025-2026 frameworks: Converge on hierarchical, multi-type memory systems with adaptive retrieval and abstracted experience patterns.

The state of the art in experience-informed planning is characterized by hybrid memory systems that combine episodic records, procedural skills, semantic knowledge, and reflective abstractions, accessed through learned retrieval policies and integrated into deliberative planning processes. The field is moving away from monolithic approaches (one type of memory, one retrieval strategy) toward cognitive-architecture-inspired designs that mirror the richness of human memory systems.

The most impactful open question is not how to store memories, but how to retrieve and apply the right memory at the right time — the retrieval-planning interface is where the next breakthrough is most likely to occur.

Get workflow automation insights that cut through the noise

One email per week. Practical frameworks, not product pitches.

Ready to Run Autonomous Enterprise Operations?

See how QorSync AI deploys governed agents across your enterprise systems.

Request Demo

Not ready for a demo? Start here instead:

Download the governance checklist Try the ROI calculator

AI Memory Digital Twins in 2026: Architecture, Governance, and Enterprise Risk

3 min read

Memory as a Service in 2026: Platform Comparison and Enterprise Buying Guide

4 min read

AI Agent Memory Security: Threat Model, Controls, and Incident Response Blueprint

4 min read

Memory-Driven Planning in AI Agents: A Comprehensive Research Report

Architectures Where Past Experience Directly Informs Action Planning

1. Introduction

2. Foundational Architectures

2.1 ReAct (Reasoning + Acting)

2.2 Reflexion

2.3 LATS (Language Agent Tree Search)

2.4 Voyager

3. 2024-2025 Advances

3.1 Cognitive Architectures: CoALA Framework

3.2 Generative Agents and Emergent Memory

3.3 MemoryBank and Long-Term Memory Systems

3.4 ExpeL (Experience Learning)

3.5 REWOO and Plan-then-Execute

4. 2025-2026 State of the Art

4.1 Agent-as-a-Judge and Self-Improvement Loops

4.2 Hierarchical Memory with Forgetting

4.3 Tool-Augmented Memory

4.4 Multi-Agent Memory Sharing

4.5 Agentic Reasoning Frameworks (2025-2026)

5. Comparative Analysis

6. How Agents Learn from Past Mistakes: A Taxonomy

6.1 Verbal Self-Reflection (Reflexion, ExpeL)

6.2 Value Backpropagation (LATS)

6.3 Iterative Code Debugging (Voyager)

6.4 Negative Example Storage

6.5 Contrastive Experience Pairs

7. Open Problems and Research Frontiers

8. Conclusions

Get workflow automation insights that cut through the noise

Ready to Run Autonomous Enterprise Operations?

Related Articles