Memory-Driven Planning in AI Agents: A Comprehensive Research Report
Memory-driven planning represents a paradigm shift in AI agent design: rather than treating each task as a fresh problem, agents accumulate experiences — successes, failures,...
Memory-Driven Planning in AI Agents: A Comprehensive Research Report
Architectures Where Past Experience Directly Informs Action Planning
1. Introduction
Memory-driven planning represents a paradigm shift in AI agent design: rather than treating each task as a fresh problem, agents accumulate experiences — successes, failures, partial solutions, and environmental observations — and use these to inform future action planning. This report examines the major architectures, their mechanisms for learning from past mistakes, and the state of the art as of early 2026.
2. Foundational Architectures
2.1 ReAct (Reasoning + Acting)
Paper: Yao et al., 2022 — “ReAct: Synergizing Reasoning and Acting in Language Models”
Core Mechanism:
ReAct interleaves chain-of-thought reasoning with action execution in a thought-action-observation loop:
Thought: I need to find the population of France. Action: Search["population of France"] Observation: France has a population of approximately 68 million. Thought: Now I have the answer. Action: Finish["68 million"]
Memory Model: ReAct’s memory is ephemeral and within-episode only. The trace of thought-action-observation triples serves as a short-term working memory within a single task. There is no cross-episode memory — the agent does not learn from past tasks.
Strengths:
- Simple, interpretable reasoning traces
- Grounds reasoning in real observations, reducing hallucination
- Works well for single-session knowledge-intensive tasks (QA, fact verification)
Limitations as a Memory-Driven Planner:
- No persistent memory across episodes
- Cannot learn from past mistakes — repeats the same errors on similar tasks
- Planning is purely reactive (one step at a time), not anticipatory
- No mechanism for strategy refinement over time
Role in the Ecosystem: ReAct is best understood as the baseline from which memory-driven architectures depart. Nearly every subsequent framework builds on or extends the ReAct loop.
2.2 Reflexion
Paper: Shinn et al., 2023 — “Reflexion: Language Agents with Verbal Reinforcement Learning”
Core Mechanism:
Reflexion adds an explicit self-reflection loop on top of a ReAct-style agent. After a task attempt (success or failure), the agent generates a natural-language reflection analyzing what went wrong and what to do differently. These reflections are stored in a persistent memory buffer and prepended to the prompt on subsequent attempts.
Architecture:
Episode N: 1. Attempt task (ReAct-style) 2. Receive environment feedback (success/failure/partial score) 3. Self-reflect: generate verbal analysis of what went wrong 4. Store reflection in memory buffer Episode N+1: 1. Load past reflections into context 2. Attempt task, informed by past failures 3. Repeat cycle
How It Learns from Mistakes:
- Reflections are verbal reinforcement signals — natural language descriptions of errors and corrective strategies
- The agent does not update weights; it updates its prompt context with accumulated wisdom
- Example reflection: “In my previous attempt, I searched for the wrong entity. Next time, I should disambiguate the query before searching.”
Memory Structure:
- A sliding window of recent reflections (typically last 3-5)
- No hierarchical organization or retrieval mechanism — all reflections are linearly prepended
- Memory is task-specific, not shared across different task types
Empirical Results:
- Significant improvements on sequential decision-making benchmarks (AlfWorld, HotPotQA, programming tasks)
- Performance improves with each trial, plateauing around 3-5 attempts
- On HotPotQA: from ~34% (ReAct baseline) to ~68% after multiple reflexion rounds
Limitations:
- Bounded by context window — cannot accumulate unlimited experience
- Reflections can be shallow or inaccurate (the LLM may misdiagnose its own failures)
- No generalization across tasks — reflections from Task A do not help with Task B
- Linear memory (no retrieval) means irrelevant reflections consume context
2.3 LATS (Language Agent Tree Search)
Paper: Zhou et al., 2023 — “Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models”
Core Mechanism:
LATS applies Monte Carlo Tree Search (MCTS) principles to LLM-based agent planning. Instead of a single linear trace (ReAct) or sequential retries (Reflexion), LATS builds a tree of possible action sequences, using the LLM as both the policy (action proposer), value function (state evaluator), and reflection generator.
Architecture:
1. Selection: Traverse the tree using UCB1 to balance exploration/exploitation 2. Expansion: Generate multiple candidate actions from current state 3. Evaluation: LLM scores the resulting states (heuristic value function) 4. Simulation: Optionally roll out to terminal states 5. Backpropagation: Update node values based on outcomes 6. Reflection: On failure, generate reflection and store for reuse
Memory Model:
LATS has two forms of memory:
1. Within-episode tree memory: The search tree itself stores explored paths, their outcomes, and value estimates. Failed branches are remembered and avoided.
2. Cross-episode reflection memory: Like Reflexion, failed trajectories produce verbal reflections that can inform future attempts.
How It Learns from Mistakes:
- Failed branches in the tree are assigned low values during backpropagation, steering future search away from them
- The tree structure provides structured memory — not just “what went wrong” but “which specific decision point led to failure”
- Reflections from failed rollouts are more targeted because they can identify the exact branching point where things diverged
Empirical Results:
- Programming tasks (HumanEval): 94.4% pass@1 (vs. 80.1% for ReAct, 91.0% for Reflexion)
- WebShop: 75.9 average score (vs. 65.4 for ReAct)
- HotPotQA: 71.2% (competitive with Reflexion but with fewer total LLM calls in many cases)
Strengths:
- Principled exploration-exploitation tradeoff via MCTS
- Can recover from mid-plan errors without restarting
- Structured memory (tree) is richer than linear reflection lists
Limitations:
- Computationally expensive — each node expansion requires LLM calls
- Tree does not persist across episodes in the base formulation
- LLM as value function is noisy and uncalibrated
- Scales poorly with action space size and planning horizon
2.4 Voyager
Paper: Wang et al., 2023 — “Voyager: An Open-Ended Embodied Agent with Large Language Models”
Core Mechanism:
Voyager is an open-ended learning agent for Minecraft that builds a persistent, growing skill library from experience. It combines automatic curriculum generation, code-as-action, and a retrieval-augmented skill library.
Architecture:
1. Automatic Curriculum: LLM proposes next learning objective 2. Skill Generation: LLM writes executable code (JavaScript) to achieve objective 3. Environment Feedback: Code runs in Minecraft, returns success/error 4. Iterative Refinement: If code fails, LLM debugs using error messages + environment state 5. Skill Library: Successful skills are stored with descriptions for future retrieval 6. Skill Retrieval: When facing new tasks, relevant past skills are retrieved and composed
Memory Model:
Voyager’s memory is a skill library — a vector-indexed collection of:
- Skill code (executable functions)
- Natural language descriptions
- Embedding vectors for retrieval
This is fundamentally different from Reflexion’s reflection buffer or LATS’s search tree. Voyager stores positive knowledge (what works) rather than negative knowledge (what failed).
How It Learns from Mistakes:
- Iterative refinement within an episode: failed code is debugged using error traces
- But failures are not stored — only successful skills enter the library
- Learning from mistakes is implicit: the refined, working version of a skill represents the accumulated debugging effort
- The curriculum itself adapts: if a skill is too hard, the agent may attempt prerequisite skills first
Empirical Results:
- 3.3x more unique items obtained than next-best baseline in Minecraft
- Unlocks key tech tree milestones (diamond pickaxe) that no other LLM agent achieved
- Skill library grows continuously — agent gets more capable over time
- Skills transfer: library built in one world helps in new worlds
Strengths:
- True lifelong learning — capability grows without bound (not limited by context window)
- Composable skills — complex behaviors built from simpler ones
- Code as memory — executable, precise, not subject to language ambiguity
- Retrieval-based — scales beyond context window limits
Limitations:
- Domain-specific (Minecraft, though the pattern generalizes)
- Does not store failure cases — cannot warn against known bad approaches
- Skill retrieval depends on description quality
- No explicit planning over multi-step objectives (curriculum is one skill at a time)
3. 2024-2025 Advances
3.1 Cognitive Architectures: CoALA Framework
Paper: Sumers et al., 2024 — “Cognitive Architectures for Language Agents”
CoALA provides a unifying framework for understanding agent memory architectures by drawing on cognitive science. It decomposes agent memory into:
- Working Memory: Current context (prompt), short-term task state
- Episodic Memory: Records of past experiences (trajectories, outcomes)
- Semantic Memory: General knowledge (facts, rules, heuristics)
- Procedural Memory: Action policies and skills (code, plans, strategies)
Under this framework:
- ReAct uses only working memory
- Reflexion adds rudimentary episodic memory (reflections)
- LATS adds structured working memory (search tree) and episodic memory
- Voyager emphasizes procedural memory (skill library) with retrieval-augmented access
The key insight: state-of-the-art agents need all four memory types, with retrieval mechanisms to access them efficiently.
3.2 Generative Agents and Emergent Memory
Paper: Park et al., 2023 — “Generative Agents: Interactive Simulacra of Human Behavior”
While not a planning-focused architecture per se, Generative Agents introduced a memory stream architecture that heavily influenced subsequent work:
- Memory Stream: Timestamped log of all observations and actions
- Retrieval: Scored by recency, importance, and relevance to current query
- Reflection: Periodic synthesis of memories into higher-level insights
- Planning: Daily plans generated from character description + recent memories + reflections
This architecture demonstrated that LLM agents with rich memory produce emergent social behaviors (planning parties, forming relationships, coordinating activities).
3.3 MemoryBank and Long-Term Memory Systems
Several 2024-2025 works formalized long-term memory for agents:
-
MemGPT (Packer et al., 2023-2024): Treats context window management as a virtual memory system with paging. The agent can explicitly write to and read from long-term storage, managing its own context like an OS manages RAM. This enables agents to maintain coherent behavior across very long interactions.
-
REM (Retrieval-Enhanced Memory): Architectures that combine vector databases with structured memory stores, allowing agents to retrieve relevant past experiences based on semantic similarity to the current situation.
3.4 ExpeL (Experience Learning)
Paper: Zhao et al., 2024 — “ExpeL: LLM Agents Are Experiential Learners”
ExpeL explicitly addresses the gap between Reflexion (task-specific reflections) and generalizable learning:
- Agent attempts multiple tasks, collecting success and failure trajectories
- Cross-task analysis extracts general insights (not task-specific reflections)
- Insights are stored as transferable rules: “When navigating, always check the map before moving” rather than “In Task 7, I should have gone left”
- Insights are retrieved and applied to novel tasks
This represents a step toward genuine experience-driven generalization — the agent learns principles, not just task-specific patches.
3.5 REWOO and Plan-then-Execute
Paper: Xu et al., 2023-2024 — “REWOO: Decoupling Reasoning from Observations”
REWOO separates planning from execution:
1. Planner: Generates a full plan with anticipated dependencies upfront
2. Worker: Executes each step, collecting actual observations
3. Solver: Synthesizes observations into final answer
When combined with memory, the planner can draw on past plan templates (what plan structures worked for similar problems), creating a form of procedural memory for planning itself.
4. 2025-2026 State of the Art
4.1 Agent-as-a-Judge and Self-Improvement Loops
The latest generation of agents incorporates self-evaluation as a first-class component:
- Agents evaluate their own outputs using rubrics derived from past experience
- Failed evaluations trigger targeted retry with specific improvement directives
- Evaluation criteria themselves evolve based on what distinguishes success from failure
4.2 Hierarchical Memory with Forgetting
Drawing on cognitive science, newer architectures implement principled forgetting:
- Memories decay based on recency and frequency of access
- Important memories are consolidated into higher-level abstractions
- This prevents memory bloat while preserving critical lessons
- Analogous to human episodic memory consolidation during sleep
4.3 Tool-Augmented Memory
Modern agents increasingly use external tools as memory infrastructure:
- Vector databases for semantic retrieval of past experiences
- Knowledge graphs for structured relationship storage
- Code repositories as procedural memory (Voyager-style)
- Structured logs with queryable metadata
4.4 Multi-Agent Memory Sharing
In multi-agent systems, memory-driven planning extends to collective experience:
- Shared memory pools where agents contribute and retrieve experiences
- Specialized memory agents that curate and organize collective knowledge
- Cross-agent learning: Agent B benefits from Agent A’s mistakes without repeating them
4.5 Agentic Reasoning Frameworks (2025-2026)
The most recent frameworks (late 2025 and into 2026) converge on several principles:
-
Memory-conditioned planning: Plans are generated not from scratch but conditioned on retrieved relevant experiences, similar to case-based reasoning in classical AI.
-
Continuous skill accumulation: Following Voyager’s lead, agents build persistent libraries of verified capabilities, but now with richer metadata (preconditions, failure modes, performance characteristics).
-
Reflective abstraction: Rather than storing raw reflections (Reflexion) or raw trajectories, agents abstract experiences into reusable patterns — “when X happens, strategy Y tends to work because Z.”
-
Adaptive retrieval: Instead of fixed retrieval strategies, agents learn which memories are useful in which contexts, effectively learning a meta-policy for memory access.
5. Comparative Analysis
| Dimension | ReAct | Reflexion | LATS | Voyager | ExpeL | Modern (2025-2026) |
|---|---|---|---|---|---|---|
| Memory Persistence | None (single episode) | Cross-trial, same task | Within-episode tree | Permanent skill library | Cross-task insights | Hierarchical, persistent |
| What’s Stored | Nothing | Verbal reflections | Tree + reflections | Working code + descriptions | General rules/insights | Abstracted patterns + skills + episodes |
| Learns from Failure | No | Yes (verbal) | Yes (tree values) | Implicitly (debugging) | Yes (cross-task) | Yes (structured) |
| Generalizes | No | No (task-specific) | No | Partially (skill reuse) | Yes | Yes |
| Retrieval | N/A | Linear (all loaded) | Tree traversal (UCB1) | Vector similarity | Relevance-based | Adaptive, learned |
| Scalability | N/A | Context-bounded | Compute-bounded | Unbounded (external DB) | Moderate | Unbounded |
| Planning Type | Reactive (1-step) | Reactive + retry | Deliberative (tree) | Curriculum-driven | Reactive + informed | Deliberative + experience-conditioned |
6. How Agents Learn from Past Mistakes: A Taxonomy
Across all architectures, there are five mechanisms for learning from stored mistakes:
6.1 Verbal Self-Reflection (Reflexion, ExpeL)
The agent generates natural language analysis of what went wrong. This is the most common approach because it leverages the LLM’s strength (language). Limitation: reflections can be superficial or incorrect.
6.2 Value Backpropagation (LATS)
Failed paths receive low value scores that propagate through the search tree. This is more principled (grounded in MCTS theory) but only operates within an episode.
6.3 Iterative Code Debugging (Voyager)
Failed code is refined using error messages. The mistake is “learned from” by producing working code, but the failure itself is discarded.
6.4 Negative Example Storage
Some newer systems explicitly store anti-patterns — “Do NOT do X because it leads to Y.” These are retrieved when the agent is about to make a similar mistake.
6.5 Contrastive Experience Pairs
The most sophisticated approach: store (failure, success) pairs for similar situations, allowing the agent to understand what distinguishes good from bad strategies.
7. Open Problems and Research Frontiers
-
Memory Quality Control: How to ensure stored experiences are accurate and useful? Incorrect reflections can cause negative transfer.
-
Forgetting and Consolidation: How to manage growing memory stores? Which experiences to retain, which to abstract, which to discard?
-
Cross-Domain Transfer: Can experience from one domain (e.g., web navigation) help in another (e.g., coding)? Current systems are mostly domain-specific.
-
Credit Assignment: In multi-step plans, identifying which specific decision caused failure remains hard. LATS partially addresses this with tree structure, but it is still an open problem.
-
Memory-Planning Integration: Current systems often retrieve memories and plan separately. Tighter integration — where the planning process itself is guided by memory retrieval at each decision point — is an active area.
-
Evaluation Benchmarks: There is no standard benchmark for measuring how well agents learn from experience over long horizons. Most evaluations are still single-task or short-horizon.
8. Conclusions
The evolution from ReAct to modern memory-driven planning architectures represents a clear trajectory:
- ReAct (2022): Established the reasoning-acting loop but with no memory.
- Reflexion (2023): Added verbal self-critique as a form of episodic memory, enabling within-task improvement.
- LATS (2023): Introduced structured deliberative planning with tree-based memory, enabling principled exploration.
- Voyager (2023): Demonstrated lifelong skill accumulation as procedural memory, enabling unbounded capability growth.
- ExpeL and successors (2024): Showed cross-task generalization from experience, moving toward genuine learning.
- 2025-2026 frameworks: Converge on hierarchical, multi-type memory systems with adaptive retrieval and abstracted experience patterns.
The state of the art in experience-informed planning is characterized by hybrid memory systems that combine episodic records, procedural skills, semantic knowledge, and reflective abstractions, accessed through learned retrieval policies and integrated into deliberative planning processes. The field is moving away from monolithic approaches (one type of memory, one retrieval strategy) toward cognitive-architecture-inspired designs that mirror the richness of human memory systems.
The most impactful open question is not how to store memories, but how to retrieve and apply the right memory at the right time — the retrieval-planning interface is where the next breakthrough is most likely to occur.
Get workflow automation insights that cut through the noise
One email per week. Practical frameworks, not product pitches.
Ready to Run Autonomous Enterprise Operations?
See how QorSync AI deploys governed agents across your enterprise systems.
Request DemoNot ready for a demo? Start here instead: