LLMs are stateless. AI agents need to remember. The gap between those two facts is bridged by a memory layer — vector embeddings, session stores, long-term memory frameworks like Letta and Mem0, and increasingly procedural memory. This guide walks through the four kinds of memory a production AI agent in 2026 actually uses, the tools that implement each, and the design decisions that matter.
Forgetting is the single biggest reason naive agent demos fall apart in production. A demo agent works because the demo is short. A real agent fails because it's been running for three hours and has no idea what it agreed to in turn 7. The fix is a memory architecture — not a magic feature, but an explicit set of choices about what to remember, how, and for how long.
This article sits next to our agent stack reference architecture and RAG vs Fine-Tuning vs Agents. For the glossary basics, see memory, RAG, vector database, vector embedding and context window.
The four memory types you actually need
| Type | Lifetime | Where it lives | Example |
|---|---|---|---|
| Working memory | Current turn / loop | Context window | "User asked about pricing 2 turns ago" |
| Session memory | One job / conversation | Orchestration state | "User picked plan B in step 3" |
| Long-term semantic memory | Months / forever | Vector DB | "User prefers concise emails, hates Mondays" |
| Procedural memory | Forever | System prompt / fine-tuned weights | "Always cite sources when answering medical questions" |
A production agent typically uses all four. Where each is implemented is the interesting design question.
1. Working memory — the context window
The model's working memory is just the context window. Frontier models in 2026 ship with 200K to 2M tokens of context, which sounds like a lot until your agent has been running for an hour and called 14 tools.
Decisions you make here:
- What goes in the system prompt? Stable, durable info (identity, tone, hard rules, tool definitions).
- What gets re-injected every turn? Recent conversation, current task state, retrieved RAG chunks.
- What gets summarized? As the window fills, older turns get condensed into a running summary that survives further into the conversation.
- What gets evicted? Tool outputs you've already used, completed sub-task plans, intermediate scratchpads.
The mistake people make: treating the context window as the entire memory store. It's the workspace, not the warehouse. Once a fact has been consumed and acted on, it should leave the window — preserved in another layer if it matters.
2. Session memory — state across tool calls
Within one job (one customer interaction, one coding task, one research run), the agent needs to track its own state.
What lives here:
- The current plan or task list.
- Outputs of completed sub-tasks that may be needed later.
- A scratchpad of intermediate findings.
- Pending tool calls, retries, error counts.
Where it lives: the orchestration framework's state. LangGraph state objects, CrewAI shared context, your own Redis cache, or in extreme cases a database row. Frameworks differ in how explicit they make this — LangGraph is most explicit, CrewAI tries to hide it.
Failure mode: session memory that doesn't survive restarts. If your agent crashes 8 minutes into a 12-minute research task and loses everything, you've shipped the wrong thing. Use a durable store (Redis or Postgres) for session state, not an in-memory dictionary.
3. Long-term semantic memory — vector + structured
This is the layer that lets the agent remember things across sessions, days, weeks, months. It's the most-discussed and most-misimplemented memory layer.
Two sub-types:
- Episodic memory — discrete events. "On 2026-04-12, user asked about the refund policy and we replied X." Stored as conversation transcripts indexed by user/time + embeddings.
- Semantic memory — distilled facts. "User prefers English, hates Mondays, manages a team of 6, last had a refund issue 3 weeks ago." Stored as structured facts.
Most real systems write both — keep raw transcripts for audit, distill facts for fast retrieval.
Mechanics:
- After each interaction, an extractor (often the same LLM with a "memory extraction" prompt) writes new facts to the store.
- At the start of each new interaction, a retriever pulls the top-k relevant memories and injects them into the system prompt or working context.
- A periodic compaction job merges duplicates, resolves contradictions, summarizes long-tail history into shorter facts.
Tools that implement this in 2026:
- Letta (formerly MemGPT) — hierarchical memory with explicit summarization tiers. Strong for agents that have to manage their own context.
- Mem0 — personalized memory with clean CRUD API. Easy to drop in.
- Zep — session-aware memory aimed at chat assistants.
- Cognee — graph-shaped memory for agents that benefit from explicit entity relationships.
- Build your own — pgvector + a small write/read API. Common for teams that want full control.
All four hosted frameworks sit on top of a vector database — see the agent stack reference for the broader picture.
4. Procedural memory — the underused layer
Procedural memory is what the agent "knows how to do" without being told. In humans this is riding a bike; in agents it's "always log to Sentry on tool failure," "always cite a source for medical claims," "always check inventory before promising delivery."
In 2026, procedural memory is implemented three ways:
- System prompt patterns. Hardcoded rules + few-shot examples. Simple, brittle.
- Fine-tuning. Bake the procedure into the model weights. See RAG vs Fine-Tuning vs Agents.
- Tool-side enforcement. The tool itself refuses bad inputs and explains the rule. Strongest pattern — the agent learns the procedure by being corrected at runtime.
Most teams skip procedural memory by accident. They add a rule, the agent breaks it in week 3, they add the same rule again, and so on. A systematic procedural memory store — and a regression test that fires when a procedure is violated — is one of the highest-leverage investments in a mature agent program.
Memory retrieval — the actually-hard part
Writing to memory is easy. Retrieving the right memory is hard. Three concrete failure modes:
Too much retrieval. You ask for top-50 and inject everything; the model gets distracted. The fix is aggressive reranking and a hard cap on how many memories enter the prompt (5–10 is typical).
Stale memory. A fact from last quarter contradicts a fact from yesterday and the retriever hands the agent the older one. The fix is recency weighting in the retrieval score + periodic compaction.
Wrong scope. Personal memory leaks into a different user's session, or organizational memory bleeds into a personal assistant. The fix is strict scope tagging at write time (user_id, tenant_id, session_id) and matching filters at read time. This is non-negotiable in regulated environments.
For broader retrieval mechanics see our RAG explainer, vector search and reranker entries.
What "memory" looks like at three sizes
Solo / startup: the minimum viable memory
- Working: context window (whatever the model gives you).
- Session: a Redis hash keyed by conversation ID.
- Long-term: pgvector inside an existing Postgres + a 30-line write/read API.
- Procedural: system prompt + a "lessons learned" doc that's appended after every incident.
Cost: Effectively free (existing Postgres + a few cents in embeddings per active user per month).
SMB: a memory framework + a vector DB
- Working: managed by your orchestration framework (LangGraph state).
- Session: framework's state object backed by Redis.
- Long-term: Letta or Mem0 with managed Pinecone / Weaviate underneath.
- Procedural: prompt patterns + an eval that asserts each rule.
Cost: $200–$2,000/mo.
Enterprise: per-tenant memory with strong governance
- Working: per-tenant context, no cross-tenant leakage.
- Session: durable, audited, replayable.
- Long-term: per-tenant vector index + per-tenant embedding model in extreme cases; full audit trail of every write and read; PII redaction at write time.
- Procedural: fine-tuned small model that enforces procedures + runtime tool-side enforcement.
Cost: $10K+/mo for the memory layer alone, but a small fraction of total agent infrastructure.
When memory becomes the bottleneck
Three signals that you're under-investing in memory:
- Users repeat themselves. "I told you this last time" is the agent's worst review.
- Quality degrades across sessions. Demo is great; week 4 is bad.
- Eval scores are unstable run-to-run. The agent is leaking context from previous runs into new ones.
When any of these show up, audit the memory layer before tuning the model or the prompt.
What this means for buyers
When you evaluate an agent on the leaderboard, ask:
- Does the agent have any persistent memory at all? Many "demos" don't.
- Can a user inspect, edit and delete what the agent remembers about them? GDPR will care; users care too.
- Is memory per-tenant or shared? In multi-tenant SaaS this matters a lot.
- What's the retention policy? Forever is sometimes wrong.
- Does the vendor expose memory traces in their observability layer?
Memory is the layer that turns a clever chatbot into a useful colleague. Pick wrong and your agent has Alzheimer's. Pick right and your users feel like the agent actually knows them.
See agent stack reference, observability comparison and AI agent design patterns for the layers above and below.