Why do AI agents forget things between conversations?

Because LLMs are stateless. The model has no memory of prior turns unless you re-send them in the prompt — and you can only send so many tokens before hitting the context window limit. Without an external memory layer, every new conversation (or every conversation past the context window) starts from a blank slate.

What types of memory does an AI agent need?

Four practical types: (1) Working memory — the current context window, managed by the orchestration layer; (2) Short-term / session memory — state that survives across tool calls within one job; (3) Long-term semantic memory — facts about users, preferences, past interactions, usually in a vector database; (4) Procedural memory — learned how-to patterns the agent reuses. Most production agents implement the first three; procedural memory is still mostly hand-coded as system-prompt snippets.

What is the difference between RAG and agent memory?

RAG retrieves chunks from a static-ish corpus to ground a single answer. Agent memory stores agent- or user-specific state that the agent writes and reads over many sessions — preferences, prior decisions, conversation history, learned facts. They use similar tech (embeddings + vector DB) but serve different roles. RAG answers 'what does the corpus say?'; memory answers 'what did we agree last week?'

What are the best AI agent memory frameworks in 2026?

Three independent frameworks dominate: Letta (formerly MemGPT) for hierarchical memory management with explicit summarization, Mem0 for personalized long-term memory with strong CRUD semantics, and Zep for session-aware memory with strong support for chat assistants. All three sit on top of a vector database (Pinecone, Weaviate, Chroma, pgvector) and offer hosted and self-hosted options.

How much memory does a typical AI agent need?

For a customer-facing agent — a few KB per active user (preferences, recent interactions, profile) plus an indexed history of conversations. For a coding agent — the full project context plus a memory of past decisions and code patterns. For a research agent — the full session's intermediate findings plus a queryable archive. Costs are small in absolute terms (cents per user per month) but the design choices around retention and retrieval are where serious engineering work lives.

AI Agent Memory in 2026: Vector, Episodic and Semantic

LLMs are stateless. AI agents need to remember. The gap between those two facts is bridged by a memory layer — vector embeddings, session stores, long-term memory frameworks like Letta and Mem0, and increasingly procedural memory. This guide walks through the four kinds of memory a production AI agent in 2026 actually uses, the tools that implement each, and the design decisions that matter.

Forgetting is the single biggest reason naive agent demos fall apart in production. A demo agent works because the demo is short. A real agent fails because it's been running for three hours and has no idea what it agreed to in turn 7. The fix is a memory architecture — not a magic feature, but an explicit set of choices about what to remember, how, and for how long.

This article sits next to our agent stack reference architecture and RAG vs Fine-Tuning vs Agents. For the glossary basics, see memory, RAG, vector database, vector embedding and context window.

The four memory types you actually need

Type	Lifetime	Where it lives	Example
Working memory	Current turn / loop	Context window	"User asked about pricing 2 turns ago"
Session memory	One job / conversation	Orchestration state	"User picked plan B in step 3"
Long-term semantic memory	Months / forever	Vector DB	"User prefers concise emails, hates Mondays"
Procedural memory	Forever	System prompt / fine-tuned weights	"Always cite sources when answering medical questions"

A production agent typically uses all four. Where each is implemented is the interesting design question.

1. Working memory — the context window

The model's working memory is just the context window. Frontier models in 2026 ship with 200K to 2M tokens of context, which sounds like a lot until your agent has been running for an hour and called 14 tools.

Decisions you make here:

What goes in the system prompt? Stable, durable info (identity, tone, hard rules, tool definitions).
What gets re-injected every turn? Recent conversation, current task state, retrieved RAG chunks.
What gets summarized? As the window fills, older turns get condensed into a running summary that survives further into the conversation.
What gets evicted? Tool outputs you've already used, completed sub-task plans, intermediate scratchpads.

The mistake people make: treating the context window as the entire memory store. It's the workspace, not the warehouse. Once a fact has been consumed and acted on, it should leave the window — preserved in another layer if it matters.

2. Session memory — state across tool calls

Within one job (one customer interaction, one coding task, one research run), the agent needs to track its own state.

What lives here:

The current plan or task list.
Outputs of completed sub-tasks that may be needed later.
A scratchpad of intermediate findings.
Pending tool calls, retries, error counts.

Where it lives: the orchestration framework's state. LangGraph state objects, CrewAI shared context, your own Redis cache, or in extreme cases a database row. Frameworks differ in how explicit they make this — LangGraph is most explicit, CrewAI tries to hide it.

Failure mode: session memory that doesn't survive restarts. If your agent crashes 8 minutes into a 12-minute research task and loses everything, you've shipped the wrong thing. Use a durable store (Redis or Postgres) for session state, not an in-memory dictionary.

3. Long-term semantic memory — vector + structured

This is the layer that lets the agent remember things across sessions, days, weeks, months. It's the most-discussed and most-misimplemented memory layer.

Two sub-types:

Episodic memory — discrete events. "On 2026-04-12, user asked about the refund policy and we replied X." Stored as conversation transcripts indexed by user/time + embeddings.
Semantic memory — distilled facts. "User prefers English, hates Mondays, manages a team of 6, last had a refund issue 3 weeks ago." Stored as structured facts.

Most real systems write both — keep raw transcripts for audit, distill facts for fast retrieval.

Mechanics:

After each interaction, an extractor (often the same LLM with a "memory extraction" prompt) writes new facts to the store.
At the start of each new interaction, a retriever pulls the top-k relevant memories and injects them into the system prompt or working context.
A periodic compaction job merges duplicates, resolves contradictions, summarizes long-tail history into shorter facts.

Tools that implement this in 2026:

Letta (formerly MemGPT) — hierarchical memory with explicit summarization tiers. Strong for agents that have to manage their own context.
Mem0 — personalized memory with clean CRUD API. Easy to drop in.
Zep — session-aware memory aimed at chat assistants.
Cognee — graph-shaped memory for agents that benefit from explicit entity relationships.
Build your own — pgvector + a small write/read API. Common for teams that want full control.

All four hosted frameworks sit on top of a vector database — see the agent stack reference for the broader picture.

4. Procedural memory — the underused layer

Procedural memory is what the agent "knows how to do" without being told. In humans this is riding a bike; in agents it's "always log to Sentry on tool failure," "always cite a source for medical claims," "always check inventory before promising delivery."

In 2026, procedural memory is implemented three ways:

System prompt patterns. Hardcoded rules + few-shot examples. Simple, brittle.
Fine-tuning. Bake the procedure into the model weights. See RAG vs Fine-Tuning vs Agents.
Tool-side enforcement. The tool itself refuses bad inputs and explains the rule. Strongest pattern — the agent learns the procedure by being corrected at runtime.

Most teams skip procedural memory by accident. They add a rule, the agent breaks it in week 3, they add the same rule again, and so on. A systematic procedural memory store — and a regression test that fires when a procedure is violated — is one of the highest-leverage investments in a mature agent program.

Memory retrieval — the actually-hard part

Writing to memory is easy. Retrieving the right memory is hard. Three concrete failure modes:

Too much retrieval. You ask for top-50 and inject everything; the model gets distracted. The fix is aggressive reranking and a hard cap on how many memories enter the prompt (5–10 is typical).

Stale memory. A fact from last quarter contradicts a fact from yesterday and the retriever hands the agent the older one. The fix is recency weighting in the retrieval score + periodic compaction.

Wrong scope. Personal memory leaks into a different user's session, or organizational memory bleeds into a personal assistant. The fix is strict scope tagging at write time (user_id, tenant_id, session_id) and matching filters at read time. This is non-negotiable in regulated environments.

For broader retrieval mechanics see our RAG explainer, vector search and reranker entries.

What "memory" looks like at three sizes

Solo / startup: the minimum viable memory

Working: context window (whatever the model gives you).
Session: a Redis hash keyed by conversation ID.
Long-term: pgvector inside an existing Postgres + a 30-line write/read API.
Procedural: system prompt + a "lessons learned" doc that's appended after every incident.

Cost: Effectively free (existing Postgres + a few cents in embeddings per active user per month).

SMB: a memory framework + a vector DB

Working: managed by your orchestration framework (LangGraph state).
Session: framework's state object backed by Redis.
Long-term: Letta or Mem0 with managed Pinecone / Weaviate underneath.
Procedural: prompt patterns + an eval that asserts each rule.

Cost: $200–$2,000/mo.

Enterprise: per-tenant memory with strong governance

Working: per-tenant context, no cross-tenant leakage.
Session: durable, audited, replayable.
Long-term: per-tenant vector index + per-tenant embedding model in extreme cases; full audit trail of every write and read; PII redaction at write time.
Procedural: fine-tuned small model that enforces procedures + runtime tool-side enforcement.

Cost: $10K+/mo for the memory layer alone, but a small fraction of total agent infrastructure.

When memory becomes the bottleneck

Three signals that you're under-investing in memory:

Users repeat themselves. "I told you this last time" is the agent's worst review.
Quality degrades across sessions. Demo is great; week 4 is bad.
Eval scores are unstable run-to-run. The agent is leaking context from previous runs into new ones.

When any of these show up, audit the memory layer before tuning the model or the prompt.

What this means for buyers

When you evaluate an agent on the leaderboard, ask:

Does the agent have any persistent memory at all? Many "demos" don't.
Can a user inspect, edit and delete what the agent remembers about them? GDPR will care; users care too.
Is memory per-tenant or shared? In multi-tenant SaaS this matters a lot.
What's the retention policy? Forever is sometimes wrong.
Does the vendor expose memory traces in their observability layer?

Memory is the layer that turns a clever chatbot into a useful colleague. Pick wrong and your agent has Alzheimer's. Pick right and your users feel like the agent actually knows them.

See agent stack reference, observability comparison and AI agent design patterns for the layers above and below.

AI Agent Memory in 2026: Vector, Episodic and Semantic — Explained

The four memory types you actually need

1. Working memory — the context window

2. Session memory — state across tool calls

3. Long-term semantic memory — vector + structured

4. Procedural memory — the underused layer

Memory retrieval — the actually-hard part

What "memory" looks like at three sizes

Solo / startup: the minimum viable memory

SMB: a memory framework + a vector DB

Enterprise: per-tenant memory with strong governance

When memory becomes the bottleneck

What this means for buyers

Agents mentioned in this post

Keep exploring

Head-to-head comparisons

By industry

By role

Terms used in this post

More from the blog

RAG vs Fine-Tuning vs Agents in 2026: How to Actually Choose

Agentic AI Design Patterns 2026: The 9 AI Agent Patterns You Need

The 2026 AI Agent Stack: Reference Architecture Buyers Can Actually Use

AI Agent Security in 2026: OWASP LLM Top 10, Threats and Mitigations

AI Agent Hallucinations 2026: Detect, Measure, Reduce

AI Agent Observability 2026: LangSmith vs Langfuse vs Helicone vs Arize