AI Agent Memory: Short-Term vs Long-Term vs Vector Storage
The difference between an AI agent that forgets you and one that remembers your last five conversations, your preferences, and your company's internal knowledge isn't the model — it's the memory architecture. Here is how memory actually works inside production AI agents, in plain English.
Key Takeaways
- AI agents have five memory types: short-term, long-term, episodic, semantic, procedural — each solves a different problem.
- Long context windows do not replace memory — they are expensive, imprecise, and forget once the session ends.
- Production memory is always a combination: summaries + vector search + structured facts.
- Bad memory design is the single biggest reason agents "feel dumb" over time.
Why memory is the hardest part of agent design
A stateless chatbot is easy. A stateful agent that remembers your last conversation, your account preferences, your company's internal documents, and what it decided three steps ago in a complex workflow — that is a systems problem. And it is the problem that separates agents that feel genuinely useful from ones that feel like amnesia-with-extra-steps.
Every memory decision — what to store, how to retrieve it, when to summarise it, how long to keep it — has consequences for cost, latency, accuracy, and privacy. The best agents we ship make these decisions explicitly. The worst ones stuff everything into context and hope for the best.
The five types of AI agent memory
Cognitive science distinguishes memory by duration (short-term vs long-term) and by content (episodic, semantic, procedural). Modern AI agent design borrows this taxonomy directly because it turns out to map well to the problems agents actually face.
| Memory type | What it stores | Typical storage | Lifetime |
|---|---|---|---|
| Short-term | Current conversation turns, in-flight reasoning | LLM context window | Single session |
| Long-term | Persistent summaries of past interactions | SQL, key-value, object store | Weeks to forever |
| Episodic | Specific past events and interactions | Event store, vector DB | User-scoped, retrievable |
| Semantic | Facts about user, org, domain | Knowledge graph, SQL, vector | Indefinite, versioned |
| Procedural | Learned patterns and skills | Fine-tuned weights, tool policies | Indefinite, retrained |
Short-term memory (the context window)
Short-term memory is what is in the LLM's context at this exact moment — the system prompt, the recent conversation, any retrieved content, and the tool outputs so far. In 2026 context windows range from 128K to 2M tokens depending on the model, which sounds unlimited but isn't. Every token costs money, raises latency, and gives the model more opportunity to get confused.
The three problems with relying on context alone
- Cost. A 100K-token conversation replayed 20 times a day at USD 3 / 1M input tokens is USD 6 per user per day — unsustainable at scale.
- Lost-in-the-middle. Even frontier models lose accuracy when relevant information is buried in a long context.
- Session-bound. Context disappears when the session ends. The agent has amnesia tomorrow.
Best practices for short-term memory
- Keep context lean — only what is genuinely relevant to this turn.
- Summarise aggressively once a conversation exceeds ~8K tokens.
- Structure the context: system prompt → user profile → retrieved facts → recent turns → current turn.
- Never blindly prepend the full history — compress.
Long-term memory
Long-term memory is where the agent stores what happened after a session ends, so the next session doesn't start from zero. The standard pattern: at the end of each session (or at periodic checkpoints), the agent produces a summary. That summary is stored, indexed, and retrieved when the user returns.
What to store in long-term memory
- A running summary of the user's history and preferences.
- Specific commitments the agent made (follow-ups, promises, action items).
- Structured facts extracted from conversations (user name, company, role, preferences).
- Outstanding tasks and their state.
What NOT to store in long-term memory
- Full verbatim transcripts (expensive, privacy risk, low value).
- PII you would not put in a database.
- Information that has a short shelf life — today's weather, transient preferences.
Episodic memory
Episodic memory stores specific events the agent participated in — a conversation with a customer on Tuesday, a support ticket resolved last month, a meeting scheduled last quarter. When the user says "remember when I asked about X?", the agent can retrieve the actual episode.
How episodic memory is typically implemented
- At the end of each interaction, generate an embedding and metadata (timestamp, user ID, topic tags).
- Store in a vector database with the rich metadata attached.
- On retrieval, use hybrid search — semantic similarity + metadata filter (user, date, topic).
- Inject the top 3–5 most relevant episodes into the next turn's context.
Semantic memory
Semantic memory is the agent's knowledge — not events, but facts. "The customer's account is on the Premium plan." "Our refund policy is 30 days." "The engineering lead for this account is Jamie." Semantic memory is usually the highest-value memory type because it lets the agent act like it knows your business.
Three layers of semantic memory
- User semantic memory: facts about the individual user — name, preferences, role, history.
- Organisational semantic memory: facts about the customer's company — plan, usage patterns, key contacts.
- Domain semantic memory: facts about the world the agent operates in — product docs, policies, SOPs.
Most semantic memory flows through retrieval-augmented generation (RAG). The agent queries a vector database, gets the most relevant facts, and injects them into context. For a deeper look, read how to build an AI agent.
Procedural memory
Procedural memory is learned behaviour — the agent's equivalent of muscle memory. When an agent has processed 10,000 refund cases, it has implicit knowledge about what good looks like. Procedural memory is usually captured in one of three ways:
- Fine-tuning on examples of high-quality past behaviour.
- Few-shot examples pulled dynamically into the prompt as reference.
- Tool policies that encode learned decision rules as code.
Procedural memory is the slowest to build and the most expensive to change — but it is often the most defensible. It is what makes an agent that has been running in your business for a year feel fundamentally different from one you stood up yesterday.
Build an agent that actually remembers
Bananalabs designs proper multi-tier memory architectures for every agent we ship — no amnesia, no runaway token bills. Book a free strategy call and we will design the right memory stack for your use case.
Book a Free Strategy Call →Vector databases in 2026: the landscape
Vector databases store and search embeddings — dense numerical representations of text, images, or other content. They are the workhorse of semantic and episodic memory. The 2026 landscape has consolidated around four leaders.
| Vector DB | Strengths | Best for |
|---|---|---|
| Pinecone | Managed, zero ops, strong filtering | Teams that want it to just work |
| Weaviate | Hybrid search, modular, open source | Self-hosted deployments with rich metadata |
| Qdrant | Fast filtering, Rust-based, small binaries | High-throughput production with self-hosting |
| pgvector (Postgres) | Already integrated with SQL data | Teams already on Postgres, mid-scale use |
| Milvus / Zilliz | Massive scale, GPU-accelerated | Billion-scale vector workloads |
| MongoDB Atlas Vector | Integrated with doc store | Existing MongoDB shops |
Our default at Bananalabs is pgvector unless scale or feature requirements push us elsewhere. One database is better than two. If the client is already on Pinecone or Weaviate, we stay there. The product rarely decides; the context does.
Production memory patterns we actually use
Pattern 1: The summary + retrieval hybrid
Our most common pattern. Each session ends with a two-paragraph summary written by the agent. Summary is stored in a SQL row per user. For each new turn, we retrieve the summary plus the top 3 relevant past episodes via vector search. Context stays small; memory feels deep.
Pattern 2: Structured user profile + RAG
For agents where structured facts dominate (B2B sales, account management), we extract user/org facts into a structured profile (Postgres). The agent reads the profile directly into context. RAG layers on top for less structured knowledge like past conversation excerpts.
Pattern 3: The working memory sketchpad
For long autonomous runs (research agents, multi-step ops agents), we give the agent a "sketchpad" — an editable scratch buffer it uses to track state across steps. This is technically short-term memory but persists across LLM calls within a single task. It is how we avoid the agent re-deriving its own conclusions repeatedly.
Pattern 4: Episodic decay
Not all past episodes deserve equal weight. We apply time-decay to retrieval scoring, so last week's conversations rank higher than last year's unless they are explicitly relevant. This prevents the agent from dredging up stale context.
Pattern 5: Opt-in persistent memory
For consumer agents, we make long-term memory an explicit user opt-in with clear controls — what is stored, how to view it, how to delete it. This is a privacy best practice and, post-GDPR, a legal one.
The real cost of memory
Memory seems like a "cheap" feature until you run the math on retrieval-per-turn at scale. Every turn with 5K tokens of retrieved context costs meaningfully more than a turn with 500. Multiply by users and turns per day and memory becomes the dominant line item.
- Embedding storage: pennies per user per month for typical volumes.
- Vector search compute: nearly free at <10M vectors.
- Retrieval tokens in context: the real cost — often 30–60% of total token spend.
- Summarisation calls: cheap if you batch them, expensive if you summarise every turn.
Our rule of thumb: budget 30–50% of total LLM spend for retrieval tokens. If you are not measuring this, start today. For the full cost picture, see the hidden costs of building AI agents.
Memory pitfalls to avoid
1. "Just put everything in context"
The lazy approach that breaks at the second user and the first scale. Long context ≠ memory.
2. No summarisation pipeline
Conversations grow unbounded. Token costs climb. Accuracy drops from lost-in-the-middle. Summarise at logical breakpoints.
3. One retrieval strategy for everything
Semantic search alone misses structured facts. Keyword search alone misses paraphrase. Hybrid search is not optional in production.
4. Storing PII you cannot delete
GDPR requires the right to be forgotten — including from vector indexes. Plan deletion from day one.
5. Infinite retention
Old memories get less accurate over time. Define decay or TTL on every memory type.
6. No memory observability
If you cannot see what the agent retrieved, you cannot debug why it answered wrong. Log every retrieval with the query, the result, and the ranking signal.
7. Treating memory as a database problem only
Memory is an information architecture problem. The vector DB is the easy part. The design of what to remember, when to retrieve, and how to summarise is the hard part.
Memory decision framework
Does the agent need state between sessions?
- No → short-term only. Simpler, cheaper.
- Yes → at minimum, summary-based long-term memory.
Do users expect the agent to recall specific past events?
- Yes → add episodic memory via vector store.
Does the agent need domain knowledge or org-specific facts?
- Yes → semantic memory via RAG.
Will the agent handle thousands of similar cases where consistency matters?
- Yes → procedural memory via fine-tuning or dynamic few-shot.
Do you have regulatory or privacy constraints?
- Yes → plan memory with deletion, audit, and consent primitives from day one. See AI agent security.
The bottom line on AI agent memory
The agents that people love are the ones that feel like they know you. That feeling is not magic — it is a memory architecture with short-term context, distilled long-term summaries, retrievable episodes, structured facts, and learned patterns, all orchestrated cleanly. Teams that get memory right end up with agents that get smarter over time. Teams that skip it end up with agents their users stop trusting.
If memory design feels like a lot — because it is — that is exactly the kind of problem a specialist partner solves for you. Our design baseline at Bananalabs includes a multi-tier memory system out of the gate. The rest of the agent is downstream of getting that right.
Frequently Asked Questions
What is AI agent memory?
AI agent memory is the system an agent uses to store, retrieve, and use information across time. It includes short-term memory (the current context window), long-term memory (persistent summaries and facts), episodic memory (records of past interactions), semantic memory (facts about the world or user), and procedural memory (learned patterns and skills). Good memory design is the difference between an agent that forgets you every conversation and one that gets smarter over time.
What is the difference between a vector database and agent memory?
A vector database is one storage technology used inside an agent memory system. The memory system is the broader architecture that decides what to remember, when to retrieve it, how to compress it, and how to keep it relevant. A vector database handles semantic search over embeddings, but long-term memory usually also needs structured stores (SQL, graph) and summary stores for efficient recall.
How much memory does an AI agent need?
Most business AI agents need roughly 10 MB to 500 MB of per-user memory, composed of conversation summaries, user facts, and retrieved knowledge. Enterprise-wide agents serving thousands of users typically need 10 GB to 500 GB of vector storage plus structured metadata. The actual number depends on how much context is preserved per interaction and how aggressively you summarise.
Should I use long context windows or vector memory?
Use long context windows for single-session reasoning over bounded content like one document or one conversation. Use vector memory for knowledge that outlives the session — past conversations, user preferences, organisational knowledge. Most production agents combine both: retrieval pulls the right information into the context window for each turn. Long context alone is expensive and imprecise at scale.
Which vector database is best for AI agents in 2026?
Pinecone, Weaviate, Qdrant, and pgvector (Postgres) are the leading choices. Pinecone leads for managed simplicity, Qdrant and Weaviate for self-hosted control and hybrid search, and pgvector for teams who already run Postgres and want one fewer system. For most Bananalabs builds, pgvector is the default unless scale or specific features require otherwise.