LLM Memory & Retrieval
How LLMs find information, why they forget between sessions, and how to solve that.
The memory problem
LLMs don't remember anything by default. Every session starts from zero.
Ask an LLM to book a trip to Taipei. Close the session. Open a new one. Say "buy the ticket." It has no idea what trip you're talking about. The previous conversation is gone.
This happens because LLMs are stateless. They process the text in front of them — the context window — and produce output. Nothing persists between sessions unless you explicitly save it somewhere.
What survives and what doesn't
The session boundary is the key concept.
Three approaches to persistence
Different ways to leave notes on the desk for tomorrow's amnesiac contractor.
File-based memory
Write instructions and outputs to files. The LLM reads them at the start of each session. You control exactly what persists and how it's structured.
Examples: README files, CLAUDE.md, structured markdown wikis, config files. Human-readable, version-controllable, transparent.
Vector-based memory
Embed all conversations and documents into a vector database. At the start of each session, query the database for relevant context and preload it.
Examples: Pinecone, Chroma, Weaviate. Automatic, scales to millions of chunks, but opaque — you don't control what gets retrieved.
Platform memory
The LLM platform itself stores memory entries between sessions. You tell it "remember this" and it saves a note internally.
Examples: ChatGPT's memory feature, Claude's memory files. Convenient but limited — you're trusting the platform to decide what's relevant.
| File-based | Vector-based | Platform | |
|---|---|---|---|
| Control | Full | Low | Medium |
| Effort | High (manual curation) | Low (automatic) | Low |
| Transparency | Total | Low (similarity scores) | Medium |
| Scale | Hundreds of pages | Millions of chunks | Limited |
| Infrastructure | Just files | Database + embedding pipeline | None |
| Portability | Works anywhere | Vendor-locked | Vendor-locked |
The two knobs: retrieval vs behavior
These are independent controls. You can have either without the other, or both.
When you set up any LLM-based system, you're turning two separate knobs:
What to look at
The retrieval problem. Out of all available information, which pieces does the LLM read?
How to process it
The behavior problem. Once the LLM has the information, what should it do with it?
| What to look at (retrieval) | How to process (behavior) | |
|---|---|---|
| File-based system | You pick the file, or LLM reads an index | Instructions file defines the rules |
| Vector system (RAG) | Vector DB picks the chunks | Still needs instructions/prompts |
| No system | User pastes text into chat | Whatever the LLM defaults to |
Vectors inside the LLM vs a vector database
These sound the same but are completely different things.
Vectors inside the LLM
Every token gets converted to a vector during processing. Attention is computed in vector space. This is just how neural networks work.
Ephemeral. These vectors exist during your conversation and are gone when the session ends. You don't control or see them.
A vector database
A separate piece of infrastructure you build and host. You embed your documents, store the vectors on disk, and query them later.
Persistent. Survives session restarts. You decide what goes in. It's like Postgres, but for similarity search instead of exact queries.