Back

How to Build AI Agents That Actually Remember: Memory Architecture for Production LLM Apps

Your AI agent just forgot everything.

The user has been chatting for twenty minutes, carefully explaining their codebase architecture, their deployment constraints, their team's coding style. Then they ask a follow-up question — and the agent responds as if they've never met. The context window overflowed. Everything before message #47 is gone. The user starts over, frustrated.

If you've built anything with LLMs beyond a toy demo, you've hit this wall. Context windows are not memory. A 128K token window feels massive until you realize a single large codebase scan fills it in seconds. A 1M token window sounds infinite until you calculate the API cost of stuffing it every request. And even with the biggest windows, there's no persistence — restart the session, and your agent has amnesia.

This is the fundamental challenge of production AI applications in 2026: How do you build agents that actually remember?

Not "remember for the next 5 messages." Remember across sessions. Remember user preferences from three weeks ago. Remember that the database migration failed last Tuesday and the workaround is still in place. Remember like a human colleague would.

This guide covers the memory architecture patterns that production teams are actually using — from simple sliding windows to hierarchical memory systems to graph-based knowledge stores. We'll build real implementations, compare the major frameworks (Mem0, LangChain Memory, Letta), and show you exactly where each pattern breaks down.

Why Context Windows Are Not Memory

Before diving into solutions, let's be precise about the problem.

The Context Window Illusion

Every LLM has a context window — the maximum number of tokens it can process in a single request. In 2026, these have grown significantly:

ModelContext WindowApproximate Cost (Input)
GPT-4.11M tokens~$2.00/M tokens
Claude Opus 4200K tokens~$5.00/M tokens
Gemini 2.5 Pro1M tokens~$1.25/M tokens
Llama 4 Scout10M tokensSelf-hosted

"Just use a bigger window" seems like an obvious solution. Here's why it isn't:

1. Cost scales linearly (or worse). Stuffing 500K tokens into every API call when you only need 2K of relevant context is burning money. At 2/Mtokens,a500Ktokenrequestcosts2/M tokens, a 500K-token request costs 1. If your agent handles 1,000 conversations per day, that's $1,000/day just on input tokens — for context that's 99% irrelevant.

2. Performance degrades with noise. Research consistently shows that LLMs perform worse when given large amounts of irrelevant context. The "needle in a haystack" problem is real: models struggle to find and use specific information buried in massive context windows. More context ≠ better answers.

3. Latency increases with context size. Time-to-first-token scales with input length. A 500K-token input takes noticeably longer than a 5K-token input, which destroys the real-time feel of chat applications.

4. No persistence across sessions. Context windows are ephemeral. Close the tab, and everything is gone. There's no mechanism to carry information from one session to the next without external storage.

5. No selective forgetting. Humans don't remember everything — they remember what's important. A context window has no concept of importance; it's just a FIFO buffer that drops the oldest tokens first, regardless of whether they contain critical information or idle chatter.

What Real Memory Looks Like

Human memory isn't a single buffer. It's a layered system:

  • Working memory (immediate context): What you're thinking about right now. Small capacity, fast access.
  • Short-term memory (recent events): What happened in the last few minutes/hours. Moderate capacity, moderate access.
  • Long-term memory (persistent knowledge): Facts, skills, and experiences accumulated over time. Massive capacity, slower retrieval.

Effective AI agent memory mimics this architecture. Let's build it.

Pattern 1: Sliding Window with Smart Truncation

The simplest memory pattern — and often sufficient for basic chatbots.

How It Works

Keep the most recent N messages in the context window. When the conversation exceeds the limit, drop the oldest messages.

interface Message { role: 'user' | 'assistant' | 'system'; content: string; timestamp: number; tokenCount: number; } class SlidingWindowMemory { private messages: Message[] = []; private maxTokens: number; private systemPrompt: Message; constructor(maxTokens: number, systemPrompt: string) { this.maxTokens = maxTokens; this.systemPrompt = { role: 'system', content: systemPrompt, timestamp: Date.now(), tokenCount: this.estimateTokens(systemPrompt), }; } addMessage(role: 'user' | 'assistant', content: string): void { this.messages.push({ role, content, timestamp: Date.now(), tokenCount: this.estimateTokens(content), }); this.trim(); } getContext(): Message[] { return [this.systemPrompt, ...this.messages]; } private trim(): void { let totalTokens = this.systemPrompt.tokenCount + this.messages.reduce((sum, m) => sum + m.tokenCount, 0); while (totalTokens > this.maxTokens && this.messages.length > 2) { const removed = this.messages.shift()!; totalTokens -= removed.tokenCount; } } private estimateTokens(text: string): number { // Rough estimate: ~4 chars per token for English return Math.ceil(text.length / 4); } }

When to Use It

  • Simple Q&A chatbots where historical context doesn't matter much
  • Customer support bots with short, focused conversations
  • Prototyping and MVPs

Where It Breaks

  • Important early context gets dropped. If the user explained their requirements in the first message, that's the first thing to go.
  • No cross-session persistence. Every new session starts blank.
  • No intelligence in what's kept. A confused tangent gets the same priority as critical specifications.

Pattern 2: Conversation Summarization

The first step toward intelligent memory — compress old context instead of dropping it.

How It Works

When the conversation grows too long, summarize the older portion and replace it with the summary. The agent works with: [System Prompt] + [Summary of older messages] + [Recent messages].

class SummarizingMemory { private messages: Message[] = []; private summary: string = ''; private maxTokens: number; private recentWindowSize: number; private llm: LLMClient; constructor(maxTokens: number, recentWindowSize: number, llm: LLMClient) { this.maxTokens = maxTokens; this.recentWindowSize = recentWindowSize; this.llm = llm; } async addMessage(role: 'user' | 'assistant', content: string): Promise<void> { this.messages.push({ role, content, timestamp: Date.now(), tokenCount: this.estimateTokens(content), }); if (this.shouldSummarize()) { await this.compactHistory(); } } async getContext(): Promise<Message[]> { const context: Message[] = []; if (this.summary) { context.push({ role: 'system', content: `Previous conversation summary:\n${this.summary}`, timestamp: 0, tokenCount: this.estimateTokens(this.summary), }); } context.push(...this.messages.slice(-this.recentWindowSize)); return context; } private shouldSummarize(): boolean { const totalTokens = this.messages.reduce((sum, m) => sum + m.tokenCount, 0); return totalTokens > this.maxTokens * 0.8; } private async compactHistory(): Promise<void> { const messagesToSummarize = this.messages.slice( 0, this.messages.length - this.recentWindowSize ); if (messagesToSummarize.length === 0) return; const conversationText = messagesToSummarize .map(m => `${m.role}: ${m.content}`) .join('\n'); const existingSummary = this.summary ? `Existing summary:\n${this.summary}\n\n` : ''; this.summary = await this.llm.complete({ prompt: `${existingSummary}New conversation to incorporate:\n${conversationText}\n\nCreate a comprehensive summary that preserves:\n1. Key decisions and conclusions\n2. User preferences and requirements\n3. Technical specifications mentioned\n4. Action items and pending tasks\n5. Important context for future reference\n\nBe concise but don't lose critical details.`, maxTokens: 500, }); // Keep only recent messages this.messages = this.messages.slice(-this.recentWindowSize); } private estimateTokens(text: string): number { return Math.ceil(text.length / 4); } }

The Progressive Summarization Trick

Instead of one flat summary, use hierarchical summarization:

class HierarchicalSummarizingMemory { private detailedSummary: string = ''; // Last ~30 messages private broadSummary: string = ''; // Everything before that private recentMessages: Message[] = []; // Last ~10 messages async compactHistory(): Promise<void> { // Step 1: Merge current detailed summary into broad summary if (this.detailedSummary) { this.broadSummary = await this.llm.complete({ prompt: `Existing high-level summary:\n${this.broadSummary}\n\nDetailed summary to incorporate:\n${this.detailedSummary}\n\nCreate a high-level summary preserving only the most important facts, decisions, and user preferences. Maximum 200 words.`, maxTokens: 300, }); } // Step 2: Summarize recent overflow into detailed summary const overflow = this.recentMessages.slice(0, -10); this.detailedSummary = await this.llm.complete({ prompt: `Conversation segment:\n${overflow.map(m => `${m.role}: ${m.content}`).join('\n')}\n\nCreate a detailed summary preserving specific technical details, code snippets mentioned, and exact requirements. Maximum 500 words.`, maxTokens: 600, }); this.recentMessages = this.recentMessages.slice(-10); } async getContext(): Promise<string> { let ctx = ''; if (this.broadSummary) { ctx += `## Background Context\n${this.broadSummary}\n\n`; } if (this.detailedSummary) { ctx += `## Recent Context (Detailed)\n${this.detailedSummary}\n\n`; } ctx += `## Current Conversation\n`; ctx += this.recentMessages.map(m => `${m.role}: ${m.content}`).join('\n'); return ctx; } }

When to Use It

  • Long-running chat applications
  • Customer support with complex, multi-turn conversations
  • Coding assistants that need project context

Where It Breaks

  • Summarization loses detail. The summary says "user wants a REST API" but forgets that they specifically asked for HATEOAS-compliant responses with ETag headers.
  • Compounding summarization errors. Summarizing a summary of a summary introduces progressive information loss. After 5 compaction cycles, the original nuance is gone.
  • Cost of summarization. Each compaction requires an LLM call, adding latency and cost.

Pattern 3: Entity and Fact Extraction

Instead of summarizing entire conversations, extract and store structured facts.

How It Works

After each interaction, extract key entities and facts into a persistent store. Before each response, retrieve relevant facts based on the current query.

interface Fact { id: string; subject: string; predicate: string; object: string; confidence: number; source: string; // which conversation timestamp: number; supersedes?: string; // ID of fact this replaces } class EntityMemory { private facts: Map<string, Fact[]> = new Map(); private llm: LLMClient; private vectorStore: VectorStore; async extractFacts(messages: Message[]): Promise<Fact[]> { const conversation = messages .map(m => `${m.role}: ${m.content}`) .join('\n'); const response = await this.llm.complete({ prompt: `Extract key facts from this conversation as structured data. Conversation: ${conversation} Return facts as JSON array: [{ "subject": "entity name", "predicate": "relationship or attribute", "object": "value or related entity", "confidence": 0.0-1.0 }] Focus on: - User preferences and requirements - Technical decisions made - Project specifications - Deadlines and constraints - People and their roles mentioned`, responseFormat: 'json', }); return JSON.parse(response).map((f: any) => ({ ...f, id: crypto.randomUUID(), source: 'current_session', timestamp: Date.now(), })); } async storeFacts(facts: Fact[]): Promise<void> { for (const fact of facts) { const key = `${fact.subject}::${fact.predicate}`; // Check for contradictions const existing = this.facts.get(key) || []; const contradicting = existing.find( e => e.object !== fact.object && e.confidence < fact.confidence ); if (contradicting) { fact.supersedes = contradicting.id; } if (!this.facts.has(key)) { this.facts.set(key, []); } this.facts.get(key)!.push(fact); // Also store in vector DB for semantic retrieval await this.vectorStore.upsert({ id: fact.id, text: `${fact.subject} ${fact.predicate} ${fact.object}`, metadata: fact, }); } } async recallRelevant(query: string, limit: number = 20): Promise<Fact[]> { const results = await this.vectorStore.search(query, limit); return results .map(r => r.metadata as Fact) .filter(f => !f.supersedes) // Exclude superseded facts .sort((a, b) => b.confidence - a.confidence); } async buildContext(query: string, recentMessages: Message[]): Promise<string> { const relevantFacts = await this.recallRelevant(query); let context = ''; if (relevantFacts.length > 0) { context += '## Known Facts About This User/Project\n'; for (const fact of relevantFacts) { context += `- ${fact.subject} ${fact.predicate}: ${fact.object}\n`; } context += '\n'; } context += '## Current Conversation\n'; context += recentMessages.map(m => `${m.role}: ${m.content}`).join('\n'); return context; } }

When to Use It

  • Personal AI assistants that should learn about the user over time
  • Project management bots that track decisions across many meetings
  • Any application where specific facts matter more than conversational flow

Where It Breaks

  • Extraction isn't perfect. LLMs miss nuances and sometimes hallucinate facts that weren't stated.
  • Contradictions are hard. "The deadline is Friday" followed by "actually, let's push to Monday" requires the system to detect and resolve conflicts.
  • Fact staleness. Without explicit expiration, outdated facts pollute the context. The user changed their preferred framework two months ago, but the old fact is still there.

Pattern 4: Hierarchical Memory Architecture

This is what production systems actually use. It combines multiple patterns into a layered system that mirrors human memory.

The Three-Tier Model

┌─────────────────────────────────────────────┐
│           Tier 1: Working Memory            │
│  (Current context window, last ~10 msgs)    │
│  Access: Instant | Capacity: Small          │
├─────────────────────────────────────────────┤
│          Tier 2: Short-Term Memory          │
│  (Session summaries, recent facts)          │
│  Access: Fast retrieval | Capacity: Medium  │
├─────────────────────────────────────────────┤
│          Tier 3: Long-Term Memory           │
│  (Knowledge graph, user profile, history)   │
│  Access: Semantic search | Capacity: Large  │
└─────────────────────────────────────────────┘

Implementation

class HierarchicalMemory { private workingMemory: Message[] = []; private shortTermMemory: ShortTermStore; private longTermMemory: LongTermStore; private llm: LLMClient; constructor(config: MemoryConfig) { this.shortTermMemory = new ShortTermStore(config.shortTermTTL); this.longTermMemory = new LongTermStore(config.vectorStore, config.graphDB); this.llm = config.llm; } async processMessage(message: Message): Promise<void> { // 1. Add to working memory this.workingMemory.push(message); // 2. Extract facts for short-term storage if (this.workingMemory.length % 5 === 0) { const recentFacts = await this.extractFacts( this.workingMemory.slice(-5) ); await this.shortTermMemory.store(recentFacts); } // 3. Promote important facts to long-term memory if (this.workingMemory.length % 20 === 0) { await this.consolidate(); } // 4. Trim working memory if needed if (this.getWorkingMemoryTokens() > 8000) { await this.compactWorkingMemory(); } } async buildContext(query: string): Promise<ContextBundle> { // Retrieve from all three tiers const [shortTermResults, longTermResults] = await Promise.all([ this.shortTermMemory.search(query, 10), this.longTermMemory.search(query, 15), ]); // Deduplicate and rank by relevance const allFacts = this.deduplicateAndRank([ ...shortTermResults, ...longTermResults, ]); return { systemContext: this.buildSystemContext(allFacts), workingMemory: this.workingMemory.slice(-10), relevantFacts: allFacts.slice(0, 20), tokenBudget: { system: 2000, facts: 3000, working: 8000, response: 4000, }, }; } private async consolidate(): Promise<void> { const shortTermFacts = await this.shortTermMemory.getAll(); // Use LLM to identify which facts are worth long-term storage const assessment = await this.llm.complete({ prompt: `Review these facts and determine which should be stored long-term. Facts: ${shortTermFacts.map(f => `- ${f.subject} ${f.predicate}: ${f.object} (confidence: ${f.confidence})`).join('\n')} For each fact, respond with: - KEEP: Important for future interactions (user preferences, key decisions, project specs) - DISCARD: Temporary or conversational (greetings, acknowledgments, transient states) - MERGE: Can be combined with another fact Return as JSON array with {id, action, mergeWith?} objects.`, responseFormat: 'json', }); const actions = JSON.parse(assessment); for (const action of actions) { if (action.action === 'KEEP') { const fact = shortTermFacts.find(f => f.id === action.id); if (fact) { await this.longTermMemory.store(fact); } } } // Clean up promoted facts from short-term await this.shortTermMemory.prunePromoted( actions.filter((a: any) => a.action === 'KEEP').map((a: any) => a.id) ); } private async compactWorkingMemory(): Promise<void> { const overflow = this.workingMemory.slice(0, -10); const summary = await this.llm.complete({ prompt: `Summarize this conversation segment preserving technical details:\n${overflow.map(m => `${m.role}: ${m.content}`).join('\n')}`, maxTokens: 300, }); // Extract any remaining facts before discarding const facts = await this.extractFacts(overflow); await this.shortTermMemory.store(facts); // Replace overflow with summary this.workingMemory = [ { role: 'system', content: `[Previous conversation summary: ${summary}]`, timestamp: Date.now(), tokenCount: this.estimateTokens(summary), }, ...this.workingMemory.slice(-10), ]; } // ... helper methods }

When to Use It

  • Production AI assistants with multi-session interactions
  • Enterprise copilots that need to remember project context over weeks
  • Any application where long-term user personalization is critical

Pattern 5: Graph-Based Memory (GraphRAG)

The cutting edge of agent memory. Instead of storing facts as flat text, represent knowledge as a graph of relationships.

Why Graphs Beat Vectors

Vector similarity search (the backbone of traditional RAG) has a fundamental limitation: it finds things that sound similar but misses things that are structurally related.

Example: "Alice manages the payment team" and "The payment team owns the checkout microservice" are not semantically similar. But in a graph, you can traverse: Alice → manages → payment team → owns → checkout microservice. So when someone asks "Who should I talk to about checkout bugs?", a graph can answer "Alice" while a vector store cannot.

class GraphMemory { private graph: GraphDatabase; // Neo4j, or in-memory async addKnowledge( subject: string, predicate: string, object: string, metadata: Record<string, any> ): Promise<void> { await this.graph.query(` MERGE (s:Entity {name: $subject}) MERGE (o:Entity {name: $object}) MERGE (s)-[r:${predicate.toUpperCase().replace(/\s/g, '_')}]->(o) SET r += $metadata, r.updatedAt = timestamp() `, { subject, object, metadata }); } async query(question: string): Promise<GraphResult[]> { // Step 1: Extract entities from the question const entities = await this.extractEntities(question); // Step 2: Find relevant subgraph around those entities const subgraph = await this.graph.query(` MATCH (e:Entity)-[r*1..3]-(connected:Entity) WHERE e.name IN $entities RETURN e, r, connected LIMIT 50 `, { entities }); // Step 3: Format subgraph as context return this.formatSubgraph(subgraph); } async traverseForContext( startEntity: string, maxDepth: number = 3 ): Promise<string> { const result = await this.graph.query(` MATCH path = (start:Entity {name: $startEntity})-[*1..${maxDepth}]-(end:Entity) RETURN path ORDER BY length(path) LIMIT 30 `, { startEntity }); return result.paths .map(p => this.pathToSentence(p)) .join('\n'); } private pathToSentence(path: GraphPath): string { // Convert: (Alice)-[:MANAGES]->(PaymentTeam)-[:OWNS]->(CheckoutService) // To: "Alice manages PaymentTeam, which owns CheckoutService" return path.segments .map(s => `${s.start.name} ${s.relationship.type.toLowerCase().replace(/_/g, ' ')} ${s.end.name}`) .join(', which '); } }

Hybrid Approach: Vectors + Graphs

The most effective production systems combine both:

class HybridMemory { private vectorStore: VectorStore; // For semantic similarity private graphStore: GraphMemory; // For structural relationships private llm: LLMClient; async recall(query: string): Promise<MemoryResult> { const [vectorResults, graphResults] = await Promise.all([ this.vectorStore.search(query, 10), this.graphStore.query(query), ]); // Merge and deduplicate const merged = this.mergeResults(vectorResults, graphResults); // Re-rank using LLM const ranked = await this.llm.complete({ prompt: `Given the user query: "${query}" Rate the relevance of each memory item (0-10): ${merged.map((m, i) => `${i}: ${m.text}`).join('\n')} Return as JSON: [{index, score, reason}]`, responseFormat: 'json', }); return { memories: this.applyRanking(merged, JSON.parse(ranked)), sources: { vector: vectorResults.length, graph: graphResults.length }, }; } }

Framework Comparison: Mem0 vs LangChain Memory vs Letta

Let's compare the three most popular memory frameworks for LLM applications.

Mem0

Mem0 provides a managed memory layer for AI applications with a multi-store architecture (KV store + vector store + graph layer).

import { MemoryClient } from 'mem0ai'; const memory = new MemoryClient({ apiKey: process.env.MEM0_API_KEY }); // Add memories from conversation await memory.add( "I prefer Python over JavaScript for backend work", { user_id: "alice", metadata: { category: "preferences" } } ); // Search memories const results = await memory.search( "What programming language does Alice prefer?", { user_id: "alice" } ); // Returns: [{memory: "Prefers Python over JavaScript for backend", score: 0.95}] // Get all memories for a user const allMemories = await memory.getAll({ user_id: "alice" });

Strengths:

  • Dead simple API — add/search/get in three lines
  • Managed infrastructure (no vector DB setup needed)
  • Automatic deduplication and conflict resolution
  • Works across sessions by default
  • Self-hosted option available (mem0 OSS)

Weaknesses:

  • Limited control over memory representation
  • Opaque ranking algorithm
  • Cloud dependency for the managed version
  • Graph layer is newer and less battle-tested than the vector store

LangChain Memory

LangChain provides multiple memory implementations out of the box:

import { BufferWindowMemory } from 'langchain/memory'; import { ConversationSummaryMemory } from 'langchain/memory'; import { VectorStoreRetrieverMemory } from 'langchain/memory'; import { CombinedMemory } from 'langchain/memory'; // Option 1: Simple buffer const bufferMemory = new BufferWindowMemory({ k: 10 }); // Option 2: Summarization const summaryMemory = new ConversationSummaryMemory({ llm: chatModel, returnMessages: true, }); // Option 3: Vector-based retrieval const vectorMemory = new VectorStoreRetrieverMemory({ vectorStoreRetriever: vectorStore.asRetriever(5), memoryKey: 'relevant_history', }); // Option 4: Combine multiple memory types const combinedMemory = new CombinedMemory({ memories: [bufferMemory, summaryMemory, vectorMemory], });

Strengths:

  • Maximum flexibility — mix and match memory types
  • Deep integration with LangChain ecosystem (agents, chains, tools)
  • Community-maintained storage backends (Redis, PostgreSQL, MongoDB)
  • Open-source and self-hosted
  • Well-documented with many examples

Weaknesses:

  • Requires more setup and infrastructure decisions
  • Can be over-engineered for simple use cases
  • Memory types don't always compose cleanly
  • Depends on the broader LangChain framework

Letta (formerly MemGPT)

Letta takes a fundamentally different approach — it treats memory management as an operating system problem.

import { Letta } from 'letta'; const client = new Letta({ apiKey: process.env.LETTA_API_KEY }); // Create an agent with OS-like memory management const agent = await client.createAgent({ name: 'project-assistant', memory: { coreMemory: { // Always in context — like your system prompt persona: 'You are a senior software engineer...', human: '', // Populated automatically from conversations }, archivalMemory: true, // Long-term vector storage recallMemory: true, // Conversation history search }, model: 'gpt-4.1', tools: ['archival_memory_insert', 'archival_memory_search', 'core_memory_replace', 'core_memory_append'], }); // The agent manages its own memory via tool calls const response = await agent.sendMessage( "I'm working on a Next.js project with PostgreSQL and Drizzle ORM" ); // Agent internally calls: // core_memory_append(section="human", content="Works with Next.js, PostgreSQL, Drizzle ORM") // archival_memory_insert(content="User's current project stack: Next.js + PostgreSQL + Drizzle ORM")

Strengths:

  • Self-managing memory — the agent decides what to remember
  • OS-inspired architecture (core/archival/recall tiers)
  • Persistent by default — memory survives across sessions
  • The agent can explicitly reason about what to store and retrieve
  • Cloud and self-hosted options

Weaknesses:

  • Requires extra LLM calls for memory management (cost/latency overhead)
  • Opinionated architecture may not fit all use cases
  • Younger ecosystem compared to LangChain
  • Core memory updates can be unpredictable

Framework Decision Matrix

CriterionMem0LangChain MemoryLetta
Setup complexity⭐ Low⭐⭐⭐ High⭐⭐ Medium
Flexibility⭐⭐ Medium⭐⭐⭐ High⭐⭐ Medium
Cross-session memory✅ Built-in⚙️ Requires config✅ Built-in
Self-managing
Self-hosted option
Production readiness⭐⭐⭐⭐⭐⭐⭐⭐
Best forQuick integrationCustom architecturesAutonomous agents

Production Patterns and Anti-Patterns

Pattern: Memory-Aware Prompt Engineering

The most impactful optimization is often not the memory system itself but how you present retrieved memories to the model.

// ❌ Bad: Dumping all memories as flat text const badPrompt = ` Here are some things you know: ${memories.map(m => m.text).join('\n')} User: ${query} `; // ✅ Good: Structured, prioritized, with freshness signals const goodPrompt = ` ## Your Knowledge About This User ${highConfidenceMemories.map(m => `- ${m.text} (last confirmed: ${formatRelativeTime(m.updatedAt)})` ).join('\n')} ## Relevant Project Context ${projectMemories.map(m => `- ${m.text}`).join('\n')} ## Potentially Outdated (verify before using) ${staleMemories.map(m => `- ${m.text} (from ${formatDate(m.createdAt)}, may have changed)` ).join('\n')} ## Current Conversation ${recentMessages.map(m => `${m.role}: ${m.content}`).join('\n')} `;

Anti-Pattern: Memory Without Forgetting

Just as important as remembering is knowing when to forget.

class MemoryManager { // Implement decay — memories that are never retrieved fade async applyDecay(): Promise<void> { const allMemories = await this.store.getAll(); for (const memory of allMemories) { const daysSinceAccess = (Date.now() - memory.lastAccessedAt) / (1000 * 60 * 60 * 24); const daysSinceCreation = (Date.now() - memory.createdAt) / (1000 * 60 * 60 * 24); // Decay formula: reduce confidence over time if not accessed const decayFactor = Math.exp(-0.01 * daysSinceAccess); const newConfidence = memory.confidence * decayFactor; if (newConfidence < 0.2) { await this.store.archive(memory.id); // Don't delete, archive } else { await this.store.updateConfidence(memory.id, newConfidence); } } } // Implement contradiction detection async addWithContradictionCheck(newFact: Fact): Promise<void> { const existing = await this.store.search( `${newFact.subject} ${newFact.predicate}`, 5 ); const contradictions = existing.filter(e => e.subject === newFact.subject && e.predicate === newFact.predicate && e.object !== newFact.object ); if (contradictions.length > 0) { // More recent fact wins, but keep history for (const old of contradictions) { await this.store.markSuperseded(old.id, newFact.id); } } await this.store.add(newFact); } }

Anti-Pattern: Over-Engineering Memory for Simple Use Cases

Not every application needs a three-tier hierarchical memory system with GraphRAG and automatic consolidation. Consider this decision tree:

1. Does your conversation last <20 messages?
   → Sliding window is fine. Stop here.

2. Does the user need to return to the same conversation later?
   → Add conversation summarization. Maybe stop here.

3. Does the agent need to remember facts across different conversations?
   → Add entity extraction + vector store.

4. Does the agent need to understand relationships between entities?
   → Add graph-based memory.

5. Does the agent need to autonomously manage what it remembers?
   → Consider Letta's self-managing approach.

Benchmarking Your Memory System

How do you know if your memory system is actually working? Define these metrics:

Recall Accuracy

// Test: Can the agent recall facts from N messages ago? async function testRecallAccuracy( agent: Agent, testFacts: { fact: string; queryAfterNMessages: number }[] ): Promise<number> { let correct = 0; for (const test of testFacts) { // Plant the fact await agent.processMessage({ role: 'user', content: test.fact }); // Add N filler messages for (let i = 0; i < test.queryAfterNMessages; i++) { await agent.processMessage({ role: 'user', content: `Filler message ${i}: Tell me about ${randomTopic()}`, }); } // Test recall const response = await agent.processMessage({ role: 'user', content: `What did I tell you about ${extractSubject(test.fact)}?`, }); if (responseContainsFact(response, test.fact)) { correct++; } } return correct / testFacts.length; }

Key Metrics to Track

MetricWhat It MeasuresTarget
Recall@NCan the agent recall a fact after N messages?>90% at N=50
Contradiction rateHow often does the agent use outdated info?<5%
Memory latencyTime to retrieve relevant memories<200ms
Token efficiencyRatio of relevant vs total context tokens>60%
Cross-session recallCan it remember facts from previous sessions?>80%

Conclusion

Building AI agents that remember isn't about finding the biggest context window. It's about designing a memory architecture that mirrors how information actually needs to flow in your application.

Here's the practical hierarchy:

  1. Start simple. Sliding window + conversation summarization handles 80% of use cases. Don't over-engineer from day one.

  2. Add persistence when users demand it. The moment users expect your agent to remember them across sessions, you need entity extraction and a persistent store. Mem0 is the fastest path here.

  3. Add structure when flat memories aren't enough. When your agent needs to understand relationships between things — org charts, dependency graphs, system architectures — that's when graph-based memory pays off.

  4. Let the agent manage itself when the problem is complex enough. For autonomous agents running multi-hour tasks, Letta's self-managing approach avoids the brittleness of hardcoded memory rules.

  5. Always implement forgetting. A memory system without decay becomes a liability. Outdated facts cause more damage than missing facts.

The tooling has matured enormously in the last year. What used to require custom-built infrastructure is now a pip install or API call away. The hard part isn't the technology anymore — it's designing the right memory architecture for your specific use case.

Your users won't thank you for a perfect memory system. But they'll definitely notice when your agent forgets.

AILLMmemoryagentsRAGLangChainproductionarchitecture

Explore Related Tools

Try these free developer tools from Pockit