Making AI Agents Remember: A Memory Architecture

Apr 21, 2026

Memory architecture

A student tells your AI tutor, "I have a big exam tomorrow. I'm really nervous."

The next day, the AI opens with: "How did it go?"

That single sentence is the difference between a chatbot and something that actually feels human. It's not magic, it's memory. And building it correctly, at scale, is one of the most interesting system problems I've worked on.

At Adda247, our team was tasked with building a centralized memory and context service for chat and voice agents powering experiences across all our products: live tutoring, real-time doubt solving, and more. A big chunk of this layer was designed and battle-tested with Ansh Tanwar, who drove a lot of the core logic. When you're operating in India's edtech space, you're dealing with millions of concurrent users jumping across wildly different contexts, speaking multiple languages, picking up sessions mid-thought.

The naive approach will fail fast. It usually is to dump the last N messages into the prompt or send a vague summary and the model hallucinates context.

After diving deep into MemGPT, semantic memory research, and a lot of painful production incidents, we landed on an architecture that actually holds up.

In this post, I'll walk you through the memory system we built: what each layer does, why we designed it that way, and the specific engineering decisions that made it work at scale. By the end, you'll have a concrete mental model you can apply directly, or tear apart and improve on.

1. Working Memory (The Current Session)

This tier manages the live conversation context. We maintain a sequential turn buffer, each exchange stored as a structured object with role, content, and timestamp. As the conversation grows, older turns are evicted to stay within the model's context window.

Memory architecture

But what if the user references something from 30 turns ago, maybe a topic the sliding window has long since evicted? The conversation summarizer handles exactly this. It periodically compresses older history into a concise summary that stays injected in the prompt throughout the session. The model never sees those raw turns again, but it retains the substance of what was discussed, just in compressed form rather than complete exchanges.

On top of this, recent turns are used to ground retrieval queries. Before hitting the vector search, the user's current utterance is prepended with the last few turns, so anaphoric references like "tell me more about that" resolve correctly against the knowledge base.

2. Episodic Memory (What Actually Happened)

This is the layer I am most excited about, as it bridges the gap between raw intelligence and personalization.

From an engineering standpoint, this runs entirely asynchronously i.e after the session ends, not during it. We cannot afford to block live conversation latency for a background job. So once user has completed the first session, the transcript goes into a queue, gets picked up by a worker, and processes in the background while the user has already moved on with their day.

The extractor is a lightweight LLM call. We feed it the transcript and ask it one question: what is actually important here? This extractor reads the transcripts of conversations and pulls out important, meaningful events. Not a phrases but specific, structured records to store. Here is example of what those record fields look like:

ColumnValues
memory_typepreference
contentUser Likes Veg Food
tags["veg", "food"]
importance0.9
embedding[0.016948, -0.014532, -0.073399, ...] (768-dim vector)
Memory architecture

Every single field here serves a distinct downstream purpose:

memory_type: We categorize memories strictly into decided enums like:

  • Fact: Some absolute truth about user (e.g. ,name, age etc).
  • Preference: Interaction styles (e.g., prefers Hindi, wants concise answers).
  • Assessment: Behavioral traits or abilities we learned.
  • Event: Time-bound occurrences (e.g., an exam next week) that act as triggers.
  • Goal: High-level targets (e.g., UPSC CSE 2025). This explicit typing allows us to route the memory appropriately in future sessions.

content: We heavily engineered the extraction prompt so that this output is strictly third-person, specific, and actionable. "User seems confused" is a useless metric. The detailed string gives the agent a clear, behavioral pattern to adapt its future communication style around.

tags: These are lightweight labels that enable highly efficient filtered retrieval & Hybrid search.

importance: Generated by an importance classifier, this float between 0 and 1 acts as a strict storage gate. A score of 0.9 is critical context. A score of 0.6 (e.g., "prefers to eat ice cream over chocolate") is useful but not urgent.

The Embedding Layer: More Than Just Retrieval

Normally, we use these embeddings for lookups to pull relevant context into the live prompt. However, we engineered a secondary use case for these vectors that is just as critical: Deduplication.

Imagine a user who struggles to express themselves or talks about the same thing in every single session. Over 50 sessions, a standard extractor would write 50 almost identical database rows. That creates massive noise. To solve this, we implemented a 2-path decision routing system before any memory actually hits the database:

While inserting a new memory record in the database, we first calculate its embedding then calculate its similarity against records in the database.

Path 1: Fast Insert (Low Similarity) — We calculate the cosine similarity between the new memory embedding and the user's existing memories. If the maximum similarity falls below a specific threshold, meaning we have never observed anything like this before, we directly insert that record in DB. It's a fast-path insert. Zero token cost, near-zero latency.

Path 2: LLM Evaluation (High Similarity) — If a close vector match exists (indicating a potential duplicate), we don't blindly overwrite or ignore it. Instead, we pass the new episode and the matching historical memories to the LLM as context. The LLM then makes a deterministic decision via tool calls:

  • add_memory: If it is genuinely new despite surface-level similarity.
  • update_memory: If it refines or updates something we already know.
  • Do nothing: If it is pure noise or an exact duplicate, the LLM simply drops it.
Memory architecture

This architecture keeps our memory store precision incredibly high, every single row earns its place because either the vector space proved it was novel, or the LLM explicitly decided it was worth keeping.

In practice, roughly 35% trigger the similarity gate and are routed to the LLM, and about 25% of those are outright dropped by the LLM as noise. Overall, our token spend dropped significantly once we stopped asking the LLM to process memories it would have written identically anyway.

Outro for dummies

Umm...so to put it simply no one wants a friend who forgets who you are every time you sit down at the desk.

By organizing our AI's memory like a human brain & keeping notes for the chat, a diary for daily events. The AI doesn't just answer questions anymore; it builds a relationship with the user. And doing that at our scale is what makes this project so much fun.

Thanks for making it to the end. See you in the next deep dive!

Is this perfect? Probably not. But it's a real example of two engineers staring at a hard problem, trying things, breaking things, and eventually shipping something that actually works at scale. The results speak for themselves, and that's enough for us.

If you're building something similar or just want to talk through the architecture, feel free to reach out to me on Twitter / X or LinkedIn, or ping Ansh directly. We're always up for a good systems conversation.