Skip to content

The 5-layer memory stack

Most agent frameworks treat memory as RAG: a vector store and a retriever. You stuff relevant chunks into context, hope the model picks the right ones, and call it memory.

This works for static-corpus Q&A. It does not work for agents that grow over time.

Thoth’s answer is five composed layers.

The stack

┌──────────────────────────────────────────────────────────────────┐
│ L5 REFLECTION · self-judgment at session end │
├──────────────────────────────────────────────────────────────────┤
│ L4 PROCEDURAL · skills, persona, hard rules │
├──────────────────────────────────────────────────────────────────┤
│ L3 EPISODIC · cosine recall over local embeddings │
├──────────────────────────────────────────────────────────────────┤
│ L2 IDENTITY · theory-of-mind on each peer (Honcho) │
├──────────────────────────────────────────────────────────────────┤
│ L1 WORKING · current session state │
└──────────────────────────────────────────────────────────────────┘

Each layer has its own:

  • Writer — who/what writes new entries
  • Cadence — how often writes happen
  • Read pattern — when entries are loaded into context
  • Retention — how long entries persist
  • Truth kindwhat kind of fact this layer holds

Below: each layer in detail.

L1 — Working memory

What’s happening right now in this session.

PropertyValue
StorageClaude session state + auto-managed MEMORY.md file
WriterThe model itself
CadencePer-turn (implicit)
Read patternAvailable throughout the session
RetentionSession-lifetime; persists to file via Claude’s auto-memory
TruthWhat is happening now

This is the layer everyone has. It’s the conversation history Claude can see, plus any auto-memory file Claude writes to itself across sessions. Table stakes.

You don’t configure L1 — Claude handles it.

L2 — Identity memory

Who is this person I’m talking to, and how should I respond?

PropertyValue
StorageHoncho (managed cloud or self-host)
WriterThe bridge (per-turn fire-and-forget ingest)
CadenceEvery turn
Read patternPre-spawn dialectic call, ~1.5s timeout
RetentionLifetime of the Honcho workspace
TruthWho each peer is

Identity memory is theory-of-mind: a derived, evolving model of who each peer is across all sessions, all topics, all time.

Honcho’s primitive is Workspace > Peer > Session > Message with two background processes:

  • The Deriver extracts observations (explicit facts + deductive inferences) about each peer on every message.
  • The Dialectic answers “what should I know to respond to this peer right now?” — pulling only relevant observations into the current turn’s context.

Example observation Thoth might derive:

“User prefers terse answers when stressed. Switches to mythological language when discussing system design — this seems intentional, not performative. Uses ‘lol’ to indicate surprise rather than humor.”

These observations don’t sit in your prompt. They sit in Honcho. The Dialectic surfaces only what’s relevant to this turn as a <user-model> block prepended to the system context.

L3 — Episodic memory

What happened?

PropertyValue
StorageSQLite + 384-dim Float32 BLOB embeddings
WriterThe bridge (post-success, async)
CadencePer turn (after the user has the reply)
Read patternFirst turn of each new thread; /recall on demand
RetentionConfigurable per Cloud tier (Free 30d, Pro 365d, Enterprise unlimited)
TruthThe autobiographical record

Every successful turn is summarized, embedded, and stored. When you start a new thread, Thoth runs cosine search (recency-weighted, τ = 14 days) over recent episodes and finds the top 3 most similar past episodes. Those go in as a <related-episodes> block.

This is what gives the agent cross-thread continuity. You’re in a new Slack thread asking about cherry-picking. The agent doesn’t remember that thread — but the embedding is similar to a thread from two weeks ago about merge strategies, which is similar to a thread from a month ago about staging-vs-prod policy. The agent surfaces that history.

// What goes in
{
user_text: "How should I cherry-pick this hotfix?",
apex_summary: "Cherry-pick from staging to main; never merge...",
num_turns: 1,
total_cost_usd: 0.012
}
// 384-dim Float32 embedding generated locally via Xenova MiniLM
// Cost: $0/encode, ~50–100ms on CPU
// Cached on disk after first download (~25 MB)

We deliberately do NOT use sqlite-vec. At our scale (< 100K episodes), a JS-side cosine over a recency-pre-filtered candidate set (default 200) is fast and adds zero native-build complexity.

L4 — Procedural memory

What works?

PropertyValue
Storage.claude/skills/<slug>/SKILL.md + persona files (git-tracked)
WriterReflection writer (drafts) → founder approval (commits)
CadencePer-session at most; typically per-week or per-month
Read patternSkill auto-discovery + persona loaded at session boot
RetentionPermanent (until explicitly deleted)
TruthThe durable rules of how this agent works

This is skills and persona — the crystallized output of sessions that have been judged worth keeping.

Skills follow the agentskills.io v1 format (also used by Anthropic Claude Code, OpenAI Agents, Cursor):

my-skill/
├── SKILL.md # human description (canonical)
├── manifest.json # name, version, author, license, tools_allowed
└── src/ # files the skill uses

When invoked, the skill’s full SKILL.md loads on-demand into Claude’s context for that turn only.

Persona files define the agent’s identity. The default Thoth persona stack:

  • IDENTITY.md — one-liner persona name
  • SOUL.md — voice, mission, axioms
  • RULES.md — operational rules
  • AGENTS.md — subagent roster
  • USER.md — founder profile, communication preferences
  • MEMORY.md — long-term memory snapshot
  • TOOLS.md — available external services

See The persona stack for the full walkthrough.

L5 — Reflection

What should I learn from what just happened?

PropertyValue
StorageDrafts to disk + Slack DMs
WriterA forked claude -p --effort low subprocess
CadenceAt session end (/done command or 30-min idle)
Read patternFounder review, then propagated to other layers
RetentionPermanent in audit log
TruthWhat this agent should become

When a session ends, a fresh claude -p subprocess runs over the transcript and emits structured JSON describing:

  • What worked
  • What didn’t
  • Should this become a new skill? (name + description + body)
  • What memory notes to save
  • What persona observations to consider
  • When to check back on this thread (next_check_at)
  • What user-model updates apply per peer

The JSON gets parsed and fanned out to four writers:

  1. Auto-Memory writer — appends memory_notes to MEMORY.md (with secret redaction)
  2. Skill draft writer — writes proposed SKILL.md to .claude/skills/<slug>/, posts a Slack approval card
  3. Persona observation writer — DMs founders candidate observations. Never auto-applies.
  4. Honcho writer — feeds user_model_updates back into Honcho as Thoth-authored observations

Reflection is the growth loop. Without it, the agent accumulates facts but doesn’t accumulate wisdom.

Why five and not one

You can imagine collapsing all this into a single vector store. The collapse loses something on every axis:

  • Retention policy — episodes can age out; persona is permanent; identity slowly evolves; reflection is timestamped. A single store has one retention policy.
  • Writer — Claude writes its own working memory. The bridge writes episodic. Reflection writes drafts; you approve. Honcho derives identity. Different writers, different invariants.
  • Read pattern — Identity surfaces pre-spawn. Episodic surfaces on first turn of new threads. Persona loads on session start. Reflection surfaces only as DMs.
  • Mutability — Persona is human-curated. Skills are human-approved. Identity is machine-derived but human-correctable. Episodes are append-only. Different governance.
  • Truth kind — L1 is “what’s happening now.” L2 is “who you are.” L3 is “what we did.” L4 is “how we work.” L5 is “what we should become.” These are different kinds of truth.

Stuffing them into one vector store works for retrieval. It does not work for governance. And without governance, you don’t have a mind — you have a search index.

Disabling layers

Each non-essential layer can be disabled independently if you don’t want it:

Terminal window
# In .env
HONCHO_DISABLED=true # disables L2 identity memory
EPISODIC_DISABLED=true # disables L3 cross-thread recall
REFLECTION_DISABLED=true # disables L5 (and skill compilation)

L1 working memory and L4 procedural always run.

What’s next