ArchitectureSystem DesignDeep Dive

System Architecture

How PersistMemory stores, indexes, extracts, and retrieves AI memory at scale. From MCP clients to vector search, knowledge graphs, and the auto-extraction engine.

High-Level System Architecture

End-to-end flow from MCP clients and REST API through the Cloudflare edge network to AI models and persistent storage.

Memory Write Pipeline

Every new memory flows through extraction, deduplication, embedding, storage, and graph population.

Memory Retrieval Pipeline

Semantic search embeds the query, searches the HNSW index, enriches with metadata, and returns ranked results.

Knowledge Graph Example

Entities and relationships auto-extracted from conversations. Each node is an entity, each edge is a typed relationship.

Auto-Extraction Sequence

Step-by-step sequence diagram showing how facts are extracted asynchronously after each conversation turn.

Database Schema (ER Diagram)

Core tables and relationships in Neon Postgres. Every memory links to a space, chunks, and knowledge graph entities.

Infrastructure Stack

Edge Compute

Cloudflare Workers

Request routing, auth, MCP protocol handling. Deployed to 300+ edge locations worldwide.

AI Models

Workers AI

BGE-large-en-v1.5 (1024-dim embeddings), Llama 3.1 8B (fact extraction), Whisper (audio transcription).

Vector Index

Cloudflare Vectorize

HNSW approximate nearest neighbor. Sub-millisecond search over millions of vectors.

Relational DB

Neon Postgres

Memories, entities, edges, spaces, users. Drizzle ORM with type-safe queries.

Real-time State

Durable Objects

Per-space knowledge graph state. Real-time graph updates and notifications.

File Storage

R2 / Workers

PDF, DOCX, image, audio file processing. Chunking and content extraction.

Database Schema

Core tables in Neon Postgres. Every memory is embedded, indexed, and optionally linked into the knowledge graph.

TableKey ColumnsPurpose
memoriesid, space, title, snippet, vectorId, metadata (JSONB)Core memory storage. metadata holds fact_type, entities, confidence, auto_extracted flag.
chunksid, memoryId, content, vectorIdChunked content from large documents. Each chunk independently embedded.
entitiesid, space, label, metadataKnowledge graph nodes. Normalized entity names per space.
edgesid, fromId, toId, type, metadataKnowledge graph relationships. Typed edges between entities.
spacesid, owner, name, tags, summary, color, encryptedMemory namespaces. Auto-generated tags/summary from extracted entities.
summariesid, memoryId, text, vectorIdLLM-generated summaries for long content. Also embedded for search.
messagesid, space, sender, text, metadataChat messages within spaces. Trigger auto-extraction on user messages.
feedbackid, name, email, rating, category, messageUser feedback submitted from the website.

Auto-Extraction Engine

Every conversation triggers a fire-and-forget extraction pipeline. The response is sent to the user immediately — extraction happens asynchronously without blocking.

1. Fact Extraction (Llama 3.1 8B)

Conversation text is sent to Llama 3.1 8B with a structured prompt. Returns JSON array of facts with type (preference, fact, relationship, event, skill, context), entities, and confidence score.

2. Deduplication Check

Each extracted fact is embedded with BGE-large. The resulting vector is compared against existing memories in the space using cosine similarity. If any match exceeds 0.88, the fact is skipped as a duplicate.

3. Memory Storage

Non-duplicate facts are stored in the memories table with structured metadata: fact_type, entities array, confidence, and auto_extracted flag. The embedding vector is indexed in Vectorize.

4. Knowledge Graph Population

Entities from each fact are upserted into the entities table (normalized, per-space). Edges are created between co-occurring entities with the relationship type and source fact. The space's Durable Object is notified for real-time graph sync.

5. Space Meta Update

The space's tags are auto-generated from the most frequent entities. The summary is rebuilt from fact type distribution (e.g., "12 preferences, 8 facts, 3 relationships").

Vector Search

Semantic search is the core retrieval mechanism. Queries are embedded in real-time and matched against the HNSW index.

1024

Embedding dimensions

HNSW

Index algorithm

cosine

Similarity metric

// Retrieval with metadata enrichment (chat handler)
const results = await queryVectors(env, queryEmbedding, spaceId, topK)

// Each result includes:
{
  text: "User prefers TypeScript with strict mode",
  score: 0.94,                    // cosine similarity
  metadata: {
    fact_type: "preference",      // from auto-extraction
    entities: ["TypeScript"],     // linked to knowledge graph
    confidence: 0.92,             // extraction confidence
    auto_extracted: true          // vs manually stored
  }
}

// Context sent to LLM:
// "[preference] User prefers TypeScript with strict mode (related: TypeScript)"

MCP Protocol Integration

PersistMemory implements the Model Context Protocol as a remote server. MCP clients connect via npx mcp-remote, authenticate with OAuth, and discover memory tools automatically.

MCP Client (Claude/Cursor/Windsurf)
  │
  ├─ tools/list → discovers: addMemory, search, listMemories, deleteMemory
  │
  ├─ tools/call: addMemory
  │   → Worker receives text
  │   → embedText() → BGE-large → 1024-dim vector
  │   → INSERT memories + upsertVector()
  │   → async: autoExtractAndStore() (fire-and-forget)
  │   → return { success: true, id }
  │
  ├─ tools/call: search
  │   → Worker receives query
  │   → embedText() → queryVectors(topK=5)
  │   → JOIN memories for text + metadata
  │   → return ranked results with scores
  │
  └─ tools/call: deleteMemory
      → DELETE from memories + deleteVector()
      → return { success: true }

Security Model

Space Isolation

Every memory operation is scoped to a space. Spaces are owned by users with strict access control. No cross-space data leakage.

OAuth Authentication

MCP clients authenticate via OAuth flow. API keys for REST access. JWT tokens with short expiry.

End-to-End Encryption

Optional per-space encryption. When enabled, memory content is encrypted before storage. Server cannot read encrypted memories.

Edge Processing

All AI inference runs on Cloudflare Workers AI. Data never leaves the Cloudflare network for embedding or extraction.

Built for Production

This architecture handles millions of memories at edge speed. Try it free.