System Architecture
How PersistMemory stores, indexes, extracts, and retrieves AI memory at scale. From MCP clients to vector search, knowledge graphs, and the auto-extraction engine.
High-Level System Architecture
End-to-end flow from MCP clients and REST API through the Cloudflare edge network to AI models and persistent storage.
Memory Write Pipeline
Every new memory flows through extraction, deduplication, embedding, storage, and graph population.
Memory Retrieval Pipeline
Semantic search embeds the query, searches the HNSW index, enriches with metadata, and returns ranked results.
Knowledge Graph Example
Entities and relationships auto-extracted from conversations. Each node is an entity, each edge is a typed relationship.
Auto-Extraction Sequence
Step-by-step sequence diagram showing how facts are extracted asynchronously after each conversation turn.
Database Schema (ER Diagram)
Core tables and relationships in Neon Postgres. Every memory links to a space, chunks, and knowledge graph entities.
Infrastructure Stack
Edge Compute
Cloudflare Workers
Request routing, auth, MCP protocol handling. Deployed to 300+ edge locations worldwide.
AI Models
Workers AI
BGE-large-en-v1.5 (1024-dim embeddings), Llama 3.1 8B (fact extraction), Whisper (audio transcription).
Vector Index
Cloudflare Vectorize
HNSW approximate nearest neighbor. Sub-millisecond search over millions of vectors.
Relational DB
Neon Postgres
Memories, entities, edges, spaces, users. Drizzle ORM with type-safe queries.
Real-time State
Durable Objects
Per-space knowledge graph state. Real-time graph updates and notifications.
File Storage
R2 / Workers
PDF, DOCX, image, audio file processing. Chunking and content extraction.
Database Schema
Core tables in Neon Postgres. Every memory is embedded, indexed, and optionally linked into the knowledge graph.
| Table | Key Columns | Purpose |
|---|---|---|
| memories | id, space, title, snippet, vectorId, metadata (JSONB) | Core memory storage. metadata holds fact_type, entities, confidence, auto_extracted flag. |
| chunks | id, memoryId, content, vectorId | Chunked content from large documents. Each chunk independently embedded. |
| entities | id, space, label, metadata | Knowledge graph nodes. Normalized entity names per space. |
| edges | id, fromId, toId, type, metadata | Knowledge graph relationships. Typed edges between entities. |
| spaces | id, owner, name, tags, summary, color, encrypted | Memory namespaces. Auto-generated tags/summary from extracted entities. |
| summaries | id, memoryId, text, vectorId | LLM-generated summaries for long content. Also embedded for search. |
| messages | id, space, sender, text, metadata | Chat messages within spaces. Trigger auto-extraction on user messages. |
| feedback | id, name, email, rating, category, message | User feedback submitted from the website. |
Auto-Extraction Engine
Every conversation triggers a fire-and-forget extraction pipeline. The response is sent to the user immediately — extraction happens asynchronously without blocking.
1. Fact Extraction (Llama 3.1 8B)
Conversation text is sent to Llama 3.1 8B with a structured prompt. Returns JSON array of facts with type (preference, fact, relationship, event, skill, context), entities, and confidence score.
2. Deduplication Check
Each extracted fact is embedded with BGE-large. The resulting vector is compared against existing memories in the space using cosine similarity. If any match exceeds 0.88, the fact is skipped as a duplicate.
3. Memory Storage
Non-duplicate facts are stored in the memories table with structured metadata: fact_type, entities array, confidence, and auto_extracted flag. The embedding vector is indexed in Vectorize.
4. Knowledge Graph Population
Entities from each fact are upserted into the entities table (normalized, per-space). Edges are created between co-occurring entities with the relationship type and source fact. The space's Durable Object is notified for real-time graph sync.
5. Space Meta Update
The space's tags are auto-generated from the most frequent entities. The summary is rebuilt from fact type distribution (e.g., "12 preferences, 8 facts, 3 relationships").
Vector Search
Semantic search is the core retrieval mechanism. Queries are embedded in real-time and matched against the HNSW index.
1024
Embedding dimensions
HNSW
Index algorithm
cosine
Similarity metric
// Retrieval with metadata enrichment (chat handler)
const results = await queryVectors(env, queryEmbedding, spaceId, topK)
// Each result includes:
{
text: "User prefers TypeScript with strict mode",
score: 0.94, // cosine similarity
metadata: {
fact_type: "preference", // from auto-extraction
entities: ["TypeScript"], // linked to knowledge graph
confidence: 0.92, // extraction confidence
auto_extracted: true // vs manually stored
}
}
// Context sent to LLM:
// "[preference] User prefers TypeScript with strict mode (related: TypeScript)"MCP Protocol Integration
PersistMemory implements the Model Context Protocol as a remote server. MCP clients connect via npx mcp-remote, authenticate with OAuth, and discover memory tools automatically.
MCP Client (Claude/Cursor/Windsurf)
│
├─ tools/list → discovers: addMemory, search, listMemories, deleteMemory
│
├─ tools/call: addMemory
│ → Worker receives text
│ → embedText() → BGE-large → 1024-dim vector
│ → INSERT memories + upsertVector()
│ → async: autoExtractAndStore() (fire-and-forget)
│ → return { success: true, id }
│
├─ tools/call: search
│ → Worker receives query
│ → embedText() → queryVectors(topK=5)
│ → JOIN memories for text + metadata
│ → return ranked results with scores
│
└─ tools/call: deleteMemory
→ DELETE from memories + deleteVector()
→ return { success: true }Security Model
Space Isolation
Every memory operation is scoped to a space. Spaces are owned by users with strict access control. No cross-space data leakage.
OAuth Authentication
MCP clients authenticate via OAuth flow. API keys for REST access. JWT tokens with short expiry.
End-to-End Encryption
Optional per-space encryption. When enabled, memory content is encrypted before storage. Server cannot read encrypted memories.
Edge Processing
All AI inference runs on Cloudflare Workers AI. Data never leaves the Cloudflare network for embedding or extraction.
Related
Built for Production
This architecture handles millions of memories at edge speed. Try it free.