What is PersistMemory?

PersistMemory is the best AI memory platform that gives any AI assistant — ChatGPT, Claude, Cursor, Copilot, Windsurf, Cline, Gemini — persistent, searchable memory. It uses vector-powered semantic search, supports file processing (PDF, DOCX, images, audio), and works with any MCP-compatible tool.

How does AI memory work?

PersistMemory stores your conversations, documents, and knowledge as vector embeddings. When your AI needs context, it performs semantic search to find the most relevant memories. This gives your AI long-term memory that persists across sessions.

Which AI tools does PersistMemory support?

PersistMemory works with all major AI tools including ChatGPT, Claude, Claude Desktop, Cursor, GitHub Copilot, Windsurf, Cline, Gemini, and any MCP-compatible AI assistant. It works with every IDE including VS Code, JetBrains, Neovim, and Zed.

Is PersistMemory free?

Yes! PersistMemory is free to start. Create an account, get your API key instantly, and start giving your AI persistent memory. No credit card required.

What is MCP (Model Context Protocol)?

MCP (Model Context Protocol) is an open standard that allows AI assistants to connect to external tools and data sources. PersistMemory provides an MCP server that any compatible AI client can use to access persistent memory.

ArchitectureSystem DesignDeep Dive

System Architecture

How PersistMemory stores, indexes, extracts, and retrieves AI memory at scale. From MCP clients to vector search, knowledge graphs, and the auto-extraction engine.

High-Level System Architecture

End-to-end flow from MCP clients and REST API through the Cloudflare edge network to AI models and persistent storage.

Memory Write Pipeline

Every new memory flows through extraction, deduplication, embedding, storage, and graph population.

Memory Retrieval Pipeline

Semantic search embeds the query, searches the HNSW index, enriches with metadata, and returns ranked results.

Knowledge Graph Example

Entities and relationships auto-extracted from conversations. Each node is an entity, each edge is a typed relationship.

Auto-Extraction Sequence

Step-by-step sequence diagram showing how facts are extracted asynchronously after each conversation turn.

Database Schema (ER Diagram)

Core tables and relationships in Neon Postgres. Every memory links to a space, chunks, and knowledge graph entities.

Infrastructure Stack

Edge Compute

Cloudflare Workers

Request routing, auth, MCP protocol handling. Deployed to 300+ edge locations worldwide.

AI Models

Workers AI

BGE-large-en-v1.5 (1024-dim embeddings), Llama 3.1 8B (fact extraction), Whisper (audio transcription).

Vector Index

Cloudflare Vectorize

HNSW approximate nearest neighbor. Sub-millisecond search over millions of vectors.

Relational DB

Neon Postgres

Memories, entities, edges, spaces, users. Drizzle ORM with type-safe queries.

Real-time State

Durable Objects

Per-space knowledge graph state. Real-time graph updates and notifications.

File Storage

R2 / Workers

PDF, DOCX, image, audio file processing. Chunking and content extraction.

Database Schema

Core tables in Neon Postgres. Every memory is embedded, indexed, and optionally linked into the knowledge graph.

Table	Key Columns	Purpose
memories	id, space, title, snippet, vectorId, metadata (JSONB)	Core memory storage. metadata holds fact_type, entities, confidence, auto_extracted flag.
chunks	id, memoryId, content, vectorId	Chunked content from large documents. Each chunk independently embedded.
entities	id, space, label, metadata	Knowledge graph nodes. Normalized entity names per space.
edges	id, fromId, toId, type, metadata	Knowledge graph relationships. Typed edges between entities.
spaces	id, owner, name, tags, summary, color, encrypted	Memory namespaces. Auto-generated tags/summary from extracted entities.
summaries	id, memoryId, text, vectorId	LLM-generated summaries for long content. Also embedded for search.
messages	id, space, sender, text, metadata	Chat messages within spaces. Trigger auto-extraction on user messages.
feedback	id, name, email, rating, category, message	User feedback submitted from the website.

Auto-Extraction Engine

Every conversation triggers a fire-and-forget extraction pipeline. The response is sent to the user immediately — extraction happens asynchronously without blocking.

1. Fact Extraction (Llama 3.1 8B)

Conversation text is sent to Llama 3.1 8B with a structured prompt. Returns JSON array of facts with type (preference, fact, relationship, event, skill, context), entities, and confidence score.

2. Deduplication Check

Each extracted fact is embedded with BGE-large. The resulting vector is compared against existing memories in the space using cosine similarity. If any match exceeds 0.88, the fact is skipped as a duplicate.

3. Memory Storage

Non-duplicate facts are stored in the memories table with structured metadata: fact_type, entities array, confidence, and auto_extracted flag. The embedding vector is indexed in Vectorize.

4. Knowledge Graph Population

Entities from each fact are upserted into the entities table (normalized, per-space). Edges are created between co-occurring entities with the relationship type and source fact. The space's Durable Object is notified for real-time graph sync.

5. Space Meta Update

The space's tags are auto-generated from the most frequent entities. The summary is rebuilt from fact type distribution (e.g., "12 preferences, 8 facts, 3 relationships").

Vector Search

Semantic search is the core retrieval mechanism. Queries are embedded in real-time and matched against the HNSW index.

1024

Embedding dimensions

HNSW

Index algorithm

cosine

Similarity metric

// Retrieval with metadata enrichment (chat handler)
const results = await queryVectors(env, queryEmbedding, spaceId, topK)

// Each result includes:
{
  text: "User prefers TypeScript with strict mode",
  score: 0.94,                    // cosine similarity
  metadata: {
    fact_type: "preference",      // from auto-extraction
    entities: ["TypeScript"],     // linked to knowledge graph
    confidence: 0.92,             // extraction confidence
    auto_extracted: true          // vs manually stored
  }
}

// Context sent to LLM:
// "[preference] User prefers TypeScript with strict mode (related: TypeScript)"

MCP Protocol Integration

PersistMemory implements the Model Context Protocol as a remote server. MCP clients connect via npx mcp-remote, authenticate with OAuth, and discover memory tools automatically.

MCP Client (Claude/Cursor/Windsurf)
  │
  ├─ tools/list → discovers: addMemory, search, listMemories, deleteMemory
  │
  ├─ tools/call: addMemory
  │   → Worker receives text
  │   → embedText() → BGE-large → 1024-dim vector
  │   → INSERT memories + upsertVector()
  │   → async: autoExtractAndStore() (fire-and-forget)
  │   → return { success: true, id }
  │
  ├─ tools/call: search
  │   → Worker receives query
  │   → embedText() → queryVectors(topK=5)
  │   → JOIN memories for text + metadata
  │   → return ranked results with scores
  │
  └─ tools/call: deleteMemory
      → DELETE from memories + deleteVector()
      → return { success: true }

Security Model

Space Isolation

Every memory operation is scoped to a space. Spaces are owned by users with strict access control. No cross-space data leakage.

OAuth Authentication

MCP clients authenticate via OAuth flow. API keys for REST access. JWT tokens with short expiry.

End-to-End Encryption

Optional per-space encryption. When enabled, memory content is encrypted before storage. Server cannot read encrypted memories.

Edge Processing

All AI inference runs on Cloudflare Workers AI. Data never leaves the Cloudflare network for embedding or extraction.

Performance Benchmarks AI Memory Architecture Guide PersistMemory vs Mem0 Changelog MCP Memory Server API Documentation

Built for Production

This architecture handles millions of memories at edge speed. Try it free.

Get Started Free View Benchmarks