Performance Benchmarks
Real numbers from PersistMemory's production infrastructure. Latency percentiles, retrieval accuracy, extraction precision, and system architecture.
Last updated: March 2026 · Measured on Cloudflare Workers edge network
System Architecture
End-to-end flow from MCP clients through the edge network to persistent storage.
Memory Write Pipeline
Every conversation triggers automatic extraction, deduplication, storage, and graph updates.
Memory Retrieval Pipeline
Semantic search with metadata enrichment delivers context to any MCP client in under 35ms (p95).
Knowledge Graph Visualization
Entities and relationships are extracted automatically and linked into a queryable graph.
Latency at a Glance (p50)
Median response times for core operations.
Latency Benchmarks
Measured at the API layer from Cloudflare edge. All operations include network overhead.
| Operation | p50 | p95 | p99 |
|---|---|---|---|
| Store memory (embed + index) | 45ms | 82ms | 120ms |
| Semantic search (top-5) | 18ms | 35ms | 52ms |
| Auto fact extraction | 280ms | 450ms | 620ms |
| Knowledge graph query | 12ms | 28ms | 41ms |
| Deduplication check | 22ms | 38ms | 55ms |
| MCP tool invocation (e2e) | 65ms | 110ms | 160ms |
Accuracy & Quality
Measured against human-labeled evaluation datasets. Embedding model: BGE-large-en-v1.5. Extraction model: Llama 3.1 8B.
| Metric | Value | Description |
|---|---|---|
| Retrieval recall@5 | 94.2% | Correct memory in top 5 results |
| Retrieval recall@10 | 97.8% | Correct memory in top 10 results |
| Fact extraction precision | 91.5% | Extracted facts that are correct |
| Fact extraction recall | 87.3% | Relevant facts actually extracted |
| Dedup accuracy | 96.1% | Duplicates correctly identified at 0.88 threshold |
| Entity extraction F1 | 89.7% | Knowledge graph entity detection |
Scale & Throughput
Production limits and performance at scale. PersistMemory runs on Cloudflare's global edge network with Vectorize indexing.
Max memories per space
1,000,000+
Concurrent MCP connections
10,000+
Search latency at 100K memories
<25ms p50
Search latency at 1M memories
<40ms p50
Extraction throughput
~200 conversations/min
Graph edges per space
Unlimited
How We Compare
Indicative comparison based on publicly available data and our own testing.
| Metric | PersistMemory | Mem0 | Raw Pinecone |
|---|---|---|---|
| Search latency (p50) | 18ms | ~50ms | ~10ms |
| Auto extraction | Built-in (280ms) | Built-in | N/A (manual) |
| Knowledge graph | Included, all plans | Enterprise only | N/A |
| Deduplication | Automatic (0.88) | Basic | N/A |
| MCP support | Native | Limited | None |
| Setup time | <1 min | ~15 min | ~30 min |
Methodology
Infrastructure: Cloudflare Workers (edge compute), Vectorize (vector index), Neon Postgres (relational storage), Workers AI (embeddings & extraction).
Embedding model: BGE-large-en-v1.5 (1024 dimensions) via Cloudflare Workers AI.
Extraction model: Llama 3.1 8B Instruct via Cloudflare Workers AI for auto fact extraction.
Latency measurement: End-to-end API response times measured from edge, averaged over 10,000 requests across multiple regions. Excludes cold starts.
Accuracy measurement: Evaluated against a hand-labeled dataset of 2,000 developer conversations covering code, preferences, architecture decisions, and project context.
Deduplication threshold: Cosine similarity of 0.88 between BGE-large embeddings. Tuned for high precision (minimize false merges) at the cost of slightly lower recall.
Related Resources
Try It Yourself
These numbers are from production. Sign up free and run your own benchmarks.