PerformanceBenchmarksArchitecture

Performance Benchmarks

Real numbers from PersistMemory's production infrastructure. Latency percentiles, retrieval accuracy, extraction precision, and system architecture.

Last updated: March 2026 · Measured on Cloudflare Workers edge network

System Architecture

End-to-end flow from MCP clients through the edge network to persistent storage.

Memory Write Pipeline

Every conversation triggers automatic extraction, deduplication, storage, and graph updates.

Memory Retrieval Pipeline

Semantic search with metadata enrichment delivers context to any MCP client in under 35ms (p95).

Knowledge Graph Visualization

Entities and relationships are extracted automatically and linked into a queryable graph.

Latency at a Glance (p50)

Median response times for core operations.

Knowledge graph query12ms
Semantic search (top-5)18ms
Dedup check22ms
Store memory45ms
MCP tool (e2e)65ms
Auto extraction280ms

Latency Benchmarks

Measured at the API layer from Cloudflare edge. All operations include network overhead.

Operationp50p95p99
Store memory (embed + index)45ms82ms120ms
Semantic search (top-5)18ms35ms52ms
Auto fact extraction280ms450ms620ms
Knowledge graph query12ms28ms41ms
Deduplication check22ms38ms55ms
MCP tool invocation (e2e)65ms110ms160ms

Accuracy & Quality

Measured against human-labeled evaluation datasets. Embedding model: BGE-large-en-v1.5. Extraction model: Llama 3.1 8B.

MetricValueDescription
Retrieval recall@594.2%Correct memory in top 5 results
Retrieval recall@1097.8%Correct memory in top 10 results
Fact extraction precision91.5%Extracted facts that are correct
Fact extraction recall87.3%Relevant facts actually extracted
Dedup accuracy96.1%Duplicates correctly identified at 0.88 threshold
Entity extraction F189.7%Knowledge graph entity detection

Scale & Throughput

Production limits and performance at scale. PersistMemory runs on Cloudflare's global edge network with Vectorize indexing.

Max memories per space

1,000,000+

Concurrent MCP connections

10,000+

Search latency at 100K memories

<25ms p50

Search latency at 1M memories

<40ms p50

Extraction throughput

~200 conversations/min

Graph edges per space

Unlimited

How We Compare

Indicative comparison based on publicly available data and our own testing.

MetricPersistMemoryMem0Raw Pinecone
Search latency (p50)18ms~50ms~10ms
Auto extractionBuilt-in (280ms)Built-inN/A (manual)
Knowledge graphIncluded, all plansEnterprise onlyN/A
DeduplicationAutomatic (0.88)BasicN/A
MCP supportNativeLimitedNone
Setup time<1 min~15 min~30 min

Methodology

Infrastructure: Cloudflare Workers (edge compute), Vectorize (vector index), Neon Postgres (relational storage), Workers AI (embeddings & extraction).

Embedding model: BGE-large-en-v1.5 (1024 dimensions) via Cloudflare Workers AI.

Extraction model: Llama 3.1 8B Instruct via Cloudflare Workers AI for auto fact extraction.

Latency measurement: End-to-end API response times measured from edge, averaged over 10,000 requests across multiple regions. Excludes cold starts.

Accuracy measurement: Evaluated against a hand-labeled dataset of 2,000 developer conversations covering code, preferences, architecture decisions, and project context.

Deduplication threshold: Cosine similarity of 0.88 between BGE-large embeddings. Tuned for high precision (minimize false merges) at the cost of slightly lower recall.

Try It Yourself

These numbers are from production. Sign up free and run your own benchmarks.