AI Memory Landscape
This page positions Sibyl against the May 2026 AI memory systems field. The headline result — 500/500 hit@5, 96.96% strict R@5, 98.90% R@10 on LongMemEval-S — is one number in a noisy landscape. The point of this page is to make the comparison legible without overclaim and without burying real competitive strengths.
The single most important framing for the rest of this page comes first.
The Apples-and-Oranges Problem
"LongMemEval score" is two different numbers depending on which lane the publisher is running in. Most public comparison tables conflate them.
Retrieval recall (the axis Sibyl reports). Did the retriever surface the right answer session(s)? Measured as Recall@K and NDCG@K directly against the gold answer_session_ids in the dataset. No model generation, no LLM judge.
End-to-end QA accuracy (a different axis). Did the full system answer the question correctly? Measured by retrieving sessions, generating an answer with a strong reader model, then grading that answer with GPT-4o as judge using the per-question rubric from the original LongMemEval paper. Combines retrieval + reading + generation + judging into a single percentage.
The retrieval number is strictly easier than the QA accuracy number on the same dataset because finding the right sessions is necessary but not sufficient to answer correctly. Reading the sessions, reasoning across them, producing a judged-correct answer all add failure modes that retrieval alone does not have. Vectorize's Agent Memory Benchmark Manifesto and rohitg00's LONGMEMEVAL.md analysis both call out this conflation as the leading source of leaderboard noise.
There is a second distinction inside the retrieval lane:
recall_any@K(sometimes called "hit@K"): did any gold session appear in the top K?recall_all@K(strict R@K): for multi-answer questions, did the retriever surface every gold session? Many LongMemEval-S questions have multiple correct sessions (250 of 500 have exactly 2, 41 have 3, the rest more).recall_anyis strictly easier thanrecall_all.
Sibyl reports both: hit@5 = 100% (the easier metric, equivalent to recall_any@5) and recall@5 = 96.96% (the strict multi-answer metric). When MemPalace, agentmemory, and Schift report "R@5", they generally mean recall_any@5.
Where Sibyl Sits
Sibyl's defensible position is the intersection of six properties. No competitor hits all six.
| Property | Sibyl | Notes |
|---|---|---|
| Self-hosted, open source | ✓ | Apache-2.0. No mandatory cloud. |
| Graph-native runtime (graph + vector + full-text + traversal) | ✓ | SurrealDB unified. |
| Physical tenant isolation (namespace-per-org, not filter-based) | ✓ | SurrealDB namespace boundary. |
| Source-preserving memory records | ✓ | Session entities keep original content. |
| Live API path benchmarking with reproducible CI artifacts | ✓ | The full eval runs against POST /api/search. |
| No LLM in the retrieval or extraction path (for the benchmark) | ✓ | OpenAI embeddings used; no LLM extraction, no LLM reranking. |
Going around the field on those six dimensions:
- Cognee gets closest. Self-hosted, graph-native, physical tenant isolation via EBAC permissions, live benchmarking. Trails on LLM-free retrieval — extraction is LLM-driven.
- Graphiti (Zep's underlying engine) is the closest architectural sibling. Zep itself deprecated self-hosted Community Edition in April 2025; Graphiti the library still runs, but you bring your own DB, your own multi-tenancy, and your own LLM extraction.
- Memweave hits source-preserving, no-LLM, live, and self-hosted. Trails on graph-native and multi-tenant — it's single-process file-on-disk.
- Mastra Observational Memory hits self-hosted and live benchmarking but runs two LLMs (Observer + Reflector) continuously to compress conversations into observations. Not source-preserving by design.
- Mem0 hits live benchmarking on a hosted product. Trails on graph-native (vector-first with entity linking), source-preserving (single-pass LLM extraction is the default), and LLM-free retrieval.
- Letta hits self-hosted but is a stateful-agent runtime, not a memory substrate; has not published LongMemEval numbers.
The combination is the position. No single property is unique.
Retrieval-Axis Comparison (Sibyl's Lane)
These are systems publishing retrieval-layer numbers on LongMemEval-S. Apples-to-apples or close to it.
| System | Headline | Metric type | Strict multi-answer | LLM in retrieval | Live API | Tenant isolation |
|---|---|---|---|---|---|---|
| Sibyl | 96.96% R@5, 98.90% R@10 | strict R@K | ✓ | ✗ | ✓ | ✓ |
| MemPalace raw | 96.6% R@5 | recall_any@K | ✗ | ✗ | ✗ | ✗ |
| MemPalace hybrid | 100% R@5 (full), 98.4% held-out | recall_any@K | ✗ | yes (Haiku) | ✗ | ✗ |
| Memweave | 98.0% R@5, 99.11% R@10 | recall_any@K | unclear | ✗ | ✗ | filesystem |
| agentmemory | 95.2% R@5, 98.6% R@10 | recall_any@K | ✗ | ✗ | ✗ | ✗ |
| Schift | 96.0% R@5 | unclear | unclear | ✗ | ✗ | ✗ |
A few honest readings of this table:
Memweave is the cleanest direct competitor on retrieval quality. Its 98.0% R@5 / 99.11% R@10 on a 450-question held-out split is real, well documented, cross-validated (±0.12% std dev), and methodologically transparent. The held-out split excludes 50 questions used for tuning, and the metric is recall_any rather than strict recall_all, so the comparison is not perfectly apples-to-apples — but Memweave is the system to point at when someone asks "is anyone close to Sibyl on this axis?". Its real edge is brutal simplicity: plain Markdown source files, SQLite + sqlite-vec + FTS5 index, zero infrastructure, graceful degradation. For a single developer on a laptop, Memweave is a defensible choice.
MemPalace had the loudest 2026 launch and the public methodology hasn't held up under independent review. The 96.6% raw number is a ChromaDB + all-MiniLM-L6-v2 baseline; the palace architecture is not actually exercised in the benchmark, and turning the palace features on reduces recall (89.4% with rooms, 84.2% with AAAK compression). The 100% hybrid result was overfit by iteratively patching failing questions until they passed (Vectorize critique, MemPalace #875). The honest MemPalace number is the 98.4% held-out result. Even that is recall_any@5 over a single-tenant local benchmark, not a strict-multi-answer live-API run.
agentmemory by rohitg00 is the small, clean reference point. BM25 + all-MiniLM-L6-v2 hybrid, no LLM in the loop, explicitly flags the recall_any vs strict distinction in its own README. Less prominent than MemPalace but its numbers are believable and its methodology disclosure is exemplary.
Schift reports 96.0% R@5 with their own schift-embed-1 embedder. The metric definition (any vs all) is not explicit in the blog post. Treat as a peer in the same lane with somewhat opaque methodology.
QA-Accuracy Comparison (Different Lane)
These systems report end-to-end QA accuracy on LongMemEval-S with an LLM judge. Sibyl does not currently publish a number on this axis. Listing them here so the contrast is explicit and so the comparison cannot be misread.
| System | Headline | Reader model | Judge | LLM extraction | LLM reranker |
|---|---|---|---|---|---|
| OMEGA | 95.4% | GPT-4.1 | likely GPT-4o | likely | ✓ |
| Mastra OM | 94.87% | GPT-5-mini | GPT-4o | ✓ | — |
| Mem0 (Apr 2026 algo) | 94.4% | (managed) | GPT-4o | ✓ | ✓ |
| Hindsight (Vectorize) | 91.4% | Gemini 3 Pro | GPT-4o | ✓ | ✓ |
| Memoria (MatrixOrigin) | 88.78% | GPT-5.4 | GPT-5.4 | not stated | not stated |
| ByteRover | 92.8% | Gemini 3.1 Pro | Gemini 3 Flash | ✓ | — |
| Emergence AI | 86% | GPT-4o | GPT-4o | ✓ | ✓ |
| Supermemory | 81.6–85.4% | various | GPT-4o | ✓ | unknown |
| RetainDB | 79% | (in-context) | GPT-4o | ✓ | ✗ |
| Zep (Cloud) | 71.2% | GPT-4o | GPT-4o | ✓ | — |
Putting Sibyl's 96.96% R@5 next to Mem0's 94.4% QA-accuracy or Mastra's 94.87% as if they were the same metric is the exact category error MemPalace was called out for. The two axes answer different questions:
- Retrieval R@K answers: "did we find the right context?"
- QA accuracy answers: "did the whole pipeline produce a correct answer that GPT-4o agreed with?"
Both matter. Sibyl is the retrieval substrate; a downstream reader model and prompt determine the QA accuracy on top of it. Adding a reader pass and judge to publish a comparable QA-accuracy number is a deliberate next step, not a hidden gap.
The Architectural Landscape
Six rough clusters cover most of the field as of May 2026.
Cluster 1: Hosted-First Commercial Platforms
Mem0 (Cloud + OpenMemory self-host), Zep Cloud, AWS AgentCore Memory. Vector-first or hybrid. All three want you on their cloud. Zep killed self-hosted Community Edition in 2025; Mem0's April 2026 algorithm rewrite quietly dropped advertised graph-store integrations in favor of internal entity linking. Physical tenant isolation is usually an Enterprise-tier feature.
Cluster 2: Self-Hostable Graph-Native Engines
Cognee, MemOS (MemTensor), and raw Graphiti as a library. Sibyl's closest architectural siblings. Cognee in particular ships per-tenant graph + vector store isolation via its EBAC permission model — the most direct production analog to Sibyl's namespace-per-org pattern.
Cluster 3: Agent Runtimes That Bundle Memory
Letta (formerly MemGPT), Mastra, and Agno (formerly Phidata). These compete on the "stateful agent platform" axis, with memory as one of several primitives. Letta is the most mature; Mastra's Observational Memory is the only architecture in this cluster posting credible >94% LongMemEval QA-accuracy scores.
Cluster 4: Framework-Coupled Memory Libraries
LangMem (LangGraph), CrewAI memory, third-party shims for Pydantic AI. Cheap if you're already in the framework, irrelevant if you're not. Note: AutoGen entered maintenance mode in Q1 2026; Microsoft moved active development to the Microsoft Agent Framework.
Cluster 5: Source-Preserving Local Libraries
Memweave and MemPalace. Both reject vector DBs as required infrastructure and store source artifacts as the truth. Memweave is the honest, methodologically clean version; MemPalace is the viral version where benchmark methodology fell apart on independent review.
Cluster 6: Provider-Bundled Consumer Memory
Claude memory (consumer + Managed Agents filesystem memory, launched April 2026), ChatGPT memory. Not direct competitors to a self-hosted developer-facing memory system, but they shape user expectations. Anthropic's filesystem-mounted Managed Agents memory plus 1M-token context GA is the platform-level threat: as context grows, fewer problems require dedicated memory infrastructure.
Cluster 7: Research Frontier
Memory-R1, A-MEM, Mem-α, MAGMA, Kumiho, FiFA, MemOS. None production-ready in May 2026. The next paradigm shift is RL-learned memory operations (store, retrieve, update, summarize, discard as tools the agent uses, with policy learned via PPO or GRPO).
Academic Frontier: Where Sibyl Trails
Sibyl is at LongMemEval-S retrieval ceiling. The field has moved on. Honest assessment of where Sibyl trails academic SOTA:
- Cross-encoder reranker. BGE-reranker-v2-m3 and ColBERT add another +33–40% accuracy at 50–100 ms latency cost on most public benchmarks. Sibyl uses interpretable query-aware ranking instead. The trade-off is real observability and cost-per-query; the cap is strict-recall ranking quality on diffuse-evidence questions (single-session-preference at 79.26% NDCG@5 illustrates this).
- Principled forgetting and consolidation. FadeMem reports 45% storage reduction with biologically inspired exponential decay; FiFA introduces six forgetting policies (FIFO, LRU, priority decay, reflection-summary, random-drop, hybrid) with privacy sensitivity scores. Sibyl accumulates memory indefinitely today.
- Procedural memory and skill learning.Letta's skill learning reports +36.8% relative on Terminal-Bench 2.0. Sibyl has task
learningsbut no procedural-memory primitive that an agent can update its own behavior from. - Temporal decay scoring fully enabled. Sibyl has temporal boosting code, but it is not enabled by default and does not integrate deeply with the main search path. The Stanford generative agents recency + importance + relevance scoring model has become a de facto standard.
- Multi-graph disentanglement. MAGMA splits memory into four orthogonal graphs (semantic, temporal, causal, entity) with intent-aware query routing, achieving +45.5% accuracy on LOCOMO at 0.7–4.2k tokens per query vs 101k for full context. Sibyl uses a single graph.
- RL-learned memory operations. Memory-R1 trains store/retrieve/update/summarize/discard as RL-tuned tools with 152 training pairs, gaining 31% F1 / 49% BLEU / 36% LLM-judge over Mem0. Sibyl uses hand-crafted query frames.
- Belief revision semantics. Kumiho proves AGM belief-revision postulates over a property graph runtime. Sibyl handles contradictions implicitly through projection.
- LongMemEval-V2. The agent-shaped successor benchmark (May 2026, 451 questions, up to 115M tokens) tests workflow knowledge, environment gotchas, premise awareness — closer to operational competence than conversational recall. Sibyl has no published number here. Best published is AgentRunbook-C at 72.5%.
- BEAM (10M-token long-memory).Hindsight reports 64.1% at 10M tokens, +58% over the next system. Sibyl has not been evaluated at this scale.
Honest Gaps
Things we do not yet have, and want to be explicit about:
- No published QA-accuracy number. Adding a thin reader pass over Sibyl's retrieved sessions plus the official LongMemEval GPT-4o judge would let us publish a number on the same axis as Mem0, Mastra, OMEGA, Hindsight, and Zep. This is on the roadmap, not the benchmark we lead with.
- No public local-embedding variant. The full run uses OpenAI embeddings. A
text-embedding-3-small-free variant withall-MiniLM-L6-v2or BGE-M3 would be directly comparable to MemPalace raw and agentmemory's measurements. - No LongMemEval-V2 number. The benchmark was published mid-2026; Sibyl has not been evaluated against it yet.
- No LOCOMO, BEAM, FiFA numbers. LongMemEval-S is one dataset. The field is broader.
- No published latency-cost trade-off curve. Search p95 is 1,115 ms in the full run; that's a working number but not yet contextualized against competitors' published latency-cost envelopes.
- No cross-encoder reranker. We chose interpretable ranking; that choice has a cost on strict-recall ranking quality.
- No principled forgetting. Storage grows monotonically; old facts compete with new ones in retrieval scoring.
How To Read This Page
If you remember one thing: retrieval R@K and end-to-end QA accuracy are different axes, and most LongMemEval leaderboard tables mix them. Sibyl's 96.96% strict R@5 is in the retrieval lane. When comparing against numbers from Mem0, Mastra, OMEGA, Hindsight, Zep, ByteRover, RetainDB, Supermemory, or Emergence AI — those are QA-accuracy numbers and they answer a different question.
The retrieval-lane comparison Sibyl can defend right now: Sibyl reaches LongMemEval-S retrieval ceiling on a stricter metric (full multi-answer recall, not lenient hit-rate), on the live production API path, with per-question physical tenant isolation, and with no LLM in the retrieval or extraction path. The retrieval-lane systems with credible published numbers in the same neighborhood are MemPalace's honest 98.4% held-out, Memweave's 98.0% held-out, and agentmemory's 95.2% — all on recall_any, all single-tenant, all offline notebook measurements.
That is the position. We are happy to be wrong about anything in this page if a reader brings a primary source that contradicts it.
Related
- LongMemEval Results — the headline eval and methodology
- Benchmark Methodology — the broader eval ladder, gates, reporting rules
- Retrieval System Architecture — how the eval-passing path actually works
