SurrealDB Graphiti Exit Benchmark Evidence
- Status: partial evidence, LongMemEval full offline artifacts present
- Last updated: 2026-05-13
- Related plan:
docs/architecture/SURREALDB_V07_GRAPHITI_EXIT_AND_PURE_SURREAL_PLAN.md
Answer
We do not yet have a saved apples-to-apples pre-Graphiti versus post-Graphiti benchmark pair.
We do have strong post-native context-pack evidence and earlier local compare-run artifacts, but those are not a Graphiti baseline. Do not use them to claim native Surreal is faster or better than Graphiti until the paired live runtime artifacts exist.
Current Artifacts
Passing Native/Compare Context-Pack Run
Latest artifact:
.moon/cache/evals/eval_context_pack_retrieval_compare_20260513_014405.json
Summary:
search_type:context-packretrieval_mode:comparerepeat_count:20cases:160passed:160failed:0pass_rate:1.0latency_ms:16.0latency_p95_ms:23.8source_metadata_coverage:1.0facet_order_match_rate:1.0leak_count:0.0
This supports the v0.7 default-loop quality gate for the current native/compare retrieval path.
The earlier passing artifact, .moon/cache/evals/eval_context_pack_retrieval_compare_20260512_225828.json, also passed 160/160 with 23.7 ms p95 latency. The 20260513_014405 artifact is the latest release-gate receipt.
Earlier Failed Local Compare Run
Artifact:
.moon/cache/evals/eval_context_pack_retrieval_compare_20260512_204628.json
Summary:
search_type:context-packcases:8passed:1failed:7pass_rate:0.125latency_ms:622.7latency_p95_ms:1187.0
This is useful debugging history. It is not a pre-Graphiti or legacy Graphiti baseline.
Offline LongMemEval-Style Full Artifacts
Artifacts:
benchmarks/results/ai-memory/manifest.jsonbenchmarks/results/ai-memory/longmemeval_sibyl_raw_20260513.jsonbenchmarks/results/ai-memory/longmemeval_sibyl_hybrid_20260513.json
These are full offline component artifacts generated with longmemeval-offline-v2. They include case_results for all 500 questions with question IDs, answer session IDs, ranked session IDs, and per-case metrics. They do not measure the live API path, default app startup, Graphiti construction, or post-native Surreal runtime behavior.
Source data came from the Hugging Face xiaowu0162/longmemeval-cleaned longmemeval_s_cleaned.json file, downloaded to .moon/cache/benchmarks/longmemeval_s_cleaned.json for the local run. The exact local wrapper used for both runs was:
uv run --with chromadb python benchmarks/longmemeval_bench.py \
.moon/cache/benchmarks/longmemeval_s_cleaned.json \
--mode raw \
--output benchmarks/results/ai-memory/longmemeval_sibyl_raw_20260513.json
uv run --with chromadb python benchmarks/longmemeval_bench.py \
.moon/cache/benchmarks/longmemeval_s_cleaned.json \
--mode hybrid \
--output benchmarks/results/ai-memory/longmemeval_sibyl_hybrid_20260513.jsonBoth artifacts passed:
moon run bench-gate -- \
benchmarks/results/ai-memory/longmemeval_sibyl_raw_20260513.json \
--profile ai-memory
moon run bench-gate -- \
benchmarks/results/ai-memory/longmemeval_sibyl_hybrid_20260513.json \
--profile ai-memoryOverall Results
| Mode | Questions | Elapsed | Recall@5 | NDCG@5 | Recall@10 | NDCG@10 |
|---|---|---|---|---|---|---|
| raw | 500 | 312.3s | 0.966 | 0.888 | 0.982 | 0.889 |
| hybrid | 500 | 309.4s | 0.980 | 0.934 | 0.992 | 0.935 |
Results By Question Type
| Mode | Question Type | Recall@5 | NDCG@5 | Recall@10 | NDCG@10 |
|---|---|---|---|---|---|
| raw | single-session-user | 0.914 | 0.828 | 0.971 | 0.847 |
| raw | multi-session | 0.992 | 0.919 | 1.000 | 0.914 |
| raw | single-session-preference | 0.967 | 0.833 | 0.967 | 0.833 |
| raw | temporal-reasoning | 0.947 | 0.839 | 0.970 | 0.839 |
| raw | knowledge-update | 1.000 | 0.946 | 1.000 | 0.944 |
| raw | single-session-assistant | 0.964 | 0.953 | 0.964 | 0.953 |
| hybrid | single-session-user | 0.986 | 0.957 | 1.000 | 0.962 |
| hybrid | multi-session | 0.985 | 0.947 | 1.000 | 0.945 |
| hybrid | single-session-preference | 0.900 | 0.786 | 0.967 | 0.807 |
| hybrid | temporal-reasoning | 0.985 | 0.908 | 0.992 | 0.904 |
| hybrid | knowledge-update | 1.000 | 0.980 | 1.000 | 0.980 |
| hybrid | single-session-assistant | 0.964 | 0.954 | 0.964 | 0.954 |
The hybrid offline mode improves overall Recall@5, NDCG@5, Recall@10, and NDCG@10 relative to raw mode. The one quality regression in this saved pair is the single-session-preference slice, where raw Recall@5 and NDCG remain higher. Runtime is effectively flat for this offline Chroma path, with raw at 312.3s and hybrid at 309.4s.
benchmarks/results/ai-memory/manifest.json is the machine-checkable release ledger for this section. Its citable rows must point at committed full artifacts and match their summary metrics; its planned rows are intentionally non-citable until raw artifacts exist.
Retrieval Component And Mini-Memory Gate
Command:
moon run core:bench-retrieval
Receipt:
25 passed in 0.42s
Coverage:
- Temporal boost correctness and throughput:
- recent entities rank higher when relevance is equal
- aggressive and gentle decay reorder results differently
- high relevance can beat recency
- entities without timestamps keep their original score
- minimum boost prevents zeroing old records
- per-type decay rates differ
1,000entities stay under the500 mstest budget10,000entities stay under the2,000 mstest budget
- Reciprocal Rank Fusion correctness and throughput:
- top shared results rank highest
- list weights influence final ranking
- disjoint lists merge successfully
3 x 100result fusion stays under the100 mstest budget3 x 1,000result fusion stays under the500 mstest budget
- Mini-LongMemEval-style retrieval:
- factual recall
- recent temporal recall
- entity-type filtering
- aggressive versus gentle temporal reshuffling
Recall@5 >= 0.60across known queries
- Reranking and hybrid-search wiring:
- reranking defaults and disabled passthrough
- graceful fallback when cross-encoder dependencies are unavailable
- content extraction
- hybrid config wiring
- temporal and non-temporal hybrid ordering
This is a component correctness and budget gate. It does not write a saved JSON benchmark artifact.
AI Memory Benchmark Coverage
The release evidence bundle should include full results for every AI memory benchmark or competitor comparison we cite. A one-line headline is not enough; each suite needs the raw result artifact, overall metrics, per-slice metrics, corpus or dataset version, command, commit, runtime mode, and known caveats.
Every public benchmark claim must be traceable to a committed or archived artifact that a future release audit can re-open without guessing the corpus, mode, or code revision. If a suite has no artifact, it belongs in planned coverage only.
Current coverage:
| Suite or comparison | Local artifact status | Current use |
|---|---|---|
| Context-pack frozen suite | Full current report in .moon/cache/evals/ | v0.7 default-loop quality gate |
| LongMemEval-style offline suite | Full raw and hybrid JSON results in benchmarks/results/ai-memory/; both pass ai-memory gate | Citable offline retrieval baseline |
| Live Graphiti vs native Surreal | Missing paired bench-live artifacts | Required before public pre/post claims |
| LOCOMO-style long-memory suite | No harness or result artifact found in this checkout | Future external-suite positioning evidence |
| RULER-style long-context suite | No harness or result artifact found in this checkout | Future long-context stress evidence |
| Mem0 comparison | No committed result artifact found in this checkout | Future competitor comparison, if cited |
| Zep comparison | No committed result artifact found in this checkout | Future competitor comparison, if cited |
| LangMem comparison | No committed result artifact found in this checkout | Future competitor comparison, if cited |
Until a suite has a raw artifact and full per-slice table, cite it only as planned coverage, not as evidence. If we decide to make benchmark claims in release notes, the missing rows above become release-blocking for that claim.
Full AI Memory Result Ledger
The table below is the release ledger for external AI memory benchmarks. Rows can move from planned to citable only after the raw artifact and the summarized table are both present in this document and benchmarks/results/ai-memory/manifest.json.
| Suite or comparison | Required artifact pattern | Current status | Required summary before citation |
|---|---|---|---|
| LongMemEval-style offline suite | benchmarks/results/ai-memory/longmemeval_sibyl_<mode>_<date>.json | citable | Overall, per-question-type metrics, and case_results for every question |
| LOCOMO-style long-memory suite | benchmarks/results/ai-memory/locomo_<engine>_<timestamp>.json | planned | Overall score, per-category score, abstention/error rate, latency |
| RULER-style long-context suite | benchmarks/results/ai-memory/ruler_<engine>_<timestamp>.json | planned | Task-level score, context length, failure modes, latency |
| Mem0 comparison | benchmarks/results/ai-memory/mem0_<engine>_<timestamp>.json | planned | Overall quality, per-slice quality, ingestion latency, query latency |
| Zep comparison | benchmarks/results/ai-memory/zep_<engine>_<timestamp>.json | planned | Overall quality, per-slice quality, ingestion latency, query latency |
| LangMem comparison | benchmarks/results/ai-memory/langmem_<engine>_<timestamp>.json | planned | Overall quality, per-slice quality, ingestion latency, query latency |
For citable rows, include both native Surreal and comparator artifacts when the claim is comparative. For non-comparative rows, include the native Surreal artifact and label the claim as Sibyl-only. The current LongMemEval files are the only citable AI-memory rows in this checkout. Every artifact must include enough metadata to reproduce the run without relying on local memory or chat history, and must pass:
moon run bench-gate -- <artifact>.json --profile ai-memoryExternal Benchmark Acceptance Checklist
Before any external AI memory result appears in release notes, README copy, launch notes, or a comparison table:
- store the raw artifact under
benchmarks/results/ai-memory/or another committed/archive path named in this document - record the exact suite source, dataset split, version or commit, and any preprocessing
- record the Sibyl commit, runtime mode, graph engine, store, embedding model, generation model if used, and tokenizer or context budget
- include overall metrics and the complete per-slice table, even when a slice regresses
- include ingestion time, query latency, timeout count, error count, and skipped-case count when the suite exposes them
- keep prompts, evaluators, and scoring scripts versioned or archived with the result
- add a claim boundary explaining what the result does and does not prove
- pass
moon run bench-gate -- <artifact>.json --profile ai-memory
This is deliberately stricter than a normal smoke benchmark. External memory benchmarks are positioning evidence, so they need the whole receipt.
Full Result Record Required Fields
Each cited suite needs a result record with:
- raw artifact path
- benchmark suite name and version
- corpus or dataset name and version
- Sibyl commit
- command
- runtime mode, including graph engine and store
- model and embedding configuration when the suite uses generated content or embeddings
- overall metrics
- per-slice metrics
- per-case result records with answer IDs, ranked result IDs, and case metrics
- elapsed time and latency metrics when available
- pass/fail gate and profile
- known caveats
For competitor comparisons, the record also needs competitor version, hosted versus self-hosted mode, data-ingestion path, and any tuning that changes retrieval quality or latency.
Release Notes Rule
Release notes can cite only the rows above that have full artifacts. For v0.7 today, that means the context-pack frozen suite and the offline LongMemEval-style raw/hybrid files are citable when their claim boundaries stay explicit. LOCOMO, RULER, Mem0, Zep, LangMem, and live Graphiti-vs-native claims stay out of release notes until their artifact rows exist.
Missing Live Runtime Artifacts
This checkout currently has no saved benchmarks/results/*.json live-runtime artifacts. That means there is no canonical bench-live legacy-vs-native pair to cite.
No committed full-result LOCOMO, Mem0, Zep, LangMem, RULER, or additional external AI-memory benchmark result files were found in this checkout. If we add those suites later, their raw result files must live next to the LongMemEval-style outputs, or in a named external archive manifest when they are too large for git, and be summarized here with the same artifact-first rule.
Claim Boundaries
Supported claims:
- Native/compare context-pack evaluation passes the frozen 20-run suite with
160/160cases and23.8 msp95 latency. - The default memory loop and selected entrypoints can run with
graphiti_coreimports blocked whenmoon run core:no-graphiti-smokeis green. - Offline LongMemEval-style full artifacts show raw and hybrid component retrieval metrics across 500 questions, including per-question answer IDs and ranked session IDs.
Unsupported claims until the missing pair exists:
- Native Surreal is faster than Graphiti.
- Native Surreal has better retrieval quality than Graphiti.
- The failed
20260512_204628context-pack artifact represents pre-Graphiti performance. - Offline LongMemEval-style files represent production runtime behavior.
Required Pre/Post Pair
To make a public pre/post Graphiti claim, capture two bench-live artifacts against the same corpus, queries, auth scope, and release candidate.
Legacy Graphiti runtime:
moon run bench-live -- \
--label graphiti-legacy \
--metadata store=legacy \
--metadata graph_engine=graphiti \
--metadata corpus=<corpus-name>Native Surreal runtime:
moon run bench-live -- \
--label surreal-native \
--metadata store=surreal \
--metadata graph_engine=native_surreal \
--metadata corpus=<corpus-name>Gate and compare:
moon run bench-gate -- benchmarks/results/<legacy-artifact>.json \
--profile acceptance \
--require-metadata graph_engine=graphiti
moon run bench-gate -- benchmarks/results/<surreal-artifact>.json \
--profile acceptance \
--require-metadata graph_engine=native_surreal
uv run python benchmarks/compare_eval_reports.py \
benchmarks/results/<legacy-artifact>.json \
benchmarks/results/<surreal-artifact>.jsonKeep the raw artifact paths in release notes or PR notes whenever citing any number.
Release Rule
v0.7 can ship on default-loop no-Graphiti proof plus the required context-pack gate. The refreshed LongMemEval-style artifacts can be cited as offline component retrieval evidence. Any release note, README, or announcement that compares native Surreal against Graphiti must wait for the paired bench-live artifacts above.
