LongMemEval-V2
LongMemEval-V2 is not the same shape as LongMemEval-S. V1 is a retrieval benchmark for finding the right memory item. V2 is an official memory-system harness: the memory backend ingests web-agent trajectories, returns compact context for a question, a fixed reader model answers, and the official scorers grade the answer.
Sibyl's V2 path therefore uses the official Memory interface instead of a benchmark-only oracle. The adapter writes trajectories through the live Sibyl API and queries /api/search; it strips the gold answer from official query context before backend code can read it, and it never sees gold trajectory IDs.
Current Commands
Download the text-context dataset slice:
moon run bench-longmemeval-v2-download -- \
--data-root .moon/cache/benchmarks/longmemeval-v2-fullAdd --include-trajectory-screenshots only when testing a memory backend that returns image context items.
Fast metadata check:
moon run bench-longmemeval-v2-probe -- \
/path/to/longmemeval-v2 \
--tier medium \
--validate-trajectoriesPlan an official run without model calls:
moon run bench-longmemeval-v2-official -- \
--data-root /path/to/longmemeval-v2 \
--domain enterprise \
--tier small \
--output-dir runs/sibyl_enterprise_small \
--plan-only \
--allow-localhostRun one official domain with the official runtime dependencies:
moon run bench-longmemeval-v2-official-full -- \
--official-repo /path/to/LongMemEval-V2 \
--data-root /path/to/longmemeval-v2 \
--domain enterprise \
--tier small \
--output-dir runs/sibyl_enterprise_small \
--api-url http://127.0.0.1:3334/api \
--allow-localhost \
--reader-base-url http://localhost:8023/v1 \
--reader-model Qwen/Qwen3.5-9B \
--evaluator-model gpt-5.2Test live Sibyl ingestion without reader or evaluator model calls:
moon run bench-longmemeval-v2-official-full -- \
--official-repo /path/to/LongMemEval-V2 \
--data-root .moon/cache/benchmarks/longmemeval-v2-full \
--domain enterprise \
--tier small \
--output-dir runs/sibyl_enterprise_ingest_1 \
--limit 1 \
--allow-localhost \
--save-memory \
--skip-evaluationA leaderboard-valid operating point needs both domains at the same tier and method:
moon run bench-longmemeval-v2-official-full -- ... --domain enterprise --tier small
moon run bench-longmemeval-v2-official-full -- ... --domain web --tier small
python /path/to/LongMemEval-V2/leaderboard/combine_aggregated_metrics.py \
runs/sibyl_enterprise_small/aggregated_metrics.json \
runs/sibyl_web_small/aggregated_metrics.json \
-o runs/sibyl_small_combined_metrics.jsonHonest-Run Requirements
- Official LongMemEval-V2 checkout available through
--official-repo. - Full dataset prepared with
questions.jsonl,haystacks/lme_v2_<tier>.json,trajectories.jsonl, and screenshots if image evidence is enabled. - Live disposable Sibyl API stack. The adapter mutates the target through
/entitiesand/search. - Reader model endpoint, normally
Qwen/Qwen3.5-9B. - Evaluator key/model for LLM-graded categories, normally
gpt-5.2. - Same method and tier for
webandenterprisebefore combining metrics.
Adapter Contract
benchmarks/longmemeval_v2_memory/sibyl_memory.py registers sibyl_live_api with the official harness.
For each memory instance it:
- Authenticates once and reuses the token inside the process.
- Creates an isolated Sibyl project unless
--project-idis supplied. - Converts each trajectory into state-aware
sessionchunks. - Writes chunks with
POST /api/entities/bulk. - Searches only that project with
POST /api/search. - Returns text context items to the official reader.
The project boundary is the V2 equivalent of the V1 per-question tenant boundary. It avoids cross-question leakage without relying on repeated local signups, which would fight the local-first single-user default.
Claim Boundary
The current V2 path proves we can run Sibyl inside the official full-suite contract. It is not yet a published V2 score until both domains complete with the official reader and evaluator.
Known limits:
- The adapter is text-context only today. It preserves screenshot references in text when requested, but does not yet return image context items.
- Medium haystacks can approach 500 trajectories per question; this is intentionally a stress test of ingestion backpressure and search isolation.
- The official harness loads trajectories into memory. Large runs should use a machine sized for the dataset and model endpoints.
