Benchmarking and Evaluation
Benchmark memory quality with repeatable datasets, fixed configurations, and a clear split between preliminary signal and publishable scorecards.
Current benchmark status
We have not yet published a full-blown benchmark campaign across all target tasks, models, and ablation settings. What we do have today are preliminary harness runs and archived evaluator outputs that already give a directional read on retrieval quality and runtime behavior.
The purpose of this page is to show what is already instrumented, what the early LoCoMo recall run looks like, and what still needs to happen before we call the numbers final.
Important
Treat the results below as preliminary engineering signals, not a finalized benchmark report. Full LongMemEval publication-grade runs, ablations, and model-normalized comparisons are still in progress.
Benchmark Phase
Preliminary
Harness is live; full campaign is still pending.
Published Dataset Signal
LoCoMo
Current docs publish the cleanest retrieval summary from archived logs.
Archived Artifacts
2 Logs
`locomo_output.txt` and `longmemeval_output.txt` are retained in the repo.
Next Milestone
Full LongMemEval
Run, validate, and publish the cleaned scorecard plus ablations.
What is already in the repo
Two benchmark artifacts are currently tracked in the workspace and referenced internally when validating the evaluator pipeline.
The LoCoMo artifact already contains a clean recall summary. The LongMemEval artifact is preserved as a working benchmark log, but we are intentionally not publishing a polished LongMemEval scorecard from it yet.
locomo_output.txt
Preliminary LoCoMo recall run log with full progress output, timing breakdowns, and the final Recall@8 summary.
longmemeval_output.txt
Archived preliminary benchmark output kept in the repo while the final LongMemEval evaluation flow is cleaned up and rerun end to end.
Preliminary LoCoMo snapshot
The strongest concrete benchmark signal we are comfortable surfacing today comes from the LoCoMo recall harness. This run focuses on whether the correct evidence session appears in the retrieved Top-8, which is a useful proxy for whether the memory engine is bringing the right context back into scope before any downstream answer-generation layer gets involved.
That distinction matters. We want to isolate memory retrieval quality first, then layer answer quality and judge-model evaluation on top. Otherwise, retrieval regressions and generation regressions get mixed together.
Overall Recall@8
94.7%
Evidence session found in the Top-8 for 1538 evaluated questions.
Questions Evaluated
1538
No skipped questions in the archived run.
Avg Query Time
65 ms
Average end-to-end query timing reported by the harness.
Avg Total Time
83 ms
Includes ingest/query/pack timing reported per evaluation cycle.
Avg Batch Ingest
407 ms
Average ingest batch time during the initial session indexing phase.
Top-K Window
8 Sessions
Run used `top-k=8` and `max-chunks-per-session=4`.
| Metric | Single-Hop | Multi-Hop | Open Domain | Temporal | Overall |
|---|---|---|---|---|---|
| Recall@8 | 90.8% | 80.4% | 97.9% | 93.8% | 94.7% |
| Question Count | 282 | 92 | 841 | 321 | 1538 |
This table is taken from the archived LoCoMo recall log and represents retrieval-stage evidence recovery, not a final answer-generation leaderboard.
How to read these numbers
The LoCoMo result is promising because it says the right session is usually being recovered even in a long-running conversation setting. The open-domain and temporal slices are especially strong in this early run, which suggests the hybrid lexical-plus-semantic stack is doing real work beyond naive vector search.
The multi-hop slice is the current pressure point. That is not surprising: once the right evidence is split across multiple linked conversational fragments, raw retrieval has to do more than recover a single relevant session. This is exactly where graph lineage, fact companions, temporal linking, and deterministic aggregation become more important.
The timing split is also useful. Average engine-stage totals show that embedding and hydrate costs dominate more than ANN search. That points optimization effort toward payload hydration, result packing, and indexing layout rather than only ANN micro-tuning.
- Average engine query stages in the archived run were roughly: embed 11 ms, ANN 2 ms, FTS 4 ms, hydrate 39 ms, total 64 ms.
- Reranking was disabled in this run, which means the current score reflects the base hybrid retrieval pipeline without cross-encoder rescue.
- Semantic dedup and consolidation were also disabled, so there is still headroom for future ablation work.
Run configuration behind the preliminary LoCoMo result
The archived LoCoMo run used the Rust evaluator directly against the engine with an explicit retrieval-only configuration. That is useful because it minimizes ambiguity about where latency and quality are coming from.
For benchmark reproducibility, keep the full invocation with the result artifact. Small flags such as rerank on/off, start index, chunk caps, or ingest concurrency can materially change both runtime and quality.
cargo run --release --manifest-path benchmarks/rust_evaluator/Cargo.toml -- \
--dataset-kind locomo \
--dataset benchmarks/LoCoMo/data/locomo10.json \
--engine-url http://127.0.0.1:3000 \
--engine-api-key XXX1111AAA \
--reset-first \
--limit 99999 \
--ingest-concurrency 16 \
--top-k 8 \
--max-chunks-per-session 4 recalldataset: LoCoMo
questions: 1538
top-k sessions: 8
max chunks per session: 4
neural rerank: false
semantic dedup: false
consolidation: false
reset first: truecargo run --release --manifest-path benchmarks/rust_evaluator/Cargo.toml -- \
--dataset-kind longmemeval \
--dataset benchmarks/LongMemEval/data/longmemeval_s_cleaned.json \
--engine-url http://127.0.0.1:3000 \
--engine-api-key XXX1111AAA \
--reset-first \
--ingest-concurrency 1 \
--top-k 8 \
--max-chunks-per-session 4 recallWhat still needs to happen before we call benchmarking complete
A proper benchmark page for Aletheia should not stop at one preliminary retrieval run. We still need a full matrix across datasets, retrieval settings, optional reranking, answer-generation layers, and judge-model evaluation so the results are defensible outside the repo.
In practice, that means LoCoMo is only the first published checkpoint. LongMemEval, ablations, and answer-quality scoring are the next layer.
- Rerun LongMemEval end to end with the cleaned dataset and publish the final summary separately from working logs.
- Add retrieval ablations for rerank on/off, lexical-only, semantic-only, and hybrid fusion.
- Report answer-quality metrics alongside retrieval metrics so improvements can be tied to end-user outcome quality.
- Track warm versus cold runs, model versions, and hardware profile so latency claims remain reproducible.
- Keep failed examples and inspect them manually because aggregate percentages hide the most valuable failure modes.
Benchmark principles we follow
- Use fixed datasets and deterministic start indices whenever possible.
- Record model versions, engine config, and policy version with every run.
- Separate warm and cold measurements so cache effects do not blur real latency.
- Keep ingest and query concurrency explicit in the command line and in the published report.
- Publish both retrieval-stage metrics and downstream answer metrics instead of treating them as interchangeable.
- Retain raw artifacts so regressions can be audited, not just summarized.