Evaluation

Benchmarking and Evaluation

Benchmark memory quality with repeatable datasets, fixed configurations, and a clear split between preliminary signal and publishable scorecards.

Current benchmark status

We have not yet published a full-blown benchmark campaign across all target tasks, models, and ablation settings. What we do have today are preliminary harness runs and archived evaluator outputs that already give a directional read on retrieval quality and runtime behavior.

The purpose of this page is to show what is already instrumented, what the early LoCoMo recall run looks like, and what still needs to happen before we call the numbers final.

Important

Treat the results below as preliminary engineering signals, not a finalized benchmark report. Full LongMemEval publication-grade runs, ablations, and model-normalized comparisons are still in progress.

Benchmark Phase

Preliminary

Harness is live; full campaign is still pending.

Published Dataset Signal

LoCoMo

Current docs publish the cleanest retrieval summary from archived logs.

Archived Artifacts

2 Logs

`locomo_output.txt` and `longmemeval_output.txt` are retained in the repo.

Next Milestone

Full LongMemEval

Run, validate, and publish the cleaned scorecard plus ablations.

What is already in the repo

Two benchmark artifacts are currently tracked in the workspace and referenced internally when validating the evaluator pipeline.

The LoCoMo artifact already contains a clean recall summary. The LongMemEval artifact is preserved as a working benchmark log, but we are intentionally not publishing a polished LongMemEval scorecard from it yet.

Artifact

locomo_output.txt

Preliminary LoCoMo recall run log with full progress output, timing breakdowns, and the final Recall@8 summary.

Current best directional retrieval read for docs publication.

Artifact

longmemeval_output.txt

Archived preliminary benchmark output kept in the repo while the final LongMemEval evaluation flow is cleaned up and rerun end to end.

Tracked as a working artifact, not yet a publication-ready benchmark table.

Preliminary LoCoMo snapshot

The strongest concrete benchmark signal we are comfortable surfacing today comes from the LoCoMo recall harness. This run focuses on whether the correct evidence session appears in the retrieved Top-8, which is a useful proxy for whether the memory engine is bringing the right context back into scope before any downstream answer-generation layer gets involved.

That distinction matters. We want to isolate memory retrieval quality first, then layer answer quality and judge-model evaluation on top. Otherwise, retrieval regressions and generation regressions get mixed together.

Overall Recall@8

94.7%

Evidence session found in the Top-8 for 1538 evaluated questions.

Questions Evaluated

1538

No skipped questions in the archived run.

Avg Query Time

65 ms

Average end-to-end query timing reported by the harness.

Avg Total Time

83 ms

Includes ingest/query/pack timing reported per evaluation cycle.

Avg Batch Ingest

407 ms

Average ingest batch time during the initial session indexing phase.

Top-K Window

8 Sessions

Run used `top-k=8` and `max-chunks-per-session=4`.

Metric	Single-Hop	Multi-Hop	Open Domain	Temporal	Overall
Recall@8	90.8%	80.4%	97.9%	93.8%	94.7%
Question Count	282	92	841	321	1538

This table is taken from the archived LoCoMo recall log and represents retrieval-stage evidence recovery, not a final answer-generation leaderboard.

How to read these numbers

The LoCoMo result is promising because it says the right session is usually being recovered even in a long-running conversation setting. The open-domain and temporal slices are especially strong in this early run, which suggests the hybrid lexical-plus-semantic stack is doing real work beyond naive vector search.

The multi-hop slice is the current pressure point. That is not surprising: once the right evidence is split across multiple linked conversational fragments, raw retrieval has to do more than recover a single relevant session. This is exactly where graph lineage, fact companions, temporal linking, and deterministic aggregation become more important.

The timing split is also useful. Average engine-stage totals show that embedding and hydrate costs dominate more than ANN search. That points optimization effort toward payload hydration, result packing, and indexing layout rather than only ANN micro-tuning.

Average engine query stages in the archived run were roughly: embed 11 ms, ANN 2 ms, FTS 4 ms, hydrate 39 ms, total 64 ms.
Reranking was disabled in this run, which means the current score reflects the base hybrid retrieval pipeline without cross-encoder rescue.
Semantic dedup and consolidation were also disabled, so there is still headroom for future ablation work.

Run configuration behind the preliminary LoCoMo result

The archived LoCoMo run used the Rust evaluator directly against the engine with an explicit retrieval-only configuration. That is useful because it minimizes ambiguity about where latency and quality are coming from.

For benchmark reproducibility, keep the full invocation with the result artifact. Small flags such as rerank on/off, start index, chunk caps, or ingest concurrency can materially change both runtime and quality.

Rust evaluator invocation

cargo run --release --manifest-path benchmarks/rust_evaluator/Cargo.toml -- \
  --dataset-kind locomo \
  --dataset benchmarks/LoCoMo/data/locomo10.json \
  --engine-url http://127.0.0.1:3000 \
  --engine-api-key XXX1111AAA \
  --reset-first \
  --limit 99999 \
  --ingest-concurrency 16 \
  --top-k 8 \
  --max-chunks-per-session 4 recall

Archived run characteristics

dataset: LoCoMo
questions: 1538
top-k sessions: 8
max chunks per session: 4
neural rerank: false
semantic dedup: false
consolidation: false
reset first: true

LongMemEval harness entry point

cargo run --release --manifest-path benchmarks/rust_evaluator/Cargo.toml -- \
  --dataset-kind longmemeval \
  --dataset benchmarks/LongMemEval/data/longmemeval_s_cleaned.json \
  --engine-url http://127.0.0.1:3000 \
  --engine-api-key XXX1111AAA \
  --reset-first \
  --ingest-concurrency 1 \
  --top-k 8 \
  --max-chunks-per-session 4 recall

What still needs to happen before we call benchmarking complete

A proper benchmark page for Aletheia should not stop at one preliminary retrieval run. We still need a full matrix across datasets, retrieval settings, optional reranking, answer-generation layers, and judge-model evaluation so the results are defensible outside the repo.

In practice, that means LoCoMo is only the first published checkpoint. LongMemEval, ablations, and answer-quality scoring are the next layer.

Rerun LongMemEval end to end with the cleaned dataset and publish the final summary separately from working logs.
Add retrieval ablations for rerank on/off, lexical-only, semantic-only, and hybrid fusion.
Report answer-quality metrics alongside retrieval metrics so improvements can be tied to end-user outcome quality.
Track warm versus cold runs, model versions, and hardware profile so latency claims remain reproducible.
Keep failed examples and inspect them manually because aggregate percentages hide the most valuable failure modes.

Benchmark principles we follow

Use fixed datasets and deterministic start indices whenever possible.
Record model versions, engine config, and policy version with every run.
Separate warm and cold measurements so cache effects do not blur real latency.
Keep ingest and query concurrency explicit in the command line and in the published report.
Publish both retrieval-stage metrics and downstream answer metrics instead of treating them as interchangeable.
Retain raw artifacts so regressions can be audited, not just summarized.

Operations