Evaluation

Benchmarking and Evaluation

Benchmark memory quality with repeatable datasets and fixed configurations.

Benchmark principles

  • Use fixed datasets and deterministic start index.
  • Record model versions and policy versions.
  • Separate warm and cold runs.
  • Keep ingest and query concurrency explicit.

Representative run

Rust evaluator invocation
cargo run --release --manifest-path benchmarks/rust_evaluator/Cargo.toml -- \
  --dataset-kind locomo \
  --dataset benchmarks/LoCoMo/data/locomo10.json \
  --engine-url http://127.0.0.1:3000 \
  --top-k 8 --limit 100

Result interpretation

Report both retrieval metrics (recall@k, MRR) and QA outcome metrics. A retrieval gain that does not improve downstream answer quality may still indicate ranking or prompt issues.

Keep failed examples and inspect them manually; qualitative review catches failure modes aggregate scores hide.