Evaluation
Benchmarking and Evaluation
Benchmark memory quality with repeatable datasets and fixed configurations.
Benchmark principles
- Use fixed datasets and deterministic start index.
- Record model versions and policy versions.
- Separate warm and cold runs.
- Keep ingest and query concurrency explicit.
Representative run
cargo run --release --manifest-path benchmarks/rust_evaluator/Cargo.toml -- \
--dataset-kind locomo \
--dataset benchmarks/LoCoMo/data/locomo10.json \
--engine-url http://127.0.0.1:3000 \
--top-k 8 --limit 100Result interpretation
Report both retrieval metrics (recall@k, MRR) and QA outcome metrics. A retrieval gain that does not improve downstream answer quality may still indicate ranking or prompt issues.
Keep failed examples and inspect them manually; qualitative review catches failure modes aggregate scores hide.