LongMemEval-S Benchmark
Public Benchmarks
Transparent, reproducible evaluation of Aletheia against industry leaders on standard agent memory benchmarks.
90.5%
Overall Accuracy
<200ms
P95 Retrieval Latency
+61pt
vs Mem0 Overall
#2
Overall Ranking
LongMemEval-S Results (%)
| Model | Overall | Single Session | Temporal | Preferences | Knowledge Updates | Multi-Session |
| Aletheia | 90.5% | 98.0% | 88.3% | 95.2% | 96.1% | 74.8% |
| HydraDB | 90.8% | 100.0% | 91.0% | 96.7% | 97.4% | 76.7% |
| Zep | 71.2% | 92.9% | 62.4% | 56.7% | 83.3% | 57.9% |
| Mem0 | 29.1% | 38.7% | 25.6% | 40.0% | 52.6% | 20.3% |
Overall Score Comparison
Aletheia
90.5%
HydraDB
90.8%
Zep
71.2%
Mem0
29.1%
Methodology
- ▸Dataset: LongMemEval-S benchmark — 6 categories across single/multi-session recall, temporal reasoning, preference extraction, knowledge updates
- ▸Hardware: All tests run on equivalent cloud instances (4 vCPU, 16 GB RAM)
- ▸Evaluation code: Open source at github.com/sharjeel619/aletheia
- ▸Competitor results sourced from publicly published benchmarks and our own evaluation rig. HydraDB results from hydradb.com/benchmarks. Zep results from getzep.com benchmarks. Mem0 results from published baselines.
- ▸Last updated: May 2026. Run them yourself:
cargo run --release --bench longmemeval