Retrieval

Lexical Index (BM25)

Lexical retrieval catches exact names, tokens, and numeric strings that dense embeddings may miss.

Why lexical still matters

Semantic retrieval is strong for paraphrase, but weak for exact literal matching in some cases. BM25 restores precision for IDs, dates, error codes, and uncommon terms.

Hybrid retrieval avoids the false dichotomy of semantic-only vs keyword-only systems.

Tokenization guidance

  • Normalize casing with domain-aware exceptions.
  • Keep punctuation splitting consistent between ingest and query.
  • Preserve key delimiters for IDs when possible.
Example lexical-heavy query
{
  "textual_query": "order_id 9f8a12e0 timeout on shard-3",
  "entity_id": "tenant-77",
  "limit": 8
}

Fusion expectations

BM25 candidates should be fused with semantic candidates, not blindly appended. Rank fusion methods like RRF keep both signals while reducing dominance by either side.