C1 — Competitive Recall Accuracy

Can a memory system reliably recall what it wrote?

20 config-heavy facts. 40 recall questions. Each system writes each fact to its own isolated namespace, then answers two targeted questions about that fact. Score: expected answer substring appears in the returned context. No LLM judge.

100%
Iranti
40/40 · 20 tok/query
100%
Shodh
40/40 · 20 tok/query
80%
Mem0
32/40 · 13 tok/query
57%
Graphiti
23/40 · 37 tok/query

Corpus design

The 20 facts cover real-world application configuration: JWT auth, PostgreSQL connection pooling, rate limiting, feature flags, background workers, logging, Redis caching, email delivery, Elasticsearch, file uploads, encryption, webhooks, audit logs, CORS, and payment processing.

Facts are assigned a risk tier based on operational severity: HIGH for security and data integrity configs, MEDIUM for infrastructure settings, LOW for operational preferences. 8 HIGH, 6 MEDIUM, 6 LOW facts.

Each fact contains at least two independently queryable values — one semantic (e.g., a path or name) and one numeric (e.g., a timeout or limit). Both are tested as separate questions to probe each system's ability to return exact config values.

Isolation and scoring

Each fact is written to a dedicated, isolated namespace per system. For Iranti and Shodh this is a per-fact user ID. For Mem0 it is a per-fact user identifier with a dedicated Chroma collection. For Graphiti it is a per-fact group_id.

Isolation ensures that each query can only return context from the one fact written to that scope — eliminating cross-contamination. The score is a deterministic substring match: the expected answer string must appear verbatim in the returned context.

Scoring function:
answer_hit(expected, text)
expected.lower() in text.lower()
# Yes/No: match "true"/"false" aliases

System comparison

Recall accuracy
Iranti100% (40/40)
Shodh100% (40/40)
Mem080% (32/40)
Graphiti57% (23/40)
Avg tokens per query
Iranti20 tok/q
Shodh20 tok/q
Mem013 tok/q
Graphiti37 tok/q

Lower is better — injected context size per query

Results by risk tier

Questions scored per tier (HIGH / MEDIUM / LOW). Each tier shows questions correct / total.

TierIrantiShodhMem0Graphiti
HIGH16/16(100%)16/16(100%)12/16(75%)9/16(56%)
MEDIUM12/12(100%)12/12(100%)12/12(100%)6/12(50%)
LOW12/12(100%)12/12(100%)8/12(67%)8/12(67%)

Per-fact recall table

✓ = both questions answered correctly · ✗ = one or both questions missed

IDTierQuestionsIrantiShodhMem0Graphiti
F01HIGHJWT expiry + public key path
F02HIGHDB max connections + idle timeout
F03HIGHRate limit write RPM + header name
F04HIGHFeature flags + default-enabled flag
F05HIGHWorker count + job timeout
F06MEDIUMLog file path + rotation days
F07MEDIUMCache TTL seconds + max memory MB
F08MEDIUMFrom email address + send timeout
F09MEDIUMElasticsearch index name + debounce MS
F10MEDIUMMax file size MB + S3 bucket name
F11LOWAuth middleware chain + order
F12LOWAWS region + deployment strategy
F13LOWStable API version + v0 sunset date
F14LOWHealth endpoint method + auth requirement
F15LOWSession TTL days + invalidate on pw change
F16HIGHEncryption algorithm + PBKDF2 iterations
F17HIGHWebhook signature header + timeout MS
F18MEDIUMAudit table name + immutability
F19LOWCORS credentials + preflight cache seconds
F20HIGHStripe webhook path + max amount cents
Total100%100%80%57%
Mem0 miss pattern

Mem0 misses F03, F11, F14, F17 — spanning HIGH and LOW tiers. The common thread is structured config values that lack strong semantic context: rate limit header names, middleware ordering, health endpoint auth flags, and webhook signature headers. Mem0's vector similarity finds semantically related context but not the exact config key-value.

Graphiti miss pattern

Graphiti misses 17 of 40 questions. The pattern concentrates on numeric config values: millisecond timeouts, RPM limits, iteration counts, byte counts. LLM entity extraction rephrases facts semantically — "JWT expiry is 3600 seconds" becomes an edge like "JWT token expiry is issued by myapp.prod" — the number is lost in translation.

Why Graphiti loses numeric values

Graphiti (getzep/graphiti v0.28.2) uses an LLM to extract entities and relationships from each ingested episode. The extracted entities and their edge fact strings are what get returned on search — not the original verbatim text.

During extraction, the LLM is asked to identify subjects, predicates, and objects in natural language. Numeric configuration values — which lack semantic subject-predicate structure — are frequently absorbed into vague predicates or dropped entirely.

The benchmark searched with per-fact group_id isolation, meaning each query was scoped to the exact namespace where that fact was written. The issue is not retrieval routing — it is that the extractable fact content itself no longer contains the original numeric value.

Input episode body:
JWT authentication config. Uses RS256 signing.
[expirySeconds=3600, issuer=myapp.prod,
 keyFile=/etc/secrets/jwt-public.pem]
LLM entity extraction
Extracted edge.fact (what search returns):
"JWT token expiry is issued by myapp.prod"
→ 3600 not found · /etc/secrets/jwt-public.pem not found

Graphiti is well-suited for narrative and relationship-heavy knowledge where semantic extraction adds value. It is not well-suited for verbatim config recall where exact numeric values must survive the write-read round trip.

NOTE
Cognee was excluded from this benchmark. Cognee requires Python <3.14 (strict upper bound in its package metadata). The benchmark environment runs Python 3.14.2 — no compatible version of Cognee could be installed via pip. This is not a performance disqualification. Cognee will be re-evaluated when a compatible release is available.

Key findings

01

Iranti and Shodh achieve perfect recall on exact-match factual questions.

02

Mem0 misses 8 questions (F03, F11, F14, F17) — concentrated in HIGH and LOW risk tiers.

03

Graphiti extracts entities and relationships via LLM but rephrases numeric config values, causing 43% miss rate.

04

Cognee could not be tested: Python 3.14 incompatible (requires <3.14).