C1 — Competitive Recall Accuracy

Can a memory system reliably recall what it wrote?

20 config-heavy facts. 40 recall questions. Each system writes each fact to its own isolated namespace, then answers two targeted questions about that fact. Score: expected answer substring appears in the returned context. No LLM judge.

100%

Iranti

40/40 · 20 tok/query

100%

Shodh

40/40 · 20 tok/query

80%

Mem0

32/40 · 13 tok/query

57%

Graphiti

23/40 · 37 tok/query

Corpus design

The 20 facts cover real-world application configuration: JWT auth, PostgreSQL connection pooling, rate limiting, feature flags, background workers, logging, Redis caching, email delivery, Elasticsearch, file uploads, encryption, webhooks, audit logs, CORS, and payment processing.

Facts are assigned a risk tier based on operational severity: HIGH for security and data integrity configs, MEDIUM for infrastructure settings, LOW for operational preferences. 8 HIGH, 6 MEDIUM, 6 LOW facts.

Each fact contains at least two independently queryable values — one semantic (e.g., a path or name) and one numeric (e.g., a timeout or limit). Both are tested as separate questions to probe each system's ability to return exact config values.

Isolation and scoring

Each fact is written to a dedicated, isolated namespace per system. For Iranti and Shodh this is a per-fact user ID. For Mem0 it is a per-fact user identifier with a dedicated Chroma collection. For Graphiti it is a per-fact group_id.

Isolation ensures that each query can only return context from the one fact written to that scope — eliminating cross-contamination. The score is a deterministic substring match: the expected answer string must appear verbatim in the returned context.

Scoring function:

answer_hit(expected, text) →

expected.lower() in text.lower()

# Yes/No: match "true"/"false" aliases

System comparison

Recall accuracy

Iranti100% (40/40)

Shodh100% (40/40)

Mem080% (32/40)

Graphiti57% (23/40)

Avg tokens per query

Iranti20 tok/q

Shodh20 tok/q

Mem013 tok/q

Graphiti37 tok/q

Lower is better — injected context size per query

Results by risk tier

Questions scored per tier (HIGH / MEDIUM / LOW). Each tier shows questions correct / total.

Tier	Iranti	Shodh	Mem0	Graphiti
HIGH	16/16(100%)	16/16(100%)	12/16(75%)	9/16(56%)
MEDIUM	12/12(100%)	12/12(100%)	12/12(100%)	6/12(50%)
LOW	12/12(100%)	12/12(100%)	8/12(67%)	8/12(67%)

Per-fact recall table

✓ = both questions answered correctly · ✗ = one or both questions missed

ID	Tier	Questions	Iranti	Shodh	Mem0	Graphiti
F01	HIGH	JWT expiry + public key path	✓	✓	✓	✗
F02	HIGH	DB max connections + idle timeout	✓	✓	✓	✗
F03	HIGH	Rate limit write RPM + header name	✓	✓	✗	✗
F04	HIGH	Feature flags + default-enabled flag	✓	✓	✓	✗
F05	HIGH	Worker count + job timeout	✓	✓	✓	✗
F06	MEDIUM	Log file path + rotation days	✓	✓	✓	✓
F07	MEDIUM	Cache TTL seconds + max memory MB	✓	✓	✓	✗
F08	MEDIUM	From email address + send timeout	✓	✓	✓	✓
F09	MEDIUM	Elasticsearch index name + debounce MS	✓	✓	✓	✗
F10	MEDIUM	Max file size MB + S3 bucket name	✓	✓	✓	✓
F11	LOW	Auth middleware chain + order	✓	✓	✗	✗
F12	LOW	AWS region + deployment strategy	✓	✓	✓	✓
F13	LOW	Stable API version + v0 sunset date	✓	✓	✓	✓
F14	LOW	Health endpoint method + auth requirement	✓	✓	✗	✓
F15	LOW	Session TTL days + invalidate on pw change	✓	✓	✓	✗
F16	HIGH	Encryption algorithm + PBKDF2 iterations	✓	✓	✓	✗
F17	HIGH	Webhook signature header + timeout MS	✓	✓	✗	✗
F18	MEDIUM	Audit table name + immutability	✓	✓	✓	✗
F19	LOW	CORS credentials + preflight cache seconds	✓	✓	✓	✓
F20	HIGH	Stripe webhook path + max amount cents	✓	✓	✓	✗
Total			100%	100%	80%	57%

Mem0 miss pattern

Mem0 misses F03, F11, F14, F17 — spanning HIGH and LOW tiers. The common thread is structured config values that lack strong semantic context: rate limit header names, middleware ordering, health endpoint auth flags, and webhook signature headers. Mem0's vector similarity finds semantically related context but not the exact config key-value.

Graphiti miss pattern

Graphiti misses 17 of 40 questions. The pattern concentrates on numeric config values: millisecond timeouts, RPM limits, iteration counts, byte counts. LLM entity extraction rephrases facts semantically — "JWT expiry is 3600 seconds" becomes an edge like "JWT token expiry is issued by myapp.prod" — the number is lost in translation.

Why Graphiti loses numeric values

Graphiti (getzep/graphiti v0.28.2) uses an LLM to extract entities and relationships from each ingested episode. The extracted entities and their edge fact strings are what get returned on search — not the original verbatim text.

During extraction, the LLM is asked to identify subjects, predicates, and objects in natural language. Numeric configuration values — which lack semantic subject-predicate structure — are frequently absorbed into vague predicates or dropped entirely.

The benchmark searched with per-fact group_id isolation, meaning each query was scoped to the exact namespace where that fact was written. The issue is not retrieval routing — it is that the extractable fact content itself no longer contains the original numeric value.

Input episode body:

JWT authentication config. Uses RS256 signing.
[expirySeconds=3600, issuer=myapp.prod,
keyFile=/etc/secrets/jwt-public.pem]

↓LLM entity extraction

Extracted edge.fact (what search returns):

"JWT token expiry is issued by myapp.prod"

→ 3600 not found · /etc/secrets/jwt-public.pem not found

Graphiti is well-suited for narrative and relationship-heavy knowledge where semantic extraction adds value. It is not well-suited for verbatim config recall where exact numeric values must survive the write-read round trip.

NOTE

Cognee was excluded from this benchmark. Cognee requires Python <3.14 (strict upper bound in its package metadata). The benchmark environment runs Python 3.14.2 — no compatible version of Cognee could be installed via pip. This is not a performance disqualification. Cognee will be re-evaluated when a compatible release is available.

Key findings

Iranti and Shodh achieve perfect recall on exact-match factual questions.

Mem0 misses 8 questions (F03, F11, F14, F17) — concentrated in HIGH and LOW risk tiers.

Graphiti extracts entities and relationships via LLM but rephrases numeric config values, causing 43% miss rate.

Cognee could not be tested: Python 3.14 incompatible (requires <3.14).

← All benchmarks C2: Pool efficiency →C3: Conflict resolution →