Can a memory system reliably recall what it wrote?
20 config-heavy facts. 40 recall questions. Each system writes each fact to its own isolated namespace, then answers two targeted questions about that fact. Score: expected answer substring appears in the returned context. No LLM judge.
Corpus design
The 20 facts cover real-world application configuration: JWT auth, PostgreSQL connection pooling, rate limiting, feature flags, background workers, logging, Redis caching, email delivery, Elasticsearch, file uploads, encryption, webhooks, audit logs, CORS, and payment processing.
Facts are assigned a risk tier based on operational severity: HIGH for security and data integrity configs, MEDIUM for infrastructure settings, LOW for operational preferences. 8 HIGH, 6 MEDIUM, 6 LOW facts.
Each fact contains at least two independently queryable values — one semantic (e.g., a path or name) and one numeric (e.g., a timeout or limit). Both are tested as separate questions to probe each system's ability to return exact config values.
Isolation and scoring
Each fact is written to a dedicated, isolated namespace per system. For Iranti and Shodh this is a per-fact user ID. For Mem0 it is a per-fact user identifier with a dedicated Chroma collection. For Graphiti it is a per-fact group_id.
Isolation ensures that each query can only return context from the one fact written to that scope — eliminating cross-contamination. The score is a deterministic substring match: the expected answer string must appear verbatim in the returned context.
System comparison
Lower is better — injected context size per query
Results by risk tier
Questions scored per tier (HIGH / MEDIUM / LOW). Each tier shows questions correct / total.
| Tier | Iranti | Shodh | Mem0 | Graphiti |
|---|---|---|---|---|
| HIGH | 16/16(100%) | 16/16(100%) | 12/16(75%) | 9/16(56%) |
| MEDIUM | 12/12(100%) | 12/12(100%) | 12/12(100%) | 6/12(50%) |
| LOW | 12/12(100%) | 12/12(100%) | 8/12(67%) | 8/12(67%) |
Per-fact recall table
✓ = both questions answered correctly · ✗ = one or both questions missed
| ID | Tier | Questions | Iranti | Shodh | Mem0 | Graphiti |
|---|---|---|---|---|---|---|
| F01 | HIGH | JWT expiry + public key path | ✓ | ✓ | ✓ | ✗ |
| F02 | HIGH | DB max connections + idle timeout | ✓ | ✓ | ✓ | ✗ |
| F03 | HIGH | Rate limit write RPM + header name | ✓ | ✓ | ✗ | ✗ |
| F04 | HIGH | Feature flags + default-enabled flag | ✓ | ✓ | ✓ | ✗ |
| F05 | HIGH | Worker count + job timeout | ✓ | ✓ | ✓ | ✗ |
| F06 | MEDIUM | Log file path + rotation days | ✓ | ✓ | ✓ | ✓ |
| F07 | MEDIUM | Cache TTL seconds + max memory MB | ✓ | ✓ | ✓ | ✗ |
| F08 | MEDIUM | From email address + send timeout | ✓ | ✓ | ✓ | ✓ |
| F09 | MEDIUM | Elasticsearch index name + debounce MS | ✓ | ✓ | ✓ | ✗ |
| F10 | MEDIUM | Max file size MB + S3 bucket name | ✓ | ✓ | ✓ | ✓ |
| F11 | LOW | Auth middleware chain + order | ✓ | ✓ | ✗ | ✗ |
| F12 | LOW | AWS region + deployment strategy | ✓ | ✓ | ✓ | ✓ |
| F13 | LOW | Stable API version + v0 sunset date | ✓ | ✓ | ✓ | ✓ |
| F14 | LOW | Health endpoint method + auth requirement | ✓ | ✓ | ✗ | ✓ |
| F15 | LOW | Session TTL days + invalidate on pw change | ✓ | ✓ | ✓ | ✗ |
| F16 | HIGH | Encryption algorithm + PBKDF2 iterations | ✓ | ✓ | ✓ | ✗ |
| F17 | HIGH | Webhook signature header + timeout MS | ✓ | ✓ | ✗ | ✗ |
| F18 | MEDIUM | Audit table name + immutability | ✓ | ✓ | ✓ | ✗ |
| F19 | LOW | CORS credentials + preflight cache seconds | ✓ | ✓ | ✓ | ✓ |
| F20 | HIGH | Stripe webhook path + max amount cents | ✓ | ✓ | ✓ | ✗ |
| Total | 100% | 100% | 80% | 57% | ||
Mem0 misses F03, F11, F14, F17 — spanning HIGH and LOW tiers. The common thread is structured config values that lack strong semantic context: rate limit header names, middleware ordering, health endpoint auth flags, and webhook signature headers. Mem0's vector similarity finds semantically related context but not the exact config key-value.
Graphiti misses 17 of 40 questions. The pattern concentrates on numeric config values: millisecond timeouts, RPM limits, iteration counts, byte counts. LLM entity extraction rephrases facts semantically — "JWT expiry is 3600 seconds" becomes an edge like "JWT token expiry is issued by myapp.prod" — the number is lost in translation.
Why Graphiti loses numeric values
Graphiti (getzep/graphiti v0.28.2) uses an LLM to extract entities and relationships from each ingested episode. The extracted entities and their edge fact strings are what get returned on search — not the original verbatim text.
During extraction, the LLM is asked to identify subjects, predicates, and objects in natural language. Numeric configuration values — which lack semantic subject-predicate structure — are frequently absorbed into vague predicates or dropped entirely.
The benchmark searched with per-fact group_id isolation, meaning each query was scoped to the exact namespace where that fact was written. The issue is not retrieval routing — it is that the extractable fact content itself no longer contains the original numeric value.
[expirySeconds=3600, issuer=myapp.prod,
keyFile=/etc/secrets/jwt-public.pem]
Graphiti is well-suited for narrative and relationship-heavy knowledge where semantic extraction adds value. It is not well-suited for verbatim config recall where exact numeric values must survive the write-read round trip.
Key findings
Iranti and Shodh achieve perfect recall on exact-match factual questions.
Mem0 misses 8 questions (F03, F11, F14, F17) — concentrated in HIGH and LOW risk tiers.
Graphiti extracts entities and relationships via LLM but rephrases numeric config values, causing 43% miss rate.
Cognee could not be tested: Python 3.14 incompatible (requires <3.14).