C3 — Conflict Resolution

Two systems scored 100%.
One of them is wrong.

Iranti and Shodh both pass this benchmark — because the scoring counts any response containing the updated value as correct. But Shodh returns the old value too. On every single query. Your agent receives contradictory context and has to guess which value is authoritative.

100%
Iranti
9 v21 both
100%*
Shodh
10 both
80%
Mem0
7 v21 both2 miss
40%
Graphiti
2 v22 both2 stale4 miss
The silent failure

What your agent actually receives from Shodh

You've just updated the API write rate limit from 60 to 100 requests per minute. You write the updated fact to your memory system. Later, your agent needs to enforce the limit and queries for the current value.

In Iranti, the write deterministically replaced the old value. The agent gets back 100 rpm. Done.

In Shodh, both facts are stored. The agent gets back both — and nothing in the response signals which one is current. It may apply the wrong limit, log a misleading rate, or produce non-deterministic behavior depending on which value the LLM picks from the context.

This happened on all 10 conflict pairs in the test — not as an edge case, but as the consistent behavior. Shodh accumulates; it does not replace.

Iranti — injected context
entity: api/rate-limits
key: writeRpm
value: 100
confidence: 0.95
→ Agent enforces 100 rpm. Correct.
Shodh — injected context
memory 1:
API write rate limit is 60 requests per minute per key.
memory 2:
API write rate limit increased to 100 requests per minute per key.
→ Agent sees both. No signal about which is current.

Test design

10 fact pairs, each covering a real-world configuration update: budget approvals, rate limit changes, timeout extensions, capacity scaling, compliance-driven policy changes. The values are structurally simple — one numeric field changes — so there is no ambiguity about what the correct answer is.

Write sequence: v1 is written first, v2 is written second, same namespace. For Graphiti, v1 is timestamped one hour before v2 to give temporal ordering context. All namespaces are isolated per conflict pair — same isolation as C1.

Scoring is lenient: any response containing v2 counts as a pass. This is why Shodh scores 100% — the correct value is present. The "both" verdict is the footnote that makes the score misleading.

Verdict definitions

v2 ✓

Response contains the updated value only. Clean replacement — no ambiguity for the caller.

both

Response contains both old and new values. Passes the benchmark. Fails in production — caller must guess which is authoritative.

stale

Response contains only the outdated value. Fails the benchmark and returns wrong information.

miss

Response contains neither value. System failed to retrieve any relevant context.

Per-conflict results

v1 → v2 for each pair. Correct answer is always v2. Note Shodh's column: 10/10 "both" is not the same as 10/10 "v2 only."

IDChangev1 → v2IrantiShodhMem0Graphiti
C01Project budget$50,000$75,000v2 ✓bothv2 ✓both
C02API write rate limit60 rpm100 rpmv2 ✓bothv2 ✓v2 ✓
C03Max file upload size10 MB25 MBv2 ✓bothv2 ✓miss
C04Redis cache TTL900s1800sv2 ✓bothv2 ✓stale
C05JWT token expiry3600s7200sv2 ✓bothv2 ✓miss
C06Background workers4 procs8 procsv2 ✓bothmissmiss
C07Log rotation7 days14 daysv2 ✓bothmissstale
C08PostgreSQL max connections2050v2 ✓bothv2 ✓miss
C09Webhook max retries35v2 ✓bothv2 ✓miss
C10Webhook timeout15000ms30000msbothbothv2 ✓v2 ✓

Why each system behaves this way

Iranti9 v2-only · 1 both

Iranti uses entity+key addressing. Writing v2 to the same entity and key as v1 deterministically overwrites the stored value at the storage level — there is no accumulation by design. 9 of 10 pairs return v2-only.

The one "both" (C10: webhook timeout) is the known B5 regression: conservative LLM arbitration on a close-confidence update treated v2 as a challenger rather than a replacement, accumulating instead of overwriting. Direct writes (same entity+key, same source) are unaffected — this is an LLM arbitration edge case only.

Shodh10 both · 0 v2-only

Shodh is an accumulative memory system. It does not replace facts — it appends them. A second write of the same information creates a second memory record alongside the first. Recall returns all matching records, regardless of recency.

This behavior is consistent and predictable — it is not a bug. But it means the caller is responsible for disambiguation. In an LLM-driven pipeline with no post-processing, the agent receives contradictory values and must choose, with no signal about which was written more recently or which is authoritative.

Mem07 v2-only · 1 both · 2 miss

Mem0 uses semantic deduplication on write. When v2 is semantically similar enough to v1, Mem0 updates the existing record rather than creating a new one — producing clean v2-only returns. This works correctly on 7 of 10 pairs.

The 2 misses (workers, log rotation) returned neither value — the semantic similarity between v1 and v2 was too low to trigger deduplication, but recall also failed to surface either record for those queries. These are retrieval gaps, not conflict handling failures.

Graphiti2 v2 · 2 both · 2 stale · 4 miss

Graphiti was given the best possible setup: v1 timestamped at t−1h, v2 at t−0, so temporal ordering was explicit. Despite this, 2 pairs returned the stale value and 4 returned nothing.

The root cause is entity extraction: when Graphiti's LLM extracts edge facts from v2, numeric values are often rephrased or dropped. If the v2 edge fact no longer contains the updated number, the temporal ordering is irrelevant — the answer was lost at ingestion. See C1 for the full extraction analysis.

Verdict distribution

Shodh's bar is entirely amber — 100% "both". That is a different result from Iranti's 90% teal.

Iranti
v2 only9/10
both values1/10
Shodh
both values10/10
Mem0
v2 only7/10
both values1/10
miss2/10
Graphiti
v2 only2/10
both values2/10
stale2/10
miss4/10

Key findings

01

Iranti uses entity+key addressing — v2 write deterministically replaces v1 at the same key. 9/10 clean v2-only returns.

02

Shodh scores 100% technically but returns BOTH old and new values on every query — the caller must disambiguate.

03

Mem0 misses 2 conflicts entirely (none verdict) — semantic similarity surfaces neither v1 nor v2 on those queries.

04

Graphiti shows 2 stale returns (returns old v1 value) and 4 total misses — temporal ordering only partially helps.