Conversational Episodic Memory
10/10 both arms. Null result — expected at this scale.
A 51-turn synthetic meeting transcript embeds 12 facts across the conversation. Both the context-reading baseline and Iranti-assisted arm score 10/10. The null result is correct — 5,500 tokens is well within Claude's context window. B7 establishes the methodology and write-and-retrieve pattern. The discriminative test requires 50,000–200,000 tokens.
Results at a glance
What this measures
Long-running agents accumulate facts across many turns: dates set in early meetings, thresholds established mid-conversation, names and identifiers introduced then referred to later. Episodic recall is the ability to retrieve a specific fact from its point of origin in a long conversation — not by scanning the whole transcript, but by querying a structured store.
B7 simulates a 51-turn project meeting. Twelve facts are embedded across the transcript: project dates, milestone numbers, a ROUGE-L threshold, a stakeholder name, and percentages. Ten probe questions are drawn from those facts at the end of the session. The question is whether an agent with Iranti — writing facts to its KB as they appear — recalls them more reliably than an agent that re-reads the entire transcript each time.
At 5,500 tokens, there is no difference. That is the right answer. The benchmark is not designed to show an advantage at this scale — it is designed to confirm that the infrastructure works before scaling to longer transcripts where the advantage becomes real.
B7 is a methodology-establishment benchmark. Read the scale context section below before drawing conclusions about Iranti's performance in production episodic-recall scenarios.
Session timeline
Twelve facts are embedded across 51 turns. The Iranti arm writes each fact to its KB at the moment it is stated (colored dots). The baseline arm reads the full transcript at the end. At this length, both strategies are equally effective.
Scale context
5,500 tokens is 2.75% of Claude's 200,000-token context window. Context-reading works perfectly at this length. The interesting comparison begins where context-reading degrades.
The teal region (left edge) represents B7's 5,500-token transcript — 2.75% of Claude's 200k context window. At this scale, nothing requires Iranti. The discriminative test begins where context-reading becomes probabilistic about precision among similar values — around 100k–200k tokens.
The four-date cluster
Turns 47–50 establish four project dates in quick succession — a deliberate design for the scale-up benchmark. At 5,500 tokens all four are trivially distinguishable. At 100k+ tokens with intervening content, they become the critical test of precision.
At 5,500 tokens, all four dates are trivially recallable — they appear in the final four turns of a short transcript. At 100k+ tokens with intervening content, context-reading becomes probabilistic about precision among similar values like April 10, 12, 19, and 22. Iranti's exact-match lookup is deterministic regardless of scale— each date is stored under a distinct key and retrieved exactly as written.
What B7 actually proves
What B7 cannot prove
What's next — scale-up plan
B7 establishes the methodology. The discriminative benchmarks are B7-50k and B7-200k, which will test whether Iranti's exact-match lookup maintains accuracy where context-reading degrades.
- →~50,000-token transcript (~500 turns)
- →Same 12 fact types, denser embedding
- →Four-date cluster test under intervening content
- →Full context-window stress test
- →Cross-session recall variant (facts from earlier sessions)
- →Precision under similar-value ambiguity
Honest limitations
Full trial execution records, transcript, probe questions, scoring notes, and methodology details in the benchmarking repository.