Benchmark B7

Conversational Episodic Memory
10/10 both arms. Null result — expected at this scale.

A 51-turn synthetic meeting transcript embeds 12 facts across the conversation. Both the context-reading baseline and Iranti-assisted arm score 10/10. The null result is correct — 5,500 tokens is well within Claude's context window. B7 establishes the methodology and write-and-retrieve pattern. The discriminative test requires 50,000–200,000 tokens.

Executed 2026-03-2151-turn transcript · ~5,500 tokensMethodology establishment

Results at a glance

10/10Context-reading baseline — expected ceiling at this length

10/10Iranti-assisted — write-and-retrieve pattern confirmed

NullDifferential — methodology test, not discriminative

NoteBoth arms answered all 10 probe questions correctly. The null result is expected and honest: 5,500 tokens (~2.75% of Claude's 200k context window) is trivially recallable without memory infrastructure. B7 proves the pattern works; it does not yet prove a performance advantage.

What this measures

Long-running agents accumulate facts across many turns: dates set in early meetings, thresholds established mid-conversation, names and identifiers introduced then referred to later. Episodic recall is the ability to retrieve a specific fact from its point of origin in a long conversation — not by scanning the whole transcript, but by querying a structured store.

B7 simulates a 51-turn project meeting. Twelve facts are embedded across the transcript: project dates, milestone numbers, a ROUGE-L threshold, a stakeholder name, and percentages. Ten probe questions are drawn from those facts at the end of the session. The question is whether an agent with Iranti — writing facts to its KB as they appear — recalls them more reliably than an agent that re-reads the entire transcript each time.

At 5,500 tokens, there is no difference. That is the right answer. The benchmark is not designed to show an advantage at this scale — it is designed to confirm that the infrastructure works before scaling to longer transcripts where the advantage becomes real.

B7 is a methodology-establishment benchmark. Read the scale context section below before drawing conclusions about Iranti's performance in production episodic-recall scenarios.

Session timeline

Twelve facts are embedded across 51 turns. The Iranti arm writes each fact to its KB at the moment it is stated (colored dots). The baseline arm reads the full transcript at the end. At this length, both strategies are equally effective.

Scale context

5,500 tokens is 2.75% of Claude's 200,000-token context window. Context-reading works perfectly at this length. The interesting comparison begins where context-reading degrades.

200,000 tokens

B7 tested here ↑~5,500 tokens

Degradation expected ~100k tokens

The teal region (left edge) represents B7's 5,500-token transcript — 2.75% of Claude's 200k context window. At this scale, nothing requires Iranti. The discriminative test begins where context-reading becomes probabilistic about precision among similar values — around 100k–200k tokens.

The four-date cluster

Turns 47–50 establish four project dates in quick succession — a deliberate design for the scale-up benchmark. At 5,500 tokens all four are trivially distinguishable. At 100k+ tokens with intervening content, they become the critical test of precision.

Turn 47

April 10

Checkpoint freeze

Turn 48

April 12

Phase 1 deployment

Turn 49

April 19

Deployment deadline

Turn 50

April 22

Stakeholder demo

Why this cluster matters at scale

At 5,500 tokens, all four dates are trivially recallable — they appear in the final four turns of a short transcript. At 100k+ tokens with intervening content, context-reading becomes probabilistic about precision among similar values like April 10, 12, 19, and 22. Iranti's exact-match lookup is deterministic regardless of scale— each date is stored under a distinct key and retrieved exactly as written.

What B7 actually proves

FindingWrite-and-retrieve pattern works. The Iranti arm correctly wrote 12 facts to its KB as they appeared in the transcript and retrieved the right values at probe time. The pattern is functional and ready to scale.

FindingEvaluation infrastructure is functional. The 10-probe scoring system, synthetic transcript generation, and two-arm comparison methodology produced clean, interpretable results. This infrastructure will support B7-scale-up with 50k and 200k token transcripts.

FindingContext-reading works at this length. The baseline arm's 10/10 score is the expected result. This is not a failure of the benchmark — it is the correct control measurement confirming the evaluation is calibrated.

What B7 cannot prove

LimitationNot discriminative at this scale. A null result with both arms at 10/10 tells you nothing about Iranti's recall advantage. The advantage only becomes measurable when context-reading begins to fail — which does not happen at 5,500 tokens.

LimitationNot comparable to B1/B2. B1 and B2 test cross-session recall — facts stored in one session retrieved in a later session. B7 tests within-session episodic recall. These are different problems. B7's null result says nothing about cross-session performance.

LimitationNot a production recall signal. Production episodic recall involves much longer transcripts, cross-session gaps, and retrieval under ambiguity. B7's clean 10/10 result does not predict performance under those conditions.

What's next — scale-up plan

B7 establishes the methodology. The discriminative benchmarks are B7-50k and B7-200k, which will test whether Iranti's exact-match lookup maintains accuracy where context-reading degrades.

B7-50k — planned

→~50,000-token transcript (~500 turns)
→Same 12 fact types, denser embedding
→Four-date cluster test under intervening content

B7-200k — planned

→Full context-window stress test
→Cross-session recall variant (facts from earlier sessions)
→Precision under similar-value ambiguity

Honest limitations

LimitationSynthetic transcript. The 51-turn meeting was generated, not drawn from real project conversations. Synthetic transcripts may embed facts more clearly than real meetings where facts are stated ambiguously, contradicted, or revised. Real-world episodic recall is harder.

LimitationSelf-evaluation. The probe questions were answered and scored within the same evaluation framework that generated the benchmark. Independent evaluation by a separate system or human reviewer was not performed.

LimitationModerate length only. 5,500 tokens does not stress-test any recall mechanism. Both context-reading and Iranti should score 10/10 here. Conclusions about episodic memory advantages cannot be drawn from this benchmark alone.

Noten=10 probe questions. Ten probes drawn from twelve embedded facts provides a reasonable methodology check but is not a large enough sample to characterize recall reliability statistically. Scale-up benchmarks will use larger probe sets.

Raw data

Full trial execution records, transcript, probe questions, scoring notes, and methodology details in the benchmarking repository.

iranti-benchmarking →← All benchmarks

Conversational Episodic Memory10/10 both arms. Null result — expected at this scale.