Benchmark B1

Entity Fact Retrieval
under Distractor Density.

Does Iranti's exact entity/key lookup maintain precision as distractor entity count grows — a regime where context-reading is expected to degrade? A Needle-in-a-Haystack variant adapted for structured fact retrieval.

Executed 2026-03-20 / 2026-03-21NIAH variant — structured key lookupsmall-n · indicative onlyhypothesis: precision holds at scale

Results at a glance

30/30Baseline across N=5, N=20, adversarial
8/8Iranti retrieval arm — prior-session KB data
8/8Iranti write arm — full ingest → retrieve
4/4Iranti arm at N=5000 (~276k tok) — first positive differential
FindingAt N=5000 (~276k tokens), the baseline is infeasible — the registry document exceeds the 200k context window and cannot be read. Iranti returned all 4 target facts correctly via exact key lookup. This is the first positive differential in the benchmark program: Iranti 4/4, baseline 0/4.

Accuracy vs scale

Both arms at ceiling across all tested N

Both arms at ceiling through N=1000. At N=5000 (~276k tokens), the baseline becomes infeasible — the haystack document exceeds Claude's 200k context window. Iranti returned 4/4 facts via exact key lookup. The dashed region marks where baseline context-reading can no longer run.

Baseline (context-reading)
Iranti (exact key lookup)
Baseline infeasible at N≥5000
0%50%100%N=5000: baseline infeasible (276k tok > 200k ctx limit) →N=5~400 tokbaseline10/10N=20~1.6k tokbaseline10/10N=20+adversarial~1.6k tokbaseline10/10N=100~8k tokbaseline10/10iranti8/8N=500~28k tokbaseline10/10iranti10/10N=1000~57k tokbaseline10/10iranti10/10N=5000~276k tokiranti4/4

⚑ N=20+adversarial: wrong facts injected for needle entities. Baseline unaffected at this scale.

Methodology

What this measures

In production multi-agent systems, agents accumulate many facts about many entities across their lifetime. If those facts live in context, retrieval precision degrades as transcript length grows — the model has more to sort through, with more surface area for confusion. If facts live in a structured KB, retrieval is an O(1) exact lookup unaffected by how many other entities exist.

B1 tests this directly: we embed two target entities ("needle" entities) among N distractor entities, ask 10 questions about the needle entities, and compare context-reading accuracy against Iranti's iranti_query(entity, key) lookup.

Inspired by Greg Kamradt's Needle-in-a-Haystack (2023) and RULER (Hsieh et al., 2024). Our adaptation replaces the sentence needle with a structured entity/key fact and replaces document-position variation with entity-count variation.

Conditions

Tested scales

ConditionHaystack sizeNotes
N=5~400 tokShort haystack — high signal-to-noise
N=20~1.6k tokMedium haystack
N=20+adversarial~1.6k tokWrong values injected for needle entities to test confound resistance
N=100~8k tokLarger haystack — both arms still at ceiling
N=500~28k tokLong context — both arms still at ceiling
N=1000~57k tokNull differential confirmed — both arms at ceiling through ~57k tokens
N=5000~276k tokFirst positive differential: Iranti 4/4, baseline infeasible (document exceeds 200k context window)

Test data

The needle entities

Two fictional researchers with four facts each. Embedded at a fixed position in every haystack.

researcher/alice_chen
affiliationMIT Computer Science
publication_count47
previous_employerOpenAI (2018–2021)
research_focusnatural language processing
researcher/bob_okafor
affiliationStanford AI Lab
publication_count23
previous_employerDeepMind (2020–2023)
research_focuscomputer vision

Write arm trial results

Full ingest → retrieve cycle

Both entities written to KB via iranti_write, then retrieved via iranti_query with no haystack document in context. Tests the full pipeline — not just retrieval from pre-existing KB data. Each cell below represents one trial.

Entityaffiliationpublication_​countprevious_​employerresearch_​focus
researcher/alice_chen
researcher/bob_okafor

All 8 cells confirmed correct. Zero hallucinations. Zero cross-entity contamination (alice facts never attributed to bob, and vice versa).

TrialEntityKeyResult
IB1researcher/alice_chenaffiliation✓ correct
IB2researcher/alice_chenpublication_count✓ correct
IB3researcher/alice_chenprevious_employer✓ correct
IB4researcher/alice_chenresearch_focus✓ correct
IB5researcher/bob_okaforaffiliation✓ correct
IB6researcher/bob_okaforpublication_count✓ correct
IB7researcher/bob_okaforprevious_employer✓ correct
IB8researcher/bob_okaforresearch_focus✓ correct

Write arm accuracy: 8/8 (100%). Recall: 8/8. Precision: 8/8.

The metadata advantage

Provenance, not just value

Every iranti_query result includes confidence, source, timestamp, and contested status. Context-reading returns a value and nothing else. The metadata is what allows downstream agents to reason about reliability — without needing to re-read the source document.

iranti_query(entity="researcher/alice_chen", key="affiliation")
{
"found": true,
"entity": "researcher/alice_chen",
"key": "affiliation",
"value": "MIT Computer Science",
"confidence": 0.98,
"source": "agent/site_main",
"validFrom": "2026-03-20T14:30:00Z",
"contested": false
}
confidence
0.98
source
agent
validFrom
timestamped
contested
false

Analysis

Key findings and limitations

FindingIranti retrieval is O(1) and deterministic. The same query returns the same result regardless of how many other entities exist in the KB. Baseline context-reading degrades in principle at scale — but we have not yet found that threshold with Claude Sonnet 4.6.
FindingSemantic search is not a substitute for exact query. A side test comparing iranti_search (semantic) vs iranti_query (exact) found the semantic path returned the target fact in 5th position — not first. For structured fact retrieval, exact key lookup is the correct tool.
FindingRetrieved facts include provenance. Every iranti_query result includes confidence, source, validFrom, and contested fields. Context-reading returns none of this metadata.
FindingFirst positive differential at N=5000. At N=5000 (~276k tokens), the haystack document exceeds the 200k context window — baseline context-reading is infeasible. Iranti returned 4/4 facts correctly via exact key lookup. This is the condition the program was built to find: Iranti 4/4, baseline 0/4.
LimitationInfeasible is not the same as degraded. The baseline scored 0/4 because the test could not be attempted — not because the model failed to find the needle. Both are a practical outcome of the same thing (no answer), but they differ technically. In either case, context-reading cannot serve a 5,000-entity knowledge base; Iranti can.
LimitationSelf-evaluation bias. The same model (Claude Sonnet 4.6) designed the test, ran the baseline arm, and evaluated answers. Self-consistency effects may inflate baseline scores. Independent evaluation infrastructure is required for publication-grade claims.
LimitationSmall-n caution. The largest arm uses 10 trials. Sample sizes are too small to make statistical claims. All results are directional — appropriate for development feedback, not benchmarking papers.
Raw data

Full trial execution records, baseline runs, dataset definitions, and statistical notes are in the benchmarking repository.