Benchmark B1

Entity Fact Retrieval
under Distractor Density.

Does Iranti's exact entity/key lookup maintain precision as distractor entity count grows — a regime where context-reading is expected to degrade? A Needle-in-a-Haystack variant adapted for structured fact retrieval.

Executed 2026-03-20 / 2026-03-21NIAH variant — structured key lookupsmall-n · indicative onlyhypothesis: precision holds at scale

Results at a glance

30/30Baseline across N=5, N=20, adversarial

8/8Iranti retrieval arm — prior-session KB data

8/8Iranti write arm — full ingest → retrieve

4/4Iranti arm at N=5000 (~276k tok) — first positive differential

FindingAt N=5000 (~276k tokens), the baseline is infeasible — the registry document exceeds the 200k context window and cannot be read. Iranti returned all 4 target facts correctly via exact key lookup. This is the first positive differential in the benchmark program: Iranti 4/4, baseline 0/4.

Accuracy vs scale

Both arms at ceiling across all tested N

Both arms at ceiling through N=1000. At N=5000 (~276k tokens), the baseline becomes infeasible — the haystack document exceeds Claude's 200k context window. Iranti returned 4/4 facts via exact key lookup. The dashed region marks where baseline context-reading can no longer run.

Baseline (context-reading)

Iranti (exact key lookup)

Baseline infeasible at N≥5000

⚑ N=20+adversarial: wrong facts injected for needle entities. Baseline unaffected at this scale.

Methodology

What this measures

In production multi-agent systems, agents accumulate many facts about many entities across their lifetime. If those facts live in context, retrieval precision degrades as transcript length grows — the model has more to sort through, with more surface area for confusion. If facts live in a structured KB, retrieval is an O(1) exact lookup unaffected by how many other entities exist.

B1 tests this directly: we embed two target entities ("needle" entities) among N distractor entities, ask 10 questions about the needle entities, and compare context-reading accuracy against Iranti's iranti_query(entity, key) lookup.

Inspired by Greg Kamradt's Needle-in-a-Haystack (2023) and RULER (Hsieh et al., 2024). Our adaptation replaces the sentence needle with a structured entity/key fact and replaces document-position variation with entity-count variation.

Conditions

Tested scales

Condition	Haystack size	Notes
N=5	~400 tok	Short haystack — high signal-to-noise
N=20	~1.6k tok	Medium haystack
N=20+adversarial	~1.6k tok	Wrong values injected for needle entities to test confound resistance
N=100	~8k tok	Larger haystack — both arms still at ceiling
N=500	~28k tok	Long context — both arms still at ceiling
N=1000	~57k tok	Null differential confirmed — both arms at ceiling through ~57k tokens
N=5000	~276k tok	First positive differential: Iranti 4/4, baseline infeasible (document exceeds 200k context window)

Test data

The needle entities

Two fictional researchers with four facts each. Embedded at a fixed position in every haystack.

researcher/alice_chen

affiliationMIT Computer Science

publication_count47

previous_employerOpenAI (2018–2021)

research_focusnatural language processing

researcher/bob_okafor

affiliationStanford AI Lab

publication_count23

previous_employerDeepMind (2020–2023)

research_focuscomputer vision

Write arm trial results

Full ingest → retrieve cycle

Both entities written to KB via iranti_write, then retrieved via iranti_query with no haystack document in context. Tests the full pipeline — not just retrieval from pre-existing KB data. Each cell below represents one trial.

Entity	affiliation	publication_count	previous_employer	research_focus
researcher/alice_chen
researcher/bob_okafor

All 8 cells confirmed correct. Zero hallucinations. Zero cross-entity contamination (alice facts never attributed to bob, and vice versa).

Trial	Entity	Key	Result
IB1	researcher/alice_chen	affiliation	✓ correct
IB2	researcher/alice_chen	publication_count	✓ correct
IB3	researcher/alice_chen	previous_employer	✓ correct
IB4	researcher/alice_chen	research_focus	✓ correct
IB5	researcher/bob_okafor	affiliation	✓ correct
IB6	researcher/bob_okafor	publication_count	✓ correct
IB7	researcher/bob_okafor	previous_employer	✓ correct
IB8	researcher/bob_okafor	research_focus	✓ correct

Write arm accuracy: 8/8 (100%). Recall: 8/8. Precision: 8/8.

The metadata advantage

Provenance, not just value

Every iranti_query result includes confidence, source, timestamp, and contested status. Context-reading returns a value and nothing else. The metadata is what allows downstream agents to reason about reliability — without needing to re-read the source document.

iranti_query(entity="researcher/alice_chen", key="affiliation")

{
  "found": true,
  "entity": "researcher/alice_chen",
  "key": "affiliation",
  "value": "MIT Computer Science",
  "confidence": 0.98,
  "source": "agent/site_main",
  "validFrom": "2026-03-20T14:30:00Z",
  "contested": false
}

confidence

0.98

source

agent

validFrom

timestamped

contested

false

Analysis

Key findings and limitations

FindingIranti retrieval is O(1) and deterministic. The same query returns the same result regardless of how many other entities exist in the KB. Baseline context-reading degrades in principle at scale — but we have not yet found that threshold with Claude Sonnet 4.6.

FindingSemantic search is not a substitute for exact query. A side test comparing iranti_search (semantic) vs iranti_query (exact) found the semantic path returned the target fact in 5th position — not first. For structured fact retrieval, exact key lookup is the correct tool.

FindingRetrieved facts include provenance. Every iranti_query result includes confidence, source, validFrom, and contested fields. Context-reading returns none of this metadata.

FindingFirst positive differential at N=5000. At N=5000 (~276k tokens), the haystack document exceeds the 200k context window — baseline context-reading is infeasible. Iranti returned 4/4 facts correctly via exact key lookup. This is the condition the program was built to find: Iranti 4/4, baseline 0/4.

LimitationInfeasible is not the same as degraded. The baseline scored 0/4 because the test could not be attempted — not because the model failed to find the needle. Both are a practical outcome of the same thing (no answer), but they differ technically. In either case, context-reading cannot serve a 5,000-entity knowledge base; Iranti can.

LimitationSelf-evaluation bias. The same model (Claude Sonnet 4.6) designed the test, ran the baseline arm, and evaluated answers. Self-consistency effects may inflate baseline scores. Independent evaluation infrastructure is required for publication-grade claims.

LimitationSmall-n caution. The largest arm uses 10 trials. Sample sizes are too small to make statistical claims. All results are directional — appropriate for development feedback, not benchmarking papers.

Raw data

Full trial execution records, baseline runs, dataset definitions, and statistical notes are in the benchmarking repository.

iranti-benchmarking →← all benchmarks

Entity Fact Retrievalunder Distractor Density.