Benchmark B4

Multi-hop Entity Reasoning
Oracle path: 4/4. Search path: 1/4.

B4 tests whether Iranti supports chained entity lookups — hop 1 resolves an entity, hop 2 uses that result to find another. The finding is a clear split: when entity IDs are known upfront, chains work perfectly. When hop 2 requires search-based discovery, it consistently fails.

Executed 2026-03-21n=4, 3 armsSearch discovery gap documented

Results at a glance

4/4Oracle path — entity IDs known, exact lookup chains work perfectly
1/4Search path — search-based discovery at hop 2 fails reliably
4/4Baseline (context-reading) — expected for small KB size
FindingThe oracle path and search path produce opposite results. This is not a marginal gap — it is a capability boundary. Multi-hop reasoning in Iranti is currently reliable only when callers supply entity IDs directly.

What this measures

Multi-hop entity reasoning is the ability to chain lookups: resolve entity A, extract a property from A, use that property to locate entity B, then continue. This pattern appears constantly in real-world agent work — traversing org charts, tracing dependency graphs, following relationship chains across a knowledge base.

B4 constructs two-hop chains and tests them three ways. The baseline arm gives the model a plain text document and lets it read the answer directly — no Iranti calls, just context. The oracle arm provides entity IDs at each hop so Iranti can do exact lookups. The search arm withholds the hop-2 ID and forces the model to discover the target entity by querying on attribute values.

The gap between oracle (4/4) and search (1/4) isolates the exact failure: not the chaining logic, not the model, not the hop-1 retrieval — just search-based entity discovery at intermediate hops.

The baseline outperforming search-based Iranti is expected at small KB sizes. As KB size grows beyond what fits in context, the ordering is expected to invert. This benchmark does not test that crossover point.

The two-hop chain

Hop 1 uses an exact entity ID — the oracle path (solid teal) works perfectly. Hop 2 requires discovering the next entity by attribute value — the search path (dashed amber) fails consistently. The broken step is search, not the chain logic.

Alice Chenentity/person/aliceMIT CSentity/dept/mit-cs? (search target)ID unknown — need searchHop 1: exact lookup ✓Hop 2: search-based ✗oracle path (exact lookup)search path (failing)

Results across all three arms

Four multi-hop chains tested per arm. The oracle arm matches the baseline. The search arm fails 3 of 4.

ArmScorePer-question breakdown
Baseline (context-reading)4/4Model scans a plain text document; all 4 multi-hop entity chains resolved correctly.
Iranti oracle path4/4Entity IDs known upfront; exact lookup chains work perfectly across both hops.
Iranti search path1/4Search-based entity discovery at hop 2 fails; consistently returns oldest KB entries, ignores recent writes.
Total (3 arms)9/12Oracle path drives all Iranti correctness

All chains used the same underlying entity graph. The only difference between oracle and search arms is whether the hop-2 entity ID was supplied or had to be discovered through search.

Why did search fail? Three hypotheses

The search arm consistently returned the oldest KB entries and ignored recently written ones. The raw results do not conclusively determine the cause — these are the three most plausible explanations from the observed behavior.

Hypothesis 1Indexing lag

Vector embeddings may not generate immediately for newly written entries. When the benchmark writes an entity and then queries for it seconds later, the embedding index may not yet include it — causing search to fall back to older, already-indexed entries.

Hypothesis 2Value-vs-summary indexing

Search may index auto-generated summaries rather than structured field values. If the summary does not faithfully reproduce the written value, attribute-value searches will miss the correct entry even when it is present in the store.

Hypothesis 3Score bias toward older entries

Older KB entries have accumulated higher confidence scores over repeated reads. When search returns the top-5 by relevance, score-weighted ranking may consistently surface high-confidence old entries over lower-scored new ones, regardless of semantic match quality.

These hypotheses are not mutually exclusive. All three could contribute simultaneously. Isolating the actual cause requires a targeted follow-up benchmark that controls for write timing, summary content, and entry age independently.

Honest limitations

LimitationSmall test set (n=4). Four chains per arm is enough to identify a clear directional signal. It is not enough to characterize the failure rate precisely — the true search failure rate could range from 50% to near 100% on different KB configurations.
LimitationSingle session. All three arms ran in one benchmark session. Session-specific factors — KB state at test time, ordering effects, specific entity values chosen — may influence results in ways that do not generalize.
LimitationSingle KB size. The crossover point where search-based Iranti outperforms context-reading baseline was not tested. At large KB sizes, context-reading becomes impossible and the oracle path remains the only reliable option. This benchmark does not characterize that regime.
NoteHypotheses are unverified. The three search failure hypotheses are inferred from behavior, not confirmed through controlled experiments. Do not treat them as established facts about Iranti's internals.

Key findings

FindingOracle multi-hop works perfectly. When entity IDs are known at each hop, Iranti chains exact lookups without error. Chain as many hops as needed — the constraint is ID availability, not chain depth.
FindingSearch-based discovery is unreliable at hop 2. Using search to discover the next entity in a chain currently fails 3 of 4 times. Applications that depend on attribute-value searches at intermediate hops should not rely on this path in production.
FindingContext-reading matches oracle at small scale. For knowledge bases small enough to fit in context, the baseline approach (plain text document) is equivalent to the oracle path. The oracle path advantage emerges at scale, which this benchmark does not test.
FindingThis is a documented capability gap, not a bug. The failure is in search-based entity discovery, a known hard problem in vector search systems. The gap is real, documented honestly, and has a clear workaround: supply entity IDs whenever possible.

v0.2.16 Update: Search Restored

The search regression that caused 1/4 in v0.2.12 and a full crash in v0.2.14 has been fixed in v0.2.16. Vector scoring is now active (scores 0.35–0.74), and the three original failing queries — find-by-institution, find-by-prior-employer, find-by-institution-peer — now work correctly. The test used direct attribute values, which is now the reliable path.

Versioniranti_searchVector scoreDirect attributeSemantic paraphrase
0.2.12Degraded0 (disabled)PartialFails
0.2.14CrashesN/AN/AN/A
0.2.16Operational0.35–0.74Works ✓Fails
FindingDirect attribute search now works. Queries like “find the researcher at MIT Computer Science” or “find the researcher who came from OpenAI” now return the correct entity. The original 3 failing queries from v0.2.12 all pass in v0.2.16. Multi-hop chains over direct attributes are viable.
LimitationRemaining ceiling: semantic paraphrase. If the query describes an entity indirectly — “find the researcher who studies causality and inference without econometrics” — rather than naming an attribute directly, search still fails. The system retrieves by attribute value better than it reasons about semantic descriptions of what those values mean. Design multi-hop chains around direct attribute lookups, not indirect descriptions.
Raw data

Full trial execution records, per-question scores, entity graphs, and methodology notes in the benchmarking repository.