Agentic graph-walk vs dense RAG on a cBioPortal paper wiki

Run 2026-04-20-0446 · primary split=train · n=30 questions × 2 modes

Abstract

We compare two retrieval strategies over a curated, 56-paper subset of the cBioPortal cancer-genomics publication corpus. The first (Agentic) is a language-model agent that navigates a hand-compiled wiki of papers, genes, cancer types, datasets, drugs, and methods via six graph-walk tools. The second (RAG) is standard dense-vector retrieval: chunk paper markdowns, embed with Vertex AI gemini-embedding-001, pull the top-40 chunks, answer in a single call. Both modes share the answering model (claude-haiku-4-5) and a rubric judge (claude-opus-4-6). On a 50-question set split 30/10/10 across train/val/test and four question categories, the agent wins on every dimension of the held-out test split — accuracy 4.30 vs 3.80, completeness 4.40 vs 3.30, citation recall 0.925 vs 0.708 — at roughly 25% more wall time. Results are consistent across splits, so the win is not train-set overfitting. We read this as a preliminary step before trying GraphRAG-style community summarization, adaptive retrieval (Self-RAG / CRAG), and a larger corpus. The apparatus, data, and source are all reproducible from eval/.

Introduction and background

What good is a knowledge base if you can’t get information out of it? Retrieval-augmented generation (Lewis 2020) has become the default recipe for grounding language-model answers using a knowledge base. The classic implementation is vector search over fixed-size text chunks (Karpukhin 2020, Santhanam 2022), evaluated on heterogeneous retrieval benchmarks (Thakur 2021, Muennighoff 2023) and generation-quality rubrics (Es 2023). In parallel, two threads have emerged that treat retrieval as a behaviour rather than a single lookup: graph-augmented retrieval (Edge 2024, Gutiérrez 2024, Guo 2024, Gao 2024) builds an explicit structure over the corpus and walks it; agentic retrieval (Yao 2023, Nakano 2022, Asai 2023, Yan 2024, Singh 2025) lets a language-model agent decide what to read next based on what it just saw. Both converge on the same observation: for questions that span documents, flat top-k vector search is not the best fit.

This experiment is a preliminary step in trying to unpack this idea for a cancer-genomics knowledge base powering a chat bot. We already maintain a hand-compiled wiki of papers, genes, cancer types, datasets, drugs and methods (56 papers as of this run, a subset of the cBioPortal publication list + some extras), with cross-links between entity pages and their citing papers. The question we want to answer is modest: given that graph, does an agent that walks it beat a vanilla RAG pipeline over the same papers? If we find some signal it motivates building more structure into our cBioPortal knowledge base; a negative one says the structure isn’t paying for itself and a plain embedding pipeline would do. Borrowing from the tradition of biomedical-QA evaluations we asked an LLM to craft a set of questions from these types; factoid, list, yes/no, and summary question types (Krithara 2023, Jin 2019, Wadden 2020).

This is just a first pass, our corpus is incomplete, we haven’t thought critically about how to structure our graph or the content we abstract from each paper, and what is treated as an entity or not. Our wiki graph has no community-detection layer (Edge 2024), no pre-computed theme/subgraph summaries, and no adaptive retrieval loop (Asai 2023, Yan 2024). The questions are AI-generated with minimal review, so treat the numbers here as a gist.

Methods

Corpus. 56 papers from the cBioPortal publication list, selected because they were fully ingested into both our compiled wiki (at wiki/papers/{pmid}.md) and our chunking/embedding pipeline (raw markdown at data/raw/papers/{pmid}.md). Both retrieval modes see exactly this set — no side channel.

Question set. 50 questions authored by an LLM, split 30 / 10 / 10 across train / val / test, and tagged with one of four categories: lookup (single-paper factoid), list (enumerate papers or entities meeting a criterion), synthesis (cross-paper claim), or definition (what is X?). Each question carries one or more gold PMIDs so we can measure citation recall independently of the judge.

Agentic mode. A PydanticAI agent using claude-haiku-4-5 with six graph-walk tools: read_page, read_section, follow_links, find_references, get_page_metadata, list_pages. The system prompt instructs the agent to start from index.md and traverse by title match → entity pivot → follow links. A hard cap of 20 tool calls per query and a 180-second wall-clock deadline are enforced at the server and runner layers respectively.

RAG mode. Paper markdowns are chunked (~900 chars with 120 overlap), embedded with Vertex AI gemini-embedding-001 (3072-dim), and stored in a FAISS IndexFlatIP for cosine similarity. At query time we embed the question, pull the top-40 passages (budgeted to ~60k characters of context), and issue a single claude-haiku-4-5 call with the stitched passages plus the question.

Judge. claude-opus-4-6 reads the question, gold notes, and the agent’s final answer, and returns three integer scores on a 1–5 rubric — accuracy, completeness, citation_correctness — plus a free-text reason. We separately compute citation recall as the fraction of gold PMIDs cited in the answer (detected by regex on PMID:\d+ and papers/\d+\.html link forms). Full rubric and scoring code are in eval/judge.py and eval/README.md.

Results

Headline numbers (train split)

Mean scores on the 30-question train split. Judge scores are on a 1–5 scale; citation recall is the fraction of gold PMIDs appearing in the answer.

Metric Agentic Rag
accuracy (1–5) 4.43 4.30
completeness (1–5) 4.33 3.87
citation correctness (1–5) 4.30 4.10
citation recall vs gold 0.844 0.745
avg wall time (s) 51.0 27.3
avg input tokens 60177 11890
avg output tokens 934 321

Takeaway — agentic wins on completeness and recall; RAG is 2× cheaper in tokens and ~2× faster in wall time. Accuracy and citation correctness are roughly equivalent.

Figure 1 — Mean judge score by question category

Each category is a different flavour of question. Categories where completeness (did you enumerate the facts?) carries signal are where the agent’s per-page reading pays off.

Figure 1. Mean judge score by question category, agentic vs RAG. Each panel is one metric (accuracy / completeness / citation). Agentic pulls clearly ahead on completeness for list and synthesis questions.

Takeaway — lookup and definition are a wash; agentic edges RAG on list (+0.3 completeness) and especially synthesis (+1.3 completeness), the categories where full-page reading matters most.

Generalization: train / val / test

Three disjoint splits run with identical configuration. Consistency across the three is evidence the agent’s win isn’t train-set overfitting (we tuned nothing between splits).

Agentic

metric train (n=30) val (n=10) test (n=10)
accuracy 4.43 4.60 4.30
completeness 4.33 4.60 4.40
citation 4.30 4.30 4.30
citation recall 0.844 0.887 0.925
wall time (s) 51.0 39.9 31.2

Rag

metric train (n=30) val (n=10) test (n=10)
accuracy 4.30 4.50 3.80
completeness 3.87 4.00 3.30
citation 4.10 4.10 3.70
citation recall 0.745 0.750 0.708
wall time (s) 27.3 29.0 25.6

Takeaway — agentic stays ahead on every metric on every split, with the test margins matching or exceeding the train margins.

Figure 2 — Per-question comparison

Each point is one train-split question plotted at its total judge score (accuracy + completeness + citation, 3–15) for both modes. Bubble size is proportional to whichever mode took longer on that question. Points above the dashed y = x line are agent wins; below the line are RAG wins.

Hover any point in the interactive version for the full question and both answers.

Takeaway — most synthesis points (red) sit above the diagonal, most definition and high-scoring lookup points pile near (15, 15). Two agent-losses stand out: LS04 (a list question the agent got lost on) and S04 (a synthesis question both modes struggled with, but RAG did better).

Figure 3 — Cost vs quality

Cost here is deliberately not budget-matched: each strategy carries its own shape. The left panel shows input-token cost (log scale) against total judge score; the right panel shows the same against wall time. A well-behaved retrieval strategy should have its high-scoring dots concentrated at the left of each panel.

Figure 3. Total judge score vs input tokens (log scale, left) and wall time (right), agentic and RAG overlaid. RAG clusters cheap/fast; agentic spends more to reach the same score ceiling, but also reaches it on questions where RAG cannot.

Takeaway — RAG has a tight, cheap, fast cluster in the 10k-token band; agentic spreads across 10k–200k tokens. Both modes hit the score ceiling of 15, but for several mid-scoring synthesis questions only the agent crosses into the high band.

Run guardrails

  • Agentic: 0 / 30 runs hit the 180 s wall-clock timeout; 1 hit the 20-call tool cap.
  • Rag: 0 / 30 runs hit the 180 s wall-clock timeout; 0 hit the 20-call tool cap.

Exceeding either cap is a harness-level failure, not a judge failure; the record is kept and scored on whatever partial output was emitted before the cap fired.

Discussion and next steps

The held-out test split is the number to quote — the agent beats RAG on every dimension (accuracy, completeness, citation, citation recall) at roughly 25% more wall time. The magnitude of the win is largest on completeness, which matches the intuition that graph-walking over whole pages beats passage-packing when the answer needs to enumerate. Train/val/test consistency rules out overfitting to a single split.

Several caveats bound how far to push this result. It’s a single corpus (56 papers), a single question set (50 questions), a single judge model, and no statistical-significance testing on the metric differences. RAG hyperparameters (chunk size, overlap, top-k, embedding model) were not swept; we used a sensible-default configuration. The agent benefits from a wiki that was itself authored by language-model agents from the same paper corpus, so some of the lift could be attributed to that pre-processing rather than graph-walk retrieval per se — a dedicated ablation would be needed to separate the two.

Next steps, in rough priority order:

  1. GraphRAG-style community summarization on our wiki’s cross-link graph (Edge 2024). We already emit wiki/graph.json; running Leiden over it and generating theme-level summaries would give a third retrieval strategy to compare against — and address our current lack of synthesis/theme pages.

  2. Adaptive retrieval — layering a reflection-token style decision (Asai 2023) or a retrieval evaluator (Yan 2024) on top of the RAG path so it can fall through to graph-walk when the top-k is weak.

  3. Corpus expansion. We run on 56 of cBioPortal’s several hundred published studies. Re-evaluating on a 200+ paper corpus would test whether the agent scales with depth.

  4. Question-set scale and judges. A larger, possibly partially auto-generated question set; multiple judge models and bootstrap confidence intervals on every delta.

  5. Authoring themes and community pages. A structural gap in the current wiki: no pre-written cross-paper synthesis pages. Their absence probably hurts both modes, but the agent more (it’s the natural top-down landing page).

References

Seminal RAG

  • Guu, K. et al. (2020). REALM: Retrieval-Augmented Language Model Pre-Training. ICML. https://arxiv.org/abs/2002.08909 — Pre-trains retrieval directly into the language model via masked LM signal.
  • Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. https://arxiv.org/abs/2005.11401 — Founding paper of the RAG paradigm — dense retriever + seq2seq generator.
  • Izacard, G. and Grave, E. (2021). Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. EACL. https://arxiv.org/abs/2007.01282 — Fusion-in-Decoder (FiD) architecture — encodes passages independently, fuses them in the decoder.
  • Izacard, G. et al. (2022). Atlas: Few-shot Learning with Retrieval Augmented Language Models. JMLR. https://arxiv.org/abs/2208.03299 — End-to-end pretrained RAG — Contriever + FiD trained jointly at scale.

Dense retrieval and evaluation

  • Karpukhin, V. et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP. https://arxiv.org/abs/2004.04906 — DPR — dual-encoder dense retrievers, the canonical RAG retriever baseline.
  • Thakur, N. et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS Datasets & Benchmarks. https://arxiv.org/abs/2104.08663 — 18 diverse IR tasks — the standard benchmark for retriever robustness.
  • Santhanam, K., Khattab, O. et al. (2022). ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. NAACL. https://arxiv.org/abs/2112.01488 — Token-level late interaction at cross-encoder quality, 6-10x smaller footprint.
  • Es, S. et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv. https://arxiv.org/abs/2309.15217 — Reference-free RAG metrics — context precision/recall, faithfulness, answer relevancy.
  • Muennighoff, N. et al. (2023). MTEB: Massive Text Embedding Benchmark. EACL. https://arxiv.org/abs/2210.07316 — 56+ embedding tasks across retrieval, classification, clustering; de facto embedding leaderboard.

GraphRAG and knowledge-graph retrieval

  • Edge, D. et al. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv (Microsoft Research). https://arxiv.org/abs/2404.16130 — Entity KG + Leiden communities + hierarchical summaries; large gains on global sensemaking queries.
  • Gao, Z. et al. (2024). StructRAG: Boosting Knowledge Intensive Reasoning via Inference-time Hybrid Information Structurization. arXiv. https://arxiv.org/abs/2410.08815 — LLM router picks per-query structure (table / graph / catalogue); beats vanilla RAG on CS + science corpora.
  • Guo, Z. et al. (2024). LightRAG: Simple and Fast Retrieval-Augmented Generation. EMNLP Findings. https://arxiv.org/abs/2410.05779 — Lightweight KG with dual-level entity + theme retrieval; ~30% latency drop vs GraphRAG.
  • Gutiérrez, B. J., Shu, Y., Jiménez, E. et al. (2024). HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. NeurIPS. https://arxiv.org/abs/2405.14831 — KG + Personalized PageRank inspired by hippocampal indexing; strong on multi-hop QA at 10-30x lower cost than iterative RAG.

Agentic retrieval and tool use

  • Nakano, R. et al. (2022). WebGPT: Browser-assisted Question-Answering with Human Feedback. arXiv (OpenAI). https://arxiv.org/abs/2112.09332 — Fine-tuned GPT-3 browses the web and cites sources; early LLM-agent milestone.
  • Asai, A. et al. (2023). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. ICLR 2024. https://arxiv.org/abs/2310.11511 — Reflection tokens let the model decide when to retrieve and critique its own outputs.
  • Yao, S. et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR. https://arxiv.org/abs/2210.03629 — Interleaved reasoning traces and tool calls — the canonical agent-loop pattern.
  • Yan, S.-Q. et al. (2024). Corrective Retrieval Augmented Generation. arXiv. https://arxiv.org/abs/2401.15884 — Retrieval evaluator triggers refinement or web-search fallback — robustness against bad retrievals.
  • Singh, A. et al. (2025). Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG. arXiv. https://arxiv.org/abs/2501.09136 — 2025 survey of agentic-RAG systems — autonomous agents replacing static pipelines.

Biomedical question-answering benchmarks

  • Jin, Q. et al. (2019). PubMedQA: A Dataset for Biomedical Research Question Answering. EMNLP. https://arxiv.org/abs/1909.06146 — PubMed-abstract QA, yes/no format; canonical biomedical QA benchmark.
  • Wadden, D. et al. (2020). Fact or Fiction: Verifying Scientific Claims. EMNLP. https://arxiv.org/abs/2004.14974 — SciFact — retrieval + abstractive evidence over biomedical literature.
  • Krithara, A. et al. (2023). BioASQ-QA: A Manually Curated Corpus for Biomedical Question Answering. Nature Scientific Data. https://www.nature.com/articles/s41597-023-02068-4 — Curated biomedical QA — factoid / list / yes-no / summary types, annual shared task.

Appendix

A. Question bank

Every question that appears in this report, with its split, category, full text, and the gold PMIDs used to score citation recall. Scores for each question are in the Per-question results section below.

ID Split Category Question Gold PMIDs
D01 train definition What is the MSK-IMPACT panel, and what are its main panel versions referenced in the corpus? 37591896, 37078708, 38922339
D02 train definition Describe each of the four features in the Aerts CT radiomic signature and what aspect of tumor phenotype they capture. 24892406
D03 train definition What is the PRISSMM NLP curation framework and how is it used in the MSK-CHORD study? 39506116
D04 train definition What is the LymphGen classification system, and how was it validated in the corpus? 38497151
D05 train definition What is the ROBIN consortium, what projects does it run, and which diseases does it cover? 41941260
D06 train definition How is homologous-recombination deficiency (HRD) characterized in the HGSOC multimodal risk-stratification paper? 35764743
L01 train lookup In the MSK-CHORD study, what percentage of lung adenocarcinoma patients harbored SETD2 driver mutations? 39506116
L02 train lookup In the radiation-associated sarcoma cohort, which driver alteration dominates RT-angiosarcoma of the breast/chest wall, and at what frequency? 37350195
L03 train lookup In the ADC radiosensitization study, what tumor doubling-time increase was observed in HER2-positive OE19 xenografts receiving T-DM1 combined with 2.5 Gy x 3 radiation? 27698471
L04 train lookup In the ROBIN consortium preclinical data, which dietary intervention unexpectedly promoted radio-resistance? 41941260
L05 train lookup What four CT radiomic features make up the Aerts prognostic signature trained on NSCLC Lung1/MAASTRO? 24892406
L06 train lookup In the phase 2 IMRT + nivolumab trial for recurrent HNSCC, what was the 1-year progression-free survival rate? 38780927
L07 train lookup What targeted therapy is the subject of the FGFR3-altered urothelial carcinoma real-world outcomes study? 37682528
L08 train lookup In the peripheral T-cell lymphoma CHOP-based chemotherapy cohort, which mutation identified high-risk events? 37078708
L09 train lookup In the appendiceal adenocarcinoma molecular classification, which two RAS-pathway genes co-drive the mucinous subtype? 36493333
LS01 train list Which corpus papers profile patients using an MSK-IMPACT targeted panel (any version, including IMPACT341/410/468/505 or the generic msk-impact-panel)? 35764743, 36357680, 36493333, 36862133, 37078708, 37084736, 37315267, 37350195, 37477937, 37591896, 37643132, 37651310, 37682528, 37699004, 37730754, 37769223, 37910594, 38147626, 38488807, 38497151, 38630790, 38653864, 38758238, 38864854, 38922339, 38949888, 38995739, 39214094, 39289779, 39506116
LS02 train list Which corpus papers focus on head and neck squamous cell carcinoma (HNSC)? 24892406, 27698471, 30179230, 37397861, 38362943, 38780927, 41941260
LS03 train list List all corpus papers that analyze circulating tumor DNA (ctDNA) as a primary modality. 36357680, 37769223, 39147831, 39289779
LS04 train list Which corpus papers describe publicly released TCIA imaging datasets? 24892406, 28872634, 30179230, 30325352, 38362943
LS05 train list Which corpus papers analyze KRAS mutations as a driver or biomarker focus (not just incidental mention)? 36357680, 36493333, 37084736, 37591896, 37643132, 37699004, 38864854, 38922339, 39147831, 39214094, 39506116
LS06 train list Which corpus papers evaluate PD-1 or PD-L1 immune checkpoint blockade in patients? 36862133, 37406106, 37643132, 37651310, 38630790, 38653864, 38758238, 38780927, 38949888, 38995739, 41941260
S01 train synthesis Where in the corpus do papers disagree about whether a CT radiomic signature generalizes across institutions for head-and-neck cancer prognosis? Summarize both sides. 24892406, 37397861
S02 train synthesis How do the corpus papers characterize the long-term molecular sequelae of radiotherapy on previously-irradiated tissue? 37350195, 38117484
S03 train synthesis Across the corpus trials of PD-1 or PD-L1 blockade, what biomarkers of response appear, and do they agree across disease sites? 37406106, 38653864, 38780927, 38949888, 38995739
S04 train synthesis What evidence across the corpus links DNA-damage-response biology to aggressive or treatment-refractory phenotypes in multiple tumor types? 34493726, 37350195, 38488813, 39386723
S05 train synthesis Across the ctDNA papers in the corpus, what common utilities emerge, and where do the approaches differ in clinical question? 36357680, 37769223, 39147831, 39289779
S06 train synthesis Across the lymphoma papers in the corpus (CLL, Hodgkin, DLBCL, PTCL, PCNSL), what common themes emerge about how genomic subclassification informs clinical stratification? 35927489, 36723991, 37078708, 38497151, 38995739
S07 train synthesis What does the corpus say about the role of tumor volume as a baseline against which imaging-derived models must prove incremental prognostic value? 24892406, 35764743, 37397861
S08 train synthesis How do the two rhabdomyosarcoma papers in the corpus differ in their genomic findings and clinical framing? 37315267, 37730754
S09 train synthesis Across the corpus, what evidence links environmental radiation or UV exposure to acquired somatic alterations or epigenetic drift? 37350195, 38117484, 38895302, 39091884
D07 val definition What is the GLASS consortium and what does its IDH-mutant glioma methylation analysis contribute to the corpus? 38117484
D08 val definition What is the Hidden-Genome Classifier introduced in the cholangiocarcinoma paper? 38864854
L10 val lookup What concordance index did the Aerts four-feature CT radiomic signature achieve on the HNSCC H&N1 MAASTRO validation cohort? 24892406
L11 val lookup How many patients were included in the MSK-CHORD harmonized clinicogenomic dataset? 39506116
L12 val lookup In the GLASS consortium study of IDH-mutant gliomas, what fraction of GCIMP-high tumors progressed to GCIMP-low in the treated group versus the untreated group? 38117484
LS07 val list Which corpus papers use whole-genome sequencing as part of their methodology? 25730765, 35927489, 36723991, 37202560, 37730754, 38117484, 38412093, 38488813
LS08 val list Which corpus papers cover rhabdomyosarcoma (RMS)? 37315267, 37730754
S10 val synthesis How does the corpus characterize molecular disparities by race or ancestry in cancer genomics? 37651310
S11 val synthesis What limitations of PD-L1 (CD274) combined positive score as a predictive biomarker emerge across the corpus? 37406106, 38780927
S12 val synthesis What role does clonal hematopoiesis play in interpreting ctDNA or cfDNA findings according to the corpus? 38147626
D09 test definition What defines a radiation-associated sarcoma clinically, and which histotypes does the corpus examine? 37350195
D10 test definition What is the RADCURE dataset, how large is it, and what makes it distinctive compared with earlier public HNSCC imaging resources? 38362943, 37397861
L13 test lookup In the NSCLC brain metastasis study, which osimertinib-treated EGFR-mutant population was examined for CNS penetrance? 37591896
L14 test lookup In the extremity rhabdomyosarcoma study, which fusion gene pair defines the alveolar (ARMS) subtype? 37315267
L15 test lookup In the pleural mesothelioma near-haploidization study, what chromosomal signature defined the newly recognized subset? 38630790
LS09 test list Which corpus papers investigate BRAF alterations or BRAF fusions? 38412093, 38895302, 38922339, 39214094
LS10 test list Which corpus papers study prostate cancer (PRAD or PRNE)? 28045614, 38488813, 38949888, 39147831, 39506116, 41941260
S13 test synthesis How does the corpus address the use of real-world data and NLP to augment structured genomic data for outcome prediction? 39506116
S14 test synthesis Beyond breast and ovarian cancer, where does the corpus document BRCA1/BRCA2 alterations or HR-pathway involvement? 34493726, 37769223, 38412093, 38488813, 38949888, 39386723
S15 test synthesis Across the corpus, what consistent findings emerge about DDR or BRCA1/BRCA2 expression and outcomes in aggressive pediatric or young-adult cancers? 34493726, 37315267, 37730754

B. Per-category breakdown (train split)

Agentic mode

category n accuracy completeness cite_correct citation_recall wall_time_s
lookup 9 4.67 4.78 5.00 1.000 25.7
list 6 4.00 3.33 3.50 0.640 40.1
synthesis 9 4.22 4.22 4.11 0.830 91.4
definition 6 4.83 4.83 4.33 0.833 39.4

Rag mode

category n accuracy completeness cite_correct citation_recall wall_time_s
lookup 9 4.78 5.00 5.00 1.000 23.9
list 6 4.00 3.00 3.33 0.543 18.7
synthesis 9 3.89 2.89 3.44 0.565 40.6
definition 6 4.50 4.50 4.50 0.833 21.0

C. Per-question results (train split)

Scores shown as accuracy / completeness / citation_correctness (each 1–5).

ID Category Agentic acc/comp/cite RAG acc/comp/cite Agentic s RAG s Winner
D01 definition 5/4/1 5/5/2 18.4 5.8 rag
D02 definition 5/5/5 5/5/5 86.5 22.6 tie
D03 definition 5/5/5 4/3/5 55.6 23.6 agentic
D04 definition 4/5/5 4/5/5 29.8 22.2 tie
D05 definition 5/5/5 4/4/5 24.2 25.5 agentic
D06 definition 5/5/5 5/5/5 22.2 26.5 tie
L01 lookup 5/5/5 5/5/5 7.3 3.7 tie
L02 lookup 5/5/5 4/5/5 6.2 28.7 agentic
L03 lookup 5/5/5 5/5/5 27.1 22.6 tie
L04 lookup 5/5/5 5/5/5 31.1 27.7 tie
L05 lookup 5/5/5 5/5/5 44.1 26.5 tie
L06 lookup 5/5/5 5/5/5 25.3 29.7 tie
L07 lookup 5/5/5 5/5/5 31.2 25.4 tie
L08 lookup 5/5/5 5/5/5 27.6 25.9 tie
L09 lookup 2/3/5 4/5/5 31.3 24.6 rag
LS01 list 4/4/4 4/3/4 20.6 5.0 agentic
LS02 list 5/5/5 4/3/3 16.7 23.6 agentic
LS03 list 5/5/5 4/3/3 30.1 36.9 agentic
LS04 list 2/1/1 5/4/4 71.5 31.8 rag
LS05 list 4/3/3 4/2/3 10.5 4.8 agentic
LS06 list 4/2/3 3/3/3 91.2 10.4 tie
S01 synthesis 5/5/5 4/5/5 164.4 38.9 agentic
S02 synthesis 5/5/4 4/2/4 144.1 38.8 agentic
S03 synthesis 4/4/4 3/2/3 114.4 55.8 agentic
S04 synthesis 1/1/1 3/2/2 22.8 61.6 rag
S05 synthesis 4/5/4 4/3/3 111.2 35.5 agentic
S06 synthesis 5/5/5 4/3/3 86.7 35.5 agentic
S07 synthesis 4/3/4 4/2/3 65.6 34.6 agentic
S08 synthesis 5/5/5 5/5/5 29.4 26.8 tie
S09 synthesis 5/5/5 4/2/3 83.7 38.0 agentic

D. Notable failures

Questions where either mode scored ≤ 2 on accuracy:

  • L09 (agentic, lookup) — acc=2 comp=3 cite=5
    • Judge: The answer incorrectly identifies NRAS as the co-driver with KRAS for the mucinous subtype, when the correct answer is KRAS and GNAS. While GNAS is mentioned in the answer, it is incorrectly described as being absent in the relevant subtype rather than being a co-driver of the mucinous subtype. The citation is correct.
  • LS04 (agentic, list) — acc=2 comp=1 cite=1
    • Judge: The answer appears to be an incomplete series of search attempts with no actual findings, conclusions, or cited PMIDs. No papers are identified or referenced, so the question is essentially unanswered despite the process narrative.
  • S04 (agentic, synthesis) — acc=1 comp=1 cite=1
    • Judge: The answer provides no substantive content, no facts, no citations, and no analysis. It only describes an intent to search without delivering any results or addressing the question.

E. Source data

  • Run directory: eval/results/2026-04-20-0446/
  • Raw records: runs.jsonl (60 rows)
  • Bibliography source: eval/bibliography.yaml
  • Narrative source: eval/narrative.md
  • Regenerate this page: uv run python -m eval.build_report --run-dir … --val-dir … --test-dir …