Publication Issues Tracker 2
Publication Issues Tracker (continued)
Section titled “Publication Issues Tracker (continued)”Continuation of docs/publication_issues_tracker.md. New issues are logged here to keep file sizes manageable.
Last updated: 2026-04-16
Issue index
Section titled “Issue index”| # | Title | Short description | Status |
|---|---|---|---|
| 52 | ”All Arms” view join fans out across all trial_arm_interventions, duplicating dose | View v23 line 474-475 matches every drug_intervention to “All Arms” outcomes regardless of which arm the intervention belongs to. Creates duplicate rows with conflicting dose profiles (RP2D vs escalation range). ~4,405 pubs with conflicting dose, ~15k row groups. Fixed: v24 eliminates fan-out (agreement-only dose), v25 adds numeric dose envelope via dose_value_numeric column. v26 adds “prefer non-escalation” aggregation (652 groups narrowed). Prompt v3 + o4-mini for 16 pubs missing escalation arms. | Complete |
| 53 | Multi-cohort pubs have per-cohort AEs with no subgroup linkage | ~1,883 pubs with multiple disease cohorts (e.g. CLL + MCL) report AEs per-cohort but the data model has no way to link adverse_events to trial_subgroups. Per-cohort values get flattened into “All Arms” and mixed. Needs schema change + re-classification. | Open |
| 54 | Zero-sentinel ORR (ORR = 0) when PR% > 0 for same subgroup | LLM extracts measure_value=0 for ORR on subgroups where abstract reports PR but not ORR explicitly. Fixed via post-process validation: ORR=0% overridden with PR%+CR% when same-confirmed-flag PR% > 0 (or derived from counts). Guards: confirmed-flag match, PR>0 required, sibling non-zero ORR skip. Backfill task for existing data. | Complete |
| 55 | Secondary-analysis cross-tabulated subgroups not extracted (SD × MR stratification) | When abstract reports efficacy for a secondary analysis population (e.g. Stable Disease patients) further stratified by a biomarker (e.g. molecular response), the cross-tabulated subgroups are not extracted. Persists on classify_publications_version=1. Related to Issue 38 pattern. | Complete |
| 56 | Cross-tabulated subgroups missed in run-on table format (Issue 33/43 residual) | extract_subgroups (v2) has correct cross-tab instructions (Step 2b) but LLM fails to parse run-on HTML-stripped tables where the matrix data has no clear delimiters. Single-dimension subgroups extracted from prose, cross-products from table body missed. Persists on sev=2. | Investigation complete |
| 57 | Subgroup disease misattribution: biomarker-named cohorts matched to wrong disease | Disease matching pipeline (PublicationDiseaseWorkflow) assigns breast cancer disease IDs to biomarker-named subgroups (e.g. “Cohort A (HER2 IHC3+)”) on non-breast pubs. Subgroups filtered out by query disease filter → cohorts invisible in report. 2,368 cross-family mismatches across 1,574 pubs. | Complete |
| 58 | extract_dose_evidence v2 sets dose_min/dose_max to full escalation range instead of RDE range | In escalation→expansion studies, extract_dose_evidence correctly identifies dose_context_type=escalation and rp2d, but sets dose_min/dose_max to the full Part 1a range (e.g. 100-1500 mg) instead of the Part 1b RDE range (e.g. 300-900 mg). Arm-level dose bypasses the view’s pub_dose_lookup gate. Issue 29 residual. Verified 2026-04-16: all spot-check pubs correct, 16 Phase 3b pubs re-extracted with separate arms. | Complete |
| 59 | LLM miscounts partial responses: tumor reductions below RECIST threshold counted as PRs | classify_publications extracts PR count = number of patients with ANY tumor reduction, not just ≥30% reduction (RECIST v1.1 PR threshold). Inflates ORR. Also: escalation-phase responses attributed to expansion cohort with wrong N. Scale TBD. | Open |
| 60 | Overall ORR value attributed to per-cohort subgroup instead of Overall | LLM assigns the overall population ORR to a cohort subgroup when both responses came from that cohort. Cohort gets the overall rate (e.g. 9% = 2/23) instead of the cohort-specific rate (13.3% = 2/15). ~2-5 pubs. Fixed via post-process validation + prompt reinforcement. | Complete |
| 61 | Non-canonical endpoint abbreviations (OR, RR, BOR, mPFS, etc.) invisible to query | LLM outputs abbreviation variants that don’t match the query’s canonical list. Fixed via abbreviation normalization in post-process + backfill of 1,019 existing trial_endpoints. | Complete |
| 62 | Population-subgroup cuts modeled as fake trial_arms (ethnicity/analysis-set pseudo-arms + garbage RP2D) | intervention_extraction v1/v2 prompt lets the LLM create trial_arms for population cuts (Asian / non-Asian / overall / responders / ITT) instead of only real treatment arms. These pseudo-arms receive the pooled ORR/PFS, dose arms go empty, and the TAI rp2d field gets stuffed with a comma-separated dose list. Fixed by v3 prompt guardrails (no prompt change needed, verified on pub 190844). ~1,133 stale pubs (163 v1 + 98 v2 + 872 null-version). Reset task + MIN_REPROCESS_VERSION=3 pending. | Fix ready, backfill pending |
Each issue entry should keep analysis and remediation separate.
Recommended issue structure:
Short summaryWhere this sits in the current pipelineExact restriction causing the dropConcrete examplesDownstream impactWhat the issue is notScaleSpot checksOpen characterization questionsExplored solution directionSolution applied
Solution applied should remain empty until an actual fix is agreed and implemented.
Backfill pattern: When an issue requires backfilling historical data, see the “One-Off Backfill Tasks” section in
.claude/skills/backend-expert/SKILL.md.
52. “All Arms” view join fans out across all trial_arm_interventions, duplicating dose
Section titled “52. “All Arms” view join fans out across all trial_arm_interventions, duplicating dose”Short summary
Section titled “Short summary”View v23’s drug join for “All Arms” outcomes uses a broad match (aoe.arm_name = 'All Arms') that matches every trial_arm_intervention for the publication — not just the relevant one. When a publication has multiple dose-level arms (e.g., dose escalation with range 4.8-16.0 mg/kg AND dose expansion with single_dose 12.0 mg/kg), each “All Arms” outcome row gets duplicated with conflicting dose profiles. The query layer then takes first_row, making the dose shown effectively non-deterministic.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”View layer: db/views/vw_publication_efficacy_data_v23.sql, lines 470-475 (drug join for Source 0 “All Arms”):
-- Source 0 (trial_arm_interventions): direct trial_arm_id match(di.trial_arm_id IS NOT NULL AND aoe.trial_arm_id IS NOT NULL AND di.trial_arm_id = aoe.trial_arm_id)-- Source 0 for "All Arms" outcomes (trial_arm_id set but no interventions on that arm)OR (di.trial_arm_id IS NOT NULL AND aoe.trial_arm_id IS NOT NULL AND aoe.arm_name = 'All Arms')Query layer: app/queries/tpp/clinical_evidence_query.rb, line 386 — build_single_row takes dose from first_row:
single_dose: first_row['single_dose'],dose_min: first_row['dose_min'],dose_max: first_row['dose_max'],rp2d: first_row['rp2d'],Exact restriction causing the drop
Section titled “Exact restriction causing the drop”The aoe.arm_name = 'All Arms' clause is a catch-all: it matches any drug_intervention row that has a trial_arm_id set, regardless of which arm the intervention belongs to. When multiple arms have interventions with different dose profiles, the fan-out creates N duplicate rows per outcome (one per intervention).
The arm_dose_lookup CTE then joins on trial_arm_intervention_id, pulling each intervention’s distinct dose. The deduped_rows CTE (SELECT DISTINCT *) doesn’t collapse these because the dose columns differ.
Concrete examples
Section titled “Concrete examples”Pub 74158 (DS-7300 / Ifinatamab deruxtecan, B7-H3 ADC phase I/II extended follow-up):
- Two
trial_arm_interventions:- TAI 33097 on arm “Dose escalation (4.8-16.0 mg/kg)”:
dose_min=4.8, dose_max=16.0, dose_context_type=escalation - TAI 33098 on arm “Dose expansion (12.0 mg/kg)”:
single_dose=12.0, rp2d=12.0, dose_context_type=rp2d
- TAI 33097 on arm “Dose escalation (4.8-16.0 mg/kg)”:
- sqNSCLC outcomes are on arm “All Arms” (arm 25400), which has NO interventions
- View produces duplicate rows: some with
single_dose=12.0 mg/kg, others withdose_min=4.8 / dose_max=16.0 - Abstract explicitly states efficacy is pooled across 4.8-16.0 mg/kg cohorts — the range is correct, but the query may show 12.0 mg/kg (RP2D) depending on sort order
- Audit issue 8512 flagged
single_dose=12.0 mg/kgas incorrect for sqNSCLC
Downstream impact
Section titled “Downstream impact”- Clinical Evidence report shows a non-deterministic dose for “All Arms” rows — whichever duplicate sorts first wins
- The RP2D (single_dose) may be shown when the efficacy population spans the full dose range, misrepresenting the study population
- Row duplication in the view inflates materialized view size and may cause subtle metric selection issues in
extract_efficacy_metrics
What the issue is not
Section titled “What the issue is not”- Not a data extraction issue — the LLM correctly identified both dose arms
- Not an arm-linking issue (Issue 49) — the “All Arms” arm intentionally has no interventions
- Not the same as Issue 31 (dose bleed onto control arms via COALESCE) — this is a JOIN fan-out, not a fallback chain
- Not the same as Issue 51 (per-arm dose not populated) — here both arms have correct dose, but both are matched to outcomes they don’t belong to
| Metric | Count |
|---|---|
| Pubs with “All Arms” outcomes + conflicting dose profiles across different arms (single/rp2d vs range) | 715 |
| Pubs where the view produces duplicate rows with different dose variants for “All Arms” outcomes | 2,776 |
| Total duplicate row groups in the view | 15,025 |
Spot checks
Section titled “Spot checks”- Pub 74158: Confirmed — sqNSCLC rows duplicated with RP2D (12.0) and range (4.8-16.0). Abstract pools all cohorts.
- Pub 65575 (ozuriftamab vedotin, BA3021 R/M SCCHN): Two dosing cohorts — Q2W (1.8 mg/kg) and 2Q3W (dose unspecified). “All Arms” rows for the pooled “Evaluable patients” subgroup pick up
single_dose=1.8 mg/kganddose_frequency=Q2Wfrom the Q2W arm’s intervention, even though the pooled population includes both Q2W and 2Q3W patients. Audit issues 8523/8524 flagged this correctly. - Pub 29738 (IMMU-132 / sacituzumab govitecan, Trop-2 ADC phase I/II): Worst-case fan-out — 5 TAIs with different dose levels (8, 10, 12, 18 mg/kg + duplicates with different ranges) all fan out onto “All Arms” outcomes. Produces 25 view rows for a single subgroup (5 endpoints × 5 dose variants). Query non-deterministically picks one TAI’s dose via
first_row. Audit issues 8560/8561 flagged the non-deterministic dose. - Pub 29737 (IMMU-132 / sacituzumab govitecan, phase I/II GI cancers): 3 TAIs (8, 10 mg/kg single_dose + one with dose_min=8/dose_max=10 range) → 75 view rows across 5 disease subgroups. CRC efficacy is pooled across both dose levels but query picks
single_dose=10 mg/kgfrom one TAI. Audit issues 8558/8559. - Pub 29700 (ABBV-400 / telisotuzumab adizutecan, CRC dose escalation + expansion): 3 dose-level arms (1.6, 2.4, 3.0 mg/kg). “CRC → Low c-Met expression” and “CRC → High c-Met expression” subgroups are on “All Arms” → triplicated via fan-out (3 TAIs × 1 ORR endpoint = 3 identical rows per subgroup). Audit issues 8537/8538.
- Pub 48880: “Overall” on pooled 5.4+6.4 mg/kg arms gets
single_dose=5.4 mg/kgfrom one TAI. Audit issue 8563. - Pub 114571 (HER2+ mCRC): Two dose levels (6.3 mg/kg n=32, 8.4 mg/kg n=1). Subgroup gets
single_dose=6.3ignoring the second level. Audit issue 8578. - Pub 238624 (EV + pembrolizumab, R/M HNSCC): Two dose levels (1.25 mg/kg 2Q3W, 1.5 mg/kg 2Q3W). Gets
single_dose=1.25. Audit issue 8519. - Pub 242943 (becotatug vedotin, R/M SCCHN): Either 2.0 or 2.3 mg/kg Q3W. Gets
single_dose=2.0. Audit issue 8533.
Open characterization questions
Section titled “Open characterization questions”- How many of the 2,776 pubs have the “wrong” dose surfaced by the query (i.e.,
first_rowpicks RP2D when range would be more accurate)? - Should “All Arms” outcomes inherit dose from the broadest-range intervention, or should dose be null/aggregated?
- Does
dose_context_type(escalation vs rp2d) provide enough signal to pick the right intervention?
Explored solution direction
Section titled “Explored solution direction”Option A — View-layer fix: When aoe.arm_name = 'All Arms', prefer the intervention with dose_context_type = 'escalation' (or the one with dose_min/dose_max range) over single_dose/rp2d. Could use a ROW_NUMBER() window with a priority ordering.
Option B — Query-layer fix: In build_single_row, when multiple view rows exist for the same outcome with different dose profiles, prefer the range (dose_min/dose_max) over single_dose for “All Arms” rows — since pooled data is better described by the range.
Option C — Data-model fix: Link “All Arms” outcomes to all relevant arms explicitly (via trial_arm_id or a new junction), then aggregate dose at the query level.
Option A is the most targeted fix with lowest risk. Option B is simpler but doesn’t address the view duplication. Option C is the cleanest long-term but highest effort.
Solution applied
Section titled “Solution applied”Phase 1 (2026-04-13, first deploy):
- Implemented extraction-layer fix in
app/tasks/publications_llm_classification/dose_evidence_extraction.rb:DOSE_EVIDENCE_VERSION = 3MIN_REPROCESS_VERSION = 2so old rows are not automatically re-run unless explicitly reset- Prompt now instructs the LLM to:
- extract from abstract text, not arm name
- use the RDE / expansion range for
dose_min/dose_maxwhen expansion doses are identified - avoid carrying escalation starting doses into efficacy-population dose fields
- keep fixed-dose arms as
single_dosewithout contaminating them with cross-arm ranges
- Added a forward-only intervention extraction guard in
app/tasks/publications_llm_classification/intervention_extraction.rb:PROMPT_VERSION = 2MIN_REPROCESS_VERSION = 1- Prompt now tells the LLM not to invent a canonical arm from the first listed escalation dose and not to create separate 100/300/600-style arms unless the abstract reports them as distinct analyzed cohorts.
- This is a forward fix only; no intervention-extraction backfill is planned as part of Issue 58.
- Prod remediation: reset escalation/range TAIs (job 1864), re-extracted (job 1865).
Phase 1 residual: Spot-check revealed ~1,332 TAIs (361 pub+drug combos, ~25.6% of escalation TAIs) still had the full escalation range leaked onto per-cohort arms. Root cause: the prompt’s “multi-arm fixed-dose” rule used the word “fixed-dose”, so the LLM skipped it when the abstract said “escalation” — even though each arm was a specific dose-level cohort.
Phase 2 (2026-04-13, second deploy):
- Added explicit “per-cohort arms” rule to the dose evidence prompt: when the input contains multiple interventions for the same drug at different dose levels, each intervention should use
single_dosefor its specific dose, notdose_min/dose_maxfor the study-wide range. Only usedose_min/dose_maxwhen a single intervention represents the entire escalation. - Tested locally against 9 publications (4 previously broken + 5 regression checks). All 4 broken cases fixed, no regressions.
- Prod remediation: reset escalation/range TAIs again (reused job 1864 task), re-extracted (job 1865 re-run).
Phase 2 verification: Per-cohort range leak dropped from 1,332 TAIs → 190 TAIs (~86% reduction). The remaining 190 TAIs across 78 pub+drug combos were investigated and confirmed to be false positives — these are arms split by treatment context (monotherapy vs combo, disease cohort, dosing schedule), not by dose level. Each arm genuinely received a range of doses. Examples:
- OH2 (pub 1504): monotherapy vs combo arms, both escalated across the same range
- Nivolumab (pub 30263): arms split by disease (Melanoma, NSCLC, RCC), each received a dose range
- Olaparib (pub 36921): continuous vs intermittent schedule arms, each spanning the dose range
Audit issues 8530/8531 (pub 115389) confirmed fixed after mat view refresh — DL1/DL2 arms now show correct single_dose per arm with null dose_min/dose_max.
Phase 2 gap identified: The reset task only targeted dose_context_type IN ('escalation', 'range'). TAIs with other context types (weight_based, fixed, bsa_based, rp2d, unknown) were left on v2 — 73,130 TAIs / 19,243 pubs. Examples:
- Pub 116843 (audit issues 8551-8553): TAIs have
dose_context_type=weight_based(v2), still show full study rangedose_min=1.6, dose_max=2.4on per-cohort arms - Many v2 TAIs may have the same per-cohort range leak or other v2 prompt issues
Phase 3 (2026-04-13, deployed + executed):
- Bumped
MIN_REPROCESS_VERSIONfrom 2 → 3 sobase_scopepicks up all v1/v2 TAIs automatically on the next extraction run - No separate reset task needed — the extraction’s own scope handles version filtering
- Prod execution: job 1873 (
extract_dose_evidence --batched --parallelism=5), completed in 68 minutes (19:42–20:50 UTC) - Result: 92,334 TAIs on v3 (99.87%). 122 TAIs remain on v1/v2 — all excluded by
target_disease_or_hemonc_relevantscope (not missed by the job) - Only 2 v3 TAIs show the old range-leak pattern (escalation + min!=max + single_dose set) — both are radiation therapy pubs (proton/carbon-ion) where the pattern is legitimate
Phase 3 spot-check verification:
- Pub 74158 (DS-7300): Escalation arm has correct range (4.8–16.0). Expansion arm uses
single_dose=12.0withrp2dcontext. Per-cohort arms (8.0, 16.0) usesingle_dosewith null min/max. Fixed. - Pub 116843 (Temab-A, weight_based gap): Per-cohort arms now use
single_dose(2.0, 2.4 mg/kg) with null min/max. Olddose_min=1.6, dose_max=2.4range leak gone. Fixed. - Pub 115389 (DL1/DL2, audit issues 8530/8531): DL1
single_dose=2.0, DL2single_dose=2.3, both null min/max. Fixed. - Pub 114973 (MRG003, audit issue 8536): Per-cohort arms (3.6, 5.4) use
single_dosewith null min/max. Fixed. - Pub 238709 (telisotuzumab vedotin, audit issue 8585):
>=4.0 mpksubset hasdose_min=4.0, dose_max=5.0withrp2dcontext. No longer inheriting full 1.0–8.0 escalation range. Fixed.
Phases 1-3 status: Complete. Phase 1 fixed prompt (v3) + reset escalation/range TAIs. Phase 2 added per-cohort rule + re-extracted. Phase 3 re-extracted all remaining v1/v2 TAIs to v3. All previously-flagged examples verified fixed.
Phase 4 (2026-04-14): View-layer fan-out elimination
- Created
vw_publication_efficacy_data_v24.sql:- New
all_arms_drug_aggCTE: aggregatesdrug_interventionsSource 0 rows to one row per(publication_id, drug_id)for “All Arms” outcomes. UsesMIN(trial_arm_intervention_id)for deterministic representative TAI selection. - New
all_arms_dose_aggCTE: agreement-only dose aggregation. Each dose field (single_dose,dose_min,dose_max,rp2d,dose_units,dose_frequency) kept if all TAIs agree (COUNT(DISTINCT) <= 1), NULL otherwise. No text parsing. - Removed “All Arms” OR clause from drug join (old lines 473-475). “All Arms” outcomes now join to
all_arms_drug_agginstead of individualdirows. - Per-arm outcomes completely unchanged — still use direct
trial_arm_idmatch.
- New
- Eliminates ~15,025 duplicate row groups from the materialized view.
- Multi-dose pooled populations show NULL dose (honest) instead of random wrong dose.
- Migration:
20260414200000_update_vw_publication_efficacy_data_to_version_24.rb
Phase 5 (2026-04-14): Numeric dose column + backfill
- Added
dose_value_numeric(numeric, nullable) column totrial_arm_interventions. - Regex backfill task (
lib/tasks/one_off/backfill_dose_value_numeric.thor): extracts leading number fromsingle_dosetext (e.g., “10 mg/kg” → 10.0, “1,200 mg” → 1200.0). Covers 94.8% of TAIs (25,767/27,194 locally). Non-parseable values (“AUC 5”, “low-dose”) → NULL. - LLM backfill task (
lib/tasks/one_off/backfill_dose_value_numeric_llm.thor): sends remaining ~1,427 non-standard cases to GPT-5-mini for numeric extraction. Handles scientific notation (“10^7 pfu/mL” → 10000000.0), AUC (“AUC 5” → 5.0), approximate (“~200 mg” → 200.0), and correctly NULLs non-dose text (“low-dose”, “three sessions”). - Updated
extract_dose_evidenceprompt + schema + post-processing to outputdose_value_numericfor new extractions. - Migration:
20260414210754_add_dose_value_numeric_to_trial_arm_interventions.rb
Phase 6 (2026-04-14): View update for numeric dose envelope
- Created
vw_publication_efficacy_data_v25.sql:- Updated
all_arms_dose_aggCTE to computeMIN(dose_value_numeric)/MAX(dose_value_numeric)across TAIs per(publication_id, drug_id). - Gated on
dose_unitsagreement — mixed units (e.g., “mg/kg” vs “mg”) → NULL envelope. - Maps numeric min/max back to text for display via
ARRAY_AGG(single_dose ORDER BY dose_value_numeric). - Dose COALESCE: arm dose → agreement-only aggregate → numeric envelope text → subgroup fallback.
- Updated
- Multi-dose pooled populations now show correct dose range (e.g.,
dose_min=5.4 mg/kg, dose_max=6.4 mg/kg). - Validated locally: pub 48880 (5.4–6.4 mg/kg), pub 242943 (2.0–2.3 mg/kg), per-arm pubs unchanged.
- Migration:
20260414220000_update_vw_publication_efficacy_data_to_version_25.rb
Phases 4-6 status: Deployed. v24, v25, and v26 all live in production. Regex backfill (job 1914) and LLM backfill (job 1915) completed. 16 Phase 3b pubs reset (job 1923) and re-extracted (jobs 1924, 1922, 1921).
Production verification (2026-04-16):
- Pub 29738: 25 view rows → 5 (1 per endpoint). Dose correctly shows aggregated range
dose_min=8 mg/kg, dose_max=10 mg/kg. - Pub 48880: “Overall” on “All Arms” shows
dose_min=5.4, dose_max=6.4 mg/kg(was non-deterministically picking 5.4). - Pub 114571:
dose_min=6.3, dose_max=8.4 mg/kg(was ignoring the 8.4 arm). - Pub 238624: EV shows
dose_min=1.25, dose_max=1.5 mg/kg(was picking 1.25 only). Pembrolizumab correctly on separate row at 200 mg. - Pub 242943: v26 prefer-non-escalation selects rp2d TAI (2.3 mg/kg), excluding escalation TAI (2.0 mg/kg).
- Pub 73299: escalation subgroup no longer inherits expansion dose — outcomes correctly on separate “Part 1a dose escalation” arm.
- No remaining single-drug fan-out duplicates on “All Arms”. Multi-drug combo rows (1,551 pubs) are expected behavior (each drug gets its own dose row).
Status: Complete.
54. Zero-sentinel cORR (confirmed ORR = 0) persists after Issue 49 re-extraction
Section titled “54. Zero-sentinel cORR (confirmed ORR = 0) persists after Issue 49 re-extraction”Short summary
Section titled “Short summary”LLM extracts measure_value=0 (or 0.0) for confirmed ORR on subgroups where the abstract reports confirmed PR (cPR) but does not explicitly state confirmed ORR. Since CR = 0 for these subgroups, cORR should equal cPR (ORR = CR + PR), but the LLM sets cORR = 0 instead of null or the derived value. This persists on classify_publications_version=1 — pubs 29704 and 235204 were re-extracted in the Issue 49 backfill and still have cORR = 0.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”Extraction layer: app/tasks/publications_llm_classification/task.rb — classify_publications extracts ORR with confirmed: true and measure_value: 0 when the abstract doesn’t explicitly state cORR per subgroup. The current prompt (version 1) does not instruct the LLM to derive ORR from CR + PR when ORR is not explicitly stated.
View layer: vw_publication_efficacy_data surfaces these as endpoint_abbreviation='ORR', confirmed=true, measure_value='0.0'.
Query layer: app/queries/tpp/clinical_evidence_query.rb lines 666–683 — extract_efficacy_metrics picks these confirmed ORR = 0 rows for the cORR metric, producing corr.value = 0% in the report.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”The classify_publications prompt does not instruct the LLM to:
- Leave ORR null when only PR/CR are reported per subgroup (instead it outputs 0)
- Derive ORR = CR + PR when the abstract reports individual response types but not the aggregate
This is not a backfill scope gap — pubs 29704 and 235204 have classify_publications_version=1 and subgroup_extraction_version=2, confirming the Issue 49 re-extraction ran on them. The current prompt still produces zero-sentinel cORR.
Concrete examples
Section titled “Concrete examples”Pub 29704 (ABBV-400 ctDNA + CRC, telisotuzumab adizutecan):
Abstract reports overall confirmed ORR = 18% (20/113), then a biomarker subgroup table with confirmed PR (cPR) rates per genomic alteration (e.g. BRAF mutant: 4/14 = 29%, KRAS mutant: 11/73 = 15%, TMB high: 11/48 = 23%).
View has for each of the 10 biomarker subgroups:
CR confirmed=true, measure_value=0(correct — no CRs reported)PR confirmed=true, measure_value=29(correct — matches abstract cPR)ORR confirmed=true, measure_value=0.0(wrong — should be 29%, i.e. CR + PR)
All 10 subgroups have the same pattern. The overall population row correctly has cORR = 18%.
Pub 29700 (ABBV-400 FIH CRC): Same pattern — cORR = 0 for CRC subgroup with N=32.
Pub 65575 (ozuriftamab vedotin BA3021, R/M SCCHN): Two dosing cohorts — Q2W (n=12, 1 CR + 2 PR) and 2Q3W (n=13, 5 PR). LLM extracted CR/PR/SD per cohort but NOT ORR. Since ClinicalEvidenceQuery drops rows without primary efficacy endpoints (ORR/PFS/OS/DOR — line 315: rows.select { |r| r[:efficacy].present? }), both per-cohort arms vanish entirely from the query output. The abstract clearly reports per-cohort response data. Audit issues 8525/8526 flagged this as missing subgroups. This is the most severe manifestation: not just wrong cORR, but entire arms invisible to the report.
Downstream impact
Section titled “Downstream impact”- Clinical Evidence report shows cORR = 0% for biomarker subgroups where the true confirmed response rate is 15–29%
- Entire per-arm rows disappear when the LLM extracts CR/PR but not ORR — the query’s efficacy filter drops them (pub 65575)
- 10 audit issues on pub 29704 alone (all
incorrect_valueonefficacy.corr.value) - Understates drug efficacy in biomarker-stratified analyses
What the issue is not
Section titled “What the issue is not”- Not Issue 25 (confirmed flag confusion) — the
confirmed=trueflag is correct, the value is wrong - Not Issue 27 (query picking confirmed as plain ORR) — here the confirmed value itself is zero-sentinel
- Not Issue 36 (cORR = ORR copy) — here cORR = 0, not a copy of unconfirmed ORR
- Not a genuine 0% ORR — confirmed PR > 0 or unconfirmed ORR > 0 exists for the same subgroup
| Metric | Count |
|---|---|
Pubs with confirmed ORR measure_value = 0 or 0.0 | 161 |
| Total confirmed ORR = 0 records across those pubs | 211 |
| Definitive zero-sentinels (confirmed PR > 0 OR unconfirmed ORR > 0 for same subgroup/arm) | 12 pubs / 23 records |
| Ambiguous (may be genuine 0% or zero-sentinel, needs abstract check) | ~149 pubs / ~188 records |
Pubs already on current prompt_version | 0 of 161 |
-- All pubs with confirmed ORR = 0SELECT DISTINCT tom.source_id as pub_idFROM trial_outcome_measures tomJOIN trial_endpoints te ON te.id = tom.trial_endpoint_idJOIN trial_arm_outcomes tao ON tao.trial_outcome_measure_id = tom.idWHERE tom.source_type = 'Publication' AND te.abbreviation = 'ORR' AND tom.confirmed = true AND (tao.measure_value = '0' OR tao.measure_value = '0.0');
-- Definitive zero-sentinels only (confirmed PR > 0 or unconfirmed ORR > 0 for same subgroup)SELECT DISTINCT tom_corr.source_id as pub_idFROM trial_outcome_measures tom_corrJOIN trial_endpoints te_corr ON te_corr.id = tom_corr.trial_endpoint_idJOIN trial_arm_outcomes tao_corr ON tao_corr.trial_outcome_measure_id = tom_corr.idWHERE tom_corr.source_type = 'Publication' AND te_corr.abbreviation = 'ORR' AND tom_corr.confirmed = true AND (tao_corr.measure_value = '0' OR tao_corr.measure_value = '0.0') AND ( EXISTS ( SELECT 1 FROM trial_outcome_measures tom2 JOIN trial_endpoints te2 ON te2.id = tom2.trial_endpoint_id JOIN trial_arm_outcomes tao2 ON tao2.trial_outcome_measure_id = tom2.id WHERE tom2.source_type = 'Publication' AND tom2.source_id = tom_corr.source_id AND tom2.trial_subgroup_id = tom_corr.trial_subgroup_id AND te2.abbreviation = 'ORR' AND (tom2.confirmed = false OR tom2.confirmed IS NULL) AND tao2.measure_value ~ '^\d' AND tao2.measure_value::numeric > 0 AND tao2.arm_name = tao_corr.arm_name ) OR EXISTS ( SELECT 1 FROM trial_outcome_measures tom_pr JOIN trial_endpoints te_pr ON te_pr.id = tom_pr.trial_endpoint_id JOIN trial_arm_outcomes tao_pr ON tao_pr.trial_outcome_measure_id = tom_pr.id WHERE tom_pr.source_type = 'Publication' AND tom_pr.source_id = tom_corr.source_id AND tom_pr.trial_subgroup_id = tom_corr.trial_subgroup_id AND te_pr.abbreviation = 'PR' AND tom_pr.confirmed = true AND tao_pr.measure_value ~ '^\d' AND tao_pr.measure_value::numeric > 0 AND tao_pr.arm_name = tao_corr.arm_name ) );Explored solution direction
Section titled “Explored solution direction”Option A — Prompt fix + re-extraction: Update the classify_publications prompt to either (a) instruct the LLM to leave ORR null when only PR/CR are reported, or (b) instruct the LLM to derive ORR = CR + PR. Requires a version bump and re-extraction of affected pubs. Most correct long-term fix.
Option B — Post-process derivation: Add logic to post_process.rb to derive confirmed ORR from confirmed CR + confirmed PR when confirmed ORR is 0 or null but confirmed PR exists. Fixes data at materialization time without re-extraction. Similar to the existing derive_orr_for_subgroup logic.
Option C — Query-layer derivation: Add logic to extract_efficacy_metrics to derive cORR from confirmed CR + confirmed PR when confirmed ORR = 0 but confirmed PR > 0. Doesn’t fix the underlying data but prevents the 0% from reaching the report. Lowest effort, most targeted.
Option B is likely the best balance — fixes data at the right layer and catches both existing and future occurrences without prompt changes.
Open question: why isn’t ORR already derived from CR + PR?
Section titled “Open question: why isn’t ORR already derived from CR + PR?”Issue 25’s backfill v2 added a “derived ORR fix” to post_process.rb (derive_orr_for_subgroup) that propagates the confirmed flag from source PR/CR records. And post_process.rb has derive_orr_for_subgroup logic that creates ORR from CR + PR when ORR is missing. So why are we still seeing arms with CR + PR but no ORR?
Possible explanations to investigate:
- Does
derive_orr_for_subgrouponly run when ORR is completely absent, but not when ORR = 0? If so, the zero-sentinel ORR blocks derivation — the system thinks ORR exists (it’s just 0). - Does the derivation only operate at the subgroup level, not at the per-arm level? Pub 65575’s Q2W/2Q3W are arms, not subgroups — the derivation may not traverse arm-level data.
- Was the derivation added only for the
confirmedflag propagation (Issue 25) but never extended to derive the actual ORR value from CR + PR? - Did the Issue 49 re-extraction (which re-ran post_process) happen BEFORE or AFTER the derivation logic was added?
This needs to be traced through post_process.rb to understand exactly what derive_orr_for_subgroup does and why it didn’t fire for these cases. If it’s condition (1), the fix is simple: treat ORR = 0 as absent when PR > 0. If it’s condition (2), the derivation scope needs widening.
Related pattern: missing N on subgroups where abstract gives only percentages
Section titled “Related pattern: missing N on subgroups where abstract gives only percentages”Pub 116873 (EV + pembrolizumab, 1L R/M HNSCC, EV-202 cohort 9): Abstract says “39% had PD-L1 CPS 1-19 and 61% had CPS ≥20” out of 41 enrolled pts. LLM extracted ORR correctly for both CPS subgroups (43.8% and 36.0%) but left number_of_participants = NULL. The abstract doesn’t state explicit counts (16 and 25), only percentages — the LLM didn’t compute N from the percentage × total. Audit issues 8517–8518 flagged patient_number_efficacy = 0 (NULL rendered as 0 in query).
This is the zero-sentinel pattern applied to N rather than ORR: when the abstract gives only a percentage breakdown, the LLM leaves N null instead of deriving it. Scale TBD — need to check how many subgroups have ORR but null N where the abstract provides enough info to compute N.
Pub 75999 (ADC phase Ib, SCCHN cohort): SCCHN subgroup has ORR=40% and DCR=80% but number_of_participants = NULL. Abstract says 13 SCCHN pts enrolled with SCCHN-specific ORR/DCR reported. Audit issue 8529 flagged patient_number_efficacy=0 (NULL→0). Same missing-N pattern.
Related pattern: derive-ORR gap leaves arms invisible in query
Section titled “Related pattern: derive-ORR gap leaves arms invisible in query”Pub 64384 (CX-2029, 2L+ R/M HNSCC): The “Dose-comparison 1100 mg” arm has CR=0 and PR=1 (N=10) in the view, but NO ORR row. The query’s extract_efficacy_metrics requires a PRIMARY_EFFICACY_ABBREVIATION (ORR, PFS, OS, DOR, etc.) to surface a row — without ORR, the 1100 mg arm is dropped entirely. The derive_orr_for_subgroup logic in post-process should compute ORR = (CR + PR) / N but didn’t fire here. This ties directly to the open question above about why derive_orr_for_subgroup doesn’t fire for certain arms. Audit issue 8586.
Caution: the reverse problem exists too (genuine 0% extracted as null)
Section titled “Caution: the reverse problem exists too (genuine 0% extracted as null)”Pub 134450 (MRG003 phase 1b, CRC subgroup): abstract explicitly states ORR = 0% for CRC (drug tested, zero responses). The LLM extracted measure_value = null instead of 0. This is the flip side of the zero-sentinel — the Issue 8 fix (nullable: true in schema) may have overcorrected, making the LLM prefer null even when 0% is the real stated value.
Any forward fix for Issue 54 must distinguish between:
- Unstated value → should be null (abstract doesn’t mention ORR for this subgroup)
- Stated 0% → should be 0 (abstract explicitly says “ORR was 0%” or “0/N”)
The prompt or post-process derivation needs to preserve genuine zeros while nulling out unstated values. A blanket “treat 0 as null when PR > 0” rule would break cases where ORR genuinely is 0% and PR is also 0%.
Solution applied
Section titled “Solution applied”Option B implemented — post-process validation in validate_orr_denominator (post_process.rb).
Root cause: derive_orr_for_subgroup skips when any ORR percentage already exists (return if existing_orr.exists?), so ORR=0% blocks derivation. And the Issue 60 validate_orr_denominator check passes ORR=0% because 0 × N / 100 = 0, which is a perfect integer — the denominator test never fires.
Fix: Two new checks added to validate_orr_denominator, before the Issue 60 logic:
-
Issue 54a (percentage path): If ORR=0% and a same-confirmed-flag PR% TOM exists with value > 0 for the same arm, set ORR = PR% + CR%. Covers the common pattern (pub 29704) where PR/CR are stored as percentages.
-
Issue 54b (count path): If ORR=0% and same-confirmed-flag PR count > 0, derive ORR = (PR + CR) / N × 100. Covers cases where PR/CR are stored as counts.
Three guards prevent false positives:
- Confirmed-flag matching: Only uses PR/CR with the same
confirmedvalue as the ORR TOM. Prevents overriding confirmed ORR=0% using unconfirmed PR (pub 65983: genuine cORR=0% with unconfirmed PR=1). - PR > 0 required: Won’t fire on CR alone. A lone CR with ORR=0% can be legitimate (pub 48436: post-progression CR that RECIST excludes from ORR).
- Sibling non-zero ORR check: If another ORR TOM exists for the same subgroup/confirmed with a non-zero value, skip the fix. Indicates different evaluation criteria (pub 54321: RECIST ORR=0% alongside PERCIST ORR=50%).
Works for all confirmed variants: cORR from cPR+cCR, ORR from PR+CR, nil-confirmed from nil-confirmed.
Backfill: lib/tasks/one_off/fix_zero_sentinel_orr.thor — three-step task (identify → sample → fix). Local run: 15 records fixed across 6 pubs, 0 false positives. Job 1876 created but never executed.
Production verification (2026-04-16): The backfill is no longer needed. Reclassification to cpv=2 (job 1871, pubs 29704/235204 and others) eliminated the original zero-sentinel records at the extraction layer. The original examples (29704: 10 bad cORR=0 records, 235204: similar) now have 1 remaining cORR=0 each — both on MR neg (methylation panel) subgroups with no cPR sibling, so the fix wouldn’t fire anyway. Corpus-wide: 0 records match the backfill’s trigger condition (cORR=0 AND cPR>0 for same subgroup/arm). The remaining 12 pubs / 14 records with cORR=0 are all the uORR>0 pattern (unconfirmed ORR positive, confirmed ORR zero) — clinically valid (responses not yet confirmed). Pub 65575 remains on cpv=1/sev=1 (not reclassified).
Open questions resolved:
- Q: Why doesn’t
derive_orr_for_subgroupfire? A: Because ORR=0% exists as a percentage TOM, soexisting_orr.exists?returns true and derivation is skipped. - Q: Why doesn’t Issue 60’s denominator check catch it? A:
0 × N / 100 = 0, remainder = 0, which passes the<= 0.3threshold.
55. Secondary-analysis cross-tabulated subgroups not extracted (SD × MR stratification)
Section titled “55. Secondary-analysis cross-tabulated subgroups not extracted (SD × MR stratification)”Short summary
Section titled “Short summary”When an abstract reports efficacy for a secondary analysis population (e.g. patients with Stable Disease) further stratified by a biomarker or response metric (e.g. molecular response pos/neg), the cross-tabulated subgroups are not extracted by extract_subgroups or classify_publications. The parent secondary subgroup and the biomarker stratification are each captured independently, but not the cross-product. Persists on classify_publications_version=1 — confirmed on pubs 29704 and 235204 after Issue 49 re-extraction.
Related to Issue 38 (biomarker subgroups in secondary analyses) and Issue 33 (cross-tabulated subgroups in basket trials), but distinct: here the cross-tabulation is between a response-category subpopulation and a biomarker panel, not between disease and biomarker.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”Extraction layer: app/tasks/publications_llm_classification/subgroup_extraction.rb — extract_subgroups identifies the SD subgroup and the MR pos/neg subgroups independently but does not create their cross-product (SD × MR pos, SD × MR neg).
Extraction layer: app/tasks/publications_llm_classification/task.rb — classify_publications extracts endpoints for the subgroups it receives from extract_subgroups. Since SD × MR subgroups don’t exist, their PFS values are never extracted.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”The abstract reports a section “MR in pts with SD” with its own table of MR rates and PFS by MR pos/neg — this is a secondary analysis within a response category. extract_subgroups captures:
- SD subgroups (e.g.
CRC → SD → 74-gene panel) with MMR endpoint ✓ - MR pos/neg subgroups (e.g.
CRC → MR pos (74-gene)) with ORR + PFS ✓ - But NOT the cross-product:
CRC → SD → MR pos (74-gene)with PFS ✗
The Issue 33 fix added cross-tabulation support for disease × biomarker, but not for response-category × biomarker.
Concrete examples
Section titled “Concrete examples”Pub 235204 (ABBV-400 ctDNA + CRC, telisotuzumab adizutecan):
Abstract reports “MR in pts with SD” section:
- 74-gene panel: 27/45 (60%) MR rate
- MR pos: mPFS 5.3 mo (95% CI 4.5–5.9), events 21/27
- MR neg: mPFS 3.9 mo (95% CI 2.8–4.3), events 16/18
- Methylation panel: 31/53 (58%) MR rate
- MR pos: mPFS 5.3 mo (95% CI 4.5–5.9), events 23/31
- MR neg: mPFS 4.0 mo (95% CI 2.8–4.4), events 19/22
View has the SD × panel MMR subgroups but NOT the SD × panel × MR pos/neg PFS subgroups.
Pub 29704 (same abstract, ASCO duplicate per Issue 17): Same pattern — SD × MR subgroups missing.
Downstream impact
Section titled “Downstream impact”- Clinical Evidence report missing PFS data for SD patients stratified by molecular response
- 4 audit issues on pub 235204 (missing_subgroup: SD × MR pos/neg for each panel)
- Likely affects other publications with response-category × biomarker cross-tabulations
Corpus-wide scan completed 2026-04-13. Pattern is rare:
- 60 pubs have both response-category and biomarker subgroups, but most don’t have actual cross-tabulated results
- 2 confirmed affected pubs: 235204 and 29704 (ASCO duplicates of the same abstract)
- Regression testing on 5 unrelated pubs showed no hallucinated cross-tabs from the prompt change
Solution applied
Section titled “Solution applied”Prompt fix (Option A) applied 2026-04-13:
- Added response-category × biomarker as an explicit pattern in Step 2b of
extract_subgroupsprompt- Expanded dimension examples:
response category × biomarker/panel - Added prose pattern:
"Among pts with SD, MR pos had mPFS 5.3 mo; MR neg had mPFS 3.9 mo" - Added subgroup examples:
"SD → MR pos (74-gene panel)","PR → ctDNA clearance"
- Expanded dimension examples:
- Bumped
PROMPT_VERSIONfrom 2 → 3 - Tested on pubs 235204 and 29704: both now correctly extract SD × MR pos/neg cross-products for PFS
- Regression tested on 5 diverse pubs (15598, 58975, 77005, 125286, 57146): no false-positive cross-tabs
Backfill: Reset and re-extract pubs 235204 and 29704 via one_off:reset_publications:reset. New general-purpose reset task created at lib/tasks/one_off/reset_publications.thor for resetting any pubs by ID through the full pipeline.
Status: Complete (forward fix + backfill of 2 affected pubs)
Production verification (2026-04-16): Pub 235204 has all 6 SD × MR cross-tab subgroups with correct data: MR pos (74-gene) PFS=5.3mo (N=27), MR neg (74-gene) PFS=3.9mo (N=18), MR pos (methylation) PFS=5.3mo (N=31), MR neg (methylation) PFS=4.0mo (N=22), plus MMR rates (60% 74-gene, 58% methylation). All values match abstract.
53. Multi-cohort pubs have per-cohort AEs with no subgroup linkage
Section titled “53. Multi-cohort pubs have per-cohort AEs with no subgroup linkage”Short summary
Section titled “Short summary”Publications that report adverse events separately for multiple disease cohorts (e.g., CLL and MCL in the same phase I/Ib trial) have no way to link AE records to the corresponding trial_subgroup. All per-cohort AE values get flattened into “All Arms” and mixed together, producing duplicate or incorrect entries.
Discovered while investigating a related issue where classify_publications collapsed compact AE pairs like (70%, 3% ≥G3) into a single record with swapped values. That issue was fixed with a prompt change to PROMPT_VERSION = 2.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”- Extraction:
app/tasks/publications_llm_classification/task.rb— AEs are extracted with arm linkage only (viaarms[].id) - Post-process:
app/tasks/publications_llm_classification/post_process.rb:423—process_adverse_eventscreatesAdverseEvent→TrialArmOutcomerecords, linked totrial_armonly - Data model:
adverse_eventshas notrial_subgroup_idor equivalent.trial_arm_outcomeslinks totrial_armbut not totrial_subgroup
Concrete examples
Section titled “Concrete examples”Pub 88644 (TGR-1202 + ibrutinib, phase I/Ib in R/R CLL and MCL):
- Two subgroups: CLL (subgroup 226526) and MCL (subgroup 226527), no distinct trial arms
- Abstract reports AEs separately per cohort:
- CLL:
neutropenia (37.5%, all gr 3-4)— entire incidence at grade 3-4 - MCL:
neutropenia (37.5%, 12.5% gr 4)— 37.5% all-grade, 12.5% grade 4
- CLL:
- Both get extracted as “All Arms” with no cohort distinction
- LLM confuses values across cohorts, producing duplicate entries with wrong grade assignments
| Metric | Count |
|---|---|
| Pubs with ≥2 disease partition subgroups | 4,699 |
| Of those, pubs that also have adverse events | 3,021 |
| Of those, pubs with no distinct trial arms (AEs land on “All Arms”) | 1,883 |
Downstream impact
Section titled “Downstream impact”- Per-cohort AE rates are mixed or duplicated in clinical evidence reports
- Clients see nonsensical AE profiles when cohorts have different safety profiles
- No way to filter AEs by disease cohort in the frontend
What the issue is not
Section titled “What the issue is not”- Not the swapped all-grade/graded values issue — that was a separate prompt bug, fixed in PROMPT_VERSION 2
- Not an arm-linking issue (Issue 49) — these pubs correctly have no distinct arms, the cohorts are subgroups
Explored solution direction
Section titled “Explored solution direction”Schema change: Add trial_subgroup_id (nullable FK) to adverse_events or trial_arm_outcomes. The LLM schema already extracts arm linkage — extend it to also extract subgroup linkage when AEs are reported per-cohort.
Pipeline changes:
- Update
Details::AdverseEventschema to include optional subgroup reference - Update classify_publications prompt to instruct per-cohort AE extraction with subgroup tagging
- Update
post_process.rb#process_adverse_eventsto set the subgroup FK - Update
app/queries/tpp/clinical_evidence_query.rb—extract_safety_metrics_for_publicationandextract_ranked_named_ae_summariesneed to filter/group AEs by subgroup when available, so per-cohort safety data is reported correctly in clinical evidence - Re-run classify_publications on the ~1,883 affected pubs
Solution applied
Section titled “Solution applied”56. Cross-tabulated subgroups missed in run-on table format (Issue 33/43 residual)
Section titled “56. Cross-tabulated subgroups missed in run-on table format (Issue 33/43 residual)”Short summary
Section titled “Short summary”extract_subgroups (version 2) has the correct Step 2b instructions for cross-tabulated subgroups (tumor type × biomarker), but the LLM fails to parse run-on HTML-stripped tables where the matrix data has no clear delimiters. The LLM extracts single-dimension subgroups from the prose summary but misses the cross-product data embedded in the table body. Persists on subgroup_extraction_version=2 — the Issue 33/43 prompt fix is in place, the LLM just doesn’t follow it for this table format.
Prior work (Issues 33 and 43)
Section titled “Prior work (Issues 33 and 43)”- Issue 33 (cross-tabulated subgroups in basket trials): Identified that
extract_subgroupsdidn’t create disease × biomarker cross-products. Fix: added Step 2b instructions to the prompt. Backfill applied 2026-03-28, 262 confirmed pubs remediated. Status: Complete. - Issue 43 (cross-tabs only extracted for highest-response HER2 level): Issue 33 backfill re-extracted cross-tabs but LLM only created cross-products for the most prominent biomarker level (e.g. IHC3+), skipping lower-response levels. Fix: prompt refinement emphasizing ALL levels. Backfill: 234 confirmed, remediated 2026-03-31. Status: Complete (pipeline re-run pending).
Both fixes are reflected in the current subgroup_extraction.rb prompt (lines 56–71), including explicit examples like "In CRC, HER2 IHC 3+ had an ORR of 100% (3/3)" and instructions to create cross-products for ALL combinations including zero-response results.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”Extraction layer: app/tasks/publications_llm_classification/subgroup_extraction.rb — Step 2b has the correct instructions, but the LLM fails to apply them when the cross-tabulated data is in a poorly formatted run-on table.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”The abstract’s cross-tab table is HTML-stripped into a single run-on string with no clear delimiters:
Table: 656MO ORR by tumor type and HER2 status BTC (N=22)UC (N=23)GC/GEJA (N=13)CRC (N=14)Other tumors (N=26) ORR*, n/N (%)(95% CI)9/15 (60.0%)(32.3–83.7)13/22 (59.1%)(36.4–79.3)6/12 (50.0%)(21.1–78.9)4/11 (36.4%)(10.9–69.2)7/25 (28.0%)(12.1–49.4)HER2 IHC 3+7/10 (70.0%) (34.8–93.3)2/6 (33.3%) (4.3–77.7)4/10 (40.0%) (12.2–73.8)3/3 (100.0%) (29.2–100.0)4/8 (50.0%) (15.7–84.3)HER2 IHC 2+0/17/10 (70.0%) (34.8–93.3)2/2 (100.0%) (15.8–100.0)0/31/8 (12.5%) (0.3–52.7)...The LLM’s abstract_data for ORR captured only the prose numbers:
“ORR was 45.9% (39/85 evaluable pts); ORR was 54.1% (20/37) in HER2 IHC3+, 41.7% (10/24) with IHC2+, 50.0% (7/14) with IHC1+. For individual tumor type, ORR was 56.3% (9/16) in BTC, 59.1% (13/22) in UC, 50.0% (6/12) in GC/GEJA, 36.4% (4/11) in CRC.”
The cross-tab table body (with per-tumor × per-HER2 values) was visible in the abstract but not captured in abstract_data and cross-product subgroups were not created.
Concrete examples
Section titled “Concrete examples”Pub 72043 (SHR-A1811 / trastuzumab rezetecan, HER2 ADC in non-breast solid tumors):
Abstract reports a matrix: ORR by tumor type (BTC, UC, GC/GEJA, CRC, Other) × HER2 status (IHC3+, IHC2+, IHC1+, mutation/amp).
Extracted (single-dimension only):
- Per tumor: BTC (56.3%), UC (59.1%), GC/GEJA (50.0%), CRC (36.4%) ✓
- Per HER2: IHC3+ (54.1%), IHC2+ (41.7%), IHC1+ (50.0%) ✓
- Cross-products: NONE ✗
Missing from CRC × HER2 alone:
- CRC → HER2 IHC 3+: 3/3 (100%)
- CRC → HER2 IHC 2+: 0/3
- CRC → HER2 IHC 1+: 0/1
- CRC → HER2 mutation/amp: 0/3
And similarly for BTC × HER2, UC × HER2, GC/GEJA × HER2 — the entire matrix is missing. That’s ~20 cross-product cells with data.
Audit issues 8570–8573 flagged the 4 CRC × HER2 cross-products.
Pub 70322 (ABBV-400 / telisotuzumab adizutecan, c-Met ADC in advanced solid tumors):
Abstract has a cross-tab table: biomarker response by population (“All pts” vs “CRC pts”) × biomarker status (High TMB, High MSI, KRAS mut, BRAF mut). The “CRC pts” column has KRAS mut (2/16 = 13% cPR) and BRAF mut (1/3 = 33% cPR) — these biomarkers are CRC-specific in the table.
LLM extracted “CRC → High TMB” and “CRC → High MSI” correctly (from prose), but nested KRAS and BRAF under “Overall” instead of “CRC”:
- “Overall → KRAS mut” (ORR=13%, N=16) — should be “CRC → KRAS mut”
- “Overall → BRAF mut” (ORR=33%, N=3) — should be “CRC → BRAF mut”
Because the subgroup names lack “CRC”, disease adjudication finds no disease signal (core_disease_phrase: null, semantic_class: disease_related_context), so disease_id stays NULL. The CRC disease filter then excludes both subgroups — making them invisible in the CRC report.
Key distinction from Issue 57: This is NOT a disease matching error. The matching pipeline correctly returns NULL because “KRAS mut” is a biomarker, not a disease. The root cause is the extraction layer naming the subgroup “Overall → KRAS mut” instead of “CRC → KRAS mut”.
Audit issues 8549–8550 flagged the 2 missing CRC × biomarker cross-products.
Downstream impact
Section titled “Downstream impact”- Clinical Evidence report shows per-tumor ORR but not per-tumor × per-HER2 stratification
- For CRC specifically: 100% (3/3) for IHC3+ is a high-signal result that’s invisible in the report
- Basket trial cross-tabs are the most clinically valuable data for HER2-targeting ADCs across tumor types
What the issue is not
Section titled “What the issue is not”- Not a prompt gap — Step 2b explicitly covers this pattern with the exact same example (“In CRC, HER2 IHC 3+…”)
- Not Issue 43 (highest-response-only extraction) — here NO cross-products are created at all, not just missing lower levels
- Not a data availability issue — the table is present in the abstract
TBD — needs assessment of how many basket-trial abstracts have run-on HTML-stripped tables that the LLM fails to parse despite correct Step 2b instructions. The Issue 33 backfill remediated 262 pubs and Issue 43 caught 234 more, but this residual suggests the fix doesn’t work for all table formats.
Open questions
Section titled “Open questions”- Is this a table parsing failure or a prompt following failure? If we reformat the abstract table with clear delimiters (tabs, newlines), does the LLM then correctly create cross-products? If so, the fix is preprocessing abstract tables before sending to
extract_subgroups. - How many pubs have this pattern? Need to compare
sev=2pubs that have both disease AND biomarker subgroups but no cross-products, against the abstract text to see if a table was present. - Would a two-pass approach work? First pass: extract single-dimension subgroups (current). Second pass: given the extracted dimensions, explicitly ask the LLM to find cross-product data in the abstract. Lower LLM ambiguity since we tell it exactly which dimensions to cross.
Explored solution direction
Section titled “Explored solution direction”Option A — Abstract preprocessing: Detect HTML-stripped tables in abstracts and reformat them with clear delimiters before sending to extract_subgroups. Addresses root cause (unparseable table format) but requires table detection heuristics.
Option B — Two-pass extraction: After Step 2a identifies both disease and biomarker subgroups, run a targeted second pass: “Given subgroups [CRC, BTC, UC…] and [IHC3+, IHC2+, IHC1+…], find the cross-tabulated results in the abstract.” More explicit prompt, higher success rate for complex tables.
Option C — Prompt reinforcement: Add more aggressive instructions to Step 2b: “ALWAYS check for tables at the end of the abstract. Tables may be HTML-stripped into run-on text — look for patterns like ‘Row Label Value1 Value2 Value3’ without newlines.” May help marginally but doesn’t address the fundamental parsing challenge.
Solution applied
Section titled “Solution applied”57. Subgroup disease misattribution: biomarker-named cohorts matched to wrong disease
Section titled “57. Subgroup disease misattribution: biomarker-named cohorts matched to wrong disease”Short summary
Section titled “Short summary”The disease matching pipeline (PublicationDiseaseWorkflow) assigns incorrect disease IDs to biomarker-named subgroups. When a subgroup name contains a biomarker like “HER2 IHC3+”, the matching pipeline interprets it as a disease indicator and assigns breast cancer subtypes (e.g. HER2-Positive Breast Cancer, disease 6216) — even when the publication is about a completely different disease (e.g. CRC). These subgroups then get filtered out by the query’s disease filter, making entire cohorts invisible in the Clinical Evidence report.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”Disease matching pipeline: app/workflows/publication_disease_workflow.rb — specifically the trial subgroup disease matching branch (steps adjudicate_subgroup_diseases → populate_disease_terms_for_trial_subgroups → … → post_process_trial_subgroup_disease_matches).
The adjudicate_subgroup_diseases step is supposed to run contextual LLM adjudication to determine if a subgroup is a true disease cohort before it enters DiseaseMatching. For pub 48926, the adjudication either didn’t run, misclassified the biomarker cohorts as disease cohorts, or ran correctly but the downstream matching overrode it.
Query layer: app/queries/tpp/clinical_evidence_query.rb lines 157-160 — filters rows by subgroup_disease_id. When subgroup_disease_id is set (even incorrectly), the query uses it directly. It only falls back to tdd.disease_id when subgroup_disease_id IS NULL.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”The subgroup names “Cohort A (HER2 IHC3+ or IHC2+/ISH+)”, “Cohort B (HER2 IHC2+/ISH−)”, “Cohort C (HER2 IHC1+)” contain HER2 biomarker status. The disease matching pipeline interprets these as disease descriptors and matches them to:
- Cohort A → disease 6216 (HER2-Positive Breast Cancer)
- Cohort B → disease 6215 (HER2-Low Breast Cancer)
- Cohort C → disease 6215 (HER2-Low Breast Cancer)
The pub’s trial_disease_details correctly says CRC (4345). But the query filters subgroup_disease_id = ANY(CRC subtree) — breast cancer IDs don’t match, so all three cohorts are excluded.
Concrete examples
Section titled “Concrete examples”Pub 48926 (T-DXd DESTINY-CRC01, trastuzumab deruxtecan in HER2+ mCRC):
CRC publication with 3 HER2-stratified cohorts. All cohort subgroups tagged with breast cancer diseases instead of CRC:
- “Cohort A (HER2 IHC3+ or IHC2+/ISH+)” → disease 6216 (HER2+ Breast Cancer) — should be CRC or null
- “Cohort B (HER2 IHC2+/ISH−)” → disease 6215 (HER2-Low Breast Cancer) — should be CRC or null
- “Cohort C (HER2 IHC1+)” → disease 6215 (HER2-Low Breast Cancer) — should be CRC or null
Child subgroups (Cohort A → IHC3+ status, etc.) have disease_id = NULL and correctly appear via the TDD fallback.
Audit issues 8565-8567 flagged all 3 cohorts as missing from the CRC query output.
Pub 7559: “HER2+ metastatic breast cancer” subgroup on a CRC pub → matched to HER2+ Breast Cancer.
Downstream impact
Section titled “Downstream impact”- Entire cohorts vanish from the Clinical Evidence report when queried by the publication’s actual disease
- For pub 48926: the primary endpoint data (cORR 45.3%, PFS 6.9mo, OS 15.5mo for Cohort A) is invisible in the CRC report
- High-value clinical data from HER2-targeting therapies in non-breast tumors (CRC, GC, BTC) is systematically underreported
| Metric | Count |
|---|---|
| Total subgroups where disease_id ≠ any pub TDD disease_id | 19,426 subgroups / 6,344 pubs |
In-family (legitimate subtypes/ancestors via all_descendants check) | 57,016 |
| Cross-family mismatches (ancestor/descendant/exact-match check fails) | 2,368 subgroups / 1,574 pubs |
| Subgroups on pubs with only broad TDDs (can’t validate) | ~5,004 subgroups / 2,157 pubs |
| Subgroups pointing to non-simplified (orphaned) diseases | 32 |
The 19,426 number includes many legitimate subtypes (e.g. TNBC subgroup on a Breast Cancer TDD pub). The cross-family check uses diseases.all_descendants on simplified diseases to verify ancestor/descendant/exact-match relationships. Basket trial pubs with broad TDDs (e.g. “Solid Tumors”) correctly allow cross-branch subgroups.
Open questions
Section titled “Open questions”DidAnswered: No — these subgroups haveadjudicate_subgroup_diseasesrun on pub 48926?subgroup_type='disease', so they bypass adjudication entirely and go straight into disease matching viapopulate_term_matches. The adjudicationllm_datafields are empty.Is the disease matching pipeline treating “HER2 IHC3+” as a disease term?Answered: Yes — the normalized term"cohort a (her2 ihc3+ or ihc2+/ish+)"is sent to semantic matching, which returns “HER2-Positive Breast Cancer” as the best global match. The root cause is upstream:classify_publicationssetssubgroup_type='disease'for biomarker-stratified cohorts.How many of the 12,846 mismatches are cross-family?Answered: 2,368 cross-family (ancestor/descendant check onall_descendants), 57,016 in-family.
Explored solution direction
Section titled “Explored solution direction”Option A — Fix adjudication: The adjudicate_subgroup_diseases step should reject biomarker-only subgroup names as disease candidates. If the subgroup name is purely a biomarker selection criterion (HER2 status, PD-L1 level, TMB), it shouldn’t enter disease matching at all.
Option B — Cross-validate against pub disease: After disease matching, add a validation step: if the matched disease is in a completely different root disease family from the pub’s TDD diseases, null out the subgroup disease_id. This catches cross-family misattributions without changing the matching pipeline.
Option C — Null out wrong IDs: For the known cross-family mismatches, directly set disease_id = NULL on the affected subgroups. They’ll fall through to the TDD-based disease in the query. Quick surgical fix.
Option B is probably the safest — it catches the bug without risking regressions on legitimate subtype matches.
Solution applied
Section titled “Solution applied”Forward fix (2026-04-13):
Added cross-family validation gate to post_process_disease_matches in lib/tasks/clinical_trials/trial_subgroups.thor. After resolving a disease_id from the TermMatch lookup, the post-processor now checks whether the matched disease is an ancestor, descendant, or exact match of at least one of the publication’s TDD diseases (using the diseases.all_descendants JSONB column on simplified diseases). If not, the match is rejected and disease_id is left NULL — the clinical evidence query falls back to the pub-level TDD disease.
Implementation:
build_tdd_subtree_lookup: preloads TDD disease subtrees per publication into a{ pub_id => Set[disease_id, ...] }hashdisease_in_family?(disease_id, pub_id, tdd_subtrees): checks (1) candidate is in any TDD’s subtree, or (2) any TDD disease appears in the candidate’sall_descendants- Both disease-type and adjudicated-disease-cohort processing loops now call
disease_in_family?before assigningdisease_id
Backfill (2026-04-13):
Created lib/tasks/one_off/reset_cross_family_subgroup_diseases.thor (Issue 57). Uses the same ancestor/descendant/exact-match check via SQL to identify cross-family subgroups, then sets disease_id = NULL on all of them.
Scale: 2,368 cross-family subgroups across 1,574 pubs nulled out. 57,306 in-family subgroups left untouched. Verified pub 48926 (DESTINY-CRC01): all 3 HER2-stratified cohorts (Cohort A → HER2+ Breast Cancer, Cohort B/C → HER2-Low Breast Cancer) reset to NULL, now correctly fall back to CRC (4345) via TDD.
Also dropped the unused diseases.parent_ids column (migration 20260413202034_remove_parent_ids_from_diseases) to avoid future confusion — the canonical hierarchy is disease_parents join table + all_descendants JSONB.
Residual: ~5,004 subgroups on 2,157 pubs with only broad TDDs (e.g. “Solid Tumors”) cannot be validated by this approach — a breast cancer subgroup on a “Solid Tumors” TDD pub passes the is-a check legitimately. These are mostly basket trials where cross-disease subgroups are expected. Not actioned.
Root cause note: Many of the 2,368 cross-family subgroups were biomarker-stratified cohorts (e.g. “HER2 IHC3+”) with subgroup_type='disease' — they bypassed adjudication entirely and entered disease matching with biomarker terms. The matching pipeline correctly matched “HER2” to breast cancer subtypes because those are the strongest semantic matches in the ontology. The upstream fix (classifying these as subgroup_type='biomarker' instead of disease) is a separate extraction prompt issue, not addressed here.
Backfill job 1870 (reset_cross_family_subgroup_diseases:reset) completed 2026-04-13.
Production verification (2026-04-16): Pub 48926 (DESTINY-CRC01) confirmed fixed. All 3 HER2-stratified cohorts have disease_id=null, correctly fall back to CRC (4345) via TDD. Cohort A now visible with ORR=45.3%, PFS=6.9mo, OS=15.5mo (N=53). Cohort B: PFS=2.1mo, OS=7.3mo. Cohort C: PFS=1.4mo, OS=7.7mo. IHC3+ subgroup ORR=57.5% (N=40) also visible.
Status: Resolved. Forward fix prevents future cross-family assignments. Backfill clears existing ones.
58. extract_dose_evidence v2 sets dose_min/dose_max to full escalation range instead of RDE range
Section titled “58. extract_dose_evidence v2 sets dose_min/dose_max to full escalation range instead of RDE range”Short summary
Section titled “Short summary”In escalation→expansion (Phase 1a→1b) studies, extract_dose_evidence v2 correctly identifies dose_context_type=escalation and populates the correct rp2d value, but sets dose_min/dose_max on the trial_arm_intervention to the full Part 1a escalation range instead of the Part 1b RDE/expansion range. Since the arm-level dose comes through a direct join in the view (not through pub_dose_lookup), the view’s dose_context_type gate (Issue 35, v21) does not block it. Expansion cohort subgroups inherit the wrong dose range.
Residual of Issue 29 (dose extraction captures study-level range, not efficacy population range). Issue 29 was marked “Complete” but this pattern persists on extract_dose_evidence v2.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”Extraction layer: extract_dose_evidence step — populates dose_evidence JSONB on trial_arm_interventions, which is then copied to the TAI’s single_dose, dose_min, dose_max, rp2d, etc. columns.
View layer: vw_publication_efficacy_data v23 — the arm_dose_lookup CTE joins trial_arm_interventions directly to outcomes via trial_arm_id. For “All Arms” outcomes, the fan-out (Issue 52) matches any arm’s intervention. The pub_dose_lookup gate blocks escalation dose from the publication_interventions fallback, but cannot block dose that comes from the arm intervention join.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”extract_dose_evidence v2 prompt extracts dose from the abstract and correctly identifies:
dose_context_type: "escalation"rp2d: "300, 600 and 900 mg Q2W"(the expansion doses)dose_min: "100 mg"(Part 1a starting dose — WRONG for expansion)dose_max: "1500 mg"(Part 1a max dose — WRONG for expansion)
The prompt doesn’t narrow dose_min/dose_max to the RDE range when the efficacy population is from the expansion phase. It reports the full dose range tested across both phases.
The view’s dose_context_type gate (v20/v21, Issue 35) only blocks pub_dose_lookup COALESCE fallback — it doesn’t affect arm_dose_lookup, which is a direct join. So the escalation range propagates to all subgroup rows via the arm.
Concrete examples
Section titled “Concrete examples”Pub 114077 (INCA33890, PD-1/TGFβR2 bispecific, phase I in advanced solid tumors):
- Part 1a (escalation): 100, 300, 600, 900, 1200, 1500 mg Q2W (n=48)
- Part 1b (expansion at RDEs): 300, 600, 900 mg Q2W — HNSCC n=18, MSS-CRC n=94, etc.
- HNSCC efficacy: 3 PRs, ORR 16.7%
extract_dose_evidence v2 output on the single TAI:
{ "dose_context_type": "escalation", "rp2d": "300, 600 and 900 mg Q2W", "dose_min": "100 mg", // Wrong — Part 1a start "dose_max": "1500 mg", // Wrong — Part 1a max "single_dose": "100 mg", // Wrong — Part 1a start "confidence": 0.9}View shows HNSCC rows with single_dose=100 mg, dose_min=100 mg, dose_max=1500 mg — but HNSCC patients received 300-900 mg Q2W.
Audit issues 8587-8589 correctly flagged all three dose fields.
Pub 116843 (Temab-A + bevacizumab, dose expansion in 3L+ CRC):
Three dose-level arms: Temab-A 2.0 mg/kg (n=26), 2.4 mg/kg (n=30), SOC TAS-102 + Bev (n=20). Safety lead-in tested 1.6, 2.0, 2.4 mg/kg. Efficacy is from the dose optimization randomized cohort.
View shows per-arm rows with correct single_dose (2.0 or 2.4) but dose_min=1.6 mg/kg and dose_max=2.4 mg/kg on ALL arms — the safety lead-in range, not the per-arm dose:
- Temab-A 2.0 arm:
single_dose=2.0✓,dose_min=1.6, dose_max=2.4✗ (should be null — fixed single dose, no range) - Temab-A 2.4 arm:
single_dose=2.4✓,dose_min=1.6, dose_max=2.4✗ (should be null — fixed single dose, no range)
Audit issues 8551-8553 correctly flagged dose_min/dose_max on both arms.
Pub 29738 (IMMU-132 / sacituzumab govitecan, Trop-2 ADC phase I/II):
Phase I escalation: 8 → 12 → 18 mg/kg (DLT at 18). Phase II focusing on 8 and 10 mg/kg. Efficacy pooled across all pts except pancreatic cancer (3 PR / 24 assessable).
Multiple TAIs created with different dose levels (8, 10, 12, 18 mg/kg), each with study-level dose_min/dose_max ranges (8-10 or 8-18). The Phase II efficacy population received 8 or 10 mg/kg, but some TAIs show dose_max=18 (the escalation max). Combined with Issue 52 fan-out, the query can non-deterministically show dose_max=10 or dose_max=18 depending on which TAI sorts first. Audit issue 8560 flagged dose_max=10 when the escalation went to 18.
Pub 115389 (becotatug vedotin + pucotenlimab, 1L R/M SCCHN):
Two fixed dose-level arms: DL1 (2.0 mg/kg, n=21) and DL2 (2.3 mg/kg, n=10). View shows per-arm rows with correct single_dose but study-level dose_min=2.0, dose_max=2.3 on BOTH arms — each arm’s range is contaminated with the other arm’s dose:
- DL1 arm:
single_dose=2.0✓,dose_min=2.0, dose_max=2.3✗ (should be null — fixed single dose) - DL2 arm:
single_dose=2.3✓,dose_min=2.0, dose_max=2.3✗ (should be null — fixed single dose)
Audit issues 8530/8531 correctly flagged dose_max on DL1 and dose_min on DL2.
Pub 73299 (SN-38 ADC, phase I/II CRC + GC dose escalation + expansion):
“CRC (dose-escalation phase)” subgroup (N=18) sits on “All Arms” and picks up single_dose=100 mg/m2, dose_frequency=Q2W from the CRC expansion cohort arm’s intervention. The escalation phase tested multiple dose levels, but the expansion dose bleeds onto escalation rows via the All Arms fan-out (also Issue 52). The escalation subgroup should have no single_dose — multiple dose levels were tested.
Audit issues 8574/8575 correctly flagged single_dose and dose_frequency on the escalation subgroup.
Pub 48903 (T-DXd, HER2 solid tumors): dose_max=6.4 mg/kg but Part 1 escalation went up to 8.0 mg/kg. Expansion used 5.4 and 6.4 mg/kg. The dose_max reflects expansion range, not full escalation range. Audit issue 8564.
Pub 52543 (zanidatamab zovodotin, CRC): single_dose=10.0 mg/kg on a dose-escalation cohort spanning 2.2-10.0 mg/kg. The max dose is shown as a single_dose. Audit issue 8569.
Pub 114973 (MRG003, HNSCC): single_dose=3.6 mg/kg on a range of 3.6-5.4 mg/kg Q3W. Lowest dose shown as single_dose. Audit issue 8536.
Pub 238709 (telisotuzumab vedotin, c-Met+ solid tumors): Subgroup “Dose escalation ≥4.0 mpk → cMET+ → squamous cell carcinoma” has dose_min=1.0 mg/kg from the full escalation range, but the efficacy subset is only patients at ≥4.0 mg/kg. Audit issue 8585.
Downstream impact
Section titled “Downstream impact”- Clinical Evidence report shows wrong dose range for expansion cohorts
- Dose escalation starting dose (100 mg) appears as the dose for efficacy populations that never received it
- Misleading for clinicians: suggests efficacy was seen at lower doses than actually tested
TBD — needs assessment of how many pubs have dose_context_type=escalation on TAIs with efficacy subgroups that are expansion cohorts. The Issue 35 gate addressed the pub_dose_lookup path but this arm-level path was not gated.
Open questions
Section titled “Open questions”- Should
extract_dose_evidenceset dose_min/dose_max to the RDE range when it identifies expansion? Therp2dfield already has the right values. - Should the view’s arm_dose_lookup also gate on
dose_context_type? If the arm is tagged as escalation, suppress dose_min/dose_max and only surface rp2d? - Should
extract_interventionscreate separate arms for Part 1a escalation vs Part 1b expansion? That would naturally scope dose to the right population.
Explored solution direction
Section titled “Explored solution direction”Option A — Prompt fix in extract_dose_evidence: When dose_context_type=escalation and RDEs are identified, set dose_min to the lowest RDE and dose_max to the highest RDE (not the full escalation range). The efficacy population is at RDEs, not the full range.
Option B — View gate on arm_dose_lookup: Extend the dose_context_type gate to also suppress arm_dose_lookup dose fields (dose_min, dose_max, single_dose) when arm_dose_context_type IN ('escalation', 'range'). Surface only rp2d for these arms. This is a view-layer fix that doesn’t require re-extraction.
Option C — Create expansion arms in extract_interventions: Have the LLM create separate arms for Part 1a escalation and Part 1b expansion cohorts. Expansion arms get RDE-range dose. Most correct structurally but highest effort.
Option A is simplest — the LLM already identifies the RDEs correctly in the rp2d field, it just needs to use them for dose_min/dose_max too.
Solution applied
Section titled “Solution applied”Phase 1 (v24): all_arms_drug_agg CTE aggregates drug_interventions to one row per (pub, drug) for “All Arms” outcomes. Eliminates the fan-out where every TAI was matched to every “All Arms” outcome.
Phase 2 (v25 + job 1915): all_arms_dose_agg uses dose_value_numeric column (backfilled via LLM, 97.7% coverage) for proper numeric MIN/MAX envelope when TAIs disagree on single_dose. Replaces non-deterministic text comparison.
Phase 3 (v26): “Prefer non-escalation” strategy in all_arms_dose_agg. When a (pub, drug) has both escalation and non-escalation TAIs, the agreement check and numeric envelope exclude escalation/range TAIs so dose reflects the efficacy population. Falls back to all TAIs when only escalation TAIs exist. 652 (pub, drug) groups narrowed, all range-narrowing, zero widening, zero data loss.
Phase 3b (prompt v3 + o4-mini): extract_interventions prompt updated to instruct LLM to create a separate escalation arm when the abstract reports both escalation and expansion results. Default model changed to o4-mini (gpt-5-mini doesn’t follow this instruction). 16 pubs reset (job 1923) and re-extracted: interventions (job 1924), subgroups (job 1922), dose evidence (job 1921). All completed 2026-04-15.
Infra fix: Removed MIN_REPROCESS_VERSION from intervention_extraction.rb and dose_evidence_extraction.rb. Base scope now uses IS NULL only — re-extraction requires explicit reset. Prevents accidental mass re-processing on version bumps.
Production verification (2026-04-16): 99.87% of TAIs on v3 (96,442/96,564). 122 remain on v1/v2 (excluded by target_disease_or_hemonc_relevant scope). All spot-check pubs confirmed correct:
- Pub 74158: escalation arm has 4.8–16.0 range, expansion
single_dose=12.0, per-cohort arms (8.0, 16.0) usesingle_dosewith null min/max. - Pub 116843: per-cohort
single_dose=2.0/2.4with null min/max. Old range leak (1.6–2.4) gone. - Pub 115389: DL1
single_dose=2.0, DL2single_dose=2.3, both null min/max. - Pub 114973: per-cohort 3.6/5.4 use
single_dosewith null min/max. - Pub 238709:
dose_min=4.0, dose_max=5.0for ≥4.0 mpk subset. No longer inheriting 1.0–8.0 escalation range. - Pub 73299 (Phase 3b): escalation subgroup now on separate “Part 1a dose escalation” arm. Expansion cohort correctly shows
single_dose=100 mg/m2 Q2W. - All 16 Phase 3b pubs now have separate escalation/expansion arms after re-extraction.
Status: Complete.
59. LLM miscounts partial responses: tumor reductions below RECIST threshold counted as PRs
Section titled “59. LLM miscounts partial responses: tumor reductions below RECIST threshold counted as PRs”Short summary
Section titled “Short summary”classify_publications extracts PR count as the number of patients with ANY tumor size reduction, rather than only those meeting RECIST v1.1 criteria (≥30% decrease in sum of target lesion diameters). This inflates both PR count and ORR. Separately, the LLM may attribute dose-escalation-phase efficacy results to the expansion cohort with an incorrect N denominator, creating spurious subgroup rows.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”Extraction layer: app/tasks/publications_llm_classification/task.rb — the LLM reads the abstract and extracts measure_value for ORR and PR endpoints per subgroup. The prompt does not explicitly instruct the LLM on RECIST criteria for what qualifies as a PR.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”The abstract describes tumor size reductions (e.g. 13%, 21%, 27%, 35%) and notes one “unconfirmed partial response” (the 35% one). The LLM counts all 4 reductions as PRs and computes ORR = 4/13 = 30.8%. Per RECIST v1.1, only the 35% reduction qualifies as a PR — the others are stable disease.
Secondary pattern: when an abstract reports a dose-escalation-phase result (e.g. “1 confirmed PR in a melanoma patient” from Part 1) alongside expansion data, the LLM creates a subgroup for the escalation result using the expansion N as denominator.
Concrete examples
Section titled “Concrete examples”Pub 75056 (MGC018 / vobramitamab duocarmazine, B7-H3 ADC in advanced solid tumors):
Abstract states: “Of 13 mCRPC patients with measurable disease, six were not yet evaluable, and seven had first 9-week imaging. Of the seven, four had reductions in target lesion sums of 13%, 21%, 27%, and 35% (unconfirmed partial response)”
LLM extracted for “mCRPC → measurable disease”:
- PR = 4, ORR = 30.8%, N = 13
- Should be: PR = 1 (only the 35% meets RECIST ≥30%), ORR ≈ 7.7% (1/13) or 14.3% (1/7 evaluable)
Audit issue 8534 flagged this.
Pub 75056 — spurious Melanoma row:
Abstract mentions “1 confirmed partial response in a melanoma patient” from the dose escalation phase. The expansion enrolled mCRPC (26), NSCLC (16), TNBC (7) = 49 patients — melanoma enrollment hadn’t started yet.
LLM created a “Melanoma” subgroup with:
- PR = 1, ORR = 2.0%, N = 49
- The PR is from escalation (different phase), and N = 49 is total expansion enrollment (not melanoma patients)
Audit issue 8535 flagged this as a spurious row.
Downstream impact
Section titled “Downstream impact”- Inflated ORR in Clinical Evidence report — clinicians see 30.8% instead of ~7.7%
- Spurious disease rows with wrong denominators create misleading efficacy signals
- Particularly dangerous for early-phase trials where small N makes each miscount a large percentage swing
What the issue is not
Section titled “What the issue is not”- Not a post-processing or view issue — the wrong values originate at extraction
- Not a zero-sentinel issue (Issue 54) — the LLM is actively extracting a wrong positive value, not defaulting to 0
TBD — needs investigation. Key questions to assess scope:
-
PR overcounting: How many pubs report tumor size reductions without explicit PR classification? Query for pubs where extracted PR count > 0 but abstract contains phrases like “reduction in target lesion” or “tumor shrinkage” without “partial response” in the same sentence. Phase I/II dose escalation pubs are highest risk.
-
Cross-phase attribution: How many pubs mix escalation and expansion results? Look for pubs with subgroups where the subgroup disease doesn’t match any expansion cohort disease, or where N doesn’t match any described cohort size.
-
Proxy signal: Pubs where ORR is high but only “unconfirmed” or no confirmed PRs are mentioned in abstract text.
Open questions
Section titled “Open questions”- Should the
classify_publicationsprompt explicitly instruct the LLM on RECIST v1.1 criteria? Or is the problem that the abstract itself is ambiguous about which reductions are PRs? - For cross-phase attribution: should the prompt instruct the LLM to only extract efficacy from the expansion phase when both phases are reported? Or should escalation results be kept but tagged differently?
- How prevalent is the “tumor reduction ≠ PR” pattern in early-phase oncology abstracts?
Explored solution direction
Section titled “Explored solution direction”Option A — Prompt reinforcement: Add explicit instructions to classify_publications prompt: “Only count a patient as having a Partial Response (PR) if the abstract explicitly states PR or if the tumor reduction meets RECIST v1.1 criteria (≥30% decrease). Do not count tumor size reductions below 30% as PRs.” Also: “Only extract efficacy results for patients enrolled in the study phase being reported. Do not mix dose-escalation phase results into expansion cohort data.”
Option B — Validation step: Add a post-extraction validation that flags ORR/PR values where the abstract text mentions specific reduction percentages below 30% — these could be auto-corrected or flagged for manual review.
Solution applied
Section titled “Solution applied”60. Overall ORR value attributed to per-cohort subgroup instead of Overall
Section titled “60. Overall ORR value attributed to per-cohort subgroup instead of Overall”Short summary
Section titled “Short summary”When an abstract reports a single ORR for the overall population and all responses come from one cohort, the LLM assigns that overall ORR value to the cohort subgroup rather than to Overall. The cohort gets the overall rate (e.g. 9% = 2/23) instead of the correct cohort-specific rate (e.g. 13.3% = 2/15). The LLM may also extract individual-patient DoR values as if they were summary statistics when N=1.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”Extraction layer: app/tasks/publications_llm_classification/task.rb — the LLM reads “response rate of 9%” in the abstract and assigns it to Cohort 1 because both responses were in Cohort 1, without recognizing the denominator (23) doesn’t match the cohort N (15).
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”The abstract says: “Two responses were reported, both in cohort 1 (1 complete and 1 unconfirmed partial response) for a response rate of 9% (95% CI: 0-20%).”
The “9%” uses the overall N=23 denominator (2/23 = 8.7% ≈ 9%). The LLM sees “both in cohort 1” and “response rate of 9%” in the same sentence and assigns 9% to Cohort 1 (N=15). The correct Cohort 1 ORR would be 2/15 = 13.3%.
Concrete examples
Section titled “Concrete examples”Pub 57529 (ABBV-399 / telisotuzumab vedotin, c-Met+ SCC, Lung-MAP S1400K):
Two cohorts: Cohort 1 ICI-naïve (N=15), Cohort 2 ICI-refractory (N=13). Total evaluable = 23. Both responses (1 CR, 1 UPR) were in Cohort 1.
View shows:
- Cohort 1: ORR = 9%, N = 15 — wrong (should be 13.3% = 2/15)
- Overall: ORR = 8.7%, N = 23 — correct (2/23)
Also: DoR = 2.3 months (N=1) for Cohort 1. This is the individual-patient DoR for the single UPR, not a median. With N=1 the value is technically correct, but extracting individual-patient DoR as a summary endpoint is questionable.
Audit issues 8580–8581 flagged both.
Pub 29700 (ABBV-400 / telisotuzumab adizutecan, CRC dose escalation + expansion):
Abstract says: “Activity was also seen at lower c-Met expression levels (10–15% ORR).” This is a range, not a single value — no specific ORR is stated for low c-Met.
LLM extracted for “CRC → Low c-Met expression”:
- ORR = 12.5%, cORR = 12.5% — fabricated midpoint of the 10-15% range
- Additionally tagged as confirmed ORR when the abstract doesn’t specify confirmed for this subgroup
The LLM converted a range to a point estimate by averaging. The per-arm cORR from the table (0%, 15%, 20%) doesn’t map to 12.5% either.
Audit issues 8537–8538 flagged both ORR and cORR.
Note: This is a range-fabrication issue, not a denominator-mismatch issue. No CR/PR counts exist for the Low c-Met subgroup so the post-process validation cannot catch it. Would need a separate prompt fix about not averaging ranges.
Pub 136275 (HER3-DXd, aNSCLC with brain metastases):
“untreated brain metastases” subgroup: ORR = 33.0% (N=7) in view. Abstract says 2/7 = 28.6%. The LLM extracted 33% — wrong value, unclear where 33% comes from (possibly confused with another subgroup). Audit issue 8584.
Update: After re-extraction with prompt v2, the LLM now extracts ORR=30.0% (N=20) for the overall cohort — matching the abstract’s “six (30%) of 20 patients having intracranial responses”. The incorrect untreated BM subgroup is no longer extracted separately. Fixed by re-extraction.
Pub 238377 (zanidatamab zovodotin, CRC → RAF/RAS-mut):
View shows DoR = 48 (N=4). Abstract says “two RAS-mut pts having DoR >48 weeks” — this is an individual-patient threshold (“>48 weeks”), not a median DoR. The LLM extracted the raw number 48 as a summary DoR value. Audit reports it converted to 11.03 months. Audit issue 8579.
Note: This is a different bug type — individual-patient threshold extracted as summary stat. Not addressable by denominator validation.
Pub 151763 (MRG003, R/M SCCHN):
DoR patient_count = 10 but abstract says 6 responders. The LLM used the wrong denominator for DoR — possibly using N evaluable (10) instead of N responders (6). Audit issue 8527.
Update: After re-extraction with prompt v2, DoR now shows N=6 (responders). Fixed by re-extraction.
Pub 65578 (telisotuzumab vedotin, NSCLC → squamous):
N = 19 but abstract says 20. Off-by-one extraction error. Audit issue 8583.
Update: After investigation, abstract table shows Squamous NSCLC N=20 enrolled but 19 evaluable for response. The extraction correctly uses N=19 (evaluable) with ORR=15.8% (3/19). Not a bug — different denominators (enrolled vs evaluable).
Pub 30082 (cetuximab-based ADC, HNSCC → Squamous):
View shows N=25 for ORR but N=29 for PFS on the same subgroup. Audit flags N=25 vs abstract’s 29. Likely different denominators per endpoint (evaluable for response vs enrolled for PFS) — may be correct extraction but needs abstract verification. Audit issue 8582.
Update: Confirmed after abstract verification. Table shows Squamous N=29 enrolled, ORR 84.0% (21/25 evaluable), 6-mo PFS rate 73.5% (N=29). The extraction correctly uses N=25 for ORR and N=29 for PFS. Not a bug — abstract uses different denominators per endpoint.
Downstream impact
Section titled “Downstream impact”- Cohort-specific ORR is understated (9% vs 13.3%) — gives a worse picture of per-cohort efficacy
- When responses cluster in one cohort (common in multi-cohort trials), the pattern systematically deflates that cohort’s ORR
- Individual-patient DoR values may mislead when displayed without N context
What the issue is not
Section titled “What the issue is not”- Not a post-processing or view issue — wrong values originate at extraction
- Not a zero-sentinel issue (Issue 54) — the LLM extracts a real value, just the wrong one
- Not an ORR derivation issue — the LLM is copying a stated percentage, not computing from CR+PR
Very low — ~2-5 pubs. Investigated 2026-04-13.
Proxy query: pubs with both an Overall and a non-Overall ORR where the values match (within 1pp) but the Ns differ. Found 25 distinct pubs with non-zero ORR. Spot-checked 7 — 5 were false positives (ORR values coincidentally match because the ratios genuinely round the same way), 1 was a different bug (pub 70501: subgroup N combines arms while ORR is arm-specific), 1 was a coincidental cross-arm match.
Strict math check (ORR × cohort_N doesn’t produce a near-integer number of responders): only 2 pubs flagged, including the known example pub 57529.
The proxy query has a very high false-positive rate because ORR values legitimately coincide more often than expected, especially with round percentages and small Ns. The known examples from this tracker (57529, 29700, 136275) likely represent most of the affected pubs.
Related sub-issues (from this tracker entry’s examples):
- DoR with N=1 (individual-patient values as summary stats): 367 pubs, 479 records — much larger, may warrant its own investigation.
ORinstead ofORRabbreviation (Issue 61): 125 pubs, 228 records.
Open questions
Section titled “Open questions”Should theDone — prompt reinforcement added.classify_publicationsprompt instruct the LLM to verify ORR denominators match the subgroup N?Should we add a post-extraction validation that checksDone —ORR ≈ (CR + PR) / N × 100and flags mismatches?validate_orr_denominatorin post_process.rb.- For DoR with N=1: should we tag these differently (e.g.
is_individual_patient: true) or just accept them? Still open — 367 pubs, may warrant its own issue.
Explored solution direction
Section titled “Explored solution direction”Option A — Prompt reinforcement: Instruct the LLM to always verify that extracted ORR is consistent with the subgroup’s N and response count. When the abstract reports an overall ORR and all responses are in one cohort, compute cohort-specific ORR = responses / cohort N.
Option B — Post-extraction math check: After extraction, validate ORR ≈ (CR + PR) / N × 100 for each subgroup. When the check fails, flag for re-extraction or auto-correct using the stated CR/PR counts.
Solution applied
Section titled “Solution applied”Both Option A and B implemented.
Prompt reinforcement (app/tasks/publications_llm_classification/task.rb): Added “ORR denominator consistency” section after ORR definition. Instructs the LLM to verify ORR × N produces a near-integer responder count and not to copy overall ORR to subgroups with different N. Effective for some pubs (pub 29158 self-corrected from 8.7% to 13.3%) but not all (pub 57529 still extracted 9% at o4-mini medium reasoning).
Post-process validation (app/tasks/publications_llm_classification/post_process.rb): validate_orr_denominator runs after each subgroup is materialized. For every ORR percentage:
- Checks if
ORR × N / 100produces a near-integer responder count (tolerance ≤ 0.3) - If not, looks up CR+PR counts for the same subgroup (summing across all confirmed flags)
- Computes expected ORR from
(CR + PR) / N × 100 - If difference > 1pp, overwrites the stated ORR with the computed value and logs the correction
Guards against false positives: requires CR+PR counts to exist, ORR must fail the integer-responder check, computed ORR must differ by >1pp, and must be ≤100%. Tested on 100 pubs with ORR + CR/PR counts — zero false positives, caught 1 additional real bug (pub 29158).
Concrete example results after re-extraction + post-processing:
| Pub | Before | After | How fixed |
|---|---|---|---|
| 57529 | Cohort 1 ORR=9% (N=15) | 13.3% | Post-process validation (2/15) |
| 29158 | Non-MEN ORR=8.7% (N=15) | 13.3% | LLM self-corrected with new prompt |
| 136275 | Untreated BM ORR=33% (N=7) | 30% (N=20, overall) | LLM dropped incorrect subgroup |
| 151763 | DoR patient_count=10 | DoR N=6 | LLM self-corrected |
| 65578 | N=19 flagged as wrong | N=19 (correct — evaluable pts) | Was not a bug |
| 30082 | N=25 vs N=29 mismatch | N=25 ORR, N=29 PFS | Was not a bug (different denominators per endpoint) |
| 29700 | Low c-Met ORR=12.5% | Still 12.5% | Not fixed — range fabrication, different bug type |
| 238377 | DoR=48 individual threshold | ORR correct, DoR needs checking | Different bug type |
Production action: Reset pubs 57529 and 29158 via job 1871 (reset_publications:reset), re-extracted through pipeline. Completed 2026-04-13.
Production verification (2026-04-16):
- Pub 57529: wrong Cohort 1 ORR=9% (N=15) is gone. Now only “Overall” ORR=9% (N=23) — correct overall population rate. Per-cohort subgroups not separately extracted (abstract doesn’t clearly state per-cohort ORR).
- Pub 136275: ORR=30.0% (N=20) — matches abstract’s “six (30%) of 20 patients”. Incorrect untreated BM subgroup no longer extracted.
- Pub 151763: DoR now shows N=6 (responders) — correct.
- Pub 29700: Low c-Met ORR=12.5% persists — range fabrication residual (abstract says “10-15% ORR”, LLM averaged). Different bug type, not covered by this fix.
Status: Complete.
61. Endpoint abbreviation ‘OR’ instead of ‘ORR’ makes ORR invisible in query
Section titled “61. Endpoint abbreviation ‘OR’ instead of ‘ORR’ makes ORR invisible in query”Short summary
Section titled “Short summary”Post-processing creates a trial_outcome_measure with endpoint_abbreviation = 'OR' instead of 'ORR'. The query’s extract_efficacy_metrics filters for PRIMARY_EFFICACY_ABBREVIATIONS = %w[OS PFS ORR DOR DoR DFS DCR] — ‘OR’ doesn’t match, so the publication appears to have no ORR even though the data exists in the view.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”Post-processing layer: app/tasks/publications_llm_classification/post_process.rb — when creating trial_outcome_measures, the endpoint abbreviation is set from the LLM output. If the LLM outputs “OR” instead of “ORR”, it flows through uncorrected.
Query layer: app/queries/tpp/clinical_evidence_query.rb — PRIMARY_EFFICACY_ABBREVIATIONS is a strict allowlist. ‘OR’ is not on it.
Concrete examples
Section titled “Concrete examples”Pub 74193 (T-DM1, HER2+ solid tumors):
View shows endpoint_abbreviation = 'OR' with measure_value = 12.5 (N=8) — matches abstract’s 12.5% (1/8). The data is correct but invisible because the abbreviation is ‘OR’ not ‘ORR’. Audit issue 8568 flagged missing ORR for Overall.
TBD — needs investigation. Query to assess:
SELECT COUNT(DISTINCT tom.source_id)FROM trial_outcome_measures tomWHERE tom.source_type = 'Publication' AND tom.endpoint_abbreviation = 'OR'Also check for other near-miss abbreviations: ‘PFS2’ vs ‘PFS’, ‘mOS’ vs ‘OS’, ‘DoR’ vs ‘DOR’ (note: query already handles DoR).
Open questions
Section titled “Open questions”- Should post-processing normalize common abbreviation variants (‘OR’ → ‘ORR’, ‘mPFS’ → ‘PFS’, etc.)?
- Should the query’s PRIMARY_EFFICACY_ABBREVIATIONS include common variants as fallbacks?
- Is there a comprehensive list of abbreviation variants the LLM produces?
Explored solution direction
Section titled “Explored solution direction”Option A — Post-processing normalization: Add an abbreviation mapping in post_process.rb that normalizes known variants before creating trial_outcome_measures. E.g. {'OR' => 'ORR', 'mPFS' => 'PFS', 'mOS' => 'OS'}.
Option B — Query-layer fallback: Expand PRIMARY_EFFICACY_ABBREVIATIONS to include common variants. Lower effort but doesn’t fix the underlying data.
Option A is preferred — fix at the source so all downstream consumers benefit.
Solution applied
Section titled “Solution applied”Option A implemented — abbreviation normalization in post_process.rb + backfill task.
Root cause: The LLM outputs valid clinical abbreviations (OR, RR, BOR, mPFS, etc.) that don’t match the query’s PRIMARY_EFFICACY_ABBREVIATIONS list. The identifier_extraction.rb step tries to normalize via Endpoint lookup, but when the abbreviation isn’t in the endpoints table (no record for ‘OR’, ‘RR’, ‘BOR’), the raw LLM value flows through to trial_endpoints.abbreviation.
Fix: normalize_abbreviation method in post_process.rb, called before endpoint lookup in both process_endpoints and process_outcome_measures. Two-tier mapping:
- Unconditional (abbreviation alone is unambiguous):
BOR→ORR,bORR→ORR,cORR→ORR,ORS→ORR,mPFS→PFS,mOS→OS,mDOR→DOR,mDFS→DFS - Conditional (abbreviation is ambiguous, checks endpoint name):
OR→ORRandRR→ORRonly when endpoint name matches/response|remission/i. Preserves legitimate non-ORR uses like “Relapse Rate”, “Recurrence Rate”, “Odds Ratio”.
Scale: 1,019 trial_endpoints normalized locally (460 BOR, 384 RR, 120 OR, 24 cORR, 14 mPFS, 9 mOS, 4 mDOR, 2 bORR, 1 ORS, 1 mDFS). 190 records correctly kept unchanged (184 RR with non-response names, 6 OR with non-response names).
Backfill: lib/tasks/one_off/normalize_endpoint_abbreviations.thor — two-step task (identify → fix). Job 1875 completed 2026-04-13.
Production verification (2026-04-16): Pub 74193 now has abbreviation='ORR' (was OR), value 12.5% visible in query. Zero non-canonical abbreviations remain for the unconditional mappings (BOR, bORR, cORR, ORS, mPFS, mOS, mDOR, mDFS all normalized). Remaining OR (8 records: Ovulation Rate, Odds Ratio, Oligorecurrence Rate, Risk of HCC) and RR (312 records: Relapse Rate, Recurrence Rate, Resection Rate, etc.) are all legitimate non-response endpoints — correctly preserved by the conditional guard.
Status: Complete.
62. Population-subgroup cuts modeled as fake trial_arms (ethnicity/analysis-set pseudo-arms + garbage RP2D)
Section titled “62. Population-subgroup cuts modeled as fake trial_arms (ethnicity/analysis-set pseudo-arms + garbage RP2D)”Short summary
Section titled “Short summary”intervention_extraction v1/v2 prompts let the LLM model population-cut subgroups (ethnicity, demographic, response category, overall/ITT/pooled population) as full-fledged trial_arms alongside the real randomized treatment arms. Those pseudo-arms then receive the pooled efficacy outcomes (ORR/DCR/PFS), while the real dose arms end up with zero trial_arm_outcomes attached. Symptom-adjacent: the TAI rp2d field on these pseudo-arms gets stuffed with a comma-separated dose list (e.g. "1.6, 2.4, and 3.0 mg/kg Q3W"), which is structurally invalid — RP2D is a single dose.
In the Clinical Evidence tool this manifests as: publication visibly has “3 dose arms” in the admin Raw tab, but only renders one aggregated efficacy row because the outcomes live on the pseudo-subgroup arm, not on the dose arms. The tool appears to “average across doses” when in fact it’s faithfully surfacing the pooled-overall extraction.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”Extraction layer: app/tasks/publications_llm_classification/intervention_extraction.rb — PROMPT_VERSION=1,2 (pre-Issue 52 Phase 2).
The v1/v2 system prompt did not constrain what counts as an arm beyond “a group of patients who received the same treatment”. With no explicit rule forbidding population cuts, the LLM emitted arms for ethnicity / age / response-category subgroups whenever the abstract reported efficacy by those cuts.
Materialization: TrialArmMaterializer.materialize! creates trial_arms + trial_arm_interventions rows from llm_data.intervention_arms. No validation of arm type vs. treatment role, so pseudo-arms flow straight through.
Query layer: app/queries/tpp/clinical_evidence_query.rb — build_result_rows groups by [pub_id, disease_id, effective_line, arm_name, subgroup_value]. Pseudo-arms get their own rows with the pooled outcome; real dose arms either don’t appear (no outcome linked) or appear with empty efficacy.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”The LLM prompt defines an arm as “a group of patients who received the same treatment” but doesn’t forbid modeling population subgroups as arms. When the abstract presents efficacy results broken down by a population cut (e.g. “ORR was 11.1% in Asian pts, 16.3% in non-Asian, 15.6% overall”), the LLM creates three extra arms to hold those three ORR values, alongside the real randomized arms.
The v3 prompt’s escalation/expansion guardrails (added for Issue 52 Phase 2) empirically fix this pattern without needing an explicit “don’t model population cuts as arms” rule — verified on pub 190844 (see Spot checks).
Concrete examples
Section titled “Concrete examples”Pub 190844 (Temab-A in Asian patients with advanced CRC, Phase I subanalysis):
Abstract Table 203P reports efficacy by ethnicity:
- AS (n=18): ORR 11.1, DCR 77.8, mPFS 4.40
- Non-AS (n=104): ORR 16.3, DCR 74.0, mPFS 5.16
- OVR (N=122): ORR 15.6, DCR 74.6, mPFS 4.63
Real randomized arms per the Methods: EXP randomized to 1.6 / 2.4 / 3.0 mg/kg Q3W, plus ESC pooled.
v1 extraction (observed in prod):
- 8 trial_arms total
- 4 legitimate: “Temab-A 1.6 mg/kg Q3W (EXP)”, “Temab-A 2.4 mg/kg Q3W (EXP)”, “Temab-A 3.0 mg/kg Q3W (EXP)”, “Dose escalation cohorts (ESC, 1.6–6.0 mg/kg)”
- 3 pseudo-arms: “Temab-A monotherapy — Asian subgroup (AS)”, “Temab-A monotherapy — non-Asian subgroup (Non-AS)”, “Temab-A monotherapy — overall CRC population (OVR)”
- 1 auto-created “All Arms”
- All 14
trial_arm_outcomes(ORR/DCR/PFS for the 3 subgroup cuts) attached to the 3 pseudo-arms; zero TAOs on the dose arms - Pseudo-arms’ TAIs had
dose="varied (ESC 1.6–6.0 mg/kg; EXP randomized to 1.6, 2.4, 3.0 mg/kg Q3W)"andrp2d="1.6, 2.4, and 3.0 mg/kg Q3W"withdose_context_type=rp2d— structural garbage.
v3 re-extraction (verified locally, 2026-04-22):
- 4 treatment arms only: “Dose escalation”, “Expansion 1.6 mg/kg Q3W”, “Expansion 2.4 mg/kg Q3W”, “Expansion 3.0 mg/kg Q3W”
- Pseudo-arms gone
- Clean dose fields, no RP2D garbage
- After classify_publications re-runs, the 3 ORR values will attach to
trial_subgroups(CRC / CRC → Asian / CRC → non-Asian — which already exist) with outcomes linked to “All Arms” or the dose arms, not to a fake arm
Downstream impact
Section titled “Downstream impact”- Clinical Evidence tool shows one pooled ORR row for these pubs (e.g. 15.6% on “OVR arm, N=122, dose range 1.6–3.0 mg/kg”) instead of per-dose or cleanly subgroup-scoped rows
- The real dose arms have no efficacy surfaced, so dose-response patterns are invisible
- Clients reasonably infer “the tool is averaging across doses” when actually the LLM attached the pooled value to a fake arm
rp2dfield contains nonsense strings that break any RP2D-based filters or displays
What the issue is not
Section titled “What the issue is not”- Not an abstract-content problem — the abstract faithfully reports efficacy by ethnicity, not by dose. Per-dose ORR is genuinely not in the source. The fix surfaces the correct structure (subgroup cuts as
trial_subgroups, outcomes on real arms); it does not invent per-dose ORR values that don’t exist in the abstract. - Not Issue 52 (dose fan-out in view) — that was a view-layer join bug; this is an extraction-layer modeling bug.
- Not Issue 55/56 (cross-tabulated subgroups missing) — those are about under-extraction; this is about over-extraction (creating fake arms instead of using subgroups).
Prod counts of pubs with extracted intervention_arms:
| intervention_extraction_version | Total pubs | Matches tight signal (stale + pseudo-arm pattern) |
|---|---|---|
| null (pre-versioning) | 18,324 | 872 |
| v1 | 5,262 | 163 |
| v2 | 736 | 98 |
| v3 (current) | 81 | 2 (edge cases) |
Tight-signal target: 1,133 pubs. Signal = pubs on intervention_extraction_version < 3 OR null AND (at least one trial_arm.name matches an ethnicity/demographic/response-category/overall-subgroup/analysis-set pattern OR at least one trial_arm_intervention.rp2d contains a comma or ” and ”).
Estimated cost of targeted re-extraction: ~$16 on o4-mini (1,133 × ~$0.014/pub from local test).
Spot checks
Section titled “Spot checks”- Pub 190844: v1 had 7 arms (3 pseudo + 4 real). After local reset + v3 re-extraction: 4 clean treatment arms, no pseudo-arms, clean dose fields. Verified the 3 ethnicity ORR values are retained in
trial_subgroupsas “CRC”, “CRC → Asian”, “CRC → non-Asian” — will re-attach to real arms after classify_publications re-runs.
Open characterization questions
Section titled “Open characterization questions”- Do the null-version pubs (18,324 total, 872 matching the tight signal) include systematic issues beyond pseudo-arms? Worth sampling 20 pubs outside the tight signal to see if anything else is silently broken on the oldest extractions.
- Should legitimate biomarker-defined cohorts (e.g. basket trial “Cohort A (HER2-amp)”) be modeled as arms or subgroups? Currently mixed — no explicit rule. Could regress if a too-aggressive “no population cuts as arms” guardrail is added.
- Post re-extraction, classify_publications needs to re-link orphan
trial_arm_outcomes(8,586 unlinked in the local reset) to the newly-materialized arms by arm-name matching — does the current arm-linking logic handle the case where a TAO’s original arm name (“Temab-A monotherapy — Asian subgroup (AS)”) no longer exists as an arm?
Explored solution direction
Section titled “Explored solution direction”Option A — Rely on v3 prompt as-is: v3 already fixes pub 190844 without any prompt change. The escalation/expansion guardrails implicitly steer the model away from pseudo-arms. Simplest path.
Option B — Add explicit “no population cuts as arms” rule to the prompt: Belt-and-suspenders. Risk: drafting the rule precisely enough to exclude ethnicity pseudo-arms but include legitimate biomarker-defined cohorts is non-trivial. Skipped unless v3 regressions appear.
Backfill strategy: tight signal via SQL (suspicious arm name OR comma in rp2d) + targeted reset, rather than re-extracting all 24k pubs. Saves ~95% of the work.
Solution applied
Section titled “Solution applied”Forward fix: already live. InterventionExtraction::PROMPT_VERSION = 3 deployed via Issue 52 Phase 2 (2026-04-13). Verified on pub 190844 locally — v3 produces 4 clean treatment arms with no ethnicity pseudo-arms.
Backfill task added (2026-04-22): lib/tasks/one_off/reset_publications.thor now has reset_pseudo_arm_pubs method. Scope:
intervention_extraction_version< 3 OR null- AND either (a) at least one
trial_arm.namematches ethnicity / demographic / response-category / overall-subgroup / analysis-set regex, OR (b) at least onetrial_arm_intervention.rp2dcontains a comma or ” and ”
Deletes trial_arms + trial_arm_interventions, unlinks trial_arm_outcomes (sets trial_arm_id=NULL, preserves values), and strips intervention + subgroup + classify outputs from llm_data. Local dry-run matched 1,132 pubs (5,101 arms, 6,362 TAIs, 8,586 TAOs unlinked) — within 1 of the prod estimate of 1,133.
Pending:
- Add
MIN_REPROCESS_VERSION = 3toInterventionExtractionand gatebase_scopeon it, so stale pubs can also be picked up via the normal pipeline re-run rather than needing an explicit reset. - Production run of
reset_pseudo_arm_pubs(stage with--limit 50first for spot-check). - Re-run
extract_interventions → extract_subgroups → classify_publicationson the reset cohort. - Verify a random sample post-backfill — confirm pseudo-arms are gone and efficacy is correctly attached to real arms.
Status: Fix ready, backfill pending.