Publication Issues Tracker
Publication Issues Tracker
Section titled “Publication Issues Tracker”Temporary working document for tracking publication-processing issues identified during investigation.
The main motivation for this doc is the sheet: 1reh2-9Xpxd9DF7EB-73JfSXH8-MLtWI3zUDEOTgxPV8, where the client has collected clinical data for different disease areas and drugs. The purpose of this document is to identify gaps in the publications database that are preventing us from being able to correctly reconstruct this sheet in the future using structured data only (from the bioloupe data lake database).
Last updated: 2026-04-03 (Issue 49: backfill plan, prompt versioning, investigational tagger removal, Issues 42/44 prompt fixes)
Issue index
Section titled “Issue index”| # | Title | Short description | Status |
|---|---|---|---|
| 8 | Zero-sentinel contamination (residual) | LLM outputs 0 instead of null for unstated efficacy values (N, ORR). Full-corpus: 55k N=0 arm outcomes (16k auto-sentinels), 20k measure_value=0. Root cause: schema lacked nullable: true | Complete — forward fix + backfill applied 2026-03-29. Guard regression fixed 2026-03-30 (Issue 43) |
| 26 | Parent population N propagated to child subgroups (residual) | classify_publications copies the parent subgroup’s number_of_participants to child subgroups instead of extracting the subset-specific N. Original fix addressed bulk but residuals remain (e.g. pub 200353 MR subgroups) | Incomplete — residuals |
| 17 | ASCO abstract + presentation copies create duplicate publication rows | ASCO ingestion saves AbstractContentItem and PresentationContentItem separately by source_id, so the same DOI can appear twice in the report | Investigation complete |
| 18 | PubMed-indexed journal article missing from publication corpus | The sqNSCLC worksheet row for Cofetuzumab now points to 10.1016/j.lungcan.2025.108492, but that article is absent from publications, so the row is still missing despite a valid journal source | Implementation complete — 2025 PubMed backfill pending |
| 27 | extract_efficacy_metrics picks confirmed ORR as plain ORR | When both confirmed and unconfirmed ORR rows exist with the same N, max_by(number_of_participants) picks the confirmed row for the plain ORR metric — making ORR and cORR identical and the ORR value wrong | Complete — applied 2026-03-26 |
| 28 | build_result_rows collapses dose-level arms when study_plan_arm_id is null | Grouping key uses study_plan_arm_id which is null for publication-extracted arms — distinct dose cohorts (e.g. “8.0 mg/kg” vs “10.0 mg/kg”) sharing the same subgroup collapse into one row, silently dropping the lower-N arm | Complete — applied 2026-03-29 |
| 29 | Dose extraction captures study-level range, not efficacy population range | In dose-escalation studies, LLM extracts the full dose range (e.g. 1.0–8.3 mg/kg) even when efficacy is reported only for a subset (e.g. ≥4.0 mg/kg) — dose_min on the efficacy row is too low | Complete — forward fix + backfill applied in prod |
| 30 | Cross-study data contamination from abstract background sections | LLM extracts efficacy values from a referenced prior study cited in the abstract’s background, attributing them to the current publication which has no efficacy data yet | Complete — full pipeline (triage, validate, remediate, retriage, prune, reset_stale) applied 2026-03-30 |
| 31 | Investigational drug dose data bleeds onto control/comparator arms | pub_dose_lookup COALESCE fallback propagates investigational drug dose fields to control arms when publication_interventions.study_plan_arm_id is NULL — 2,890 rows across 566 publications | Complete — applied 2026-03-29 |
| 32 | TTP (time to progression) misclassified as PFS | LLM extraction maps TTP values to PFS endpoint — 241 publications mention TTP in abstract but have PFS extracted without TTP; additionally SD-subpopulation TTP values get attributed to full cohort. Query-layer TTP→PFS fallback also remapped correctly-extracted TTP back to PFS. | Complete — extraction fix 2026-03-28, query fix 2026-03-30 |
| 33 | Cross-tabulated subgroups not identified in basket trials | extract_subgroups identifies single-dimension subgroups (tumor type OR biomarker) but not the cross-product (tumor type × biomarker) when tabular data is present — 262 confirmed pubs (from 6,081 candidates → 934 pass 1 → 262 pass 2) | Complete — applied in prod 2026-03-28 |
| 34 | ”Immature” endpoints extracted as “Not Reached” | LLM maps “not yet mature” / “data immature” to “Not Reached” — but immature means no median can be estimated (should be null), while “Not Reached” means median exceeds follow-up. ~71 pubs have immature language without “not reached” but have “Not Reached” extracted | Investigation complete |
| 35 | Dose extraction confuses PK thresholds, imaging agent doses, and missing dose_max | LLM extracts PK observation thresholds or imaging tracer doses instead of therapeutic drug doses; also omits dose_max when abstract states a range with “≥X” pattern | Complete — forward fix + view v21 (rp2d gate) applied 2026-03-31; backfill validated (job 1694), 42 remediation pending |
| 36 | cORR set equal to ORR when abstract distinguishes confirmed vs unconfirmed | LLM extraction sets cORR = ORR instead of counting only confirmed responses. Reverse of Issue 27 — here the ORR value is copied to cORR rather than cORR leaking into ORR | Investigation complete |
| 37 | Mean survival values extracted as median | LLM extracts mean OS/PFS values without distinguishing them from median — the pipeline has no field to flag the statistic type, so mean values are silently presented as median | Investigation complete |
| 38 | Biomarker subgroups in secondary analyses not identified by extract_subgroups | extract_subgroups misses biomarker-defined subgroups (e.g. p16+ oropharyngeal) when they appear as secondary efficacy analyses rather than pre-specified study arms | Complete — backfill applied 2026-03-30 (1,718/1,730 reprocessed). Partial screen complete 2026-03-31 (16,709 screened, 1,483 flagged). Prompt fixes validated; remediation pending deployment. |
| 39 | Multi-drug randomized trial dose cross-contamination | In randomized trials with multiple investigational drugs, LLM assigns one drug’s dose to all arms instead of arm-specific doses | Investigation complete |
| 40 | Hierarchical subgroup rows in view lose N from flat counterparts | Mostly false positive. 3 of 4 audit examples (pubs 134450, 67379, 200353) have null N because the abstract genuinely doesn’t state per-subgroup N — correct extraction. Only pub 48926 is a real bug: flat IHC3+ has N=40 but hierarchical copy has N=null. Real scope: 182 TAOs across 59 pubs where flat counterpart has N but hierarchical copy doesn’t. | Downscoped — mostly not a bug. Post-process propagation fix deferred (low impact: 182 records). |
| 41 | Safety data cross-contamination between dose arms | Safety N and discontinuation rates from one dose arm attributed to another dose arm in the same publication. Related to Issue 31 but in safety domain — extraction/query layer, not view COALESCE. | Complete |
| 42 | Tumor shrinkage rate confused with RECIST ORR | LLM extracts “any tumor reduction” percentage as ORR instead of RECIST-defined objective response rate. e.g. pub 162304: 35% had any shrinkage but true ORR was ~1.5% (1/66 PR). | Forward fix applied 2026-04-03. Included in Issue 49 re-extraction (PROMPT_VERSION=1). |
| 43 | Cross-tabulated subgroups only extracted for highest-response HER2 level | Issue 33 backfill re-extracted cross-tabs but LLM only creates disease × biomarker cross-products for the most prominent level (e.g. IHC3+), skipping IHC2+, IHC1+, mutation/amp where responses are low/zero. Residual gap in Issue 33. | Forward fix applied 2026-03-30. Backfill: screened 5,348 → rescreened → 234 confirmed → remediated 2026-03-31. Pipeline re-run pending. |
| 44 | PFS/OS event count extracted as number_of_participants | In survival tables reporting “median (95% CI) events n/N”, LLM extracts the event numerator as N instead of the denominator. e.g. “5.3 (4.5, 5.9) 23/31” → N=23 (events) instead of N=31 (patients). Scale TBD. | Forward fix applied 2026-04-03. Included in Issue 49 re-extraction (PROMPT_VERSION=1). |
| 45 | Qualifying-subset denominator used as subgroup N instead of subset count | When abstract reports “X/Y pts had [condition]”, LLM uses Y (tested/assessed) as subgroup N instead of X (qualifying subset). Applies to biomarker, analysis population, prior-therapy, and condition-present subgroups. ~17% of target-disease pubs affected. | Forward fix applied + screen → remediate → re-extract backfill ready. Production screening pending. |
| 46 | Incomplete endpoint extraction across sibling dose arms | LLM extracts an endpoint (e.g. DoR) for one dose arm but skips the same endpoint for a sibling arm in the same table. Possibly biased toward higher-response or first-listed arm. Combined with Issue 45 screening. | Forward fix applied + combined with Issue 45 backfill. Production screening pending. |
| 49 | Arm name mismatch between extract_interventions and classify_publications | Two independent LLM steps name the same arm differently (e.g. “Control group” vs “Control”), preventing trial_arm_outcomes from linking to trial_arms by name. ~18% of arm outcomes unlinked after backfill. | Forward fix applied 2026-04-02. Backfill plan ready: 3,943 target-disease pubs, full pipeline re-run (~$178). Reset task + prompt versioning + investigational tagger removal. Tested on 10 pubs — 100% linking. |
| 50 | DrugLinker false-matches non-drug interventions to drugs | SimpleCandidateMatchingService (LLM-based last resort) matches non-pharmacological interventions (e.g. “Classical music” → Orca-T) to drugs. ~3,093 false matches on procedure/device/other intervention types. | Forward fix applied 2026-04-04 (DrugMatchingService + caching). Backfill cleanup pending production run. |
| 51 | Per-arm dose not populated on backfilled trial_arm_interventions | Backfill copied study-level dose from publication_interventions to trial_arm_interventions. Multi-dose-arm pubs have the same range on every arm instead of arm-specific dose. ~23.5k pubs need extract_dose_evidence re-run. | Fix validated — version bump + prompt refinement tested on 9 pubs. Production extract_dose_evidence run pending (~$103 est.). |
Each issue entry should keep analysis and remediation separate.
Recommended issue structure:
Short summaryWhere this sits in the current pipelineExact restriction causing the dropConcrete examplesDownstream impactWhat the issue is notScaleSpot checksOpen characterization questionsExplored solution directionSolution applied
Solution applied should remain empty until an actual fix is agreed and implemented.
Backfill pattern: When an issue requires backfilling historical data, see the “One-Off Backfill Tasks” section in
.claude/skills/backend-expert/SKILL.md.
8. Zero-sentinel contamination (residual)
Section titled “8. Zero-sentinel contamination (residual)”Short summary
Section titled “Short summary”The original Issue 8 fix addressed max_prior_lines zero-sentinel contamination (LLM outputting 0 instead of null for unstated values). That fix is complete — no min > max contradictions remain. However, the same zero-sentinel pattern persists in efficacy fields: patient_number_efficacy, measure_value (ORR), and patient_count. When a publication abstract doesn’t state a per-arm N or per-subgroup ORR, the LLM extracts 0 instead of leaving the value null.
Concrete examples
Section titled “Concrete examples”- Pub 241259 (Temab-A E-R analysis): Per-arm N not stated for 2.0 and 2.4 mg/kg dose arms (63 total across arms), but N=0 extracted for each arm
- Pub 29699 (ABBV-400 E-R analysis): ORR=0% extracted for all arms, but abstract only reports exposure-response correlations (p<0.05) — no numeric ORR values stated
- Pub 134450 (MRG003 phase 1b): N=0 for CRC and SCCHN disease subgroups despite ORR and DCR being reported (ORR=0%, DCR=25% for CRC; ORR=40%, DCR=100% for SCCHN)
- Pub 67379 (ROME trial): N=0 for hTMB/MSS subgroup, yet PFS=3.6 months with HR=0.65 and p=0.01 are extracted
Full-corpus scan (2026-03-29): 55,499 N=0 arm outcomes across 10,461 publications; 19,872 measure_value=0 across 9,307 publications. Of the N=0 set, 15,968 are definitive auto-sentinels (have non-zero sibling measure_values); the remaining ~39k are ambiguous (N not stated, no sibling data to confirm). Originally identified as 14 residual instances in HNSCC+ADC and CRC+ADC audits — the actual scope is corpus-wide.
Explored solution direction
Section titled “Explored solution direction”Update classify_publications prompt: “When the abstract does not state a specific numeric value for a field (e.g., number of patients in a subgroup, ORR for an arm), leave the field null. Never output 0 as a placeholder for unstated values — 0 and null have different clinical meanings.”
Solution applied
Section titled “Solution applied”Forward fix (2026-03-29): Three-part fix:
-
details.rb: Addednullable: truetonumber_of_participants(line 43) andmeasure_value(line 44) in the Arm schema. This allows the JSON schema to accept null, which is the primary signal the LLM uses to decide valid outputs. -
task.rb: Added zero-vs-null prompt instruction after the child-subgroup N section (line 151-156): “Use null (not 0) when no numeric value is stated. 0 and null have different clinical meanings.” -
post_process.rb: Added two defensive guards:- N=0 → nil for all arm outcomes (zero patients with reported efficacy is always a sentinel)
- measure_value=0 → nil when ALL arms for a percentage endpoint have value “0” (LLM fabricated zeros for unreported endpoints)
Backfill (2026-03-29): lib/tasks/one_off/backfill_zero_sentinel_efficacy.thor — three-phase Thor task (identify → validate → remediate). N=0 candidates with non-zero sibling measure_values are auto-classified as sentinels without LLM. Remaining candidates (ambiguous N=0 and measure_value=0 for percentage endpoints) validated via GPT-5-mini against abstract text. Audit trail stored in trial_subgroup.llm_data['zero_sentinel_checks'] and ['zero_sentinel_patches']. All three phases completed 2026-03-29 (jobs 1661-1664).
Known regression (fixed 2026-03-30): The post_process.rb guard that nulls measure_value=0 when all arms have 0 for a percentage endpoint was too aggressive — it killed real 0% ORR values (e.g. pub 31990 IHC2+/ISH- and IHC1+ cohorts genuinely had 0% ORR). Fixed in Issue 43: guard now only nulls when all arms also have nil/zero N (fabrication signal). Real 0% with stated N > 0 is preserved.
26. Parent population N propagated to child subgroups (residual)
Section titled “26. Parent population N propagated to child subgroups (residual)”Short summary
Section titled “Short summary”The original Issue 26 fix addressed the bulk of cases where classify_publications copies the parent subgroup’s number_of_participants to child subgroups. However, residual instances remain where the parent population N is applied to child subgroups that represent a strict subset.
Concrete examples
Section titled “Concrete examples”- Pub 200353 (T-DXd biomarker analysis DESTINY-CRC02): 97 paired BL/C3D1 ctDNA samples total. Both the “Complete MR at C3D1” and “Absent MR at C3D1” child subgroups have N=97, but each is a subset of the 97 paired samples. The abstract references a Table (not inline) with the split, but the LLM defaulted to the parent N.
2 residual instances in job 1635 audit. Lower frequency than the original Issue 26 (which affected ~5,058 subgroups across 1,174 pubs), suggesting the fix addressed the majority but edge cases remain — particularly when the child subgroup N is only available in a referenced table rather than inline text.
Explored solution direction
Section titled “Explored solution direction”The original prompt fix instructed the LLM to extract subset-specific N. Residuals likely need a reinforcement: “When a child subgroup represents a subset of the parent (e.g., ‘Complete MR’ vs ‘Absent MR’ within a paired sample set), the child’s N must be less than the parent’s N. If the specific N is not stated, leave it null rather than copying the parent’s N.”
Solution applied
Section titled “Solution applied”(empty — pending implementation)
17. ASCO abstract and presentation copies create duplicate publication rows
Section titled “17. ASCO abstract and presentation copies create duplicate publication rows”Short summary
Section titled “Short summary”After broadening ASCO ingestion to include both AbstractContentItem and PresentationContentItem, the same scientific abstract can now be stored twice under different ASCO uids. EmergingClinicalDataQuery groups by publication_id, not DOI/title, so both copies surface as separate rows.
This showed up repeatedly during the sqNSCLC pass and makes the local output look larger and noisier than the sheet.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/services/publications/asco_api_service.rb:
fetch_abstract_hitsrequestscontentTypes: ['Abstract', 'Presentation']save_publicationpersists records usingPublication.find_or_initialize_by(source: 'ASCO', source_id: publication_data[:source_id])
app/queries/tpp/emerging_clinical_data_query.rb:
build_result_rowsgroups bypublication_id,disease_id,effective_line, andstudy_plan_arm_id
There is no DOI-level or title-level deduplication step between ingestion and reporting.
Exact restriction causing the duplication
Section titled “Exact restriction causing the duplication”The ASCO fix for Issue 2 intentionally broadened the search and detail query to include PresentationContentItem. That solved the “missing presentation” problem, but persistence still keys uniqueness on source_id:
publication = Publication.find_or_initialize_by(source: 'ASCO', source_id: publication_data[:source_id])So if ASCO exposes both:
ABSTRACT492030PRESENTATION251481
with the same DOI and same text, both are considered distinct publications locally.
Concrete examples from sqNSCLC validation
Section titled “Concrete examples from sqNSCLC validation”Example 1: PF-08046054
Section titled “Example 1: PF-08046054”Same DOI:
10.1200/JCO.2025.43.16_suppl.8611
Stored twice:
- publication
48035— source_idABSTRACT492030 - publication
238708— source_idPRESENTATION251481
Both produce the same sqNSCLC row (ORR = 33.3%, N = 6).
Example 2: IBI363
Section titled “Example 2: IBI363”Same DOI:
10.1200/JCO.2025.43.16_suppl.8509
Stored twice:
- publication
139344— source_idABSTRACT500470 - publication
237445— source_idPRESENTATION246467
Both produce the same main sqNSCLC 3 mg/kg Q3W row.
Example 3: Additional duplicate DOI pairs in the same sqNSCLC slice
Section titled “Example 3: Additional duplicate DOI pairs in the same sqNSCLC slice”- Datopotamab deruxtecan:
10.1200/JCO.2025.43.16_suppl.8501 - Sacituzumab govitecan:
10.1200/JCO.2025.43.16_suppl.8599
Downstream impact
Section titled “Downstream impact”- one worksheet row can correspond to two local rows
- counts for “how many publication-backed rows do we have?” are overstated
- manual comparison against the sheet becomes noisy
- any future ranking or aggregation that does not dedupe by DOI/title risks double-counting conference data
What the issue is not
Section titled “What the issue is not”This is not a disease-mapping issue and not a subgroup-extraction issue.
The data itself is usually valid in both copies. The problem is that they are the same scientific result represented twice because ASCO exposes two content-item types.
This is also not an argument to undo Issue 2 entirely. We needed PresentationContentItem support to recover records like SHR-A2102. The gap is specifically the lack of a deduplication strategy after broadening the source.
In the sqNSCLC ADC/fusion slice alone, there are 4 duplicate DOI pairs:
- PF-08046054
- IBI363
- Datopotamab deruxtecan
- Sacituzumab govitecan
So the effect is already material in a small disease/technology slice.
Explored solution direction
Section titled “Explored solution direction”Two reasonable options:
1. Query/report deduplication
Keep both source records in publications, but dedupe in EmergingClinicalDataQuery or the TPP report by a stable key such as:
- DOI + disease + subgroup/arm
- or DOI + publication title
This is lower risk for ingestion history.
2. Ingestion-time merge
When saving ASCO records, detect that an incoming presentation and an existing abstract share the same DOI/title/NCT tuple and merge them into one canonical Publication.
This is cleaner downstream but riskier because it changes persistence semantics for already-ingested ASCO records.
18. PubMed-indexed journal article missing from publication corpus
Section titled “18. PubMed-indexed journal article missing from publication corpus”Short summary
Section titled “Short summary”The current sqNSCLC worksheet row for Cofetuzumab pelidotin points to the 2025 journal article:
- DOI:
10.1016/j.lungcan.2025.108492 - PMID:
40086026
That article exists on PubMed and contains the sqNSCLC result the sheet uses, but there is no corresponding Publication row in the local database. As a result, the row is completely absent from EmergingClinicalDataQuery.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”This drop happens before EmergingClinicalDataQuery.
During validation:
Publication.where(doi: '10.1016/j.lungcan.2025.108492')returned no rowsPublication.where(source_id: '40086026')returned no rows
So the publication never entered the local corpus, or it was dropped before persistence.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”Root cause isolated.
There are two distinct PubMed ingestion limitations affecting this paper:
- the disease-specific path depends on PubMed exposing a
ClinicalTrials.gov/NCT...databank entry, and this record does not appear to expose that linking metadata even though PubMed marks it as a clinical trial - the broad PubMed path in
Publications::PubmedApiServicebuilt one giant combined query for the oncology MeSH clause plus the recovery clause; that combined search term excluded qualifying records that PubMed returned when the intended criteria were tested separately
What was verified live for PMID 40086026:
- PubMed resolves DOI
10.1016/j.lungcan.2025.108492to PMID40086026 - the record has
Clinical Trial, Phase I - the record has oncology MeSH including
Carcinoma, Non-Small-Cell LungandLung Neoplasms 40086026[uid] AND mesh AND clinical-trial publication types AND 2025 datereturned140086026[uid] AND full previous combined search termreturned0
So the missing publication was not due to missing PubMed record metadata for the broad query. It was due to our query construction.
Concrete example
Section titled “Concrete example”Worksheet row: Cofetuzumab pelidotin in sqNSCLC
Section titled “Worksheet row: Cofetuzumab pelidotin in sqNSCLC”Worksheet entry:
- Drug:
Cofetuzumab pelidotin - Publication:
Lung Cancer (Journal), 2025 - Link:
https://doi.org/10.1016/j.lungcan.2025.108492 ORR = 12.5%cORR = 12.5%mPFS = 5.3mDoR = 2.2
Local database state:
- no
Publicationrow for DOI10.1016/j.lungcan.2025.108492 - no
Publicationrow for PMID40086026 - only older cofetuzumab records exist:
- publication
150086— ASCO 2021 - publication
71934— ESMO 2023 - publication
101600— Clinical Cancer Research 2021
- publication
External confirmation:
- PubMed lists the paper as “A phase 1b study of cofetuzumab pelidotin monotherapy in patients with PTK7-expressing recurrent non-small cell lung cancer” with PMID
40086026
Downstream impact
Section titled “Downstream impact”- the sqNSCLC worksheet still has one fully missing non-investor row even after the backfills and corrections
- the earlier tracker note that the cofetuzumab sqNSCLC value was poster-only is now stale for the current worksheet version
- the publication will remain absent until a non-
--disease-specific2025 PubMed run is executed against the fixed query logic --disease-specificalone is still insufficient for this class of paper because PubMed does not appear to expose theClinicalTrials.govlinking metadata we rely on
What the issue is not
Section titled “What the issue is not”This does not contradict the earlier ESMO 2023 analysis in Issue 11.
That earlier note was about publication 71934, where the squamous-specific value was not in the 2023 abstract text. The current worksheet has since moved to a later 2025 journal article. That newer source should be representable if it is ingested.
Currently one confirmed sqNSCLC worksheet row for the original worksheet discrepancy.
For 2025-01-01 through 2025-12-31, after fixing the PubMed query construction:
- the broad oncology/malignant-heme PubMed query returns
6,013PMIDs 3,831of those are not already in localpublications- compared with the old
Clinical Trial[pt]path, there are435additional PMIDs 431of those additional PMIDs are not already in localpublications
So this is not just one missing-paper edge case. The broken combined query was suppressing a non-trivial number of 2025 PubMed records.
Spot checks
Section titled “Spot checks”Publication.where(doi: '10.1016/j.lungcan.2025.108492')returned no rows before the fixPublication.where(source_id: '40086026')returned no rows before the fix- after the
PubmedApiServicequery change,fetch_uids_by_date('2025/01/01', '2025/12/31', nct_ids: [])includes PMID40086026 - live verification after the fix returned:
includes_pmid_40086026 = truetotal = 6013
Open characterization questions
Section titled “Open characterization questions”- After the 2025 backfill, how many of the
431incremental publications are truly result publications versus broader cancer-clinical-trial noise? - Do we want to keep the broad non-
--disease-specificPubMed run as a regular sync, or use it only as a periodic coverage backfill?
Explored solution direction
Section titled “Explored solution direction”Characterize the missing publication upstream of the query, then narrow the fix to the actual failure point:
- Trace the PubMed/journal ingestion path for DOI
10.1016/j.lungcan.2025.108492/ PMID40086026 - Compare direct PubMed criteria matches against the full generated search term
- Split the broad PubMed search into separate query terms and union PMIDs in Ruby instead of relying on one giant combined PubMed query
Solution applied
Section titled “Solution applied”- updated
Publications::PubmedApiServiceso the broad PubMed path now runs separate search terms for:- oncology/malignant-heme MeSH + clinical-trial publication types
- oncology/malignant-heme MeSH + recovery result terms for the recent recovery window
- changed PubMed UID fetching to execute each term separately and union the PMIDs in Ruby
- aligned total-count logic with the split-query approach
- verified live that the fixed 2025 query now includes PMID
40086026 - syntax check passed:
ruby -c app/services/publications/pubmed_api_service.rb
27. extract_efficacy_metrics picks confirmed ORR as plain ORR
Section titled “27. extract_efficacy_metrics picks confirmed ORR as plain ORR”Short summary
Section titled “Short summary”When both confirmed (confirmed=true) and unconfirmed (confirmed=false) ORR rows exist for the same subgroup in the view, ClinicalEvidenceQuery#extract_efficacy_metrics can pick the confirmed row as the plain ORR metric value. This happens because the ORR extraction loop does not exclude confirmed rows, and when both rows have the same number_of_participants, max_by returns whichever comes first — often the confirmed row.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”ClinicalEvidenceQuery#extract_efficacy_metrics — app/queries/tpp/clinical_evidence_query.rb, lines 590–628.
The cORR extraction (lines 658–675) correctly filters confirmed == true and is unaffected. The problem is exclusively in the general efficacy extraction loop that handles ORR alongside OS, PFS, DOR, etc.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”Lines 600–611:
PRIMARY_EFFICACY_ABBREVIATIONS.each do |abbr| matching = grouped[abbr] || grouped[abbr.downcase] next if matching.nil? || matching.empty?
matching = filter_by_valid_unit(matching, abbr) next if matching.empty?
experimental = matching.select { |r| r['resolved_group_type'] == 'EXPERIMENTAL' } experimental = matching if experimental.empty?
best_row = experimental.max_by { |r| r['number_of_participants'].to_i } || matching.firstWhen abbr == 'ORR', matching includes ALL ORR rows regardless of confirmed flag. If both confirmed=true (value=26.7%) and confirmed=false (value=43.3%) exist with the same N, max_by picks the first match. The result: metrics[:orr] gets the confirmed value, making it identical to metrics[:corr] and wrong as a standalone ORR.
Concrete examples
Section titled “Concrete examples”Publication 117228 (RM-1929 photoimmunotherapy in rHNSCC):
Abstract states:
- “unconfirmed objective response rate (ORR) 43.3%”
- “confirmed ORR 26.7%”
View correctly has both rows (subgroup “Heavily pretreated rHNSCC → Part 2”):
confirmed=true, measure_value=26.7, number_of_participants=30confirmed=false, measure_value=43.3, number_of_participants=30
Report output: efficacy.orr.value = 26.7 (should be 43.3)
The cORR extraction correctly returns 26.7%, but the ORR extraction ALSO returns 26.7% instead of 43.3%.
Downstream impact
Section titled “Downstream impact”- Understated ORR: When confirmed ORR is lower than unconfirmed ORR (the typical pattern), the report shows the lower confirmed value as the headline ORR. For pub 117228, ORR is understated from 43.3% to 26.7%.
- Duplicate values: ORR and cORR columns show the same value, making the cORR column appear redundant and hiding the existence of a lower confirmed rate.
- Audit noise: The audit correctly flags these as
incorrect_valueonefficacy.orr.value, generating true-positive findings that overlap with Issue 25 audit findings.
477 publications currently have both confirmed=true and confirmed=false ORR rows (the correct Issue 25 extraction pattern). When both rows have the same N (which is common — confirmed and unconfirmed ORR are computed from the same denominator), the confirmed value gets picked as plain ORR.
-- Publications where confirmed and unconfirmed ORR have the same N-- (susceptible to the wrong-pick bug)SELECT count(DISTINCT ts.source_id)FROM trial_subgroups tsJOIN trial_outcome_measures tom_c ON tom_c.trial_subgroup_id = ts.id AND tom_c.confirmed = trueJOIN trial_outcome_measures tom_u ON tom_u.trial_subgroup_id = ts.id AND tom_u.confirmed = falseJOIN trial_endpoints te_c ON te_c.id = tom_c.trial_endpoint_id AND te_c.abbreviation = 'ORR'JOIN trial_endpoints te_u ON te_u.id = tom_u.trial_endpoint_id AND te_u.abbreviation = 'ORR'JOIN trial_arm_outcomes tao_c ON tao_c.trial_outcome_measure_id = tom_c.idJOIN trial_arm_outcomes tao_u ON tao_u.trial_outcome_measure_id = tom_u.idWHERE ts.source_type = 'Publication' AND tao_c.number_of_participants = tao_u.number_of_participants;Explored solution direction
Section titled “Explored solution direction”Forward fix: In extract_efficacy_metrics, when processing ORR, exclude confirmed=true rows if confirmed=false rows also exist for the same subgroup. This ensures the plain ORR metric always uses the unconfirmed/total ORR:
# Inside the PRIMARY_EFFICACY_ABBREVIATIONS.each loop, after filtering matching:if abbr == 'ORR' unconfirmed = matching.reject { |r| [true, 't'].include?(r['confirmed']) } matching = unconfirmed if unconfirmed.any?endThis is a ~3 line change in clinical_evidence_query.rb. No backfill needed — fixing the query immediately fixes all report output.
No backfill required: This is a query-layer bug, not a data issue. The underlying data (trial_outcome_measures with correct confirmed flags) is correct. Fixing the Ruby code fixes all publications instantly.
Solution applied
Section titled “Solution applied”Forward fix (2026-03-26): Added 5-line guard in app/queries/tpp/clinical_evidence_query.rb extract_efficacy_metrics method (line 610-613). When processing ORR, rejects confirmed=true rows if non-confirmed rows exist. This ensures the plain ORR metric uses the unconfirmed/total ORR, while the cORR extraction (line 667-683) independently picks confirmed=true rows.
if abbr == 'ORR' non_confirmed = matching.reject { |r| [true, 't'].include?(r['confirmed']) } matching = non_confirmed if non_confirmed.any?endEdge cases handled:
- Both confirmed + unconfirmed exist → ORR gets unconfirmed, cORR gets confirmed (correct)
- Only confirmed exists (no unconfirmed) → ORR falls back to confirmed value (safe fallback — same as cORR)
- Only unconfirmed/null exists → no change (correct)
No backfill needed — query-layer fix applies immediately to all report output
28. build_result_rows collapses dose-level arms when study_plan_arm_id is null
Section titled “28. build_result_rows collapses dose-level arms when study_plan_arm_id is null”Short summary
Section titled “Short summary”ClinicalEvidenceQuery.build_result_rows groups view rows by [publication_id, disease_id, effective_line, study_plan_arm_id, subgroup_value]. When study_plan_arm_id is null — which it is for all publication-extracted arms that haven’t been matched to a clinical trial study plan arm — distinct dose-level arms (e.g. “8.0 mg/kg” and “10.0 mg/kg”) sharing the same subgroup_value collapse into a single group. extract_efficacy_metrics then picks one arm by max_by(number_of_participants), silently dropping the other.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/queries/tpp/clinical_evidence_query.rb, build_result_rows method (line 306).
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”The grouping key at line 306 is:
grouped = enriched_data.group_by { |row| [row['publication_id'], row['disease_id'], row['effective_line'], row['study_plan_arm_id'], row['subgroup_value']]}When study_plan_arm_id is null for both dose arms (common for unlinked publications), they group together. extract_efficacy_metrics (line 619) then picks one via max_by(number_of_participants).
Concrete examples
Section titled “Concrete examples”Pub 190656 (ARTEMIS-001, HS-20093 B7-H3 ADC in NSCLC):
- View has 6 rows for “NSCLC → Squamous cell carcinoma” (3 endpoints × 2 dose arms: 8.0 mg/kg N=32 and 10.0 mg/kg N=26)
- Both arms have
study_plan_arm_id = null - Query collapses to 1 row, picks 8.0 mg/kg (N=32 > N=26)
- Lost data: Sq 10.0 mg/kg cORR 26.9%, PFS 5.7, DOR 7.0
Downstream impact
Section titled “Downstream impact”Dose-level subgroup data is silently dropped from the Clinical Evidence report. For dose-escalation studies where different dose levels have meaningfully different efficacy, only the higher-N cohort appears.
Affects dose-escalation/expansion publications where arms aren’t matched to trial study plan arms. The view correctly distinguishes arms by arm_name, but the query ignores arm_name in its grouping key.
Explored solution direction
Section titled “Explored solution direction”Add arm_name to the grouping key in build_result_rows, or fall back to arm_name when study_plan_arm_id is null. This preserves dose-level arm distinctions without breaking publications where study_plan_arm_id correctly differentiates arms.
Related to Issue 20 (study_plan_arm link is fragile) — same root cause of over-reliance on study_plan_arm_id.
Solution applied
Section titled “Solution applied”Forward fix (2026-03-29): Added arm_name fallback to the grouping key in app/queries/tpp/clinical_evidence_query.rb build_result_rows method (line 307). When study_plan_arm_id is null, uses arm_name as the differentiator so distinct dose-level arms (e.g. “8.0 mg/kg” vs “10.0 mg/kg”) are preserved as separate rows.
grouped = enriched_data.group_by { |row| [row['publication_id'], row['disease_id'], row['effective_line'], row['study_plan_arm_id'] || row['arm_name'], row['subgroup_value']]}No backfill needed — query-layer fix applies immediately to all report output.
29. Dose extraction captures study-level range, not efficacy population range
Section titled “29. Dose extraction captures study-level range, not efficacy population range”Short summary
Section titled “Short summary”In dose-escalation studies, classify_publications extracts the full dose range stated in the abstract (e.g. dose_min=1.0, dose_max=8.3 mg/kg) as a property of the subgroup. But when the abstract restricts efficacy reporting to a dose subset (e.g. “results for patients who received ≥4.0 mg/kg”), the dose_min on the efficacy row is too low, creating a mismatch between the dose range and the efficacy population.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/task.rb — dose fields extracted as subgroup-level properties.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”Dose extraction treats dose as a study-level attribute (“what doses were used?”) rather than scoping to the efficacy analysis population (“what doses did the patients in the reported results actually receive?”). The LLM prompt doesn’t instruct it to scope dose to the efficacy population.
Concrete examples
Section titled “Concrete examples”Pub 238709 (MYTX-011 KisMET-01 updated):
- Abstract: “85 pts received 1.0–8.3 mg/kg; 59 pts received ≥4.0 mg/kg” — efficacy reported only for ≥4.0 mg/kg subset
- Extracted:
dose_min=1.0, dose_max=8.3 - Expected:
dose_min=4.0, dose_max=8.3(matching the efficacy population) - RP2D correctly extracted as “5.0 mg/kg Q3W (2-on 1-off) and 4.0 mg/kg Q3W”
Downstream impact
Section titled “Downstream impact”Report rows show a broader dose range than the actual efficacy population received. Minor impact on report accuracy but misleading for dose-response interpretation.
Affects phase I dose-escalation studies where efficacy is reported for a dose subset. Relatively uncommon pattern — most studies report efficacy at a single dose or clearly per-dose-level.
Explored solution direction
Section titled “Explored solution direction”Update the classify_publications dose extraction prompt to instruct the LLM: “When the abstract reports efficacy for a specific dose subset, use that subset’s dose range, not the full escalation range.” Alternatively, accept this as a known limitation since RP2D (when present) correctly reflects the clinically relevant dose.
Solution applied
Section titled “Solution applied”Forward fix (2026-03-28):
task.rb: Added “DOSE SCOPING” instruction to the Subgroup Dose Context section — instructs LLM to set dose_min/dose_max to match the efficacy population, not the full escalation range, when the abstract restricts efficacy reporting to a dose subset.task.rb: Added “DOSE RANGE COMPLETENESS” instruction — instructs LLM to always fill both dose_min and dose_max for dose-defined subgroups (e.g. “≥X” subgroups now get dose_max set to the highest dose level in the abstract).dose_evidence_extraction.rb: Added clarifying comment that drug-level dose extraction intentionally captures the full escalation range (efficacy-population scoping is handled in subgroup extraction).
Backfill (2026-03-28) — lib/tasks/one_off/backfill_dose_scope_mismatch.thor:
Three-phase approach, no regex. Also covers issue 35 (PK thresholds, imaging doses, missing dose_max).
- Structural query (
identify): Finds any publication with materialized efficacy data ANDdose_minset ontrial_subgroups. No phase or trial-link filter —dose_minpresence is the structural signal. 720 candidates in prod. - LLM validation (
validate): Sends abstract + current dose_min/dose_max to GPT-5-mini per subgroup. Schema:efficacy_restricted_to_dose_subset(bool),needs_correction(bool — true only when correct values differ from current extraction),correct_dose_min,correct_dose_max,explanation. Stores result ints.llm_data['dose_scope_check']on each trial_subgroup. - Remediation (
remediate --no-dry-run): Directly patchesdose_min/dose_maxontrial_subgroupsusing the validatedcorrect_dose_min/correct_dose_max. Also syncsllm_data['subgroup_outcome_measures']. Stores audit trail ints.llm_data['dose_scope_patch']with previous values and explanation. Note:dry_rundefaults to true — must pass--no-dry-runto apply.
Remediation applied in prod (2026-03-28):
- 405 trial_subgroups patched across 299 publications
- 315 subgroups: dose_max filled in (e.g. “≥240 mg” went from 240/null → 240/960)
- 90 subgroups: dose_min/dose_max nulled out (non-dose values: PK thresholds, cycle counts, % weight loss, radiation parameters, etc.)
- Spot-checked 12 random patched records against abstracts: 12/12 correct
- Audit trail stored in
ts.llm_data['dose_scope_patch']with previous values for reversal if needed
Production run sequence:
# 1. Identify structural candidates (read-only, ~720 candidates)thor one_off:backfill_dose_scope_mismatch:identify
# 2. LLM validation — writes to llm_data only (~$1-2 for 720 pubs with gpt-5-mini)thor one_off:backfill_dose_scope_mismatch:validate --batched
# 3. Dry-run remediation — preview all patchesthor one_off:backfill_dose_scope_mismatch:remediate --dry-run
# 4. Live remediation — patches dose fields on trial_subgroupsthor one_off:backfill_dose_scope_mismatch:remediate --no-dry-run30. Cross-study data contamination from abstract background sections
Section titled “30. Cross-study data contamination from abstract background sections”Short summary
Section titled “Short summary”When a publication abstract references efficacy results from a prior study as background context (e.g. “In our previous study NCT05029882, ORR was 24.4%”), classify_publications extracts those values as if they belong to the current study. This produces fabricated efficacy data for publications that may have no efficacy results of their own yet.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/task.rb — efficacy extraction from abstract text.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”The LLM extraction prompt does not distinguish between efficacy results reported as outcomes of the current study vs. results cited from external/prior studies as background context. The abstract structure (Background → Methods → Results → Conclusions) is not enforced.
Concrete examples
Section titled “Concrete examples”Pub 29705 (ABBV-400/Telisotuzumab adizutecan signal-seeking study, NCT06084481):
- Abstract background: “Initial results from the ongoing first-in-human study (NCT05029882) of ABBV-400… an overall response rate of 24.4%”
- Current study status: “As of 19 January 2024, 24 patients have been enrolled” — no efficacy data reported
- Extracted: ORR=24.4%, N=24 (enrollment count misinterpreted as efficacy N)
- Expected: No efficacy data (null)
The 24.4% ORR belongs to NCT05029882, not NCT06084481. The N=24 is enrollment, not an efficacy population.
Downstream impact
Section titled “Downstream impact”Publications appear in the Clinical Evidence report with fabricated efficacy data from unrelated studies. This is particularly misleading for signal-seeking or early-enrollment publications where the abstract previews prior results to motivate the new study.
Affects publications whose abstracts cite efficacy results from prior/companion studies. Common in: signal-seeking study designs, follow-up studies referencing parent trials, and publications describing study rationale with prior data.
Initial backfill (2026-03-28) validated 7,675 pubs via NCT mismatch + multiple registry ID signals, finding 1,495 that cite prior study efficacy. However, 46,962 pubs with efficacy data remain unvalidated — the structural signals missed cases where the prior study is cited by author/journal reference (e.g. [Cohen, Cancer Research 2023]) or is a different cohort of the same trial (same NCT). Example: pub 30362 cites petosemtamab monotherapy 2L/3L results from [Cohen, Cancer Research 2023] as background, but shares NCT03526835 with the current 1L combination study — no registry ID mismatch to detect.
Explored solution direction
Section titled “Explored solution direction”- Audit prompt guard (deployed): Added “CROSS-STUDY REFERENCES” instruction to the audit prompt so future audits flag these correctly.
- Extraction prompt fix (forward): Update
classify_publicationsprompt to instruct: “Only extract efficacy values reported as results of THIS study (typically in the Results section). Do not extract values cited from prior/external studies in the Background or Introduction.” - Detection query: Publications where
llm_datahas efficacy values but the abstract contains phrases like “previous study”, “prior study”, “first-in-human study (NCT…)” with efficacy values in the same sentence could be flagged for review. - Backfill: Need to identify publications that have been cross-contaminated and depending on the number, maybe reset them so they can go through the publication pipeline. Ideally, we wouldn’t want to rely on regex-based solutions for identifying cross contaminated pubs.
Solution applied
Section titled “Solution applied”Forward fix (2026-03-28):
task.rb: Added section 6 “Cross-Study References” to SYSTEM_PROMPT — instructs LLM to only extract efficacy from THIS study, reject values from prior/external studies, and use the provided trial NCT IDs as authoritative identifiers. Tested on pub 29705: correctly returns emptyoutcome_measuresinstead of the prior study’s ORR=24.4%.subgroup_extraction.rb: Added cross-study guard to Step 3 — ignore subgroups/endpoints from prior study citations.
Backfill detection (2026-03-28) — lib/tasks/one_off/backfill_cross_study_contamination.thor:
Two-phase detection, no regex:
- Structural query (
identify): Finds publications linked to a trial whose abstract mentions different NCT IDs (2,306 candidates from ~NCT mismatch). Filters to NCT-prefixed IDs only to avoid false positives from alternate registry entries (EudraCT, CTRI, etc.). - LLM validation (
validate): Sends abstract + linked NCT IDs to GPT-5-mini asking whether the pub reports its own efficacy or only cites prior studies. Schema:has_own_efficacy_results(bool),cites_prior_study_efficacy(bool),explanation,prior_studies(array).--allflag validates all unvalidated pubs with efficacy data (no structural pre-filter).
Tested on 50+ random structural candidates + 5 known edge cases. Zero false positives. Correctly distinguishes:
- Pubs with own results only (true negative)
- Pubs with own results + prior study citations (mixed — needs re-extract)
- Pubs with no own efficacy + prior study citations (pure contamination — null out)
- Safety/PK/diagnostic pubs with no efficacy at all (different problem, excluded)
Backfill remediation (remediate) — two modes:
- Null out (
own=false, cites=true): Destroys materialized efficacy data, setssubgroup_outcome_measures=[]. For pubs like 29705 that have zero own efficacy. - Re-extract (
own=true, cites=true): Resetsextracted=false, clearssubgroup_outcome_measures, destroys materialized data. Nextclassify_publicationsrun re-extracts with the fixed prompt.
Production run (2026-03-28): Validated 7,675 pubs via NCT mismatch + multiple registry ID signals. Found 1,495 citing prior study efficacy. Remediated confirmed contamination (null-out only).
Backfill gap identified (2026-03-30): 46,962 pubs with efficacy data were never validated because the structural pre-filters (NCT mismatch, 2+ registry IDs) miss prior studies cited by author/journal reference or different cohorts of the same trial. Text pattern matching (ILIKE on “prior study”, citation brackets, etc.) only catches ~5% of known cases — language is too varied. Solution: validate all remaining pubs with --all (no structural pre-filter). Estimated cost: ~$12 via GPT-5-mini batch.
Full-corpus validation (2026-03-30) — job 1471: Validated 68,958 of 69,124 pubs with efficacy data. Results:
| Category | Pubs | Outcomes | Action |
|---|---|---|---|
Clean (own=true, cites=false) | 53,380 | — | None |
Mixed (own=true, cites=true) | 8,937 | 26,491 | Triage then re-extract |
Pure contamination (own=false, cites=true) | 1,722 | 3,275 | Null out |
No efficacy at all (own=false, cites=false) | 5,159 | 14,177 | Separate issue |
The own=false, cites=false bucket (5,159 pubs) contains PK/safety/DDI/biomarker publications with spurious subgroup_outcome_measures — not cross-contamination, but a separate data quality issue.
Triage step added (2026-03-30): Re-extracting all 8,937 mixed pubs is wasteful — most have trivial background citations (e.g. “promising phase 1/2 rates”) that didn’t leak into extracted data. Added triage command to backfill_cross_study_contamination.thor that sends each mixed pub’s abstract + prior study citations + extracted outcome measures to GPT-5-mini, which checks whether any extracted values actually match the cited prior study data. Results stored in llm_data['cross_study_triage']. Schema: has_contaminated_outcomes (bool), explanation, contaminated_indices (array of 0-based indices into subgroup_outcome_measures).
Spot-checked on 5 pubs (pub 30362 known contamination + 4 random): 1/5 flagged contaminated. Pub 30362 correctly identified index 0 (ORR=37.2% + DOR=6.0mo from Cohen 2023 monotherapy) as contaminated while index 1 (ORR=60% from current combination study) was clean. The remediate command now only re-extracts pubs where triage confirmed contamination, and warns if untriaged pubs exist.
Production triage run (2026-03-30) — job 1674: Triaged all 8,937 mixed pubs in ~34 minutes.
| Triage result | Pubs | Contaminated indices |
|---|---|---|
| Clean (no leakage) | 8,205 | — |
| Contaminated (prior study data leaked into extractions) | 732 | 970 |
91.8% of mixed pubs cite prior studies in Background but have clean extractions — triage saved ~8,200 unnecessary re-extractions.
Spot-check results (2026-03-30): Checked 18 contaminated, 5 clean, 5 pure-contamination pubs.
- Contaminated (732): True positives confirmed across diverse patterns (SCHOLAR-1 comparator data, preclinical studies citing clinical results, I-SPY 2 external validation). ~63 pubs are same-trial follow-ups where “previously reported” refers to the trial’s own earlier publication (e.g., ARAMIS OS follow-up citing its own MFS primary result, PAOLA-1 OS citing its own PFS primary). These are borderline — the values genuinely came from Background text, but they’re the trial’s own data. Re-extraction with the fixed prompt handles these correctly: it will extract values that belong to this trial and skip values only mentioned in Background from different studies.
- Clean (8,205): All correctly clean — extracted values don’t match cited prior study values. Triage reasoning is precise (compares specific numbers).
- Pure contamination (1,722): All correct — news summaries, trial design abstracts, preclinical studies with only cited clinical efficacy.
Prompt fix v1 (2026-03-30): Updated cross-study reference handling in task.rb section 6 and subgroup_extraction.rb:
- Removed “Previously reported…” from prior-study recognition patterns (was causing false positives on same-trial follow-ups)
- Added EXCEPTION clause: when abstract is a subgroup/post-hoc/updated analysis of the same trial (matched by NCT/registry ID or trial name/acronym), previously reported results ARE the study’s own data and should be extracted
- Kept matching criteria precise (NCT ID, registry ID, trial name/acronym only) — excluded fuzzy “same population and intervention” matching to avoid a Phase 3 incorrectly claiming a Phase 2’s results as its own
Production run (2026-03-30): Steps 1–4 completed. 720/732 pubs re-extracted, 12 correctly empty (no own efficacy). 701 materialized via post-processing.
Post-extraction review (2026-03-30): Spot-checked 10 random re-extracted pubs against triage explanations. Found two contamination patterns:
- Pattern A (pure background citations): Cleaned up — prompt fix works. Values from cited external studies in Background/Introduction are no longer extracted.
- Pattern B (cross-study comparisons): Persists — when an abstract explicitly compares its results against another study (benchmarking, MAIC, historical controls), o4-mini still extracts both sides. Root cause:
subgroup_extractioncreates subgroups for external study arms (e.g., “VISION”), thenclassify_publicationsfills them in.
Prompt fix v2 (2026-03-30): Strengthened both task.rb section 6 and subgroup_extraction.rb:
- Changed framing from “don’t extract prior study data” to “only extract data MEASURED IN PATIENTS ENROLLED IN THIS STUDY”
- Added explicit examples of cross-study comparison patterns to reject: benchmarking, MAIC, historical controls, side-by-side comparisons
- Added instruction to
subgroup_extraction.rbto not create subgroups/endpoints for data from other studies even when presented as comparisons
Tested on 6 pubs locally (1 known + 5 Pattern B):
- 5/6 clean: subgroup extraction no longer creates external study subgroups, classify_publications only extracts own data
- 1/6 still contaminated (pub 119370): abstract presents cross-study comparison as formal arm comparison, indistinguishable from own trial arms
Retriage + prune commands (2026-03-30): Added retriage and prune to backfill_cross_study_contamination.thor for surgical cleanup of remaining contamination after re-extraction:
retriage: Re-runs triage on re-extracted pubs, stores result incross_study_retriagekey (preserves original triage data)prune: Removes specific contaminated SOMs by index, destroys materialized data, marks for post-processing rebuild
Tested on pub 119370: retriage correctly identified SOM index 1 (RICOVER-60 data) as contaminated, prune removed it, leaving only the Beijing cohort’s own data.
Production run sequence (remediation of 732 re-extracted pubs):
# 1. Deploy prompt fix v2 (task.rb + subgroup_extraction.rb)
# 2. Re-run subgroup extraction with fixed prompt (732 pubs)thor clinical_trials:publications:extract_subgroups --publication_ids $( psql -t -c "SELECT id FROM publications WHERE llm_data->'cross_study_triage'->>'has_contaminated_outcomes' = 'true' AND extracted = true" | tr '\n' ' ')
# 3. Re-extract with fixed promptthor clinical_trials:publications:classify_publications --batched
# 4. Re-triage new extractions to find remaining contaminationthor one_off:backfill_cross_study_contamination:retriage --batched
# 5. Surgically remove contaminated SOMsthor one_off:backfill_cross_study_contamination:prune --no-dry-run
# 6. Post-process to rematerializethor clinical_trials:publications:post_process_publications --batched31. Investigational drug dose data bleeds onto control/comparator arms
Section titled “31. Investigational drug dose data bleeds onto control/comparator arms”Short summary
Section titled “Short summary”When publication_interventions.study_plan_arm_id is NULL (the common case for publication-extracted drugs via Source 0), the drug_interventions CTE in vw_publication_efficacy_data joins the investigational drug to ALL arms — including control/comparator arms. The pub_dose_lookup COALESCE fallback then propagates the investigational drug’s dose fields (dose_min, dose_max, rp2d, dose_units, dose_frequency) onto control arm rows that have no subgroup-level dose override. This makes it appear that the comparator arm received the investigational drug’s dosing.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”db/views/vw_publication_efficacy_data_v18.sql:
drug_interventionsCTE (Source 0): Joinspublication_interventionsto arms. When bothclinical_trial_idandstudy_plan_arm_idare NULL, the drug matches all arms via theOR di.study_plan_arm_id IS NULLfallback.pub_dose_lookupCTE: Pulls dose_evidence frompublication_interventions. Joined toraw_rowsviapublication_intervention_idmatch fromdrug_interventions.raw_rowsCOALESCE chain (lines 449–469): Falls through subgroup-level dose → pub-level dose. No arm_type guard prevents control arms from inheriting investigational drug dose.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”In raw_rows, the dose COALESCE chain:
COALESCE(tlm.subgroup_dose_min, ..., pdl.pub_dose_min) AS dose_min,COALESCE(tlm.subgroup_dose_max, ..., pdl.pub_dose_max) AS dose_max,COALESCE(tlm.subgroup_rp2d, pdl.pub_rp2d) AS rp2d,has no guard for aoe.arm_type or aoe.resolved_group_type. When a control arm’s subgroup has no dose fields, the COALESCE falls through to pub_dose_lookup, which contains the investigational drug’s dose evidence.
Concrete examples
Section titled “Concrete examples”Pub 241259 (Temab-A exposure-response in mCRC):
- SOC arm = trifluridine/tipiracil+BEV (N=20)
- View shows: dose_min=1.6 mg/kg, dose_max=2.4 mg/kg, rp2d=2.4 mg/kg Q3W, dose_units=mg/kg, dose_frequency=Q3W
- These are Temab-A doses from
publication_interventionsid=51068 (study_plan_arm_id=NULL) - Abstract explicitly states SOC is “trifluridine/tipiracil+BEV” — no Temab-A dosing
Pub 241978 (Enfortumab vedotin):
- “No upfront dose reduction” control arm shows dose_min=0.75 mg/kg, dose_max=1.25 mg/kg
Downstream impact
Section titled “Downstream impact”- Clinical Evidence report: Control arms display investigational drug dose fields, misleading reviewers into thinking comparator arms received the ADC
- Audit findings: Audit LLM correctly flags these as incorrect (5 of 7 issues on pub 241259 are this pattern)
- Data quality: Dose fields on control arms are nonsensical — they describe a drug the arm didn’t receive
- 2,890 view rows across 566 publications have dose data from pub_dose_lookup on control/comparator arms
- 1,197 additional control rows have subgroup-level dose (potentially legitimate for dose-comparison arms)
- Within ADC technology scope: 14 rows across 5 publications (smaller because most ADC trials are single-arm)
What the issue is not
Section titled “What the issue is not”- Drug NAME attribution to control arms is intentional — the report needs to show what drug the control is being compared against
- Subgroup-level dose on control arms may be correct (e.g., dose-comparison trials where the control is a different dose of the same drug)
- This does NOT affect experimental/investigational arm rows
Explored solution direction
Section titled “Explored solution direction”Forward fix — view v19: Add an arm_type guard to the pub_dose_lookup COALESCE in raw_rows. When aoe.arm_type = 'control' (or aoe.resolved_group_type = 'ACTIVE_COMPARATOR'), skip the pub_dose_lookup fallback:
COALESCE( tlm.subgroup_dose_min, CASE WHEN tlm.subgroup_dose_value IS NOT NULL THEN tlm.subgroup_dose_value || ' ' || COALESCE(tlm.subgroup_dose_units, '') END, CASE WHEN aoe.arm_type != 'control' THEN pdl.pub_dose_min END) AS dose_min,Apply the same pattern to dose_max, rp2d, dose_units, dose_frequency, and single_dose. This preserves subgroup-level dose (tier 1) for all arms but blocks the publication-level fallback (tier 3) for control arms only.
No backfill needed — rematerializing the view after deploying v19 will fix all affected rows.
Related to Issue 20: The v16 Source 0 fix (using publication_interventions as primary drug source) introduced this side effect by broadening the drug_interventions join. The drug join itself is correct; only the dose COALESCE fallback needs the arm_type guard.
Solution applied
Section titled “Solution applied”Forward fix — view v19 (2026-03-29): Added arm_type guard to all 6 dose COALESCE chains in db/views/vw_publication_efficacy_data_v19.sql. When aoe.arm_type is a control/comparator variant (control, comparator, active_comparator, placebo, placebo_comparator), the pub_dose_lookup fallback is skipped. Subgroup-level dose (tier 1) is preserved for all arms — only the publication-level fallback (tier 3) is blocked for control arms.
-- Example for dose_min (same pattern for dose_max, rp2d, dose_units, dose_frequency, single_dose):COALESCE(tlm.subgroup_dose_min, CASE WHEN tlm.subgroup_dose_value IS NOT NULL THEN tlm.subgroup_dose_value || ' ' || COALESCE(tlm.subgroup_dose_units, '') END, CASE WHEN aoe.arm_type IS NULL OR LOWER(aoe.arm_type) NOT IN ('control', 'comparator', 'active_comparator', 'placebo', 'placebo_comparator') THEN pdl.pub_dose_min END) AS dose_min,No backfill needed — rematerializing the view after deploying v19 fixes all affected rows.
32. TTP (time to progression) misclassified as PFS
Section titled “32. TTP (time to progression) misclassified as PFS”Short summary
Section titled “Short summary”The LLM extraction pipeline (classify_publications) maps TTP (time to progression) values to PFS (progression-free survival) when the abstract reports TTP but not PFS. These are distinct endpoints — TTP censors deaths while PFS counts them as events. Additionally, in some cases (e.g., pub 29737), TTP values reported for a best-response subpopulation (e.g., SD patients only) are attributed to the entire cohort.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/subgroup_extraction.rb: Identifies endpoints from the abstract. May correctly identify TTP but it gets mapped to PFS downstream.app/tasks/publications_llm_classification/task.rb: Extracts endpoint values. The LLM treats TTP as PFS when extracting, or the endpoint mapping normalizes TTP→PFS.- Endpoint normalization: If TTP is not in the standard endpoint list, the LLM may substitute the closest recognized endpoint (PFS).
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”The classify_publications prompt and/or endpoint schema does not distinguish TTP from PFS. When an abstract reports “median TTP = X months”, the LLM maps this to the PFS endpoint because TTP is not available as a separate extraction target. The LLM lacks instruction to leave PFS null when only TTP is reported.
Concrete examples
Section titled “Concrete examples”Pub 29737 (IMMU-132 in GI cancers):
- Abstract: “time to progression (TTP) … median of 4.8+ mo for the SD pts”
- Extracted: PFS=4.8 months, patient_count=29 (entire CRC cohort)
- Correct: TTP=4.8+ months, applicable to 14 SD patients only — PFS should be null
- Two compounding errors: (1) TTP→PFS confusion, (2) SD-subpopulation value → full cohort
Pub 29737 KRAS-mutated subgroup:
- Abstract: “median TTP = 4.4+ mo” for 7 SD patients
- Extracted: PFS=4.4 months, patient_count=13 (all KRAS-mutated)
- Correct: TTP=4.4+ months for 7 SD patients — PFS should be null
Downstream impact
Section titled “Downstream impact”- Clinical Evidence report: PFS column shows TTP values, overstating the evidence (PFS is a stronger endpoint than TTP)
- Cross-study comparisons: TTP values mixed with genuine PFS values make comparisons unreliable
- Patient counts: When TTP is reported only for responders/SD patients, attributing it to the full cohort inflates the denominator
- 241 publications mention TTP in their abstract yet have PFS extracted without TTP (revised upward from 149 after widening text patterns to include hyphenated “time-to-progression” and “mTTP”)
- 181 publications have TTP correctly extracted as TTP (suggesting the pipeline CAN handle TTP in many cases)
- The SD-subpopulation misattribution is harder to quantify systematically but likely affects a subset of phase I/II publications reporting outcomes by best response category
Explored solution direction
Section titled “Explored solution direction”- Extraction prompt fix (forward): Add explicit instruction to
classify_publications: “TTP (time to progression) and PFS (progression-free survival) are distinct endpoints. If the abstract reports TTP but not PFS, extract TTP only — do NOT map TTP values to PFS. Leave PFS null when only TTP is reported.” - Subpopulation guard: Add instruction: “When a time-based endpoint (TTP, PFS, DoR) is reported only for a best-response subgroup (e.g., ‘median TTP for SD patients’), do not attribute it to the parent population. Extract it under the response-specific subgroup or leave the parent’s value null.”
- Backfill: Re-extract PFS values for the 149 affected publications with updated prompt. Scope: publications where abstract contains TTP/time to progression but NOT PFS/progression-free survival, and a PFS endpoint was extracted.
Solution applied
Section titled “Solution applied”-
Prompt fix in
identifier_extraction.rb: Added<<< Endpoint Distinction Rules >>>section after the “keep broad” normalization instruction, explicitly stating TTP and PFS are clinically distinct and must never be merged. Also covers DFS vs EFS. Instructs LLM to use the exact term from the abstract when in doubt. -
Subpopulation guard in
task.rb: Added** Response-Specific Endpoint Attributionblock instructing the LLM not to attribute response-specific time-based endpoints (e.g., “TTP for SD patients”) to the parent population. -
Backfill task:
lib/tasks/one_off/backfill_ttp_pfs_misclassification.thorwithidentify(finds 241 affected pubs, stores findings inllm_data['ttp_pfs_check']) andremediate(resets pubs for full pipeline re-extraction). ~50% false positive rate in scope (TTP mentioned descriptively, not as a study endpoint) but re-extraction is harmless. -
Spot-check results: Ran
extract_trial_identifieron 5 confirmed misclassified pubs (13857, 143497, 53502, 12317, 143682) — all 5 now correctly extract TTP instead of PFS. -
Query-layer fix (2026-03-30): The extraction fix (steps 1-4) correctly stored TTP as
TTPin the view, butextract_efficacy_metricsinclinical_evidence_query.rbhad a TTP→PFS fallback (added ine7fc41f7, 2026-03-23) that silently remapped TTP back intometrics[:pfs]when no real PFS existed. This caused audit issue 8436 (pub 100, DX1002 phase 1 — abstract reports mTTP=2.70, query presented as mPFS=2.7). Fix: TTP is now a first-class metric (metrics[:ttp]) instead of a PFS stand-in. Changes:clinical_evidence_query.rb: TTP fallback writes to:ttpnot:pfs; added:ttpto patient count chainclinical_evidence_report.rb: AddedmTTP (month)andHR (TTP)columnsaudit_clinical_evidence.rb: Addedefficacy.ttp.value/hazard_ratioto auditable fields; removed TTP from “not tracked” exclusion list
33. Cross-tabulated subgroups not identified in basket trials
Section titled “33. Cross-tabulated subgroups not identified in basket trials”Short summary
Section titled “Short summary”When basket trial abstracts report efficacy in a table structured as tumor type × biomarker status (e.g., CRC × HER2 IHC 3+/2+/1+), extract_subgroups identifies the single-dimension subgroups (tumor types and biomarker statuses separately) but not the cross-product subgroups (CRC IHC 3+, CRC IHC 2+, etc.). This means disease-specific biomarker-stratified efficacy data is lost — only the overall tumor-type and overall biomarker-status rows are extracted.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/subgroup_extraction.rb: Identifies subgroups and their endpoint associations from the abstract. The LLM prompt identifies subgroups as a flat list, and the hierarchical naming convention (e.g., “Non-breast STs → CRC”) captures one level of nesting but not cross-dimensional nesting.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”The subgroup extraction prompt produces subgroups along each dimension independently:
- By tumor type: BTC, UC, GC/GEJA, CRC
- By biomarker: HER2 IHC3+, IHC2+, IHC1+
But it does not produce the cross-product: CRC IHC3+, CRC IHC2+, etc. The table data in the abstract contains these values, but the extraction doesn’t recognize the need to create nested subgroups for each cell in a tumor type × biomarker matrix.
Concrete examples
Section titled “Concrete examples”Pub 72043 (SHR-A1811 in non-breast solid tumors):
- Abstract table reports ORR for each tumor type × HER2 IHC status combination
- Extracted subgroups: CRC (36.4%), IHC3+ (54.1%), IHC2+ (41.7%), IHC1+ (50.0%)
- Missing: CRC IHC3+ (100%, 3/3), CRC IHC2+ (0%, 0/3), CRC IHC1+ (0%, 0/1), CRC HER2 mut/amp (0%, 0/3)
- 4 audit issues (8402-8405) all flagging missing cross-tabulated CRC subgroups
Downstream impact
Section titled “Downstream impact”- Clinical Evidence report: Disease-specific biomarker-stratified efficacy data missing — can only show overall CRC ORR, not CRC by HER2 status
- Granularity loss: The most clinically relevant data in basket trials is often the cross-tabulation (e.g., “does HER2 IHC 3+ predict response in CRC specifically?”)
- ~366 publications have both disease-type and biomarker-type subgroups with common biomarkers (HER2, EGFR, KRAS, BRAF, PD-L1, MSI, MMR)
- Not all 366 will have cross-tabulated data in the abstract — many will have separate analyses rather than a matrix table
- The issue primarily affects basket/platform trials reporting across multiple tumor types with biomarker stratification
What the issue is not
Section titled “What the issue is not”- This is NOT about missing biomarker context on existing subgroups (that’s Issue 19)
- This is NOT about dropped subgroups at the classify step (Issue 10) — the cross-product subgroups are never identified in the first place
- Parent-level tumor type and biomarker subgroups ARE correctly extracted
Explored solution direction
Section titled “Explored solution direction”- Extraction prompt enhancement: Update
extract_subgroupsprompt to recognize tabular cross-tabulation patterns: “When the abstract contains a table or matrix reporting efficacy by tumor type × biomarker status, create cross-product subgroups (e.g., ‘CRC → HER2 IHC 3+’) for each cell with reported data, in addition to the single-dimension subgroups.” - Post-extraction cross-product generation: After extracting single-dimension subgroups, detect when a table exists with both dimensions and generate cross-product subgroups programmatically.
- Scope: Focus on publications with ≥2 disease subgroups AND ≥1 biomarker subgroup, and re-run extraction with the enhanced prompt.
- Backfill?
Solution applied
Section titled “Solution applied”-
Prompt enhancement (
subgroup_extraction.rb): Added Step 2b toSYSTEM_PROMPTinstructing the LLM to create cross-product subgroups when the abstract reports efficacy broken down by two dimensions (e.g., tumor type × biomarker status) — covers both literal tables and prose patterns like “Among CRC patients, ORR was X% in IHC 3+”. Cross-products use the existing arrow nesting convention (“CRC → HER2 IHC 3+”) alongside preserved single-dimension parents. No schema or downstream changes needed — classification task, dropped-subgroup guard, and post_process all handle arbitrary subgroup strings already. -
Two-pass LLM-screened backfill (
lib/tasks/one_off/backfill_cross_tabulated_subgroups.thor):- Structural scoping: 6,081 candidate pubs (≥2 disease-tagged + ≥1 biomarker-tagged subgroups)
- Pass 1 (
screen): Broadgpt-5-miniscreening → 934 flagged (Job 1655) - Pass 2 (
rescreen): Tighter prompt requiring ≥2 distinct tumor types with per-disease biomarker breakdown → 262 confirmed (Job 1657). Spot-check: ~7-9/10 true positives. remediateresetssubgroup_endpoints,subgroup_outcome_measures,llm_data_processed = falseon the 262 confirmed pubs for pipeline re-run.
Terminal window # Screening (already complete)thor one_off:backfill_cross_tabulated_subgroups:screen --batched --parallelism=4 # Job 1655thor one_off:backfill_cross_tabulated_subgroups:rescreen --batched --parallelism=4 # Job 1657# Remaining stepsthor one_off:backfill_cross_tabulated_subgroups:remediate --dry_runthor one_off:backfill_cross_tabulated_subgroups:remediate# Then re-run full publications pipeline on affected pubs
34. “Immature” endpoints extracted as “Not Reached”
Section titled “34. “Immature” endpoints extracted as “Not Reached””Short summary
Section titled “Short summary”When an abstract states that an endpoint (OS, PFS, DoR) is “not yet mature”, “data immature”, or “results are immature”, the LLM extraction maps this to “Not Reached”. These are clinically distinct concepts: “Not Reached” means the Kaplan-Meier curve hasn’t crossed the 50% mark (a real finding indicating the median exceeds current follow-up), while “immature” means insufficient events or follow-up to perform the analysis (no median can be estimated — value should be null).
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/task.rb: The classify_publications prompt doesn’t distinguish between “Not Reached” and “immature/not yet mature”. The LLM treats both as equivalent and extracts “Not Reached” for either.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”The extraction prompt has no instruction to differentiate “Not Reached” (endpoint was analyzed, median exceeds follow-up) from “immature” (endpoint was NOT formally analyzed, insufficient data). Both get mapped to the string “Not Reached”.
Concrete examples
Section titled “Concrete examples”Pub 114571 (JSKN003 in HER2+ mCRC):
- Abstract: “The median overall survival (OS) was not yet mature”
- Extracted: OS = “Not Reached”
- Correct: OS should be null — data immature, no median estimated
Pub 115389 (from job 1594):
- Abstract: PFS described as “immature”
- Extracted: PFS = “Not Reached”
- Correct: PFS should be null
Downstream impact
Section titled “Downstream impact”- Clinical Evidence report: “Not Reached” implies a favorable outcome (median exceeds follow-up), while “immature” is neutral (no data yet). Reporting “Not Reached” when the data is simply immature overstates the evidence.
- Cross-study comparisons: “Not Reached” OS is treated as a positive signal, biasing comparisons against studies that honestly report immature data.
- ~71 publications have “immature”/“not yet mature” language in the abstract (without “not reached”) but have “Not Reached” extracted for OS, PFS, or DoR
- Breakdown: OS (~214 total “Not Reached” pubs with immature language, ~71 without “not reached” in abstract), PFS (~107), DoR (~68)
- Many abstracts legitimately say BOTH “immature” and “not reached” — these are correct and not affected
What the issue is not
Section titled “What the issue is not”- Abstracts that say “median OS was not reached” — these ARE correct as “Not Reached”
- Abstracts that say “OS data are immature; median was not reached” — also correct (both terms used)
- Only affects abstracts where “immature” is used WITHOUT “not reached” for the same endpoint
Explored solution direction
Section titled “Explored solution direction”- Extraction prompt fix (forward): Add instruction to
classify_publications: “Distinguish between ‘Not Reached’ (endpoint was analyzed but median exceeds follow-up — extract as ‘Not Reached’) and ‘immature/not yet mature’ (insufficient data to analyze the endpoint — extract as null/omit). Only use ‘Not Reached’ when the abstract explicitly states the median was not reached.” - Backfill: Re-extract OS/PFS/DoR for the ~71 affected publications. Scope query:
SELECT DISTINCT v.publication_idFROM vw_publication_efficacy_data vJOIN publications p ON p.id = v.publication_idWHERE v.measure_value = 'Not Reached'AND v.endpoint_abbreviation IN ('OS', 'PFS', 'DOR')AND (p.abstract ILIKE '%not yet mature%' OR p.abstract ILIKE '%data immature%'OR p.abstract ILIKE '%data are immature%' OR p.abstract ILIKE '%results are immature%')AND p.abstract NOT ILIKE '%not reached%'AND p.abstract NOT ILIKE '%not been reached%'
Solution applied
Section titled “Solution applied”(empty — pending implementation)
35. Dose extraction confuses PK thresholds, imaging agent doses, and missing dose_max
Section titled “35. Dose extraction confuses PK thresholds, imaging agent doses, and missing dose_max”Short summary
Section titled “Short summary”The classify_publications dose extraction conflates several distinct concepts when populating subgroup dose fields. Pharmacokinetic thresholds (e.g. “tumor saturation above 100 mg/m²/d”), imaging/diagnostic agent doses (e.g. [⁶⁸Ga]Ga-PSMA-11 activity in MBq), and PD biomarker thresholds are extracted as if they were the therapeutic dose range for the efficacy population. Additionally, dose_max is sometimes left null when the abstract clearly states an upper bound.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/task.rb — Subgroup Dose Context section of SYSTEM_PROMPT.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”The dose extraction prompt instructs the LLM to extract dose fields for dose cohorts but does not distinguish between:
- Therapeutic dose (the drug dose patients received for treatment)
- PK/PD thresholds (concentration or exposure levels observed, e.g. “target saturation above X”)
- Imaging/diagnostic agent doses (tracer activity for PET scans, not therapeutic)
- dose_max omission — when dose_min is set from a “≥X” phrase, dose_max is left null even when the abstract states the upper bound of the escalation range
Concrete examples
Section titled “Concrete examples”PK threshold as dose — Pub 148480:
- Abstract: doses 5–400 mg/m²/d, “tumor saturation above 100 mg/m²/d”
- Extracted:
dose_min=100(PK threshold, not efficacy population) - Expected:
dose_min=5, dose_max=400(full enrolled range, efficacy not restricted)
PK threshold as dose — Pub 229651:
- Abstract: doses 25–500 mg QD, “NTX changes ≥50 mg”
- Extracted:
dose_min=50(PD biomarker threshold) - Expected:
dose_min=25, dose_max=500(full enrolled range)
PK threshold as dose — Pub 134251:
- Abstract: responses and PFS reported overall, “target concentrations ≥100 mg/day”
- Extracted:
dose_min=100(PK target, not dose restriction) - Expected: null dose fields or full enrolled range
Imaging agent dose — Pub 244477:
- Abstract: [¹⁷⁷Lu]Lu-PSMA-617 at 7.4 GBq (therapeutic), [⁶⁸Ga]Ga-PSMA-11 111–259 MBq (imaging)
- Extracted:
dose_min=111, dose_max=259(imaging tracer activity) - Expected: dose fields for the therapeutic agent, not the imaging tracer
Missing dose_max — Pub 58814:
- Abstract: “14 pts at doses ≥1.5 mg/kg” across cohorts 0.5–2.5 mg/kg
- Extracted:
dose_min=1.5, dose_max=null - Expected:
dose_min=1.5, dose_max=2.5
Missing dose_max — Pub 137619:
- Abstract: “patients at dose of 0.15 mg/kg or above” across escalation up to 0.18 mg/kg
- Extracted:
dose_min=0.15, dose_max=null - Expected:
dose_min=0.15, dose_max=0.18
Downstream impact
Section titled “Downstream impact”Report rows show incorrect dose context: PK observations misrepresented as dosing, imaging agent doses shown instead of therapeutic doses, and incomplete dose ranges when max is omitted. Affects dose-response interpretation and cross-study comparisons.
Discovered during issue 29 backfill validation: 14 of 20 random publications with dose_min set had some form of dose extraction error. Categories overlap — a single pub may have both a PK threshold issue and a missing dose_max. Full scope is ~720 publications with dose_min set on trial_subgroups; exact breakdown by error type pending full validation run.
Explored solution direction
Section titled “Explored solution direction”- Prompt fix (forward): Update Subgroup Dose Context to explicitly instruct:
- “Only extract the THERAPEUTIC drug dose, not PK/PD thresholds, biomarker cutoffs, or diagnostic/imaging agent doses.”
- “When dose_min is set from a ‘≥X’ pattern, also set dose_max to the highest dose level stated in the abstract.”
- Backfill: The issue 29 backfill validation (
backfill_dose_scope_mismatch.thor) already identifies these problems viacorrect_dose_min/correct_dose_maxin the LLM check. Remediation can directly patch the dose fields on trial_subgroups using the validated corrections, rather than a full re-extract.
Solution applied
Section titled “Solution applied”Forward fix (2026-03-31): Added “DOSE VALUE FILTERING” instruction to the Subgroup Dose Context section in task.rb — instructs LLM to only extract therapeutic drug doses, explicitly rejecting PK/PD thresholds (e.g. “target saturation above X”), imaging/diagnostic agent doses (e.g. tracer activity in MBq), and biomarker/lab value cutoffs.
View fix (2026-03-31): vw_publication_efficacy_data v20 — added dose_context_type gate to the pub-level dose COALESCE fallback. For publications with dose_context_type of escalation or range, the view no longer falls back to the study-level dose range from publication_interventions.dose_evidence. Subgroups in escalation studies that genuinely need dose fields already have them set at the trial_subgroup level (via extraction or Issue 29 backfill), so they take COALESCE priority 1 and are unaffected. Fixes ~2,612 publications where study-level escalation ranges were bleeding into non-dose subgroups (disease cohorts, biomarker groups, Overall).
View fix v21 (2026-03-31): Extended dose_context_type gate to also block rp2d studies. RP2D publications store the full escalation range in dose_min/dose_max on publication_interventions.dose_evidence, not just the RP2D value — non-dose subgroups (biomarker, disease, Overall) were inheriting the escalation range. Affects ~1,581 additional publications (e.g. pub 48903 “Low HER2” showing 5.4–8.0 instead of null, pub 135119 “Overall” inheriting Q2W-LD arm dose).
Backfill validation (2026-03-31): Job 1694 re-validated ~720 pubs with subgroup-level dose_min. 981 subgroups OK, 42 new corrections identified (26 wrong range, 13 non-dose→null, 2 PK/PD, 1 radiation). Remediation pending.
36. cORR set equal to ORR when abstract distinguishes confirmed vs unconfirmed
Section titled “36. cORR set equal to ORR when abstract distinguishes confirmed vs unconfirmed”Short summary
Section titled “Short summary”LLM extraction sets confirmed ORR (cORR) equal to the overall ORR instead of counting only confirmed responses. This is the reverse of Issue 27 — there, extract_efficacy_metrics picked the confirmed row for the plain ORR metric. Here, the LLM itself outputs the same value for both ORR and cORR during classify_publications, so the view and query faithfully reproduce the wrong value.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/task.rb — classify_publications LLM extraction step. The prompt asks for both ORR and cORR as separate endpoints, but the LLM sometimes fails to distinguish confirmed from unconfirmed responses.
Exact restriction causing the issue
Section titled “Exact restriction causing the issue”The LLM extraction prompt does not provide explicit guidance on how to compute cORR when the abstract itemizes confirmed vs unconfirmed responses (e.g., “1 confirmed CR, 2 confirmed and 3 unconfirmed PRs”). The LLM defaults to the total response count for both metrics.
Concrete examples
Section titled “Concrete examples”Pub 30362 (Petosemtamab+pembro 1L r/m HNSCC):
- Abstract: “1 confirmed complete response, 2 confirmed and 3 unconfirmed partial responses” out of 10 evaluable pts
- Expected: ORR = 60% (6/10), cORR = 30% (3/10 at cutoff)
- Extracted: ORR = 60%, cORR = 60% (identical — wrong)
Downstream impact
Section titled “Downstream impact”Clinical Evidence report shows inflated cORR identical to ORR, obscuring the distinction between confirmed and unconfirmed responses. This is a meaningful clinical difference — confirmed response rates are the regulatory-grade metric.
1 instance found in job 1634 (HNSCC+BsAb). Related to Issue 27 (which was a query-layer pick issue, now fixed). This is a distinct extraction-layer issue. Full-corpus scale TBD — would require comparing ORR vs cORR values across all publications where both are extracted.
Explored solution direction
Section titled “Explored solution direction”- Prompt fix (forward): Add explicit instruction to
classify_publications: “cORR counts ONLY confirmed responses (CR + confirmed PR). If the abstract lists unconfirmed responses separately, exclude them from cORR. If the abstract does not distinguish confirmed from unconfirmed, leave cORR null.” - Backfill: Identify publications where cORR = ORR and the abstract contains language distinguishing confirmed/unconfirmed. Re-extract cORR with targeted prompt.
Solution applied
Section titled “Solution applied”- Forward fix (prompt): Already in place —
classify_publicationsprompt (task.rb:206-250) has comprehensive instructions for confirmed/unconfirmed ORR handling, including count-based derivation. - Backfill:
one_off:backfill_confirmed_unconfirmed_orr backfill_issue36— targets 57 publications where confirmed=true ORR has the same measure_value as confirmed=false/null ORR. Re-extracts using focused LLM prompt. Scope is structural (no text matching): joins ORR TOMs with matching values across confirmed flags.
37. Mean survival values extracted as median
Section titled “37. Mean survival values extracted as median”Short summary
Section titled “Short summary”When an abstract reports mean OS or PFS (rather than median), the LLM extracts the numeric value without flagging the statistic type. The pipeline has no field to distinguish mean from median, so mean values are silently presented as median in the Clinical Evidence report.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/task.rb — classify_publications extraction. The measure_value field captures a numeric value but has no companion field for the statistic type (mean vs median).
Exact restriction causing the issue
Section titled “Exact restriction causing the issue”The LLM extraction schema defines survival endpoints with measure_value (numeric) and measure_unit (e.g., “months”) but has no statistic_type field. When the abstract says “mean OS = 25.3 months”, the LLM outputs 25.3 with unit “months”, indistinguishable from a median.
Concrete examples
Section titled “Concrete examples”Pub 51969 (FDG-PET target delineation SCCHN, CT-95):
- Abstract: “The mean OS was 25.3 months (95% CI, 22.5-28.1) and mean PFS was 23.2 months (95% CI, 20.3-26.1)”
- Extracted: OS = 25.3 months, PFS = 23.2 months (no indication these are means)
- Expected: Either null (mean is not the standard metric), or extracted with a flag indicating “mean”
Downstream impact
Section titled “Downstream impact”Report consumers assume survival values are medians (the standard in oncology). Mean survival overestimates the “typical” outcome when distributions are right-skewed (common in survival data). This creates misleading cross-study comparisons.
What the issue is not
Section titled “What the issue is not”This is not about rounding or approximation — the numeric value is correct. The problem is the absence of metadata distinguishing the statistic type.
1 instance found in job 1634 (HNSCC+BsAb). Mean survival reporting is uncommon in oncology abstracts (median is standard), so corpus-wide scale is likely small. Could identify candidates by searching abstracts for “mean OS” or “mean PFS” patterns.
Explored solution direction
Section titled “Explored solution direction”Two approaches:
- Null approach: Update prompt to instruct: “Only extract median survival values. If the abstract reports mean (not median) OS or PFS, leave the value null.” Simple, preserves existing schema.
- Schema approach: Add a
statistic_typefield (enum: median, mean) to the outcome measure schema. More informative but requires schema migration, view update, and query changes.
Option 1 is recommended for now — mean survival is rare and the null correctly signals “no standard median reported.”
Solution applied
Section titled “Solution applied”(empty — pending implementation)
38. Biomarker subgroups in secondary analyses not identified by extract_subgroups
Section titled “38. Biomarker subgroups in secondary analyses not identified by extract_subgroups”Short summary
Section titled “Short summary”extract_subgroups misses biomarker-defined subgroups (e.g., p16+ oropharyngeal) when they appear as secondary efficacy analyses within the results section rather than as pre-specified study arms or primary subgroups. The efficacy data is present in the abstract but never enters the pipeline because the subgroup is not identified in the first extraction step.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/subgroup_extraction.rb — extract_subgroups runs BEFORE classify_publications. If a subgroup is not identified here, no efficacy data is extracted for it downstream.
Exact restriction causing the issue
Section titled “Exact restriction causing the issue”The extract_subgroups prompt focuses on pre-specified study populations and arms. When an abstract reports a secondary analysis like “Of the 8 pts with p16+ oropharyngeal disease, 4 had confirmed responses (ORR 50%)”, this biomarker subgroup is not captured because it appears as an exploratory result rather than a defined study arm.
Concrete examples
Section titled “Concrete examples”Pub 48310 (Petosemtamab+pembro PD-L1+ r/m HNSCC):
- Abstract: “Of the 8 pts with p16+ oropharyngeal disease, 4 had confirmed responses (ORR 50%)”
- Extracted subgroups: “PD-L1+ HNSCC” only (the primary population)
- Missing: “p16+ oropharyngeal” subgroup with ORR = 50% (4/8)
Pub 238083 (same trial, same abstract — duplicate per Issue 17):
- Same missing “p16+ oropharyngeal” subgroup
Downstream impact
Section titled “Downstream impact”Biomarker-defined efficacy signals are lost from the Clinical Evidence report. For HNSCC specifically, p16/HPV status is a critical prognostic and predictive biomarker that clients expect to see.
What the issue is not
Section titled “What the issue is not”This is distinct from Issue 33 (cross-tabulated subgroups in basket trials). Issue 33 addressed multi-dimensional subgroup identification (tumor type × biomarker) in basket trial designs. This issue is about single-dimension biomarker subgroups in secondary analyses of non-basket trials.
Originally 2 instances found in job 1634 (HNSCC+BsAb), both from the same trial (duplicate pubs). Full-corpus LLM screening (gpt-5-mini) of ~18,500 candidate pubs identified 1,730 (9.2%) with potentially missed biomarker subgroups. Verified on pubs 549 (LAG-3 expression ORR 28% vs 7.7%) and 44673 (TP53 wild-type ORR 79%, CRc 74%, MRD neg 76%, 3yr OS 51%).
Explored solution direction
Section titled “Explored solution direction”- Prompt fix (forward): Expand
extract_subgroupsprompt to include: “Also identify biomarker-defined subgroups reported in secondary/exploratory analyses within the results section (e.g., ‘Of the N patients with [biomarker], ORR was X%’). These should be captured as child subgroups of the primary population even if not pre-specified as study arms.” - Backfill: Identify publications with biomarker language in the abstract (regex patterns) where no corresponding subgroup exists. Re-run
extract_subgroupswith updated prompt for those publications.
Solution applied
Section titled “Solution applied”Forward fix (2026-03-30): Added Step 2c to extract_subgroups SYSTEM_PROMPT in subgroup_extraction.rb. Explicitly instructs the LLM to capture biomarker-defined subgroups from secondary/exploratory analyses within the Results section, even when not pre-specified as study arms. Includes guard against over-extraction (only capture when numeric efficacy result is present, not for biomarkers in baseline characteristics or background).
Backfill (2026-03-30): lib/tasks/one_off/backfill_biomarker_secondary_subgroups.thor — three-phase Thor task (identify → screen → remediate). Candidate set: ~18,500 pubs with biomarker mentions in abstract (dynamic regex built from Biomarker model with Postgres word boundaries) but no biomarker-tagged trial_subgroups record. LLM screen (gpt-5-mini) checked each abstract against its existing subgroup list. Results: 18,742 screened, 1,730 flagged (9.2%). Remediation reset flagged pubs, followed by full pipeline re-run (extract_subgroups job 1682 + classify_publications job 1683). Final state: 1,718/1,730 fully reprocessed, 11 pending extraction, 1 pending classification.
Known gap (addressed 2026-03-31): The candidate query excluded any pub with existing biomarker-tagged subgroups (16,678 pubs skipped). Added screen_partial + remediate_partial commands to screen these pubs using the same LLM prompt. No regex pre-filter — LLM decides. Local validation: pub 29704 (known gap) correctly flagged; 3/10 flagged on single-biomarker pubs, 0/10 on random multi-biomarker pubs. Cross-tabulated pubs with partial biomarker coverage are separately handled by Issue 43.
Partial screening results (2026-03-31, job 1691): screen_partial completed on all 16,709 pubs with existing biomarker subgroups. Results: 1,483 flagged (8.9%), 15,226 clean, 0 remaining. Flag rate consistent with original screen (9.2%). Domain expert review of 5 random flagged pubs: 4/5 strong true positives, 1 plausible.
Prompt fixes for re-extraction quality (2026-03-31): Local testing of the full pipeline (extract_subgroups → classify_publications) on 14 flagged pubs revealed two regression patterns, both fixed with prompt changes:
-
Biomarker tag loss on biomarker-defined populations (
task.rb): When the overall study population is defined by a biomarker (e.g., “mIDH2 ND-AML”), theclassify_publicationsprompt rule “overall must never combine with other tags” caused thebiomarkertag to be dropped. Fixed by allowingoverallto combine withbiomarkerand/ordisease. Also added biomarker inheritance instruction: child subgroups (e.g., “mIDH2 ND-AML → CR”) must carry the same biomarker tag and biomarkers array as their biomarker-qualified parent. Validated on pub 120034 (IDH2) and pub 119668 (ABL-class fusion) — all children now correctly tagged. -
Molecular qualifier dropped in subgroup naming (
subgroup_extraction.rb): When a study population is defined by both a molecular feature and a clinical feature (e.g., “ABL-class fusion patients who responded slowly”),extract_subgroupssometimes chose only the clinical label (“Slow induction responders”), losing the biomarker context entirely. Fixed by adding compound baseline instruction: when efficacy results are reported for a population defined by multiple qualifiers, biomarker/molecular qualifiers must never be dropped. Validated on pub 119668 — now correctly produces “ABL-class fusion patients → TKI group”.
Validation also confirmed correct behavior: Lab values (albumin, NLR, WBC, platelet count) are now properly tagged as risk_group rather than biomarker (pubs 52188, 154203). HPV-naive populations tagged as population, TAA immune responses tagged as response_status (pubs 138809, 141909). These reclassifications are correct — biomarker tag is reserved for molecular/genomic features relevant to population selection (HER2, EGFR, IDH2, etc.).
Status: Prompt fixes validated locally on 14 pubs. Pending deployment before running remediate_partial + pipeline re-run on the 1,483 flagged pubs.
Investigation notes: Concrete examples (pubs 48310, 238083) now capture the p16+ oropharyngeal subgroup with the current prompt. Regex-based scale estimation was inconclusive — cannot reliably distinguish “biomarker in baseline characteristics” from “biomarker-stratified efficacy in secondary analysis.” LLM screening required to determine true scale.
39. Multi-drug randomized trial dose cross-contamination
Section titled “39. Multi-drug randomized trial dose cross-contamination”Short summary
Section titled “Short summary”In randomized trials comparing multiple investigational drugs (each in its own arm), the view shows all drugs’ doses on every arm instead of scoping each dose to its own arm. Originally thought to be an LLM extraction issue, but investigation revealed the per-drug dose_evidence extraction is correct — the contamination happens in the view layer.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”db/views/vw_publication_efficacy_data_v21.sql — the drug_interventions LEFT JOIN to arm_outcomes_expanded. When publication_interventions.study_plan_arm_id is NULL (common for publications without clinical trial linkage), the join condition di.study_plan_arm_id IS NULL creates a cross-product: every intervention joins to every arm, so each arm gets rows with all drugs’ doses.
Exact restriction causing the issue
Section titled “Exact restriction causing the issue”The drug_interventions join in raw_rows uses di.study_plan_arm_id IS NULL as a pass-through condition that matches any arm. For multi-drug publications with 3 interventions and 3 arms, this creates 9 combinations instead of 3 — each arm gets dose rows from all 3 drugs.
Concrete examples
Section titled “Concrete examples”Pub 239841 (Ivonescimab vs Cadonilimab vs Penpulimab neoadjuvant HNSCC):
publication_interventionscorrectly extract per-drug doses: ivonescimab=10 mg/kg, cadonilimab=6 mg/kg, penpulimab=200 mg- But the view shows 3 dose rows per arm (one per intervention), so the Ivonescimab arm shows 10 mg/kg, 6 mg/kg, AND 200 mg
Downstream impact
Section titled “Downstream impact”Report rows for each arm show all drugs’ doses as separate rows. Clinically misleading — ivonescimab at 6 mg/kg vs 10 mg/kg is a meaningful difference, and penpulimab’s fixed 200 mg dose is a completely different dosing paradigm than weight-based 6 mg/kg.
What the issue is not
Section titled “What the issue is not”This is NOT an LLM extraction issue. The dose_evidence_extraction pipeline correctly extracts per-drug doses on publication_interventions. The contamination is view-layer: the join creates a cross-product when study_plan_arm_id is NULL.
Related to Issue 31 (view-layer COALESCE bleed onto control arms) — same family of view join scoping issues.
7,258 publications have multiple distinct interventions. Of those, the fix only changes behavior for pubs where intervention names appear in arm names (enabling name-based scoping). Pubs with generic arm names (“Control”, “Intervention”) keep the existing cross-join behavior.
Explored solution direction
Section titled “Explored solution direction”- View fix: In the
drug_interventionsjoin, for Source 0 multi-drug publications wherestudy_plan_arm_idIS NULL, match each intervention to its arm by checking ifintervention_nameappears inarm_name. Falls back to cross-join when name matching is not feasible (generic arm names). - Prompt fix (defense-in-depth): Added instruction to
classify_publicationsSYSTEM_PROMPT for multi-drug trials to extract arm-specific doses in subgroup dose fields.
Solution applied
Section titled “Solution applied”- View v22 (
db/views/vw_publication_efficacy_data_v22.sql):- Added
multi_drug_pubsCTE to identify publications with 2+ distinct interventions. - Modified
drug_interventionsjoin: for Source 0 multi-drug pubs, requiresLOWER(arm_name) LIKE '%' || LOWER(intervention_name) || '%'to scope each intervention to its matching arm. - Safe fallback: if an intervention doesn’t match ANY arm by name, reverts to cross-join (no data loss for pubs with generic arm names).
- Added
- Prompt fix (
app/tasks/publications_llm_classification/task.rb):- Added “MULTI-DRUG RANDOMIZED TRIALS” instruction to the Subgroup Dose Context block.
- Forward prevention for subgroup-level dose extraction.
40. Hierarchical subgroup rows in view lose N from flat counterparts
Section titled “40. Hierarchical subgroup rows in view lose N from flat counterparts”Short summary
Section titled “Short summary”The LLM extraction creates both flat subgroups (IHC3+) and hierarchical subgroups (RAS wild-type mCRC → Cohort A → IHC3+) as separate trial_subgroup → trial_outcome_measure → trial_arm_outcome chains. When the flat version has number_of_participants but the hierarchical copy doesn’t, ClinicalEvidenceQuery picks the hierarchical row (due to disease filtering) and reports null N.
Related to Issue 26 (parent N propagation) but distinct: Issue 26 is extraction-layer (LLM copies parent N to child subgroups). Issue 40 is post-processing-layer (hierarchical copies don’t carry forward N from their flat counterparts).
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/post_process.rb — creates both flat and hierarchical trial_arm_outcomes. The hierarchical copy’s N comes from the LLM output, which often omits it. Existing null_out_propagated_parent_n (line 565) handles the inverse case (removing incorrectly copied parent N).
Investigation findings (2026-04-01)
Section titled “Investigation findings (2026-04-01)”Most audit examples were false positives — null N is correct because the abstract doesn’t state per-subgroup N:
- Pub 134450 (MRG003 phase 1b): Abstract states N=39 for overall Phase 1b, gives per-disease ORR (SCCHN 40%, NPC 44%, CRC 0%) but never states per-disease N. Null N on
Phase 1b dose expansion → CRCis accurate. - Pub 67379 (ROME trial): Abstract states 200/200 randomized overall. hTMB/MSS exploratory analysis gives PFS and HR but never states subgroup N. Null N is accurate.
- Pub 200353 (T-DXd DESTINY-CRC02 biomarker):
EGFR amplificationmentioned as prognostic factor but no N given. Not even a hierarchical issue — this is a flat subgroup with legitimately unstated N.
Only pub 48926 is a real bug:
- Pub 48926 (DESTINY-CRC01 updated):
IHC3+flat has N=40, ORR=57.5.RAS wild-type mCRC → Cohort A → IHC3+hierarchical has N=null, ORR=57.5. Same forIHC2+/ISH+(13 vs null) andprior anti-HER2 therapy(16 vs null). The LLM extracted N for the flat version but not the hierarchical copy.
Real scope: 182 trial_arm_outcomes across 59 publications where the flat counterpart has N but the hierarchical copy doesn’t. Of ~32,874 hierarchical TAOs with null N, 32,692 have flat counterparts that also have null N (abstract doesn’t state it), and 182 have flat counterparts with N (extraction gap).
Explored solution direction
Section titled “Explored solution direction”- Post-process fix: Add
propagate_flat_n_to_hierarchicalmethod inpost_process.rb(sibling to existingnull_out_propagated_parent_n) to carry forward N from flat counterparts after all subgroups are created. Plus backfill task for existing 182 records. - Prompt fix: Instruct LLM to always carry N when creating hierarchical subgroups from data it already extracted for flat counterparts.
Solution applied
Section titled “Solution applied”Downscoped (2026-04-01): Investigation revealed most audit examples are false positives — null N is correct because the source abstracts don’t state per-subgroup patient counts. Real bug scope is narrow (182 TAOs / 59 pubs). Post-process propagation fix deferred as low priority.
41. Safety data cross-contamination between dose arms
Section titled “41. Safety data cross-contamination between dose arms”Short summary
Section titled “Short summary”In multi-arm dose-optimization studies, safety metrics (patient_number_safety, discontinuation rate) from one dose arm are attributed to a different dose arm. The safety extraction doesn’t scope by arm, so values from the most prominent or first-mentioned arm bleed onto sibling arms.
Related to Issue 31 (dose field bleed onto control arms via view COALESCE) but distinct: Issue 31 was view-layer dose field propagation onto control arms. Issue 41 is extraction/query-layer safety data misattribution between experimental dose arms.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/queries/tpp/clinical_evidence_query.rb — extract_safety_metrics_for_publication method. Safety data is queried by publication_id and optionally study_plan_arm_id, but when study_plan_arm_id is null (common for publication-extracted arms), safety data cannot be scoped to a specific arm.
Concrete examples
Section titled “Concrete examples”- Pub 116843 (Temab-A + Bev): Safety N=30 and discontinuation=3% attributed to the 2.0 mg/kg arm, but abstract reports these for the 2.4 mg/kg arm (n=30). The 2.0 mg/kg arm has n=26 with no discontinuation data stated.
- Pub 49900 (M9140 dose optimization): Safety N=29 attributed to 2.4 mg/kg arm, but 29 is the 2.8 mg/kg arm size. The 2.4 mg/kg arm has n=31.
3 audit issues from CRC+ADC audit (2026-03-30). Likely affects multi-arm dose-optimization studies where safety is discussed arm-by-arm in the abstract but study_plan_arm_id is null.
Explored solution direction
Section titled “Explored solution direction”- Extraction fix: When safety data is extracted per arm in the abstract, ensure arm-specific safety N and discontinuation rates are stored with correct arm attribution.
- Query fix: In
extract_safety_metrics_for_publication, when multiple arms exist, attempt to match safety data to the correct arm by arm name or dose level.
Solution applied
Section titled “Solution applied”Query-layer forward fix (2026-03-30): Extracted scope_safety_results_to_arm helper used by extract_safety_metrics_for_publication and extract_ranked_named_ae_summaries in both ClinicalEvidenceQuery and EmergingClinicalDataQuery. Two-tier arm matching:
- Primary: Match by
study_plan_arm_id(when present) - Fallback: Match by
arm_nameusing exact normalized comparison (downcase + whitespace normalization) — avoids false positives between similar dose levels (e.g. “2.0 mg/kg” vs “2.4 mg/kg”) - Guard: When neither match succeeds and safety data contains multiple distinct arms, return empty rather than falling back to a wrong arm’s data. Single-arm or publication-level safety data (no arm differentiation) still falls through correctly.
This fixes both contamination patterns: (a) pub 49900 where study_plan_arm_id is null but arm_name distinguishes arms, and (b) pub 116843 where study_plan_arm_id exists but the requested arm has no safety data (guard prevents borrowing from a sibling arm).
No backfill needed — regenerating the clinical evidence report produces correct arm-scoped safety data.
42. Tumor shrinkage rate confused with RECIST ORR
Section titled “42. Tumor shrinkage rate confused with RECIST ORR”Short summary
Section titled “Short summary”The LLM extracts “any tumor reduction” or “tumor shrinkage rate” as ORR, when these are distinct from RECIST-defined objective response rate. Tumor shrinkage includes minor reductions (e.g. 0-20% decrease) that don’t meet RECIST PR threshold (≥30% decrease). This can dramatically inflate the reported ORR.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/task.rb — classify_publications efficacy extraction.
Concrete examples
Section titled “Concrete examples”- Pub 162304 (IMMU-130 phase I/II in mCRC): Abstract reports “Tumor reductions were seen in 23/66 (35%) pts, including one PR.” The LLM extracted ORR=35%, but the actual RECIST ORR is ~1.5% (1/66 PR). The 35% is any-shrinkage rate, not objective response.
1 instance found in CRC+ADC audit (2026-03-30). Scale TBD — need to investigate how many publications report non-RECIST shrinkage rates that could be confused with ORR. Likely uncommon but high-impact when it occurs (35% vs 1.5% is a massive error).
Explored solution direction
Section titled “Explored solution direction”- Prompt fix (forward): Add instruction to
classify_publications: “ORR (objective response rate) must be based on RECIST criteria (CR + PR). Do not use ‘any tumor reduction’, ‘tumor shrinkage rate’, or ‘disease control rate’ as ORR. If the abstract reports tumor shrinkage without specifying RECIST criteria, and separately reports a lower confirmed PR/CR rate, use the PR/CR rate as ORR.”
Solution applied
Section titled “Solution applied”Forward fix (2026-04-03): Added ORR definition instruction to classify_publications prompt (section 3d): “ORR must be based on RECIST criteria (CR + PR). Do not extract tumor shrinkage rate or any-tumor-reduction as ORR.” Will be picked up by Issue 49 backfill re-extraction (3,943 target-disease pubs, PROMPT_VERSION=1).
43. Cross-tabulated subgroups only extracted for highest-response HER2 level
Section titled “43. Cross-tabulated subgroups only extracted for highest-response HER2 level”Short summary
Section titled “Short summary”Residual gap in Issue 33 (cross-tabulated subgroups). The Issue 33 backfill correctly flagged pubs and re-extracted cross-tabs, but the LLM only creates disease × biomarker cross-products for the most prominent biomarker level (typically the one with highest response rates). Lower-response or zero-response cross-tabulated subgroups are skipped.
Related to Issue 33 — same pipeline layer (extract_subgroups + classify_publications), but the fix and backfill worked for the dominant cross-tab; the LLM selectively omits cross-tabs where responses are low or absent.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/subgroup_extraction.rb — extract_subgroups creates the subgroup × biomarker cross-products. The prompt instructs creation of cross-tabs but doesn’t emphasize completeness across all biomarker levels.
Concrete examples
Section titled “Concrete examples”-
Pub 72043 (SHR-A1811 phase 1, non-breast solid tumors): Abstract table shows ORR by tumor type × HER2 status (IHC3+, IHC2+, IHC1+, mutation/amp) for each of BTC, UC, GC/GEJA, CRC, and Other. After Issue 33 backfill (
needs_cross_tab_reextraction=true), only IHC3+ cross-tabs were extracted per tumor type:CRC → HER2 IHC3+✓ (ORR 100%, 3/3)CRC → HER2 IHC2+✗ (ORR 0%, 0/3) — missingCRC → HER2 IHC1+✗ (ORR 0%, 0/1) — missingCRC → HER2 mutation/amp✗ (ORR 0%, 0/3) — missing
The overall HER2 subgroups exist (
Non-breast STs → HER2 IHC1+/2+/3+) but disease × HER2 cross-tabs only exist for IHC3+.
3 audit issues from pub 72043 in CRC+ADC audit (2026-03-30). Likely affects other basket trials where the cross-tab table has zero-response cells — the LLM treats 0% ORR subgroups as not worth extracting. Scale TBD.
Explored solution direction
Section titled “Explored solution direction”- Prompt fix (forward): Strengthen
extract_subgroupsto explicitly require all cells in a cross-tabulated table, including zero-response cells: “When a table cross-tabulates tumor type × biomarker status, create subgroups for ALL cells in the table, including those with 0% ORR or 0 responders. A zero-response subgroup is clinically meaningful data, not an absence of data.” - Targeted re-extraction: Force re-extract specific pubs where the cross-tab is incomplete (e.g.
--publication_ids=72043).
Solution applied
Section titled “Solution applied”Forward fix (2026-03-30): Two changes:
-
subgroup_extraction.rb: Rewrote Step 2b cross-tab instruction to explicitly require ALL table cells including zero-response results (“0% ORR”, “0/3”). Previous wording (“Do NOT create subgroups for empty cells”) caused the LLM to treat zero-response cells as empty. New wording distinguishes zero responses (clinically meaningful) from truly empty/NE/NA cells. -
post_process.rb: Fixed Issue 8 regression — the all-zeromeasure_valueguard now only nulls when all arms also have nil/zeronumber_of_participants(fabrication signal). Real 0% ORR with stated N (e.g., 0/3 → N=3) is preserved. The N=0→nil guard at line 365 ensures fabricated N values are already nil, so the check reliably distinguishes fabrications from real data. -
task.rb: Added classify instruction for zero-response extraction from cross-tabulated tables (“0/3” = 0% with N=3) and abstain-when-ambiguous for garbled table parsing. Also added second-pass zero guard inpost_process.rbafternull_out_propagated_parent_nto catch fabricated zeros that initially bypassed the first-pass guard via copied parent N.
Backfill (2026-03-30–31): backfill_cross_tabulated_subgroups.thor — three commands run in prod:
screen_zero_response(job 1688, 2026-03-30): LLM screen (gpt-5-mini) of all 5,348 pubs with disease × biomarker cross-tab subgroups. Compared each abstract’s cross-tab structure against existing subgroups to find missing zero-response cells. Flagged withneeds_zero_response_reextraction.rescreen_zero_response(job 1693, 2026-03-30): Tighter second-pass screen to reduce false positives. Result: 234 confirmed, 5,114 rejected.remediate_screened(job 1689, 2026-03-31): Reset 234 flagged pubs (destroyed trial_subgroups, cleared llm_data subgroup fields, setllm_data_processed = false).
Pipeline re-run pending: extract_subgroups → classify_publications → post_process_publications on the 234 remediated pubs.
44. PFS/OS event count extracted as number_of_participants
Section titled “44. PFS/OS event count extracted as number_of_participants”Short summary
Section titled “Short summary”In survival analysis tables that report “median (95% CI) events n/N”, the LLM extracts the event count numerator as number_of_participants instead of the denominator. The “events n/N” fraction (e.g. “23/31”) looks similar to a response fraction to the LLM, but the numerator is the number of events (deaths/progressions), not the number of patients.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/task.rb — classify_publications efficacy extraction.
Concrete examples
Section titled “Concrete examples”- Pub 235204 (Telisotuzumab adizutecan ctDNA biomarker): Table row reads “mPFS 5.3 (4.5, 5.9) 23/31” for SD → MR positive (methylation panel). LLM extracted N=23 (PFS events) instead of N=31 (patients in subgroup). The correct N (31) matches “MR in pts with SD: methylation panel 31/53 (58%)”.
1 instance found in CRC+ADC audit (2026-03-30). Scale TBD — this table format (“value (CI) events n/N”) is standard in survival analysis reporting across oncology publications. Likely affects many publications reporting PFS/OS with event fractions. Need to assess by searching for publications where number_of_participants on a PFS/OS endpoint is less than number_of_participants on a sibling ORR endpoint for the same subgroup.
Explored solution direction
Section titled “Explored solution direction”- Prompt fix (forward): Add instruction: “In survival tables, when you see a format like ‘median (CI) n/N’, the N is the number of patients (use as number_of_participants) and n is the number of events (do not use as number_of_participants). Events n/N is NOT a response fraction.”
Solution applied
Section titled “Solution applied”Forward fix (2026-04-03): Added Sub-rule 6 to classify_publications prompt (section 3e): “In survival tables, n/N means n events out of N patients. Use N (denominator) as number_of_participants, NOT n (numerator).” Will be picked up by Issue 49 backfill re-extraction (3,943 target-disease pubs, PROMPT_VERSION=1).
45. Biomarker-tested denominator used as subgroup N instead of positive subset
Section titled “45. Biomarker-tested denominator used as subgroup N instead of positive subset”Short summary
Section titled “Short summary”When an abstract reports biomarker retention or status as “X/Y pts had [biomarker]”, the LLM uses Y (the tested population) as the subgroup’s number_of_participants instead of X (the biomarker-positive subset count). The subgroup is defined by having the biomarker, so its N should be X.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/task.rb — classify_publications efficacy extraction.
Concrete examples
Section titled “Concrete examples”- Pub 74193 (T-DM1 HERACLES-RESCUE): Abstract says “HER2 IHC 3+/amplification was retained on circulating tumor DNA in 2/3 pts.” The subgroup “HER2 retained in ctDNA” should have N=2 (the pts who retained HER2), but LLM extracted N=3 (the tested population).
1 instance found in CRC+ADC audit (2026-03-30). Issue-specific scale still TBD — biomarker retention/status reporting (“X/Y had [biomarker]”) is common in correlative analyses. Need to assess by searching for subgroups containing “retained”, “positive”, “expressing” etc. where N matches the denominator of the defining fraction rather than the numerator.
Precursor rollout scope for the target disease slice was sized from bioloupe-db-prod on 2026-03-31:
- Full target scope (including
4116/ Solid Tumors): 123,207 subgroup rows across 45,397 publications - Filtered query scope (excluding
4116from querying, but leavingPublication::TARGET_DISEASE_IDSunchanged in code): 15,867 subgroup rows across 6,443 publications - Filtered-scope breakdown: 4,194 rows matched directly via
trial_subgroups.disease_id; 11,673 rows came fromtrial_disease_detailsfallback when subgroup disease was null
Explored solution direction
Section titled “Explored solution direction”- Prompt fix (forward): Add instruction for qualifying subset N-counting, generalized beyond biomarkers.
Solution applied
Section titled “Solution applied”Precursor (2026-03-31): population_role metadata rollout via deterministic inference + LLM fallback. Ran in production (one_off_jobs 1699, 1700). Covered ~162k subgroups deterministically, ~2.7k via LLM, 12 remained null.
Superseded by screen → remediate → re-extract approach (2026-03-31):
The deterministic inference was brittle for edge cases and population_role has no downstream consumers yet — it was a stepping stone. Replaced with a direct LLM screening approach that addresses both Issue 45 and Issue 46 together.
Forward fix:
- Removed deterministic
PopulationRoleInferencefrompost_process.rb—population_rolenow comes directly from LLM output (kept as free metadata). - Added generalized qualifying-subset N-counting instruction to
classify_publicationsprompt intask.rb: applies to any filtered subset (biomarker, prior-therapy, analysis population, condition-present), not just biomarkers. - Added complete endpoint extraction instruction for sibling arms (Issue 46 forward fix also in
task.rb).
Backfill: screen_subgroup_reextraction.thor — screens ~6,780 target-disease pubs with gpt-5-mini, then remediates flagged pubs for full re-extraction.
Local validation (82 pubs screened):
- ~17% flag rate (14/82), projecting ~1,100–1,200 pubs for re-extraction
- True positive rate: high — flagged pubs had genuine N-counting or missing endpoint issues
- False negative rate: ~2-3% — 1 clear miss (pub 49494: total cohort N copied to biomarker subsets), 2 borderline (abstract didn’t state fraction explicitly)
- Cost: ~$0.001/pub for screening, ~$7-9 for full 6,780 scope
Production screening (2026-03-31, one_off_job 1702):
Screened 6,443 pubs in ~1h38m. Results: 1,395 flagged (21.7% flag rate), 5,048 clean. Both concrete examples (pub 74193, pub 29700) correctly flagged. Pub 49494 (known false negative from local validation) was outside candidate scope (no subgroups).
Re-screening analysis (2026-04-01):
Manual spot-check of 6 random flagged pubs revealed ~33% true positive rate — the initial screen was too permissive. Common false positive patterns:
- Sibling endpoint asymmetry that is real in the abstract (not an extraction error)
- Single-arm studies with no structural issue
- Percentage-based endpoints where N is correctly the denominator (not the event count)
Re-extraction testing on false positives showed it is NOT safe to blindly re-run classify_publications on all 1,395 — model variance at temperature: 1 can regress correct CR/PR extraction due to conflicting prompt instructions.
Prompt restructuring (2026-04-01):
Restructured the classify_publications SYSTEM_PROMPT in task.rb to resolve 6 identified conflicts:
- “Keep endpoint associations as they are” vs “extract response components not in the input list” → unified into single coherent statement
- Garbled table guidance: “extract if confident” vs “don’t extract response components” → unified with graduated strictness
- N-counting rules scattered across 3 locations → consolidated in section 3e
- null vs 0 duplicated → single rule
- “All Arms” usage duplicated → merged in section 3c
- “Derive ORR from counts” vs “don’t fabricate components from composites” → clarified as one-way only
Also expanded response component extraction (section 3d) to cover SD, PD, VGPR, sCR alongside CR/PR, with clear 3-scenario framework (components only, composite + components, composite only). Updated details.rb schema enum to match.
Prompt reduced from 34,281 → 28,154 chars (18% shorter). Emphasis markers reduced from 14 (7x IMPORTANT, 5x CRITICAL, 2x MANDATORY) to 5 RULE: prefixes.
Validation: 10 pubs re-extracted (5 flagged, 5 clean) — 5 improved, 5 stable, 0 regressions.
Structured re-screen (2026-04-01):
Added rescreen command to screen_subgroup_reextraction.thor — a second-pass screen on flagged pubs that requires concrete evidence (specific subgroup, expected vs current value, abstract quote). Stores structured evidence in llm_data['rescreen_evidence'].
Validated on 5 known pubs: correctly confirmed 2 true positives (pub 163764: wrong Ns, pub 221507: missing endpoints) and rejected 3 false positives (pubs 43226, 59227, 136409). Cost: ~$0.002/pub.
Verification (2026-04-02)
Section titled “Verification (2026-04-02)”Status: RESOLVED. Full pipeline re-run completed (one_off_jobs 1714→1716→1717→1722, 2026-04-01). ~2,500 publications re-extracted.
Original example confirmed fixed: Pub 74193 (HERACLES-RESCUE) — ctDNA subgroup now has N=2 (the biomarker-positive subset), not N=3 (the tested denominator). Response breakdown (2 PD out of 2) correct. Solid biopsy subgroup also correct at N=5 (5/5 retained, 1 PR + 1 SD + 3 PD).
Additional spot-checks:
- Pub 48455 (Pembro+Trastuzumab EG cancer): Abstract says “12 of 16 had a decline in their maxVAF” → N correctly extracted as 12 (positive subset), not 16 (tested). All values match (PFS 14.7 vs 5.9, OS 29.7 vs 7.71, milestone PFS 75% vs 0%).
- Pub 56725 (ANV419, 6 dose arms): All dose levels have symmetric Ki-67 CD8/NK/Treg extraction with correct per-cohort Ns.
- Pub 234727 (DCF vs FLOT esophageal): Both arms have all 4 endpoints symmetrically with correct Ns.
- Pub 101692, 152836: Null Ns appropriate where abstract doesn’t give per-subset denominators.
No regressions found across 10+ randomly sampled publications.
Production rollout commands
# 1. Screen all target-disease pubs (DONE — one_off_job 1702)# bundle exec thor one_off:screen_subgroup_reextraction:screen --batched --parallelism 4 --batch_size 2000
# 2. Re-screen flagged pubs with structured evidencebundle exec thor one_off:screen_subgroup_reextraction:rescreen --batched --parallelism 4 --batch_size 2000
# 3. Check re-screening resultsbundle exec thor one_off:screen_subgroup_reextraction:rescreen_stats
# 4. Remediate only confirmed true positives (--confirmed_only flag TBD)bundle exec thor one_off:screen_subgroup_reextraction:remediate --dry_runbundle exec thor one_off:screen_subgroup_reextraction:remediate
# 5. Re-run pipeline on remediated pub IDs# extract_subgroups → classify_publications → post_process_publications46. Incomplete endpoint extraction across sibling dose arms
Section titled “46. Incomplete endpoint extraction across sibling dose arms”Short summary
Section titled “Short summary”When a publication reports the same endpoint (e.g. DoR, PFS) across multiple dose arms in the same table, the LLM extracts the endpoint for some arms but skips others. This appears biased toward higher-response or first-listed arms.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/task.rb — classify_publications efficacy extraction.
Concrete examples
Section titled “Concrete examples”- Pub 29700 (ABBV-400 phase 1 CRC): Abstract table shows DoR for three dose arms: 1.6 mg/kg (no responses, no DoR), 2.4 mg/kg (DoR 4.1 mo), 3.0 mg/kg (DoR 5.5 mo). LLM extracted DoR for 2.4 mg/kg but skipped 3.0 mg/kg. Both values are in the same table row.
1 instance found in CRC+ADC audit (2026-03-30). Scale TBD — multi-arm dose-escalation/expansion studies with per-arm efficacy tables are common, especially in phase 1 ADC trials. Need to assess by searching for publications with multiple dose arms where some arms have an endpoint and sibling arms are missing it.
Explored solution direction
Section titled “Explored solution direction”- Prompt fix (forward): Add instruction for complete endpoint extraction across all sibling arms.
Solution applied
Section titled “Solution applied”Combined with Issue 45 (2026-03-31) — both issues are addressed by the same screen → remediate → re-extract approach.
Forward fix: Added instruction to classify_publications prompt in task.rb: “When extracting efficacy from a table or listing with multiple dose arms or treatment cohorts, extract ALL endpoints for ALL arms listed. Do not skip arms with lower response rates, fewer patients, or that appear later in the table.”
Backfill: screen_subgroup_reextraction.thor screens for both Issue 45 (wrong N on qualifying subsets) and Issue 46 (missing sibling endpoints) in a single pass. Initial screen completed (one_off_job 1702, 1,395 flagged). Structured re-screen added to filter false positives before remediation. See Issue 45 for full production rollout details and commands.
Verification (2026-04-02)
Section titled “Verification (2026-04-02)”Status: RESOLVED. Full pipeline re-run completed (one_off_jobs 1714→1716→1717→1722, 2026-04-01). ~2,500 publications re-extracted.
Pub 29700 verified: DoR now extracted for all three dose arms — 2.4 mg/kg = 4.1 mo, 3.0 mg/kg = 5.5 mo, correctly absent for 1.6 mg/kg (no responses). All values match abstract table. ORR, CBR, PFS, OS also symmetric across all three arms with correct Ns (32, 40, 41).
Additional spot-checks on multi-arm dose-escalation pubs (56725, 234727) confirmed symmetric endpoint extraction across all sibling arms. No regressions found.
47. Hazard ratios and p-values not captured for subgroup comparisons
Section titled “47. Hazard ratios and p-values not captured for subgroup comparisons”Short summary
Section titled “Short summary”When abstracts report per-subgroup hazard ratios, confidence intervals, and p-values — either inline or in tables — these values are not captured into the hazard_ratio and p_value columns on trial_arm_outcomes or trial_outcome_measures, even though those columns exist in the schema. Median values are correctly extracted; only the statistical comparison measures are lost.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/task.rb — classify_publications efficacy extraction.
Concrete examples
Section titled “Concrete examples”-
Pub 53685 (PROpel — Olaparib + Abiraterone mCRPC): Abstract reports per-gene HRs for 9 HRR mutations: “rPFS: BRCA2, HR 0.20 (0.08–0.44); ATM, HR 0.55 (0.20–1.38); CDK12, HR 0.51 (0.20–1.18). OS: BRCA2, HR 0.20 (0.07–0.48)…” Median rPFS and OS are correctly extracted for both arms across all 9 genes. All
hazard_ratiofields are null. -
Pub 158388 (NPM1-mutated AML with venetoclax): Abstract reports per-mutation HRs from regression: “IDH1/2 HR: 0.62 (0.42–0.89, p=0.011); FLT3-ITD HR: 1.42 (0.99–2.04, p=0.055); TET2 HR: 1.76 (1.24–2.50, p=0.002).” Subgroups for these mutations exist but have empty values and null Ns — the abstract gives only HRs, not median values per mutation individually. The NPM1-low vs NPM1-high comparison (which does have medians) was correctly extracted with CR/CRi, MRD-neg, and OS.
-
Pub 51636 (PRIME — Panitumumab KRAS/NRAS/BRAF): Abstract table includes HR and p-value columns alongside median OS and PFS for each RAS/BRAF subgroup (e.g., WT RAS: OS HR 0.78 p=0.04, PFS HR 0.72 p<0.01). Medians extracted correctly. HRs and p-values not captured.
Widespread. Abstracts routinely report HRs for subgroup comparisons — especially in randomized trials (treatment effect HRs) and in mutation/biomarker analyses (prognostic HRs from Cox regression). This affects both structured tables (Pub 53685, 51636) and inline text (Pub 158388). The schema already supports it (trial_arm_outcomes.hazard_ratio, trial_arm_outcomes.p_value, trial_outcome_measures.hazard_ratio, trial_outcome_measures.p_value) — the LLM extraction simply doesn’t populate these fields.
Explored solution direction
Section titled “Explored solution direction”- Prompt fix (forward): Add instruction to extract HR, CI, and p-value when reported alongside efficacy endpoints, and map them to the existing schema columns.
Solution applied
Section titled “Solution applied”Not yet addressed.
48. Milestone endpoint value missing for sibling arm in randomized trials
Section titled “48. Milestone endpoint value missing for sibling arm in randomized trials”Short summary
Section titled “Short summary”When an abstract reports a milestone rate (e.g., 6-month PFS rate) for both arms of a randomized trial — often in the same sentence — the LLM sometimes captures the value for one arm but not the other. Median survival values for the same endpoints are typically extracted for both arms correctly.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/task.rb — classify_publications efficacy extraction.
Concrete examples
Section titled “Concrete examples”- Pub 43144 (Cetuximab maintenance, TIME trial): Abstract says “The 6-month progression-free rate was 38.8% (26 of 67; 95% CI, 27.1%-51.5%) in the cetuximab group and 5.6% (4 of 72; 95% CI, 1.5%-13.6%) in the observation group.” Milestone PFS extracted for cetuximab arm (38.8%, N=67) but missing for observation arm (5.6% not captured). Median PFS and OS correctly extracted for both arms (5.3 vs 2.0, 24.8 vs 19.7).
Appears rare. Systematic query across re-extracted batch (2026-04-02) found ~10 publications where milestone endpoints existed for fewer arms than other endpoints in the same subgroup. Most were single-arm studies where “All Arms” is expected. Only Pub 43144 was a confirmed case of a multi-arm trial with asymmetric milestone extraction. Low priority.
Explored solution direction
Section titled “Explored solution direction”- Prompt fix (forward): Could extend the existing “extract ALL endpoints for ALL arms” instruction to explicitly mention milestone/landmark rates.
Solution applied
Section titled “Solution applied”Not yet addressed.
49. Arm name mismatch between extract_interventions and classify_publications
Section titled “49. Arm name mismatch between extract_interventions and classify_publications”Short summary
Section titled “Short summary”Two independent LLM pipeline steps name the same arm differently — extract_interventions (step 5) might call an arm “Control group” while classify_publications (step 10) calls it “Control”. This prevents trial_arm_outcomes from linking to trial_arms by name, since trial_arms are created from step 5 output and trial_arm_outcomes are created from step 10 output.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/intervention_extraction.rb— extracts arm names intollm_data['intervention_arms'], which becometrial_armsapp/tasks/publications_llm_classification/task.rb— extracts arms intollm_data['subgroup_outcome_measures'][].arms[], which becometrial_arm_outcomesapp/tasks/publications_llm_classification/post_process.rb— linkstrial_arm_outcomestotrial_armsby case-insensitive name match
Concrete examples
Section titled “Concrete examples”Top unlinked arm names after backfill (local dev, 2026-04-02):
- “Control group” (883) vs “Control” in trial_arms
- “Intervention group” (419) vs “Intervention”
- “Experimental group” (481) vs treatment-specific names
- “Arm A” / “Arm B” (466/382) vs descriptive arm names
~85k trial_arm_outcomes (18% of 470k) unlinked after initial backfill of trial_arms. Includes “All Arms” pooled outcomes (resolved separately by creating “All Arms” trial_arm entries) and name mismatches from the two-step naming inconsistency.
Downstream impact
Section titled “Downstream impact”Unlinked trial_arm_outcomes (trial_arm_id = NULL) cannot be joined to trial_arm_interventions for drug/dose attribution via the new FK path. They fall back to the legacy name-matching view logic.
Explored solution direction
Section titled “Explored solution direction”- Forward fix (applied): Pass
extracted_arm_namesfromllm_data['intervention_arms']into theclassify_publicationsprompt, instructing the LLM to reuse those exact names instead of inventing new ones. - Backfill: Affected publications need reprocessing through
classify_publicationswith the new prompt to align arm names. Alternatively, fuzzy name matching inpost_process.rbcould catch common variants (but is fragile).
Solution applied
Section titled “Solution applied”Resolved by design (2026-04-02): The name-matching approach was replaced entirely with ID-based linking.
extract_interventionscreatestrial_armswith database IDsclassify_publicationsreceives those IDs in the prompt and assigns them to each outcome armpost_processreadsarm_data['id']astrial_arm_iddirectly — no name matching at all
Additional fixes:
extract_interventionsprompt updated: separate arms per dose cohort (different patients = different arms)classify_publicationsprompt updated: use provided arm IDs, never leave id empty (including “All Arms”)TrialArmMaterializeralways creates an “All Arms” entry so the LLM has an ID for pooled resultsstudy_plan_armsno longer sent toclassify_publications—trial_armsare the source of truth
Tested on 3 publications (190656, 54137, 91482) — all achieved 100% trial_arm_id linkage.
Backfill plan (2026-04-03):
Scoped to target disease pubs (TARGET_DISEASE_IDS minus 4116) with trial_arm_interventions and unlinked outcomes: 3,943 publications, ~26.7k unlinked outcomes.
Approach: full pipeline re-run, not name-matching heuristics. A reset task (one_off:reset_classify_publications) deletes trial_arms/interventions, clears llm_data keys for all three extraction steps, and resets flags. After reset, re-run:
extract_interventions— new arms with proper dose splitting + IDslink_publication_drugs— drug entity matchingextract_subgroups— re-extract with new arm names (arm names influence whether dose cohorts become arms vs subgroups)classify_publications— outcomes with ID-based arm linkingpost_process_publications— materialize everything
extract_subgroups must re-run because it reads intervention_arms to decide arm-vs-subgroup boundaries (see subgroup_extraction.rb lines 47-51).
tag_investigational_interventions step removed from pipeline — intervention_role (including the supportive role) is now set directly by extract_interventions + TrialArmMaterializer.
Prompt versioning added to all three steps (intervention_extraction_version, subgroup_extraction_version, classify_publications_version in llm_data) to track which pubs have been processed with the new prompts.
Tested on 10 random pubs from the scope — 100% arm linking, 0 errors. Extraction quality equal or better than production (richer response components, correct null handling, proper arm separation). Estimated cost: ~$0.045/pub ($178 total for 3,943 pubs).
50. DrugLinker false-matches non-drug interventions to drugs
Section titled “50. DrugLinker false-matches non-drug interventions to drugs”Short summary
Section titled “Short summary”DrugLinker’s last-resort SimpleCandidateMatchingService (LLM-based fuzzy matcher) matches non-pharmacological interventions to drug records. For example, “Classical music” → Orca-T (cell therapy), “Noise-canceling headphones” could match to random drugs. The candidate service has no guard against non-drug intervention types.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/drug_linker.rb—match_via_candidate_service(line 64)SimpleCandidateMatchingService— LLM-based candidate matching, used as fallback when NcitConcept and Drug.flexifind both fail
Concrete examples
Section titled “Concrete examples”- Pub 129 (music during MRI biopsy): “Classical music” intervention matched to drug Orca-T (id=13666, cell therapy). No synonym overlap — pure hallucination from the candidate service.
- 1,409
procedure-type interventions have drug_id set - 140
device-type interventions have drug_id set - 1,544
other-type interventions have drug_id set - Total: ~3,093 likely false matches on non-drug intervention types
- 8,843 total interventions matched via candidate service (drug_id set, ncit_concept_id null) — some of these are legitimate drug matches, but the non-drug types above are almost certainly false positives
Downstream impact
Section titled “Downstream impact”False drug attribution in the efficacy view — non-drug interventions appear as if they’re associated with specific drugs, polluting drug-level clinical evidence reports.
Explored solution direction
Section titled “Explored solution direction”- Guard by intervention_type: Skip
match_via_candidate_servicewhenintervention_typeis in%w[procedure behavioral device dietary other radiation]. Only attempt LLM-based matching fordrugandbiologicaltypes. - Backfill cleanup: Null out
drug_idontrial_arm_interventions/publication_interventionswhereintervention_typeis non-drug and match came from candidate service (no ncit_concept_id).
Solution applied
Section titled “Solution applied”Forward fix (2026-04-04): Two-part approach — domain-specific prompt + TermMatch caching.
-
DrugMatchingService(app/services/drug_matching_service.rb): New wrapper aroundSimpleCandidateMatchingServicewith a drug-specific prompt that explicitly instructs the LLM to reject non-pharmacological interventions (procedures, devices, behavioral, imaging, dietary, radiation, observation). Replaces the generic “best match” prompt that caused false matches. -
SimpleCandidateMatchingServicecaching: Addedcache: true+strategy:params. When enabled, checksTermMatchbefore LLM call and persists results after.DrugMatchingServiceuses this withstrategy: "DrugMatching"— repeated terms hit cache instead of making LLM calls. -
DrugLinkerupdated: Now usesDrugMatchingServiceinstead of rawSimpleCandidateMatchingService.
Root cause: The old code (pre-March 2026) used Elasticsearch fuzzy search (edit_distance: 2) + min_confidence: 0.5 to find candidates, then the LLM in match_mode: :best picked from garbage candidates. Current code (pg_trgm) wouldn’t reproduce the “Classical music” → Orca-T case, but still reproduced others like “PET/CT” → radiopharmaceutical drugs due to drug synonyms containing procedure terms.
Backfill cleanup: lib/tasks/one_off/cleanup_false_drug_matches.thor — nulls drug_id on trial_arm_interventions (1,707) and publication_interventions (1,274) where intervention_type is non-drug and ncit_concept_id IS NULL. Production run pending.
Validated: “Classical music”, “PET/CT”, “no treatment/observation” → nil. “Pembrolizumab”, “Nivolumab”, “Trastuzumab deruxtecan”, “Keytruda” → correct matches. Caching works (second call hits TermMatch, no LLM). Canonical dedup works (“pembrolizumab” = “Pembrolizumab”).
51. Per-arm dose not populated on backfilled trial_arm_interventions
Section titled “51. Per-arm dose not populated on backfilled trial_arm_interventions”Short summary
Section titled “Short summary”The trial_arms backfill (2026-04-02) created trial_arm_interventions for ~23.5k publications by copying data from existing publication_interventions. Drug_id and intervention_role were copied correctly, but dose fields were copied from the old study-level dose_evidence — not per-arm dose. For multi-dose-arm studies, every arm’s intervention has the same study-level dose range instead of arm-specific dose.
The ~43.5k publications without trial_arm_interventions (created from arm outcomes only) are intentionally out of scope — those pubs were never processed through extract_interventions and are outside the target disease scope. They retain drug attribution via the legacy registry fallback (Sources 1a-1c in the view).
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”extract_dose_evidence(step 9) populates per-arm structured dose fields ontrial_arm_interventions- The backfill copied
publication_interventions.dose_evidence(study-level) totrial_arm_interventionsdose fields — this is a flat copy, not per-arm extraction
~23.5k publications with trial_arm_interventions that have study-level dose copied from PIs. Multi-dose-arm pubs within this set have incorrect dose attribution (same range on every arm).
Downstream impact
Section titled “Downstream impact”For single-dose studies: no impact (study-level dose = arm-level dose). For multi-dose-arm studies: each arm shows the full dose range instead of its specific dose. Same problem that originally motivated the trial_arms migration (Issue 49, pub 190656).
Explored solution direction
Section titled “Explored solution direction”The extract_dose_evidence task is updated to scope to trial_arm_interventions, but backfilled records already have dose_evidence populated (copied from PIs). The task’s scope filter (dose_evidence IS NULL OR version < DOSE_EVIDENCE_VERSION) skips them because they have version 1 data.
Option A: Bump DOSE_EVIDENCE_VERSION — Change constant from 1 to 2 in dose_evidence_extraction.rb. The scope already checks (dose_evidence->>'version')::int < DOSE_EVIDENCE_VERSION, so all backfilled records (version 1) become eligible for re-extraction. New extractions get version 2. One-line change, no data cleanup needed. Downside: re-extracts ALL 23.5k pubs including single-dose studies where the study-level dose is already correct.
Option B: Null out structured columns, keep JSONB — Set dose_min, dose_max, single_dose, rp2d, dose_units, dose_frequency, dose_context_type to NULL on all backfilled trial_arm_interventions, but keep dose_evidence JSONB for audit trail. Then update the extract_dose_evidence scope to check structured columns instead of JSONB presence. More surgical — only re-extracts records with missing structured dose. Downside: requires scope change and a cleanup migration.
Option C: Null out dose_evidence entirely — Set dose_evidence = NULL on all backfilled trial_arm_interventions. Then run extract_dose_evidence as-is (it scopes on dose_evidence IS NULL). Simplest, but loses the study-level audit trail. For single-dose studies the data was correct and will just be re-extracted to the same values.
Option D: Targeted backfill for multi-dose only — Only null out dose on trial_arm_interventions where the publication has multiple arms with the same dose range (indicator of study-level copy). Leaves single-dose pubs alone. Most efficient LLM cost but requires a scoping query to identify affected records.
Solution applied
Section titled “Solution applied”Option A: Bump DOSE_EVIDENCE_VERSION to 2 + prompt refinement for per-arm dose_context_type.
Two changes in dose_evidence_extraction.rb:
-
Version bump (line 5):
DOSE_EVIDENCE_VERSION = 1 → 2. All backfilled records (version 1) become eligible for re-extraction. The existing scope (version < DOSE_EVIDENCE_VERSION) handles this automatically. -
Prompt fix (SYSTEM_PROMPT): Replaced study-level instruction “Set dose_context_type to rp2d when an RP2D is identified” with arm-level guidance —
rp2donly for the arm that IS the RP2D/DRDE/MTD,escalationfor other dose-finding cohorts,fixedfor predetermined doses, etc.
Validated on 9 publications (mix of dose-escalation, fixed-dose combos, randomized multi-arm trials):
- Per-arm
single_dosecorrect on all 9 pubs (previously every arm got study-level range) dose_context_typecorrectly differentiates escalation vs rp2d vs fixed vs weight_based- RP2D field only set on the arm that IS the RP2D, not all arms in the study
- Cost: ~$0.004/pub → ~$103 projected for full 23.5k backfill
- Only touches dose fields on existing
trial_arm_interventions— no drug matching, arm creation, or linking changes
Production run: Bump version, deploy, then run extract_dose_evidence (no flags needed — scope picks up all version 1 records automatically).