Publication Issues Tracker

Temporary working document for tracking publication-processing issues identified during investigation.

The main motivation for this doc is the sheet: 1reh2-9Xpxd9DF7EB-73JfSXH8-MLtWI3zUDEOTgxPV8, where the client has collected clinical data for different disease areas and drugs. The purpose of this document is to identify gaps in the publications database that are preventing us from being able to correctly reconstruct this sheet in the future using structured data only (from the bioloupe data lake database).

Last updated: 2026-04-03 (Issue 49: backfill plan, prompt versioning, investigational tagger removal, Issues 42/44 prompt fixes)

Issue index

#	Title	Short description	Status
8	Zero-sentinel contamination (residual)	LLM outputs `0` instead of `null` for unstated efficacy values (N, ORR). Full-corpus: 55k N=0 arm outcomes (16k auto-sentinels), 20k measure_value=0. Root cause: schema lacked `nullable: true`	Complete — forward fix + backfill applied 2026-03-29. Guard regression fixed 2026-03-30 (Issue 43)
26	Parent population N propagated to child subgroups (residual)	`classify_publications` copies the parent subgroup’s `number_of_participants` to child subgroups instead of extracting the subset-specific N. Original fix addressed bulk but residuals remain (e.g. pub 200353 MR subgroups)	Incomplete — residuals
17	ASCO abstract + presentation copies create duplicate publication rows	ASCO ingestion saves `AbstractContentItem` and `PresentationContentItem` separately by `source_id`, so the same DOI can appear twice in the report	Investigation complete
18	PubMed-indexed journal article missing from publication corpus	The sqNSCLC worksheet row for Cofetuzumab now points to `10.1016/j.lungcan.2025.108492`, but that article is absent from `publications`, so the row is still missing despite a valid journal source	Implementation complete — 2025 PubMed backfill pending
27	`extract_efficacy_metrics` picks confirmed ORR as plain ORR	When both confirmed and unconfirmed ORR rows exist with the same N, `max_by(number_of_participants)` picks the confirmed row for the plain ORR metric — making ORR and cORR identical and the ORR value wrong	Complete — applied 2026-03-26
28	`build_result_rows` collapses dose-level arms when `study_plan_arm_id` is null	Grouping key uses `study_plan_arm_id` which is null for publication-extracted arms — distinct dose cohorts (e.g. “8.0 mg/kg” vs “10.0 mg/kg”) sharing the same subgroup collapse into one row, silently dropping the lower-N arm	Complete — applied 2026-03-29
29	Dose extraction captures study-level range, not efficacy population range	In dose-escalation studies, LLM extracts the full dose range (e.g. 1.0–8.3 mg/kg) even when efficacy is reported only for a subset (e.g. ≥4.0 mg/kg) — dose_min on the efficacy row is too low	Complete — forward fix + backfill applied in prod
30	Cross-study data contamination from abstract background sections	LLM extracts efficacy values from a referenced prior study cited in the abstract’s background, attributing them to the current publication which has no efficacy data yet	Complete — full pipeline (triage, validate, remediate, retriage, prune, reset_stale) applied 2026-03-30
31	Investigational drug dose data bleeds onto control/comparator arms	`pub_dose_lookup` COALESCE fallback propagates investigational drug dose fields to control arms when `publication_interventions.study_plan_arm_id` is NULL — 2,890 rows across 566 publications	Complete — applied 2026-03-29
32	TTP (time to progression) misclassified as PFS	LLM extraction maps TTP values to PFS endpoint — 241 publications mention TTP in abstract but have PFS extracted without TTP; additionally SD-subpopulation TTP values get attributed to full cohort. Query-layer TTP→PFS fallback also remapped correctly-extracted TTP back to PFS.	Complete — extraction fix 2026-03-28, query fix 2026-03-30
33	Cross-tabulated subgroups not identified in basket trials	`extract_subgroups` identifies single-dimension subgroups (tumor type OR biomarker) but not the cross-product (tumor type × biomarker) when tabular data is present — 262 confirmed pubs (from 6,081 candidates → 934 pass 1 → 262 pass 2)	Complete — applied in prod 2026-03-28
34	”Immature” endpoints extracted as “Not Reached”	LLM maps “not yet mature” / “data immature” to “Not Reached” — but immature means no median can be estimated (should be null), while “Not Reached” means median exceeds follow-up. ~71 pubs have immature language without “not reached” but have “Not Reached” extracted	Investigation complete
35	Dose extraction confuses PK thresholds, imaging agent doses, and missing dose_max	LLM extracts PK observation thresholds or imaging tracer doses instead of therapeutic drug doses; also omits dose_max when abstract states a range with “≥X” pattern	Complete — forward fix + view v21 (rp2d gate) applied 2026-03-31; backfill validated (job 1694), 42 remediation pending
36	cORR set equal to ORR when abstract distinguishes confirmed vs unconfirmed	LLM extraction sets cORR = ORR instead of counting only confirmed responses. Reverse of Issue 27 — here the ORR value is copied to cORR rather than cORR leaking into ORR	Investigation complete
37	Mean survival values extracted as median	LLM extracts mean OS/PFS values without distinguishing them from median — the pipeline has no field to flag the statistic type, so mean values are silently presented as median	Investigation complete
38	Biomarker subgroups in secondary analyses not identified by `extract_subgroups`	`extract_subgroups` misses biomarker-defined subgroups (e.g. p16+ oropharyngeal) when they appear as secondary efficacy analyses rather than pre-specified study arms	Complete — backfill applied 2026-03-30 (1,718/1,730 reprocessed). Partial screen complete 2026-03-31 (16,709 screened, 1,483 flagged). Prompt fixes validated; remediation pending deployment.
39	Multi-drug randomized trial dose cross-contamination	In randomized trials with multiple investigational drugs, LLM assigns one drug’s dose to all arms instead of arm-specific doses	Investigation complete
40	Hierarchical subgroup rows in view lose N from flat counterparts	Mostly false positive. 3 of 4 audit examples (pubs 134450, 67379, 200353) have null N because the abstract genuinely doesn’t state per-subgroup N — correct extraction. Only pub 48926 is a real bug: flat `IHC3+` has N=40 but hierarchical copy has N=null. Real scope: 182 TAOs across 59 pubs where flat counterpart has N but hierarchical copy doesn’t.	Downscoped — mostly not a bug. Post-process propagation fix deferred (low impact: 182 records).
41	Safety data cross-contamination between dose arms	Safety N and discontinuation rates from one dose arm attributed to another dose arm in the same publication. Related to Issue 31 but in safety domain — extraction/query layer, not view COALESCE.	Complete
42	Tumor shrinkage rate confused with RECIST ORR	LLM extracts “any tumor reduction” percentage as ORR instead of RECIST-defined objective response rate. e.g. pub 162304: 35% had any shrinkage but true ORR was ~1.5% (1/66 PR).	Forward fix applied 2026-04-03. Included in Issue 49 re-extraction (PROMPT_VERSION=1).
43	Cross-tabulated subgroups only extracted for highest-response HER2 level	Issue 33 backfill re-extracted cross-tabs but LLM only creates disease × biomarker cross-products for the most prominent level (e.g. IHC3+), skipping IHC2+, IHC1+, mutation/amp where responses are low/zero. Residual gap in Issue 33.	Forward fix applied 2026-03-30. Backfill: screened 5,348 → rescreened → 234 confirmed → remediated 2026-03-31. Pipeline re-run pending.
44	PFS/OS event count extracted as number_of_participants	In survival tables reporting “median (95% CI) events n/N”, LLM extracts the event numerator as N instead of the denominator. e.g. “5.3 (4.5, 5.9) 23/31” → N=23 (events) instead of N=31 (patients). Scale TBD.	Forward fix applied 2026-04-03. Included in Issue 49 re-extraction (PROMPT_VERSION=1).
45	Qualifying-subset denominator used as subgroup N instead of subset count	When abstract reports “X/Y pts had [condition]”, LLM uses Y (tested/assessed) as subgroup N instead of X (qualifying subset). Applies to biomarker, analysis population, prior-therapy, and condition-present subgroups. ~17% of target-disease pubs affected.	Forward fix applied + screen → remediate → re-extract backfill ready. Production screening pending.
46	Incomplete endpoint extraction across sibling dose arms	LLM extracts an endpoint (e.g. DoR) for one dose arm but skips the same endpoint for a sibling arm in the same table. Possibly biased toward higher-response or first-listed arm. Combined with Issue 45 screening.	Forward fix applied + combined with Issue 45 backfill. Production screening pending.
49	Arm name mismatch between `extract_interventions` and `classify_publications`	Two independent LLM steps name the same arm differently (e.g. “Control group” vs “Control”), preventing `trial_arm_outcomes` from linking to `trial_arms` by name. ~18% of arm outcomes unlinked after backfill.	Forward fix applied 2026-04-02. Backfill plan ready: 3,943 target-disease pubs, full pipeline re-run (~$178). Reset task + prompt versioning + investigational tagger removal. Tested on 10 pubs — 100% linking.
50	DrugLinker false-matches non-drug interventions to drugs	`SimpleCandidateMatchingService` (LLM-based last resort) matches non-pharmacological interventions (e.g. “Classical music” → Orca-T) to drugs. ~3,093 false matches on procedure/device/other intervention types.	Forward fix applied 2026-04-04 (DrugMatchingService + caching). Backfill cleanup pending production run.
51	Per-arm dose not populated on backfilled trial_arm_interventions	Backfill copied study-level dose from publication_interventions to trial_arm_interventions. Multi-dose-arm pubs have the same range on every arm instead of arm-specific dose. ~23.5k pubs need `extract_dose_evidence` re-run.	Fix validated — version bump + prompt refinement tested on 9 pubs. Production `extract_dose_evidence` run pending (~$103 est.).

Each issue entry should keep analysis and remediation separate.

Recommended issue structure:

Short summary
Where this sits in the current pipeline
Exact restriction causing the drop
Concrete examples
Downstream impact
What the issue is not
Scale
Spot checks
Open characterization questions
Explored solution direction
Solution applied

Solution applied should remain empty until an actual fix is agreed and implemented.

Backfill pattern: When an issue requires backfilling historical data, see the “One-Off Backfill Tasks” section in .claude/skills/backend-expert/SKILL.md.

8. Zero-sentinel contamination (residual)

Short summary

The original Issue 8 fix addressed max_prior_lines zero-sentinel contamination (LLM outputting 0 instead of null for unstated values). That fix is complete — no min > max contradictions remain. However, the same zero-sentinel pattern persists in efficacy fields: patient_number_efficacy, measure_value (ORR), and patient_count. When a publication abstract doesn’t state a per-arm N or per-subgroup ORR, the LLM extracts 0 instead of leaving the value null.

Concrete examples

Pub 241259 (Temab-A E-R analysis): Per-arm N not stated for 2.0 and 2.4 mg/kg dose arms (63 total across arms), but N=0 extracted for each arm
Pub 29699 (ABBV-400 E-R analysis): ORR=0% extracted for all arms, but abstract only reports exposure-response correlations (p<0.05) — no numeric ORR values stated
Pub 134450 (MRG003 phase 1b): N=0 for CRC and SCCHN disease subgroups despite ORR and DCR being reported (ORR=0%, DCR=25% for CRC; ORR=40%, DCR=100% for SCCHN)
Pub 67379 (ROME trial): N=0 for hTMB/MSS subgroup, yet PFS=3.6 months with HR=0.65 and p=0.01 are extracted

Scale

Full-corpus scan (2026-03-29): 55,499 N=0 arm outcomes across 10,461 publications; 19,872 measure_value=0 across 9,307 publications. Of the N=0 set, 15,968 are definitive auto-sentinels (have non-zero sibling measure_values); the remaining ~39k are ambiguous (N not stated, no sibling data to confirm). Originally identified as 14 residual instances in HNSCC+ADC and CRC+ADC audits — the actual scope is corpus-wide.

Explored solution direction

Update classify_publications prompt: “When the abstract does not state a specific numeric value for a field (e.g., number of patients in a subgroup, ORR for an arm), leave the field null. Never output 0 as a placeholder for unstated values — 0 and null have different clinical meanings.”

Solution applied

Forward fix (2026-03-29): Three-part fix:

details.rb: Added nullable: true to number_of_participants (line 43) and measure_value (line 44) in the Arm schema. This allows the JSON schema to accept null, which is the primary signal the LLM uses to decide valid outputs.
task.rb: Added zero-vs-null prompt instruction after the child-subgroup N section (line 151-156): “Use null (not 0) when no numeric value is stated. 0 and null have different clinical meanings.”
post_process.rb: Added two defensive guards:
- N=0 → nil for all arm outcomes (zero patients with reported efficacy is always a sentinel)
- measure_value=0 → nil when ALL arms for a percentage endpoint have value “0” (LLM fabricated zeros for unreported endpoints)

Backfill (2026-03-29): lib/tasks/one_off/backfill_zero_sentinel_efficacy.thor — three-phase Thor task (identify → validate → remediate). N=0 candidates with non-zero sibling measure_values are auto-classified as sentinels without LLM. Remaining candidates (ambiguous N=0 and measure_value=0 for percentage endpoints) validated via GPT-5-mini against abstract text. Audit trail stored in trial_subgroup.llm_data['zero_sentinel_checks'] and ['zero_sentinel_patches']. All three phases completed 2026-03-29 (jobs 1661-1664).

Known regression (fixed 2026-03-30): The post_process.rb guard that nulls measure_value=0 when all arms have 0 for a percentage endpoint was too aggressive — it killed real 0% ORR values (e.g. pub 31990 IHC2+/ISH- and IHC1+ cohorts genuinely had 0% ORR). Fixed in Issue 43: guard now only nulls when all arms also have nil/zero N (fabrication signal). Real 0% with stated N > 0 is preserved.

26. Parent population N propagated to child subgroups (residual)

Short summary

The original Issue 26 fix addressed the bulk of cases where classify_publications copies the parent subgroup’s number_of_participants to child subgroups. However, residual instances remain where the parent population N is applied to child subgroups that represent a strict subset.

Concrete examples

Pub 200353 (T-DXd biomarker analysis DESTINY-CRC02): 97 paired BL/C3D1 ctDNA samples total. Both the “Complete MR at C3D1” and “Absent MR at C3D1” child subgroups have N=97, but each is a subset of the 97 paired samples. The abstract references a Table (not inline) with the split, but the LLM defaulted to the parent N.

Scale

2 residual instances in job 1635 audit. Lower frequency than the original Issue 26 (which affected ~5,058 subgroups across 1,174 pubs), suggesting the fix addressed the majority but edge cases remain — particularly when the child subgroup N is only available in a referenced table rather than inline text.

Explored solution direction

The original prompt fix instructed the LLM to extract subset-specific N. Residuals likely need a reinforcement: “When a child subgroup represents a subset of the parent (e.g., ‘Complete MR’ vs ‘Absent MR’ within a paired sample set), the child’s N must be less than the parent’s N. If the specific N is not stated, leave it null rather than copying the parent’s N.”

Solution applied

(empty — pending implementation)

17. ASCO abstract and presentation copies create duplicate publication rows

Short summary

After broadening ASCO ingestion to include both AbstractContentItem and PresentationContentItem, the same scientific abstract can now be stored twice under different ASCO uids. EmergingClinicalDataQuery groups by publication_id, not DOI/title, so both copies surface as separate rows.

This showed up repeatedly during the sqNSCLC pass and makes the local output look larger and noisier than the sheet.

Where this sits in the current pipeline

app/services/publications/asco_api_service.rb:

fetch_abstract_hits requests contentTypes: ['Abstract', 'Presentation']
save_publication persists records using Publication.find_or_initialize_by(source: 'ASCO', source_id: publication_data[:source_id])

app/queries/tpp/emerging_clinical_data_query.rb:

build_result_rows groups by publication_id, disease_id, effective_line, and study_plan_arm_id

There is no DOI-level or title-level deduplication step between ingestion and reporting.

Exact restriction causing the duplication

The ASCO fix for Issue 2 intentionally broadened the search and detail query to include PresentationContentItem. That solved the “missing presentation” problem, but persistence still keys uniqueness on source_id:

publication = Publication.find_or_initialize_by(source: 'ASCO', source_id: publication_data[:source_id])

So if ASCO exposes both:

ABSTRACT492030
PRESENTATION251481

with the same DOI and same text, both are considered distinct publications locally.

Concrete examples from sqNSCLC validation

Example 1: PF-08046054

Same DOI:

10.1200/JCO.2025.43.16_suppl.8611

Stored twice:

publication 48035 — source_id ABSTRACT492030
publication 238708 — source_id PRESENTATION251481

Both produce the same sqNSCLC row (ORR = 33.3%, N = 6).

Example 2: IBI363

Same DOI:

10.1200/JCO.2025.43.16_suppl.8509

Stored twice:

publication 139344 — source_id ABSTRACT500470
publication 237445 — source_id PRESENTATION246467

Both produce the same main sqNSCLC 3 mg/kg Q3W row.

Example 3: Additional duplicate DOI pairs in the same sqNSCLC slice

Datopotamab deruxtecan: 10.1200/JCO.2025.43.16_suppl.8501
Sacituzumab govitecan: 10.1200/JCO.2025.43.16_suppl.8599

Downstream impact

one worksheet row can correspond to two local rows
counts for “how many publication-backed rows do we have?” are overstated
manual comparison against the sheet becomes noisy
any future ranking or aggregation that does not dedupe by DOI/title risks double-counting conference data

What the issue is not

This is not a disease-mapping issue and not a subgroup-extraction issue.

The data itself is usually valid in both copies. The problem is that they are the same scientific result represented twice because ASCO exposes two content-item types.

This is also not an argument to undo Issue 2 entirely. We needed PresentationContentItem support to recover records like SHR-A2102. The gap is specifically the lack of a deduplication strategy after broadening the source.

Scale

In the sqNSCLC ADC/fusion slice alone, there are 4 duplicate DOI pairs:

PF-08046054
IBI363
Datopotamab deruxtecan
Sacituzumab govitecan

So the effect is already material in a small disease/technology slice.

Explored solution direction

Two reasonable options:

1. Query/report deduplication

Keep both source records in publications, but dedupe in EmergingClinicalDataQuery or the TPP report by a stable key such as:

DOI + disease + subgroup/arm
or DOI + publication title

This is lower risk for ingestion history.

2. Ingestion-time merge

When saving ASCO records, detect that an incoming presentation and an existing abstract share the same DOI/title/NCT tuple and merge them into one canonical Publication.

This is cleaner downstream but riskier because it changes persistence semantics for already-ingested ASCO records.

18. PubMed-indexed journal article missing from publication corpus

Short summary

The current sqNSCLC worksheet row for Cofetuzumab pelidotin points to the 2025 journal article:

DOI: 10.1016/j.lungcan.2025.108492
PMID: 40086026

That article exists on PubMed and contains the sqNSCLC result the sheet uses, but there is no corresponding Publication row in the local database. As a result, the row is completely absent from EmergingClinicalDataQuery.

Where this sits in the current pipeline

This drop happens before EmergingClinicalDataQuery.

During validation:

Publication.where(doi: '10.1016/j.lungcan.2025.108492') returned no rows
Publication.where(source_id: '40086026') returned no rows

So the publication never entered the local corpus, or it was dropped before persistence.

Exact restriction causing the drop

Root cause isolated.

There are two distinct PubMed ingestion limitations affecting this paper:

the disease-specific path depends on PubMed exposing a ClinicalTrials.gov / NCT... databank entry, and this record does not appear to expose that linking metadata even though PubMed marks it as a clinical trial
the broad PubMed path in Publications::PubmedApiService built one giant combined query for the oncology MeSH clause plus the recovery clause; that combined search term excluded qualifying records that PubMed returned when the intended criteria were tested separately

What was verified live for PMID 40086026:

PubMed resolves DOI 10.1016/j.lungcan.2025.108492 to PMID 40086026
the record has Clinical Trial, Phase I
the record has oncology MeSH including Carcinoma, Non-Small-Cell Lung and Lung Neoplasms
40086026[uid] AND mesh AND clinical-trial publication types AND 2025 date returned 1
40086026[uid] AND full previous combined search term returned 0

So the missing publication was not due to missing PubMed record metadata for the broad query. It was due to our query construction.

Concrete example

Worksheet row: Cofetuzumab pelidotin in sqNSCLC

Worksheet entry:

Drug: Cofetuzumab pelidotin
Publication: Lung Cancer (Journal), 2025
Link: https://doi.org/10.1016/j.lungcan.2025.108492
ORR = 12.5%
cORR = 12.5%
mPFS = 5.3
mDoR = 2.2

Local database state:

no Publication row for DOI 10.1016/j.lungcan.2025.108492
no Publication row for PMID 40086026
only older cofetuzumab records exist:
- publication 150086 — ASCO 2021
- publication 71934 — ESMO 2023
- publication 101600 — Clinical Cancer Research 2021

External confirmation:

PubMed lists the paper as “A phase 1b study of cofetuzumab pelidotin monotherapy in patients with PTK7-expressing recurrent non-small cell lung cancer” with PMID 40086026

Downstream impact

the sqNSCLC worksheet still has one fully missing non-investor row even after the backfills and corrections
the earlier tracker note that the cofetuzumab sqNSCLC value was poster-only is now stale for the current worksheet version
the publication will remain absent until a non---disease-specific 2025 PubMed run is executed against the fixed query logic
--disease-specific alone is still insufficient for this class of paper because PubMed does not appear to expose the ClinicalTrials.gov linking metadata we rely on

What the issue is not

This does not contradict the earlier ESMO 2023 analysis in Issue 11.

That earlier note was about publication 71934, where the squamous-specific value was not in the 2023 abstract text. The current worksheet has since moved to a later 2025 journal article. That newer source should be representable if it is ingested.

Scale

Currently one confirmed sqNSCLC worksheet row for the original worksheet discrepancy.

For 2025-01-01 through 2025-12-31, after fixing the PubMed query construction:

the broad oncology/malignant-heme PubMed query returns 6,013 PMIDs
3,831 of those are not already in local publications
compared with the old Clinical Trial[pt] path, there are 435 additional PMIDs
431 of those additional PMIDs are not already in local publications

So this is not just one missing-paper edge case. The broken combined query was suppressing a non-trivial number of 2025 PubMed records.

Spot checks

Publication.where(doi: '10.1016/j.lungcan.2025.108492') returned no rows before the fix
Publication.where(source_id: '40086026') returned no rows before the fix
after the PubmedApiService query change, fetch_uids_by_date('2025/01/01', '2025/12/31', nct_ids: []) includes PMID 40086026
live verification after the fix returned:
- includes_pmid_40086026 = true
- total = 6013

Open characterization questions

After the 2025 backfill, how many of the 431 incremental publications are truly result publications versus broader cancer-clinical-trial noise?
Do we want to keep the broad non---disease-specific PubMed run as a regular sync, or use it only as a periodic coverage backfill?

Explored solution direction

Characterize the missing publication upstream of the query, then narrow the fix to the actual failure point:

Trace the PubMed/journal ingestion path for DOI 10.1016/j.lungcan.2025.108492 / PMID 40086026
Compare direct PubMed criteria matches against the full generated search term
Split the broad PubMed search into separate query terms and union PMIDs in Ruby instead of relying on one giant combined PubMed query

Solution applied

updated Publications::PubmedApiService so the broad PubMed path now runs separate search terms for:
- oncology/malignant-heme MeSH + clinical-trial publication types
- oncology/malignant-heme MeSH + recovery result terms for the recent recovery window
changed PubMed UID fetching to execute each term separately and union the PMIDs in Ruby
aligned total-count logic with the split-query approach
verified live that the fixed 2025 query now includes PMID 40086026
syntax check passed: ruby -c app/services/publications/pubmed_api_service.rb

27. `extract_efficacy_metrics` picks confirmed ORR as plain ORR

Short summary

When both confirmed (confirmed=true) and unconfirmed (confirmed=false) ORR rows exist for the same subgroup in the view, ClinicalEvidenceQuery#extract_efficacy_metrics can pick the confirmed row as the plain ORR metric value. This happens because the ORR extraction loop does not exclude confirmed rows, and when both rows have the same number_of_participants, max_by returns whichever comes first — often the confirmed row.

Where this sits in the current pipeline

ClinicalEvidenceQuery#extract_efficacy_metrics — app/queries/tpp/clinical_evidence_query.rb, lines 590–628.

The cORR extraction (lines 658–675) correctly filters confirmed == true and is unaffected. The problem is exclusively in the general efficacy extraction loop that handles ORR alongside OS, PFS, DOR, etc.

Exact restriction causing the drop

Lines 600–611:

PRIMARY_EFFICACY_ABBREVIATIONS.each do |abbr|
  matching = grouped[abbr] || grouped[abbr.downcase]
  next if matching.nil? || matching.empty?

  matching = filter_by_valid_unit(matching, abbr)
  next if matching.empty?

  experimental = matching.select { |r| r['resolved_group_type'] == 'EXPERIMENTAL' }
  experimental = matching if experimental.empty?

  best_row = experimental.max_by { |r| r['number_of_participants'].to_i } || matching.first

When abbr == 'ORR', matching includes ALL ORR rows regardless of confirmed flag. If both confirmed=true (value=26.7%) and confirmed=false (value=43.3%) exist with the same N, max_by picks the first match. The result: metrics[:orr] gets the confirmed value, making it identical to metrics[:corr] and wrong as a standalone ORR.

Concrete examples

Publication 117228 (RM-1929 photoimmunotherapy in rHNSCC):

Abstract states:

“unconfirmed objective response rate (ORR) 43.3%”
“confirmed ORR 26.7%”

View correctly has both rows (subgroup “Heavily pretreated rHNSCC → Part 2”):

confirmed=true, measure_value=26.7, number_of_participants=30
confirmed=false, measure_value=43.3, number_of_participants=30

Report output: efficacy.orr.value = 26.7 (should be 43.3)

The cORR extraction correctly returns 26.7%, but the ORR extraction ALSO returns 26.7% instead of 43.3%.

Downstream impact

Understated ORR: When confirmed ORR is lower than unconfirmed ORR (the typical pattern), the report shows the lower confirmed value as the headline ORR. For pub 117228, ORR is understated from 43.3% to 26.7%.
Duplicate values: ORR and cORR columns show the same value, making the cORR column appear redundant and hiding the existence of a lower confirmed rate.
Audit noise: The audit correctly flags these as incorrect_value on efficacy.orr.value, generating true-positive findings that overlap with Issue 25 audit findings.

Scale

477 publications currently have both confirmed=true and confirmed=false ORR rows (the correct Issue 25 extraction pattern). When both rows have the same N (which is common — confirmed and unconfirmed ORR are computed from the same denominator), the confirmed value gets picked as plain ORR.

-- Publications where confirmed and unconfirmed ORR have the same N
-- (susceptible to the wrong-pick bug)
SELECT count(DISTINCT ts.source_id)
FROM trial_subgroups ts
JOIN trial_outcome_measures tom_c ON tom_c.trial_subgroup_id = ts.id AND tom_c.confirmed = true
JOIN trial_outcome_measures tom_u ON tom_u.trial_subgroup_id = ts.id AND tom_u.confirmed = false
JOIN trial_endpoints te_c ON te_c.id = tom_c.trial_endpoint_id AND te_c.abbreviation = 'ORR'
JOIN trial_endpoints te_u ON te_u.id = tom_u.trial_endpoint_id AND te_u.abbreviation = 'ORR'
JOIN trial_arm_outcomes tao_c ON tao_c.trial_outcome_measure_id = tom_c.id
JOIN trial_arm_outcomes tao_u ON tao_u.trial_outcome_measure_id = tom_u.id
WHERE ts.source_type = 'Publication'
  AND tao_c.number_of_participants = tao_u.number_of_participants;

Explored solution direction

Forward fix: In extract_efficacy_metrics, when processing ORR, exclude confirmed=true rows if confirmed=false rows also exist for the same subgroup. This ensures the plain ORR metric always uses the unconfirmed/total ORR:

# Inside the PRIMARY_EFFICACY_ABBREVIATIONS.each loop, after filtering matching:
if abbr == 'ORR'
  unconfirmed = matching.reject { |r| [true, 't'].include?(r['confirmed']) }
  matching = unconfirmed if unconfirmed.any?
end

This is a ~3 line change in clinical_evidence_query.rb. No backfill needed — fixing the query immediately fixes all report output.

No backfill required: This is a query-layer bug, not a data issue. The underlying data (trial_outcome_measures with correct confirmed flags) is correct. Fixing the Ruby code fixes all publications instantly.

Solution applied

Forward fix (2026-03-26): Added 5-line guard in app/queries/tpp/clinical_evidence_query.rb extract_efficacy_metrics method (line 610-613). When processing ORR, rejects confirmed=true rows if non-confirmed rows exist. This ensures the plain ORR metric uses the unconfirmed/total ORR, while the cORR extraction (line 667-683) independently picks confirmed=true rows.

if abbr == 'ORR'
  non_confirmed = matching.reject { |r| [true, 't'].include?(r['confirmed']) }
  matching = non_confirmed if non_confirmed.any?
end

Edge cases handled:

Both confirmed + unconfirmed exist → ORR gets unconfirmed, cORR gets confirmed (correct)
Only confirmed exists (no unconfirmed) → ORR falls back to confirmed value (safe fallback — same as cORR)
Only unconfirmed/null exists → no change (correct)

No backfill needed — query-layer fix applies immediately to all report output

28. `build_result_rows` collapses dose-level arms when `study_plan_arm_id` is null

Short summary

ClinicalEvidenceQuery.build_result_rows groups view rows by [publication_id, disease_id, effective_line, study_plan_arm_id, subgroup_value]. When study_plan_arm_id is null — which it is for all publication-extracted arms that haven’t been matched to a clinical trial study plan arm — distinct dose-level arms (e.g. “8.0 mg/kg” and “10.0 mg/kg”) sharing the same subgroup_value collapse into a single group. extract_efficacy_metrics then picks one arm by max_by(number_of_participants), silently dropping the other.

Where this sits in the current pipeline

app/queries/tpp/clinical_evidence_query.rb, build_result_rows method (line 306).

Exact restriction causing the drop

The grouping key at line 306 is:

grouped = enriched_data.group_by { |row|
  [row['publication_id'], row['disease_id'], row['effective_line'],
   row['study_plan_arm_id'], row['subgroup_value']]
}

When study_plan_arm_id is null for both dose arms (common for unlinked publications), they group together. extract_efficacy_metrics (line 619) then picks one via max_by(number_of_participants).

Concrete examples

Pub 190656 (ARTEMIS-001, HS-20093 B7-H3 ADC in NSCLC):

View has 6 rows for “NSCLC → Squamous cell carcinoma” (3 endpoints × 2 dose arms: 8.0 mg/kg N=32 and 10.0 mg/kg N=26)
Both arms have study_plan_arm_id = null
Query collapses to 1 row, picks 8.0 mg/kg (N=32 > N=26)
Lost data: Sq 10.0 mg/kg cORR 26.9%, PFS 5.7, DOR 7.0

Downstream impact

Dose-level subgroup data is silently dropped from the Clinical Evidence report. For dose-escalation studies where different dose levels have meaningfully different efficacy, only the higher-N cohort appears.

Scale

Affects dose-escalation/expansion publications where arms aren’t matched to trial study plan arms. The view correctly distinguishes arms by arm_name, but the query ignores arm_name in its grouping key.

Explored solution direction

Add arm_name to the grouping key in build_result_rows, or fall back to arm_name when study_plan_arm_id is null. This preserves dose-level arm distinctions without breaking publications where study_plan_arm_id correctly differentiates arms.

Related to Issue 20 (study_plan_arm link is fragile) — same root cause of over-reliance on study_plan_arm_id.

Solution applied

Forward fix (2026-03-29): Added arm_name fallback to the grouping key in app/queries/tpp/clinical_evidence_query.rb build_result_rows method (line 307). When study_plan_arm_id is null, uses arm_name as the differentiator so distinct dose-level arms (e.g. “8.0 mg/kg” vs “10.0 mg/kg”) are preserved as separate rows.

grouped = enriched_data.group_by { |row|
  [row['publication_id'], row['disease_id'], row['effective_line'],
   row['study_plan_arm_id'] || row['arm_name'], row['subgroup_value']]
}

No backfill needed — query-layer fix applies immediately to all report output.

29. Dose extraction captures study-level range, not efficacy population range

Short summary

In dose-escalation studies, classify_publications extracts the full dose range stated in the abstract (e.g. dose_min=1.0, dose_max=8.3 mg/kg) as a property of the subgroup. But when the abstract restricts efficacy reporting to a dose subset (e.g. “results for patients who received ≥4.0 mg/kg”), the dose_min on the efficacy row is too low, creating a mismatch between the dose range and the efficacy population.

Where this sits in the current pipeline

app/tasks/publications_llm_classification/task.rb — dose fields extracted as subgroup-level properties.

Exact restriction causing the drop

Dose extraction treats dose as a study-level attribute (“what doses were used?”) rather than scoping to the efficacy analysis population (“what doses did the patients in the reported results actually receive?”). The LLM prompt doesn’t instruct it to scope dose to the efficacy population.

Concrete examples

Pub 238709 (MYTX-011 KisMET-01 updated):

Abstract: “85 pts received 1.0–8.3 mg/kg; 59 pts received ≥4.0 mg/kg” — efficacy reported only for ≥4.0 mg/kg subset
Extracted: dose_min=1.0, dose_max=8.3
Expected: dose_min=4.0, dose_max=8.3 (matching the efficacy population)
RP2D correctly extracted as “5.0 mg/kg Q3W (2-on 1-off) and 4.0 mg/kg Q3W”

Downstream impact

Report rows show a broader dose range than the actual efficacy population received. Minor impact on report accuracy but misleading for dose-response interpretation.

Scale

Affects phase I dose-escalation studies where efficacy is reported for a dose subset. Relatively uncommon pattern — most studies report efficacy at a single dose or clearly per-dose-level.

Explored solution direction

Update the classify_publications dose extraction prompt to instruct the LLM: “When the abstract reports efficacy for a specific dose subset, use that subset’s dose range, not the full escalation range.” Alternatively, accept this as a known limitation since RP2D (when present) correctly reflects the clinically relevant dose.

Solution applied

Forward fix (2026-03-28):

task.rb: Added “DOSE SCOPING” instruction to the Subgroup Dose Context section — instructs LLM to set dose_min/dose_max to match the efficacy population, not the full escalation range, when the abstract restricts efficacy reporting to a dose subset.
task.rb: Added “DOSE RANGE COMPLETENESS” instruction — instructs LLM to always fill both dose_min and dose_max for dose-defined subgroups (e.g. “≥X” subgroups now get dose_max set to the highest dose level in the abstract).
dose_evidence_extraction.rb: Added clarifying comment that drug-level dose extraction intentionally captures the full escalation range (efficacy-population scoping is handled in subgroup extraction).

Backfill (2026-03-28) — lib/tasks/one_off/backfill_dose_scope_mismatch.thor:

Three-phase approach, no regex. Also covers issue 35 (PK thresholds, imaging doses, missing dose_max).

Structural query (identify): Finds any publication with materialized efficacy data AND dose_min set on trial_subgroups. No phase or trial-link filter — dose_min presence is the structural signal. 720 candidates in prod.
LLM validation (validate): Sends abstract + current dose_min/dose_max to GPT-5-mini per subgroup. Schema: efficacy_restricted_to_dose_subset (bool), needs_correction (bool — true only when correct values differ from current extraction), correct_dose_min, correct_dose_max, explanation. Stores result in ts.llm_data['dose_scope_check'] on each trial_subgroup.
Remediation (remediate --no-dry-run): Directly patches dose_min/dose_max on trial_subgroups using the validated correct_dose_min/correct_dose_max. Also syncs llm_data['subgroup_outcome_measures']. Stores audit trail in ts.llm_data['dose_scope_patch'] with previous values and explanation. Note: dry_run defaults to true — must pass --no-dry-run to apply.

Remediation applied in prod (2026-03-28):

405 trial_subgroups patched across 299 publications
315 subgroups: dose_max filled in (e.g. “≥240 mg” went from 240/null → 240/960)
90 subgroups: dose_min/dose_max nulled out (non-dose values: PK thresholds, cycle counts, % weight loss, radiation parameters, etc.)
Spot-checked 12 random patched records against abstracts: 12/12 correct
Audit trail stored in ts.llm_data['dose_scope_patch'] with previous values for reversal if needed

Production run sequence:

# 1. Identify structural candidates (read-only, ~720 candidates)
thor one_off:backfill_dose_scope_mismatch:identify

# 2. LLM validation — writes to llm_data only (~$1-2 for 720 pubs with gpt-5-mini)
thor one_off:backfill_dose_scope_mismatch:validate --batched

# 3. Dry-run remediation — preview all patches
thor one_off:backfill_dose_scope_mismatch:remediate --dry-run

# 4. Live remediation — patches dose fields on trial_subgroups
thor one_off:backfill_dose_scope_mismatch:remediate --no-dry-run

30. Cross-study data contamination from abstract background sections

Short summary

When a publication abstract references efficacy results from a prior study as background context (e.g. “In our previous study NCT05029882, ORR was 24.4%”), classify_publications extracts those values as if they belong to the current study. This produces fabricated efficacy data for publications that may have no efficacy results of their own yet.

Where this sits in the current pipeline

app/tasks/publications_llm_classification/task.rb — efficacy extraction from abstract text.

Exact restriction causing the drop

The LLM extraction prompt does not distinguish between efficacy results reported as outcomes of the current study vs. results cited from external/prior studies as background context. The abstract structure (Background → Methods → Results → Conclusions) is not enforced.

Concrete examples

Pub 29705 (ABBV-400/Telisotuzumab adizutecan signal-seeking study, NCT06084481):

Abstract background: “Initial results from the ongoing first-in-human study (NCT05029882) of ABBV-400… an overall response rate of 24.4%”
Current study status: “As of 19 January 2024, 24 patients have been enrolled” — no efficacy data reported
Extracted: ORR=24.4%, N=24 (enrollment count misinterpreted as efficacy N)
Expected: No efficacy data (null)

The 24.4% ORR belongs to NCT05029882, not NCT06084481. The N=24 is enrollment, not an efficacy population.

Downstream impact

Publications appear in the Clinical Evidence report with fabricated efficacy data from unrelated studies. This is particularly misleading for signal-seeking or early-enrollment publications where the abstract previews prior results to motivate the new study.

Scale

Affects publications whose abstracts cite efficacy results from prior/companion studies. Common in: signal-seeking study designs, follow-up studies referencing parent trials, and publications describing study rationale with prior data.

Initial backfill (2026-03-28) validated 7,675 pubs via NCT mismatch + multiple registry ID signals, finding 1,495 that cite prior study efficacy. However, 46,962 pubs with efficacy data remain unvalidated — the structural signals missed cases where the prior study is cited by author/journal reference (e.g. [Cohen, Cancer Research 2023]) or is a different cohort of the same trial (same NCT). Example: pub 30362 cites petosemtamab monotherapy 2L/3L results from [Cohen, Cancer Research 2023] as background, but shares NCT03526835 with the current 1L combination study — no registry ID mismatch to detect.

Explored solution direction

Audit prompt guard (deployed): Added “CROSS-STUDY REFERENCES” instruction to the audit prompt so future audits flag these correctly.
Extraction prompt fix (forward): Update classify_publications prompt to instruct: “Only extract efficacy values reported as results of THIS study (typically in the Results section). Do not extract values cited from prior/external studies in the Background or Introduction.”
Detection query: Publications where llm_data has efficacy values but the abstract contains phrases like “previous study”, “prior study”, “first-in-human study (NCT…)” with efficacy values in the same sentence could be flagged for review.
Backfill: Need to identify publications that have been cross-contaminated and depending on the number, maybe reset them so they can go through the publication pipeline. Ideally, we wouldn’t want to rely on regex-based solutions for identifying cross contaminated pubs.

Solution applied

Forward fix (2026-03-28):

task.rb: Added section 6 “Cross-Study References” to SYSTEM_PROMPT — instructs LLM to only extract efficacy from THIS study, reject values from prior/external studies, and use the provided trial NCT IDs as authoritative identifiers. Tested on pub 29705: correctly returns empty outcome_measures instead of the prior study’s ORR=24.4%.
subgroup_extraction.rb: Added cross-study guard to Step 3 — ignore subgroups/endpoints from prior study citations.

Backfill detection (2026-03-28) — lib/tasks/one_off/backfill_cross_study_contamination.thor:

Two-phase detection, no regex:

Structural query (identify): Finds publications linked to a trial whose abstract mentions different NCT IDs (2,306 candidates from ~NCT mismatch). Filters to NCT-prefixed IDs only to avoid false positives from alternate registry entries (EudraCT, CTRI, etc.).
LLM validation (validate): Sends abstract + linked NCT IDs to GPT-5-mini asking whether the pub reports its own efficacy or only cites prior studies. Schema: has_own_efficacy_results (bool), cites_prior_study_efficacy (bool), explanation, prior_studies (array). --all flag validates all unvalidated pubs with efficacy data (no structural pre-filter).

Tested on 50+ random structural candidates + 5 known edge cases. Zero false positives. Correctly distinguishes:

Pubs with own results only (true negative)
Pubs with own results + prior study citations (mixed — needs re-extract)
Pubs with no own efficacy + prior study citations (pure contamination — null out)
Safety/PK/diagnostic pubs with no efficacy at all (different problem, excluded)

Backfill remediation (remediate) — two modes:

Null out (own=false, cites=true): Destroys materialized efficacy data, sets subgroup_outcome_measures=[]. For pubs like 29705 that have zero own efficacy.
Re-extract (own=true, cites=true): Resets extracted=false, clears subgroup_outcome_measures, destroys materialized data. Next classify_publications run re-extracts with the fixed prompt.

Production run (2026-03-28): Validated 7,675 pubs via NCT mismatch + multiple registry ID signals. Found 1,495 citing prior study efficacy. Remediated confirmed contamination (null-out only).

Backfill gap identified (2026-03-30): 46,962 pubs with efficacy data were never validated because the structural pre-filters (NCT mismatch, 2+ registry IDs) miss prior studies cited by author/journal reference or different cohorts of the same trial. Text pattern matching (ILIKE on “prior study”, citation brackets, etc.) only catches ~5% of known cases — language is too varied. Solution: validate all remaining pubs with --all (no structural pre-filter). Estimated cost: ~$12 via GPT-5-mini batch.

Full-corpus validation (2026-03-30) — job 1471: Validated 68,958 of 69,124 pubs with efficacy data. Results:

Category	Pubs	Outcomes	Action
Clean (`own=true, cites=false`)	53,380	—	None
Mixed (`own=true, cites=true`)	8,937	26,491	Triage then re-extract
Pure contamination (`own=false, cites=true`)	1,722	3,275	Null out
No efficacy at all (`own=false, cites=false`)	5,159	14,177	Separate issue

The own=false, cites=false bucket (5,159 pubs) contains PK/safety/DDI/biomarker publications with spurious subgroup_outcome_measures — not cross-contamination, but a separate data quality issue.

Triage step added (2026-03-30): Re-extracting all 8,937 mixed pubs is wasteful — most have trivial background citations (e.g. “promising phase 1/2 rates”) that didn’t leak into extracted data. Added triage command to backfill_cross_study_contamination.thor that sends each mixed pub’s abstract + prior study citations + extracted outcome measures to GPT-5-mini, which checks whether any extracted values actually match the cited prior study data. Results stored in llm_data['cross_study_triage']. Schema: has_contaminated_outcomes (bool), explanation, contaminated_indices (array of 0-based indices into subgroup_outcome_measures).

Spot-checked on 5 pubs (pub 30362 known contamination + 4 random): 1/5 flagged contaminated. Pub 30362 correctly identified index 0 (ORR=37.2% + DOR=6.0mo from Cohen 2023 monotherapy) as contaminated while index 1 (ORR=60% from current combination study) was clean. The remediate command now only re-extracts pubs where triage confirmed contamination, and warns if untriaged pubs exist.

Production triage run (2026-03-30) — job 1674: Triaged all 8,937 mixed pubs in ~34 minutes.

Triage result	Pubs	Contaminated indices
Clean (no leakage)	8,205	—
Contaminated (prior study data leaked into extractions)	732	970

91.8% of mixed pubs cite prior studies in Background but have clean extractions — triage saved ~8,200 unnecessary re-extractions.

Spot-check results (2026-03-30): Checked 18 contaminated, 5 clean, 5 pure-contamination pubs.

Contaminated (732): True positives confirmed across diverse patterns (SCHOLAR-1 comparator data, preclinical studies citing clinical results, I-SPY 2 external validation). ~63 pubs are same-trial follow-ups where “previously reported” refers to the trial’s own earlier publication (e.g., ARAMIS OS follow-up citing its own MFS primary result, PAOLA-1 OS citing its own PFS primary). These are borderline — the values genuinely came from Background text, but they’re the trial’s own data. Re-extraction with the fixed prompt handles these correctly: it will extract values that belong to this trial and skip values only mentioned in Background from different studies.
Clean (8,205): All correctly clean — extracted values don’t match cited prior study values. Triage reasoning is precise (compares specific numbers).
Pure contamination (1,722): All correct — news summaries, trial design abstracts, preclinical studies with only cited clinical efficacy.

Prompt fix v1 (2026-03-30): Updated cross-study reference handling in task.rb section 6 and subgroup_extraction.rb:

Removed “Previously reported…” from prior-study recognition patterns (was causing false positives on same-trial follow-ups)
Added EXCEPTION clause: when abstract is a subgroup/post-hoc/updated analysis of the same trial (matched by NCT/registry ID or trial name/acronym), previously reported results ARE the study’s own data and should be extracted
Kept matching criteria precise (NCT ID, registry ID, trial name/acronym only) — excluded fuzzy “same population and intervention” matching to avoid a Phase 3 incorrectly claiming a Phase 2’s results as its own

Production run (2026-03-30): Steps 1–4 completed. 720/732 pubs re-extracted, 12 correctly empty (no own efficacy). 701 materialized via post-processing.

Post-extraction review (2026-03-30): Spot-checked 10 random re-extracted pubs against triage explanations. Found two contamination patterns:

Pattern A (pure background citations): Cleaned up — prompt fix works. Values from cited external studies in Background/Introduction are no longer extracted.
Pattern B (cross-study comparisons): Persists — when an abstract explicitly compares its results against another study (benchmarking, MAIC, historical controls), o4-mini still extracts both sides. Root cause: subgroup_extraction creates subgroups for external study arms (e.g., “VISION”), then classify_publications fills them in.

Prompt fix v2 (2026-03-30): Strengthened both task.rb section 6 and subgroup_extraction.rb:

Changed framing from “don’t extract prior study data” to “only extract data MEASURED IN PATIENTS ENROLLED IN THIS STUDY”
Added explicit examples of cross-study comparison patterns to reject: benchmarking, MAIC, historical controls, side-by-side comparisons
Added instruction to subgroup_extraction.rb to not create subgroups/endpoints for data from other studies even when presented as comparisons

Tested on 6 pubs locally (1 known + 5 Pattern B):

5/6 clean: subgroup extraction no longer creates external study subgroups, classify_publications only extracts own data
1/6 still contaminated (pub 119370): abstract presents cross-study comparison as formal arm comparison, indistinguishable from own trial arms

Retriage + prune commands (2026-03-30): Added retriage and prune to backfill_cross_study_contamination.thor for surgical cleanup of remaining contamination after re-extraction:

retriage: Re-runs triage on re-extracted pubs, stores result in cross_study_retriage key (preserves original triage data)
prune: Removes specific contaminated SOMs by index, destroys materialized data, marks for post-processing rebuild

Tested on pub 119370: retriage correctly identified SOM index 1 (RICOVER-60 data) as contaminated, prune removed it, leaving only the Beijing cohort’s own data.

Production run sequence (remediation of 732 re-extracted pubs):

# 1. Deploy prompt fix v2 (task.rb + subgroup_extraction.rb)

# 2. Re-run subgroup extraction with fixed prompt (732 pubs)
thor clinical_trials:publications:extract_subgroups --publication_ids $(
  psql -t -c "SELECT id FROM publications WHERE llm_data->'cross_study_triage'->>'has_contaminated_outcomes' = 'true' AND extracted = true" | tr '\n' ' '
)

# 3. Re-extract with fixed prompt
thor clinical_trials:publications:classify_publications --batched

# 4. Re-triage new extractions to find remaining contamination
thor one_off:backfill_cross_study_contamination:retriage --batched

# 5. Surgically remove contaminated SOMs
thor one_off:backfill_cross_study_contamination:prune --no-dry-run

# 6. Post-process to rematerialize
thor clinical_trials:publications:post_process_publications --batched

31. Investigational drug dose data bleeds onto control/comparator arms

Short summary

When publication_interventions.study_plan_arm_id is NULL (the common case for publication-extracted drugs via Source 0), the drug_interventions CTE in vw_publication_efficacy_data joins the investigational drug to ALL arms — including control/comparator arms. The pub_dose_lookup COALESCE fallback then propagates the investigational drug’s dose fields (dose_min, dose_max, rp2d, dose_units, dose_frequency) onto control arm rows that have no subgroup-level dose override. This makes it appear that the comparator arm received the investigational drug’s dosing.

Where this sits in the current pipeline

db/views/vw_publication_efficacy_data_v18.sql:

drug_interventions CTE (Source 0): Joins publication_interventions to arms. When both clinical_trial_id and study_plan_arm_id are NULL, the drug matches all arms via the OR di.study_plan_arm_id IS NULL fallback.
pub_dose_lookup CTE: Pulls dose_evidence from publication_interventions. Joined to raw_rows via publication_intervention_id match from drug_interventions.
raw_rows COALESCE chain (lines 449–469): Falls through subgroup-level dose → pub-level dose. No arm_type guard prevents control arms from inheriting investigational drug dose.

Exact restriction causing the drop

In raw_rows, the dose COALESCE chain:

COALESCE(tlm.subgroup_dose_min, ..., pdl.pub_dose_min) AS dose_min,
COALESCE(tlm.subgroup_dose_max, ..., pdl.pub_dose_max) AS dose_max,
COALESCE(tlm.subgroup_rp2d, pdl.pub_rp2d) AS rp2d,

has no guard for aoe.arm_type or aoe.resolved_group_type. When a control arm’s subgroup has no dose fields, the COALESCE falls through to pub_dose_lookup, which contains the investigational drug’s dose evidence.

Concrete examples

Pub 241259 (Temab-A exposure-response in mCRC):

SOC arm = trifluridine/tipiracil+BEV (N=20)
View shows: dose_min=1.6 mg/kg, dose_max=2.4 mg/kg, rp2d=2.4 mg/kg Q3W, dose_units=mg/kg, dose_frequency=Q3W
These are Temab-A doses from publication_interventions id=51068 (study_plan_arm_id=NULL)
Abstract explicitly states SOC is “trifluridine/tipiracil+BEV” — no Temab-A dosing

Pub 241978 (Enfortumab vedotin):

“No upfront dose reduction” control arm shows dose_min=0.75 mg/kg, dose_max=1.25 mg/kg

Downstream impact

Clinical Evidence report: Control arms display investigational drug dose fields, misleading reviewers into thinking comparator arms received the ADC
Audit findings: Audit LLM correctly flags these as incorrect (5 of 7 issues on pub 241259 are this pattern)
Data quality: Dose fields on control arms are nonsensical — they describe a drug the arm didn’t receive

Scale

2,890 view rows across 566 publications have dose data from pub_dose_lookup on control/comparator arms
1,197 additional control rows have subgroup-level dose (potentially legitimate for dose-comparison arms)
Within ADC technology scope: 14 rows across 5 publications (smaller because most ADC trials are single-arm)

What the issue is not

Drug NAME attribution to control arms is intentional — the report needs to show what drug the control is being compared against
Subgroup-level dose on control arms may be correct (e.g., dose-comparison trials where the control is a different dose of the same drug)
This does NOT affect experimental/investigational arm rows

Explored solution direction

Forward fix — view v19: Add an arm_type guard to the pub_dose_lookup COALESCE in raw_rows. When aoe.arm_type = 'control' (or aoe.resolved_group_type = 'ACTIVE_COMPARATOR'), skip the pub_dose_lookup fallback:

COALESCE(
  tlm.subgroup_dose_min,
  CASE WHEN tlm.subgroup_dose_value IS NOT NULL
       THEN tlm.subgroup_dose_value || ' ' || COALESCE(tlm.subgroup_dose_units, '')
  END,
  CASE WHEN aoe.arm_type != 'control' THEN pdl.pub_dose_min END
) AS dose_min,

Apply the same pattern to dose_max, rp2d, dose_units, dose_frequency, and single_dose. This preserves subgroup-level dose (tier 1) for all arms but blocks the publication-level fallback (tier 3) for control arms only.

No backfill needed — rematerializing the view after deploying v19 will fix all affected rows.

Related to Issue 20: The v16 Source 0 fix (using publication_interventions as primary drug source) introduced this side effect by broadening the drug_interventions join. The drug join itself is correct; only the dose COALESCE fallback needs the arm_type guard.

Solution applied

Forward fix — view v19 (2026-03-29): Added arm_type guard to all 6 dose COALESCE chains in db/views/vw_publication_efficacy_data_v19.sql. When aoe.arm_type is a control/comparator variant (control, comparator, active_comparator, placebo, placebo_comparator), the pub_dose_lookup fallback is skipped. Subgroup-level dose (tier 1) is preserved for all arms — only the publication-level fallback (tier 3) is blocked for control arms.

-- Example for dose_min (same pattern for dose_max, rp2d, dose_units, dose_frequency, single_dose):
COALESCE(tlm.subgroup_dose_min,
         CASE WHEN tlm.subgroup_dose_value IS NOT NULL
              THEN tlm.subgroup_dose_value || ' ' || COALESCE(tlm.subgroup_dose_units, '')
         END,
         CASE WHEN aoe.arm_type IS NULL OR LOWER(aoe.arm_type) NOT IN ('control', 'comparator', 'active_comparator', 'placebo', 'placebo_comparator')
              THEN pdl.pub_dose_min
         END) AS dose_min,

No backfill needed — rematerializing the view after deploying v19 fixes all affected rows.

32. TTP (time to progression) misclassified as PFS

Short summary

The LLM extraction pipeline (classify_publications) maps TTP (time to progression) values to PFS (progression-free survival) when the abstract reports TTP but not PFS. These are distinct endpoints — TTP censors deaths while PFS counts them as events. Additionally, in some cases (e.g., pub 29737), TTP values reported for a best-response subpopulation (e.g., SD patients only) are attributed to the entire cohort.

Where this sits in the current pipeline

app/tasks/publications_llm_classification/subgroup_extraction.rb: Identifies endpoints from the abstract. May correctly identify TTP but it gets mapped to PFS downstream.
app/tasks/publications_llm_classification/task.rb: Extracts endpoint values. The LLM treats TTP as PFS when extracting, or the endpoint mapping normalizes TTP→PFS.
Endpoint normalization: If TTP is not in the standard endpoint list, the LLM may substitute the closest recognized endpoint (PFS).

Exact restriction causing the drop

The classify_publications prompt and/or endpoint schema does not distinguish TTP from PFS. When an abstract reports “median TTP = X months”, the LLM maps this to the PFS endpoint because TTP is not available as a separate extraction target. The LLM lacks instruction to leave PFS null when only TTP is reported.

Concrete examples

Pub 29737 (IMMU-132 in GI cancers):

Abstract: “time to progression (TTP) … median of 4.8+ mo for the SD pts”
Extracted: PFS=4.8 months, patient_count=29 (entire CRC cohort)
Correct: TTP=4.8+ months, applicable to 14 SD patients only — PFS should be null
Two compounding errors: (1) TTP→PFS confusion, (2) SD-subpopulation value → full cohort

Pub 29737 KRAS-mutated subgroup:

Abstract: “median TTP = 4.4+ mo” for 7 SD patients
Extracted: PFS=4.4 months, patient_count=13 (all KRAS-mutated)
Correct: TTP=4.4+ months for 7 SD patients — PFS should be null

Downstream impact

Clinical Evidence report: PFS column shows TTP values, overstating the evidence (PFS is a stronger endpoint than TTP)
Cross-study comparisons: TTP values mixed with genuine PFS values make comparisons unreliable
Patient counts: When TTP is reported only for responders/SD patients, attributing it to the full cohort inflates the denominator

Scale

241 publications mention TTP in their abstract yet have PFS extracted without TTP (revised upward from 149 after widening text patterns to include hyphenated “time-to-progression” and “mTTP”)
181 publications have TTP correctly extracted as TTP (suggesting the pipeline CAN handle TTP in many cases)
The SD-subpopulation misattribution is harder to quantify systematically but likely affects a subset of phase I/II publications reporting outcomes by best response category

Explored solution direction

Extraction prompt fix (forward): Add explicit instruction to classify_publications: “TTP (time to progression) and PFS (progression-free survival) are distinct endpoints. If the abstract reports TTP but not PFS, extract TTP only — do NOT map TTP values to PFS. Leave PFS null when only TTP is reported.”
Subpopulation guard: Add instruction: “When a time-based endpoint (TTP, PFS, DoR) is reported only for a best-response subgroup (e.g., ‘median TTP for SD patients’), do not attribute it to the parent population. Extract it under the response-specific subgroup or leave the parent’s value null.”
Backfill: Re-extract PFS values for the 149 affected publications with updated prompt. Scope: publications where abstract contains TTP/time to progression but NOT PFS/progression-free survival, and a PFS endpoint was extracted.

Solution applied

Prompt fix in identifier_extraction.rb: Added <<< Endpoint Distinction Rules >>> section after the “keep broad” normalization instruction, explicitly stating TTP and PFS are clinically distinct and must never be merged. Also covers DFS vs EFS. Instructs LLM to use the exact term from the abstract when in doubt.
Subpopulation guard in task.rb: Added ** Response-Specific Endpoint Attribution block instructing the LLM not to attribute response-specific time-based endpoints (e.g., “TTP for SD patients”) to the parent population.
Backfill task: lib/tasks/one_off/backfill_ttp_pfs_misclassification.thor with identify (finds 241 affected pubs, stores findings in llm_data['ttp_pfs_check']) and remediate (resets pubs for full pipeline re-extraction). ~50% false positive rate in scope (TTP mentioned descriptively, not as a study endpoint) but re-extraction is harmless.
Spot-check results: Ran extract_trial_identifier on 5 confirmed misclassified pubs (13857, 143497, 53502, 12317, 143682) — all 5 now correctly extract TTP instead of PFS.
Query-layer fix (2026-03-30): The extraction fix (steps 1-4) correctly stored TTP as TTP in the view, but extract_efficacy_metrics in clinical_evidence_query.rb had a TTP→PFS fallback (added in e7fc41f7, 2026-03-23) that silently remapped TTP back into metrics[:pfs] when no real PFS existed. This caused audit issue 8436 (pub 100, DX1002 phase 1 — abstract reports mTTP=2.70, query presented as mPFS=2.7). Fix: TTP is now a first-class metric (metrics[:ttp]) instead of a PFS stand-in. Changes:
- clinical_evidence_query.rb: TTP fallback writes to :ttp not :pfs; added :ttp to patient count chain
- clinical_evidence_report.rb: Added mTTP (month) and HR (TTP) columns
- audit_clinical_evidence.rb: Added efficacy.ttp.value/hazard_ratio to auditable fields; removed TTP from “not tracked” exclusion list

33. Cross-tabulated subgroups not identified in basket trials

Short summary

When basket trial abstracts report efficacy in a table structured as tumor type × biomarker status (e.g., CRC × HER2 IHC 3+/2+/1+), extract_subgroups identifies the single-dimension subgroups (tumor types and biomarker statuses separately) but not the cross-product subgroups (CRC IHC 3+, CRC IHC 2+, etc.). This means disease-specific biomarker-stratified efficacy data is lost — only the overall tumor-type and overall biomarker-status rows are extracted.

Where this sits in the current pipeline

app/tasks/publications_llm_classification/subgroup_extraction.rb: Identifies subgroups and their endpoint associations from the abstract. The LLM prompt identifies subgroups as a flat list, and the hierarchical naming convention (e.g., “Non-breast STs → CRC”) captures one level of nesting but not cross-dimensional nesting.

Exact restriction causing the drop

The subgroup extraction prompt produces subgroups along each dimension independently:

By tumor type: BTC, UC, GC/GEJA, CRC
By biomarker: HER2 IHC3+, IHC2+, IHC1+

But it does not produce the cross-product: CRC IHC3+, CRC IHC2+, etc. The table data in the abstract contains these values, but the extraction doesn’t recognize the need to create nested subgroups for each cell in a tumor type × biomarker matrix.

Concrete examples

Pub 72043 (SHR-A1811 in non-breast solid tumors):

Abstract table reports ORR for each tumor type × HER2 IHC status combination
Extracted subgroups: CRC (36.4%), IHC3+ (54.1%), IHC2+ (41.7%), IHC1+ (50.0%)
Missing: CRC IHC3+ (100%, 3/3), CRC IHC2+ (0%, 0/3), CRC IHC1+ (0%, 0/1), CRC HER2 mut/amp (0%, 0/3)
4 audit issues (8402-8405) all flagging missing cross-tabulated CRC subgroups

Downstream impact

Clinical Evidence report: Disease-specific biomarker-stratified efficacy data missing — can only show overall CRC ORR, not CRC by HER2 status
Granularity loss: The most clinically relevant data in basket trials is often the cross-tabulation (e.g., “does HER2 IHC 3+ predict response in CRC specifically?”)

Scale

~366 publications have both disease-type and biomarker-type subgroups with common biomarkers (HER2, EGFR, KRAS, BRAF, PD-L1, MSI, MMR)
Not all 366 will have cross-tabulated data in the abstract — many will have separate analyses rather than a matrix table
The issue primarily affects basket/platform trials reporting across multiple tumor types with biomarker stratification

What the issue is not

This is NOT about missing biomarker context on existing subgroups (that’s Issue 19)
This is NOT about dropped subgroups at the classify step (Issue 10) — the cross-product subgroups are never identified in the first place
Parent-level tumor type and biomarker subgroups ARE correctly extracted

Explored solution direction

Extraction prompt enhancement: Update extract_subgroups prompt to recognize tabular cross-tabulation patterns: “When the abstract contains a table or matrix reporting efficacy by tumor type × biomarker status, create cross-product subgroups (e.g., ‘CRC → HER2 IHC 3+’) for each cell with reported data, in addition to the single-dimension subgroups.”
Post-extraction cross-product generation: After extracting single-dimension subgroups, detect when a table exists with both dimensions and generate cross-product subgroups programmatically.
Scope: Focus on publications with ≥2 disease subgroups AND ≥1 biomarker subgroup, and re-run extraction with the enhanced prompt.
Backfill?

Solution applied

Prompt enhancement (subgroup_extraction.rb): Added Step 2b to SYSTEM_PROMPT instructing the LLM to create cross-product subgroups when the abstract reports efficacy broken down by two dimensions (e.g., tumor type × biomarker status) — covers both literal tables and prose patterns like “Among CRC patients, ORR was X% in IHC 3+”. Cross-products use the existing arrow nesting convention (“CRC → HER2 IHC 3+”) alongside preserved single-dimension parents. No schema or downstream changes needed — classification task, dropped-subgroup guard, and post_process all handle arbitrary subgroup strings already.
Two-pass LLM-screened backfill (lib/tasks/one_off/backfill_cross_tabulated_subgroups.thor):
- Structural scoping: 6,081 candidate pubs (≥2 disease-tagged + ≥1 biomarker-tagged subgroups)
- Pass 1 (screen): Broad gpt-5-mini screening → 934 flagged (Job 1655)
- Pass 2 (rescreen): Tighter prompt requiring ≥2 distinct tumor types with per-disease biomarker breakdown → 262 confirmed (Job 1657). Spot-check: ~7-9/10 true positives.
- remediate resets subgroup_endpoints, subgroup_outcome_measures, llm_data_processed = false on the 262 confirmed pubs for pipeline re-run.
Terminal window
```
# Screening (already complete)
thor one_off:backfill_cross_tabulated_subgroups:screen --batched --parallelism=4    # Job 1655
thor one_off:backfill_cross_tabulated_subgroups:rescreen --batched --parallelism=4  # Job 1657

# Remaining steps
thor one_off:backfill_cross_tabulated_subgroups:remediate --dry_run
thor one_off:backfill_cross_tabulated_subgroups:remediate
# Then re-run full publications pipeline on affected pubs
```

34. “Immature” endpoints extracted as “Not Reached”

Short summary

When an abstract states that an endpoint (OS, PFS, DoR) is “not yet mature”, “data immature”, or “results are immature”, the LLM extraction maps this to “Not Reached”. These are clinically distinct concepts: “Not Reached” means the Kaplan-Meier curve hasn’t crossed the 50% mark (a real finding indicating the median exceeds current follow-up), while “immature” means insufficient events or follow-up to perform the analysis (no median can be estimated — value should be null).

Where this sits in the current pipeline

app/tasks/publications_llm_classification/task.rb: The classify_publications prompt doesn’t distinguish between “Not Reached” and “immature/not yet mature”. The LLM treats both as equivalent and extracts “Not Reached” for either.

Exact restriction causing the drop

The extraction prompt has no instruction to differentiate “Not Reached” (endpoint was analyzed, median exceeds follow-up) from “immature” (endpoint was NOT formally analyzed, insufficient data). Both get mapped to the string “Not Reached”.

Concrete examples

Pub 114571 (JSKN003 in HER2+ mCRC):

Abstract: “The median overall survival (OS) was not yet mature”
Extracted: OS = “Not Reached”
Correct: OS should be null — data immature, no median estimated

Pub 115389 (from job 1594):

Abstract: PFS described as “immature”
Extracted: PFS = “Not Reached”
Correct: PFS should be null

Downstream impact

Clinical Evidence report: “Not Reached” implies a favorable outcome (median exceeds follow-up), while “immature” is neutral (no data yet). Reporting “Not Reached” when the data is simply immature overstates the evidence.
Cross-study comparisons: “Not Reached” OS is treated as a positive signal, biasing comparisons against studies that honestly report immature data.

Scale

~71 publications have “immature”/“not yet mature” language in the abstract (without “not reached”) but have “Not Reached” extracted for OS, PFS, or DoR
Breakdown: OS (~214 total “Not Reached” pubs with immature language, ~71 without “not reached” in abstract), PFS (~107), DoR (~68)
Many abstracts legitimately say BOTH “immature” and “not reached” — these are correct and not affected

What the issue is not

Abstracts that say “median OS was not reached” — these ARE correct as “Not Reached”
Abstracts that say “OS data are immature; median was not reached” — also correct (both terms used)
Only affects abstracts where “immature” is used WITHOUT “not reached” for the same endpoint

Explored solution direction

Extraction prompt fix (forward): Add instruction to classify_publications: “Distinguish between ‘Not Reached’ (endpoint was analyzed but median exceeds follow-up — extract as ‘Not Reached’) and ‘immature/not yet mature’ (insufficient data to analyze the endpoint — extract as null/omit). Only use ‘Not Reached’ when the abstract explicitly states the median was not reached.”

Backfill: Re-extract OS/PFS/DoR for the ~71 affected publications. Scope query:

SELECT DISTINCT v.publication_id
FROM vw_publication_efficacy_data v
JOIN publications p ON p.id = v.publication_id
WHERE v.measure_value = 'Not Reached'
  AND v.endpoint_abbreviation IN ('OS', 'PFS', 'DOR')
  AND (p.abstract ILIKE '%not yet mature%' OR p.abstract ILIKE '%data immature%'
       OR p.abstract ILIKE '%data are immature%' OR p.abstract ILIKE '%results are immature%')
  AND p.abstract NOT ILIKE '%not reached%'
  AND p.abstract NOT ILIKE '%not been reached%'

Solution applied

(empty — pending implementation)

35. Dose extraction confuses PK thresholds, imaging agent doses, and missing dose_max

Short summary

The classify_publications dose extraction conflates several distinct concepts when populating subgroup dose fields. Pharmacokinetic thresholds (e.g. “tumor saturation above 100 mg/m²/d”), imaging/diagnostic agent doses (e.g. [⁶⁸Ga]Ga-PSMA-11 activity in MBq), and PD biomarker thresholds are extracted as if they were the therapeutic dose range for the efficacy population. Additionally, dose_max is sometimes left null when the abstract clearly states an upper bound.

Where this sits in the current pipeline

app/tasks/publications_llm_classification/task.rb — Subgroup Dose Context section of SYSTEM_PROMPT.

Exact restriction causing the drop

The dose extraction prompt instructs the LLM to extract dose fields for dose cohorts but does not distinguish between:

Therapeutic dose (the drug dose patients received for treatment)
PK/PD thresholds (concentration or exposure levels observed, e.g. “target saturation above X”)
Imaging/diagnostic agent doses (tracer activity for PET scans, not therapeutic)
dose_max omission — when dose_min is set from a “≥X” phrase, dose_max is left null even when the abstract states the upper bound of the escalation range

Concrete examples

PK threshold as dose — Pub 148480:

Abstract: doses 5–400 mg/m²/d, “tumor saturation above 100 mg/m²/d”
Extracted: dose_min=100 (PK threshold, not efficacy population)
Expected: dose_min=5, dose_max=400 (full enrolled range, efficacy not restricted)

PK threshold as dose — Pub 229651:

Abstract: doses 25–500 mg QD, “NTX changes ≥50 mg”
Extracted: dose_min=50 (PD biomarker threshold)
Expected: dose_min=25, dose_max=500 (full enrolled range)

PK threshold as dose — Pub 134251:

Abstract: responses and PFS reported overall, “target concentrations ≥100 mg/day”
Extracted: dose_min=100 (PK target, not dose restriction)
Expected: null dose fields or full enrolled range

Imaging agent dose — Pub 244477:

Abstract: [¹⁷⁷Lu]Lu-PSMA-617 at 7.4 GBq (therapeutic), [⁶⁸Ga]Ga-PSMA-11 111–259 MBq (imaging)
Extracted: dose_min=111, dose_max=259 (imaging tracer activity)
Expected: dose fields for the therapeutic agent, not the imaging tracer

Missing dose_max — Pub 58814:

Abstract: “14 pts at doses ≥1.5 mg/kg” across cohorts 0.5–2.5 mg/kg
Extracted: dose_min=1.5, dose_max=null
Expected: dose_min=1.5, dose_max=2.5

Missing dose_max — Pub 137619:

Abstract: “patients at dose of 0.15 mg/kg or above” across escalation up to 0.18 mg/kg
Extracted: dose_min=0.15, dose_max=null
Expected: dose_min=0.15, dose_max=0.18

Downstream impact

Report rows show incorrect dose context: PK observations misrepresented as dosing, imaging agent doses shown instead of therapeutic doses, and incomplete dose ranges when max is omitted. Affects dose-response interpretation and cross-study comparisons.

Scale

Discovered during issue 29 backfill validation: 14 of 20 random publications with dose_min set had some form of dose extraction error. Categories overlap — a single pub may have both a PK threshold issue and a missing dose_max. Full scope is ~720 publications with dose_min set on trial_subgroups; exact breakdown by error type pending full validation run.

Explored solution direction

Prompt fix (forward): Update Subgroup Dose Context to explicitly instruct:
- “Only extract the THERAPEUTIC drug dose, not PK/PD thresholds, biomarker cutoffs, or diagnostic/imaging agent doses.”
- “When dose_min is set from a ‘≥X’ pattern, also set dose_max to the highest dose level stated in the abstract.”
Backfill: The issue 29 backfill validation (backfill_dose_scope_mismatch.thor) already identifies these problems via correct_dose_min/correct_dose_max in the LLM check. Remediation can directly patch the dose fields on trial_subgroups using the validated corrections, rather than a full re-extract.

Solution applied

Forward fix (2026-03-31): Added “DOSE VALUE FILTERING” instruction to the Subgroup Dose Context section in task.rb — instructs LLM to only extract therapeutic drug doses, explicitly rejecting PK/PD thresholds (e.g. “target saturation above X”), imaging/diagnostic agent doses (e.g. tracer activity in MBq), and biomarker/lab value cutoffs.

View fix (2026-03-31): vw_publication_efficacy_data v20 — added dose_context_type gate to the pub-level dose COALESCE fallback. For publications with dose_context_type of escalation or range, the view no longer falls back to the study-level dose range from publication_interventions.dose_evidence. Subgroups in escalation studies that genuinely need dose fields already have them set at the trial_subgroup level (via extraction or Issue 29 backfill), so they take COALESCE priority 1 and are unaffected. Fixes ~2,612 publications where study-level escalation ranges were bleeding into non-dose subgroups (disease cohorts, biomarker groups, Overall).

View fix v21 (2026-03-31): Extended dose_context_type gate to also block rp2d studies. RP2D publications store the full escalation range in dose_min/dose_max on publication_interventions.dose_evidence, not just the RP2D value — non-dose subgroups (biomarker, disease, Overall) were inheriting the escalation range. Affects ~1,581 additional publications (e.g. pub 48903 “Low HER2” showing 5.4–8.0 instead of null, pub 135119 “Overall” inheriting Q2W-LD arm dose).

Backfill validation (2026-03-31): Job 1694 re-validated ~720 pubs with subgroup-level dose_min. 981 subgroups OK, 42 new corrections identified (26 wrong range, 13 non-dose→null, 2 PK/PD, 1 radiation). Remediation pending.

36. cORR set equal to ORR when abstract distinguishes confirmed vs unconfirmed

Short summary

LLM extraction sets confirmed ORR (cORR) equal to the overall ORR instead of counting only confirmed responses. This is the reverse of Issue 27 — there, extract_efficacy_metrics picked the confirmed row for the plain ORR metric. Here, the LLM itself outputs the same value for both ORR and cORR during classify_publications, so the view and query faithfully reproduce the wrong value.

Where this sits in the current pipeline

app/tasks/publications_llm_classification/task.rb — classify_publications LLM extraction step. The prompt asks for both ORR and cORR as separate endpoints, but the LLM sometimes fails to distinguish confirmed from unconfirmed responses.

Exact restriction causing the issue

The LLM extraction prompt does not provide explicit guidance on how to compute cORR when the abstract itemizes confirmed vs unconfirmed responses (e.g., “1 confirmed CR, 2 confirmed and 3 unconfirmed PRs”). The LLM defaults to the total response count for both metrics.

Concrete examples

Pub 30362 (Petosemtamab+pembro 1L r/m HNSCC):

Abstract: “1 confirmed complete response, 2 confirmed and 3 unconfirmed partial responses” out of 10 evaluable pts
Expected: ORR = 60% (6/10), cORR = 30% (3/10 at cutoff)
Extracted: ORR = 60%, cORR = 60% (identical — wrong)

Downstream impact

Clinical Evidence report shows inflated cORR identical to ORR, obscuring the distinction between confirmed and unconfirmed responses. This is a meaningful clinical difference — confirmed response rates are the regulatory-grade metric.

Scale

1 instance found in job 1634 (HNSCC+BsAb). Related to Issue 27 (which was a query-layer pick issue, now fixed). This is a distinct extraction-layer issue. Full-corpus scale TBD — would require comparing ORR vs cORR values across all publications where both are extracted.

Explored solution direction

Prompt fix (forward): Add explicit instruction to classify_publications: “cORR counts ONLY confirmed responses (CR + confirmed PR). If the abstract lists unconfirmed responses separately, exclude them from cORR. If the abstract does not distinguish confirmed from unconfirmed, leave cORR null.”
Backfill: Identify publications where cORR = ORR and the abstract contains language distinguishing confirmed/unconfirmed. Re-extract cORR with targeted prompt.

Solution applied

Forward fix (prompt): Already in place — classify_publications prompt (task.rb:206-250) has comprehensive instructions for confirmed/unconfirmed ORR handling, including count-based derivation.
Backfill: one_off:backfill_confirmed_unconfirmed_orr backfill_issue36 — targets 57 publications where confirmed=true ORR has the same measure_value as confirmed=false/null ORR. Re-extracts using focused LLM prompt. Scope is structural (no text matching): joins ORR TOMs with matching values across confirmed flags.

37. Mean survival values extracted as median

Short summary

When an abstract reports mean OS or PFS (rather than median), the LLM extracts the numeric value without flagging the statistic type. The pipeline has no field to distinguish mean from median, so mean values are silently presented as median in the Clinical Evidence report.

Where this sits in the current pipeline

app/tasks/publications_llm_classification/task.rb — classify_publications extraction. The measure_value field captures a numeric value but has no companion field for the statistic type (mean vs median).

Exact restriction causing the issue

The LLM extraction schema defines survival endpoints with measure_value (numeric) and measure_unit (e.g., “months”) but has no statistic_type field. When the abstract says “mean OS = 25.3 months”, the LLM outputs 25.3 with unit “months”, indistinguishable from a median.

Concrete examples

Pub 51969 (FDG-PET target delineation SCCHN, CT-95):

Abstract: “The mean OS was 25.3 months (95% CI, 22.5-28.1) and mean PFS was 23.2 months (95% CI, 20.3-26.1)”
Extracted: OS = 25.3 months, PFS = 23.2 months (no indication these are means)
Expected: Either null (mean is not the standard metric), or extracted with a flag indicating “mean”

Downstream impact

Report consumers assume survival values are medians (the standard in oncology). Mean survival overestimates the “typical” outcome when distributions are right-skewed (common in survival data). This creates misleading cross-study comparisons.

What the issue is not

This is not about rounding or approximation — the numeric value is correct. The problem is the absence of metadata distinguishing the statistic type.

Scale

1 instance found in job 1634 (HNSCC+BsAb). Mean survival reporting is uncommon in oncology abstracts (median is standard), so corpus-wide scale is likely small. Could identify candidates by searching abstracts for “mean OS” or “mean PFS” patterns.

Explored solution direction

Two approaches:

Null approach: Update prompt to instruct: “Only extract median survival values. If the abstract reports mean (not median) OS or PFS, leave the value null.” Simple, preserves existing schema.
Schema approach: Add a statistic_type field (enum: median, mean) to the outcome measure schema. More informative but requires schema migration, view update, and query changes.

Option 1 is recommended for now — mean survival is rare and the null correctly signals “no standard median reported.”

Solution applied

(empty — pending implementation)

38. Biomarker subgroups in secondary analyses not identified by `extract_subgroups`

Short summary

extract_subgroups misses biomarker-defined subgroups (e.g., p16+ oropharyngeal) when they appear as secondary efficacy analyses within the results section rather than as pre-specified study arms or primary subgroups. The efficacy data is present in the abstract but never enters the pipeline because the subgroup is not identified in the first extraction step.

Where this sits in the current pipeline

app/tasks/publications_llm_classification/subgroup_extraction.rb — extract_subgroups runs BEFORE classify_publications. If a subgroup is not identified here, no efficacy data is extracted for it downstream.

Exact restriction causing the issue

The extract_subgroups prompt focuses on pre-specified study populations and arms. When an abstract reports a secondary analysis like “Of the 8 pts with p16+ oropharyngeal disease, 4 had confirmed responses (ORR 50%)”, this biomarker subgroup is not captured because it appears as an exploratory result rather than a defined study arm.

Concrete examples

Pub 48310 (Petosemtamab+pembro PD-L1+ r/m HNSCC):

Abstract: “Of the 8 pts with p16+ oropharyngeal disease, 4 had confirmed responses (ORR 50%)”
Extracted subgroups: “PD-L1+ HNSCC” only (the primary population)
Missing: “p16+ oropharyngeal” subgroup with ORR = 50% (4/8)

Pub 238083 (same trial, same abstract — duplicate per Issue 17):

Same missing “p16+ oropharyngeal” subgroup

Downstream impact

Biomarker-defined efficacy signals are lost from the Clinical Evidence report. For HNSCC specifically, p16/HPV status is a critical prognostic and predictive biomarker that clients expect to see.

What the issue is not

This is distinct from Issue 33 (cross-tabulated subgroups in basket trials). Issue 33 addressed multi-dimensional subgroup identification (tumor type × biomarker) in basket trial designs. This issue is about single-dimension biomarker subgroups in secondary analyses of non-basket trials.

Scale

Originally 2 instances found in job 1634 (HNSCC+BsAb), both from the same trial (duplicate pubs). Full-corpus LLM screening (gpt-5-mini) of ~18,500 candidate pubs identified 1,730 (9.2%) with potentially missed biomarker subgroups. Verified on pubs 549 (LAG-3 expression ORR 28% vs 7.7%) and 44673 (TP53 wild-type ORR 79%, CRc 74%, MRD neg 76%, 3yr OS 51%).

Explored solution direction

Prompt fix (forward): Expand extract_subgroups prompt to include: “Also identify biomarker-defined subgroups reported in secondary/exploratory analyses within the results section (e.g., ‘Of the N patients with [biomarker], ORR was X%’). These should be captured as child subgroups of the primary population even if not pre-specified as study arms.”
Backfill: Identify publications with biomarker language in the abstract (regex patterns) where no corresponding subgroup exists. Re-run extract_subgroups with updated prompt for those publications.

Solution applied

Forward fix (2026-03-30): Added Step 2c to extract_subgroups SYSTEM_PROMPT in subgroup_extraction.rb. Explicitly instructs the LLM to capture biomarker-defined subgroups from secondary/exploratory analyses within the Results section, even when not pre-specified as study arms. Includes guard against over-extraction (only capture when numeric efficacy result is present, not for biomarkers in baseline characteristics or background).

Backfill (2026-03-30): lib/tasks/one_off/backfill_biomarker_secondary_subgroups.thor — three-phase Thor task (identify → screen → remediate). Candidate set: ~18,500 pubs with biomarker mentions in abstract (dynamic regex built from Biomarker model with Postgres word boundaries) but no biomarker-tagged trial_subgroups record. LLM screen (gpt-5-mini) checked each abstract against its existing subgroup list. Results: 18,742 screened, 1,730 flagged (9.2%). Remediation reset flagged pubs, followed by full pipeline re-run (extract_subgroups job 1682 + classify_publications job 1683). Final state: 1,718/1,730 fully reprocessed, 11 pending extraction, 1 pending classification.

Known gap (addressed 2026-03-31): The candidate query excluded any pub with existing biomarker-tagged subgroups (16,678 pubs skipped). Added screen_partial + remediate_partial commands to screen these pubs using the same LLM prompt. No regex pre-filter — LLM decides. Local validation: pub 29704 (known gap) correctly flagged; 3/10 flagged on single-biomarker pubs, 0/10 on random multi-biomarker pubs. Cross-tabulated pubs with partial biomarker coverage are separately handled by Issue 43.

Partial screening results (2026-03-31, job 1691): screen_partial completed on all 16,709 pubs with existing biomarker subgroups. Results: 1,483 flagged (8.9%), 15,226 clean, 0 remaining. Flag rate consistent with original screen (9.2%). Domain expert review of 5 random flagged pubs: 4/5 strong true positives, 1 plausible.

Prompt fixes for re-extraction quality (2026-03-31): Local testing of the full pipeline (extract_subgroups → classify_publications) on 14 flagged pubs revealed two regression patterns, both fixed with prompt changes:

Biomarker tag loss on biomarker-defined populations (task.rb): When the overall study population is defined by a biomarker (e.g., “mIDH2 ND-AML”), the classify_publications prompt rule “overall must never combine with other tags” caused the biomarker tag to be dropped. Fixed by allowing overall to combine with biomarker and/or disease. Also added biomarker inheritance instruction: child subgroups (e.g., “mIDH2 ND-AML → CR”) must carry the same biomarker tag and biomarkers array as their biomarker-qualified parent. Validated on pub 120034 (IDH2) and pub 119668 (ABL-class fusion) — all children now correctly tagged.
Molecular qualifier dropped in subgroup naming (subgroup_extraction.rb): When a study population is defined by both a molecular feature and a clinical feature (e.g., “ABL-class fusion patients who responded slowly”), extract_subgroups sometimes chose only the clinical label (“Slow induction responders”), losing the biomarker context entirely. Fixed by adding compound baseline instruction: when efficacy results are reported for a population defined by multiple qualifiers, biomarker/molecular qualifiers must never be dropped. Validated on pub 119668 — now correctly produces “ABL-class fusion patients → TKI group”.

Validation also confirmed correct behavior: Lab values (albumin, NLR, WBC, platelet count) are now properly tagged as risk_group rather than biomarker (pubs 52188, 154203). HPV-naive populations tagged as population, TAA immune responses tagged as response_status (pubs 138809, 141909). These reclassifications are correct — biomarker tag is reserved for molecular/genomic features relevant to population selection (HER2, EGFR, IDH2, etc.).

Status: Prompt fixes validated locally on 14 pubs. Pending deployment before running remediate_partial + pipeline re-run on the 1,483 flagged pubs.

Investigation notes: Concrete examples (pubs 48310, 238083) now capture the p16+ oropharyngeal subgroup with the current prompt. Regex-based scale estimation was inconclusive — cannot reliably distinguish “biomarker in baseline characteristics” from “biomarker-stratified efficacy in secondary analysis.” LLM screening required to determine true scale.

39. Multi-drug randomized trial dose cross-contamination

Short summary

In randomized trials comparing multiple investigational drugs (each in its own arm), the view shows all drugs’ doses on every arm instead of scoping each dose to its own arm. Originally thought to be an LLM extraction issue, but investigation revealed the per-drug dose_evidence extraction is correct — the contamination happens in the view layer.

Where this sits in the current pipeline

db/views/vw_publication_efficacy_data_v21.sql — the drug_interventions LEFT JOIN to arm_outcomes_expanded. When publication_interventions.study_plan_arm_id is NULL (common for publications without clinical trial linkage), the join condition di.study_plan_arm_id IS NULL creates a cross-product: every intervention joins to every arm, so each arm gets rows with all drugs’ doses.

Exact restriction causing the issue

The drug_interventions join in raw_rows uses di.study_plan_arm_id IS NULL as a pass-through condition that matches any arm. For multi-drug publications with 3 interventions and 3 arms, this creates 9 combinations instead of 3 — each arm gets dose rows from all 3 drugs.

Concrete examples

Pub 239841 (Ivonescimab vs Cadonilimab vs Penpulimab neoadjuvant HNSCC):

publication_interventions correctly extract per-drug doses: ivonescimab=10 mg/kg, cadonilimab=6 mg/kg, penpulimab=200 mg
But the view shows 3 dose rows per arm (one per intervention), so the Ivonescimab arm shows 10 mg/kg, 6 mg/kg, AND 200 mg

Downstream impact

Report rows for each arm show all drugs’ doses as separate rows. Clinically misleading — ivonescimab at 6 mg/kg vs 10 mg/kg is a meaningful difference, and penpulimab’s fixed 200 mg dose is a completely different dosing paradigm than weight-based 6 mg/kg.

What the issue is not

This is NOT an LLM extraction issue. The dose_evidence_extraction pipeline correctly extracts per-drug doses on publication_interventions. The contamination is view-layer: the join creates a cross-product when study_plan_arm_id is NULL.

Related to Issue 31 (view-layer COALESCE bleed onto control arms) — same family of view join scoping issues.

Scale

7,258 publications have multiple distinct interventions. Of those, the fix only changes behavior for pubs where intervention names appear in arm names (enabling name-based scoping). Pubs with generic arm names (“Control”, “Intervention”) keep the existing cross-join behavior.

Explored solution direction

View fix: In the drug_interventions join, for Source 0 multi-drug publications where study_plan_arm_id IS NULL, match each intervention to its arm by checking if intervention_name appears in arm_name. Falls back to cross-join when name matching is not feasible (generic arm names).
Prompt fix (defense-in-depth): Added instruction to classify_publications SYSTEM_PROMPT for multi-drug trials to extract arm-specific doses in subgroup dose fields.

Solution applied

View v22 (db/views/vw_publication_efficacy_data_v22.sql):
- Added multi_drug_pubs CTE to identify publications with 2+ distinct interventions.
- Modified drug_interventions join: for Source 0 multi-drug pubs, requires LOWER(arm_name) LIKE '%' || LOWER(intervention_name) || '%' to scope each intervention to its matching arm.
- Safe fallback: if an intervention doesn’t match ANY arm by name, reverts to cross-join (no data loss for pubs with generic arm names).
Prompt fix (app/tasks/publications_llm_classification/task.rb):
- Added “MULTI-DRUG RANDOMIZED TRIALS” instruction to the Subgroup Dose Context block.
- Forward prevention for subgroup-level dose extraction.

40. Hierarchical subgroup rows in view lose N from flat counterparts

Short summary

The LLM extraction creates both flat subgroups (IHC3+) and hierarchical subgroups (RAS wild-type mCRC → Cohort A → IHC3+) as separate trial_subgroup → trial_outcome_measure → trial_arm_outcome chains. When the flat version has number_of_participants but the hierarchical copy doesn’t, ClinicalEvidenceQuery picks the hierarchical row (due to disease filtering) and reports null N.

Related to Issue 26 (parent N propagation) but distinct: Issue 26 is extraction-layer (LLM copies parent N to child subgroups). Issue 40 is post-processing-layer (hierarchical copies don’t carry forward N from their flat counterparts).

Where this sits in the current pipeline

app/tasks/publications_llm_classification/post_process.rb — creates both flat and hierarchical trial_arm_outcomes. The hierarchical copy’s N comes from the LLM output, which often omits it. Existing null_out_propagated_parent_n (line 565) handles the inverse case (removing incorrectly copied parent N).

Investigation findings (2026-04-01)

Most audit examples were false positives — null N is correct because the abstract doesn’t state per-subgroup N:

Pub 134450 (MRG003 phase 1b): Abstract states N=39 for overall Phase 1b, gives per-disease ORR (SCCHN 40%, NPC 44%, CRC 0%) but never states per-disease N. Null N on Phase 1b dose expansion → CRC is accurate.
Pub 67379 (ROME trial): Abstract states 200/200 randomized overall. hTMB/MSS exploratory analysis gives PFS and HR but never states subgroup N. Null N is accurate.
Pub 200353 (T-DXd DESTINY-CRC02 biomarker): EGFR amplification mentioned as prognostic factor but no N given. Not even a hierarchical issue — this is a flat subgroup with legitimately unstated N.

Only pub 48926 is a real bug:

Pub 48926 (DESTINY-CRC01 updated): IHC3+ flat has N=40, ORR=57.5. RAS wild-type mCRC → Cohort A → IHC3+ hierarchical has N=null, ORR=57.5. Same for IHC2+/ISH+ (13 vs null) and prior anti-HER2 therapy (16 vs null). The LLM extracted N for the flat version but not the hierarchical copy.

Scale

Real scope: 182 trial_arm_outcomes across 59 publications where the flat counterpart has N but the hierarchical copy doesn’t. Of ~32,874 hierarchical TAOs with null N, 32,692 have flat counterparts that also have null N (abstract doesn’t state it), and 182 have flat counterparts with N (extraction gap).

Explored solution direction

Post-process fix: Add propagate_flat_n_to_hierarchical method in post_process.rb (sibling to existing null_out_propagated_parent_n) to carry forward N from flat counterparts after all subgroups are created. Plus backfill task for existing 182 records.
Prompt fix: Instruct LLM to always carry N when creating hierarchical subgroups from data it already extracted for flat counterparts.

Solution applied

Downscoped (2026-04-01): Investigation revealed most audit examples are false positives — null N is correct because the source abstracts don’t state per-subgroup patient counts. Real bug scope is narrow (182 TAOs / 59 pubs). Post-process propagation fix deferred as low priority.

41. Safety data cross-contamination between dose arms

Short summary

In multi-arm dose-optimization studies, safety metrics (patient_number_safety, discontinuation rate) from one dose arm are attributed to a different dose arm. The safety extraction doesn’t scope by arm, so values from the most prominent or first-mentioned arm bleed onto sibling arms.

Related to Issue 31 (dose field bleed onto control arms via view COALESCE) but distinct: Issue 31 was view-layer dose field propagation onto control arms. Issue 41 is extraction/query-layer safety data misattribution between experimental dose arms.

Where this sits in the current pipeline

app/queries/tpp/clinical_evidence_query.rb — extract_safety_metrics_for_publication method. Safety data is queried by publication_id and optionally study_plan_arm_id, but when study_plan_arm_id is null (common for publication-extracted arms), safety data cannot be scoped to a specific arm.

Concrete examples

Pub 116843 (Temab-A + Bev): Safety N=30 and discontinuation=3% attributed to the 2.0 mg/kg arm, but abstract reports these for the 2.4 mg/kg arm (n=30). The 2.0 mg/kg arm has n=26 with no discontinuation data stated.
Pub 49900 (M9140 dose optimization): Safety N=29 attributed to 2.4 mg/kg arm, but 29 is the 2.8 mg/kg arm size. The 2.4 mg/kg arm has n=31.

Scale

3 audit issues from CRC+ADC audit (2026-03-30). Likely affects multi-arm dose-optimization studies where safety is discussed arm-by-arm in the abstract but study_plan_arm_id is null.

Explored solution direction

Extraction fix: When safety data is extracted per arm in the abstract, ensure arm-specific safety N and discontinuation rates are stored with correct arm attribution.
Query fix: In extract_safety_metrics_for_publication, when multiple arms exist, attempt to match safety data to the correct arm by arm name or dose level.

Solution applied

Query-layer forward fix (2026-03-30): Extracted scope_safety_results_to_arm helper used by extract_safety_metrics_for_publication and extract_ranked_named_ae_summaries in both ClinicalEvidenceQuery and EmergingClinicalDataQuery. Two-tier arm matching:

Primary: Match by study_plan_arm_id (when present)
Fallback: Match by arm_name using exact normalized comparison (downcase + whitespace normalization) — avoids false positives between similar dose levels (e.g. “2.0 mg/kg” vs “2.4 mg/kg”)
Guard: When neither match succeeds and safety data contains multiple distinct arms, return empty rather than falling back to a wrong arm’s data. Single-arm or publication-level safety data (no arm differentiation) still falls through correctly.

This fixes both contamination patterns: (a) pub 49900 where study_plan_arm_id is null but arm_name distinguishes arms, and (b) pub 116843 where study_plan_arm_id exists but the requested arm has no safety data (guard prevents borrowing from a sibling arm).

No backfill needed — regenerating the clinical evidence report produces correct arm-scoped safety data.

42. Tumor shrinkage rate confused with RECIST ORR

Short summary

The LLM extracts “any tumor reduction” or “tumor shrinkage rate” as ORR, when these are distinct from RECIST-defined objective response rate. Tumor shrinkage includes minor reductions (e.g. 0-20% decrease) that don’t meet RECIST PR threshold (≥30% decrease). This can dramatically inflate the reported ORR.

Where this sits in the current pipeline

app/tasks/publications_llm_classification/task.rb — classify_publications efficacy extraction.

Concrete examples

Pub 162304 (IMMU-130 phase I/II in mCRC): Abstract reports “Tumor reductions were seen in 23/66 (35%) pts, including one PR.” The LLM extracted ORR=35%, but the actual RECIST ORR is ~1.5% (1/66 PR). The 35% is any-shrinkage rate, not objective response.

Scale

1 instance found in CRC+ADC audit (2026-03-30). Scale TBD — need to investigate how many publications report non-RECIST shrinkage rates that could be confused with ORR. Likely uncommon but high-impact when it occurs (35% vs 1.5% is a massive error).

Explored solution direction

Prompt fix (forward): Add instruction to classify_publications: “ORR (objective response rate) must be based on RECIST criteria (CR + PR). Do not use ‘any tumor reduction’, ‘tumor shrinkage rate’, or ‘disease control rate’ as ORR. If the abstract reports tumor shrinkage without specifying RECIST criteria, and separately reports a lower confirmed PR/CR rate, use the PR/CR rate as ORR.”

Solution applied

Forward fix (2026-04-03): Added ORR definition instruction to classify_publications prompt (section 3d): “ORR must be based on RECIST criteria (CR + PR). Do not extract tumor shrinkage rate or any-tumor-reduction as ORR.” Will be picked up by Issue 49 backfill re-extraction (3,943 target-disease pubs, PROMPT_VERSION=1).

43. Cross-tabulated subgroups only extracted for highest-response HER2 level

Short summary

Residual gap in Issue 33 (cross-tabulated subgroups). The Issue 33 backfill correctly flagged pubs and re-extracted cross-tabs, but the LLM only creates disease × biomarker cross-products for the most prominent biomarker level (typically the one with highest response rates). Lower-response or zero-response cross-tabulated subgroups are skipped.

Related to Issue 33 — same pipeline layer (extract_subgroups + classify_publications), but the fix and backfill worked for the dominant cross-tab; the LLM selectively omits cross-tabs where responses are low or absent.

Where this sits in the current pipeline

app/tasks/publications_llm_classification/subgroup_extraction.rb — extract_subgroups creates the subgroup × biomarker cross-products. The prompt instructs creation of cross-tabs but doesn’t emphasize completeness across all biomarker levels.

Concrete examples

Pub 72043 (SHR-A1811 phase 1, non-breast solid tumors): Abstract table shows ORR by tumor type × HER2 status (IHC3+, IHC2+, IHC1+, mutation/amp) for each of BTC, UC, GC/GEJA, CRC, and Other. After Issue 33 backfill (needs_cross_tab_reextraction=true), only IHC3+ cross-tabs were extracted per tumor type:
- CRC → HER2 IHC3+ ✓ (ORR 100%, 3/3)
- CRC → HER2 IHC2+ ✗ (ORR 0%, 0/3) — missing
- CRC → HER2 IHC1+ ✗ (ORR 0%, 0/1) — missing
- CRC → HER2 mutation/amp ✗ (ORR 0%, 0/3) — missing
The overall HER2 subgroups exist (Non-breast STs → HER2 IHC1+/2+/3+) but disease × HER2 cross-tabs only exist for IHC3+.

Scale

3 audit issues from pub 72043 in CRC+ADC audit (2026-03-30). Likely affects other basket trials where the cross-tab table has zero-response cells — the LLM treats 0% ORR subgroups as not worth extracting. Scale TBD.

Explored solution direction

Prompt fix (forward): Strengthen extract_subgroups to explicitly require all cells in a cross-tabulated table, including zero-response cells: “When a table cross-tabulates tumor type × biomarker status, create subgroups for ALL cells in the table, including those with 0% ORR or 0 responders. A zero-response subgroup is clinically meaningful data, not an absence of data.”
Targeted re-extraction: Force re-extract specific pubs where the cross-tab is incomplete (e.g. --publication_ids=72043).

Solution applied

Forward fix (2026-03-30): Two changes:

subgroup_extraction.rb: Rewrote Step 2b cross-tab instruction to explicitly require ALL table cells including zero-response results (“0% ORR”, “0/3”). Previous wording (“Do NOT create subgroups for empty cells”) caused the LLM to treat zero-response cells as empty. New wording distinguishes zero responses (clinically meaningful) from truly empty/NE/NA cells.
post_process.rb: Fixed Issue 8 regression — the all-zero measure_value guard now only nulls when all arms also have nil/zero number_of_participants (fabrication signal). Real 0% ORR with stated N (e.g., 0/3 → N=3) is preserved. The N=0→nil guard at line 365 ensures fabricated N values are already nil, so the check reliably distinguishes fabrications from real data.
task.rb: Added classify instruction for zero-response extraction from cross-tabulated tables (“0/3” = 0% with N=3) and abstain-when-ambiguous for garbled table parsing. Also added second-pass zero guard in post_process.rb after null_out_propagated_parent_n to catch fabricated zeros that initially bypassed the first-pass guard via copied parent N.

Backfill (2026-03-30–31): backfill_cross_tabulated_subgroups.thor — three commands run in prod:

screen_zero_response (job 1688, 2026-03-30): LLM screen (gpt-5-mini) of all 5,348 pubs with disease × biomarker cross-tab subgroups. Compared each abstract’s cross-tab structure against existing subgroups to find missing zero-response cells. Flagged with needs_zero_response_reextraction.
rescreen_zero_response (job 1693, 2026-03-30): Tighter second-pass screen to reduce false positives. Result: 234 confirmed, 5,114 rejected.
remediate_screened (job 1689, 2026-03-31): Reset 234 flagged pubs (destroyed trial_subgroups, cleared llm_data subgroup fields, set llm_data_processed = false).

Pipeline re-run pending: extract_subgroups → classify_publications → post_process_publications on the 234 remediated pubs.

44. PFS/OS event count extracted as number_of_participants

Short summary

In survival analysis tables that report “median (95% CI) events n/N”, the LLM extracts the event count numerator as number_of_participants instead of the denominator. The “events n/N” fraction (e.g. “23/31”) looks similar to a response fraction to the LLM, but the numerator is the number of events (deaths/progressions), not the number of patients.

Where this sits in the current pipeline

app/tasks/publications_llm_classification/task.rb — classify_publications efficacy extraction.

Concrete examples

Pub 235204 (Telisotuzumab adizutecan ctDNA biomarker): Table row reads “mPFS 5.3 (4.5, 5.9) 23/31” for SD → MR positive (methylation panel). LLM extracted N=23 (PFS events) instead of N=31 (patients in subgroup). The correct N (31) matches “MR in pts with SD: methylation panel 31/53 (58%)”.

Scale

1 instance found in CRC+ADC audit (2026-03-30). Scale TBD — this table format (“value (CI) events n/N”) is standard in survival analysis reporting across oncology publications. Likely affects many publications reporting PFS/OS with event fractions. Need to assess by searching for publications where number_of_participants on a PFS/OS endpoint is less than number_of_participants on a sibling ORR endpoint for the same subgroup.

Explored solution direction

Prompt fix (forward): Add instruction: “In survival tables, when you see a format like ‘median (CI) n/N’, the N is the number of patients (use as number_of_participants) and n is the number of events (do not use as number_of_participants). Events n/N is NOT a response fraction.”

Solution applied

Forward fix (2026-04-03): Added Sub-rule 6 to classify_publications prompt (section 3e): “In survival tables, n/N means n events out of N patients. Use N (denominator) as number_of_participants, NOT n (numerator).” Will be picked up by Issue 49 backfill re-extraction (3,943 target-disease pubs, PROMPT_VERSION=1).

45. Biomarker-tested denominator used as subgroup N instead of positive subset

Short summary

When an abstract reports biomarker retention or status as “X/Y pts had [biomarker]”, the LLM uses Y (the tested population) as the subgroup’s number_of_participants instead of X (the biomarker-positive subset count). The subgroup is defined by having the biomarker, so its N should be X.

Where this sits in the current pipeline

app/tasks/publications_llm_classification/task.rb — classify_publications efficacy extraction.

Concrete examples

Pub 74193 (T-DM1 HERACLES-RESCUE): Abstract says “HER2 IHC 3+/amplification was retained on circulating tumor DNA in 2/3 pts.” The subgroup “HER2 retained in ctDNA” should have N=2 (the pts who retained HER2), but LLM extracted N=3 (the tested population).

Scale

1 instance found in CRC+ADC audit (2026-03-30). Issue-specific scale still TBD — biomarker retention/status reporting (“X/Y had [biomarker]”) is common in correlative analyses. Need to assess by searching for subgroups containing “retained”, “positive”, “expressing” etc. where N matches the denominator of the defining fraction rather than the numerator.

Precursor rollout scope for the target disease slice was sized from bioloupe-db-prod on 2026-03-31:

Full target scope (including 4116 / Solid Tumors): 123,207 subgroup rows across 45,397 publications
Filtered query scope (excluding 4116 from querying, but leaving Publication::TARGET_DISEASE_IDS unchanged in code): 15,867 subgroup rows across 6,443 publications
Filtered-scope breakdown: 4,194 rows matched directly via trial_subgroups.disease_id; 11,673 rows came from trial_disease_details fallback when subgroup disease was null

Explored solution direction

Prompt fix (forward): Add instruction for qualifying subset N-counting, generalized beyond biomarkers.

Solution applied

Precursor (2026-03-31): population_role metadata rollout via deterministic inference + LLM fallback. Ran in production (one_off_jobs 1699, 1700). Covered ~162k subgroups deterministically, ~2.7k via LLM, 12 remained null.

Superseded by screen → remediate → re-extract approach (2026-03-31):

The deterministic inference was brittle for edge cases and population_role has no downstream consumers yet — it was a stepping stone. Replaced with a direct LLM screening approach that addresses both Issue 45 and Issue 46 together.

Forward fix:

Removed deterministic PopulationRoleInference from post_process.rb — population_role now comes directly from LLM output (kept as free metadata).
Added generalized qualifying-subset N-counting instruction to classify_publications prompt in task.rb: applies to any filtered subset (biomarker, prior-therapy, analysis population, condition-present), not just biomarkers.
Added complete endpoint extraction instruction for sibling arms (Issue 46 forward fix also in task.rb).

Backfill: screen_subgroup_reextraction.thor — screens ~6,780 target-disease pubs with gpt-5-mini, then remediates flagged pubs for full re-extraction.

Local validation (82 pubs screened):

~17% flag rate (14/82), projecting ~1,100–1,200 pubs for re-extraction
True positive rate: high — flagged pubs had genuine N-counting or missing endpoint issues
False negative rate: ~2-3% — 1 clear miss (pub 49494: total cohort N copied to biomarker subsets), 2 borderline (abstract didn’t state fraction explicitly)
Cost: ~$0.001/pub for screening, ~$7-9 for full 6,780 scope

Production screening (2026-03-31, one_off_job 1702):

Screened 6,443 pubs in ~1h38m. Results: 1,395 flagged (21.7% flag rate), 5,048 clean. Both concrete examples (pub 74193, pub 29700) correctly flagged. Pub 49494 (known false negative from local validation) was outside candidate scope (no subgroups).

Re-screening analysis (2026-04-01):

Manual spot-check of 6 random flagged pubs revealed ~33% true positive rate — the initial screen was too permissive. Common false positive patterns:

Sibling endpoint asymmetry that is real in the abstract (not an extraction error)
Single-arm studies with no structural issue
Percentage-based endpoints where N is correctly the denominator (not the event count)

Re-extraction testing on false positives showed it is NOT safe to blindly re-run classify_publications on all 1,395 — model variance at temperature: 1 can regress correct CR/PR extraction due to conflicting prompt instructions.

Prompt restructuring (2026-04-01):

Restructured the classify_publications SYSTEM_PROMPT in task.rb to resolve 6 identified conflicts:

“Keep endpoint associations as they are” vs “extract response components not in the input list” → unified into single coherent statement
Garbled table guidance: “extract if confident” vs “don’t extract response components” → unified with graduated strictness
N-counting rules scattered across 3 locations → consolidated in section 3e
null vs 0 duplicated → single rule
“All Arms” usage duplicated → merged in section 3c
“Derive ORR from counts” vs “don’t fabricate components from composites” → clarified as one-way only

Also expanded response component extraction (section 3d) to cover SD, PD, VGPR, sCR alongside CR/PR, with clear 3-scenario framework (components only, composite + components, composite only). Updated details.rb schema enum to match.

Prompt reduced from 34,281 → 28,154 chars (18% shorter). Emphasis markers reduced from 14 (7x IMPORTANT, 5x CRITICAL, 2x MANDATORY) to 5 RULE: prefixes.

Validation: 10 pubs re-extracted (5 flagged, 5 clean) — 5 improved, 5 stable, 0 regressions.

Structured re-screen (2026-04-01):

Added rescreen command to screen_subgroup_reextraction.thor — a second-pass screen on flagged pubs that requires concrete evidence (specific subgroup, expected vs current value, abstract quote). Stores structured evidence in llm_data['rescreen_evidence'].

Validated on 5 known pubs: correctly confirmed 2 true positives (pub 163764: wrong Ns, pub 221507: missing endpoints) and rejected 3 false positives (pubs 43226, 59227, 136409). Cost: ~$0.002/pub.

Verification (2026-04-02)

Status: RESOLVED. Full pipeline re-run completed (one_off_jobs 1714→1716→1717→1722, 2026-04-01). ~2,500 publications re-extracted.

Original example confirmed fixed: Pub 74193 (HERACLES-RESCUE) — ctDNA subgroup now has N=2 (the biomarker-positive subset), not N=3 (the tested denominator). Response breakdown (2 PD out of 2) correct. Solid biopsy subgroup also correct at N=5 (5/5 retained, 1 PR + 1 SD + 3 PD).

Additional spot-checks:

Pub 48455 (Pembro+Trastuzumab EG cancer): Abstract says “12 of 16 had a decline in their maxVAF” → N correctly extracted as 12 (positive subset), not 16 (tested). All values match (PFS 14.7 vs 5.9, OS 29.7 vs 7.71, milestone PFS 75% vs 0%).
Pub 56725 (ANV419, 6 dose arms): All dose levels have symmetric Ki-67 CD8/NK/Treg extraction with correct per-cohort Ns.
Pub 234727 (DCF vs FLOT esophageal): Both arms have all 4 endpoints symmetrically with correct Ns.
Pub 101692, 152836: Null Ns appropriate where abstract doesn’t give per-subset denominators.

No regressions found across 10+ randomly sampled publications.

Production rollout commands

# 1. Screen all target-disease pubs (DONE — one_off_job 1702)
# bundle exec thor one_off:screen_subgroup_reextraction:screen --batched --parallelism 4 --batch_size 2000

# 2. Re-screen flagged pubs with structured evidence
bundle exec thor one_off:screen_subgroup_reextraction:rescreen --batched --parallelism 4 --batch_size 2000

# 3. Check re-screening results
bundle exec thor one_off:screen_subgroup_reextraction:rescreen_stats

# 4. Remediate only confirmed true positives (--confirmed_only flag TBD)
bundle exec thor one_off:screen_subgroup_reextraction:remediate --dry_run
bundle exec thor one_off:screen_subgroup_reextraction:remediate

# 5. Re-run pipeline on remediated pub IDs
#    extract_subgroups → classify_publications → post_process_publications

46. Incomplete endpoint extraction across sibling dose arms

Short summary

When a publication reports the same endpoint (e.g. DoR, PFS) across multiple dose arms in the same table, the LLM extracts the endpoint for some arms but skips others. This appears biased toward higher-response or first-listed arms.

Where this sits in the current pipeline

app/tasks/publications_llm_classification/task.rb — classify_publications efficacy extraction.

Concrete examples

Pub 29700 (ABBV-400 phase 1 CRC): Abstract table shows DoR for three dose arms: 1.6 mg/kg (no responses, no DoR), 2.4 mg/kg (DoR 4.1 mo), 3.0 mg/kg (DoR 5.5 mo). LLM extracted DoR for 2.4 mg/kg but skipped 3.0 mg/kg. Both values are in the same table row.

Scale

1 instance found in CRC+ADC audit (2026-03-30). Scale TBD — multi-arm dose-escalation/expansion studies with per-arm efficacy tables are common, especially in phase 1 ADC trials. Need to assess by searching for publications with multiple dose arms where some arms have an endpoint and sibling arms are missing it.

Explored solution direction

Prompt fix (forward): Add instruction for complete endpoint extraction across all sibling arms.

Solution applied

Combined with Issue 45 (2026-03-31) — both issues are addressed by the same screen → remediate → re-extract approach.

Forward fix: Added instruction to classify_publications prompt in task.rb: “When extracting efficacy from a table or listing with multiple dose arms or treatment cohorts, extract ALL endpoints for ALL arms listed. Do not skip arms with lower response rates, fewer patients, or that appear later in the table.”

Backfill: screen_subgroup_reextraction.thor screens for both Issue 45 (wrong N on qualifying subsets) and Issue 46 (missing sibling endpoints) in a single pass. Initial screen completed (one_off_job 1702, 1,395 flagged). Structured re-screen added to filter false positives before remediation. See Issue 45 for full production rollout details and commands.

Verification (2026-04-02)

Status: RESOLVED. Full pipeline re-run completed (one_off_jobs 1714→1716→1717→1722, 2026-04-01). ~2,500 publications re-extracted.

Pub 29700 verified: DoR now extracted for all three dose arms — 2.4 mg/kg = 4.1 mo, 3.0 mg/kg = 5.5 mo, correctly absent for 1.6 mg/kg (no responses). All values match abstract table. ORR, CBR, PFS, OS also symmetric across all three arms with correct Ns (32, 40, 41).

Additional spot-checks on multi-arm dose-escalation pubs (56725, 234727) confirmed symmetric endpoint extraction across all sibling arms. No regressions found.

47. Hazard ratios and p-values not captured for subgroup comparisons

Short summary

When abstracts report per-subgroup hazard ratios, confidence intervals, and p-values — either inline or in tables — these values are not captured into the hazard_ratio and p_value columns on trial_arm_outcomes or trial_outcome_measures, even though those columns exist in the schema. Median values are correctly extracted; only the statistical comparison measures are lost.

Where this sits in the current pipeline

app/tasks/publications_llm_classification/task.rb — classify_publications efficacy extraction.

Concrete examples

Pub 53685 (PROpel — Olaparib + Abiraterone mCRPC): Abstract reports per-gene HRs for 9 HRR mutations: “rPFS: BRCA2, HR 0.20 (0.08–0.44); ATM, HR 0.55 (0.20–1.38); CDK12, HR 0.51 (0.20–1.18). OS: BRCA2, HR 0.20 (0.07–0.48)…” Median rPFS and OS are correctly extracted for both arms across all 9 genes. All hazard_ratio fields are null.
Pub 158388 (NPM1-mutated AML with venetoclax): Abstract reports per-mutation HRs from regression: “IDH1/2 HR: 0.62 (0.42–0.89, p=0.011); FLT3-ITD HR: 1.42 (0.99–2.04, p=0.055); TET2 HR: 1.76 (1.24–2.50, p=0.002).” Subgroups for these mutations exist but have empty values and null Ns — the abstract gives only HRs, not median values per mutation individually. The NPM1-low vs NPM1-high comparison (which does have medians) was correctly extracted with CR/CRi, MRD-neg, and OS.
Pub 51636 (PRIME — Panitumumab KRAS/NRAS/BRAF): Abstract table includes HR and p-value columns alongside median OS and PFS for each RAS/BRAF subgroup (e.g., WT RAS: OS HR 0.78 p=0.04, PFS HR 0.72 p<0.01). Medians extracted correctly. HRs and p-values not captured.

Scale

Widespread. Abstracts routinely report HRs for subgroup comparisons — especially in randomized trials (treatment effect HRs) and in mutation/biomarker analyses (prognostic HRs from Cox regression). This affects both structured tables (Pub 53685, 51636) and inline text (Pub 158388). The schema already supports it (trial_arm_outcomes.hazard_ratio, trial_arm_outcomes.p_value, trial_outcome_measures.hazard_ratio, trial_outcome_measures.p_value) — the LLM extraction simply doesn’t populate these fields.

Explored solution direction

Prompt fix (forward): Add instruction to extract HR, CI, and p-value when reported alongside efficacy endpoints, and map them to the existing schema columns.

Solution applied

Not yet addressed.

48. Milestone endpoint value missing for sibling arm in randomized trials

Short summary

When an abstract reports a milestone rate (e.g., 6-month PFS rate) for both arms of a randomized trial — often in the same sentence — the LLM sometimes captures the value for one arm but not the other. Median survival values for the same endpoints are typically extracted for both arms correctly.

Where this sits in the current pipeline

app/tasks/publications_llm_classification/task.rb — classify_publications efficacy extraction.

Concrete examples

Pub 43144 (Cetuximab maintenance, TIME trial): Abstract says “The 6-month progression-free rate was 38.8% (26 of 67; 95% CI, 27.1%-51.5%) in the cetuximab group and 5.6% (4 of 72; 95% CI, 1.5%-13.6%) in the observation group.” Milestone PFS extracted for cetuximab arm (38.8%, N=67) but missing for observation arm (5.6% not captured). Median PFS and OS correctly extracted for both arms (5.3 vs 2.0, 24.8 vs 19.7).

Scale

Appears rare. Systematic query across re-extracted batch (2026-04-02) found ~10 publications where milestone endpoints existed for fewer arms than other endpoints in the same subgroup. Most were single-arm studies where “All Arms” is expected. Only Pub 43144 was a confirmed case of a multi-arm trial with asymmetric milestone extraction. Low priority.

Explored solution direction

Prompt fix (forward): Could extend the existing “extract ALL endpoints for ALL arms” instruction to explicitly mention milestone/landmark rates.

Solution applied

Not yet addressed.

49. Arm name mismatch between `extract_interventions` and `classify_publications`

Short summary

Two independent LLM pipeline steps name the same arm differently — extract_interventions (step 5) might call an arm “Control group” while classify_publications (step 10) calls it “Control”. This prevents trial_arm_outcomes from linking to trial_arms by name, since trial_arms are created from step 5 output and trial_arm_outcomes are created from step 10 output.

Where this sits in the current pipeline

app/tasks/publications_llm_classification/intervention_extraction.rb — extracts arm names into llm_data['intervention_arms'], which become trial_arms
app/tasks/publications_llm_classification/task.rb — extracts arms into llm_data['subgroup_outcome_measures'][].arms[], which become trial_arm_outcomes
app/tasks/publications_llm_classification/post_process.rb — links trial_arm_outcomes to trial_arms by case-insensitive name match

Concrete examples

Top unlinked arm names after backfill (local dev, 2026-04-02):

“Control group” (883) vs “Control” in trial_arms
“Intervention group” (419) vs “Intervention”
“Experimental group” (481) vs treatment-specific names
“Arm A” / “Arm B” (466/382) vs descriptive arm names

Scale

~85k trial_arm_outcomes (18% of 470k) unlinked after initial backfill of trial_arms. Includes “All Arms” pooled outcomes (resolved separately by creating “All Arms” trial_arm entries) and name mismatches from the two-step naming inconsistency.

Downstream impact

Unlinked trial_arm_outcomes (trial_arm_id = NULL) cannot be joined to trial_arm_interventions for drug/dose attribution via the new FK path. They fall back to the legacy name-matching view logic.

Explored solution direction

Forward fix (applied): Pass extracted_arm_names from llm_data['intervention_arms'] into the classify_publications prompt, instructing the LLM to reuse those exact names instead of inventing new ones.
Backfill: Affected publications need reprocessing through classify_publications with the new prompt to align arm names. Alternatively, fuzzy name matching in post_process.rb could catch common variants (but is fragile).

Solution applied

Resolved by design (2026-04-02): The name-matching approach was replaced entirely with ID-based linking.

extract_interventions creates trial_arms with database IDs
classify_publications receives those IDs in the prompt and assigns them to each outcome arm
post_process reads arm_data['id'] as trial_arm_id directly — no name matching at all

Additional fixes:

extract_interventions prompt updated: separate arms per dose cohort (different patients = different arms)
classify_publications prompt updated: use provided arm IDs, never leave id empty (including “All Arms”)
TrialArmMaterializer always creates an “All Arms” entry so the LLM has an ID for pooled results
study_plan_arms no longer sent to classify_publications — trial_arms are the source of truth

Tested on 3 publications (190656, 54137, 91482) — all achieved 100% trial_arm_id linkage.

Backfill plan (2026-04-03):

Scoped to target disease pubs (TARGET_DISEASE_IDS minus 4116) with trial_arm_interventions and unlinked outcomes: 3,943 publications, ~26.7k unlinked outcomes.

Approach: full pipeline re-run, not name-matching heuristics. A reset task (one_off:reset_classify_publications) deletes trial_arms/interventions, clears llm_data keys for all three extraction steps, and resets flags. After reset, re-run:

extract_interventions — new arms with proper dose splitting + IDs
link_publication_drugs — drug entity matching
extract_subgroups — re-extract with new arm names (arm names influence whether dose cohorts become arms vs subgroups)
classify_publications — outcomes with ID-based arm linking
post_process_publications — materialize everything

extract_subgroups must re-run because it reads intervention_arms to decide arm-vs-subgroup boundaries (see subgroup_extraction.rb lines 47-51).

tag_investigational_interventions step removed from pipeline — intervention_role (including the supportive role) is now set directly by extract_interventions + TrialArmMaterializer.

Prompt versioning added to all three steps (intervention_extraction_version, subgroup_extraction_version, classify_publications_version in llm_data) to track which pubs have been processed with the new prompts.

Tested on 10 random pubs from the scope — 100% arm linking, 0 errors. Extraction quality equal or better than production (richer response components, correct null handling, proper arm separation). Estimated cost: ~$0.045/pub ($178 total for 3,943 pubs).

50. DrugLinker false-matches non-drug interventions to drugs

Short summary

DrugLinker’s last-resort SimpleCandidateMatchingService (LLM-based fuzzy matcher) matches non-pharmacological interventions to drug records. For example, “Classical music” → Orca-T (cell therapy), “Noise-canceling headphones” could match to random drugs. The candidate service has no guard against non-drug intervention types.

Where this sits in the current pipeline

app/tasks/publications_llm_classification/drug_linker.rb — match_via_candidate_service (line 64)
SimpleCandidateMatchingService — LLM-based candidate matching, used as fallback when NcitConcept and Drug.flexifind both fail

Concrete examples

Pub 129 (music during MRI biopsy): “Classical music” intervention matched to drug Orca-T (id=13666, cell therapy). No synonym overlap — pure hallucination from the candidate service.

Scale

1,409 procedure-type interventions have drug_id set
140 device-type interventions have drug_id set
1,544 other-type interventions have drug_id set
Total: ~3,093 likely false matches on non-drug intervention types
8,843 total interventions matched via candidate service (drug_id set, ncit_concept_id null) — some of these are legitimate drug matches, but the non-drug types above are almost certainly false positives

Downstream impact

False drug attribution in the efficacy view — non-drug interventions appear as if they’re associated with specific drugs, polluting drug-level clinical evidence reports.

Explored solution direction

Guard by intervention_type: Skip match_via_candidate_service when intervention_type is in %w[procedure behavioral device dietary other radiation]. Only attempt LLM-based matching for drug and biological types.
Backfill cleanup: Null out drug_id on trial_arm_interventions / publication_interventions where intervention_type is non-drug and match came from candidate service (no ncit_concept_id).

Solution applied

Forward fix (2026-04-04): Two-part approach — domain-specific prompt + TermMatch caching.

DrugMatchingService (app/services/drug_matching_service.rb): New wrapper around SimpleCandidateMatchingService with a drug-specific prompt that explicitly instructs the LLM to reject non-pharmacological interventions (procedures, devices, behavioral, imaging, dietary, radiation, observation). Replaces the generic “best match” prompt that caused false matches.
SimpleCandidateMatchingService caching: Added cache: true + strategy: params. When enabled, checks TermMatch before LLM call and persists results after. DrugMatchingService uses this with strategy: "DrugMatching" — repeated terms hit cache instead of making LLM calls.
DrugLinker updated: Now uses DrugMatchingService instead of raw SimpleCandidateMatchingService.

Root cause: The old code (pre-March 2026) used Elasticsearch fuzzy search (edit_distance: 2) + min_confidence: 0.5 to find candidates, then the LLM in match_mode: :best picked from garbage candidates. Current code (pg_trgm) wouldn’t reproduce the “Classical music” → Orca-T case, but still reproduced others like “PET/CT” → radiopharmaceutical drugs due to drug synonyms containing procedure terms.

Backfill cleanup: lib/tasks/one_off/cleanup_false_drug_matches.thor — nulls drug_id on trial_arm_interventions (1,707) and publication_interventions (1,274) where intervention_type is non-drug and ncit_concept_id IS NULL. Production run pending.

Validated: “Classical music”, “PET/CT”, “no treatment/observation” → nil. “Pembrolizumab”, “Nivolumab”, “Trastuzumab deruxtecan”, “Keytruda” → correct matches. Caching works (second call hits TermMatch, no LLM). Canonical dedup works (“pembrolizumab” = “Pembrolizumab”).

51. Per-arm dose not populated on backfilled trial_arm_interventions

Short summary

The trial_arms backfill (2026-04-02) created trial_arm_interventions for ~23.5k publications by copying data from existing publication_interventions. Drug_id and intervention_role were copied correctly, but dose fields were copied from the old study-level dose_evidence — not per-arm dose. For multi-dose-arm studies, every arm’s intervention has the same study-level dose range instead of arm-specific dose.

The ~43.5k publications without trial_arm_interventions (created from arm outcomes only) are intentionally out of scope — those pubs were never processed through extract_interventions and are outside the target disease scope. They retain drug attribution via the legacy registry fallback (Sources 1a-1c in the view).

Where this sits in the current pipeline

extract_dose_evidence (step 9) populates per-arm structured dose fields on trial_arm_interventions
The backfill copied publication_interventions.dose_evidence (study-level) to trial_arm_interventions dose fields — this is a flat copy, not per-arm extraction

Scale

~23.5k publications with trial_arm_interventions that have study-level dose copied from PIs. Multi-dose-arm pubs within this set have incorrect dose attribution (same range on every arm).

Downstream impact

For single-dose studies: no impact (study-level dose = arm-level dose). For multi-dose-arm studies: each arm shows the full dose range instead of its specific dose. Same problem that originally motivated the trial_arms migration (Issue 49, pub 190656).

Explored solution direction

The extract_dose_evidence task is updated to scope to trial_arm_interventions, but backfilled records already have dose_evidence populated (copied from PIs). The task’s scope filter (dose_evidence IS NULL OR version < DOSE_EVIDENCE_VERSION) skips them because they have version 1 data.

Option A: Bump DOSE_EVIDENCE_VERSION — Change constant from 1 to 2 in dose_evidence_extraction.rb. The scope already checks (dose_evidence->>'version')::int < DOSE_EVIDENCE_VERSION, so all backfilled records (version 1) become eligible for re-extraction. New extractions get version 2. One-line change, no data cleanup needed. Downside: re-extracts ALL 23.5k pubs including single-dose studies where the study-level dose is already correct.

Option B: Null out structured columns, keep JSONB — Set dose_min, dose_max, single_dose, rp2d, dose_units, dose_frequency, dose_context_type to NULL on all backfilled trial_arm_interventions, but keep dose_evidence JSONB for audit trail. Then update the extract_dose_evidence scope to check structured columns instead of JSONB presence. More surgical — only re-extracts records with missing structured dose. Downside: requires scope change and a cleanup migration.

Option C: Null out dose_evidence entirely — Set dose_evidence = NULL on all backfilled trial_arm_interventions. Then run extract_dose_evidence as-is (it scopes on dose_evidence IS NULL). Simplest, but loses the study-level audit trail. For single-dose studies the data was correct and will just be re-extracted to the same values.

Option D: Targeted backfill for multi-dose only — Only null out dose on trial_arm_interventions where the publication has multiple arms with the same dose range (indicator of study-level copy). Leaves single-dose pubs alone. Most efficient LLM cost but requires a scoping query to identify affected records.

Solution applied

Option A: Bump DOSE_EVIDENCE_VERSION to 2 + prompt refinement for per-arm dose_context_type.

Two changes in dose_evidence_extraction.rb:

Version bump (line 5): DOSE_EVIDENCE_VERSION = 1 → 2. All backfilled records (version 1) become eligible for re-extraction. The existing scope (version < DOSE_EVIDENCE_VERSION) handles this automatically.
Prompt fix (SYSTEM_PROMPT): Replaced study-level instruction “Set dose_context_type to rp2d when an RP2D is identified” with arm-level guidance — rp2d only for the arm that IS the RP2D/DRDE/MTD, escalation for other dose-finding cohorts, fixed for predetermined doses, etc.

Validated on 9 publications (mix of dose-escalation, fixed-dose combos, randomized multi-arm trials):

Per-arm single_dose correct on all 9 pubs (previously every arm got study-level range)
dose_context_type correctly differentiates escalation vs rp2d vs fixed vs weight_based
RP2D field only set on the arm that IS the RP2D, not all arms in the study
Cost: ~$0.004/pub → ~$103 projected for full 23.5k backfill
Only touches dose fields on existing trial_arm_interventions — no drug matching, arm creation, or linking changes

Production run: Bump version, deploy, then run extract_dose_evidence (no flags needed — scope picks up all version 1 records automatically).

Publication Issues Tracker

Publication Issues Tracker

Issue index

8. Zero-sentinel contamination (residual)

Short summary

Concrete examples

Scale

Explored solution direction

Solution applied

26. Parent population N propagated to child subgroups (residual)

Short summary

Concrete examples

Scale

Explored solution direction

Solution applied

17. ASCO abstract and presentation copies create duplicate publication rows

Short summary

Where this sits in the current pipeline

Exact restriction causing the duplication

Concrete examples from sqNSCLC validation

Example 1: PF-08046054

Example 2: IBI363

Example 3: Additional duplicate DOI pairs in the same sqNSCLC slice

Downstream impact

What the issue is not

Scale

Explored solution direction

18. PubMed-indexed journal article missing from publication corpus

Short summary

Where this sits in the current pipeline

Exact restriction causing the drop

Concrete example

Worksheet row: Cofetuzumab pelidotin in sqNSCLC

Downstream impact

What the issue is not

Scale

Spot checks

Open characterization questions

Explored solution direction

Solution applied

27. extract_efficacy_metrics picks confirmed ORR as plain ORR

Short summary

Where this sits in the current pipeline

Exact restriction causing the drop

Concrete examples

Downstream impact

Scale

Explored solution direction

Solution applied

28. build_result_rows collapses dose-level arms when study_plan_arm_id is null

Short summary

Where this sits in the current pipeline

Exact restriction causing the drop

Concrete examples

Downstream impact

Scale

Explored solution direction

Solution applied

29. Dose extraction captures study-level range, not efficacy population range

Short summary

Where this sits in the current pipeline

Exact restriction causing the drop

Concrete examples

Downstream impact

Scale

Explored solution direction

Solution applied

30. Cross-study data contamination from abstract background sections

Short summary

Where this sits in the current pipeline

Exact restriction causing the drop

Concrete examples

Downstream impact

Scale

Explored solution direction

Solution applied

31. Investigational drug dose data bleeds onto control/comparator arms

Short summary

Where this sits in the current pipeline

Exact restriction causing the drop

27. `extract_efficacy_metrics` picks confirmed ORR as plain ORR

28. `build_result_rows` collapses dose-level arms when `study_plan_arm_id` is null

38. Biomarker subgroups in secondary analyses not identified by `extract_subgroups`