Publication Issues Tracker Archive

Publication Issues Tracker

Temporary working document for tracking publication-processing issues identified during investigation.

The main motivation for this doc is the sheet: 1reh2-9Xpxd9DF7EB-73JfSXH8-MLtWI3zUDEOTgxPV8, where the client has collected clinical data for different disease areas and drugs. The purpose of this document is to identify gaps in the publications database that are preventing us from being able to correctly reconstruct this sheet in the future using structured data only (from the bioloupe data lake database).

Last updated: 2026-03-28 (Issues 31-34 added from job 1635 CRC+ADC audit triage — dose cross-contamination to control arms, TTP→PFS misclassification, cross-tabulated subgroups, immature→Not Reached confusion)

Issue index

#	Title	Short description	Status
1	Trial subgroup disease propagation gap	Non-`disease` subgroups with disease-like labels (e.g. `MSS-CRC`, `NSCLC`) never get `disease_id` populated because propagation is gated on `subgroup_type = 'disease'`	Complete — 1,924 subgroups pending term match resolution
2	ASCO API content type blind spot	`PresentationContentItem` publications silently dropped — search filter, detail query, and NCT ID search all restricted to `AbstractContentItem` only	Complete
3	Publication dose context gap	Linked publications use trial-derived dose; publication-specific dose extraction only runs for unlinked publications; no structured dose fields (min/max/RP2D/units/frequency)	Complete — extraction + view join fixed by Issue 20 (v16 view)
4	AE grade classification gap	Individual named AE rows lack grade category (`all_grade` vs `grade_gte3`), preventing ranked “Most Frequent AE” export columns	Complete — superseded by Issue 7 full re-run
5	Prior therapy context not extracted	Min/max/median prior lines and prior therapy exposure (e.g. prior taxane, prior IO) not captured from publication abstracts despite being available in text	Complete — `max_prior_lines` data quality cleanup needed
6	Data cutoff date not extracted	Publication data cutoff date is stated in ~6K abstracts but not persisted as a structured field — needed for worksheet `Data Cut` column	Implementation complete — backfill complete
7	AE grade category too coarse	Binary `all_grade`/`grade_gte3` enum forces grade 1-2 rows into `all_grade`, causing ~~50 inverted AE pairs and 7.9K misclassified rows — expand to 6-value enum and re-run backfill (~~$10)	Complete — enum expanded, backfill run, inverted pairs reduced from ~50 to 33
8	`max_prior_lines` zero-sentinel contamination	LLM outputs `0` instead of `null` for unstated max prior lines, producing 124K unusable values including 12.9K logically impossible `min > max` rows	Complete — cleanup applied, 0 contradictions remain, residual zeros in 1L/Adj/Neo populations only
9	All-grade AE extraction gap	Originally ~13K pubs suspected; after investigation only ~14 have genuine misclassification (any-grade values labeled as grade≥3). Already fixed by Issue 7 enum expansion — re-extraction produces correct results	Complete — fixed by Issue 7 prompt, no additional changes needed
10	`classify_publications` drops identified subgroups	LLM drops ~15% of subgroups identified by `extract_subgroups` — ~9,700 publications affected across all sources. Prompt + schema + validation fix implemented and full pipeline re-extraction completed in prod	Complete
11	~~Empty endpoint extractions~~	All 102 publications with empty `outcome_measures` are correctly empty: trial designs, safety-only, biomarker studies, or truncated abstracts. Original worksheet gaps explained by Issue 10 + data availability	Closed — not an issue
12	Legacy Emerging Clinical Data query collapses subgroup-level results	Legacy `EmergingClinicalDataQuery` groups by `[pub_id, disease_id, line, arm]` and prefers “Overall” subgroup, hiding dose-level and biomarker-stratified data; the current `ClinicalEvidenceQuery` already preserves subgroup rows	Stale — superseded by ClinicalEvidenceQuery; legacy EmergingClinicalDataQuery still collapses subgroups
13	Technology filter excludes combination partner drugs	Query filters view rows by `technology_id`, removing combo partner drugs with different technologies — e.g. paclitaxel (chemo) filtered out when querying for BsAb, so Amivantamab+paclitaxel shows no combo partner	Complete — switched to `fetch_combination_partners`
14	Basket trial disease subgroups not extracted for minority cohorts	BNT324/DB-1311 abstract mentions SCLC, CRPC, NSCLC by name but not HNSCC (only “1 pt with BTC” style mentions) — HNSCC N=3 data was in poster/presentation only, not abstract text	Investigation complete — data availability limit
15	Disease extraction drops parent disease when subtype matches exist	`build_match_set` early-returns when subtype TermMatches succeed, skipping parent disease-name match — e.g. HNSCC (6200) dropped when H&N sub-sites match, making 1,856 pubs invisible under umbrella diseases	Complete — backfill ran 2026-03-18
16	Confirmed ORR is not exported by `EmergingClinicalDataQuery`	Query/report endpoint whitelist omits `cORR`, so worksheet rows with `Confirmed ORR (cORR)` cannot be reconstructed even when ORR is present — folded into Issue 12	Complete — `confirmed` boolean added, backfilled 3,061 rows
17	ASCO abstract + presentation copies create duplicate publication rows	ASCO ingestion saves `AbstractContentItem` and `PresentationContentItem` separately by `source_id`, so the same DOI can appear twice in the report	Investigation complete
18	PubMed-indexed journal article missing from publication corpus	The sqNSCLC worksheet row for Cofetuzumab now points to `10.1016/j.lungcan.2025.108492`, but that article is absent from `publications`, so the row is still missing despite a valid journal source	Implementation complete — 2025 PubMed backfill pending
19	Biomarker context missing at subgroup level	Biomarkers are extracted at `trial_disease_details` level (disease scope), not per subgroup — ~13K biomarker-type subgroups like “EGFR-mutant” and “PD-L1 TPS≥1%” have no structured biomarker link, preventing biomarker-stratified export	Complete — extraction backfilled (52K records, 99%), matching pipeline run (67.3% matched), query layer aggregates multi-biomarker subgroups
20	`study_plan_arm` link is fragile and causes dose/drug/arm issues	`vw_publication_efficacy_data` joins through `study_plan_arms` for arm roles AND drug resolution — causing arm role failures (62% of rows), dose evidence drop (76% lost via drug_id mismatch), and row triplication. Merges Issue 3 dose gap.	Complete — v16 view deployed + arm_type backfill run in prod (2026-03-24)
21	Phase 1 basket trials report response counts, not ORR percentages	LLM extracts `PR` (count) faithfully from phase 1 abstracts reporting “1 PR in 9 HNSCC patients”, but the query only recognizes `ORR` (percentage). No ORR is derived, and fallback patient count inflates to the cross-tumor total	Complete — derived ORR in post_process + backfill
22	`extract_subgroups` doesn’t identify response counts as endpoints	When abstracts report best response narratively (“1 PR and 14 SD out of 29 patients”) without a formal ORR, `extract_subgroups` only identifies DCR and TTP as endpoints — individual response counts (PR, CR) are missed, so `classify_publications` can’t extract them	Complete — forward fix v2 + backfill v1+v2 run; 759→498 DCR-only pubs (remaining 498 verified clean)
23	Dose extraction misses implicit RP2D in phase I/II trials	When a phase I/II abstract says “dose levels of X and Y were chosen for phase II”, the dose extractor classifies this as a range (`dose_min`/`dose_max`) rather than RP2D — but in phase I/II trials, doses chosen for phase II ARE the RP2D by definition	Complete — backfill ran 2026-03-23
24	Subgroup participant count wrong for biomarker sub-cohorts	KRAS-mutated CRC subgroup (pub 29737) reports n=7 but abstract states 13 KRAS-mutated patients with 7 having SD — LLM confused the SD count with the total KRAS cohort size	Complete — backfill ran 2026-03-23
25	Confirmed vs unconfirmed ORR confusion in `classify_publications`	When abstracts report both confirmed and unconfirmed ORR (common in ADC trials), the LLM extracts the unconfirmed value but marks `confirmed: true`, or omits the confirmed ORR entirely — producing wrong cORR values and missing cORR endpoints	Incomplete — extraction residual post-fix, see 2026-03-26 audit findings
26	Parent population N propagated to child subgroups	`classify_publications` copies the parent subgroup’s `number_of_participants` to child subgroups instead of extracting the subset-specific N — ~5,058 child subgroups across 1,174 publications affected	Complete
27	`extract_efficacy_metrics` picks confirmed ORR as plain ORR	When both confirmed and unconfirmed ORR rows exist with the same N, `max_by(number_of_participants)` picks the confirmed row for the plain ORR metric — making ORR and cORR identical and the ORR value wrong	Investigation complete
28	`build_result_rows` collapses dose-level arms when `study_plan_arm_id` is null	Grouping key uses `study_plan_arm_id` which is null for publication-extracted arms — distinct dose cohorts (e.g. “8.0 mg/kg” vs “10.0 mg/kg”) sharing the same subgroup collapse into one row, silently dropping the lower-N arm	Investigation complete
29	Dose extraction captures study-level range, not efficacy population range	In dose-escalation studies, LLM extracts the full dose range (e.g. 1.0–8.3 mg/kg) even when efficacy is reported only for a subset (e.g. ≥4.0 mg/kg) — dose_min on the efficacy row is too low	Investigation complete
30	Cross-study data contamination from abstract background sections	LLM extracts efficacy values from a referenced prior study cited in the abstract’s background, attributing them to the current publication which has no efficacy data yet	Investigation complete
31	Investigational drug dose data bleeds onto control/comparator arms	`pub_dose_lookup` COALESCE fallback propagates investigational drug dose fields to control arms when `publication_interventions.study_plan_arm_id` is NULL — 2,890 rows across 566 publications	Investigation complete
32	TTP (time to progression) misclassified as PFS	LLM extraction maps TTP values to PFS endpoint — 149 publications mention TTP (not PFS) in abstract but have PFS extracted; additionally SD-subpopulation TTP values get attributed to full cohort	Investigation complete
33	Cross-tabulated subgroups not identified in basket trials	`extract_subgroups` identifies single-dimension subgroups (tumor type OR biomarker) but not the cross-product (tumor type × biomarker) when tabular data is present — ~366 pubs have both disease + biomarker subgroups that could have cross-tabulated data	Investigation complete
34	”Immature” endpoints extracted as “Not Reached”	LLM maps “not yet mature” / “data immature” to “Not Reached” — but immature means no median can be estimated (should be null), while “Not Reached” means median exceeds follow-up. ~71 pubs have immature language without “not reached” but have “Not Reached” extracted	Investigation complete

Each issue entry should keep analysis and remediation separate.

Recommended issue structure:

Short summary
Where this sits in the current pipeline
Exact restriction causing the drop
Concrete examples
Downstream impact
What the issue is not
Scale
Spot checks
Open characterization questions
Explored solution direction
Solution applied

Solution applied should remain empty until an actual fix is agreed and implemented.

Backfill pattern: When an issue requires backfilling historical data, see the “One-Off Backfill Tasks” section in .claude/skills/backend-expert/SKILL.md.

1. Trial subgroup disease propagation gap

Short summary

Publication subgroup rows can contain disease-like cohort labels in trial_subgroups.subgroup_value, but the current disease propagation path only assigns trial_subgroups.disease_id for subgroups whose subgroup_type is exactly disease.

If a subgroup is classified as analysis population or another non-disease type, its disease_id remains NULL even when:

the subgroup label is clearly disease-like, and
a high-confidence TermMatch already exists for that label.

This means disease-specific publication rows can fail to surface in reporting even though the publication contains a disease cohort tied to outcomes.

Where this sits in the current pipeline

Current publication flow:

extract_subgroups identifies subgroup labels and endpoint associations.
classify_publications emits subgroup_outcome_measures, including:
- type
- value
- linked outcome measures
post_process_publications destroys and recreates publication.trial_subgroups from the LLM output.
The separate publication disease workflow creates TermMatch records for subgroup disease strings and later post-processes them back into trial_subgroups.disease_id.

Relevant code paths:

Exact restriction causing the drop

The subgroup disease term population and subgroup disease post-processing are both restricted to subgroup_type = 'disease'.

In the model:

TrialSubgroup.disease_type is defined as where(subgroup_type: 'disease')
TrialSubgroup.populate_term_matches only iterates disease_type.with_subgroup_value

In the Thor task:

post_process_disease_matches builds the scope as:
- TrialSubgroup.disease_type.with_subgroup_value.without_disease_id

So any subgroup classified as:

analysis population
clinical feature
mutation
patient characteristic
or any other non-disease type

is excluded from disease propagation, even if its subgroup_value is disease-like.

Example: publication `114077`

Publication:

publications.id = 114077
title: A phase I study of INCA33890, a PD-1/TGFβR2 bispecific antibody, for advanced solid tumours
linked trial: NCT05836324

Publication-level disease rows:

trial_disease_details contains only:
- 4116 = Solid Tumors

Subgroup row:

trial_subgroups.id = 210858
source_type = 'Publication'
source_id = 114077
subgroup_type = 'analysis population'
subgroup_value = 'MSS-CRC'
disease_id = NULL

But the disease matcher already knows what this means:

term_matches.id = 100095
subject_type = 'TrialSubgroup'
field = 'disease_name'
strategy = 'DiseaseMatching'
term = 'mss-crc'
final_result.id = 4345
final_result.score = 0.95
disease 4345 = Colorectal Cancer

So the system has a validated disease match for the normalized term, but it is never propagated to trial_subgroups.disease_id because the subgroup is analysis population, not disease.

Why this matters downstream

The publication efficacy view uses subgroup disease from trial_subgroups, not from the linked clinical trial and not from trial_disease_details.

In /Users/tomor/Sites/bioloupe-data-gov/db/views/vw_publication_efficacy_data_v07.sql:

treatment_line_mapping reads trial_subgroups where source_type = 'Publication'
subgroup_disease_id is set directly from trial_subgroups.disease_id

The view does not join:

clinical_trial_end_diseases
trial_disease_details

for subgroup disease attribution.

So if a publication subgroup is disease-like but trial_subgroups.disease_id stays null, the view row does not carry that disease.

Later, in /Users/tomor/Sites/bioloupe-data-gov/app/queries/tpp/emerging_clinical_data_query.rb, filtering works like this:

prefer v.subgroup_disease_id
if that is null, fall back to trial_disease_details

For publication 114077, that fallback disease is only Solid Tumors, so the publication does not surface as CRC.

What the issue is not

This is not primarily a missing TermMatch problem.

For the MSS-CRC example, the TermMatch already exists and is high-confidence. The failure is in propagation from the normalized term match back onto the subgroup record.

This is also not a clinical-trial disease issue. In this path, the effective disease used by the publication efficacy view comes from publication subgroup records, not from clinical_trials.

Current semantic mismatch

The system currently behaves as if:

trial_subgroups.disease_id means: “this subgroup is explicitly a disease subgroup”

But many real publication subgroup labels behave more like:

disease cohort embedded inside another subgroup class
disease-shaped analysis population
disease-plus-qualifier cohort

Examples:

MSS-CRC
Overall → RCC
Relapsed/Refractory AML
BCG-refractory NMIBC
Stage I NSCLC

These can carry real disease meaning even when the LLM classified the subgroup as analysis population or another non-disease type.

Scale of the issue in publication-sourced subgroup rows

For trial_subgroups.source_type = 'Publication' with disease_id IS NULL:

total null-disease subgroup rows: 140,057
distinct null-disease subgroup strings: 92,854

For subgroup_type = 'analysis population':

rows with non-empty subgroup_value and null disease_id: 88,623
distinct subgroup strings: 51,343

For non-disease subgroup rows overall:

rows with non-empty subgroup_value and null disease_id: 134,211
distinct subgroup strings: 88,382

Among publication analysis population rows specifically:

1,720 rows already have an existing exact normalized high-confidence DiseaseMatching result available by term

Among all publication non-disease subgroup rows:

2,422 rows already have an existing exact normalized high-confidence DiseaseMatching result available by term

This shows two things at once:

there is recoverable disease signal being left unused
most non-disease subgroup rows are not pre-validated disease matches

Why broadening this blindly is risky

Many analysis population values are obviously not disease cohorts:

Overall
Responders
Evaluable patients
Monotherapy
Cohort 1
Placebo
Dose escalation
Healthy Volunteers
First-line
Japanese patients

So “map all non-disease subgroup types through disease matching” would push large volumes of junk terms into a disease-normalization process that was not designed for them.

Spot checks showing recoverable signal

These publication subgroup values look meaningfully disease-like and appear useful for disease attribution:

MSS-CRC -> Colorectal Cancer
Overall → RCC -> Renal Cell Carcinoma (RCC)
Overall → GIST -> Gastrointestinal Stromal Tumor (GIST)
Relapsed/Refractory AML -> Acute Myeloid Leukemia (AML)
BCG-refractory NMIBC -> Non-Muscle Invasive Bladder Cancer
Head and Neck Squamous Cell Carcinoma -> Head and Neck Squamous Cell Carcinoma (HNSCC)
NSCLC -> Non-Small Cell Lung Cancer (NSCLC)
Colorectal cancer -> Colorectal Cancer

These are the kinds of subgroups that currently fail to contribute disease-specific reachability if their subgroup_type is not disease.

Spot checks showing noise or semantic drift

These examples show why broad disease assignment on subgroup labels can produce incorrect or misleading disease attribution:

Previously untreated mPDAC -> matched to Multiple Myeloma at score 0.75
- abbreviation collision
Relapsed/refractory cHL -> matched to Chronic Leukemia at score 0.825
- clearly wrong
Overall → Carcinoma In Situ -> matched to Breast Ductal Carcinoma In Situ at score 0.85
- wrong in a bladder-cancer context
Bone metastases -> matched to Bone Metastasis
- may be useful as a retrieval concept but not necessarily the publication’s disease cohort

These are not hypothetical edge cases. They already exist in the term-matching results.

Reporting impact

Because subgroup_disease_id from publication subgroups is preferred when present, this issue affects:

disease-specific publication discovery
disease-specific efficacy row inclusion
downstream CSV/report completeness for basket and umbrella studies
publications whose abstract reports disease cohorts under non-disease subgroup types

The observed failure mode is:

publication contains a disease cohort in subgroup results
subgroup gets created with a non-disease type
subgroup disease propagation never runs
vw_publication_efficacy_data row has subgroup_disease_id = NULL
reporting falls back to publication-level disease or misses the disease entirely

Core problem statement

The system currently treats subgroup disease attribution as a type-gated post-processing step:

only subgroup_type = 'disease' is eligible

But in actual publication abstracts, disease-bearing cohort labels are often emitted under other subgroup types, especially analysis population.

As a result, the pipeline loses disease information that is already present in subgroup text and, in some cases, already normalized in term_matches.

Open characterization questions

These are not proposed fixes. They are the unresolved aspects of the issue:

Is trial_subgroups.disease_id intended to mean “authoritative disease cohort” or “retrieval-relevant disease tag”?
Should disease-bearing analysis population subgroups be treated differently from clearly non-disease analysis population values like Responders or Cohort 1?

Working assumptions from discussion

Metastatic-site labels such as Bone metastases may be valid for publication reachability if the ontology already contains the corresponding disease concept.
When subgroup disease is null, fallback to trial_disease_details should be interpreted as publication-level disease rather than subgroup-level disease.

Explored solution direction

The explored direction is not to map every subgroup directly into trial_subgroups.disease_id.

That would continue to create incorrect disease assignments, just with a different error pattern:

fewer abbreviation-only failures
more context-overreach failures

The better conceptual shape is:

extract_subgroups
    ↓
classify_publications
    ↓
subgroup disease adjudication (LLM, contextual)
    ↓
post_process / disease matching

The key idea is to separate two questions that are currently blurred together:

Is this subgroup actually disease-like?
If yes, which disease concept should it map to?

The explored adjudication step would analyze subgroup rows in publication context and emit something like:

semantic class:
- disease_cohort
- disease_related_context
- not_disease
normalized disease phrase, if applicable
evidence quote/span
confidence

Behavioral intent of those outputs:

disease_cohort
- subgroup is a real disease-bearing cohort
- eligible to write into authoritative trial_subgroups.disease_id
disease_related_context
- subgroup contains disease signal that may help publication reachability or filtering
- should not automatically overwrite authoritative subgroup disease semantics
- may belong in a separate retrieval/tag field rather than trial_subgroups.disease_id
not_disease
- subgroup remains unmapped for disease attribution

This distinction matters because the current system uses trial_subgroups.disease_id as an authoritative signal in reporting, not just as a search helper.

So if all subgroup strings are pushed directly into the existing disease_id field, the reports inherit those assignments as if they were true disease cohorts.

That is acceptable for:

MSS-CRC
Overall → RCC
Relapsed/Refractory AML
BCG-refractory NMIBC

But not acceptable for:

Responders
Cohort 1
Placebo
Evaluable patients
ambiguous or mis-normalized strings like Relapsed/refractory cHL
context-sensitive strings like Carcinoma In Situ

Placement options explored:

In the main publication workflow:
- after classify_publications
- before post_process_publications
- this would affect subgroup creation semantics earlier
In the publication disease branch:
- near /Users/tomor/Sites/bioloupe-data-gov/app/workflows/publication_disease_workflow.rb
- this keeps subgroup extraction/classification separate from disease enrichment

Current preferred exploration direction:

yes, a new LLM subgroup adjudication step makes sense
no, it should not directly map all subgroups into the existing authoritative disease_id field
analysis population is the best first expansion target
the main gain comes from separating:
- “is this disease-like?”
- from
- “which disease is it?” using publication context rather than term-only matching

Solution applied

Implemented contextual LLM subgroup disease adjudication for all non-disease publication subgroups (~132K rows, ~89K distinct values).

Scope: All publication-sourced subgroups where subgroup_type != 'disease', including analysis population (89K rows), clinical feature (25K), mutation (10K), patient characteristic (2.4K), and smaller types. Spot checks confirmed disease-bearing labels appear across all these types (e.g. Metastatic Urothelial Carcinoma → PD-L1- under clinical feature, Relapsed/refractory multiple myeloma → del17p under mutation).

Estimated cost: ~$50 with gpt-5-mini for full backfill.

New code:

app/tasks/subgroup_disease_adjudication/task.rb — LLM adjudication task that classifies publication subgroup labels as disease_cohort, disease_related_context, or not_disease, with a normalized disease phrase, evidence span, and confidence score.
app/tasks/subgroup_disease_adjudication/response.rb — JSON schema for the adjudication response (StoreModel + DataTasks::JsonSchema).

Modified code:

app/models/trial_subgroup.rb — Added adjudicated_disease_cohort scope. Updated populate_term_matches to also generate TermMatch entries for adjudicated disease_cohort subgroups using the LLM-provided normalized_disease_phrase.
lib/tasks/clinical_trials/trial_subgroups.thor — Added adjudicate_subgroup_diseases Thor task for CLI access. Updated post_process_disease_matches to process both explicit disease-type subgroups and adjudicated disease cohort subgroups.
app/workflows/publication_disease_workflow.rb — Added adjudicate_subgroup_diseases step before populate_disease_terms_for_trial_subgroups in the workflow graph.

How it works:

Adjudication runs on all publication-sourced non-disease subgroups and persists the result on trial_subgroups.llm_data['subgroup_disease_adjudication'].
Only semantic_class = 'disease_cohort' subgroups enter the DiseaseMatching term population and post-processing paths.
disease_related_context and not_disease subgroups remain excluded from trial_subgroups.disease_id.
No changes to vw_publication_efficacy_data or Tpp::EmergingClinicalDataQuery — they consume the newly populated subgroup_disease_id automatically.

Initial spot check (15 random subgroups): All classifications correct. Disease cohorts (AML, mCRPC, CML, melanoma, solid tumors) correctly identified. Metastatic sites, biomarkers, treatment arms, dose levels, and healthy controls correctly excluded.

Pending: Manual verification on a curated sample before broad backfill.

Validation (2026-03-13)

Coverage: 134,061 / 134,211 non-disease subgroups adjudicated (99.9%). 34,652 classified as disease_cohort, of which 32,811 (94.7%) received disease_id.

Tracker example verified: Pub 114077, MSS-CRC subgroup (id 210858) correctly resolved: disease_id = 4345 (Colorectal Cancer), flows through vw_publication_efficacy_data.

Remaining gap — 1,841 disease_cohort subgroups without disease_id:

The populate_term_matches step has already run after adjudication. TermMatch rows exist for these terms — the gap is in the DiseaseMatching resolution pipeline itself, which is expected behavior in most cases.

The unresolved terms fall into categories that are inherent to the disease ontology design:

Broad disease concepts not in simplified tree (e.g. “lymphoma”, “mesothelioma”).
- Disease 4668 = “Lymphoma” exists in diseases but has simplified = false — intentionally excluded from the matchable disease set.
- The DiseaseMatching pipeline correctly found only subtypes (Follicular, Hodgkin, etc.) as candidates, rejected them as too narrow, and returned null.
- Verified against abstracts: these publications genuinely reference “lymphoma” without specifying a subtype (e.g. pub 90447: “relapsed/refractory lymphomas”; pub 119434: “newly diagnosed lymphoma”). The LLM adjudication correctly normalized to “Lymphoma” because the abstracts don’t provide enough context to be more specific.
- Same pattern for “mucosal melanoma”, “mesothelioma” — the broad concept isn’t in the simplified tree, and the abstracts don’t specify further.
Non-oncology diseases correctly absent from ontology.
- “Polycystic Ovary Syndrome” (41 subgroups), “Uterine leiomyoma” (10), “Sepsis” (9): not in our hemonc-focused disease ontology. These subgroups were correctly adjudicated as disease_cohort by the LLM (they are disease cohorts), but the diseases themselves are out of scope.
Too-generic terms below matching threshold.
- “Cancer” (21 subgroups): score 0.35, too broad. “Advanced cancer” (19): score 0.75, at threshold. “Pediatric cancer” (13): score 0.7, below threshold.
Finalization pipeline edge cases.
- “Muscle-invasive urothelial carcinoma” (20 subgroups): Round 1 and Round 2 both agreed on disease 4424 (Muscle Invasive Bladder Cancer), judgment accepted with 0.9 confidence, but the majority-vote finalization step still produced null. This may warrant investigation as a potential finalization bug.
- “Gastric and gastroesophageal junction adenocarcinoma” (12): compound disease phrase where the matcher couldn’t resolve to a single disease.

Assessment: The 1,841 gap is largely expected — broad/generic/out-of-scope terms that the disease tree intentionally doesn’t cover. The only potentially actionable subset is the ~32 subgroups affected by the finalization edge case (pattern 4), which may be a bug in the majority-vote logic.

2. ASCO API content type blind spot drops PresentationContentItem publications

Short summary

The ASCO GraphQL API classifies conference content into multiple __typename variants: AbstractContentItem, PresentationContentItem, PosterContentItem, VideosSlidesContentItem, JournalContentItem, and SessionContentItem. Our ingestion pipeline only handles AbstractContentItem — in both the search filter and the detail query. Publications typed as PresentationContentItem (and potentially PosterContentItem) are silently dropped.

Where this sits in the current pipeline

ASCO ingestion flow in app/services/publications/asco_api_service.rb:

fetch_abstract_hits sends a GraphQL Search query with filters: { contentTypes: ['Abstract'] }.
fetch_full_abstract_detail sends getContentByUID with a single inline fragment: ... on AbstractContentItem { uid title body doi ... }.
save_publication receives the detail result and persists it.

Triggered from lib/tasks/clinical_trials/publications.thor via:

bundle exec thor clinical_trials:publications:import_from asco [options]

Exact restrictions causing the drop

Three failure points, any one of which is sufficient to lose a publication:

1. Search filter excludes non-Abstract content types

filters_hash = { contentTypes: ['Abstract'] }

For wildcard searches (userInput: '*'), the ASCO API strictly filters by contentTypes. A PresentationContentItem is not returned when contentTypes: ['Abstract'] is used with a wildcard query.

Verified via API:

userInput: '*', contentTypes: ['Abstract'], years: [2025] → returns only hex UIDs (AbstractContentItem)
userInput: '*', contentTypes: ['Presentation'], years: [2025] → returns only PRESENTATION* UIDs

2. NCT ID text search returns zero hits for PresentationContentItem records

The ASCO search API does not index the clinicalTrialRegistryNumber field for search. Searching userInput: 'NCT05701709' returns zero hits regardless of contentTypes filter, even though the record has clinicalTrialRegistryNumber: 'NCT05701709' in its data.

Verified:

search(userInput: "NCT05701709", filters: {}) → 0 hits
search(userInput: "NCT05701709", filters: {contentTypes: ["Abstract"]}) → 0 hits
search(userInput: "SHR A2102", filters: {contentTypes: ["Abstract"]}) → finds PRESENTATION245980

This means the disease-specific ingestion path (which searches by NCT ID) can never discover this publication.

3. Detail query GraphQL fragment only matches AbstractContentItem

... on AbstractContentItem { uid title body doi clinicalTrialRegistryNumber ... }

When getContentByUID returns a PresentationContentItem, the fragment does not match. The result is {}. save_publication then sees a blank title and silently skips the record:

if publication_data[:title].blank?
  increment_stat(:skipped)
  Rails.logger.warn("ASCO Abstract #{abstract_data['uid']} has no title")
  return :skipped
end

Concrete example

Publication: DOI 10.1200/JCO.2025.43.16_suppl.107

Title: “Phase 1 trial of SHR-A2102, a nectin-4-directed antibody drug conjugate (ADC), in advanced solid tumors.”
ASCO UID: PRESENTATION245980
__typename: PresentationContentItem
clinicalTrialRegistryNumber: NCT05701709
Drug: SHR-A2102 (drug_id 13643, known in our system)
Trial: NCT05701709 (clinical_trial_id 51789, linked to “Solid Tumors” disease)
ESMO version of same study: publication_id 65886, successfully ingested and linked to trial

API verification:

# Search finds nothing by NCT ID
search(userInput: "NCT05701709") → 0 hits

# Search finds it by drug name
search(userInput: "SHR A2102") → PRESENTATION245980 (score 19.66)

# Detail with AbstractContentItem fragment → empty
getContentByUID("PRESENTATION245980") with ... on AbstractContentItem → result: {}

# Detail with PresentationContentItem fragment → full data
getContentByUID("PRESENTATION245980") with ... on PresentationContentItem → title, body, doi, NCT ID, authors ✓

Downstream impact

Missing ASCO publications for trials where the abstract is classified as Presentation
This particularly affects oral presentations and plenary sessions (low abstract numbers like 107), which are often the highest-impact results
Disease-specific reporting misses these publications entirely
Trial publication counts are understated

What the issue is not

Not a disease-mapping problem — the drug and trial are correctly linked in our system
Not a timing/availability problem — the abstract is live in the ASCO API
Not specific to Chinese trials or specific sponsors — this is a content classification issue on the ASCO API side
Not a one_off_jobs issue — job 1022 (Dec 31 wildcard run) did run but could not discover these due to the contentTypes filter

Scale

ASCO API schema introspection reveals 6 content item types. Four have DOI + clinicalTrialRegistryNumber + body fields:

Type	Has DOI	Has NCT ID field	Has Body	Currently handled
`AbstractContentItem`	yes	yes	yes	yes
`PresentationContentItem`	yes	yes	yes	no
`PosterContentItem`	yes	yes	yes	no
`VideosSlidesContentItem`	yes	yes	yes	no
`JournalContentItem`	yes	no	yes	no
`SessionContentItem`	no	no	yes	no

The exact count of PresentationContentItem records in ASCO is not easily determined (the API returns paginated results of 10 per page), but a drug-name search returning PRESENTATION UIDs alongside Abstract UIDs confirms they represent a meaningful fraction of conference content.

Our ASCO 2025 Annual Meeting coverage: 1,102 abstracts out of an estimated 5,000-6,000+ total — the gap is likely partly explained by this issue.

Spot checks

PRESENTATION245980 (DOI 10.1200/JCO.2025.43.16_suppl.107): SHR-A2102 Phase 1 in solid tumors — missing
PRESENTATION243121 (DOI 10.1200/JCO.2025.43.5_suppl.657): SHR-A2102 in urothelial carcinoma — missing

Both are PresentationContentItem with full abstract text, NCT IDs, authors, and DOIs available.

Open characterization questions

What fraction of ASCO Annual Meeting oral presentations are classified as PresentationContentItem vs AbstractContentItem?
Are PosterContentItem records also carrying unique abstracts we’re missing, or do they duplicate AbstractContentItem records?
Should VideosSlidesContentItem be ingested (they carry DOI and NCT ID fields)?

Explored solution direction

The fix is contained entirely in app/services/publications/asco_api_service.rb. Two methods need changes:

1. fetch_abstract_hits — broaden the search contentTypes filter

Current (line 93):

filters_hash = { contentTypes: ['Abstract'] }

Change to:

filters_hash = { contentTypes: ['Abstract', 'Presentation'] }

This ensures the wildcard search (userInput: '*') returns both AbstractContentItem and PresentationContentItem records. The ASCO API enforces contentTypes strictly for wildcard queries, so without adding 'Presentation' these records never appear in search results.

PosterContentItem is excluded for now — open question whether posters carry unique abstract content or duplicate what’s already in AbstractContentItem records. Can be added later if spot checks show unique content.

2. fetch_full_abstract_detail — add a PresentationContentItem inline fragment

Current query (lines 130–157) uses only:

... on AbstractContentItem { uid title body doi clinicalTrialRegistryNumber ... }

Add a second fragment with the shared fields that both types expose:

... on PresentationContentItem {
  uid
  title
  body
  doi
  clinicalTrialRegistryNumber
  journalCitation
  taxonomy { subjectsThes drugsThes }
  publishDate { start }
  authors { displayName role publicationOrganization }
}

These are the same fields already requested from AbstractContentItem. The PresentationContentItem schema exposes all of them (verified via schema introspection). GraphQL will match whichever fragment corresponds to the returned __typename and populate the result identically.

No changes needed in save_publication — the downstream code reads abstract_data['title'], abstract_data['body'], etc. by string key. As long as the GraphQL fragment returns the same field names, save_publication works unchanged.

Deduplication — save_publication already uses Publication.find_or_initialize_by(source: 'ASCO', source_id: publication_data[:source_id]), where source_id is the ASCO uid. Since PresentationContentItem records have distinct UIDs (e.g. PRESENTATION245980), they will not collide with existing AbstractContentItem records. If a presentation and an abstract share the same DOI but different UIDs, both would be saved — but find_or_initialize_by on source_id prevents true duplicates.

What this does not fix — the NCT ID search blind spot (failure point 2). The ASCO API does not index clinicalTrialRegistryNumber for text search regardless of content type. So the disease-specific ingestion path (userInput: 'NCT05701709') will still return zero hits for PresentationContentItem records. This is an ASCO API limitation outside our control. The fix works because the wildcard path (userInput: '*') will now find these records, and they will be correctly saved and linked to trials via clinicalTrialRegistryNumber at save time.

Solution applied

Updated app/services/publications/asco_api_service.rb with two changes:

Search filter: contentTypes: ['Abstract'] → contentTypes: ['Abstract', 'Presentation'] in fetch_abstract_hits.
Detail query: Added ... on PresentationContentItem { ... } inline fragment with identical fields to fetch_full_abstract_detail.
Performance: Parallelized detail fetches using Parallel.map(hits, in_threads: 5) in fetch_publications_by_criteria.

No changes to save_publication — fields are identical across both content types.

Verification: Test run confirmed PRESENTATION-prefixed UIDs are returned by search, detail query resolves fields correctly, and publications save to the database with source: 'ASCO', category: 'ASCO Abstract', and correct titles/metadata.

3. Publication dose context is trial-derived for linked result publications and still too unstructured for worksheet parity

Short summary

The disease clinical evidence worksheet needs publication dose fields with substantially more precision than our current publication pipeline can provide:

Dose (if only one dose was used)
Dose Min
Dose Max
RP2D
Dose Unites
Dose Freqency

Today, most linked result publications never get publication-specific arm/intervention extraction at all. They still surface a dose in /Users/tomor/Sites/bioloupe-data-gov/db/views/vw_publication_efficacy_data_v07.sql, but that value is usually coming from trial study-plan interventions, not from the publication abstract.

Even when publication-specific intervention extraction does run, it only persists a free-text publication_interventions.dose string. That is enough to display a single dose blob, but not enough to reproduce the worksheet columns the client is maintaining manually in spreadsheet 1reh2-9Xpxd9DF7EB-73JfSXH8-MLtWI3zUDEOTgxPV8.

Where this sits in the current pipeline

Current publication flow:

/Users/tomor/Sites/bioloupe-data-gov/app/workflows/publications_workflow.rb runs extract_interventions before endpoint and AE processing.
/Users/tomor/Sites/bioloupe-data-gov/app/tasks/publications_llm_classification/intervention_extraction.rb writes llm_data['intervention_arms'].
/Users/tomor/Sites/bioloupe-data-gov/app/tasks/publications_llm_classification/drug_linker.rb persists publication_interventions and publication_arm_interventions.
/Users/tomor/Sites/bioloupe-data-gov/db/views/vw_publication_efficacy_data_v07.sql builds drug_interventions for reporting:
- linked publications use vw_bioloupe_interventions
- only unlinked publications use publication_interventions
/Users/tomor/Sites/bioloupe-data-gov/app/queries/tpp/emerging_clinical_data_query.rb reads v.dose as a single free-text field.

Exact restriction causing the drop

There are two separate restrictions, and they compound.

Restriction 1: intervention extraction is scoped to unlinked publications

In /Users/tomor/Sites/bioloupe-data-gov/app/tasks/publications_llm_classification/intervention_extraction.rb, base_scope is:

Publication.workflow_eligible
           .unlinked_to_trials
           .hematology_oncology_relevant
           .where("(llm_data -> 'intervention_arms') is null")

So once a result publication is linked to a trial, it normally never enters publication arm extraction.

Restriction 2: the efficacy view only uses publication_interventions for publications without a trial link

In /Users/tomor/Sites/bioloupe-data-gov/db/views/vw_publication_efficacy_data_v07.sql, drug_interventions explicitly says:

sources 1a/1b/1c use vw_bioloupe_interventions for linked publications
source 2 uses publication_interventions
source 2 is restricted by:

WHERE pct.clinical_trial_id IS NULL and pi.source_type='Publication'

That means linked publications can show a dose, but it is almost always trial-derived.

Concrete examples

Example 1: publication `66552` (`BL-B01D1` in ESCC, ESMO 2024)

Publication:

publications.id = 66552
title: BL-B01D1, an EGFR x her3 bispecific antibody-drug conjugate (ADC), in patients with locally advanced or metastatic esophageal squamous cell carcinoma (ESCC)
linked trial: NCT05262491

Abstract dose language:

2.0, 2.5 and 3.0 mg/kg D1D8 Q3W
2.5mg/kg (RP2D)

Current persisted state:

jsonb_array_length(publications.llm_data->'intervention_arms') = 0
no publication_interventions rows
vw_publication_efficacy_data.dose = 'not specified'

But the worksheet row in the client spreadsheet is manually decomposed into:

Dose Min = 2
Dose Max = 2.5
RP2D = 2.5
Dose Units = mg/kg
Dose Frequency = 2Q3W

So the publication abstract contains the dose context the worksheet needs, but the current linked-publication path discards it and falls back to trial-level not specified.

Example 2: publication `133793` (`simmitinib`, ASCO 2024)

Publication:

publications.id = 133793
title: First-in-human study of simmitinib, a novel tyrosine kinase inhibitor targeting FGFR1-3, KDR and CSF-1R.
linked trial: NCT04058587

Abstract dose language:

dose escalation 1 to 9 mg orally
expansion regimens 4 mg QD, 6 mg QD, and 6 mg 3 weeks on 1 week off

Current persisted state:

no llm_data['intervention_arms']
no publication_interventions
vw_publication_efficacy_data.dose = 'starting dose 1mg/d'

This is not just incomplete. It is directionally misleading for reporting because the publication result set includes later expansion regimens and the worksheet needs to distinguish min/max/RP2D/schedule.

Example 3: publication `75999` (`MRG003`, ESMO 2021) shows the partial success case

Publication:

publications.id = 75999
title: FIH phase I dose escalation and dose expansion study of anti-EGFR ADC MRG003 in patients with advanced solid tumors
no linked trial

Current persisted state:

jsonb_array_length(publications.llm_data->'intervention_arms') = 5
publication_interventions.dose = '0.1–3.0 mg/kg (dose-escalation cohorts)'
publication_interventions.schedule = 'Q3W'
vw_publication_efficacy_data.dose echoes the same free-text dose

This proves the existing publication arm extraction can capture publication-derived dosing when the publication is unlinked.

But it also shows the second gap:

the persisted output is still one free-text dose blob
the expansion dose 2.5 mg/kg Q3W is not decomposed into worksheet-ready columns
RP2D is not persisted separately

So broadening extraction scope alone will improve provenance, but not worksheet parity.

Downstream impact

The disease clinical evidence export cannot reliably recreate the client worksheet from our publication database.
Linked publication rows can carry a dose string that looks structured enough to trust, while actually reflecting trial-plan interventions rather than the publication cohort being reported.
Basket, dose-escalation, dose-expansion, and subgroup-specific publications are especially exposed because publication dose often differs from the trial’s broad intervention description.
The existing add-dose-column-to-emerging-data direction is useful for visibility, but it does not solve the worksheet problem because the export needs decomposed dose fields, not only a single free-text dose.

What the issue is not

This is not just a missing CSV column problem.

Exposing vw_publication_efficacy_data.dose more widely would still leave us with:

linked publications whose dose came from trial interventions instead of the publication
free-text values like not specified, specified dose, dose escalation, and starting dose 1mg/d
no reliable dose_min, dose_max, rp2d, dose_units, or dose_frequency

This is also not purely a trial curation problem.

In many cases the trial registry is doing exactly what it should: storing planned intervention doses at the study-plan level. The problem is that the publication is often talking about:

a subset of dose-escalation cohorts
a specific expansion dose
a weight-banded administration rule
a recommended phase 2 dose selected after escalation
a disease-specific cohort inside a broader trial

That context exists in the publication narrative, not necessarily in the linked study plan.

This is also not a good regex problem.

Dose strings in the worksheet and in publication text mix:

ranges
RP2D statements
schedules like Q3W, 2Q3W, QD, days 1, 8, and 15 of a 28-day cycle
weight-banded doses
escalation plus expansion language in the same abstract

We should not try to derive worksheet fields from vw_publication_efficacy_data.dose with string-splitting heuristics.

Scale

Current warehouse counts:

linked result publications: 53,701
linked result publications with any publication_interventions: 79
linked result publications with publication-derived dose in publication_interventions: 50
linked result publications with llm_data['intervention_arms']: 87
linked result publications with a nonblank vw_publication_efficacy_data.dose: 36,840
linked result publications with view dose but no publication-derived dose: 36,803

This is the key shape of the issue:

dose appears broadly in reporting
publication-specific dose provenance is almost absent for linked results

Contrast:

unlinked result publications with publication-derived dose in publication_interventions: 2,374

The field shape is also not export-ready even when populated:

vw_publication_efficacy_data rows with nonblank dose: 489,397
distinct dose strings in the view: 18,002
rows with obviously ambiguous values like not specified, not reported, or escalation-only labels: 45,236

Representative high-frequency values in the view:

not specified (28,138 rows)
specified dose (6,927 rows)
escalating doses (1,434 rows)
dose escalation (1,287 rows)

For publication-derived doses specifically:

publication_interventions rows with nonblank dose: 4,668
distinct publication-derived dose strings: 3,123
rows with structurally complex dose text (ranges, RP2D text, schedules): 775

Examples of currently persisted publication-derived dose strings:

0.1–0.9 mg/m2 (administered over 1–10 minutes); RP2D 0.7 mg/m2 over 10 minutes
0.05 mg/kg rounded to nearest 1.5 mg; weight-band doses used: 1.5 mg (<30 kg), 3 mg (30–60 kg), 4.5 mg (60–90 kg)
1000 mg/m2 on days 1 and 8 every 3 weeks

These are useful raw evidence strings, but they are not already normalized worksheet fields.

Spot checks

Linked publications where publication text clearly contains richer dose context than the current export path:

66552 (BL-B01D1, ESCC): publication says 2.0, 2.5 and 3.0 mg/kg D1D8 Q3W; view says not specified
133793 (simmitinib): publication says 1 to 9 mg, 4 mg QD, 6 mg QD, 6 mg 3 weeks on 1 week off; view says starting dose 1mg/d
240515 (amivantamab OrigAMI-1): worksheet needs the weight-based regimen; current linked-publication path has no publication intervention extraction at all

Unlinked publication showing the existing extraction path works but is still too shallow:

75999 (MRG003): publication-derived dose and schedule are persisted, but only as raw text rather than decomposed worksheet fields

Working assumptions from discussion

The authoritative persistence grain should be publication + arm + subgroup, interpreted as the smallest defensible publication-result scope.
We should not force false precision. Some dose evidence will legitimately be:
- publication-level
- publication + arm
- publication + subgroup
- publication + arm + subgroup
publication + disease is too coarse for dose evidence because dose usually follows treatment context, not just disease context.
Publication intervention extraction should run for all result publications, not just unlinked publications and not just records currently missing rows.
Operationally, reruns can still be versioned/idempotent so we only refresh missing, stale, or schema-changed records.
When a publication reports both escalation and expansion cohorts, we should persist:
- the raw dose evidence text
- a structured cohort array
- and derive a preferred export dose per report row from the matching publication context
We should not persist one publication-wide preferred dose detached from arm/subgroup context.
Publication-derived dose should be treated as the source of truth for publication-backed rows when it matches the same or narrower context than the row being exported.
Linked trial dose remains fallback context only when the publication is silent or too vague to support a row-level dose assignment.
We do want evidence quotes/spans and confidence for extracted dose claims such as RP2D, units, schedule, or frequency. This is primarily for analyst review and debugging.

Open characterization questions

How should the persistence model represent scope when an abstract supports only publication-level or arm-level dose evidence and no subgroup is reported?
Should disease be denormalized onto the dose evidence row for easier querying, or resolved later from subgroup / publication disease context?
What exact cohort labels do we want to persist for dose context classification:
- escalation
- expansion
- rp2d_or_fixed_dose
- mixed_or_unclear
Should full text, when available, be allowed to override abstract-derived dose evidence, or only supplement it?

Explored solution direction

The direction that emerges from the worksheet and the warehouse evidence has two layers.

1. Use smallest-scope publication evidence as the persistence model

The target grain should be publication-result context, not publication-wide text blobs.

Preferred direction:

persist dose evidence at publication + arm + subgroup scope when supported
allow nullable arm/subgroup keys for publication-level and arm-only evidence
derive disease-facing exports from these scoped evidence rows instead of trying to back-infer scope later

2. Expand publication arm/intervention extraction to all result publications, including linked ones

The current unlinked_to_trials restriction is too aggressive for dose-sensitive reporting.

Preferred direction:

run publication arm/intervention extraction for all result publications, including linked ones
persist publication_interventions and publication_arm_interventions even when a publication is linked to a trial
keep trial-linked study-plan interventions as fallback context, not as the only source of dose

This addresses provenance.

3. Add a separate LLM-backed publication evidence extraction for worksheet dose fields

publication_interventions.dose should remain the raw publication dose phrase, but it should not be the final reporting shape.

Preferred structured output for the disease clinical evidence export:

raw publication dose text
structured cohort array
single_dose
dose_min
dose_max
rp2d
dose_units
dose_frequency
dose_context_type such as escalation, expansion, RP2D/fixed-dose, or mixed/unclear
evidence quote/span
confidence
optional cohort / arm note explaining whether the values come from escalation, expansion, or a disease-specific subset

This should be extracted from publication text with publication context using an LLM-backed schema, not reverse-parsed from the existing free-text dose field and not derived through substring / regex heuristics.

The current early extract_interventions step is still useful, but it is probably not sufficient on its own for dose attribution. The authoritative dose extraction likely belongs later in the workflow, after subgroup / arm / endpoint context exists, so the dose evidence can be attached to the correct publication result scope.

4. Use publication-derived dose as the preferred export source when it matches the publication result context

Source precedence for dose should likely be:

publication-specific structured dose evidence
publication raw intervention dose text
linked trial intervention dose as fallback only

The important nuance is that we should derive the preferred export dose per output row from the matching publication context. We should not store or trust a single publication-wide preferred dose when the abstract contains multiple cohorts.

That is different from the current efficacy view, where linked publications are effectively forced into the trial-derived path.

5. Keep this as an export/evidence enrichment concern, not a generic trial-study-plan rewrite

The problem we are solving is:

can we recreate the worksheet from publication-backed evidence?

The answer does not require fully normalizing every historical publication intervention into canonical pharmacology. It requires a publication evidence layer that preserves what the publication actually says at the arm/cohort level.

Solution applied

Implemented 2026-03-11. Change: publication-dose-context-gap.

Four-part fix:

Broadened intervention extraction scope — Removed .unlinked_to_trials from InterventionExtraction#base_scope. Previously ~53K linked publications were skipped because the therapeutic_area_filter step also had .unlinked_to_trials, so linked publications never got classified as hematology_oncology_relevant and never entered the intervention extraction scope — even though all trials in our database are hemonc by definition.
Target disease scope for cost control — Running intervention + dose extraction across all 53K linked pubs would cost ~$480. Instead, scoped the backfill to publications linked to trials in target disease areas via clinical_trial_end_diseases:
- Solid Tumors (4116), HNSCC (6200), ESCC (4260), sqNSCLC (4174), CRC (4345), Cholangiocarcinoma (6228/6229/4298)
- Plus all existing hemonc-classified unlinked publications
- Implemented as reusable scope Publication.target_disease_or_hemonc_relevant on the model
- Reduces backfill from 53K to ~10K publications, estimated cost ~$66
- These disease IDs are hardcoded for the initial backfill; scope can be broadened later by adding more disease IDs to Publication::TARGET_DISEASE_IDS
New dose evidence extraction step — Created DoseEvidenceExtraction LLM task (app/tasks/publications_llm_classification/dose_evidence_extraction.rb) that decomposes free-text publication_interventions.dose into structured fields stored in publication_interventions.dose_evidence JSONB:
- single_dose, dose_min, dose_max, rp2d, dose_units, dose_frequency, dose_context_type
- evidence_quote, confidence, version
- Uses gpt-5-mini at ~$0.004/publication — sufficient quality, no model upgrade needed
- Prompt sends publication_intervention.id per intervention for deterministic persistence (no name matching)
- Integrated into PublicationsWorkflow as a skippable step after extract_subgroups
Efficacy view + export updated — vw_publication_efficacy_data v08 adds dose_min, dose_max, rp2d, dose_units, dose_frequency columns via a pub_dose_lookup CTE that reads publication_interventions.dose_evidence. emerging_clinical_data_query.rb includes these in export output.

Key discovery during implementation: The therapeutic_area_filter task also has .unlinked_to_trials in its scope, so 65,152 linked publications were never classified for hemonc relevance. Since all trials in our DB are hemonc, the classification gate is meaningless for linked pubs. Rather than running the LLM therapeutic area filter on 65K pubs unnecessarily, we bypass it with target_disease_or_hemonc_relevant which uses trial disease metadata for linked pubs and LLM classification for unlinked pubs.

Files changed:

app/models/publication.rb (target_disease_or_hemonc_relevant scope + TARGET_DISEASE_IDS)
app/tasks/publications_llm_classification/dose_evidence_extraction.rb (new)
app/tasks/publications_llm_classification/intervention_extraction.rb (scope changed to target_disease_or_hemonc_relevant)
app/workflows/publications_workflow.rb (new step added)
app/admin/services/publication_console/publication_workflow_registry.rb (registry entries)
app/admin/services/publication_console/publication_workflow_overview_service.rb (scope methods)
lib/tasks/clinical_trials/publications.thor (Thor task wiring)
db/migrate/20260311220054_add_dose_evidence_to_publication_interventions.rb (JSONB column + GIN index)
db/views/vw_publication_efficacy_data_v08.sql (structured dose columns)
db/migrate/20260311220657_update_vw_publication_efficacy_data_to_version8.rb (view migration)
app/queries/tpp/emerging_clinical_data_query.rb (export columns)

Smoke test results (4 publications, gpt-5-mini):

Pub 75999 (MRG003): dose_min=0.1 mg/kg, dose_max=3.0 mg/kg, rp2d=2.5 mg/kg, Q3W, context=escalation, confidence=0.95
Pub 117 (Olanzapine/Pregabalin): fixed doses correctly extracted (5mg, 75mg, 8mg)
Pub 88446 (21 interventions): all 21 matched by ID, non-drug interventions correctly got confidence=0.0
Structured dose columns confirmed flowing through materialized view after refresh

Backfill completed 2026-03-12. Four steps ran in production:

thor clinical_trials:publications:extract_interventions --batched --parallelism=4 --batch-size=2000
thor clinical_trials:publications:link_publication_drugs --parallelism=5
thor clinical_trials:publications:extract_dose_evidence --batched --parallelism=4 --batch-size=2000 (ran twice — first pass covered unlinked pubs only; second pass covered newly materialized linked-pub interventions)
REFRESH MATERIALIZED VIEW CONCURRENTLY vw_publication_efficacy_data

Backfill results:

44,778 / 44,780 publication_interventions rows have dose_evidence populated
Actual cost: ~$8 total across both dose evidence runs (gpt-5-mini batch API, ~$0.0004/pub — 10x cheaper than pre-implementation estimate)
Extraction quality verified across random samples: high-confidence extractions accurate, RP2D correctly identified in escalation studies, weight-based/BSA-based classification correct, low-confidence calibration appropriate (no hallucinated doses)

Post-backfill cleanup:

~1.1% of rows (513) had LLM garbage in string fields — placeholder text, chain-of-thought leaking, field-name rotation, escaped JSON fragments. All correlated with non-drug interventions (surgery, imaging, lifestyle). Root cause: system prompt redundantly described JSON format when structured outputs already constrain it.
~5,500 rows had string "null" variants instead of JSON null.
Both issues fixed by one_off:cleanup_dose_evidence_garbage:execute (one-off Thor task, 6,545 rows cleaned).
Prevention added: sanitize_dose_evidence! in DoseEvidenceExtraction#persist_dose_evidence strips garbage on persist. System prompt simplified to avoid redundant format instructions with structured outputs.

Spot-check verification (tracker examples now resolved):

Pub	Drug	Before	After
66552	BL-B01D1	`not specified`	dose_min `2.0 mg/kg`, dose_max `3.0 mg/kg`, rp2d `2.5 mg/kg`, `D1D8 Q3W`
133793	simmitinib	`starting dose 1mg/d`	dose_min `1 mg`, dose_max `9 mg`, rp2d `6 mg 3 weeks on 1 week off`
75999	MRG003	raw text only	dose_min `0.1 mg/kg`, dose_max `3.0 mg/kg`, rp2d `2.5 mg/kg`, `Q3W`
240515	amivantamab	no intervention rows	no intervention_arms in llm_data (abstract may lack dose detail)

Issue reopened: `pub_dose_lookup` view join drops 76% of extracted dose evidence (2026-03-23)

The extraction and persistence steps from the 2026-03-11 fix are working correctly — 23,503 publications have dose_evidence populated in publication_interventions. However, only 8,764 publications (37%) have structured dose fields flowing through to vw_publication_efficacy_data. The remaining 17,826 publications (76%) have dose evidence silently dropped by the view’s pub_dose_lookup join.

Root cause

The pub_dose_lookup CTE joins on (publication_id, drug_id):

LEFT JOIN pub_dose_lookup pdl
  ON po.publication_id = pdl.publication_id
  AND di.drug_id = pdl.drug_id

di.drug_id comes from the drug_interventions CTE, which for linked publications sources from vw_bioloupe_interventions (trial registry drugs)
pdl.drug_id comes from publication_interventions.drug_id (LLM-extracted and drug-linked)

This join fails in two ways:

Failure mode 1: NULL drug_id on publication_interventions (~13,600 pubs, 58%)

When link_publication_drugs doesn’t find a matching drug record, publication_interventions.drug_id stays NULL. The SQL condition di.drug_id = NULL is always FALSE, so the dose evidence is silently dropped even though it’s correctly extracted.

Failure mode 2: Drug_id mismatch between registry and publication (~2,148 pubs, 9%)

The trial registry and the LLM-extracted publication interventions can resolve to different drug records for the same compound:

ADC vs naked antibody: Zanidatamab (10432) vs Zanidatamab zovodotin (15231)
Unresolved drug matching: SHR-A1811 has drug_id=NULL in publication_interventions but drug_id=10733 (Trastuzumab rezetecan) in the trial registry
Biosimilar/brand aliases: SCT510 (15900) vs Bevacizumab (9022)

Concrete examples from CRC ADC audit (disease 4345, technology 708)

Pub	Drug	PI drug_id	View drug_id	Dose evidence	View dose fields
66516	Zanidatamab	10432 (Zanidatamab)	15231 (Zanidatamab zovodotin)	single_dose=1200 mg	all NULL
70960	SHR-A1811	NULL	10733 (Trastuzumab rezetecan)	dose_min=3.2, dose_max=8.0, rp2d=6.4 mg/kg	all NULL
114758	Zanidatamab	10432 (Zanidatamab)	15231 (Zanidatamab zovodotin)	single_dose=1200 mg	all NULL

The unstructured dose column (from trial registry study_plan_components) still shows generic protocol text like “dose levels and schedules determined by the Safety Monitoring Committee (SMC)” for these publications.

Scale

23,503  publications with dose_evidence extracted
 8,764  publications with structured dose in view (37%)
17,826  publications with dose evidence silently dropped (76%)

Breakdown of dropped:
~13,600  NULL drug_id on publication_interventions (58%)
 ~2,148  drug_id mismatch between registry and publication (9%)
 ~2,078  other (pub not in view, dose_evidence has no usable fields, etc.)

Fix applied

Resolved by Issue 20 fix (2026-03-23). The root cause was the drug_interventions CTE sourcing drug_id from vw_bioloupe_interventions (registry) while pub_dose_lookup used publication_interventions drug_id. The v16 view restructuring (see Issue 20 solution) fixes this by:

Using publication_interventions as the primary drug source (Source 0), so di.drug_id and pdl.drug_id come from the same table.
Threading publication_intervention_id through both CTEs for exact 1:1 join matching — eliminating the drug_id mismatch entirely, including for NULL drug_id interventions.
Allowing NULL drug_id interventions through Source 0 (if we extracted them, they’re the source of truth — don’t fall back to registry).

Result: dose evidence coverage went from 8,764 pubs (71% of extracted) to 11,902 pubs (96.6% of extracted).

4. Most frequent AE columns lack grade-classified ranked export fields

Short summary

The disease clinical evidence worksheet has two AE columns per row:

Most Frequent AE All Grade — e.g. Anemia (85.4%), Leukopenia (53.7%), Thrombocytopenia (53.7%)
Most Frequent AE >=Gr3 — e.g. Anemia (28.0%), Leukopenia (15.9%), Thrombocytopenia (14.6%)

These are ranked lists of the top individual named adverse events by incidence, separated into all-grade vs grade ≥3 buckets.

The current pipeline extracts individual named AE rows with numeric values but does not:

Classify each AE row by grade category (all-grade vs ≥grade 3)
Rank AEs by incidence within each grade bucket
Produce a formatted summary string for export

As a result, the worksheet AE columns cannot be populated from structured data today.

Where this sits in the current pipeline

Publication AE flow:

classify_publications extracts llm_data['adverse_events'] from the abstract. The LLM schema (details.rb:AdverseEvent) captures adverse_event (name), measure_unit, observation (free text), and arms[].measure_value (numeric). There is no grade_category field — grade information lands in observation as unstructured text or gets embedded in the AE name.
post_process_publications creates adverse_events rows and trial_arm_outcomes rows with numeric measure_value.
standardize_adverse_events does rule-based name standardization.
classify_adverse_events LLM-matches AEs to safety endpoint categories.

Relevant code paths:

app/tasks/publications_llm_classification/task.rb — extraction prompt (section 4: Adverse Events)
app/tasks/publications_llm_classification/details.rb — AdverseEvent schema (lines 111–134)
app/tasks/publications_llm_classification/post_process.rb — process_adverse_events persists rows
app/queries/tpp/emerging_clinical_data_query.rb — extract_safety_metrics_for_publication only handles aggregate metrics (TRAE ≥Gr3, TEAE ≥Gr3, discontinuation), not individual named AEs

Exact restriction causing the drop

Two separate restrictions:

1. The LLM extraction schema has no grade classification field

The AdverseEvent schema in details.rb captures:

attribute :adverse_event, :string    # name
attribute :measure_unit, :string     # percentage/count
attribute :observation, :string      # free text — grade info lands here
attribute :arms, Arm.to_array_type   # numeric values per arm

There is no grade_category enum. The LLM puts grade context into observation as free text (e.g. "Grade ≥3", "Grade 3 treatment-related", "Any grade", "Most common adverse event", or empty).

2. The downstream safety extraction only handles aggregate metrics

classify_safety_metric in emerging_clinical_data_query.rb classifies AEs into aggregate categories (:grade3_traes, :grade3_teaes, :discontinuation) and returns nil for individual named AEs like Nausea or Neutropenia. These individual AEs are stored but never surfaced in any export path.

Concrete examples

Worksheet row: Izalontamab brengitecan in ESCC (ESCC tab, row 3)

The worksheet contains:

Most Frequent AE All Grade: Anemia (85.4%), Leukopenia (53.7%), Thrombocytopenia (53.7%), Neutropenia (42.7%)
Most Frequent AE >=Gr3: Anemia (28.0%), Leukopenia (15.9%), Thrombocytopenia (14.6%), Neutropenia (14.6%)

Our database has the individual AE rows and numeric values for this publication, but no way to classify which rows are all-grade vs ≥grade 3, and no export field that produces the ranked formatted string.

Worksheet row: Micvotabart pelidotin in HNSCC (HNSCC tab, row 4)

The worksheet contains:

Most Frequent AE All Grade: Cutaneous (44%); Neuropathy (34%); Neutropenia (22%); Anemia (17%)
Most Frequent AE >=Gr3: Neuropathy (28%), Neutropenia (11%)

The pattern is consistent: top 2–4 AEs ranked by incidence, with percentages, semicolon or comma separated.

Scale

Current database state for publication-sourced AE rows:

Total publications with AE rows: 36,802
Publications with AE rows that have numeric trial_arm_outcomes.measure_value: 33,835
Total AE rows with numeric values: 156,325
Average AE rows per publication: 4.6 (median 3, p90 8)

Grade context distribution across the 156K rows:

Grade signal	Rows	%	Source
Clearly grade ≥3 (in `observation`)	19,079	12%	`observation ~* 'grade.*(3\|≥3\|3/4)'`
Clearly grade ≥3 (in `name`)	16,862	11%	`name ~* 'grade.*(3\|≥3\|3/4)'`
Subtotal grade ≥3 identifiable	57,206	37%	Combined name + observation
Explicitly all-grade	4,054	3%	`observation ~* '(any grade\|all grade)'`
No grade context at all	75,616	48%	Neither name nor observation mentions grade
Low grade only (1-2)	~4,024	3%
Other grade context	~2,797	2%

At the publication level:

Category	Publications
Has BOTH all-grade and grade ≥3 rows	4,053
Has grade ≥3 rows only	6,547
Has any-grade rows only	7,719
Ambiguous (no clear grade signals)	1,287

The key finding: 48% of individual AE rows (75K) have no grade context in either name or observation. These are likely all-grade AEs but cannot be reliably classified without the abstract context that was available at extraction time.

Worksheet AE column patterns

From spot-checking across HNSCC and ESCC tabs:

All Grade column: typically 2–4 AEs, sometimes just names without % when percentages aren’t reported
=Gr3 column: typically 1–3 AEs, usually fewer than all-grade
Some cells include (NR) for “not reported”
Separator style varies: semicolons and commas both used
Format: AE_name (value%)

Downstream impact

The disease clinical evidence export cannot populate the two most-frequent-AE columns
The existing safety extraction only surfaces aggregate TRAE/TEAE/discontinuation metrics
Individual named AEs with percentages exist in the database but are invisible to reporting
Publications where the abstract reports specific high-frequency AEs (the most clinically relevant safety signal) cannot be compared to the manually curated worksheet

What the issue is not

This is not a missing AE extraction problem. The pipeline already extracts individual named AEs with numeric values for ~34K publications. The AE data exists — it just lacks grade classification and a ranked export format.

This is also not an aggregate safety metric problem. TRAE ≥Gr3, TEAE ≥Gr3, and discontinuation rates are already handled by extract_safety_metrics_for_publication. The gap is specifically in individual named AE ranking.

Open characterization questions

Should the ranked summary be persisted as pre-formatted strings (like the worksheet cells), or as structured arrays that the export formats at query time?
When a publication has AE rows for multiple arms, should the ranked summary use the experimental arm only (current behavior for aggregate metrics) or present the arm that matches the export row context?

Explored solution direction

The solution has two parts: a schema enhancement for future publications and a backfill for existing data.

1. Modify classify_publications extraction to include grade classification (going forward)

Add a grade_category enum field to the AdverseEvent schema in details.rb:

class AdverseEvent
  include StoreModel::Model
  include DataTasks::JsonSchema

  desc 'The name of the adverse event reported in the trial.'
  attribute :adverse_event, :string

  desc 'Grade category of this adverse event. Use all_grade for any-grade or unspecified-grade AEs, grade_gte3 for grade ≥3/grade 3-4/grade 3-5 AEs.'
  attribute :grade_category, :string  # enum: all_grade, grade_gte3

  # ... existing fields ...
end

Update the extraction prompt (section 4 in task.rb) to instruct the LLM to classify grade at extraction time. The LLM already reads the abstract in full — it knows whether “Nausea (75.3%)” is reported as all-grade or ≥grade 3 from surrounding context. Adding one enum field is nearly free in token cost.

Add a grade_category column to the adverse_events table (migration). Update post_process.rb:process_adverse_events to persist the new field.

2. Backfill existing AE rows with LLM grade classification

A separate one-time LLM task that reads existing adverse_events rows + the publication abstract and classifies grade_category for each row.

Scope: ~33,835 publications, ~156K AE rows.

Input per publication prompt:

Publication title + abstract (~2,750 chars avg)
Existing AE rows with name, observation, and measure_value (~217 chars avg)

Output per AE row:

grade_category: all_grade | grade_gte3

Estimated cost with gpt-5-mini batched: ~$15–25 for the full backfill.

The backfill task would update adverse_events.grade_category directly. After completion, all AE rows (both historical and future) have grade classification from the same source of truth.

3. Ranked summary derivation (query-time)

Once all AE rows have grade_category, producing the worksheet columns is a straightforward query:

-- For a given publication + arm context:
SELECT ae.name, tao.measure_value
FROM adverse_events ae
JOIN trial_arm_outcomes tao ON tao.adverse_event_id = ae.id
WHERE ae.source_id = :publication_id
  AND ae.source_type = 'Publication'
  AND ae.grade_category = 'all_grade'        -- or 'grade_gte3'
  AND ae.measure_unit = 'percentage'
  AND tao.measure_value IS NOT NULL
  AND tao.measure_value::numeric > 0
ORDER BY tao.measure_value::numeric DESC
LIMIT 4

Format as: AE_name (value%); AE_name (value%); ...

This can be computed at export time from the grade-tagged rows without a separate persistence step.

4. Workflow placement

No new workflow step needed for the going-forward path — grade classification happens inside the existing classify_publications step and is persisted by post_process_publications.

The backfill task runs independently as a one-time Thor task, similar in pattern to the subgroup disease adjudication backfill (Issue 1).

Solution applied

Implemented 2026-03-11. Change: add-publication-ae-grade-classification.

Status: Implementation complete. Full historical backfill has not yet been run across the remaining eligible publication AE rows.

Applied fix:

Persisted AE grade category on adverse_events — Added adverse_events.grade_category with canonical values all_grade and grade_gte3, plus model normalization/validation so downstream readers have a stable field instead of re-parsing free text.
Extended publication extraction for new rows — Updated the publication LLM schema and prompt so classify_publications emits grade_category for each adverse event row, and updated post_process_publications to persist it when creating publication-sourced AE rows.
Added historical backfill task — Created PublicationsLlmClassification::AdverseEventGradeBackfill and wired a Thor task:
- thor clinical_trials:publications:backfill_adverse_event_grade_categories
- supports non-batched execution, --publication-ids, --limit, --source, --model, and --overwrite
- default validation model: gpt-5-mini
Added ranked named-AE export derivation — Implemented query-time ranking of named adverse events by grade_category and wired worksheet-style outputs into the reporting path:
- Most Frequent AE All Grade
- Most Frequent AE >=Gr3
Hardened ranked summary filtering after manual spot checks — Updated the summary helper so it:
- prefers the actual adverse-event name over standardized bucket labels
- excludes aggregate rollup rows such as TRAE, TEAE, SAE, AESI, irAE, discontinuation, fatal/grade-5 rollups
- excludes zero-value / not reported rows from named-AE summaries

Files changed:

app/models/adverse_event.rb
app/tasks/publications_llm_classification/details.rb
app/tasks/publications_llm_classification/task.rb
app/tasks/publications_llm_classification/post_process.rb
app/tasks/publications_llm_classification/adverse_event_grade_backfill.rb
lib/tasks/clinical_trials/publications.thor
app/queries/clinical_trials/publications_query.rb
app/queries/tpp/emerging_clinical_data_query.rb
app/services/tpp/reports/emerging_clinical_data_report.rb
db/migrate/20260311222107_add_grade_category_to_adverse_events.rb

Manual validation completed:

Non-batched gpt-5-mini run on 4 hand-picked publications: 4 publications processed, 33 rows updated
Confirmed persisted all_grade vs grade_gte3, default skip behavior, overwrite behavior, and arm fallback
Additional non-batched gpt-5-mini run on 8 random publications: 8 publications processed, 33 rows updated
Random spot checks confirmed:
- named grade 3/4 and >=3 rows classify as grade_gte3
- named any-grade / grade-1 rows classify as all_grade
- aggregate safety rows are excluded from ranked named-AE summaries
- zero/not reported rows no longer emit bogus ranked summary strings

Model outcome: gpt-5-mini was good enough on the manual validation slices; no progression to a stronger model was needed.

Operational follow-up: run the full historical backfill for the remaining eligible publication AE rows before marking this issue fully complete.

5. Publication prior therapy context is not extracted — min/max prior lines and prior therapy exposure are missing

Short summary

The disease clinical evidence worksheet has four columns that describe the prior therapy context of a publication’s study population:

Min Prior Lines — minimum number of prior lines of therapy (e.g. 1)
Max Prior Lines — maximum number of prior lines (e.g. 7)
Treatment Line — e.g. 2L+, 3L+ (already extracted, this issue does not cover treatment line)
Prior Taxane Use — e.g. Yes, No, Allowed, Required

Treatment line is already extracted and persisted on trial_subgroups.treatment_lines (see TreatmentContextExtraction task, renamed from TreatmentLineExtraction). But min_prior_lines, max_prior_lines, and prior therapy exposure are not extracted from publications at all.

The trial side has partial analogues:

trial_eligibility_criteria with modifier = 'prior_treatment_lines' stores min/max for ~62K trial records

Note: indicated_prior_therapies is related to drug approval indications, not trials or publications. It captures required/excluded prior therapies for regulatory label context, not clinical study populations.

Publication-sourced rows have no equivalent for either prior line counts or prior therapy exposure. When the worksheet reports “median 4 prior therapies (range 0–7)” or “52% had prior taxane therapy for mCRPC,” that context exists only in the publication abstract and is not captured by the pipeline.

Where this sits in the current pipeline

Treatment line extraction:

TreatmentContextExtraction in app/tasks/publications_llm_classification/treatment_context_extraction.rb maps abstracts to enum values (1L, 2L+, 3L+, etc.) and extracts prior therapy context
Results persist on trial_subgroups.treatment_lines (JSONB array) and trial_subgroups.llm_data['treatment_lines']
The efficacy view normalizes to effective_line (numeric 0–4) and treatment_settings

The treatment line extraction already reads prior therapy language to determine the line (e.g. “median of 4 prior therapies” → 3L+). But the numeric counts and specific therapy exposures are consumed as reasoning inputs, not persisted as structured data.

There is no extraction step for:

publication-level prior line counts (min, max, median)
prior therapy exposure flags (prior taxane, prior checkpoint inhibitor, etc.)

Exact restriction causing the gap

1. Treatment line extraction discards numeric prior therapy counts

The TreatmentLineExtraction system prompt instructs the LLM to use prior therapy counts for line determination:

- "median prior lines = N":
  - N ≥ 2 → "3L+"
  - N = 1 (or range includes 1–2) → "2L+"

But the output schema (TreatmentLineDetails) only captures treatment_lines (enum array) and evidence (free text). The actual numbers (median = 4, range 0–7) are consumed during reasoning but not persisted as structured fields.

2. Prior therapy exposure is completely out of scope

The treatment line extraction prompt explicitly states:

Out of scope: Dosing, endpoints, safety, biomarkers (unless they clarify line), efficacy stats.

Prior therapy exposure (e.g. “52% had prior taxane,” “required prior platinum,” “prior CAR-T allowed”) is not captured by any extraction step.

3. The efficacy view has no prior-line or prior-therapy columns from publications

vw_publication_efficacy_data exposes effective_line, treatment_settings, and raw_treatment_lines but has no min_prior_lines, max_prior_lines, or prior therapy fields. The trial efficacy view (vw_trial_efficacy_data) does have min_line and max_line from trial_eligibility_criteria, but the publication view has no equivalent.

Concrete examples

Example 1: publication `152908` (BOLD-100 in gastric cancer)

Abstract states:

“Patients had a median of 4 prior systemic therapies [0, 7], 1 with no prior therapy, 2 had 2 prior therapies, 5 with 3 prior therapies, and 13 patients with 4 or more prior therapies. 20/21 patients received prior platinum with 18/21 receiving prior FOLFOX/CAPOX.”

Current extraction result: treatment_lines: ["3L+"] — correct, but we lose:

min_prior_lines: 0
max_prior_lines: 7
median_prior_lines: 4
prior platinum: 20/21 (95%)
prior FOLFOX/CAPOX: 18/21 (86%)

Example 2: publication `162733` (sEphB4-HSA in mCRPC)

Abstract states:

“treatment with at least one second generation androgen receptor (AR)-targeted therapy but no more than three prior therapies for mCRPC” “received a median of three prior therapies (range 1-3)” “Ten patients received prior taxane for mCRPC or hormone sensitive prostate cancer”

Current extraction result: treatment_lines: ["2L+"] — correct, but we lose:

min_prior_lines: 1
max_prior_lines: 3
median_prior_lines: 3
prior taxane: 10/14 (71%)
prior AR-targeted therapy: 14/14 (100%, required)

Example 3: publication `53818` (PROfound — olaparib by prior taxane)

This is the paradigmatic case — the entire publication is organized around prior taxane use as a stratification factor. The abstract reports efficacy by prior taxane yes/no subgroups. The worksheet needs Prior Taxane Use: Yes/No (stratified).

Current extraction captures treatment_lines: ["2L+"] but does not capture that prior taxane is the defining subgroup variable.

Downstream impact

The disease clinical evidence export cannot populate Min Prior Lines, Max Prior Lines, or Prior Taxane Use columns from publication data
Researchers manually fill these from abstracts — exactly the kind of structured extraction the pipeline should automate
Prior therapy context is clinically important for interpreting efficacy results (a drug showing ORR of 30% in a post-taxane population is very different from 30% in a treatment-naïve population)
Without structured prior therapy data, comparative analyses across publications in the same disease are unreliable

What the issue is not

This is not a treatment line problem. Treatment line extraction works well and correctly maps abstracts to 1L, 2L+, 3L+, etc. The issue is that treatment line is a categorical bucket, while prior therapy context includes:

numeric counts (min, max, median, range)
specific therapy exposure flags
exposure requirements (required, allowed, excluded)

This is also not a trial eligibility criteria problem. The trial side has prior_treatment_lines and indicated_prior_therapies, but these describe trial enrollment criteria, not the actual population characteristics reported in the publication abstract.

Scale

Prior therapy language in ~71K result publications:

Pattern	Publications mentioning
Mentions median prior line count	1,936
Mentions prior line threshold (≥N)	1,839
Mentions prior line range	883
Mentions any specific prior therapy class	1,458

Specific prior therapy class mentions (non-exclusive):

Prior therapy class	Publications
Prior checkpoint/IO therapy	572
Prior platinum	302
Prior anti-VEGF	241
Prior radiation	190
Prior CDK4/6i	152
Prior hormonal/endocrine	151
Prior taxane	148
Prior CAR-T	143
Prior transplant	84
Prior HMA	80
Prior surgery	55
Prior PI/bortezomib	54
Prior IMiD/lenalidomide	43
Prior anthracycline	39
Prior fluoropyrimidine	33
Prior gemcitabine	27
Prior irinotecan	21
Prior bispecific	12
Prior BCG	10
Prior ADC	3

Key observations:

“Prior taxane” (148 publications) is just one instance of a general pattern — at least 15 therapy classes appear routinely
The highest-frequency classes (checkpoint/IO, platinum, anti-VEGF) reflect current oncology practice where these are standard earlier-line therapies
~1,900 publications contain explicit numeric prior line counts that are currently consumed during treatment line reasoning but discarded

Spot checks

Publications with rich prior therapy context that is currently lost:

152908 (BOLD-100 in gastric cancer): median 4 prior therapies (range 0–7), 95% prior platinum — extracted as 3L+ only
162733 (sEphB4-HSA in mCRPC): median 3 prior therapies (range 1–3), 71% prior taxane, 100% prior AR-targeted — extracted as 2L+ only
53818 (PROfound olaparib): entire study stratified by prior taxane yes/no — extracted as 2L+ only, taxane context not captured
65484 (givastomig in GEC): median 3 prior lines, 74% prior PD-(L)1 inhibitor — extracted as 3L+ only
147778 (GSK2636771 in mCRPC): median 4 prior lines, 83% prior taxane — extracted as 3L+ only

Current semantic model vs what’s needed

The pipeline currently models treatment context as:

trial_subgroups.treatment_lines → ["2L+"]  (categorical bucket)

The worksheet needs:

Treatment Line     → 2L+        (categorical — already have)
Min Prior Lines    → 1          (numeric — don't have)
Max Prior Lines    → 3          (numeric — don't have)
Prior Taxane Use   → Yes (71%)  (therapy exposure flag — don't have)
Prior Platinum Use → Yes (95%)  (therapy exposure flag — don't have)
Prior IO Use       → No         (therapy exposure flag — don't have)

The worksheet column is labeled “Prior Taxane Use” specifically, but the underlying data pattern is general: researchers track prior exposure to whatever therapy class is clinically relevant for the disease area. In breast cancer it’s taxane and anthracycline; in mCRPC it’s taxane and AR-targeted therapy; in myeloma it’s IMiD, PI, and anti-CD38; in lymphoma it’s CAR-T and bispecific.

Open characterization questions

Should we distinguish between required/allowed/excluded prior therapies, or just report exposure percentages?
- “At least one prior platinum” (required) vs “prior taxane was allowed” (optional) vs “52% had prior taxane” (reported)
- These carry different clinical meaning
- The indicated_prior_therapies optionality enum on the indications side uses: must_have_received, progressed_on_after, not_previously_treated_with, After Failure Of, refractory_to, ineligible_for, inadequate_response_to, Intolerant to — these are richer than what abstracts typically state, but the pattern is informative
How should the persistence model represent scope when prior therapy context applies to the overall population but individual subgroups break it down differently?

Explored solution direction

Key design decisions from investigation

1. Subgroup-level, not publication-level

Prior therapy context should persist at the subgroup level (on trial_subgroups), not at the publication level. Evidence:

Treatment line is already subgroup-level, and prior therapy context is tightly coupled to treatment line
1,827 publications have multiple disease subgroups with treatment lines; 119 of those have different treatment lines across subgroups (e.g. pub 1703: “treatment-naive” subgroup at 1L vs “previously treated” at 2L+)
When treatment lines differ across subgroups, prior therapy context necessarily differs too — a 1L subgroup has 0 prior lines while a 2L+ subgroup has ≥1
The PROfound example (pub 53818) shows prior taxane as a subgroup stratification variable — some subgroups are “prior taxane yes” and others “prior taxane no”

For publications where the abstract only states population-level prior therapy characteristics (the common case), all subgroups inherit the same values. The subgroup-level model handles both cases correctly.

2. Rename to TreatmentContextExtraction

The existing TreatmentLineExtraction should be renamed to TreatmentContextExtraction (or similar) to reflect its expanded scope. The task already reads all prior therapy language for treatment line reasoning — it just discards the structured details. Expanding the output schema is natural.

This is not “mixing concerns” — treatment line, prior line counts, and prior therapy exposure are all facets of the same clinical context question: “Where does this population sit in the treatment sequence?”

3. Strict enum for therapy_class + free text for therapy_name (two-field design)

The key design insight is separating what the abstract says from what we query on:

therapy_name:  "taxane-based chemotherapy"     ← free text, what the abstract says (evidence)
therapy_class: "taxane"                        ← strict enum, what we filter/query on

This avoids the disease_stages antipattern in ParticipationCriterion where an initial predefined list grew unbounded through LLM and import drift, producing duplicates like Stage I / Stage 1 / Stage IA with no normalization layer.

The therapy_class enum is fixed in the schema. The LLM must pick from the list or use other. If other accumulates a meaningful cluster over time, that’s signal to add a new enum value — a conscious schema change, not drift.

The enum covers ~20 therapy classes based on publication frequency analysis:

`therapy_class`	Pubs mentioning	Example abstract phrases
`checkpoint_inhibitor`	865	”prior anti-PD-1”, “prior IO”, “prior pembrolizumab”
`surgery`	572	”prior resection”, “prior nephrectomy”
`transplant`	533	”prior HSCT”, “prior auto-SCT”, “prior allo-SCT”
`platinum`	506	”prior platinum”, “prior cisplatin”, “prior carboplatin”
`endocrine_therapy`	477	”prior ARPI”, “prior endocrine therapy”, “prior enzalutamide”
`anti_vegf`	361	”prior bevacizumab”, “prior anti-VEGF”, “prior anti-angiogenic”
`taxane`	344	”prior taxane”, “prior docetaxel”, “prior paclitaxel”
`radiation`	340	”prior radiation”, “prior radiotherapy”, “prior chemoradiation”
`car_t`	313	”prior CAR-T”, “prior CAR T-cell therapy”
`cdk_inhibitor`	204	”prior CDK4/6 inhibitor”, “prior palbociclib”
`anti_her2`	184	”prior trastuzumab”, “prior T-DXd”, “prior pertuzumab”
`imid`	141	”prior lenalidomide”, “prior IMiD”, “prior pomalidomide”
`hma`	121	”prior azacitidine”, “prior HMA”, “prior decitabine”
`proteasome_inhibitor`	121	”prior bortezomib”, “prior PI”, “prior carfilzomib”
`anthracycline`	90	”prior anthracycline”, “prior doxorubicin”
`fluoropyrimidine`	75	”prior 5-FU”, “prior capecitabine”
`bispecific`	48	”prior bispecific antibody”
`anti_cd38`	43	”prior daratumumab”, “prior anti-CD38”
`adc`	38	”prior ADC”, “prior antibody-drug conjugate”
`bcg`	16	”prior BCG”
`chemotherapy`	—	“prior chemotherapy” (generic, when no specific class stated)
`other`	—	Catch-all for anything not above

The long tail drops off fast — only 20 classes cover virtually all clinically meaningful prior therapy mentions in oncology/hematology publications.

Compound semantics are manageable: most publications (2,101 out of 2,349 mentioning specific priors) reference only a single prior therapy class. Only 231 mention two, and 17 mention three or more.

4. Compound prior therapy semantics (“prior X and Y”, “prior X or Y”)

Real abstract patterns:

Conjunctive (AND): "who received prior taxane, endocrine therapy, CDK4/6 inhibitor, and 2-4 prior chemotherapies" (TROPiCS-02) — all four are required
Disjunctive (OR): "prior platinum and/or fluoropyrimidine chemotherapy" — either qualifies
Mixed: "prior checkpoint inhibitor and platinum-based chemotherapy" — both required

The simplest model that handles all cases: extract each therapy as a separate row in the prior_therapies array. Each row has its own exposure_status. For compound requirements like TROPiCS-02, that becomes:

[
  { "therapy_name": "taxane", "exposure_status": "required", "evidence": "..." },
  { "therapy_name": "endocrine therapy", "exposure_status": "required", "evidence": "..." },
  { "therapy_name": "CDK4/6 inhibitor", "exposure_status": "required", "evidence": "..." }
]

We do NOT need to model the logical relationship (AND/OR) between therapies explicitly. Each therapy entry stands on its own with its exposure status. This is sufficient for worksheet export (“Prior Taxane Use: Yes”) and for filtering (“show publications requiring prior CDK4/6i”).

Proposed schema

Rename TreatmentLineExtraction → TreatmentContextExtraction

class Subgroup
  include StoreModel::Model
  include DataTasks::JsonSchema

  desc 'ID of the subgroup from the input'
  attribute :id, :integer
  attribute :subgroup_type, :string, ignore: true
  attribute :subgroup_value, :string, ignore: true

  # Existing
  attribute :treatment_lines, ArrayType.new, enum: Indication::TREATMENT_LINES
  desc 'Textual evidence or reasoning that supports the treatment line(s)'
  attribute :evidence, :string

  # New: prior line counts
  desc 'Minimum number of prior lines of therapy for this population (from eligibility criteria or reported range). Null if not stated.'
  attribute :min_prior_lines, :integer

  desc 'Maximum number of prior lines of therapy for this population. Null if not stated.'
  attribute :max_prior_lines, :integer

  desc 'Median number of prior lines of therapy, if explicitly stated in the abstract.'
  attribute :median_prior_lines, :integer

  # New: prior therapy exposures
  attribute :prior_therapies, PriorTherapyExposure.to_array_type
end

class PriorTherapyExposure
  include StoreModel::Model
  include DataTasks::JsonSchema

  THERAPY_CLASSES = %w[
    checkpoint_inhibitor surgery transplant platinum endocrine_therapy
    anti_vegf taxane radiation car_t cdk_inhibitor anti_her2
    imid hma proteasome_inhibitor anthracycline fluoropyrimidine
    bispecific anti_cd38 adc bcg chemotherapy other
  ].freeze

  desc 'Normalized therapy class for filtering/querying. Must be one of the enum values.'
  attribute :therapy_class, :string  # enum: THERAPY_CLASSES

  desc 'Therapy name as stated in the abstract (e.g. "taxane-based chemotherapy", "prior anti-PD-1 therapy", "lenalidomide"). Preserves original phrasing for evidence.'
  attribute :therapy_name, :string

  desc 'How this prior therapy relates to the study population'
  attribute :exposure_status, :string  # enum: required, allowed, excluded, reported

  desc 'Percentage of patients with this prior exposure, if reported (e.g. 71.4). Null if not stated.'
  attribute :exposure_percentage, :float

  desc 'Evidence quote from the abstract'
  attribute :evidence, :string
end

Persistence

New columns on trial_subgroups:

min_prior_lines (integer, nullable)
max_prior_lines (integer, nullable)
median_prior_lines (integer, nullable)

Prior therapy exposures persist in trial_subgroups.llm_data['prior_therapies'] (JSONB array), consistent with how treatment line evidence is already stored in trial_subgroups.llm_data['treatment_lines'].

The efficacy view would expose min_prior_lines and max_prior_lines alongside effective_line. The emerging clinical data query would format prior therapies for export.

Backfill

This requires a full backfill since we’re expanding the extraction schema. The renamed TreatmentContextExtraction task re-runs on all result publications that have subgroups.

Options to reduce cost:

Only backfill publications where the abstract contains prior therapy language (~3K–5K publications based on regex estimates) for the prior therapy fields
Use gpt-5-mini for the backfill since the extraction is well-defined
Batch processing with the existing DataTasks::Task infrastructure
The prior line count fields can be extracted in the same pass as treatment lines since the LLM already reasons about them

Estimated cost: ~$31 batched with gpt-5-mini for a full backfill of all ~~62K publications with subgroups. No regex pre-filter — the LLM returns empty arrays when no prior therapy context exists, and the cost per publication (~~$0.001) makes filtering unnecessary.

Export formatting

For the worksheet columns:

Min Prior Lines → trial_subgroups.min_prior_lines (direct)
Max Prior Lines → trial_subgroups.max_prior_lines (direct)
Prior Taxane Use → derived from llm_data['prior_therapies'] array, filtering for therapy_class = 'taxane':
- If exposure_status = 'required' → Yes (required)
- If exposure_status = 'reported' with percentage → Yes (71%)
- If exposure_status = 'excluded' → No (excluded)
- If exposure_status = 'allowed' → Allowed
- If no entry with therapy_class = 'taxane' → NR

The worksheet currently labels this column “Prior Taxane Use” but the extraction captures all therapy classes via the strict therapy_class enum. The export filters by enum value — therapy_class = 'taxane' for this column, therapy_class = 'checkpoint_inhibitor' for “Prior IO Use”, etc. No schema changes needed to add new worksheet columns for different disease areas.

The therapy_name free text field preserves the original abstract phrasing for display and evidence review (e.g. “prior docetaxel-based chemotherapy” rather than just “taxane”).

Solution applied

Implemented as the TreatmentContextExtraction task, which expands the former TreatmentLineExtraction to extract prior therapy context alongside treatment lines in a single LLM call.

Schema changes

New columns on trial_subgroups:

min_prior_lines (integer, nullable) — minimum number of prior lines of therapy
max_prior_lines (integer, nullable) — maximum number of prior lines
median_prior_lines (integer, nullable) — median number of prior lines

New JSONB key in trial_subgroups.llm_data:

prior_therapies — array of PriorTherapyExposure objects, each with:
- therapy_class — strict enum of 22 values (checkpoint_inhibitor, taxane, platinum, endocrine_therapy, anti_vegf, car_t, cdk_inhibitor, anti_her2, imid, hma, proteasome_inhibitor, anthracycline, fluoropyrimidine, bispecific, anti_cd38, adc, bcg, surgery, transplant, radiation, chemotherapy, other)
- therapy_name — free text preserving original abstract phrasing
- exposure_status — enum: required, allowed, excluded, reported
- exposure_percentage — float, nullable (e.g. 71.4 for “71% had prior taxane”)
- evidence — quote from abstract

Code changes

app/tasks/publications_llm_classification/treatment_context_extraction.rb — renamed from treatment_line_extraction.rb. Expanded Subgroup schema adds min_prior_lines, max_prior_lines, median_prior_lines, and prior_therapies array. System prompt extended with prior line count extraction rules and therapy class mapping with the 22-value enum.
app/tasks/publications_llm_classification/post_process.rb — updated to write min_prior_lines, max_prior_lines, median_prior_lines columns and prior_therapies JSONB key during subgroup creation.
db/views/vw_publication_efficacy_data_v09.sql — added min_prior_lines, max_prior_lines, median_prior_lines from trial_subgroups to the materialized view output.
app/queries/tpp/emerging_clinical_data_query.rb — added min_prior_lines, max_prior_lines, median_prior_lines to result rows. Added prior_therapy_class parameter; when specified, includes a prior_therapy_use column formatted as: required → “Yes (required)”, reported with percentage → “Yes (71%)”, excluded → “No (excluded)”, allowed → “Allowed”, no entry → “NR”.
lib/tasks/one_off/backfill_prior_therapy_context.thor — self-contained one-off backfill task processing all ~62K publications with subgroups (no regex pre-filter). Uses gpt-5-mini. Only writes prior therapy fields (min_prior_lines, max_prior_lines, median_prior_lines, llm_data['prior_therapies']) — does not overwrite existing treatment_lines. Delete when backfill is complete.
lib/tasks/one_off/cleanup_prior_therapy_values.thor — one-off cleanup that nulls out invalid values from the backfill. Delete when done.
Data validation — sanitize_line_count added to treatment_context_extraction.rb and post_process.rb to reject negative sentinel values (-1, -999) the LLM uses instead of null. sanitize_prior_therapies rejects negative exposure_percentage values.

Backfill results (2026-03-12)

62,008 publications processed via gpt-5-mini (synchronous)
40,278 subgroups have non-zero prior line counts
61,895 subgroups have at least one prior therapy entry

Post-backfill cleanup: LLM used sentinel values (-1, -999, -2147483648) instead of null for ~9K subgroups. Additionally ~25K subgroups had median outside [min, max] range. All cleaned via cleanup_prior_therapy_values.thor.

Spot-check verification (2026-03-12)

Publication	Expected	Extracted	Status
152908 (BOLD-100, gastric)	min=0, max=7, median=4, 95% platinum	min=0, max=7, median=4, platinum 95.2%	Correct
162733 (sEphB4-HSA, mCRPC)	min=1, max=3, median=3, 71% taxane, 100% AR-targeted	min=1, max=3, median=3, taxane 71.4%, endocrine_therapy required	Correct
53818 (PROfound, olaparib)	Stratified by prior taxane yes/no	”Prior taxane Yes” subgroups: taxane required 100%; “Prior taxane No”: taxane excluded	Correct
147778 (GSK2636731, mCRPC)	median=4, 83% taxane	median=4, taxane 83%	Correct

Known limitations

Subgroup-defining therapies: The LLM sometimes classifies subgroup-defining therapy characteristics (e.g. “Prior taxane Yes”) as reported instead of required/excluded. Full backfill showed inconsistency vs spot-check runs — likely due to temperature: 1 (gpt-5-mini constraint). A prompt improvement could help but is not blocking.
Endocrine therapy ambiguity in mCRPC: Background ADT (universally required) and novel AR agents (often excluded) both map to endocrine_therapy, creating apparent contradictions (both required and excluded on the same subgroup). Could be addressed by splitting into separate therapy classes in a future iteration.
max_prior_lines zero-sentinel contamination: See Issue 8. The LLM outputs 0 instead of null for unstated max prior lines, producing 124K unusable values. This is a separate issue from the prior therapy extraction itself (which works correctly).

Validation (2026-03-13)

Coverage confirmed:

150,689 subgroups have min_prior_lines (95%)
149,952 have max_prior_lines (94%) — but see Issue 8 for data quality concern
124,264 have median_prior_lines (78%)
61,895 have at least one prior_therapies entry (39%)

Prior therapy class enum distribution is healthy. All 22 enum values are used. Top classes: chemotherapy (19.9K), surgery (6.7K), platinum (6K), checkpoint_inhibitor (5.5K), endocrine_therapy (5.5K). other has 25.4K entries (26%) — high but acceptable given the long tail of therapy types not covered by the 22 predefined classes.

Tracker examples all re-verified correct:

Pub 152908 (BOLD-100): min=0, max=7, median=4, platinum 95.2%, fluoropyrimidine 85.7%
Pub 162733 (sEphB4-HSA): min=1, max=3, median=3, taxane 71.4%, endocrine_therapy required
Pub 53818 (PROfound): “Prior taxane Yes” subgroups have taxane reported, “Prior taxane No” subgroups have taxane reported — exposure_status is reported rather than required/excluded (see known limitation above)
Pub 147778 (GSK2636731): median=4, taxane 83%

Report-readiness: Prior therapy class data and min_prior_lines are usable for reports. max_prior_lines is not usable without the cleanup described in Issue 8.

6. Data cutoff date is not extracted from publication abstracts

Short summary

The disease clinical evidence worksheet has a Data Cut column that records the date when trial data collection was frozen for analysis (e.g. Jun 26, 2024, Mar 20, 2025).

Data cutoff date is not currently extracted or persisted as structured data. The pipeline already reads this language during endpoint and treatment line extraction but discards it. Data cutoff dates appear in ~6,100 publication abstracts with an extractable date in ~3,800 of those.

Where this sits in the current pipeline

Current publication flow:

classify_publications extracts endpoints and adverse events from the abstract. The system prompt references data cutoff incidentally (e.g. for maturity determination) but does not extract the date.
The not_reached boolean on outcome measures captures the consequence of an immature data cutoff but not the cutoff date itself.
is_partial_result / is_partial flags on publications signal interim results but not the specific cutoff date.

Relevant code paths:

app/tasks/publications_llm_classification/task.rb — main extraction prompt
app/tasks/publications_llm_classification/details.rb — Details schema (has not_reached but no cutoff date)
db/views/vw_publication_efficacy_data_v07.sql — no data cutoff column
app/queries/tpp/emerging_clinical_data_query.rb — no data cutoff in output

Exact restriction causing the gap

1. No extraction schema field for data cutoff date

The Details schema in details.rb captures endpoints, arms, adverse events, study design, and partial result flags — but has no data_cutoff_date field. The LLM reads the cutoff date in the abstract for reasoning (e.g. to determine endpoint maturity via not_reached) but has no output slot to persist it.

2. The efficacy view and export have no data cutoff column

vw_publication_efficacy_data exposes effective_line, treatment_settings, dose, but has no data_cutoff_date. The CSV export (emerging_clinical_data_report.rb) has 37 columns but none for data cutoff.

Concrete examples

Example 1: publication `241657` (belzutifan + lenvatinib in RCC)

Abstract states:

“for the first (IA1; data cutoff Jun 26, 2024) and second (IA2; data cutoff Apr 9, 2025) interim analysis”

This publication reports two separate data cutoffs for two interim analyses. The worksheet needs at minimum the most recent cutoff date (Apr 9, 2025). Neither date is captured.

Example 2: publication `116878` (BURAN — buparlisib in HNSCC)

Abstract states:

“data cut-off date of 15 March 2025, with a median follow up of 27 months”

Cutoff date (2025-03-15) is clearly stated. Not captured. The worksheet Data Cut column for this publication would be 15 Mar 2025.

Example 3: publication `240450` (BREAKWATER — encorafenib in mCRC)

Abstract states:

“At data cutoff (Mar 1, 2025), EC+FOLFIRI demonstrated a clinically meaningful and statistically significant improvement…”

Cutoff date (2025-03-01) is stated parenthetically. Not captured.

Example 4: publication `191190` (pembrolizumab + nab-paclitaxel in HNSCC)

Abstract states:

“data cutoff (February 27, 2025; median follow-up 23 months)”

Cutoff date (2025-02-27) is clearly stated. Not captured.

Why this matters downstream

Data cutoff date is clinically essential for interpreting results. When the same trial publishes multiple analyses, each with a different cutoff date, the cutoff distinguishes which analysis the reported endpoints belong to. Without it:

The worksheet Data Cut column cannot be populated
Analysts cannot distinguish interim from final analysis results for the same trial
Publications reporting updated OS at longer follow-up cannot be correctly ordered or attributed

What the issue is not

This is not a not_reached problem. The not_reached flag captures whether a median was estimable at all. Data cutoff date describes when the analysis was performed, not whether an endpoint was reached.

This is also not a publication dating problem. publication_date is when the paper was presented or published. Data cutoff date is when the trial database was locked for that analysis — typically months before publication. The two dates serve different purposes and should not be conflated.

Scale

Across all publications with abstracts:

Signal	Publications	% of all pubs with abstracts (194K)
Mentions data cutoff language	6,148	3.2%
Data cutoff with extractable date (month/year or full date)	3,849	2.0%
Single cutoff mention	5,314	86% of cutoff pubs
Multiple cutoff mentions (2+)	834	14% of cutoff pubs

For the target worksheet diseases specifically:

Disease	Total pubs	With data cutoff
Colorectal Cancer	2,878	208 (7%)
HNSCC	974	83 (9%)
NSCLC	4,079	623 (15%)

Key observations:

Data cutoff dates are most common in NSCLC publications (15%), likely reflecting the higher proportion of large randomized trials in lung cancer
~14% of publications with cutoff language mention multiple cutoffs (e.g. different interim analyses), confirming that subgroup-level persistence is needed
~63% of publications with cutoff language include an extractable date with at least month + year precision

Data cutoff date formats in abstracts

From spot-checking 30 recent abstracts with data cutoff language:

Format	Example	Frequency
Month DD, YYYY	`data cutoff date of October 27, 2025`	Common
Mon DD, YYYY	`data cutoff Jun 26, 2024`	Common
DD Mon YYYY	`data cut-off (18 Sept 2025)`	Common
Mon YYYY (no day)	`data cut-off (July 2025)`	Moderate
MM/DD/YYYY	`data cut-off (06/13/2025)`	Rare
Month YYYY only	`data cutoff (Oct 2025)`	Moderate

Some abstracts state only month + year without a specific day. The LLM should extract whatever precision is available.

Some publications report multiple cutoff dates for interim analyses (e.g. publication 241657: IA1 cutoff Jun 26, 2024 and IA2 cutoff Apr 9, 2025). For the worksheet, the most recent cutoff associated with the reported results should be used.

Existing partial signals

The system already captures related but insufficient signals:

not_reached (boolean): Whether a time-to-event median was estimable. Captures endpoint maturity but not the temporal context.
is_partial_result (boolean): Whether the publication reports interim results. Related to data cutoff (interim = earlier cutoff) but does not carry the date.
publication_date: When the paper was published. Distinct from data cutoff — typically the cutoff is 3–12 months before publication.
LLM evidence text: Data cutoff dates appear embedded in llm_data observation/evidence free text (~2,400 publications) but are not structured or queryable.

Open characterization questions

When a publication reports multiple interim analyses with different cutoffs (e.g. IA1, IA2) and different subgroups share the same cutoff, should the cutoff be denormalized onto each subgroup or stored once with an analysis label?
Should the LLM extract an analysis label (e.g. “IA1”, “IA2”, “primary analysis”) alongside the cutoff date?

Explored solution direction

1. Subgroup-level persistence, not publication-level

Data cutoff date belongs on trial_subgroups, not on publications. Evidence:

~14% of publications with cutoff language mention multiple cutoffs (e.g. pub 241657: PFS cutoff Jun 2024, OS cutoff Apr 2025 for different interim analyses)
Different subgroups or endpoint sets within the same publication can reference different analysis cutoffs
The common case (~86%) is a single cutoff — all subgroups inherit it, so subgroup-level handles both cases

This is consistent with how treatment_lines already persists on trial_subgroups.

2. Bake into classify_publications for going-forward extraction

The classify_publications task (PublicationsLlmClassification::Task) already reads the full abstract and encounters data cutoff language naturally. Add data_cutoff_date to the SubgroupOutcome schema in details.rb:

class SubgroupOutcome
  # ... existing fields ...

  desc 'Data cutoff date for results reported under this subgroup, in ISO 8601 format (YYYY-MM-DD). ' \
       'Use YYYY-MM-01 when only month and year are stated. Null if not mentioned in the abstract.'
  attribute :data_cutoff_date, :string, nullable: true
end

System prompt addition in task.rb (extend section 3, Endpoints and Outcome Measures):

** Data Cutoff Date:
  - If the abstract states a data cutoff date for the results in this subgroup
    (e.g. "data cutoff Jun 26, 2024", "data cut-off date was Mar 20, 2025"),
    extract it as data_cutoff_date in YYYY-MM-DD format.
  - Use YYYY-MM-01 when only month and year are given.
  - If the publication reports a single cutoff for all results, apply it to every
    subgroup_outcome_measures entry.
  - If different analyses have different cutoffs, assign each cutoff to the
    subgroup(s) whose results it covers.
  - Leave null if not explicitly stated — do not infer from publication date.

Why this is better than a separate task:

Zero marginal cost — one nullable string field per subgroup entry adds negligible tokens
The LLM already has the full abstract in context and already reasons about data maturity (not_reached, is_partial_result)
No new task class, no new Thor command, no new workflow step for going-forward publications
Schema stays co-located with the subgroup outcome data it describes

3. Post-processing: propagate to trial_subgroups

post_process_publications already creates/updates trial_subgroups from llm_data['subgroup_outcome_measures']. Add data_cutoff_date to the attributes written during post-processing. This requires a migration to add data_cutoff_date (date, nullable) to the trial_subgroups table.

4. Backfill task for all existing result publications

A separate backfill task extracts data_cutoff_date from all existing result publications that already have llm_data['subgroup_outcome_measures'] (~63K publications). No regex pre-filtering — the LLM decides whether a cutoff date is present, not a pattern match. Regex would silently miss publications that state cutoff dates in unexpected phrasing.

The backfill task:

Reads the publication abstract and its existing trial_subgroups records
Extracts data_cutoff_date per subgroup
Writes directly to trial_subgroups.data_cutoff_date and trial_subgroups.llm_data, same pattern as TreatmentContextExtraction (which finds each trial_subgroup by ID and updates in place)
Does NOT re-run post_process_publications — that would destroy and recreate all trial_subgroups, wiping treatment lines and disease adjudication data
Runs as a one-time Thor task, similar in pattern to adjudicate_subgroup_diseases (Issue 1)

Estimated cost: ~$30–50 with gpt-5-mini for ~63K publications (single nullable date field per subgroup, minimal output tokens).

After backfill, the going-forward path (classify_publications) handles all new publications automatically.

Solution applied

Status: Implemented — backfill complete (validated 2026-03-13)

All code changes are in place. Backfill has been run.

Going-forward extraction (classify_publications)

Added data_cutoff_date (string, nullable) to SubgroupOutcome in details.rb
Added data cutoff extraction instructions to the system prompt in task.rb
Updated post_process.rb to propagate data_cutoff_date from llm_data['subgroup_outcome_measures'] to trial_subgroups.data_cutoff_date

New publications processed through classify_publications → post_process_publications will automatically have data cutoff dates extracted and persisted.

Schema and view

Added data_cutoff_date (date, nullable) column to trial_subgroups
Added data_cutoff_date to vw_publication_efficacy_data (v10) sourced from trial_subgroups.data_cutoff_date
Added Data Cut column to EmergingClinicalDataQuery output and CSV export

Backfill task

One-off backfill task at lib/tasks/one_off/backfill_data_cutoff_dates.thor extracts data cutoff dates from all existing result publications with trial_subgroups. No regex pre-filter — all ~62K publications are sent to gpt-5-mini (estimated cost ~$6-10). The LLM returns null for publications without cutoff language.

Run with:

bundle exec thor one_off:backfill_data_cutoff_dates:extract --batched --parallelism=4

Spot-check validation

Tested on 6 publications with known cutoff dates (5 extracted correctly, 1 correctly returned null for abstract that says “at data cut-off” without stating the date):

Pub ID	Abstract says	Extracted	Correct?
116878	”data cut-off date of 15 March 2025”	2025-03-15	Yes
163930	”data cutoff” Feb 4, 2021	2021-02-04	Yes
190005	”at data cut-off” (no date)	null	Yes
190016	cutoff Sept 16, 2024	2024-09-16	Yes
190620	”data cutoff, 01 Aug 2025”	2025-08-01 (all 14 subgroups)	Yes
190677	”data cutoff (07 Oct 24)“	2024-10-07	Yes

Validation (2026-03-13)

Coverage: 30,369 subgroups across 11,203 distinct publications have data_cutoff_date populated (19.1% of all publication subgroups). This exceeds the pre-implementation estimate of ~6K abstracts with cutoff language, confirming the backfill has been run.

Tracker spot-check pubs re-verified:

Pub	Expected	Actual	Correct?
116878 (BURAN)	2025-03-15	2025-03-15	Yes
190016 (SERENA-1)	2024-09-16	2024-09-16	Yes
190620 (POD1UM-303)	2025-08-01 (all 17 subgroups)	17/17 populated	Yes
190677 (CAPItello-281)	2024-10-07	2024-10-07	Yes
190005 (TROPION-Breast01)	null (no date in text)	null	Yes

Tracker examples 241657 and 240450 have zero subgroups — they are newly ingested ASCO 2025 publications (created 2026-03-10) that haven’t been through classify_publications yet. Once the publication workflow runs, cutoff dates will be extracted automatically by the going-forward path.

Minor data quality issues:

9 subgroups have cutoff dates before 2000 — verified as legitimate (e.g. pub 144506 is a 1988 pilot study in Qidong County).
2 subgroups (pub 109543) have cutoff date 2028-12-01 — hallucinated future date. Should be cleaned.

7. AE grade category enum is too coarse — grade 1-2 rows misclassified as `all_grade`

Short summary

The grade_category field on adverse_events only supports two values: all_grade and grade_gte3. Many publication abstracts report AEs in finer grade buckets (grade 1-2, grade 3-4, grade 5/fatal, SAE). When forced into the binary, grade 1-2 rows get shoehorned into all_grade, which is incorrect — true all-grade incidence includes all grades, while grade 1-2 is a strict subset.

This produces ~50 AE pairs where the grade_gte3 value is higher than the all_grade value for the same AE name, which is counter-intuitive but affects <0.3% of publications with AE data.

Scale and severity: low

36,545 publications have AE rows with grade_category
312 publications (0.9%) have the same AE name under both grade categories
50 AE pairs across those 312 pubs show inverted values (grade_gte3 > all_grade)
92 of the 312 are in target disease areas

Current misclassification breakdown from observation text analysis:

Observation pattern	Classified as `all_grade`	Classified as `grade_gte3`	Issue
Explicitly “all grade” / “any grade”	5,143	401	401 wrong
Grade 1-2 specific	7,883	221	7,883 should be `grade_1_2`
Grade 3-4 specific	1,679	15,204	1,679 wrong
Grade 5 / fatal	176	3,803	Should be own category
SAE context	1,287	1,804	Should be own category
No observation	20,152	16,413	Ambiguous
Other	35,053	13,173	Mixed

The grade 1-2 → all_grade misclassification (7,883 rows) is the largest single issue. The grade ≥3 column is mostly correct, so the clinically important safety signal is preserved. The all-grade column underreports in affected cases.

Explored solution direction

Expand grade_category to a richer enum and re-run the backfill:

# Current: all_grade, grade_gte3
# Proposed:
attribute :grade_category, :string
# enum: all_grade, grade_1_2, grade_gte3, grade_3_4, grade_5_fatal, sae

Value	Meaning	Ranked summary use
`all_grade`	True all-grade / any-grade / unspecified	”Most Frequent AE All Grade”
`grade_1_2`	Grade 1-2 only (low-grade bucket)	Excluded from ranked summaries
`grade_gte3`	Grade ≥3 / grade 3+	“Most Frequent AE >=Gr3”
`grade_3_4`	Grade 3-4 specifically	Treated same as `grade_gte3` for ranking
`grade_5_fatal`	Grade 5 / fatal / treatment-related death	Separate or excluded
`sae`	Serious adverse event (any grade)	Excluded from ranked summaries

The ranked summary helper would then:

“Most Frequent AE All Grade” → filter to all_grade only (not grade_1_2 or sae)
“Most Frequent AE >=Gr3” → filter to grade_gte3 + grade_3_4

This eliminates the inversion problem because grade 1-2 and SAE rows no longer contaminate the all-grade bucket.

Cost: Re-running the full AE grade backfill at ~$10 with gpt-5-mini. The schema change to the extraction prompt and the AdverseEventGradeBackfill task already exist — just need to expand the enum, update the prompt, and re-run.

Downstream changes: Update AdverseEvent model normalization, ranked summary helper, and export query to handle the expanded enum.

Solution applied

Implemented 2026-03-13. Commit ef8bcfa8.

Enum expansion: adverse_events.grade_category expanded from 2 values to 6: all_grade, grade_1_2, grade_gte3, grade_3_4, grade_5_fatal, sae. All model normalization, extraction schema, backfill task, and export queries updated.

Backfill completed. Current distribution across 148,084 classified rows:

`grade_category`	Count	%
`all_grade`	60,450	40.8%
`grade_gte3`	37,747	25.5%
`grade_3_4`	21,685	14.6%
`grade_1_2`	12,907	8.7%
`sae`	6,865	4.6%
`grade_5_fatal`	6,430	4.3%
NULL	2,305	1.6%

Ranked summary updated: “Most Frequent AE All Grade” filters to all_grade only (excluding grade_1_2 and sae). “Most Frequent AE >=Gr3” filters to grade_gte3 + grade_3_4 + grade_5_fatal.

Residual: 2,305 rows (1.6%) across 1,257 publications still have NULL grade_category. Inverted AE pairs reduced from ~50 to 33 — remaining inversions likely reflect genuine data complexity (e.g. subgroup-level AE rates where a smaller subgroup has higher grade ≥3 than the overall all-grade rate).

8. `max_prior_lines` zero-sentinel contamination

Short summary

The TreatmentContextExtraction LLM task outputs 0 instead of null for max_prior_lines when the abstract does not state a maximum number of prior therapies. This produces 124,446 subgroups (78% of all publication subgroups) with max_prior_lines = 0, of which 12,924 are logically impossible (min_prior_lines > max_prior_lines).

Where this sits in the current pipeline

TreatmentContextExtraction (app/tasks/publications_llm_classification/treatment_context_extraction.rb):

Schema declares attribute :max_prior_lines, :integer, nullable: true with desc "Null if not stated."
System prompt (line 150): "Leave null if not stated. Do not infer counts that are not explicitly stated."
sanitize_line_count rejects negative values (value.negative?) but passes 0 through

Root cause

Two contributing factors:

Structured outputs integer default: When the LLM generates structured JSON with an integer field and the value is conceptually “not applicable,” many models default to 0 rather than null, even when the schema allows nullable and the prompt says “null if not stated.” This is a known behavior pattern with OpenAI structured outputs.
Sanitizer gap: sanitize_line_count (line 411) was designed to catch the -1/-999 sentinel pattern discovered during the initial backfill, but did not anticipate 0 as a sentinel because 0 is a valid value for treatment-naïve (1L) populations.

Scale

`max_prior_lines`	Count	%
0	124,446	78.4%
1-3	13,527	8.5%
4-10	6,851	4.3%
>10	5,128	3.2%
NULL	8,757	5.5%

Logically impossible rows (min > max): 12,924

Breakdown by treatment line for max_prior_lines = 0:

Treatment line	min=0 & max=0	min>0 & max=0 (contradictory)
2L+	17,225	6,282
3L+	5,614	3,411
1L only	25,827	129
Other (Adj/Neo/Ind/etc.)	51,922	771

For 1L publications, min=0, max=0 is valid (treatment-naïve = zero prior lines). For 2L+ and 3L+ publications, max=0 is always wrong — by definition these populations have ≥1 prior line.

Concrete examples

Pub	Subgroup	Treatment line	min	Abstract says
69513	Asian pts	3L+	2	”at least 2 prior lines” (no max stated)
45604	Overall	2L+	1	”previously treated” (no max stated)
101698	HRAS-mutated UC → Evaluable	2L+	1	”at least one prior therapy” (no max stated)
121922	Overall	3L+	2	”≥2 prior systemic therapies” (no max stated)

In all cases the abstract provides a minimum threshold but no maximum. The LLM correctly extracted min_prior_lines but output 0 instead of null for max_prior_lines.

Downstream impact

max_prior_lines is not usable for reports in its current state — 78% of values are sentinel zeros
The Max Prior Lines column in the worksheet export will show 0 for the vast majority of rows, which is misleading
The efficacy view (vw_publication_efficacy_data) exposes max_prior_lines directly from trial_subgroups, so the bad values propagate to all downstream consumers
min_prior_lines is less affected — 0 is valid for 1L populations, and the contradictory cases (min > 0 with max = 0) are identifiable

Recommended fix — two parts

Part 1: Cleanup existing data

The cleanup is not straightforward because 0 is valid for 1L populations. Possible approaches:

Conservative (rule-based): Set max_prior_lines = NULL where min_prior_lines > max_prior_lines (12,924 rows — clearly wrong). This fixes the worst cases but leaves ~111K ambiguous max=0 rows untouched.
Moderate (rule-based with treatment line context): Additionally set max_prior_lines = NULL where max_prior_lines = 0 AND treatment_lines contains 2L+ or 3L+ (these populations by definition have ≥1 prior line, so max=0 is impossible). This would cover ~23K additional rows.
Aggressive (re-extract via LLM): Re-run TreatmentContextExtraction on all affected publications. Most accurate but costs another ~$30 and risks other field drift. Could be scoped to only publications where max_prior_lines = 0 AND treatment_lines is not 1L.

Recommendation: Start with approach 2 (rule-based cleanup of clearly wrong values), then evaluate whether the remaining 1L + max=0 population needs LLM re-extraction or if 0 is acceptable there.

Part 2: Prevent recurrence

Two changes needed:

Update sanitize_line_count to also reject 0 for max_prior_lines when min_prior_lines > 0:
```
def sanitize_line_count(value)
  return nil if value.nil? || value.negative?
  value
end
```
This alone is insufficient because the sanitizer doesn’t have cross-field context. Better to add a post-persist validation.

Update the system prompt to be more explicit about the 0-vs-null distinction:

- IMPORTANT: Use null (not 0) when no maximum is stated. 0 means "zero prior lines"
  (treatment-naïve only). If the abstract says "at least 2 prior lines" with no upper
  bound, set min=2 and max=null, NOT max=0.

Add a cross-field sanitizer in persist_results that nulls max_prior_lines when min > max:

subgroup.max_prior_lines = nil if subgroup.min_prior_lines.present? &&
  subgroup.max_prior_lines.present? && subgroup.min_prior_lines > subgroup.max_prior_lines

Solution applied

Implemented 2026-03-13. Two-part fix:

Part 1: Prevent recurrence

Updated TreatmentContextExtraction system prompt with explicit zero-vs-null disambiguation and concrete example
Added cross-field validation (min > max → max = nil) in all three persist paths: TreatmentContextExtraction#persist_results, PostProcess outcome measure building, and backfill_prior_therapy_context.thor
Added MAX_PLAUSIBLE_PRIOR_LINES = 25 threshold to all three sanitize_line_count methods — values above 25 are nulled on persist (verified via spot-checking that real abstracts top out at ~20 prior lines in heavily pretreated myeloma/phase 1 basket trials)

Part 2: Historical data cleanup

Extended cleanup_prior_therapy_values.thor with three new cleanup rules:
1. Nulled max_prior_lines where min_prior_lines > max_prior_lines (12,924 rows)
2. Nulled max_prior_lines = 0 where treatment_lines contains 2L+/3L+/2L/3L/4L/4L+/5L/5L+ (23,193 additional rows)
3. Nulled sentinel junk (values > 25) across all three fields: min_prior_lines (95 rows), max_prior_lines (3,013 rows), median_prior_lines (1,281 rows). Common sentinels included INT_MAX (2,147,483,647), 999, 999999, 65535, 32767, 123456789, etc.
Total cleaned: 40,506 rows across all rules

Post-cleanup validation:

Zero contradictory rows (min > max) remain
Zero impossible zeros for 2L+/3L+ populations remain
Spot-checked 4 example publications (69513, 45604, 101698, 121922) — all now have max_prior_lines = NULL
1L populations with min=0, max=0 preserved correctly
All three fields now cap at plausible values (max observed: min=14, max=20, median=18)
~40 rows in the 21-25 range were nulled that may have been valid; unrecoverable without LLM re-extraction, but impact is negligible

Validation (2026-03-16)

Post-cleanup state confirmed:

0 contradictory rows (min > max) remain
90,476 subgroups still have max_prior_lines = 0 — all are in 1L (5,564) or non-line-specific settings (Adjuvant, Neoadjuvant, Induction, etc.: 84,912). No 2L+/3L+ zeros remain.
The remaining zeros in non-line-specific settings (e.g. Adjuvant with max=0) are likely still sentinel zeros, but these populations have no treatment line context anyway, so the downstream impact is negligible.
max_prior_lines is now usable for reports where treatment line context exists. For 1L populations, max=0 is valid. For populations without treatment lines, max_prior_lines should be treated as unreliable.

9. All-grade AE extraction gap — originally ~13K publications, revised to ~14 after investigation

Short summary

Originally suspected that classify_publications fails to extract all-grade named AEs for ~13,000 publications. After deep investigation (2026-03-16), the issue is much narrower than initially estimated.

Investigation (2026-03-16)

The 12,986 publications with only grade≥3 AE rows break down as:

Category	Publications	Genuine extraction failure?
Abstract genuinely only reports grade ≥3 AEs	~11,200	No — abstract has no all-grade named AE data
Abstract mentions “any grade” in aggregate context only (e.g. “discontinuation due to any grade TRAE”)	~400	No — “any grade” appears as an aggregate stat, not per-AE
Abstract has grade_1_2 AEs separately (not combined all-grade)	1,744	No — abstract reports low/high grade separately, not combined
Abstract has clear two-column AE table (Any grade + Grade ≥3) but LLM misclassified	~14	Yes — any-grade values extracted but labeled as `grade_gte3`

Root cause for the ~14 genuine failures

The LLM extracts numeric values from embedded AE tables but reads the first column (any-grade) and labels it as grade≥3, completely ignoring the second column (actual grade≥3 values). This was caused by the old binary grade_category enum (all_grade/grade_gte3) which didn’t give the LLM enough guidance to distinguish columns.

Confirmed example: pub 60886 (Debio 0123 + carboplatin, phase 1)

Abstract table:

TEAE                         Any grade n(%)    Grade ≥3 n(%)
Thrombocytopenia             12 (31.6)         3 (7.9)
Nausea                       12 (31.6)         0
Anemia                        8 (21.1)         1 (2.6)
Fatigue                        7 (18.4)         0

Before (old extraction): 7 rows, all grade_gte3 — values 31.6%, 31.6%, 21.1% are the ANY-GRADE column mislabeled. Grade≥3 column (7.9%, 0%, 2.6%, 0%) completely missing.

After re-extraction (current prompt with 6-value enum): 14 rows — 7 all_grade (31.6%, 31.6%, 21.1%, 18.4%, 13.2%, 13.2%, 10.5%) + 7 grade_gte3 (7.9%, 2.6%, 2.6%, 2.6%, 0%, 0%, 0%). All correct.

Confirmed example: pub 56057 (Debio 0123 + carbo + etoposide, phase 1)

Same pattern — any-grade column extracted as grade≥3. After re-extraction: 4 all_grade + 4 grade_gte3, all values correct.

Why the current prompt already fixes this

The expanded 6-value grade_category enum (Issue 7, implemented 2026-03-13) and the detailed grade classification instructions in the prompt give the LLM enough context to correctly distinguish table columns. Re-running classify_publications on the affected pubs with the current prompt produces correct results — verified on 2/2 test pubs.

What does NOT need fixing

~11,200 pubs where the abstract genuinely only reports grade≥3: The all-grade data is not in the abstract. It may be in the full paper, poster, or oral presentation. This is a data availability limitation, not an extraction failure.
~400 pubs where “any grade” appears in aggregate context: The abstract says things like “discontinuation due to any grade TRAE occurred in 7.5%” — this is an aggregate stat correctly handled by the safety metrics extraction, not individual named AEs.
1,744 pubs with grade_1_2 + grade≥3 but no all_grade: The abstract reports grades separately (grade 1-2 and grade 3-4), not as a combined “any grade” bucket. This is correct — the “Most Frequent AE All Grade” column should only use true all-grade data, not sum of grade buckets.

Remaining 35 inverted AE pairs

35 AE pairs across pubs WITH both all_grade and grade≥3 rows show grade≥3 > all_grade for the same AE name. These are likely the same column-swap bug in pubs that DID get partial all-grade extraction. The Issue 10 re-extraction (2,182 pubs through full classify_publications) will fix any that overlap.

Solution applied

Going forward: The Issue 7 enum expansion (2026-03-13) and current prompt instructions are sufficient — re-running classify_publications on affected pubs produces correct two-column extraction. Verified on pubs 60886 and 56057: before=7 rows all grade_gte3 (any-grade values mislabeled), after=14 rows (7 all_grade + 7 grade_gte3, all values correct).

Why the Issue 7 AE grade backfill didn’t fix existing data: The backfill (AdverseEventGradeBackfill) can only reclassify existing AE rows — it cannot create new rows. For the ~14 affected pubs:

The original classify_publications extracted only the any-grade column values and labeled them grade_gte3 (wrong)
The grade≥3 column values (Nausea 0%, Thrombocytopenia 7.9%, etc.) were never extracted as rows at all
The backfill skipped these rows because grade_category was already non-null (set incorrectly by the original extraction)
Even with --overwrite, the backfill would at best reclassify the 7 rows from grade_gte3 → all_grade, but the 7 missing grade≥3 rows still wouldn’t exist

Fix requires re-running classify_publications on the affected pubs — only full re-extraction creates both sets of rows. The ~14 pubs will be fixed by either:

The Issue 10 re-extraction (2,182 pubs) if they overlap, or
The next full publications workflow run

No additional prompt changes or backfill tasks needed.

10. `classify_publications` drops subgroups identified by `extract_subgroups`

Short summary

The classify_publications LLM task receives a list of subgroups with endpoint associations from the upstream extract_subgroups step, but sometimes drops subgroups entirely — producing subgroup_outcome_measures entries for only a subset of the provided subgroups. The subgroup extraction step correctly identifies the subgroup, the schema enum correctly includes it, and the endpoint association is correctly passed — but the main classification LLM simply doesn’t create an output entry for it.

This was discovered during worksheet validation against the client sheet 1reh2-9Xpxd9DF7EB-73JfSXH8-MLtWI3zUDEOTgxPV8.

Where this sits in the current pipeline

Publication classification runs in two LLM steps:

extract_subgroups (subgroup_extraction.rb) reads the abstract and identifies subgroup labels with their endpoint associations. Output is llm_data['subgroup_endpoints'].
classify_publications (task.rb) receives subgroup_endpoints, derives distinct_subgroups, and passes them to the main LLM as:
- A subgroup_endpoints field in the user prompt (subgroup → endpoint mapping)
- An enum constraint on subgroup_outcome_measures[].value in the structured output schema (details.rb line 185)
- A system prompt instruction: “Look at the provided ‘subgroup_endpoints’, keep the associations between the endpoints and subgroups as they are.” (line 31)

The schema constraint (details.rb line 185) enforces that subgroup_outcome_measures[].value MUST be one of the distinct_subgroups values — the LLM cannot hallucinate new subgroups. But the schema does not enforce that every enum value must appear at least once. The LLM is free to produce output with only a subset of the provided subgroups, and it does.

Exact restriction causing the drop

The structured output schema makes subgroup entries optional, not required.

The subgroup_outcome_measures field is an array of objects. Each object has a value field constrained to the enum. But the array itself has no minimum length and no constraint requiring each enum value to appear. The LLM is structurally allowed to produce output with 1 subgroup entry out of N provided.

The system prompt says to “keep the associations as they are” but this is a soft instruction. With structured outputs, the LLM’s tendency to minimize output length can override soft prompt instructions, especially when one subgroup has much more data in the abstract than another.

Concrete examples

Example 1: pub 47147 (sigvotatug vedotin + pembrolizumab, ASCO 2025) — confirmed LLM drop

Abstract text (verbatim):

“In 7 efficacy-evaluable pts with TPS≥1 NSCLC, 1 confirmed (c) complete response (CR), 1 c partial response (PR), and 2 PRs pending confirmation were observed (ORR 57%; cORR 29%). In 8 efficacy-evaluable pts with 1L HNSCC, 2 cCR and 1 cPR were observed (cORR 37.5%).”

Both disease cohorts are in the same sentence block with explicit efficacy values.

extract_subgroups output (llm_data['subgroup_endpoints']):

[
  {
    "endpoint": "Objective response Rate",
    "subgroups": ["NSCLC (PD-L1 TPS ≥1)"]
  },
  {
    "endpoint": "Confirmed Objective Response Rate",
    "subgroups": ["NSCLC (PD-L1 TPS ≥1)", "1L HNSCC"]
  }
]

Step 1 correctly identified both subgroups. 1L HNSCC is associated with Confirmed Objective Response Rate.

distinct_subgroups passed to schema enum: ["NSCLC (PD-L1 TPS ≥1)", "1L HNSCC"]

Both subgroups were available as valid enum values in the structured output schema.

classify_publications output (llm_data['subgroup_outcome_measures']):

[
  {
    "type": "disease",
    "value": "NSCLC (PD-L1 TPS ≥1)",
    "outcome_measures": [
      {"endpoint": "ORR", "measure_value": 57, "number_of_participants": 7},
      {"endpoint": "cORR", "measure_value": 29, "number_of_participants": 7}
    ]
  }
]

Only the NSCLC subgroup was created. The 1L HNSCC subgroup with cORR=37.5% was completely dropped despite being:

explicitly mentioned in the abstract with a numeric value
correctly identified by extract_subgroups
present in the schema enum
associated with Confirmed Objective Response Rate in the input

Worksheet impact: The sheet row for HNSCC says ORR=37.5% from this trial (NCT04389632). Our database has no HNSCC efficacy row for this publication.

Example 2: pub 71934 (cofetuzumab pelidotin, ESMO 2023) — data not in abstract table

Abstract embedded table has two columns:

Parameter	NSQ EGFR WT, PTK7 ≥90%/≥2+ N=21	Overall N=56
ORR	30.0%	19.6%
CBR	90.0%	78.6%
mDOR	5.8 mo	7.2 mo
mPFS	5.5 mo	5.3 mo

The LLM correctly extracted both columns as subgroups: PTK7-expressing rNSCLC (Overall) and NSQ EGFR WT → PTK7 ≥90%.

The abstract narrative mentions three histology cohorts: “27 NSQ EGFR WT, 13 NSQ EGFR mutant, and 16 squamous (SQ)” and states “Enrollment of SQ and NSQ EGFR mutant pts was halted to prioritize NSQ EGFR WT accrual due to response rates in each subgroup.”

However, the per-histology ORR values (including sqNSCLC ORR=12.5% from the worksheet) are not present in the abstract’s table or narrative text. The abstract only shows the overall and NSQ EGFR WT results. The squamous-specific data was likely in the poster or supplementary material, not the abstract.

This is NOT an LLM extraction failure — the data isn’t in the text we have. The worksheet’s sqNSCLC ORR=12.5% comes from a source outside our abstract corpus.

Root cause analysis

The two examples show different failure modes:

Pub 47147 (HNSCC cORR=37.5%): Pure LLM output quality failure. The data is in the abstract, the subgroup was correctly identified upstream, the schema allowed it — but the LLM still dropped it. This is the actionable issue.
Pub 71934 (sqNSCLC ORR=12.5%): Not an extraction failure. The data isn’t in the abstract. The worksheet references data from a source we don’t have.

For the actionable case (pub 47147 pattern), the root cause is:

The structured output schema does not require completeness — the LLM can produce fewer subgroup_outcome_measures entries than there are enum values
The system prompt instruction (“keep the associations as they are”) is not strong enough to override the LLM’s tendency to minimize output when one subgroup has much less data than another
The HNSCC subgroup had only one endpoint value (cORR=37.5%) while NSCLC had two (ORR=57%, cORR=29%), making it a “smaller” subgroup that the LLM is more likely to drop

Scale

Measured by comparing llm_data['subgroup_endpoints'] (distinct subgroups identified by extract_subgroups) against llm_data['subgroup_outcome_measures'] (entries with non-empty outcome_measures produced by classify_publications):

Status	Publications	%
All subgroups used	55,683	84.8%
Partial drop (some subgroups lost)	9,245	14.1%
Total drop (all subgroups lost — zero outcome measures)	473	0.7%
More used than identified (LLM created extra)	293	0.4%
Total with dropped subgroups	9,718	14.8%

Note: initial measurement (2,760) undercounted due to a category filter that excluded PubMed, EHA, and other non-ASCO sources. The corrected count uses result = true across all sources.

Explored solution direction

1. Strengthen the prompt instruction (going-forward prevention)

Add explicit language to the system prompt in task.rb:

- IMPORTANT: You MUST create a subgroup_outcome_measures entry for EVERY subgroup in the
  provided list that has associated endpoints. Do not skip subgroups even if they have fewer
  results than others. If a subgroup has only one endpoint value, still create the entry.
  Every subgroup provided to you was identified because the abstract contains results for it.

This won’t guarantee compliance (the current prompt already says “keep the associations as they are” and the LLM ignores it), but it raises the bar.

2. Schema-level enforcement

Add minItems: distinct_subgroups.length to the subgroup_outcome_measures array in to_json_schema. OpenAI structured outputs may or may not honor this — needs testing. If it works, it forces the LLM to produce at least N entries, preventing the drop.

schema[:properties]['subgroup_outcome_measures'][:minItems] = distinct_subgroups.length

3. Post-extraction validation + selective re-extraction (fix existing data)

The detection query is cheap (no LLM needed):

-- Compare identified vs used subgroup counts
SELECT p.id,
  (SELECT count(DISTINCT sg)
   FROM jsonb_array_elements(p.llm_data -> 'subgroup_endpoints') e,
        jsonb_array_elements_text(e -> 'subgroups') sg) as identified,
  (SELECT count(*) FROM jsonb_array_elements(p.llm_data -> 'subgroup_outcome_measures') s
   WHERE jsonb_array_length(s -> 'outcome_measures') > 0) as used
FROM publications p
WHERE ...
HAVING identified > used

For the 2,760 affected publications, re-run classify_publications with the strengthened prompt. Estimated cost: ~$20 with o4-mini for 2,760 pubs.

This could also be wired as a permanent validation step in post_process_publications that flags mismatches for automatic re-extraction (with a retry limit to prevent infinite loops on genuinely ambiguous abstracts).

4. For the 701 total-drop publications (zero outcome_measures)

These need separate investigation — likely a mix of:

Trial-in-progress abstracts (correct behavior, no results to extract)
Genuine extraction failures where the LLM returned empty outcomes
Abstracts too short or ambiguous for the LLM to extract anything

A quick filter: check if partial_result_tags contains ‘Trial Design/Enrollment’ — if yes, the empty outcome is expected.

5. Re-run chain for the 2,760 affected publications

Because post_process_publications destroys and recreates trial_subgroups (line 138 of post_process.rb), re-running classify_publications requires re-running downstream steps that write to subgroup rows. The full chain:

classify_publications --publication_ids <ids> --batched — re-extracts subgroup_outcome_measures with the fixed prompt. Reads llm_data['subgroup_endpoints'] (already correct from extract_subgroups). ~$20 with o4-mini.
post_process_publications --publication_ids <ids> --overwrite — destroys all trial_subgroups, trial_outcome_measures, adverse_events, trial_disease_details for these pubs and recreates from llm_data. Re-persists treatment lines and prior therapy context for subgroups that match by subgroup_type + subgroup_value against llm_data['treatment_lines']['subgroups']. New subgroups (the ones previously dropped) will get null treatment context because treatment_context_extraction never ran on them.
extract_treatment_lines --publication_ids <ids> — re-runs TreatmentContextExtraction on the new subgroups. Reads existing trial_subgroups by ID and writes treatment lines, min/max/median prior lines, and prior therapies. ~$20 with gpt-5-mini. Note: extract_treatment_lines scope (line 294) filters to llm_data->'treatment_lines' IS NULL — but since post_process writes llm_data['treatment_lines'] on the publication (not null), we need to either pass --publication_ids to bypass the scope or temporarily null out the field. Alternatively, since post_process matched existing subgroups correctly, only the new subgroups lack treatment context. A targeted approach: after step 2, query for the newly created trial_subgroups that have null treatment_lines and run treatment context extraction on just those publications.
Disease workflow steps — re-run for these pubs:
- adjudicate_subgroup_diseases — re-adjudicate new non-disease subgroups
- populate_disease_terms_for_trial_subgroups + post_process_disease_matches — re-populate trial_subgroups.disease_id

Steps that do NOT need re-running: extract_subgroups (input is already correct), extract_interventions, link_publication_drugs, tag_investigational_interventions, extract_dose_evidence, therapeutic_area_filter — all write to llm_data on the publication or to publication_interventions, not to trial_subgroups.

Full downstream chain: Since post_process_publications destroys and recreates trial_subgroups, trial_endpoints, trial_outcome_measures, adverse_events, and trial_disease_details, all downstream steps need to re-run: extract_treatment_lines, standardize_adverse_events, classify_adverse_events, llm_classify_publication_endpoints_domains, llm_match_publication_endpoints, plus the publication_disease_workflow for disease_id. The simplest approach is to re-run the full publications_workflow from classify_publications onward, then the publication_disease_workflow.

Estimated cost: ~$40 for classify_publications re-extraction with o4-mini + ~$20 for extract_treatment_lines with gpt-5-mini + minor costs for other LLM steps.

Solution applied

Implemented 2026-03-16. Three-part fix:

1. Prompt hardening (task.rb): Added explicit instruction to the classify_publications system prompt:

IMPORTANT: You MUST create a subgroup_outcome_measures entry for EVERY subgroup in the
provided list that has associated endpoints. Do not skip subgroups even if they have fewer
results than others. If a subgroup has only one endpoint value, still create the entry.
Every subgroup provided to you was identified because the abstract contains results for it.

2. Schema enforcement (details.rb): Added minItems: distinct_subgroups.length to the subgroup_outcome_measures array in the structured output JSON schema. This prevents the LLM from producing fewer entries than there are identified subgroups.

3. Post-extraction validation logging (task.rb): After each publication is persisted, compares the set of subgroups from extract_subgroups against the set produced by classify_publications. Logs a warning if any subgroups were dropped.

Local test results: 6/6 publications with previously dropped subgroups now have all subgroups populated after re-extraction:

Pub 47147 (sigvotatug vedotin): 1L HNSCC subgroup with cORR=37.5% now extracted (previously dropped)
Pubs 51804, 53951, 56337, 60242, 144841: all dropped subgroups recovered

One-off re-extraction task: lib/tasks/one_off/reextract_dropped_subgroups.thor identifies the ~9,700 affected publications and creates OneOffJob records for the re-extraction. After classify_publications completes, re-run the full publications_workflow from post_process_publications onward, then publication_disease_workflow.

Production re-extraction completed 2026-03-21. Full pipeline re-run (extract_subgroups → classify_publications → post_process) executed across all affected publications. Issue is now closed.

11. Recently ingested publications have empty endpoint extractions — Closed: not an issue

Short summary

Initially suspected that recently ingested publications (ASCO 2025, ESMO 2025) had llm_data['subgroup_outcome_measures'] with subgroup entries but empty outcome_measures: [] arrays, suggesting extraction failures.

Investigation (2026-03-16)

Systematic analysis of all 102 publications with subgroup_outcome_measures containing only empty outcome_measures arrays:

Category	Count	Genuine extraction failure?
Trial Design/Enrollment (no results in abstract)	61	No — correct behavior
Safety/AE-focused publications (no efficacy endpoints)	~15	No — correct behavior
Biomarker/correlative science (no clinical endpoints)	~8	No — correct behavior
Truncated abstracts (data in figure/table not captured in text)	~2	Data availability limit, not bug
Mistagged pubs (tagged “Interim Result” but actually TDE)	~7	No — tagging wrong, extraction correct
Genuinely missed efficacy data	0	—

Regex scan for standard efficacy keywords (ORR, mPFS, mOS, HR with numeric values) across all 41 non-TDE pubs found ~7 with keyword matches, but manual inspection confirmed all were false positives:

Pub 53912: pCR prediction AUROC values, not clinical endpoints
Pub 63720 (BNT327/PM8002): abstract text truncated — Results section jumps from enrollment stats to Conclusions, efficacy data was in an embedded figure not captured in text
Remainder: biomarker studies where HR/ORR appears in passing context, not as reported results

The worksheet rows that couldn’t be matched (MICVO ORR=46%, sigvotatug HNSCC cORR=37.5%, cofetuzumab sqNSCLC ORR=12.5%) were caused by:

MICVO ORR=46%: data from a Nov 2025 corporate presentation not in our publication corpus
Sigvotatug HNSCC cORR=37.5%: Issue 10 — data was in the abstract but the subgroup was dropped by classify_publications (now fixed)
Cofetuzumab sqNSCLC ORR=12.5%: data not in the abstract at all — squamous-specific results were in the poster/supplementary material

Resolution

Closed — not an issue. The empty outcome_measures are correct in all 102 cases. The original concern was caused by confounding with Issue 10 (subgroup drops) and data availability limitations (corporate presentations, poster-only data).

12. Legacy Emerging Clinical Data query collapses subgroup-level results into Overall-preferred rows

Short summary

Legacy Tpp::EmergingClinicalDataQuery groups all view rows by [publication_id, disease_id, effective_line, study_plan_arm_id] and then picks the “Overall” subgroup when extracting efficacy metrics. This means dose-level cohorts, biomarker-stratified subgroups, and other clinically meaningful splits are hidden behind the Overall population row — even when the data is correctly extracted and present in vw_publication_efficacy_data.

Status note: No further work planned for now. Subgroup-preserving behavior is available via Tpp::ClinicalEvidenceQuery, which is the current client-facing path for this use case. The remaining collapse behavior exists only on the legacy EmergingClinicalDataQuery path.

Where this sits in the current pipeline

app/queries/tpp/emerging_clinical_data_query.rb:

build_result_rows (line 913): groups by [pub_id, disease_id, effective_line, study_plan_arm_id]
extract_efficacy_metrics (line 1057): overall_rows = matching_rows.select { |r| r['subgroup_value'] == 'Overall' } — prefers Overall when present
All subgroups with the same disease_id (including via trial_disease_details fallback) collapse into a single output row

Concrete examples

Example 1: Ficlatuzumab HPV-negative subgroup (pub 43175)

Publication: “Randomized Phase II Trial of Ficlatuzumab With or Without Cetuximab in Pan-Refractory HNSCC” (NCT03422536)

Correctly extracted subgroups:

Overall → ORR=19%, PFS=3.7, N=32
Overall → HPV-negative → ORR=38%, PFS=4.1, N=16
Overall → HPV-negative → cMet overexpression → PFS data
Overall → HPV-positive → ORR=0%, PFS=2.3, N=16

All four subgroups are in vw_publication_efficacy_data with subgroup_disease_id = NULL. The fallback to trial_disease_details.disease_id = 6200 (HNSCC) gives them all the same disease_id. So they all land in the same group key [43175, 6200, 3L, nil].

extract_efficacy_metrics then picks subgroup_value = 'Overall' (ORR=19%), discarding the HPV-negative result (ORR=38%) that the worksheet expects.

Worksheet says: Ficlatuzumab HPV-neg N=16 ORR=38% Query returns: Ficlatuzumab Overall N=32 ORR=19%

Example 2: PF-08046054 dose-level cohorts (pub 65346, ESMO 2024)

The ESMO 2024 abstract for this solid-tumor basket trial extracted a single subgroup: PDL1-expressing solid tumors with N=55 ORR=27.3%. The sheet expects HNSCC-specific dose-level splits (N=19 at 1.5mg/kg ORR=10.5%, N=7 at 1.75mg/kg ORR=42.9%).

This is a compound issue:

The abstract itself is a cross-tumor overview — HNSCC-specific dose-level data was in the poster/slides, not the abstract text (data availability)
Even if separate subgroups existed, the query would collapse them into one row

Example 3: IBI363 TPS<1 squamous subgroup (pub 139344 / 237445, ASCO 2025)

The sqNSCLC worksheet keeps two rows for the same IBI363 abstract:

SqNSCLC 3 mg/kg Q3W: ORR = 43.3%, mPFS = 7.3, N = 30
SqNSCLC with TPS <1: ORR = 45.5%, N = 22

Both rows are correctly present in vw_publication_efficacy_data:

Advanced NSCLC → Squamous cell carcinoma → 3 mg/kg Q3W → ORR = 43.3, PFS = 7.3, N = 30
Advanced NSCLC → TPS <1 → Squamous cell carcinoma → ORR = 45.5, N = 22

But EmergingClinicalDataQuery groups both into the same key:

abstract copy: [139344, 4174, 1, nil]
presentation copy: [237445, 4174, 1, nil]

There is no subgroup_value = 'Overall', so extract_efficacy_metrics falls back to max_by(number_of_participants) and picks the 30-patient row. The TPS <1 row is hidden even though it is already structured and disease-linked.

Worksheet says: IBI363 TPS <1 SqNSCLC ORR = 45.5%, N = 22 Query returns: IBI363 SqNSCLC ORR = 43.3%, N = 30

Root cause

The query was designed for one-row-per-publication summary display, not for subgroup-level comparisons. The “prefer Overall” logic (line 1057) is intentional — it prevents small subgroup analyses from overriding the main population result in summary tables. But for worksheet reconstruction, the subgroup-level detail IS the desired output.

Scale

Difficult to quantify precisely, but any publication with biomarker-stratified results (HPV+/-, PD-L1 CPS levels, mutation status) or dose-level cohorts will lose the subgroup-level detail. This affects basket trials and biomarker-enriched studies disproportionately.

From the HNSCC sheet comparison:

Ficlatuzumab HPV-neg (N=16 ORR=38%) — data present, hidden by Overall preference
PF-08046054 dose-levels (N=19, N=7) — data not in abstract, but would be hidden even if extracted
Becotatug vedotin 2.3mg/kg (N=32 ORR=43%) — data IS extracted (pub 71438 subgroup 2.3 mg/kg → 2/3-line prior platinum & PD-1/L1 inhibitor failure has 4 outcome measures) but invisible due to Issue 15 (disease mapping)
IBI363 TPS <1 SqNSCLC (N=22 ORR=45.5%) — data present, hidden behind the larger SqNSCLC 3 mg/kg row (N=30 ORR=43.3%)

What the issue is not

This is not an extraction failure. The LLM correctly identifies and extracts subgroup-level data. The data exists in trial_subgroups, trial_outcome_measures, and vw_publication_efficacy_data. The loss happens at query time in the Ruby layer.

Explored solution direction

No additional implementation is planned at this time. The legacy EmergingClinicalDataQuery behavior remains documented below for reference, but this issue is currently superseded by ClinicalEvidenceQuery, which already preserves subgroup-level rows and surfaces cORR.

Two possible approaches:

1. Subgroup-aware grouping: Change build_result_rows to group by [pub_id, disease_id, effective_line, study_plan_arm_id, subgroup_value] instead of collapsing subgroups. This would produce multiple rows per publication — one for Overall, one for HPV-neg, one for each dose level. Downstream consumers (the TPP React component) would need to handle multiple rows per publication.

2. Subgroup expansion mode: Add an optional parameter (e.g. expand_subgroups: true) that preserves subgroup-level rows when set. Default behavior stays unchanged for summary display, but worksheet reconstruction can request the expanded view.

Option 2 would be the lower-risk approach if the legacy Emerging Clinical Data report needs to be revived without adopting ClinicalEvidenceQuery.

3. Confirmed ORR (cORR) not surfaced as a separate column

The worksheet has separate columns for ORR and Confirmed ORR (cORR). Our query only exports ORR. The data IS in the database — but it’s not distinguishable at query time.

Current state:

The endpoints catalog has no cORR entry — only ORR (ids 10, 64)
The EndpointMatcher maps all confirmed ORR extractions to the catalog ORR endpoint
When the LLM extracts “cORR” or “Confirmed Objective Response Rate”, it becomes a regular ORR row with “confirmed” noted in the trial_outcome_measures.observation text
2,377 ORR rows have “confirmed” in their observation text
Only 7 rows in the entire DB have an explicit cORR/Confirmed ORR abbreviation on trial_endpoints
When an abstract reports both ORR and cORR (e.g. pub 47147: ORR=57%, cORR=29%), both are extracted as separate ORR rows — but the query picks one

Explored approach — adding cORR as a separate catalog endpoint: Not recommended. Confirmed ORR is not a different clinical endpoint — it’s the same ORR with confirmation scans. Splitting the catalog would create ambiguity in the matching step (should “ORR 35%” map to ORR or cORR?) and wouldn’t help for the 2,377 rows that already have “confirmed” buried in observation text.

Recommended approach — structured confirmed boolean on outcome measures:

Add a confirmed boolean field to the outcome measure schema in classify_publications. The LLM already knows whether a response is confirmed (it writes “confirmed” in the observation) — we should ask it to put that in a proper field rather than relying on substring/regex matching at query time.

The field would sit on:

The outcome measure in llm_data['subgroup_outcome_measures'][].outcome_measures[] — set by the LLM during classify_publications
trial_outcome_measures — persisted by post_process_publications

Then EmergingClinicalDataQuery can pull ORR rows where confirmed = true for the cORR column and confirmed = false/null for the regular ORR column.

Implementation steps:

Add confirmed boolean to the outcome measure JSON schema in details.rb
Add prompt instruction to task.rb: “Set confirmed: true when the response has been confirmed by follow-up assessment (e.g. cORR, confirmed CR/PR). Set confirmed: false or omit when unconfirmed or not stated.”
Add confirmed column to trial_outcome_measures (migration)
Persist the field in post_process.rb
Expose in vw_publication_efficacy_data
Use in EmergingClinicalDataQuery to populate a separate cORR column
Backfill: re-run classify_publications on affected pubs, or run a lightweight AdverseEventGradeBackfill-style task that re-classifies existing ORR rows using the observation text

Solution applied (2026-03-18):

Migration: Added confirmed boolean column to trial_outcome_measures (nullable, no default)
Schema: Added confirmed attribute to Outcome StoreModel class in details.rb with description guiding the LLM
Prompt: Added “Confirmed Response” instruction to task.rb system prompt — confirmed: true for cORR/confirmed CR/PR, false for unconfirmed, null when not stated
Persistence: Added confirmed: om['confirmed'] to post_process.rb trial_outcome_measures.create! call
View: Created vw_publication_efficacy_data_v11.sql exposing tom.confirmed column
Backfill: Created lib/tasks/one_off/backfill_confirmed_orr.thor — rule-based detection from observation text and endpoint name (no LLM cost). Results:
- 3,061 rows updated (2,722 confirmed=true, 339 confirmed=false)
- 62,207 rows left as null (no signal in text)
- 2,076 publications had llm_data synced
- View refreshed: 4,332 confirmed rows, 517 unconfirmed rows visible

Verified on pub 47147 (sigvotatug vedotin):

HNSCC cORR=37.5% → confirmed=true ✓
NSCLC ORR=57% → confirmed=null ✓
NSCLC cORR=29% → confirmed=true ✓

Going forward: New publications processed via classify_publications will have the confirmed field set by the LLM during extraction. The legacy EmergingClinicalDataQuery can now filter by confirmed = true for a cORR column, but no further query-layer work is planned here because subgroup-preserving behavior is already available in ClinicalEvidenceQuery.

Subgroup-level dose fields (2026-03-18)

Problem: Dose evidence was stored at publication_interventions level (one record per publication+drug), not per subgroup. When a publication reports multiple dose cohorts (e.g. Becotatug 2.0 mg/kg vs 2.3 mg/kg), efficacy is split into separate subgroups but they all share the same publication-wide dose_min/dose_max. ~17K publications with dose evidence have subgroups that could carry dose context.

Solution applied:

Migration: Added 6 dose columns to trial_subgroups: dose_value, dose_min, dose_max, rp2d, dose_units, dose_frequency (all nullable strings)
Schema: Added dose attributes to SubgroupOutcome class in details.rb — numeric values only, units separate in dose_units
Prompt: Added “Subgroup Dose Context” instruction to task.rb system prompt — extract dose into subgroup fields for dose cohorts, leave null for non-dose subgroups
Persistence: Added dose field mapping in post_process.rb trial_subgroups.create!
View: Created vw_publication_efficacy_data_v12.sql — COALESCEs subgroup-level dose over publication-level dose: COALESCE(ts.dose_min, pdl.pub_dose_min) AS dose_min, etc. Also surfaces single_dose column via COALESCE(ts.dose_value, pdl.pub_single_dose)
Backfill: Created lib/tasks/one_off/backfill_subgroup_dose.thor — sends all subgroups for publications with dose_evidence to gpt-5-mini, LLM determines which are dose-specific

Scope: 17,170 publications, 50,403 subgroups. Estimated cost ~$15 with gpt-5-mini batched.

Key design decisions:

Dose value fields are numeric-only (e.g. "2.3") with units in separate dose_units field (e.g. "mg/kg"). Initial run had 45/47 values with units leaked into numeric fields; fixed by making schema descriptions explicit (“WITHOUT units”)
Backfill scope is all publications with dose_evidence on publication_interventions, not regex-filtered by subgroup name. Earlier regex approach (mg|mg/kg|...) missed Gy, IU, U/kg, cell therapy doses (×10^N), DLT/MTD keywords, and schedule-only cohorts (QD/BID)
The LLM correctly nulls non-dose subgroups (disease cohorts, biomarker subgroups, “Overall”) even when they’re sent in the same prompt

Prod deployment:

Run migrations (add columns + update view to v12)
Run backfill: thor one_off:backfill_subgroup_dose:backfill --batched
Refresh materialized view

Going forward: New publications processed via classify_publications → post_process will automatically populate subgroup dose fields. No additional work is planned on the legacy Emerging Clinical Data path; ClinicalEvidenceQuery is the subgroup-preserving query for current use.

13. Technology filter excludes combination partner drugs

Short summary

EmergingClinicalDataQuery filters vw_publication_efficacy_data rows by technology_id, which removes view rows for combination partner drugs that have a different technology than the investigational drug. This means extract_combination_partners_from_rows (which works from the filtered view rows) cannot see the combo partner, so the combination_partners field is blank even when the partner is correctly recorded in publication_interventions.

Where this sits in the current pipeline

app/queries/tpp/emerging_clinical_data_query.rb:

build_base_query (line 501): AND v.technology_id = ANY(ARRAY[:technology_ids]::integer[]) filters ALL view rows by technology
extract_combination_partners_from_rows (line 1495): scans the filtered rows for investigational_component = false — but those rows were already removed by the technology filter
The older fetch_combination_partners method (line 1560) queries publication_interventions directly and would work, but it’s not used by build_single_row — extract_combination_partners_from_rows is used instead

Concrete examples

Example 1: Amivantamab + Paclitaxel (pub 114606, ESMO 2025)

publication_interventions correctly records:

Amivantamab: drug_id=10180, intervention_role='investigational', technology = Bispecific Antibody (235)
Paclitaxel: drug_id=10109, intervention_role='supportive', technology = (chemotherapy/small molecule)

When the query runs with technology_id = 235 (Bispecific Antibody):

View rows for Amivantamab pass the filter (technology_id = 235) ✓
View rows for Paclitaxel are filtered OUT (different technology) ✗
extract_combination_partners_from_rows sees only Amivantamab rows → combination_partners = nil

Worksheet says: Combination Partner = “Paclitaxel” Query returns: combination_partners = nil

Example 2: Petosemtamab + Pembrolizumab (pub 30362/209252)

publication_interventions correctly records Pembrolizumab as intervention_role='supportive'. When querying with technology_id = 235 (Bispecific Antibody), Pembrolizumab (Monoclonal Antibody, technology 230) is filtered out.

Note: even when running a separate query with technology_id = 230, Pembrolizumab rows would appear but Petosemtamab rows would be filtered out — so the combination context is lost in both directions.

Root cause

The technology filter is applied to view rows before drug role analysis. The filter is correct for identifying the investigational drug’s technology, but it eliminates combo partner rows that inherently have a different technology. This is a fundamental design tension: the technology filter serves to scope results to a technology of interest, but combination therapy inherently crosses technology boundaries.

Scale

Affects any publication where the investigational drug and combination partner have different technologies. Common patterns:

ADC + checkpoint inhibitor (e.g. sigvotatug + pembrolizumab)
BsAb + chemotherapy (e.g. amivantamab + paclitaxel)
BsAb + checkpoint inhibitor (e.g. petosemtamab + pembrolizumab)

These are increasingly common in oncology clinical trials.

Additionally, the Amivantamab + Pembrolizumab 1L row from the MHNCS Feb 2026 conference is missing entirely — this publication does not exist in our database. The “Multidisciplinary Head and Neck Cancers Symposium” is not an ingested source. This is a data availability gap, not an extraction or query issue.

Explored solution direction

Option 1: Fall back to publication_interventions for combo partners. Instead of relying on filtered view rows, use the existing fetch_combination_partners method (line 1560) which queries publication_interventions directly. This method already exists and handles both publication-based and trial-based combo partner lookup. Change build_single_row to call fetch_combination_partners instead of extract_combination_partners_from_rows.

Option 2: Remove technology filter from combo partner extraction. Run a secondary unfiltered query for publication_interventions where investigational_component = false for the matched publication_ids.

Option 1 is simplest — the method already exists, just needs to be wired in.

Solution applied

Implemented 2026-03-18. Two changes in app/queries/tpp/emerging_clinical_data_query.rb:

Fixed fetch_combination_partners SQL bug (line 1567): Changed pi.publication_id to pi.source_id — the column was renamed during the polymorphize migration but the SQL was never updated, so this method silently failed for all publications.
Switched build_single_row to use fetch_combination_partners (line 951): Replaced extract_combination_partners_from_rows(rows) with fetch_combination_partners(publication_id, clinical_trial_id, primary_drug_id, primary_drug_name). This queries publication_interventions directly, bypassing the technology_id filter on the view. Falls back to extract_combination_partners_from_rows for non-publication rows.

Verified:

Amivantamab + Paclitaxel (pub 114606): now shows combo=Paclitaxel (was blank)
Petosemtamab + Pembrolizumab (pub 30362): now shows combo=Pembrolizumab (was blank)
Monotherapy publications: correctly show no combo partner

14. Basket trial disease subgroups not extracted for minority cohorts

Short summary

BNT324/DB-1311 (NCT05914116) is a solid-tumor basket trial. The ESMO abstract (pub 64328) reports results for 77 evaluable patients across multiple tumor types but only names SCLC (ORR=45.5%, n=33), CRPC (3 PRs), NSCLC (3 PRs), and BTC (1 PR) explicitly. HNSCC is never mentioned in the abstract text. The client sheet lists HNSCC N=3 ORR=100% from this trial — this data was in the poster/presentation, not the abstract.

Investigation (2026-03-17)

Publication corpus: 5 publications linked to NCT05914116:

Pub ID	Source	Disease focus	HNSCC mentioned?
64328	ESMO	Broad solid tumors (SCLC emphasis)	No
137185	ASCO	CRPC	No
190691	ESMO	Cervical cancer / ovarian	No
236643	ASCO	CRPC	No
241480	ASCO	mCRPC + Lu-177 analysis	No

Abstract text analysis (pub 64328):

The abstract mentions PRs by tumor type: “In pts with SCLC (n=33), unconfirmed ORR was 45.5%… PRs were also observed in 3 pts with CRPC, 3 pts with NSCLC and 1 pt with BTC.” HNSCC is not in this list. The HNSCC N=3 data likely appeared in the ESMO Asia poster/supplementary materials.

Database state:

trial_subgroups for this trial with disease_id = 6200 (HNSCC): 2 records, both source_type = 'News' / 'NewsTrialMention' — NOT from publication extraction
publication_interventions: BNT324 (drug_id=12964) correctly linked with technology_id = 708 (ADC) ✓
No publication-sourced trial_subgroups have disease_id = 6200 for this trial

Root cause

Data availability limitation. The LLM extraction is correct — it cannot extract HNSCC data that isn’t in the abstract text. The HNSCC results for this basket trial were only available in the poster/presentation at ESMO Asia 2024, which is not captured in our abstract corpus.

This is a common pattern for basket trials: the main abstract reports overall + top-responding tumor types, while per-tumor breakdowns for minority cohorts appear only in the poster, supplementary slides, or corporate presentations.

What would fix this

Full poster/presentation ingestion — if ESMO Asia poster PDFs were ingested and processed, the per-tumor-type data would be extractable
Corporate presentation ingestion — the sheet source “ESMO Asia 2024” may reference a BioNTech R&D day presentation rather than an abstract
News-sourced subgroup promotion — the HNSCC subgroups exist from News/NewsTrialMention sources; these could potentially be surfaced alongside publication data, but this would require view/query changes to accept non-publication sources

Scale

This pattern affects any basket trial where minority cohort data is only in supplementary materials. Likely affects dozens of phase 1 solid-tumor basket trials in the database.

15. Disease extraction favors subtype matches over parent disease, losing the umbrella disease

Short summary

The disease_extraction.rb matching logic tries subtype-level matches first, and if they succeed, skips the parent disease-level match entirely (early return on line 219). For pub 71438 (Becotatug vedotin, ESMO), the LLM correctly extracted name = "squamous cell carcinoma of the head and neck" with subtypes ["oral cavity", "oropharynx", "hypopharynx", "larynx"]. The subtype combos matched to Oropharyngeal Cancer (5040), Hypopharyngeal Cancer (5031), etc. via TermMatch. Because those subtype matches succeeded, the disease-name-level match to HNSCC (6200) was never attempted. The publication ends up with 4 trial_disease_details rows for sub-site cancers but none for HNSCC itself.

Where this sits in the current pipeline

app/tasks/publications_llm_classification/disease_extraction.rb:

build_match_set (line 207): Takes a disease name and subtype values
Lines 212-216: For each subtype, builds combo "squamous cell carcinoma of the head and neck - oral cavity" and looks up TermMatch with field = 'disease_subtypes'
Line 219: return matches if matches.any? — if ANY subtype matched, skip the disease-name match entirely
Lines 221-223: Only reached if no subtype matches — looks up "squamous cell carcinoma of the head and neck" as disease_name, which resolves to HNSCC (6200)

Then in post_process.rb:

Lines 401-436: Iterates over processed diseases, uses matched_disease.matched_disease_id to find the Disease record
Creates one trial_disease_details row per entry — since there are 4 subtype-matched entries (not the parent), 4 sub-site disease rows are created

Exact data flow for pub 71438

Step 1 — LLM extraction (extract_diseases):

The LLM correctly extracted ONE disease:

{
  "name": {"value": "squamous cell carcinoma of the head and neck"},
  "subtypes": [{"value": "oral cavity"}, {"value": "oropharynx"}, {"value": "hypopharynx"}, {"value": "larynx"}]
}

Step 2 — Disease matching (disease_extraction.rb):

build_match_set receives disease_name = "squamous cell carcinoma of the head and neck", subtype_values = ["oral cavity", "oropharynx", "hypopharynx", "larynx"].

For each subtype, it builds a combo and finds a TermMatch:

Combo term	TermMatch ID	Matched disease	Confidence
`squamous cell carcinoma of the head and neck - oral cavity`	51046	Lip and Oral Cavity Cancer (5047)	0.925
`squamous cell carcinoma of the head and neck - oropharynx`	50748	Oropharyngeal Cancer (5040)	0.9
`squamous cell carcinoma of the head and neck - hypopharynx`	50744	Hypopharyngeal Cancer (5031)	0.975
`squamous cell carcinoma of the head and neck - larynx`	50745	Laryngeal Cancer (5023)	0.9

All 4 subtype matches succeed → line 219 early return → disease-name match to HNSCC (6200) never runs.

The ONE input disease entry is split into 4 output entries, each with a subtype-matched disease and matched_disease.matched_disease_id pointing to the sub-site cancer (not HNSCC).

Step 3 — Post-processing (post_process.rb):

The 4 processed disease entries become 4 trial_disease_details rows:

TDD ID	disease_id	disease_name	subtypes
94126	5047	Lip and Oral Cavity Cancer	`["oral cavity"]`
94127	5040	Oropharyngeal Cancer	`["oropharynx"]`
94128	5031	Hypopharyngeal Cancer	`["hypopharynx"]`
94129	5023	Laryngeal Cancer	`["larynx"]`

HNSCC (6200) is nowhere in trial_disease_details for this publication.

Step 4 — Query (EmergingClinicalDataQuery):

The query uses Disease.subtree_for([6200]) which returns only [6200] (HNSCC has no descendants, all_descendants = []). None of the sub-site diseases (5047, 5040, 5031, 5023) are in this set. The publication is invisible.

Comparison with pub 242943 (PubMed, same trial)

Pub 242943 for the same trial (NCT04868162) has trial_disease_details.disease_id = 6200 (HNSCC directly). This works because the PubMed abstract either:

Did not have subtypes, so the disease-name fallback (line 221) ran and matched HNSCC
Or had different subtype values that didn’t match any disease_subtypes TermMatch

Why `patient_population_diseases` shows the correct match

The llm_data['patient_population_diseases'] for pub 71438 shows matched_disease.matched_disease_id = 6200 with confidence 1.0. But this is stale data — it was set before disease_extraction.rb re-processed the entries. The extraction step replaces the matched_disease field on each cloned entry (line 174), overwriting the original HNSCC match with the subtype-level match.

Root cause

The early return on line 219 of disease_extraction.rb treats subtype matches as replacing the parent disease match, rather than supplementing it. When the LLM extracts “squamous cell carcinoma of the head and neck” with anatomical subtypes, the system should create BOTH:

A parent disease record for HNSCC (6200) — so the publication is discoverable under the umbrella term
Subtype records for the anatomical sub-sites — for more granular filtering

Instead, it creates ONLY the subtype records and drops the parent entirely.

Disease ontology contributing factor

Even if the subtype records were the only ones created, the publication would still be discoverable IF the sub-site diseases were descendants of HNSCC in the disease hierarchy. But they are all root-level siblings:

Disease ID	Name	Parent	all_descendants
6200	Head and Neck Squamous Cell Carcinoma (HNSCC)	NULL	`[]`
5040	Oropharyngeal Cancer	NULL	(separate tree)
5031	Hypopharyngeal Cancer	NULL	(separate tree)
5023	Laryngeal Cancer	NULL	(separate tree)
5047	Lip and Oral Cavity Cancer	NULL	(separate tree)

So Disease.subtree_for([6200]) returns only [6200], excluding all sub-site diseases.

Scale

Not yet quantified. Affects any publication where:

The LLM extracts a disease with anatomical subtypes
Those subtypes have disease_subtypes TermMatches to separate diseases
The separate diseases are not descendants of the umbrella disease

This pattern is common for:

Head & neck cancers (HNSCC → oropharyngeal, laryngeal, hypopharyngeal, oral cavity)
Lung cancers (NSCLC → adenocarcinoma, squamous)
Potentially others with anatomical sub-site taxonomy

Explored solution direction

Option 1 (recommended): Always include parent disease match alongside subtype matches.

In disease_extraction.rb build_match_set, after collecting subtype matches, also run the disease-name match and include it in the result. Remove the early return on line 219:

# Current (line 219):
return matches if matches.any?

# Proposed: always also try the disease-name match
term_match = lookup_term_match('disease_name', disease_name)
if valid_match?(term_match)
  # Only add parent match if it resolved to a different disease than the subtypes
  parent_disease_id = term_match.final_result['id']
  subtype_disease_ids = matches.filter_map { |m| m['matched_disease_id'] }
  unless subtype_disease_ids.include?(parent_disease_id)
    matches << format_match_data(disease_name, subtype_values, term_match, matched_subtype: nil)
  end
end

This ensures HNSCC (6200) gets a trial_disease_details row alongside the sub-site rows. The deduplication check prevents creating a duplicate if the parent and subtype resolve to the same disease.

Option 2: Fix disease hierarchy. Make sub-site H&N cancers descendants of HNSCC. This is conceptually correct but clinically nuanced — not all oropharyngeal cancers are squamous cell carcinomas. Would need expert review.

Option 3: Both. Fix the extraction to always include the parent, AND fix the hierarchy for confirmed relationships. Belt and suspenders.

Solution applied

Implemented 2026-03-18.

1. Fixed disease_extraction.rb build_match_set: Removed the early return on line 219 that skipped the parent disease-name match when subtype matches existed. The method now always also tries the disease_name TermMatch lookup and includes it in the result set if it resolves to a different disease_id than any of the subtype matches. Added deduplication in merge_disease_matches to prevent the parent disease from being added multiple times when multiple sibling subtypes share the same parent.

2. Created backfill task lib/tasks/one_off/backfill_parent_disease_matches.thor:

identify — finds 1,856 publications with subtype-only disease matches
backfill — re-runs disease matching with the fixed logic, then destroys and recreates trial_disease_details only (does not touch subgroups, endpoints, or AEs)

Verified on pub 71438 (Becotatug vedotin, ESMO):

Before: trial_disease_details had 4 sub-site diseases (5047, 5040, 5031, 5023), no HNSCC
After: 5 entries — 4 sub-sites + HNSCC (6200)
Publication now surfaces in HNSCC queries via EmergingClinicalDataQuery

Scale: 1,856 publications affected. Top disease names: breast cancer (593 entries across case variants), NSCLC (205), prostate cancer (127), lymphoma (130), mesothelioma (70), renal cell carcinoma (36), H&N SCC (29).

Pending: Production backfill of the 1,856 affected publications.

16. Confirmed ORR is not exported by `EmergingClinicalDataQuery`

Short summary

The disease worksheet has a dedicated Confirmed ORR (cORR) column, but EmergingClinicalDataQuery only exports OS, PFS, ORR, DoR, DFS, and DCR. Even when a worksheet row distinguishes confirmed from unconfirmed response, the query output has no place to carry that metric.

This means worksheet rows can look “partially matched” because the main ORR is present while the confirmed-response column is always blank.

Where this sits in the current pipeline

app/queries/tpp/emerging_clinical_data_query.rb:

PRIMARY_EFFICACY_ABBREVIATIONS is defined as %w[OS PFS ORR DOR DoR DFS DCR]
extract_efficacy_metrics iterates only that whitelist
the result hash has no :corr or :confirmed_orr key
summary_statistics, orr_ranking, and CSV export all inherit the same endpoint set

This is a reporting-layer omission. It sits after publication ingestion and after subgroup extraction.

Exact restriction causing the drop

The query hard-codes the primary efficacy endpoint set:

PRIMARY_EFFICACY_ABBREVIATIONS = %w[OS PFS ORR DOR DoR DFS DCR].freeze

Because cORR is not in that list:

extract_efficacy_metrics never reads confirmed-response rows even if they exist upstream
build_single_row never exposes a confirmed-response field
downstream consumers cannot distinguish:
- unconfirmed ORR
- confirmed ORR
- rows where both are reported

Concrete examples from sqNSCLC sheet validation

Example 1: PF-08046054 (ASCO 2025)

Worksheet row:

ORR = 33.3%
cORR = 33.3%
N = 6

Query row:

ORR = 33.3%
no cORR field

Example 2: Ifinatamab deruxtecan (ESMO 2023)

Worksheet row:

ORR = 31%
cORR = 31%
mDoR = 4.1

Query row:

ORR = 31%
mDoR = 4.1
no cORR

Example 3: IBI363 (ASCO 2025)

Worksheet rows:

SqNSCLC ORR = 43.3%, cORR = 36.7%
SqNSCLC TPS <1 ORR = 45.5%, cORR = 36.7%

Query rows:

ORR = 43.3% on the main SqNSCLC row
no cORR
the TPS <1 row is additionally hidden by Issue 12

Downstream impact

The worksheet Confirmed ORR (cORR) column cannot be reconstructed from structured output
studies that report both ORR and cORR appear more complete than they really are because only one of the two response metrics survives
comparisons between abstracts that emphasize unconfirmed responses versus confirmed responses become unreliable

What the issue is not

This is not primarily a data-availability problem.

For the sqNSCLC examples above, the worksheet values are tied to concrete conference/journal records that we already ingest or otherwise match on the main ORR metric. The missing part is the confirmed-response export path.

This is also not the same as Issue 12. Issue 12 hides subgroup rows; Issue 16 removes an entire metric family from the report shape.

Scale

In the current sqNSCLC worksheet:

5 / 10 populated rows include a cORR value
these rows cover at least 4 distinct studies

So this is not an edge case for the worksheet format.

Explored solution direction

Add confirmed response as a first-class efficacy metric:

Expand the endpoint whitelist to include the confirmed-response abbreviation actually used in the data (cORR / normalized equivalent)
Store it in the row hash alongside :orr
Add a Confirmed ORR column to CSV/export formatting
Keep ORR and cORR separate rather than trying to merge or overwrite one with the other

17. ASCO abstract and presentation copies create duplicate publication rows

Short summary

After broadening ASCO ingestion to include both AbstractContentItem and PresentationContentItem, the same scientific abstract can now be stored twice under different ASCO uids. EmergingClinicalDataQuery groups by publication_id, not DOI/title, so both copies surface as separate rows.

This showed up repeatedly during the sqNSCLC pass and makes the local output look larger and noisier than the sheet.

Where this sits in the current pipeline

app/services/publications/asco_api_service.rb:

fetch_abstract_hits requests contentTypes: ['Abstract', 'Presentation']
save_publication persists records using Publication.find_or_initialize_by(source: 'ASCO', source_id: publication_data[:source_id])

app/queries/tpp/emerging_clinical_data_query.rb:

build_result_rows groups by publication_id, disease_id, effective_line, and study_plan_arm_id

There is no DOI-level or title-level deduplication step between ingestion and reporting.

Exact restriction causing the duplication

The ASCO fix for Issue 2 intentionally broadened the search and detail query to include PresentationContentItem. That solved the “missing presentation” problem, but persistence still keys uniqueness on source_id:

publication = Publication.find_or_initialize_by(source: 'ASCO', source_id: publication_data[:source_id])

So if ASCO exposes both:

ABSTRACT492030
PRESENTATION251481

with the same DOI and same text, both are considered distinct publications locally.

Concrete examples from sqNSCLC validation

Example 1: PF-08046054

Same DOI:

10.1200/JCO.2025.43.16_suppl.8611

Stored twice:

publication 48035 — source_id ABSTRACT492030
publication 238708 — source_id PRESENTATION251481

Both produce the same sqNSCLC row (ORR = 33.3%, N = 6).

Example 2: IBI363

Same DOI:

10.1200/JCO.2025.43.16_suppl.8509

Stored twice:

publication 139344 — source_id ABSTRACT500470
publication 237445 — source_id PRESENTATION246467

Both produce the same main sqNSCLC 3 mg/kg Q3W row.

Example 3: Additional duplicate DOI pairs in the same sqNSCLC slice

Datopotamab deruxtecan: 10.1200/JCO.2025.43.16_suppl.8501
Sacituzumab govitecan: 10.1200/JCO.2025.43.16_suppl.8599

Downstream impact

one worksheet row can correspond to two local rows
counts for “how many publication-backed rows do we have?” are overstated
manual comparison against the sheet becomes noisy
any future ranking or aggregation that does not dedupe by DOI/title risks double-counting conference data

What the issue is not

This is not a disease-mapping issue and not a subgroup-extraction issue.

The data itself is usually valid in both copies. The problem is that they are the same scientific result represented twice because ASCO exposes two content-item types.

This is also not an argument to undo Issue 2 entirely. We needed PresentationContentItem support to recover records like SHR-A2102. The gap is specifically the lack of a deduplication strategy after broadening the source.

Scale

In the sqNSCLC ADC/fusion slice alone, there are 4 duplicate DOI pairs:

PF-08046054
IBI363
Datopotamab deruxtecan
Sacituzumab govitecan

So the effect is already material in a small disease/technology slice.

Explored solution direction

Two reasonable options:

1. Query/report deduplication

Keep both source records in publications, but dedupe in EmergingClinicalDataQuery or the TPP report by a stable key such as:

DOI + disease + subgroup/arm
or DOI + publication title

This is lower risk for ingestion history.

2. Ingestion-time merge

When saving ASCO records, detect that an incoming presentation and an existing abstract share the same DOI/title/NCT tuple and merge them into one canonical Publication.

This is cleaner downstream but riskier because it changes persistence semantics for already-ingested ASCO records.

18. PubMed-indexed journal article missing from publication corpus

Short summary

The current sqNSCLC worksheet row for Cofetuzumab pelidotin points to the 2025 journal article:

DOI: 10.1016/j.lungcan.2025.108492
PMID: 40086026

That article exists on PubMed and contains the sqNSCLC result the sheet uses, but there is no corresponding Publication row in the local database. As a result, the row is completely absent from EmergingClinicalDataQuery.

Where this sits in the current pipeline

This drop happens before EmergingClinicalDataQuery.

During validation:

Publication.where(doi: '10.1016/j.lungcan.2025.108492') returned no rows
Publication.where(source_id: '40086026') returned no rows

So the publication never entered the local corpus, or it was dropped before persistence.

Exact restriction causing the drop

Root cause isolated.

There are two distinct PubMed ingestion limitations affecting this paper:

the disease-specific path depends on PubMed exposing a ClinicalTrials.gov / NCT... databank entry, and this record does not appear to expose that linking metadata even though PubMed marks it as a clinical trial
the broad PubMed path in Publications::PubmedApiService built one giant combined query for the oncology MeSH clause plus the recovery clause; that combined search term excluded qualifying records that PubMed returned when the intended criteria were tested separately

What was verified live for PMID 40086026:

PubMed resolves DOI 10.1016/j.lungcan.2025.108492 to PMID 40086026
the record has Clinical Trial, Phase I
the record has oncology MeSH including Carcinoma, Non-Small-Cell Lung and Lung Neoplasms
40086026[uid] AND mesh AND clinical-trial publication types AND 2025 date returned 1
40086026[uid] AND full previous combined search term returned 0

So the missing publication was not due to missing PubMed record metadata for the broad query. It was due to our query construction.

Concrete example

Worksheet row: Cofetuzumab pelidotin in sqNSCLC

Worksheet entry:

Drug: Cofetuzumab pelidotin
Publication: Lung Cancer (Journal), 2025
Link: https://doi.org/10.1016/j.lungcan.2025.108492
ORR = 12.5%
cORR = 12.5%
mPFS = 5.3
mDoR = 2.2

Local database state:

no Publication row for DOI 10.1016/j.lungcan.2025.108492
no Publication row for PMID 40086026
only older cofetuzumab records exist:
- publication 150086 — ASCO 2021
- publication 71934 — ESMO 2023
- publication 101600 — Clinical Cancer Research 2021

External confirmation:

PubMed lists the paper as “A phase 1b study of cofetuzumab pelidotin monotherapy in patients with PTK7-expressing recurrent non-small cell lung cancer” with PMID 40086026

Downstream impact

the sqNSCLC worksheet still has one fully missing non-investor row even after the backfills and corrections
the earlier tracker note that the cofetuzumab sqNSCLC value was poster-only is now stale for the current worksheet version
the publication will remain absent until a non---disease-specific 2025 PubMed run is executed against the fixed query logic
--disease-specific alone is still insufficient for this class of paper because PubMed does not appear to expose the ClinicalTrials.gov linking metadata we rely on

What the issue is not

This does not contradict the earlier ESMO 2023 analysis in Issue 11.

That earlier note was about publication 71934, where the squamous-specific value was not in the 2023 abstract text. The current worksheet has since moved to a later 2025 journal article. That newer source should be representable if it is ingested.

Scale

Currently one confirmed sqNSCLC worksheet row for the original worksheet discrepancy.

For 2025-01-01 through 2025-12-31, after fixing the PubMed query construction:

the broad oncology/malignant-heme PubMed query returns 6,013 PMIDs
3,831 of those are not already in local publications
compared with the old Clinical Trial[pt] path, there are 435 additional PMIDs
431 of those additional PMIDs are not already in local publications

So this is not just one missing-paper edge case. The broken combined query was suppressing a non-trivial number of 2025 PubMed records.

Spot checks

Publication.where(doi: '10.1016/j.lungcan.2025.108492') returned no rows before the fix
Publication.where(source_id: '40086026') returned no rows before the fix
after the PubmedApiService query change, fetch_uids_by_date('2025/01/01', '2025/12/31', nct_ids: []) includes PMID 40086026
live verification after the fix returned:
- includes_pmid_40086026 = true
- total = 6013

Open characterization questions

After the 2025 backfill, how many of the 431 incremental publications are truly result publications versus broader cancer-clinical-trial noise?
Do we want to keep the broad non---disease-specific PubMed run as a regular sync, or use it only as a periodic coverage backfill?

Explored solution direction

Characterize the missing publication upstream of the query, then narrow the fix to the actual failure point:

Trace the PubMed/journal ingestion path for DOI 10.1016/j.lungcan.2025.108492 / PMID 40086026
Compare direct PubMed criteria matches against the full generated search term
Split the broad PubMed search into separate query terms and union PMIDs in Ruby instead of relying on one giant combined PubMed query

Solution applied

updated Publications::PubmedApiService so the broad PubMed path now runs separate search terms for:
- oncology/malignant-heme MeSH + clinical-trial publication types
- oncology/malignant-heme MeSH + recovery result terms for the recent recovery window
changed PubMed UID fetching to execute each term separately and union the PMIDs in Ruby
aligned total-count logic with the split-query approach
verified live that the fixed 2025 query now includes PMID 40086026
syntax check passed: ruby -c app/services/publications/pubmed_api_service.rb

19. Biomarker context missing at subgroup level

Short summary

The worksheet slices data three ways: dose, treatment line, and biomarker. Treatment line and dose are now structured on trial_subgroups (issues 5/12), but biomarker context is not. Biomarkers are extracted at the trial_disease_details level (publication + disease scope) via disease_extraction.rb, stored in trial_disease_biomarkers. There is no link between a biomarker-type subgroup (e.g. “EGFR-mutant → ORR=45%”) and the structured biomarker record (EGFR = positive).

~13,177 subgroups have biomarker-type classifications (mutation: 11,850, biomarker: 913, molecular subtype: 118, etc.). ~94% are single-biomarker subgroups; ~6% are multi-biomarker (e.g. “EGFR/ALK-negative”, “KRAS wild-type + BRAF-mutated”).

Where this sits in the current pipeline

Biomarker extraction (disease level):

disease_extraction.rb → LLM extracts patient_population_diseases[].biomarkers[] from abstract
post_process.rb lines 445-463 → creates trial_disease_biomarkers linked to trial_disease_details
Matching via Biomarker.flexifind(biomarker_name, 'synonyms') → biomarker_id

Subgroup extraction (no biomarker logic):

subgroup_extraction.rb → identifies subgroup labels (e.g. “EGFR-mutant”), classifies subgroup_type = 'mutation'
classify_publications (task.rb) → extracts outcome measures per subgroup
post_process.rb lines 251-260 → creates trial_subgroups with subgroup_type, subgroup_value — no biomarker fields

No biomarker usage in query/view:

vw_publication_efficacy_data does not join or expose biomarker data
EmergingClinicalDataQuery does not query trial_disease_biomarkers

Exact restriction causing the gap

trial_subgroups has no biomarker columns. Biomarker information is only available as:

Unstructured text in subgroup_value (e.g. “EGFR-mutant”, “PD-L1 TPS≥1%”, “TMB high”)
Structured records in trial_disease_biomarkers — but these are linked to trial_disease_details, not to trial_subgroups

Concrete examples

Example 1: EGFR-mutant subgroup (pub 176313)

trial_subgroups: subgroup_type='mutation', subgroup_value='EGFR-mutant', biomarker_id=NULL
trial_disease_biomarkers: biomarker_name='EGFR', value='positive', biomarker_id=656 — attached to trial_disease_detail, no link to the subgroup

Example 2: PD-L1 TPS≥1% subgroup

Subgroup value: “Non-squamous NSCLC → PD-L1 TPS≥1%”
Needs: biomarker_id → PD-L1, biomarker_value → “≥1%”
Currently: only unstructured text in subgroup_value

Example 3: Multi-biomarker (6% of cases)

Subgroup value: “KRAS wild-type + BRAF-mutated”
Contains two biomarkers — single biomarker_id column would capture only one

Scale

13,177 subgroups with biomarker-type subgroup_type
~5,117 (39%) contain a single recognized biomarker name
~811 (6%) contain multiple biomarker names
~7,361 (55%) contain less common markers not in the top-40 list but still single-biomarker (e.g. AKT1, VHL, DNMT3A, EZH2)
Total: ~94% single biomarker per subgroup

What the issue is not

Not an extraction failure — biomarkers ARE extracted, just at the wrong granularity (disease level, not subgroup level)
Not a matching failure — Biomarker.flexifind works well, and BiomarkerMatchingService provides advanced LLM-based matching
Not a view/query issue — the data simply doesn’t exist on trial_subgroups yet

Resolution: Partially addressed by subgroup tagging

Phase 1 (complete): Subgroup tagging (openspec/changes/subgroup-tagging/) added a biomarker tag to trial_subgroups.tags, solving the filtering problem — users can find biomarker subgroups via tags @> '["biomarker"]'. Tags are multi-valued (“EGFR-mutant NSCLC” gets ["biomarker", "disease"]), exposed in vw_publication_efficacy_data and admin UI.

Phase 2 (implemented): Structured biomarker link for display and matching.

What was implemented:

Join table trial_subgroup_biomarkers — mirrors trial_disease_biomarkers schema:
- trial_subgroup_id → FK to trial_subgroups (cascade delete)
- biomarker_name → LLM-extracted name (e.g., “KRAS”)
- value → status/value (e.g., “mutated”, “wild-type”, “TPS≥1%”)
- numeric_value → threshold if applicable (e.g., “1” for TPS≥1%)
- biomarker_id → FK to biomarkers (populated by BiomarkerMatchingService, not flexifind)
LLM extraction — two paths:
- Backfill: lib/tasks/one_off/backfill_subgroup_biomarkers.thor — sends abstract + biomarker-tagged subgroups to GPT-5-mini per-publication. Extracts biomarker name + value. Handles multi-biomarker (e.g., “BRCA1/2” → two entries). ~13K subgroups, ~$5-10.
- Forward pipeline: SubgroupBiomarker schema added to SubgroupOutcome in details.rb. post_process.rb creates trial_subgroup_biomarkers records when tags.include?('biomarker').
No flexifind — biomarker_id is left NULL at extraction time. All matching goes through BiomarkerMatchingService pipeline in PublicationDiseaseWorkflow:
- populate_term_matches → creates TermMatch entries with strategy: 'BiomarkerMatching', field: 'name'
- Deduplicates with 6,151 existing BiomarkerMatching term matches (3,186 already resolved from ParticipationCriterionBiomarker runs)
- suggest_keywords → find_candidates (semantic) → pick_best_match → qa_best_match → judge (gpt-5) → post_process writes biomarker_id
- Also applied to trial_disease_biomarkers — removed flexifind from post_process.rb disease biomarker creation. Same matching pipeline now handles both subgroup-level and disease-level biomarkers.
Workflow steps — added to PublicationDiseaseWorkflow as two parallel branches from the first node:
- Subgroup biomarker branch: populate_term_matches_for_subgroup_biomarkers → 6 matching steps → post_process_subgroup_biomarkers
- Disease biomarker branch: populate_term_matches_for_disease_biomarkers → 6 matching steps → post_process_disease_biomarkers
- Both run in parallel with existing disease/subtype matching branches.
View v15 — vw_publication_efficacy_data now exposes trial_subgroup_id for query-layer joins.

Query updates — ClinicalEvidenceQuery and EmergingClinicalDataQuery now COALESCE subgroup-level biomarkers over disease-level:

LEFT JOIN trial_subgroup_biomarkers tsb ON tsb.trial_subgroup_id = v.trial_subgroup_id
LEFT JOIN biomarkers sb ON tsb.biomarker_id = sb.id
-- ...existing disease-level joins...
COALESCE(tsb.biomarker_id, tdb.biomarker_id) AS biomarker_id,
COALESCE(sb.name, tsb.biomarker_name, b.name, tdb.biomarker_name) AS biomarker_name,
COALESCE(tsb.value, tdb.value) AS biomarker_value,

Production deployment:

# 1. Run migration (create trial_subgroup_biomarkers table + view v15) ✅
# 2. Backfill subgroup biomarker extraction ✅ (2026-03-24, gpt-5.4-mini)
#    Results: 52,063 records across 44,725 subgroups (99% of 45,184 biomarker-tagged)
#    1.16 records/subgroup avg. Top markers: HER2 (3,728), PD-L1 (3,188), EGFR (1,964)
thor one_off:backfill_subgroup_biomarkers:backfill --batched --parallelism 4 --model=gpt-5.4-mini
# 3. Run PublicationDiseaseWorkflow — biomarker branches match both subgroup + disease biomarkers ✅ (2026-03-25)
#    Results: 3,439 TermMatches created for TrialSubgroupBiomarker (3,191 resolved, 248 pending)
#    35,026 / 52,063 records matched to biomarker_id (67.3%)
#    Unmatched breakdown: 7,038 resolved no-match (long tail), 8,005 deduped via PCB no-match, 477 PCB match not propagated, 1,517 unknown
# 4. Query layer fix: LEFT JOIN LATERAL with STRING_AGG to aggregate multi-biomarker subgroups ✅ (2026-03-25)
#    Prevents row multiplication for ~5,810 multi-biomarker subgroups
#    All biomarker names surface (matched or raw) via COALESCE

Design notes:

TermMatch field: 'name' is shared across all biomarker sources (ParticipationCriterionBiomarker, TrialSubgroupBiomarker, TrialDiseaseBiomarker) for deduplication
~6% of biomarker subgroups are multi-biomarker — join table handles 1:N cleanly
Judge step uses gpt-5 (temperature=nil, since gpt-5 only supports default temperature)
Query layer uses LEFT JOIN LATERAL with STRING_AGG to aggregate multiple biomarkers per subgroup into comma-separated strings, avoiding row multiplication while preserving all biomarker names/values

20. `study_plan_arm` link is fragile and causes dose/drug/arm issues (merges Issue 3)

Short summary

The vw_publication_efficacy_data materialized view depends on study_plan_arms (trial registry) for two critical functions: resolving arm roles (EXPERIMENTAL vs COMPARATOR) and resolving drug attribution (via vw_bioloupe_interventions). This dependency is the root cause of three cascading problems:

Arm role failures — 62% of view rows have no study_plan_arm match and default to EXPERIMENTAL
Dose evidence drop (Issue 3) — The pub_dose_lookup CTE joins on drug_id, but the view’s drug_id comes from the registry while dose evidence drug_id comes from publication_interventions. This mismatch causes 76% of extracted dose evidence (17,826 of 23,503 pubs) to silently drop.
Row triplication — Multiple study_plan_arms per trial create duplicate rows in the drug_interventions CTE

The fix is to drop the study_plan_arm dependency entirely and use publication_interventions as the primary drug source, with LLM-classified arm roles replacing the registry lookup.

Where this sits in the current pipeline

The study_plan_arm link flows through:

publication_clinical_trials links a publication to a clinical_trial
trial_arm_outcomes.study_plan_arm_id links an LLM-extracted outcome row to a registry arm
study_plan_arms.arm_type provides the registry’s classification (EXPERIMENTAL, ACTIVE_COMPARATOR, PLACEBO_COMPARATOR, etc.)
vw_publication_efficacy_data resolves resolved_group_type via: COALESCE(UPPER(spa.arm_type), CASE WHEN arm_type = 'experimental' THEN 'EXPERIMENTAL' ... END)
The drug_interventions CTE in the view joins vw_bioloupe_interventions (trial registry drug data) to the correct arm via study_plan_arm_id

Relevant code paths:

app/queries/tpp/clinical_evidence_query.rb — uses resolved_group_type to prefer EXPERIMENTAL rows for efficacy/safety, extract comparator values
db/views/vw_publication_efficacy_data_v18.sql — the materialized view definition
fetch_trial_enrichments in the query — fetches comparator arm names from study_plan_arms

Exact restriction causing the issue

trial_arm_outcomes.arm_type is always NULL for publication-sourced data. The LLM extraction pipeline (classify_publications) extracts arm names but does not classify arm roles. The only path to arm role classification is the study_plan_arm_id foreign key, which requires:

The publication is linked to a trial (publication_clinical_trials exists)
The LLM-extracted arm name was matched to a registry arm (study_plan_arm_id is set)

Both conditions frequently fail.

Scale

Coverage analysis of vw_publication_efficacy_data (total ~1.04M rows):

Category	Row count	% of total
Trial + arm linked (has `study_plan_arm_id`)	399,373	38%
Trial linked, no arm match	447,912	43%
Unlinked (uses `publication_interventions`)	196,723	19%

So 62% of view rows have no study_plan_arm link and default to EXPERIMENTAL.

For HNSCC specifically (14,660 rows):

1,463 rows (10%) have comparator identification via the arm link
12,360 rows are marked EXPERIMENTAL
569 rows have NULL resolved_group_type

Dose evidence impact (from Issue 3 reopened investigation, 2026-03-23)

The same study_plan_arm dependency causes the drug_interventions CTE to use registry drug_ids. The pub_dose_lookup CTE then fails to join because publication_interventions.drug_id (LLM-extracted) doesn’t match:

23,503  publications with dose_evidence extracted
 8,764  publications with structured dose in view (37%)
17,826  publications with dose evidence silently dropped (76%)

Breakdown of dropped:
~13,600  NULL drug_id on publication_interventions (58%)
 ~2,148  drug_id mismatch: registry vs LLM-extracted (9%)
 ~2,078  other (pub not in view, no usable fields, etc.)

Concrete examples from CRC ADC audit (disease 4345, technology 708):

Pub	Drug	PI drug_id	View drug_id	Dose evidence	View dose
66516	Zanidatamab	10432 (antibody)	15231 (ADC: zovodotin)	1200 mg	NULL
70960	SHR-A1811	NULL	10733 (Trastuzumab rezetecan)	rp2d=6.4 mg/kg	NULL
114758	Zanidatamab	10432 (antibody)	15231 (ADC: zovodotin)	1200 mg	NULL

Dropping the study_plan_arm dependency and using publication_interventions as the primary drug source would fix this automatically — drug_id and pub_dose_lookup would use the same source.

Concrete examples

LLM-extracted arm names that clearly indicate their role without registry lookup:

arm_name (LLM-extracted)	resolved_group_type (from registry)	Obvious from name?
`Cetuximab + Chemotherapy (Control)`	ACTIVE_COMPARATOR	Yes — “(Control)“
`Standard Treatment`	ACTIVE_COMPARATOR	Yes — “Standard”
`Placebo`	PLACEBO_COMPARATOR	Yes — “Placebo”
`Extreme Regimen`	ACTIVE_COMPARATOR	Ambiguous — SOC regimen name
`Experimental group`	EXPERIMENTAL	Yes — “Experimental”
`Non-Randomized Single-Arm`	EXPERIMENTAL	Yes — single-arm
`BCA101 + pembrolizumab`	NO_INTERVENTION	Registry is wrong — this is clearly experimental
`Arm B: Cetuximab/Methotrexate/Docetaxel`	ACTIVE_COMPARATOR	Ambiguous — needs context
`1`, `2`, `Arm I`	varies	Not classifiable from name alone

What the client worksheet actually needs from the trial link

Mapping each worksheet column against its data source:

Sheet column	Data source	Needs trial link?	Needs `study_plan_arm`?
Drug	`publication_interventions`	No	No
Technology	`drugs` → `technologies`	No (via `drug_id`)	No
Target(s)	`drug_target_actions`	No (via `drug_id`)	No
Company	`drug_ownerships`	No (via `drug_id`)	No
Clinical Trial (NCT ID)	`publication_clinical_trials` → `clinical_trials`	Yes (trial ID only)	No
Clinical Trial Name	`clinical_trials.brief_title`	Yes (trial ID only)	No
Clinical Trial Location	`locations` table (country rollup)	Yes (trial ID only)	No
Combination Partner	`publication_interventions`	No	No
Comparator	`study_plan_arms` (COMPARATOR type)	Yes	Yes (current path)
Disease	`trial_disease_details` / `trial_subgroups`	No	No
Publication Date	`publications`	No	No
Data Cut Date	`trial_subgroups` (pub-extracted)	No	No
Prior Lines (min/max/median)	`trial_subgroups` (pub-extracted)	No	No
Biomarker	subgroup tags (pub-extracted)	No	No
Dose fields	`trial_subgroups` + `publication_interventions`	No	No
Efficacy (mOS, mPFS, ORR, etc.)	`trial_outcome_measures` / `trial_arm_outcomes`	No	No
Safety (TRAE, TEAE, etc.)	`adverse_events`	No	No
Phase (internal filter)	`clinical_trials.phase`	Yes (trial ID only)	No
Randomized (internal)	`study_designs.allocation`	Yes (trial ID only)	No
Is Basket Trial (internal)	`clinical_trial_end_diseases` (computed)	Yes (trial ID only)	No

Conclusion: The study_plan_arm link is only needed for the “Comparator” column and for resolved_group_type (experimental vs comparator arm selection). All other trial-derived fields only need publication_clinical_trials.clinical_trial_id.

Drug resolution dependency

publication_interventions currently only exists for publications processed through the target-disease extraction pipeline (~17K publications). For the remaining ~45K linked publications, drug resolution still flows through vw_bioloupe_interventions via the trial link and arm join.

However, ClinicalEvidenceQuery is always scoped to a specific disease, which means its publications will have gone through the target-disease pipeline and will have publication_interventions. This is not a blocker for the clinical evidence report specifically.

Downstream impact

Efficacy extraction — extract_efficacy_metrics prefers EXPERIMENTAL rows. Without arm role classification, randomized trial publications would have both experimental and comparator values lumped together, and the “best” row would be picked by patient count rather than arm role.
Comparator value — The query extracts comparator_value (e.g., comparator mPFS) from rows with resolved_group_type containing COMPARATOR. Without this, the comparator column and comparator efficacy values would be empty.
Safety extraction — extract_safety_metrics_for_publication filters to EXPERIMENTAL arm for safety. Less critical since most single-arm studies (majority of the corpus) only have one arm anyway.

Explored solution direction

Drop study_plan_arm dependency; add LLM arm role classification.

The proposed approach has two parts:

Part 1: Classify arm roles from LLM-extracted arm names

Add an arm_role field to trial_arm_outcomes (or arm_type — currently always NULL for publication data). Populate it via one of:

Option A: LLM classification during classify_publications — Add arm role to the extraction schema so the LLM outputs "arm_role": "experimental" or "arm_role": "comparator" alongside the arm name. This is the most reliable since the LLM has the full abstract context and knows which drug is investigational.

Option B: Post-hoc heuristic — Pattern match on arm names: keywords like “control”, “placebo”, “standard of care”, “SOC”, “comparator” → COMPARATOR; “experimental”, “investigational”, “study drug”, “treatment” → EXPERIMENTAL. This catches ~70% of cases but fails on regimen names like “Extreme Regimen” (HNSCC SOC) or numbered arms like “Arm B”.

Option A is recommended because the LLM already has the context to make this classification, and the marginal cost per publication is negligible.

Part 2: Simplify the view and query

Once arm roles are self-classified:

vw_publication_efficacy_data: Remove the study_plan_arms join from arm_outcomes_expanded. Use the new arm_role field on trial_arm_outcomes instead.
drug_interventions CTE: Remove the publication_arm_links → vw_bioloupe_interventions join path entirely for clinical evidence queries. Use publication_interventions as the sole drug source (acceptable since clinical evidence queries are disease-scoped).
fetch_trial_enrichments: Keep the enrichment query but simplify — it only needs clinical_trials + locations + study_designs for metadata. Remove the study_plan_arms subquery for comparator arm names; instead, derive comparator name from the LLM-extracted arm names where arm_role = 'comparator'.
fetch_combination_partners: Already uses publication_interventions as primary path. No change needed.

What this preserves

NCT ID, trial name, phase, location, randomized, basket trial detection — all via publication_clinical_trials → clinical_trials (no arm join)
Correct experimental vs comparator arm selection — via LLM-classified arm_role
Comparator name in the report — derived from arm names where arm_role = 'comparator'

What this removes

Dependency on study_plan_arm_id matching (currently fails for 62% of rows)
Registry arm type overriding LLM context (sometimes wrong, e.g., BCA101 + pembrolizumab tagged NO_INTERVENTION)
Drug resolution via vw_bioloupe_interventions for linked publications (replaced by publication_interventions)

Solution applied

Implemented 2026-03-23. Change: fix-study-plan-arm-dependency.

Four-part fix:

vw_publication_efficacy_data v16 — restructured drug_interventions CTE:
- Added Source 0: publication_interventions as primary drug source for all pubs that have them (linked AND unlinked). Includes NULL drug_id interventions — if we extracted them, that’s the source of truth.
- Sources 1a/1a-fallback/1b/1c gated with NOT EXISTS (pubs_with_pi) — only fire as fallback for publications without publication_interventions (non-target-disease pubs used by EmergingClinicalDataQuery).
- Removed Source 2 (unlinked-only path) — subsumed by Source 0.
- Threaded publication_intervention_id through Source 0 and pub_dose_lookup for exact join matching, eliminating the drug_id mismatch that dropped 76% of dose evidence.
vw_publication_efficacy_data v16 — inverted arm_outcomes_expanded priority:
- LLM-classified tao.arm_type now preferred over registry spa.arm_type via CASE expression.
- Maps control/active_comparator → ACTIVE_COMPARATOR, placebo/placebo_comparator → PLACEBO_COMPARATOR.
- Falls back to spa.arm_type only when LLM value is NULL.
Safety queries in clinical_evidence_query.rb:
- Updated both inline safety SQL queries to use the same LLM-first arm role logic.
Arm role classification improvements (going-forward + backfill):
- Expanded arm_type enum in details.rb from [investigational, control] to [investigational, control, active_comparator, placebo_comparator].
- LLM-based backfill task lib/tasks/one_off/backfill_arm_type_from_name.thor:
  - Phase 1 fast-path: single-arm publications (39K pubs, 239K rows) → investigational directly.
  - Phase 2 LLM: multi-arm publications (28K pubs) sent to GPT-5-mini with abstract context for classification. Estimated cost ~$17.
- Tested on 65 publications with 0 errors. LLM correctly classifies drug-name arms (e.g. “Sorafenib” → control, “Chemotherapy” → control), ambiguous labels (e.g. “Arm B”, “Group 1”), and placebo variants.

Results (prod, post-backfill, 2026-03-24):

Metric	Before v16	After v16 + backfill
Pubs with structured dose in view	8,764	11,916 (+36%)
Coverage of extracted dose evidence	71%	96.7%
ACTIVE_COMPARATOR rows	sparse (registry-only)	124,346 (12.6% of view)
PLACEBO_COMPARATOR rows	sparse (registry-only)	32,383 (3.3% of view)
Total comparator identification	~38% coverage when arm linked	15.9% of all rows (up from near-zero for LLM-sourced pubs)
Stale registry values (PLACEHOLDER/NO_INTERVENTION/OTHER)	7,806 rows	17 rows

Prod verification (2026-03-24): Spot-checked 55+ publications across multiple categories:

Combo arms with “placebo” in name (e.g. “Nivo+Ipi+Placebo for Nivo”) → correctly EXPERIMENTAL
Drug-name comparators (e.g. “Sorafenib”, “FOLFIRI”, “Chemotherapy”) → correctly ACTIVE_COMPARATOR
Novel drug monotherapy vs combo (EV mono vs EV+pembro) → correctly identified mono as comparator
Phase I multi-arm dose trials → correctly all EXPERIMENTAL
Randomized dose-finding (same drug, different schedules) → correctly all EXPERIMENTAL
No false positives or misclassifications found

Tracker spot-checks resolved:

Pub	Drug	Before	After
66516	Zanidatamab	all NULL (drug_id mismatch: 10432 vs 15231)	single_dose=1200 mg, dose_units=mg
114758	Zanidatamab	all NULL (same mismatch)	single_dose=1200 mg, dose_frequency=on days 1 and 15
70960	SHR-A1811	all NULL (drug_id was NULL)	dose_min=3.2 mg/kg, dose_max=8.0 mg/kg, rp2d=6.4 mg/kg

Files changed:

db/views/vw_publication_efficacy_data_v16.sql (new)
db/migrate/20260323212725_update_vw_publication_efficacy_data_to_version_16.rb (new)
app/queries/tpp/clinical_evidence_query.rb (safety query arm role logic)
app/tasks/publications_llm_classification/details.rb (arm_type enum expansion)
lib/tasks/one_off/backfill_arm_type_from_name.thor (new — LLM arm type backfill)

Deployment steps:

rake db:migrate (creates v16 view + materializes)
REFRESH MATERIALIZED VIEW CONCURRENTLY vw_publication_efficacy_data
thor one_off:backfill_arm_type_from_name:backfill --batched --parallelism 4 --batch-size 2000
REFRESH MATERIALIZED VIEW CONCURRENTLY vw_publication_efficacy_data (again after backfill)

21. Phase 1 basket trials report response counts, not ORR percentages

Short summary

Phase 1 dose-escalation and basket trial abstracts often report efficacy as response counts per tumor type (e.g. “1 PR in 9 HNSCC patients”) rather than ORR percentages. The LLM faithfully extracts these as PR endpoint with measure_unit = count, but the query only recognizes ORR with measure_unit = percentage. This causes two downstream problems:

No efficacy shown — the publication surfaces in the report with empty ORR/PFS/OS columns despite having extractable response data
Inflated patient count — when no recognized efficacy endpoint exists for the disease subgroup, extract_patient_count falls back to the largest number_of_participants across all rows, which is typically the cross-tumor Overall population (e.g. N=92 instead of N=9)

Concrete example

Publication 29759 — Praluzatamab ravtansine (CX-2009) first-in-human phase 1 (NCT03504488), ASCO 2020.

Abstract reports: “92 patients … 5 PRs in breast cancer (n=39), 2 PRs in ovarian (n=22), 1 PR in HNSCC (n=9)”

Extracted data (correct):

Subgroup	Endpoint	Value	Unit	N
Overall	SD	21	count	92
Overall → HNSCC	PR	1	count	9
Overall → Breast Cancer	PR	5	count	39
Overall → Ovarian Cancer	PR	2	count	22

Query output for HNSCC (incorrect):

ORR: empty (no ORR endpoint exists)
Patient count: 92 (fallback to Overall N, should be 9)
The row appears in the report with no efficacy and a misleading N

Root cause

Two gaps in the query layer:

extract_efficacy_metrics only looks for PRIMARY_EFFICACY_ABBREVIATIONS (OS, PFS, ORR, DOR, DFS, DCR). PR and CR counts are not recognized. No logic derives ORR from PR count / N.
extract_patient_count takes the max number_of_participants across all rows in the group. For basket trials where the Overall subgroup (N=92) and disease subgroup (N=9) coexist in the same group key, the fallback picks N=92.

Scale

Phase 1 dose-escalation trials commonly report response counts rather than ORR. Basket trials with disease-specific cohorts are particularly affected since they report per-tumor-type counts. The exact count of affected publications needs characterization, but this pattern is common in early-phase oncology abstracts.

Explored solution direction

Option 1: Derive ORR from PR/CR counts at query time. When no ORR endpoint exists for a subgroup but PR and/or CR counts exist with number_of_participants > 0, compute ORR = (PR + CR) / N * 100. This is clinically correct and matches how the client sheet manually computes these values.

Option 2: Have the LLM compute ORR during extraction. Add a prompt instruction: when only response counts are reported, also emit a derived ORR endpoint with measure_unit = percentage. Risk: the LLM might hallucinate percentages or miscount.

Option 3: Filter out publications with no recognized efficacy endpoints. If a publication has no ORR/PFS/OS/DoR for the disease subgroup, don’t surface it in the report. This avoids misleading rows but loses legitimate phase 1 data.

Option 1 is most reliable — the data is already correctly extracted, just needs a calculation step in the query.

Solution applied

Implemented (2026-03-20):

Going-forward fix in post_process.rb: Added derive_orr_for_subgroup — after persisting outcome measures for each subgroup, checks if PR/CR counts exist with N > 0 but no ORR percentage. If so, creates a derived ORR row: (PR + CR) / N * 100 with measure_unit = 'percentage' and observation = 'Derived from PR + CR counts'. Skips subgroups tagged response_status (response-defined subgroups where derivation is meaningless).
Backfill task lib/tasks/one_off/backfill_derived_orr.thor: Finds all publication subgroups with PR/CR counts but no ORR percentage and creates derived ORR rows. Results: 753 ORR rows created across 512 publications.
Empty efficacy filter in ClinicalEvidenceQuery: Rows with no recognized efficacy endpoints (empty efficacy hash) are now filtered out in build_result_rows, preventing publications with only safety/DLT data from appearing as empty rows with misleading patient counts.

Prod deployment:

Run response_status backfill first: thor one_off:backfill_response_status_tags:backfill --batched
Run derived ORR backfill: thor one_off:backfill_derived_orr:backfill
Refresh materialized view

22. `extract_subgroups` doesn’t identify response counts as endpoints

Short summary

When abstracts report best response as narrative counts (“1 PR and 14 SD out of 29 CRC patients”, “1 PR and 4 SD among 8 esophageal cancer patients”) without computing an explicit ORR percentage, the upstream extract_subgroups step only identifies formal endpoints like DCR and TTP. Individual response counts (PR, CR) are not recognized as extractable endpoints. Since classify_publications constrains its endpoint_abbreviation enum to the abbreviations identified upstream, the LLM cannot create PR/CR endpoint rows even though it sees the data in the abstract.

Where this sits in the current pipeline

extract_subgroups (step 7 in PublicationsWorkflow) scans the abstract and identifies subgroups + their associated endpoints → stored in llm_data['subgroup_endpoints']
classify_publications (step 9) receives subgroup_endpoints as input, builds a JSON schema with endpoint_abbreviation constrained to the upstream list, and extracts structured outcome measures
If PR/CR aren’t in the upstream endpoint list, classify_publications can’t output them

Concrete example

Publication 29737 — IMMU-132 (sacituzumab govitecan) phase I/II in GI cancers (NCT01631552), ASCO 2020.

Abstract text: “Of 29 CRC pts… 1 had a PR and 14 had SD as the best response by RECIST, with a time to progression (TTP) of 11.5+ months for the PR… This is a disease control rate (DCR) of 51.7%.”

subgroup_endpoints identified upstream:

Time to progression → 5 subgroups
Disease control rate → 3 subgroups

Missing: Partial Response / PR was not identified as an endpoint despite being explicitly reported per disease cohort.

LLM output: Extracted DCR=51.7% (N=29) and TTP values. The PR count (1/29) was noted in the DCR observation text (“1 PR and 14 SD out of 29 evaluable CRC patients”) but not as a separate endpoint row.

Result: No ORR can be derived (Issue 21’s derivation requires PR/CR rows to exist), and the publication shows DCR but no ORR in the report.

Scale

759 publications have DCR but no ORR, PR, or CR endpoints
287 of those have response counts (PR/CR) mentioned in the DCR observation text — confirming the data was seen by the LLM but not extracted as separate endpoints
414 publications have SD counts but no ORR/PR/CR/DCR — similar pattern with stable disease

Root cause

extract_subgroups identifies endpoints by looking for formal endpoint patterns in the abstract (named endpoints with abbreviations, table headings, structured results). Narrative best-response descriptions like “1 had a PR and 14 had SD” are not recognized as formal endpoints because:

They don’t follow the endpoint = value pattern
PR/SD/CR appear as best overall response categories, not as measured endpoints
The abstract often only computes a summary metric (DCR) from these counts

The classify_publications schema then constrains the LLM to only the identified abbreviations, preventing it from creating PR/CR rows even though it clearly reads the counts (as evidenced by the observation text).

Explored solution direction

Option 1: Expand extract_subgroups to detect response count patterns. Add pattern matching for narrative response descriptions: “N had a PR”, “X partial responses”, “CR in Y patients”, etc. When detected, add PR/CR as endpoints alongside DCR/TTP.

Option 2: Allow classify_publications to add endpoints not in the upstream list. Remove or relax the endpoint_abbreviation enum constraint so the LLM can create PR/CR rows when it sees response counts. Risk: the LLM might hallucinate endpoints.

Option 3: Post-processing derivation from DCR observation text. Parse the observation strings like “1 PR and 14 SD out of 29 evaluable CRC patients” to extract PR/CR counts. This is fragile (regex on LLM-generated text) but catches the 287 publications where the data is already captured.

Option 4: Prompt instruction in classify_publications. Add an explicit instruction: “When the abstract reports individual response counts (e.g. ‘1 PR’, ‘2 CR’) per subgroup without an explicit ORR, also extract these as separate PR/CR endpoints with measure_unit=count.” Combined with relaxing the enum constraint for response-type abbreviations.

Option 4 is cleanest — it works within the existing pipeline, the LLM already sees the data, and combined with Issue 21’s derivation logic, the ORR gets computed automatically.

Solution applied

Forward fix (v1): Updated task.rb classify_publications prompt to instruct LLM to extract PR/CR counts. See Issue 21 for the ORR derivation that consumes these counts.

Forward fix (v2): Updated task.rb classify_publications prompt to also extract PR/CR/ORR percentages from DCR breakdowns (e.g. “DCR was 54% (CR 8%, PR 15%, SD 31%)”). Added dCR (durable CR) and pCR/MPR exclusions to prevent misidentification as standard CR.

Backfill v1 (2026-03-20/21): screen_missing_response_counts:screen (job 1568) screened candidates and flagged pubs with narrative response counts (e.g. “1 PR, 14 SD”). Re-extraction via classify_publications (job 1570) on flagged pubs. Reduced DCR-only population from 759→620 (~139 fixed).

Backfill v1 gap: The v1 screener explicitly excluded percentage-based response rates (“ORR was 35%” → NO), missing a second pattern where abstracts report ORR/PR/CR as percentages — either standalone (“ORR was 33%”, “BOR rate 18.2%”) or embedded in DCR breakdowns (“DCR was 54% (CR 8%, PR 15%, SD 31%)”). Prod analysis (2026-03-24) found ~92 publications with extractable response rate percentages but no response endpoint, of which 73 were never re-processed (pre-fix) and 19 ran with the v1 prompt but were missed.

Backfill v1 screener (historical): screen_missing_response_counts.thor was used to identify candidates for v1 re-extraction. Its prompt only detected integer counts and explicitly excluded percentage-based ORR — this is why the v1 gap exists. The screener is no longer needed for v2 since the targeted backfill scopes structurally via SQL.

Backfill v2 (complete, job 1604, 2026-03-24): Targeted backfill task backfill_missing_response_endpoints.thor — sent a focused LLM prompt (o4-mini) per publication extracting ORR/PR/CR values anchored to existing subgroups. Created trial_endpoint + trial_outcome_measure + trial_arm_outcome records directly without re-running the full classify pipeline. ORR derived inline from PR% + CR% when LLM didn’t return explicit ORR. Guards: skips zero values, excludes dCR/pCR/MPR, skips response-status subgroups, idempotent (skips existing records).

Results: 97 new records created (41 PR counts, 22 ORR percentages, 14 PR percentages, 12 CR counts, 8 CR percentages). DCR-only population reduced from 550 → 498 (~52 pubs fixed). Combined with v1 backfill: 759 → 498 total (~261 pubs fixed, ~34% reduction).

Verified: 10 random remaining DCR-only pubs manually checked against full abstract text — all 10 genuinely report only DCR with no PR/CR/ORR breakdown (phase I safety studies, PK/biomarker analyses, maintenance trials with DCR as primary endpoint, composite response rates ≠ ORR). The remaining 498 are clean.

23. Dose extraction misses implicit RP2D in phase I/II trials

Short summary

The dose extraction LLM classifies “dose levels of 8 and 10 mg/kg were chosen for phase II” as a range (dose_min/dose_max) rather than RP2D. In phase I/II trials, doses selected for phase II expansion ARE the recommended phase 2 dose by definition — this is the entire purpose of the phase I dose escalation.

Concrete example

Publication 29737 — “Phase I/II trial of IMMU-132 (isactuzumab govitecan)”

Abstract states: “starting at a dose of 8 mg/kg given on days 1 and 8 of a 3-week cycle. Dose levels of 8 and 10 mg/kg were chosen for phase II”

Current extraction:

{
  "dose_min": "8 mg/kg",
  "dose_max": "10 mg/kg",
  "rp2d": null,
  "dose_context_type": "range"
}

Expected: rp2d should capture that 8 and 10 mg/kg are the RP2D levels. The phrase “chosen for phase II” in a phase I/II trial is semantically equivalent to “recommended phase 2 dose.”

Explored solution direction

Update the dose extraction prompt to recognize implicit RP2D language in phase I/II trials:

“doses chosen/selected for phase II”
“phase II dose levels”
“expansion cohort dose”
“dose carried forward to phase II”

The challenge is that RP2D is currently a single value field. When two dose levels are selected (8 and 10 mg/kg), storing both requires either a comma-separated value or keeping dose_min/dose_max AND setting rp2d.

Note: this publication also has a secondary issue — publication_interventions.drug_id is NULL, so the dose evidence can’t join to the view via pub_dose_lookup even if the extraction were correct.

Scale

~4,617 interventions across ~4,100 publications have dose_context_type of range, escalation, or rp2d (typed but value missing) with no rp2d value. LLM verification on a sample of 17 publications found implicit RP2D in ~20% of candidates (MTD declarations, phase II dose selections, expansion cohort doses).

Solution applied

Forward fix: Updated dose_evidence_extraction.rb system prompt to recognize implicit RP2D language: MTD declarations, “chosen/selected for phase II”, expansion cohort doses, “recommended for further study”.

Backfill: lib/tasks/one_off/backfill_implicit_rp2d.thor — sends abstract + current dose evidence for ~4,100 publications to GPT-5-mini. LLM determines if an implicit RP2D exists and extracts the value. Only updates rp2d and dose_context_type fields — does not overwrite existing dose_min/dose_max/units/frequency. Corrections tagged with rp2d_source: 'implicit_backfill' for audit. Estimated ~800 RP2Ds to be found. Cost: ~$2.

thor one_off:backfill_implicit_rp2d:backfill --batched --parallelism 4

24. Subgroup participant count wrong for biomarker sub-cohorts

Short summary

When abstracts report results for a biomarker-defined sub-cohort within a disease subgroup, the LLM sometimes confuses the count of patients with a specific outcome with the total sub-cohort size.

Concrete example

Publication 29737, KRAS-mutated CRC subgroup:

Abstract states: “Thirteen CRC pts had KRAS mutations, 7 with SD (median TTP = 4.4+ mo)”

Current extraction: subgroup "Advanced GI cancers → Colorectal cancer → KRAS-mutated" with TTP endpoint, n=7

Expected: n=13 (the KRAS-mutated cohort size), with 7 being the count of patients with SD (stable disease).

The LLM set number_of_participants=7 (the SD count) instead of 13 (the KRAS cohort size). This is a pattern likely to recur wherever abstracts report “N patients had X, Y with outcome Z.”

Scale

~112 highly suspicious subgroups identified via heuristic (response count = N, and N < 30% of publication max N). True scope is likely larger but hard to detect structurally — confirmed by LLM verification on sample of 12 publications finding 11 corrections across 104 verified arms (~10.6% correction rate).

Affected patterns:

Basket trials: per-tumor-type enrollment vs SD/PR counts (pub 53427: CRC N=6 should be 14, PDAC N=6 should be 25, etc.)
Biomarker sub-cohorts: mutation cohort size vs outcome count (pub 29737: KRAS N=7 should be 13)
Response cohorts: assessable patients vs responder count (pub 3674: Cohort 2 N=13 should be 17 — 13 was cCR count, 17 was assessable)
Disease sub-cohorts in phase I: per-histology enrollment vs outcome count (pub 5024: DIPG N=7 should be 9, sDMG N=7 should be 2)

Solution applied

Forward fix: Updated classify_publications prompt in task.rb with explicit anti-example: “CRITICAL: Set number_of_participants to the TOTAL evaluable patients in that subgroup/cohort — NOT the count of patients with a specific outcome.”

Backfill: lib/tasks/one_off/backfill_subgroup_participant_counts.thor — sends abstract + all arm outcomes for ~1,240 publications with PR/CR/SD count endpoints to GPT-5-mini for verification. LLM compares current N against abstract and corrects where wrong. All corrections logged in trial_subgroups.llm_data['n_corrections'] for audit/revert. Estimated cost: ~$1.50.

thor one_off:backfill_subgroup_participant_counts:backfill --batched --parallelism 4

25. Confirmed vs unconfirmed ORR confusion in `classify_publications`

Short summary

When abstracts report both confirmed and unconfirmed ORR (a common pattern in ADC oncology trials), classify_publications either (a) extracts the unconfirmed ORR value and incorrectly marks it confirmed: true, or (b) extracts only the unconfirmed ORR and omits the confirmed value entirely. This produces wrong cORR values in the report and missing cORR endpoints.

Where this sits in the current pipeline

classify_publications (app/tasks/publications_llm_classification/task.rb) — the LLM extraction step that produces subgroup_outcome_measures. The confirmed boolean on ORR endpoints was added by Issue 16, but the extraction prompt doesn’t instruct the LLM on how to handle abstracts that report both confirmed and unconfirmed ORR.

Exact restriction causing the drop

The extraction schema allows a single ORR record per subgroup arm with a confirmed boolean. When an abstract reports “unconfirmed ORR was X% (confirmed: Y%)”, the LLM extracts one ORR record with measure_value=X (the unconfirmed value) and sets confirmed: true because the word “confirmed” appears in the abstract context. The actual confirmed value (Y%) is only captured in the free-text observation field.

The prompt does not instruct the LLM to:

Create TWO separate ORR records when both confirmed and unconfirmed values are reported
Distinguish which numeric value corresponds to confirmed vs unconfirmed status

Concrete examples

Publication 192026 (Precemtabart tocentecan, PROCEADE-CRC-01 dose optimization):

Abstract states: “The unconfirmed objective response rate (ORR) at 2.8 mg/kg was 24.1% (95% CI: 10.3, 43.5) (confirmed: 13.8% [95% CI: 3.9, 31.7]).”

Extracted: ORR confirmed=true, measure_value=24.1

observation: “Unconfirmed ORR; confirmed ORR was 13.8%”

Expected: Two records:

ORR confirmed=false, measure_value=24.1 (unconfirmed)
ORR confirmed=true, measure_value=13.8 (confirmed)

Same pattern in pubs 237309, 49900 (same drug, different data cuts). Also confirmed in pub 190845 (missing cORR entirely) and pub 116824 (missing cORR for dose subgroups).

Downstream impact

Wrong cORR values: The Clinical Evidence report shows unconfirmed ORR in the cORR column. For Precemtabart at 2.8 mg/kg, report shows cORR=24.1% when it should be 13.8% — a 75% overstatement.
Missing cORR endpoints: Some publications have no confirmed ORR extracted at all, leaving the cORR column blank when the abstract does report it.
Audit failures: 24 of 145 open audit issues in the CRC ADC scope (disease 4345, technology 708) are caused by this pattern: 18 incorrect_value issues on efficacy.corr.value and 6 missing_endpoint issues on efficacy.corr.value.

Scale

Directly confirmed in 5 publications (192026, 237309, 49900, 190845, 116824) across the CRC ADC audit scope. Likely affects any ADC trial publication reporting both confirmed and unconfirmed ORR — estimated dozens across the full corpus.

-- Publications with confirmed=true ORR that may have unconfirmed values stored as confirmed
SELECT DISTINCT ts.source_id as publication_id
FROM trial_subgroups ts
JOIN trial_outcome_measures tom ON tom.trial_subgroup_id = ts.id
JOIN trial_arm_outcomes tao ON tao.trial_outcome_measure_id = tom.id
JOIN trial_endpoints te ON te.id = tom.trial_endpoint_id
WHERE ts.source_type = 'Publication'
  AND te.abbreviation = 'ORR'
  AND tom.confirmed = true
  AND tom.observation ILIKE '%unconfirmed%'

Explored solution direction

Forward fix: Update the classify_publications prompt in task.rb to explicitly handle confirmed/unconfirmed ORR:

“When an abstract reports both confirmed and unconfirmed ORR for the same subgroup/arm, create TWO separate ORR records: one with confirmed: false and the unconfirmed value, and one with confirmed: true and the confirmed value. The unconfirmed ORR is typically the larger number. Do NOT set confirmed: true on the unconfirmed ORR value.”

Backfill: Re-extract affected publications with the updated prompt. Scope can be identified by querying for ORR records where confirmed=true and the observation mentions “unconfirmed”. Estimated cost: minimal (small number of publications).

Solution applied

Forward fix (2026-03-24): Updated the classify_publications prompt in app/tasks/publications_llm_classification/task.rb to:

Instruct the LLM to create TWO separate ORR endpoint records when an abstract reports both confirmed and unconfirmed values
Not confuse different RECIST assessment criteria (RECIST 1.1 vs mRECIST) with confirmation status — use RECIST 1.1 as primary measure_value, note other criteria in observation

Targeted backfill v1 (2026-03-24): Created lib/tasks/one_off/backfill_confirmed_unconfirmed_orr.thor. Initial run (job 1603) fixed the most obvious cases but missed ~398 records due to overly conservative guardrails and narrow scope.

Backfill v2 (2026-03-24): Expanded the task to address gaps found in v1:

Problems found in v1:

131 confirmed=true ORR records with “unconfirmed” in observation but no confirmed=false pair created — guardrail required LLM to return a full pair, skipping cases where it only returned one side
23 confirmed=true ORR records where the abstract never mentions response confirmation — LLM hallucinated the flag
244 confirmed=false ORR records missing their confirmed=true sibling
~50 publications with PR/CR confirmed=null where abstract says “confirmed PR” — wrong flag propagates to derived ORR via post_process, making it invisible to the cORR metric in clinical_evidence_query

Changes in v2:

Scope widened: Now covers 2,530 pubs — incomplete ORR pairs (1,596) + derived ORR pubs with PR/CR that may need confirmed flags (~934)
PR/CR coverage: LLM now evaluates confirmed flags on PR, CR, and ORR in one pass (was ORR-only)
Guardrail relaxed: Acts when LLM returns any non-null confirmed entry (was: required both true+false pair)
Null upgrade: Can upgrade confirmed=null records to true/false instead of only creating new records
Derived ORR fix: After correcting PR/CR flags, surgically updates derived ORR confirmed to match source PR/CR — no post_process re-run needed
Prompt improved: Instructs LLM to derive both confirmed/unconfirmed ORR from response counts (e.g. “6 confirmed PRs and 2 unconfirmed PRs among 40 patients”)

Forward fix for derived ORR (2026-03-24): Updated post_process.rb derive_orr_for_subgroup to propagate the confirmed flag from source PR/CR records. If all PR/CR are confirmed=true, derived ORR gets confirmed=true. If mixed, derives both a confirmed and unconfirmed ORR.

Commands:

# Preview scope
bundle exec thor one_off:backfill_confirmed_unconfirmed_orr:identify

# Run (use --batched for large runs)
bundle exec thor one_off:backfill_confirmed_unconfirmed_orr:backfill --batched

Validation (2026-03-24):

v1 tested on 20 random publications across two rounds — correctly handled split ORR, single confirmed, RECIST criteria, ambiguous cases
v2 tested on 36 publications (30 dry run + 6 real run). Verified against abstracts:
- Pub 1246: “unconfirmed partial response” → PR confirmed=false, derived ORR confirmed=false ✓
- Pub 31619: mixed confirmed/unconfirmed PRs across disease subgroups — each subgroup’s derived ORR correctly matched its PR flag ✓
- Pubs 1527, 5024, 7313, 7499: no confirmation language → all flags left as confirmed=null ✓
- Zero spurious changes on pubs without confirmation language

Backfill v2 production run (2026-03-24, job 1608): 2,530 publications processed.

Results verified in prod:

confirmed=false ORR records: 575 → 744 (+169 new unconfirmed pairs created)
confirmed=true ORR records: 2,240 → 2,583 (+343 flags upgraded or new records)
Derived ORR with confirmed flag: 94 true + 28 false (was all null)
Known broken pubs verified correct: 47342 (27.6%/34.5%), 65504 (13.0%/17.4%), 74897 (46.7%/60.0%), 234678 (15.0%/20.0%)
Remaining 55 confirmed=true ORR (45 pubs) with “unconfirmed” in observation but no pair — root cause: the backfill was sending the existing confirmed flag to the LLM, which anchored on it and echoed it back instead of making a fresh determination from the abstract.

Backfill v3 fix (2026-03-24): Two changes to the prompt/input:

Removed confirmed field from existing records sent to the LLM — forces fresh determination from abstract text only
Added explicit instruction: “The existing records may have WRONG confirmed flags. Do NOT trust the existing confirmed value.”

Scope: ~2,000 pubs still in scope (1,553 incomplete pairs + 691 derived ORR with null confirmed). The pubs fixed by v2 are excluded (they now have complete pairs).

Command: bundle exec thor one_off:backfill_confirmed_unconfirmed_orr:backfill --batched

Backfill v3 production run (2026-03-24, job 1612): ~2,000 publications processed.

Results: 55 → 26 remaining records with “unconfirmed” in observation but no pair. The 26 remaining break down as:

~20 combined rates where abstract reports “confirmed and unconfirmed responses” as a single number — confirmed=true is wrong (should be null) but can’t be split into two rows. Not a data loss since the value itself is correct.
3 truncated abstracts (30362, 59711, 209569) — response breakdown is in the missing portion of the abstract text
2 genuine LLM misses (116973, 236929) — abstract has the data but LLM didn’t split

2026-03-26 audit findings — Issue reopened

A Clinical Evidence audit (publications:audit_clinical_evidence) on HNSCC publications identified 7 open cORR-related audit issues across 5 publications that demonstrate the extraction fix is insufficient. Three categories of residual failure:

Category 1: LLM counts all responses as confirmed (post-fix)

Publication 30362 (Petosemtamab, updated_at: 2026-03-23 — processed AFTER the v3 backfill):

Abstract: “1 confirmed complete response, 2 confirmed and 3 unconfirmed partial responses” among 10 evaluable patients
Expected: cORR = 30% (3/10 confirmed), ORR = 60% (6/10 total)
Extracted: ORR = 60.0% with confirmed: true — LLM counted ALL responses as confirmed
Note: v3 backfill categorized this as “truncated abstract” but the abstract is NOT truncated — full response breakdown is present. The backfill LLM erroneously classified it as truncated.

Category 2: “cORR” terminology not recognized as confirmed flag

Publication 29660 (Tisotumab vedotin):

Abstract explicitly uses “confirmed objective response rate (cORR)” as primary endpoint throughout
Values: cORR = 32.5% (full cohort), cORR = 40.0% (≤2 prior lines)
Extracted: ORR with confirmed: null for both subgroups — correct values but missing confirmed flag
Impact: cORR column is empty in the report despite values being correctly extracted

Category 3: Total ORR mislabeled as confirmed

Publication 65575 (Ozuriftamab vedotin):

Abstract: “ORR was 32% including confirmed and unconfirmed responses”
Extracted: ORR = 32.0% with confirmed: true — the total ORR (including unconfirmed) is marked as confirmed
Only confirmed: true record exists; no confirmed: false pair

Additional confirmed cases with correct extraction but wrong audit flags (false positives from audit LLM):

Pubs 65346, 151763, 237727: Both confirmed and unconfirmed ORR rows exist with correct values and flags. The ClinicalEvidenceQuery cORR extraction at lines 658–675 correctly filters confirmed=true. These audit findings appear to be audit LLM errors (confusing which row is ORR vs cORR).

Remaining scope estimate:

-- Publications with only confirmed=true ORR (no unconfirmed counterpart)
-- that might have wrong confirmed attribution
SELECT count(DISTINCT ts.source_id)
FROM trial_subgroups ts
JOIN trial_outcome_measures tom ON tom.trial_subgroup_id = ts.id
JOIN trial_endpoints te ON te.id = tom.trial_endpoint_id
WHERE ts.source_type = 'Publication'
  AND te.abbreviation = 'ORR'
  AND tom.confirmed = true
  AND NOT EXISTS (
    SELECT 1 FROM trial_outcome_measures tom2
    JOIN trial_endpoints te2 ON te2.id = tom2.trial_endpoint_id
    WHERE tom2.trial_subgroup_id = ts.id
      AND te2.abbreviation = 'ORR'
      AND tom2.confirmed = false
  );
-- Returns 1,178 publications — subset may have wrong attribution

Forward fix needed: The classify_publications prompt needs stronger instructions for three specific failure modes:

When abstract lists individual confirmed + unconfirmed responses by count (e.g., “2 confirmed PR, 3 unconfirmed PR”), derive both cORR and ORR from counts — don’t sum them into one value
When abstract uses “cORR” or “confirmed ORR” terminology, set confirmed: true on the endpoint even if no separate unconfirmed value is stated
When abstract says “ORR X% (including confirmed and unconfirmed)”, set confirmed: false or confirmed: null — not confirmed: true

Related: See also Issue 27 — even when extraction is correct, extract_efficacy_metrics in ClinicalEvidenceQuery can pick the confirmed ORR value for the plain ORR metric

Forward fix v4 (2026-03-26): Added two additional prompt instructions to app/tasks/publications_llm_classification/task.rb:

Explicit example for deriving TWO ORR records from mixed confirmed/unconfirmed response counts (e.g., “1 confirmed CR, 2 confirmed PR, and 3 unconfirmed PR among 10 patients” → cORR=30%, ORR=60%). Addresses the pattern where the LLM sums all responses and marks as confirmed.
Instruction that when the primary endpoint is described as “cORR” or “confirmed ORR”, the value IS confirmed and confirmed: true must be set — do not leave as null.

Backfill scope (2026-03-26): Structural scope (no text matching) — all ~25.5K publications with ORR that don’t already have both a confirmed=true AND confirmed=false ORR record. The LLM determines from the abstract whether confirmation language exists; apply_result is a no-op for pubs where the LLM returns confirmed=null.

Estimated affected (will actually change): ~1,000-1,500 publications based on text analysis showing ~985 with “confirmed ORR”/“cORR” language + null flag, ~77 with wrong confirmed=true, ~21 v3 remnants.

-- V4 structural scope: all ORR pubs without complete confirmed pair
SELECT DISTINCT ts.source_id
FROM trial_subgroups ts
JOIN trial_outcome_measures tom ON tom.trial_subgroup_id = ts.id
JOIN trial_endpoints te ON te.id = tom.trial_endpoint_id
WHERE ts.source_type = 'Publication'
  AND te.abbreviation = 'ORR'
  AND NOT EXISTS (
    SELECT 1 FROM trial_outcome_measures t1
    JOIN trial_endpoints e1 ON e1.id = t1.trial_endpoint_id
    JOIN trial_outcome_measures t2 ON t2.trial_subgroup_id = t1.trial_subgroup_id
    JOIN trial_endpoints e2 ON e2.id = t2.trial_endpoint_id
    WHERE t1.trial_subgroup_id = tom.trial_subgroup_id
      AND e1.abbreviation = 'ORR' AND t1.confirmed = true
      AND e2.abbreviation = 'ORR' AND t2.confirmed = false
  );

Cost: ~$6 using gpt-4o-mini in batch mode (simple classification, no reasoning model needed).

Backfill v4 production run (2026-03-26, job 1626): 25,594 publications processed (full structural scope, gpt-5-mini batch).

Results:

confirmed=true records: 2,685 → 5,948 (+3,263 new confirmed flags)
confirmed=false records: 744 → 1,424 (+680 new unconfirmed records)
Complete confirmed/unconfirmed pairs: 477 → 783 pubs (+306)
Pubs with any confirmed flag: 2,240 → 3,436 (+1,196)

Spot-checked 12 random publications against full abstracts — 11 correct, 1 pre-existing extraction error:

Round 1:

Pub 29807 (AZD9291): abstract says “confirmed+unconfirmed ORR 51%” → correctly split to cORR=33.9%, ORR=51%
Pub 56237 (IMO+ipi): abstract says “6 PR (3 confirmed)” among 15 → correctly split to cORR=20%, ORR=46.7%
Pub 76478 (Pralsetinib): abstract says “all confirmed” for naïve subgroup → correct 73.7% confirmed; overall has small gap (63.3% vs 64.6%)
Pub 65763 (Belrestotug): complete pairs for all 4 arms, confirmed < unconfirmed in each (expected)
Pub 59860 (Pazopanib GCT): pre-existing extraction error — abstract reports marker response (4/5 AFP/HCG decrease), not RECIST ORR. The 80% “ORR” is a marker response rate, not a true ORR. Backfill correctly split confirmed/unconfirmed given the existing data, but the underlying extraction is wrong. Not a backfill bug.

Round 2 (full abstract read → compare):

Pub 58824 (Fruquintinib+S-1): 1 confirmed PR at 4mg, 2 unconfirmed at 5mg → cORR=16.67% (1/6), ORR=50% (3/6) ✓
Pub 62418 (Zongertinib GI): abstract explicitly states “confirmed ORR 17.2%” and “regardless of confirmation 20.7%” → exact match ✓
Pub 70313 (D-1553 KRAS G12C): “1 confirmed CR, 3 confirmed PR, 1 unconfirmed PR” → cORR=40% (4/10), ORR=50% (5/10) ✓
Pub 234635 (Ficerafusp SCAC): “6 of 7 responses confirmed” → cORR=27.3% (6/22), ORR=31.8% (7/22) ✓
Pub 238559 (BC3195 ADC): “4 confirmed PR (cPR)” out of 31 at 2.4mg, 5 total PR → cORR=12.9%, ORR=16.1% ✓

Zero spurious changes on pubs without confirmation language (24K+ no-ops)

26. Parent population N propagated to child subgroups

Short summary

When classify_publications extracts data for hierarchical subgroups (e.g., “Phase 1b dose expansion → SCCHN”), the LLM copies the parent subgroup’s number_of_participants to all child subgroups instead of extracting the subset-specific N. This produces incorrect patient counts for ~5,058 child subgroups across 1,174 publications.

Where this sits in the current pipeline

classify_publications — app/tasks/publications_llm_classification/task.rb

The prompt currently instructs (line 120–123):

“CRITICAL: Set number_of_participants to the TOTAL evaluable patients in that subgroup/cohort — NOT the count of patients with a specific outcome.”

This instruction was added for Issue 24 (confusing outcome counts with cohort size), but it has a side effect: the LLM interprets “total evaluable patients in that subgroup” as the parent population total when it doesn’t know the child-specific N.

Exact restriction causing the drop

No prompt instruction distinguishes between:

The parent population N (e.g., 39 patients in Phase 1b)
The child subgroup N (e.g., the SCCHN subset of those 39)

The LLM defaults to the known parent N rather than outputting null when the child N isn’t explicitly stated.

Concrete examples

Publication 134450 — MRG003 Phase 1 (SCCHN/NPC/CRC basket):

“Phase 1b dose expansion” → N=39 (correct, total)
“Phase 1b dose expansion → SCCHN” → N=39 (WRONG — SCCHN is a subset)
“Phase 1b dose expansion → NPC” → N=39 (WRONG — NPC is a subset)
“Phase 1b dose expansion → CRC” → N=39 (WRONG — CRC is a subset)

All three disease children show the parent’s N instead of the actual per-disease cohort size.

Publication 5799 — Neoadjuvant hormonal therapy in prostate cancer:

“Overall” → N=62 (correct)
“Overall → Baseline tumor burden: Low” → N=62 (WRONG — subset)
“Overall → Baseline tumor burden: High” → N=62 (WRONG — subset)
“Overall → PTEN/ERG immunostatus: Altered” → N=62 (WRONG — subset)
“Overall → PTEN/ERG immunostatus: Wild-type” → N=62 (WRONG — subset)

Downstream impact

Clinical Evidence report shows inflated patient counts for sub-cohort rows
ORR percentages combined with wrong N produce misleading responder counts (e.g., 40% ORR with N=39 implies 15.6 responders, but the actual SCCHN cohort may only have 10 patients)
Undermines per-cohort comparisons in basket trial reporting

Scale

~5,058 child subgroup-endpoint rows across 1,174 publications where 2+ siblings all share the parent’s N.

-- Identify affected parent-child groups
WITH parent_child AS (
  SELECT DISTINCT
    ts_parent.source_id as pub_id,
    ts_parent.subgroup_value as parent,
    ts_child.subgroup_value as child,
    tao_child.number_of_participants as child_n,
    tao_parent.number_of_participants as parent_n,
    te_child.abbreviation as endpoint
  FROM trial_subgroups ts_child
  JOIN trial_subgroups ts_parent ON ts_parent.source_id = ts_child.source_id
    AND ts_parent.source_type = ts_child.source_type
    AND ts_child.subgroup_value LIKE ts_parent.subgroup_value || ' → %'
    AND ts_child.subgroup_value NOT LIKE ts_parent.subgroup_value || ' → % → %'
  JOIN trial_outcome_measures tom_child ON tom_child.trial_subgroup_id = ts_child.id
  JOIN trial_arm_outcomes tao_child ON tao_child.trial_outcome_measure_id = tom_child.id
  JOIN trial_outcome_measures tom_parent ON tom_parent.trial_subgroup_id = ts_parent.id
  JOIN trial_endpoints te_child ON tom_child.trial_endpoint_id = te_child.id
  JOIN trial_endpoints te_parent ON tom_parent.trial_endpoint_id = te_parent.id
  JOIN trial_arm_outcomes tao_parent ON tao_parent.trial_outcome_measure_id = tom_parent.id
  WHERE ts_child.source_type = 'Publication'
    AND te_child.abbreviation = te_parent.abbreviation
    AND tao_child.number_of_participants = tao_parent.number_of_participants
    AND tao_child.number_of_participants > 0
)
SELECT pub_id, parent, COUNT(DISTINCT child) as num_siblings
FROM parent_child
GROUP BY pub_id, parent
HAVING COUNT(DISTINCT child) >= 2;
-- Returns 1,776 parent groups across 1,174 publications

Explored solution direction

Forward fix: Add a prompt instruction to classify_publications in task.rb:

“When extracting number_of_participants for a child subgroup (e.g., ‘Overall → NSCLC’, ‘Phase 1b → SCCHN’), use the N specific to that sub-cohort, NOT the parent population’s total. If the abstract does not explicitly state how many patients are in the child sub-cohort, set number_of_participants to null rather than copying the parent’s N. For example, if ‘Phase 1b’ enrolled 39 patients across SCCHN, NPC, and CRC, do NOT set N=39 for each disease — set N to null unless the abstract specifies the per-disease count.”

Backfill: Re-extract the ~1,174 affected publications with the updated prompt. Alternatively, a cheaper post-processing cleanup could null out child N values that match the parent N when 2+ siblings exist — but this may also catch legitimate cases (e.g., crossover designs where all patients go through each arm), so prompt fix + re-extraction is safer.

Related issues: Issue 24 (subgroup participant count wrong for biomarker sub-cohorts) is a specific instance of this broader pattern.

Solution applied

Three-part fix:

Forward prompt fix (app/tasks/publications_llm_classification/task.rb): Added instruction telling the LLM to use null for child subgroup number_of_participants when the abstract doesn’t explicitly state the per-subset count, rather than copying the parent’s N. Includes concrete right/wrong examples.
Post-processing guard (app/tasks/publications_llm_classification/post_process.rb): Added null_out_propagated_parent_n method that runs after process_outcome_measures. Detects parent-child pairs where 2+ siblings share the parent’s N for the same endpoint and nulls out those child N values. Acts as a permanent safety net regardless of LLM behavior.
One-off backfill (lib/tasks/one_off/null_propagated_parent_n.thor): SQL-based fix for existing affected records. Identifies child trial_arm_outcomes where 2+ siblings share the parent’s N and sets number_of_participants to NULL. No LLM re-runs needed — the correct answer is NULL since these abstracts don’t state the per-subset N.
- Run: thor one_off:null_propagated_parent_n:identify to preview scope
- Run: thor one_off:null_propagated_parent_n:backfill --no-dry-run to apply

Scope note: Only the 2+ siblings case is addressed. Single-child cases are ambiguous — the child could legitimately be the full parent population — and are left untouched.

Backfill v1 (prod): Ran successfully, nulled out all same-endpoint matches (0 remaining for same-endpoint check).

Backfill v2 fix (2026-03-24): v1 only matched child N against parent N on the same endpoint (e.g., child ORR N vs parent ORR N). But N propagation happens at the subgroup level — a child DCR can have the parent’s N even if the parent only has ORR. Found 2,495 pubs / 6,820 children still affected. Fixed both the one-off task and the post-process guard to match child N against ANY parent N across all endpoints.

Command: bundle exec thor one_off:null_propagated_parent_n:backfill --no-dry-run

27. `extract_efficacy_metrics` picks confirmed ORR as plain ORR

Short summary

When both confirmed (confirmed=true) and unconfirmed (confirmed=false) ORR rows exist for the same subgroup in the view, ClinicalEvidenceQuery#extract_efficacy_metrics can pick the confirmed row as the plain ORR metric value. This happens because the ORR extraction loop does not exclude confirmed rows, and when both rows have the same number_of_participants, max_by returns whichever comes first — often the confirmed row.

Where this sits in the current pipeline

ClinicalEvidenceQuery#extract_efficacy_metrics — app/queries/tpp/clinical_evidence_query.rb, lines 590–628.

The cORR extraction (lines 658–675) correctly filters confirmed == true and is unaffected. The problem is exclusively in the general efficacy extraction loop that handles ORR alongside OS, PFS, DOR, etc.

Exact restriction causing the drop

Lines 600–611:

PRIMARY_EFFICACY_ABBREVIATIONS.each do |abbr|
  matching = grouped[abbr] || grouped[abbr.downcase]
  next if matching.nil? || matching.empty?

  matching = filter_by_valid_unit(matching, abbr)
  next if matching.empty?

  experimental = matching.select { |r| r['resolved_group_type'] == 'EXPERIMENTAL' }
  experimental = matching if experimental.empty?

  best_row = experimental.max_by { |r| r['number_of_participants'].to_i } || matching.first

When abbr == 'ORR', matching includes ALL ORR rows regardless of confirmed flag. If both confirmed=true (value=26.7%) and confirmed=false (value=43.3%) exist with the same N, max_by picks the first match. The result: metrics[:orr] gets the confirmed value, making it identical to metrics[:corr] and wrong as a standalone ORR.

Concrete examples

Publication 117228 (RM-1929 photoimmunotherapy in rHNSCC):

Abstract states:

“unconfirmed objective response rate (ORR) 43.3%”
“confirmed ORR 26.7%”

View correctly has both rows (subgroup “Heavily pretreated rHNSCC → Part 2”):

confirmed=true, measure_value=26.7, number_of_participants=30
confirmed=false, measure_value=43.3, number_of_participants=30

Report output: efficacy.orr.value = 26.7 (should be 43.3)

The cORR extraction correctly returns 26.7%, but the ORR extraction ALSO returns 26.7% instead of 43.3%.

Downstream impact

Understated ORR: When confirmed ORR is lower than unconfirmed ORR (the typical pattern), the report shows the lower confirmed value as the headline ORR. For pub 117228, ORR is understated from 43.3% to 26.7%.
Duplicate values: ORR and cORR columns show the same value, making the cORR column appear redundant and hiding the existence of a lower confirmed rate.
Audit noise: The audit correctly flags these as incorrect_value on efficacy.orr.value, generating true-positive findings that overlap with Issue 25 audit findings.

Scale

477 publications currently have both confirmed=true and confirmed=false ORR rows (the correct Issue 25 extraction pattern). When both rows have the same N (which is common — confirmed and unconfirmed ORR are computed from the same denominator), the confirmed value gets picked as plain ORR.

-- Publications where confirmed and unconfirmed ORR have the same N
-- (susceptible to the wrong-pick bug)
SELECT count(DISTINCT ts.source_id)
FROM trial_subgroups ts
JOIN trial_outcome_measures tom_c ON tom_c.trial_subgroup_id = ts.id AND tom_c.confirmed = true
JOIN trial_outcome_measures tom_u ON tom_u.trial_subgroup_id = ts.id AND tom_u.confirmed = false
JOIN trial_endpoints te_c ON te_c.id = tom_c.trial_endpoint_id AND te_c.abbreviation = 'ORR'
JOIN trial_endpoints te_u ON te_u.id = tom_u.trial_endpoint_id AND te_u.abbreviation = 'ORR'
JOIN trial_arm_outcomes tao_c ON tao_c.trial_outcome_measure_id = tom_c.id
JOIN trial_arm_outcomes tao_u ON tao_u.trial_outcome_measure_id = tom_u.id
WHERE ts.source_type = 'Publication'
  AND tao_c.number_of_participants = tao_u.number_of_participants;

Explored solution direction

Forward fix: In extract_efficacy_metrics, when processing ORR, exclude confirmed=true rows if confirmed=false rows also exist for the same subgroup. This ensures the plain ORR metric always uses the unconfirmed/total ORR:

# Inside the PRIMARY_EFFICACY_ABBREVIATIONS.each loop, after filtering matching:
if abbr == 'ORR'
  unconfirmed = matching.reject { |r| [true, 't'].include?(r['confirmed']) }
  matching = unconfirmed if unconfirmed.any?
end

This is a ~3 line change in clinical_evidence_query.rb. No backfill needed — fixing the query immediately fixes all report output.

No backfill required: This is a query-layer bug, not a data issue. The underlying data (trial_outcome_measures with correct confirmed flags) is correct. Fixing the Ruby code fixes all publications instantly.

Solution applied

Forward fix (2026-03-26): Added 5-line guard in app/queries/tpp/clinical_evidence_query.rb extract_efficacy_metrics method (line 610-613). When processing ORR, rejects confirmed=true rows if non-confirmed rows exist. This ensures the plain ORR metric uses the unconfirmed/total ORR, while the cORR extraction (line 667-683) independently picks confirmed=true rows.

if abbr == 'ORR'
  non_confirmed = matching.reject { |r| [true, 't'].include?(r['confirmed']) }
  matching = non_confirmed if non_confirmed.any?
end

Edge cases handled:

Both confirmed + unconfirmed exist → ORR gets unconfirmed, cORR gets confirmed (correct)
Only confirmed exists (no unconfirmed) → ORR falls back to confirmed value (safe fallback — same as cORR)
Only unconfirmed/null exists → no change (correct)

No backfill needed — query-layer fix applies immediately to all report output

28. `build_result_rows` collapses dose-level arms when `study_plan_arm_id` is null

Short summary

ClinicalEvidenceQuery.build_result_rows groups view rows by [publication_id, disease_id, effective_line, study_plan_arm_id, subgroup_value]. When study_plan_arm_id is null — which it is for all publication-extracted arms that haven’t been matched to a clinical trial study plan arm — distinct dose-level arms (e.g. “8.0 mg/kg” and “10.0 mg/kg”) sharing the same subgroup_value collapse into a single group. extract_efficacy_metrics then picks one arm by max_by(number_of_participants), silently dropping the other.

Where this sits in the current pipeline

app/queries/tpp/clinical_evidence_query.rb, build_result_rows method (line 306).

Exact restriction causing the drop

The grouping key at line 306 is:

grouped = enriched_data.group_by { |row|
  [row['publication_id'], row['disease_id'], row['effective_line'],
   row['study_plan_arm_id'], row['subgroup_value']]
}

When study_plan_arm_id is null for both dose arms (common for unlinked publications), they group together. extract_efficacy_metrics (line 619) then picks one via max_by(number_of_participants).

Concrete examples

Pub 190656 (ARTEMIS-001, HS-20093 B7-H3 ADC in NSCLC):

View has 6 rows for “NSCLC → Squamous cell carcinoma” (3 endpoints × 2 dose arms: 8.0 mg/kg N=32 and 10.0 mg/kg N=26)
Both arms have study_plan_arm_id = null
Query collapses to 1 row, picks 8.0 mg/kg (N=32 > N=26)
Lost data: Sq 10.0 mg/kg cORR 26.9%, PFS 5.7, DOR 7.0

Downstream impact

Dose-level subgroup data is silently dropped from the Clinical Evidence report. For dose-escalation studies where different dose levels have meaningfully different efficacy, only the higher-N cohort appears.

Scale

Affects dose-escalation/expansion publications where arms aren’t matched to trial study plan arms. The view correctly distinguishes arms by arm_name, but the query ignores arm_name in its grouping key.

Explored solution direction

Add arm_name to the grouping key in build_result_rows, or fall back to arm_name when study_plan_arm_id is null. This preserves dose-level arm distinctions without breaking publications where study_plan_arm_id correctly differentiates arms.

Related to Issue 20 (study_plan_arm link is fragile) — same root cause of over-reliance on study_plan_arm_id.

Solution applied

29. Dose extraction captures study-level range, not efficacy population range

Short summary

In dose-escalation studies, classify_publications extracts the full dose range stated in the abstract (e.g. dose_min=1.0, dose_max=8.3 mg/kg) as a property of the subgroup. But when the abstract restricts efficacy reporting to a dose subset (e.g. “results for patients who received ≥4.0 mg/kg”), the dose_min on the efficacy row is too low, creating a mismatch between the dose range and the efficacy population.

Where this sits in the current pipeline

app/tasks/publications_llm_classification/task.rb — dose fields extracted as subgroup-level properties.

Exact restriction causing the drop

Dose extraction treats dose as a study-level attribute (“what doses were used?”) rather than scoping to the efficacy analysis population (“what doses did the patients in the reported results actually receive?”). The LLM prompt doesn’t instruct it to scope dose to the efficacy population.

Concrete examples

Pub 238709 (MYTX-011 KisMET-01 updated):

Abstract: “85 pts received 1.0–8.3 mg/kg; 59 pts received ≥4.0 mg/kg” — efficacy reported only for ≥4.0 mg/kg subset
Extracted: dose_min=1.0, dose_max=8.3
Expected: dose_min=4.0, dose_max=8.3 (matching the efficacy population)
RP2D correctly extracted as “5.0 mg/kg Q3W (2-on 1-off) and 4.0 mg/kg Q3W”

Downstream impact

Report rows show a broader dose range than the actual efficacy population received. Minor impact on report accuracy but misleading for dose-response interpretation.

Scale

Affects phase I dose-escalation studies where efficacy is reported for a dose subset. Relatively uncommon pattern — most studies report efficacy at a single dose or clearly per-dose-level.

Explored solution direction

Update the classify_publications dose extraction prompt to instruct the LLM: “When the abstract reports efficacy for a specific dose subset, use that subset’s dose range, not the full escalation range.” Alternatively, accept this as a known limitation since RP2D (when present) correctly reflects the clinically relevant dose.

Solution applied

30. Cross-study data contamination from abstract background sections

Short summary

When a publication abstract references efficacy results from a prior study as background context (e.g. “In our previous study NCT05029882, ORR was 24.4%”), classify_publications extracts those values as if they belong to the current study. This produces fabricated efficacy data for publications that may have no efficacy results of their own yet.

Where this sits in the current pipeline

app/tasks/publications_llm_classification/task.rb — efficacy extraction from abstract text.

Exact restriction causing the drop

The LLM extraction prompt does not distinguish between efficacy results reported as outcomes of the current study vs. results cited from external/prior studies as background context. The abstract structure (Background → Methods → Results → Conclusions) is not enforced.

Concrete examples

Pub 29705 (ABBV-400/Telisotuzumab adizutecan signal-seeking study, NCT06084481):

Abstract background: “Initial results from the ongoing first-in-human study (NCT05029882) of ABBV-400… an overall response rate of 24.4%”
Current study status: “As of 19 January 2024, 24 patients have been enrolled” — no efficacy data reported
Extracted: ORR=24.4%, N=24 (enrollment count misinterpreted as efficacy N)
Expected: No efficacy data (null)

The 24.4% ORR belongs to NCT05029882, not NCT06084481. The N=24 is enrollment, not an efficacy population.

Downstream impact

Publications appear in the Clinical Evidence report with fabricated efficacy data from unrelated studies. This is particularly misleading for signal-seeking or early-enrollment publications where the abstract previews prior results to motivate the new study.

Scale

Affects publications whose abstracts cite efficacy results from prior/companion studies. Common in: signal-seeking study designs, follow-up studies referencing parent trials, and publications describing study rationale with prior data. Exact count unknown — requires systematic detection.

Explored solution direction

Audit prompt guard (deployed): Added “CROSS-STUDY REFERENCES” instruction to the audit prompt so future audits flag these correctly.
Extraction prompt fix (forward): Update classify_publications prompt to instruct: “Only extract efficacy values reported as results of THIS study (typically in the Results section). Do not extract values cited from prior/external studies in the Background or Introduction.”
Detection query: Publications where llm_data has efficacy values but the abstract contains phrases like “previous study”, “prior study”, “first-in-human study (NCT…)” with efficacy values in the same sentence could be flagged for review.

Solution applied

Audit prompt updated with cross-study reference guard (2026-03-27). Extraction-level fix pending.

Job 1594 Triage Log (HNSCC + ADC, disease_id=6200, technology_ids=708)

Audit ID	Pub ID	Type	Field	Classification	Notes
8338	29660	incorrect_value	efficacy.dor.value	True issue — extraction (minor)	LLM appended spurious “(4.55)” to “Not Reached” DOR
8341	29705	incorrect_value	efficacy.orr.value	True issue — extraction (Issue 30)	ORR from referenced prior study NCT05029882, not current study
8342	29705	incorrect_value	efficacy.orr.patient_count	True issue — extraction (Issue 30)	Enrollment count (24) misinterpreted as efficacy N
8339	44216	incorrect_value	dose_min	True issue — extraction (Issue 29)	Dose-escalation range (0.3) on dose-expansion efficacy row (RP2D=2.0)
8340	44216	incorrect_value	dose_max	True issue — extraction (Issue 29)	Dose-escalation range (2.2) on dose-expansion efficacy row (RP2D=2.0)
8343	115389	incorrect_value	efficacy.pfs.value	True issue — extraction	”Not Reached” should be null; abstract says “immature” (insufficient data)
8344	134450	incorrect_value	patient_number_efficacy	True issue — extraction (Issue 8 residual)	Zero-sentinel: N=0 instead of null for unstated SCCHN-specific N
8345	134450	incorrect_value	dose_min	True issue — extraction (Issue 29 variant)	Child subgroup inherited phase 1a escalation dose (0.1) instead of parent’s fixed dose (2.5)
8346	134450	incorrect_value	dose_max	True issue — extraction (Issue 29)	Dose range from escalation phase on expansion subgroup
8347	75542	missing_subgroup	—	False positive — audit LLM	ctDNA abundance is a Cox model correlation, not a tabulated efficacy subgroup
8348	75542	missing_subgroup	—	False positive — audit LLM	VAF persistence is a statistical correlation, not a reportable subgroup
8349	114973	incorrect_value	dose_min	True issue — extraction (Issue 29)	Full escalation range (0.3) on efficacy row; efficacy population was 3.6-5.4
8350	114973	incorrect_value	dose_max	True issue — extraction (Issue 29)	Full escalation range (8.0) on efficacy row; efficacy population was 3.6-5.4

Job 1635 Triage Log (CRC + ADC, disease_id=4345, technology_ids=708)

Audit ID	Pub ID	Type	Field	Classification	Notes
8360	241259	incorrect_value	patient_number_efficacy	True issue — extraction (Issue 8)	Zero-sentinel: N=0 for 2.0 mg/kg arm; per-arm N not stated
8361	241259	incorrect_value	patient_number_efficacy	True issue — extraction (Issue 8)	Zero-sentinel: N=0 for 2.4 mg/kg arm; per-arm N not stated
8362	241259	incorrect_value	dose_min	True issue — view (Issue 31)	SOC arm has Temab-A dose_min=1.6; SOC is trifluridine/tipiracil+BEV
8363	241259	incorrect_value	dose_max	True issue — view (Issue 31)	SOC arm has Temab-A dose_max=2.4
8364	241259	incorrect_value	dose_units	True issue — view (Issue 31)	SOC arm has mg/kg (Temab-A units)
8365	241259	incorrect_value	dose_frequency	True issue — view (Issue 31)	SOC arm has Q3W (Temab-A schedule)
8366	241259	incorrect_value	rp2d	True issue — view (Issue 31)	SOC arm has Temab-A RP2D
8352	29699	incorrect_value	efficacy.orr.value	True issue — extraction (Issue 8)	Zero-sentinel: ORR=0% for overall mCRC; no numeric ORR in abstract (E-R paper)
8353	29699	incorrect_value	patient_number_efficacy	True issue — extraction (Issue 8)	Zero-sentinel: N=0 for 2.4 mg/kg arm
8354	29699	incorrect_value	efficacy.orr.value	True issue — extraction (Issue 8)	Zero-sentinel: ORR=0% for 2.4 mg/kg; E-R correlations only
8355	29699	incorrect_value	patient_number_efficacy	True issue — extraction (Issue 8)	Zero-sentinel: N=0 for 3.0 mg/kg arm
8356	29699	incorrect_value	efficacy.orr.value	True issue — extraction (Issue 8)	Zero-sentinel: ORR=0% for 3.0 mg/kg; E-R correlations only
8368	29737	incorrect_value	efficacy.pfs.value	True issue — extraction (Issue 32)	TTP 4.8+ mo (SD pts only) mapped to PFS for full CRC cohort
8369	29737	incorrect_value	efficacy.pfs.patient_count	True issue — extraction (Issue 32)	N=29 (full CRC) but TTP was for 14 SD patients only
8370	29737	incorrect_value	efficacy.pfs.value	True issue — extraction (Issue 32)	TTP 4.4+ mo (SD pts only) mapped to PFS for KRAS-mutated
8371	29737	incorrect_value	efficacy.pfs.patient_count	True issue — extraction (Issue 32)	N=13 (full KRAS) but TTP was for 7 SD patients only
8411	134450	incorrect_value	patient_number_efficacy	True issue — extraction (Issue 8)	Zero-sentinel: N=0 for CRC phase 1b; ORR/DCR reported
8412	134450	incorrect_value	dose_min	True issue — extraction (Issue 29)	Phase 1a escalation min (0.1) on phase 1b efficacy row (RP2D=2.5)
8413	134450	incorrect_value	patient_number_efficacy	True issue — extraction (Issue 8)	Zero-sentinel: N=0 for SCCHN phase 1b; ORR/DCR reported
8414	134450	incorrect_value	dose_min	True issue — extraction (Issue 29)	Same as 8412 for SCCHN child subgroup
8402	72043	missing_subgroup	—	True issue — subgroup identification (Issue 33)	CRC × HER2 IHC 3+ cross-tabulated subgroup missing
8403	72043	missing_subgroup	—	True issue — subgroup identification (Issue 33)	CRC × HER2 IHC 2+ cross-tabulated subgroup missing
8404	72043	missing_subgroup	—	True issue — subgroup identification (Issue 33)	CRC × HER2 IHC 1+ cross-tabulated subgroup missing
8405	72043	missing_subgroup	—	True issue — subgroup identification (Issue 33)	CRC × HER2 mut/amp cross-tabulated subgroup missing
8386	74193	incorrect_value	efficacy.pfs.value	True issue — extraction (Issue 32)	TTP 1.6 mo mapped to PFS
8387	74193	incorrect_value	patient_number_efficacy	True issue — extraction	ctDNA retained subgroup: N=3 (tested) but only 2 had retention
8388	74193	incorrect_value	efficacy.orr.patient_count	True issue — extraction	Same: ORR denominator=3 should be 2
8389	74193	incorrect_value	efficacy.dcr.patient_count	True issue — extraction	Same: DCR denominator=3 should be 2
8380	200353	incorrect_value	patient_number_efficacy	True issue — extraction (Issue 26)	Parent N=97 propagated to “Absent MR” child subgroup
8381	200353	incorrect_value	patient_number_efficacy	True issue — extraction (Issue 26)	Parent N=97 propagated to “Complete MR” child subgroup
8382	200353	incorrect_value	patient_number_efficacy	True issue — extraction (Issue 8)	Zero-sentinel: N=0 for EGFR amplification subgroup
8383	200353	incorrect_value	efficacy.pfs.patient_count	True issue — extraction (Issue 8)	Zero-sentinel: PFS patient_count=0 for EGFR amp
8373	48880	incorrect_value	single_dose	True issue — extraction	Pooled Overall row shows single_dose=5.4; study had both 5.4 and 6.4 mg/kg
8374	48880	incorrect_value	dose_min	False positive — audit LLM	dose_min=5.4 IS the minimum dose; audit confused by dose_max also being 5.4
8375	48880	incorrect_value	dose_max	True issue — extraction	dose_max=5.4 should be 6.4 (second arm omitted from pub-level dose)
8407	135119	incorrect_value	patient_number_safety	True issue — extraction	Safety N=28 (Q2W-LD only); full study N=43 includes Q3W arm
8408	135119	incorrect_value	dose_max	True issue — extraction	dose_max=170 but Q3W arm went to 190 mg/m²
8409	135119	incorrect_value	dose_frequency	True issue — extraction	Q2W only; study used both Q2W and Q3W schedules
8397	66892	incorrect_value	dose_min	True issue — extraction (Issue 29)	Escalation min 0.8 on efficacy row; efficacy population ≥6 mg/kg
8398	66892	incorrect_value	dose_min	True issue — extraction (Issue 29)	Same for IHC 2+/FISH+ child subgroup
8399	66892	missing_subgroup	—	True issue — subgroup identification (Issue 33)	HER2 IHC 3+ subgroup (ORR 16/30=53.3%) not extracted
8377	48926	incorrect_value	patient_number_efficacy	True issue — query/view	Disease-scoped IHC2+/ISH+ duplicate has N=0; non-scoped row has correct N=13
8378	48926	incorrect_value	patient_number_efficacy	True issue — query/view	Disease-scoped IHC3+ duplicate has N=0; non-scoped row has correct N=40
8379	48926	incorrect_value	patient_number_efficacy	True issue — query/view	Disease-scoped prior anti-HER2 duplicate has N=0; non-scoped row has correct N=16
8390	49899	incorrect_value	patient_number_efficacy	True issue — extraction	N=40 (overall) for ≥2.4 mg/kg subgroup; should be 34 per abstract
8391	49899	incorrect_value	efficacy.orr.patient_count	True issue — extraction	ORR denominator=40 should be 34
8392	49899	incorrect_value	efficacy.corr.patient_count	True issue — extraction	cORR denominator=40 should be 34
8393	49900	incorrect_value	patient_number_safety	True issue — extraction	Safety N=29 for 2.4 mg/kg arm; abstract says 31 treated
8351	100	incorrect_value	efficacy.pfs.value	True issue — extraction (Issue 32)	TTP 2.70 mo mapped to PFS
8394	51436	incorrect_value	dose_min	True issue — extraction (Issue 29)	Escalation min 1.5 on ≥6 mg/kg efficacy subgroup
8396	52543	incorrect_value	efficacy.orr.patient_count	False positive — audit LLM	patient_count=3 is denominator (correct); audit confused numerator/denominator
8384	67379	incorrect_value	patient_number_efficacy	True issue — extraction (Issue 8)	Zero-sentinel: N=0 for hTMB/MSS; PFS+HR reported
8385	67379	incorrect_value	efficacy.pfs.patient_count	True issue — extraction (Issue 8)	Zero-sentinel: PFS patient_count=0 for same
8400	70960	incorrect_value	dose_min	True issue — extraction (Issue 29)	Escalation min 3.2 on RP2D (6.4) subgroup
8401	70960	incorrect_value	dose_max	True issue — extraction (Issue 29)	Escalation max 8.0 on RP2D (6.4) subgroup
8406	73299	incorrect_value	efficacy.pfs.value	True issue — extraction (Issue 32)	TTP 1.8 mo mapped to PFS for CRC cohort
8410	75999	spurious_row	—	True issue — query scoping	NPC subgroup in CRC-scoped report (basket trial leak)
8415	114571	incorrect_value	efficacy.os.value	True issue — extraction	OS=“Not Reached” but abstract says “not yet mature” → should be null
8358	116843	incorrect_value	rp2d	True issue — view (Issue 31)	SOC arm has Temab-A RP2D (dose cross-contamination)
8417	152942	spurious_row	—	True issue — query scoping	PDA subgroup in CRC-scoped report (basket trial leak)
8418	162304	incorrect_value	efficacy.orr.value	True issue — extraction	ORR=35% is “any tumor reduction” rate; actual ORR≈1.5% (1/66 PR)
8359	235204	incorrect_value	patient_number_efficacy	True issue — extraction	N=23 is PFS event count, not patient count; should be 31
8416	238377	incorrect_value	efficacy.dor.value	True issue — extraction	DoR=11.03mo from “>48 weeks” (lower bound, not median)
8395	240052	incorrect_value	dose_min	True issue — extraction (Issue 29)	Escalation min 1.5 on ≥6 mg/kg efficacy subgroup
8357	29700	missing_endpoint	efficacy.dor.value	True issue — extraction	DoR=5.5 mo in abstract for 3.0 mg/kg but not extracted
8367	29735	incorrect_value	efficacy.pfs.value	True issue — extraction (Issue 32)	TTP 5.1 mo mapped to PFS for CRC
8372	29738	incorrect_value	efficacy.pfs.value	True issue — extraction (Issue 32)	TTP 18 wks → 4.14 mo converted and mapped to PFS
8376	48903	incorrect_value	dose_max	True issue — extraction (Issue 29)	Part 1 max (8.0) on Part 2 expansion row (5.4/6.4)

31. Investigational drug dose data bleeds onto control/comparator arms

Short summary

When publication_interventions.study_plan_arm_id is NULL (the common case for publication-extracted drugs via Source 0), the drug_interventions CTE in vw_publication_efficacy_data joins the investigational drug to ALL arms — including control/comparator arms. The pub_dose_lookup COALESCE fallback then propagates the investigational drug’s dose fields (dose_min, dose_max, rp2d, dose_units, dose_frequency) onto control arm rows that have no subgroup-level dose override. This makes it appear that the comparator arm received the investigational drug’s dosing.

Where this sits in the current pipeline

db/views/vw_publication_efficacy_data_v18.sql:

drug_interventions CTE (Source 0): Joins publication_interventions to arms. When both clinical_trial_id and study_plan_arm_id are NULL, the drug matches all arms via the OR di.study_plan_arm_id IS NULL fallback.
pub_dose_lookup CTE: Pulls dose_evidence from publication_interventions. Joined to raw_rows via publication_intervention_id match from drug_interventions.
raw_rows COALESCE chain (lines 449–469): Falls through subgroup-level dose → pub-level dose. No arm_type guard prevents control arms from inheriting investigational drug dose.

Exact restriction causing the drop

In raw_rows, the dose COALESCE chain:

COALESCE(tlm.subgroup_dose_min, ..., pdl.pub_dose_min) AS dose_min,
COALESCE(tlm.subgroup_dose_max, ..., pdl.pub_dose_max) AS dose_max,
COALESCE(tlm.subgroup_rp2d, pdl.pub_rp2d) AS rp2d,

has no guard for aoe.arm_type or aoe.resolved_group_type. When a control arm’s subgroup has no dose fields, the COALESCE falls through to pub_dose_lookup, which contains the investigational drug’s dose evidence.

Concrete examples

Pub 241259 (Temab-A exposure-response in mCRC):

SOC arm = trifluridine/tipiracil+BEV (N=20)
View shows: dose_min=1.6 mg/kg, dose_max=2.4 mg/kg, rp2d=2.4 mg/kg Q3W, dose_units=mg/kg, dose_frequency=Q3W
These are Temab-A doses from publication_interventions id=51068 (study_plan_arm_id=NULL)
Abstract explicitly states SOC is “trifluridine/tipiracil+BEV” — no Temab-A dosing

Pub 241978 (Enfortumab vedotin):

“No upfront dose reduction” control arm shows dose_min=0.75 mg/kg, dose_max=1.25 mg/kg

Downstream impact

Clinical Evidence report: Control arms display investigational drug dose fields, misleading reviewers into thinking comparator arms received the ADC
Audit findings: Audit LLM correctly flags these as incorrect (5 of 7 issues on pub 241259 are this pattern)
Data quality: Dose fields on control arms are nonsensical — they describe a drug the arm didn’t receive

Scale

2,890 view rows across 566 publications have dose data from pub_dose_lookup on control/comparator arms
1,197 additional control rows have subgroup-level dose (potentially legitimate for dose-comparison arms)
Within ADC technology scope: 14 rows across 5 publications (smaller because most ADC trials are single-arm)

What the issue is not

Drug NAME attribution to control arms is intentional — the report needs to show what drug the control is being compared against
Subgroup-level dose on control arms may be correct (e.g., dose-comparison trials where the control is a different dose of the same drug)
This does NOT affect experimental/investigational arm rows

Explored solution direction

Forward fix — view v19: Add an arm_type guard to the pub_dose_lookup COALESCE in raw_rows. When aoe.arm_type = 'control' (or aoe.resolved_group_type = 'ACTIVE_COMPARATOR'), skip the pub_dose_lookup fallback:

COALESCE(
  tlm.subgroup_dose_min,
  CASE WHEN tlm.subgroup_dose_value IS NOT NULL
       THEN tlm.subgroup_dose_value || ' ' || COALESCE(tlm.subgroup_dose_units, '')
  END,
  CASE WHEN aoe.arm_type != 'control' THEN pdl.pub_dose_min END
) AS dose_min,

Apply the same pattern to dose_max, rp2d, dose_units, dose_frequency, and single_dose. This preserves subgroup-level dose (tier 1) for all arms but blocks the publication-level fallback (tier 3) for control arms only.

No backfill needed — rematerializing the view after deploying v19 will fix all affected rows.

Related to Issue 20: The v16 Source 0 fix (using publication_interventions as primary drug source) introduced this side effect by broadening the drug_interventions join. The drug join itself is correct; only the dose COALESCE fallback needs the arm_type guard.

Solution applied

(empty — pending implementation)

32. TTP (time to progression) misclassified as PFS

Short summary

The LLM extraction pipeline (classify_publications) maps TTP (time to progression) values to PFS (progression-free survival) when the abstract reports TTP but not PFS. These are distinct endpoints — TTP censors deaths while PFS counts them as events. Additionally, in some cases (e.g., pub 29737), TTP values reported for a best-response subpopulation (e.g., SD patients only) are attributed to the entire cohort.

Where this sits in the current pipeline

app/tasks/publications_llm_classification/subgroup_extraction.rb: Identifies endpoints from the abstract. May correctly identify TTP but it gets mapped to PFS downstream.
app/tasks/publications_llm_classification/task.rb: Extracts endpoint values. The LLM treats TTP as PFS when extracting, or the endpoint mapping normalizes TTP→PFS.
Endpoint normalization: If TTP is not in the standard endpoint list, the LLM may substitute the closest recognized endpoint (PFS).

Exact restriction causing the drop

The classify_publications prompt and/or endpoint schema does not distinguish TTP from PFS. When an abstract reports “median TTP = X months”, the LLM maps this to the PFS endpoint because TTP is not available as a separate extraction target. The LLM lacks instruction to leave PFS null when only TTP is reported.

Concrete examples

Pub 29737 (IMMU-132 in GI cancers):

Abstract: “time to progression (TTP) … median of 4.8+ mo for the SD pts”
Extracted: PFS=4.8 months, patient_count=29 (entire CRC cohort)
Correct: TTP=4.8+ months, applicable to 14 SD patients only — PFS should be null
Two compounding errors: (1) TTP→PFS confusion, (2) SD-subpopulation value → full cohort

Pub 29737 KRAS-mutated subgroup:

Abstract: “median TTP = 4.4+ mo” for 7 SD patients
Extracted: PFS=4.4 months, patient_count=13 (all KRAS-mutated)
Correct: TTP=4.4+ months for 7 SD patients — PFS should be null

Downstream impact

Clinical Evidence report: PFS column shows TTP values, overstating the evidence (PFS is a stronger endpoint than TTP)
Cross-study comparisons: TTP values mixed with genuine PFS values make comparisons unreliable
Patient counts: When TTP is reported only for responders/SD patients, attributing it to the full cohort inflates the denominator

Scale

149 publications mention TTP (but not PFS) in their abstract yet have PFS as an extracted endpoint
1,150 publications have TTP correctly extracted as TTP (suggesting the pipeline CAN handle TTP in many cases)
The SD-subpopulation misattribution is harder to quantify systematically but likely affects a subset of phase I/II publications reporting outcomes by best response category

Explored solution direction

Extraction prompt fix (forward): Add explicit instruction to classify_publications: “TTP (time to progression) and PFS (progression-free survival) are distinct endpoints. If the abstract reports TTP but not PFS, extract TTP only — do NOT map TTP values to PFS. Leave PFS null when only TTP is reported.”
Subpopulation guard: Add instruction: “When a time-based endpoint (TTP, PFS, DoR) is reported only for a best-response subgroup (e.g., ‘median TTP for SD patients’), do not attribute it to the parent population. Extract it under the response-specific subgroup or leave the parent’s value null.”
Backfill: Re-extract PFS values for the 149 affected publications with updated prompt. Scope: publications where abstract contains TTP/time to progression but NOT PFS/progression-free survival, and a PFS endpoint was extracted.

Solution applied

(empty — pending implementation)

33. Cross-tabulated subgroups not identified in basket trials

Short summary

When basket trial abstracts report efficacy in a table structured as tumor type × biomarker status (e.g., CRC × HER2 IHC 3+/2+/1+), extract_subgroups identifies the single-dimension subgroups (tumor types and biomarker statuses separately) but not the cross-product subgroups (CRC IHC 3+, CRC IHC 2+, etc.). This means disease-specific biomarker-stratified efficacy data is lost — only the overall tumor-type and overall biomarker-status rows are extracted.

Where this sits in the current pipeline

app/tasks/publications_llm_classification/subgroup_extraction.rb: Identifies subgroups and their endpoint associations from the abstract. The LLM prompt identifies subgroups as a flat list, and the hierarchical naming convention (e.g., “Non-breast STs → CRC”) captures one level of nesting but not cross-dimensional nesting.

Exact restriction causing the drop

The subgroup extraction prompt produces subgroups along each dimension independently:

By tumor type: BTC, UC, GC/GEJA, CRC
By biomarker: HER2 IHC3+, IHC2+, IHC1+

But it does not produce the cross-product: CRC IHC3+, CRC IHC2+, etc. The table data in the abstract contains these values, but the extraction doesn’t recognize the need to create nested subgroups for each cell in a tumor type × biomarker matrix.

Concrete examples

Pub 72043 (SHR-A1811 in non-breast solid tumors):

Abstract table reports ORR for each tumor type × HER2 IHC status combination
Extracted subgroups: CRC (36.4%), IHC3+ (54.1%), IHC2+ (41.7%), IHC1+ (50.0%)
Missing: CRC IHC3+ (100%, 3/3), CRC IHC2+ (0%, 0/3), CRC IHC1+ (0%, 0/1), CRC HER2 mut/amp (0%, 0/3)
4 audit issues (8402-8405) all flagging missing cross-tabulated CRC subgroups

Downstream impact

Clinical Evidence report: Disease-specific biomarker-stratified efficacy data missing — can only show overall CRC ORR, not CRC by HER2 status
Granularity loss: The most clinically relevant data in basket trials is often the cross-tabulation (e.g., “does HER2 IHC 3+ predict response in CRC specifically?”)

Scale

~366 publications have both disease-type and biomarker-type subgroups with common biomarkers (HER2, EGFR, KRAS, BRAF, PD-L1, MSI, MMR)
Not all 366 will have cross-tabulated data in the abstract — many will have separate analyses rather than a matrix table
The issue primarily affects basket/platform trials reporting across multiple tumor types with biomarker stratification

What the issue is not

This is NOT about missing biomarker context on existing subgroups (that’s Issue 19)
This is NOT about dropped subgroups at the classify step (Issue 10) — the cross-product subgroups are never identified in the first place
Parent-level tumor type and biomarker subgroups ARE correctly extracted

Explored solution direction

Extraction prompt enhancement: Update extract_subgroups prompt to recognize tabular cross-tabulation patterns: “When the abstract contains a table or matrix reporting efficacy by tumor type × biomarker status, create cross-product subgroups (e.g., ‘CRC → HER2 IHC 3+’) for each cell with reported data, in addition to the single-dimension subgroups.”
Post-extraction cross-product generation: After extracting single-dimension subgroups, detect when a table exists with both dimensions and generate cross-product subgroups programmatically.
Scope: Focus on publications with ≥2 disease subgroups AND ≥1 biomarker subgroup, and re-run extraction with the enhanced prompt.

Solution applied

(empty — pending implementation)

34. “Immature” endpoints extracted as “Not Reached”

Short summary

When an abstract states that an endpoint (OS, PFS, DoR) is “not yet mature”, “data immature”, or “results are immature”, the LLM extraction maps this to “Not Reached”. These are clinically distinct concepts: “Not Reached” means the Kaplan-Meier curve hasn’t crossed the 50% mark (a real finding indicating the median exceeds current follow-up), while “immature” means insufficient events or follow-up to perform the analysis (no median can be estimated — value should be null).

Where this sits in the current pipeline

app/tasks/publications_llm_classification/task.rb: The classify_publications prompt doesn’t distinguish between “Not Reached” and “immature/not yet mature”. The LLM treats both as equivalent and extracts “Not Reached” for either.

Exact restriction causing the drop

The extraction prompt has no instruction to differentiate “Not Reached” (endpoint was analyzed, median exceeds follow-up) from “immature” (endpoint was NOT formally analyzed, insufficient data). Both get mapped to the string “Not Reached”.

Concrete examples

Pub 114571 (JSKN003 in HER2+ mCRC):

Abstract: “The median overall survival (OS) was not yet mature”
Extracted: OS = “Not Reached”
Correct: OS should be null — data immature, no median estimated

Pub 115389 (from job 1594):

Abstract: PFS described as “immature”
Extracted: PFS = “Not Reached”
Correct: PFS should be null

Downstream impact

Clinical Evidence report: “Not Reached” implies a favorable outcome (median exceeds follow-up), while “immature” is neutral (no data yet). Reporting “Not Reached” when the data is simply immature overstates the evidence.
Cross-study comparisons: “Not Reached” OS is treated as a positive signal, biasing comparisons against studies that honestly report immature data.

Scale

~71 publications have “immature”/“not yet mature” language in the abstract (without “not reached”) but have “Not Reached” extracted for OS, PFS, or DoR
Breakdown: OS (~214 total “Not Reached” pubs with immature language, ~71 without “not reached” in abstract), PFS (~107), DoR (~68)
Many abstracts legitimately say BOTH “immature” and “not reached” — these are correct and not affected

What the issue is not

Abstracts that say “median OS was not reached” — these ARE correct as “Not Reached”
Abstracts that say “OS data are immature; median was not reached” — also correct (both terms used)
Only affects abstracts where “immature” is used WITHOUT “not reached” for the same endpoint

Explored solution direction

Extraction prompt fix (forward): Add instruction to classify_publications: “Distinguish between ‘Not Reached’ (endpoint was analyzed but median exceeds follow-up — extract as ‘Not Reached’) and ‘immature/not yet mature’ (insufficient data to analyze the endpoint — extract as null/omit). Only use ‘Not Reached’ when the abstract explicitly states the median was not reached.”

Backfill: Re-extract OS/PFS/DoR for the ~71 affected publications. Scope query:

SELECT DISTINCT v.publication_id
FROM vw_publication_efficacy_data v
JOIN publications p ON p.id = v.publication_id
WHERE v.measure_value = 'Not Reached'
  AND v.endpoint_abbreviation IN ('OS', 'PFS', 'DOR')
  AND (p.abstract ILIKE '%not yet mature%' OR p.abstract ILIKE '%data immature%'
       OR p.abstract ILIKE '%data are immature%' OR p.abstract ILIKE '%results are immature%')
  AND p.abstract NOT ILIKE '%not reached%'
  AND p.abstract NOT ILIKE '%not been reached%'

Solution applied

(empty — pending implementation)

Publication Issues Tracker Archive

Publication Issues Tracker

Issue index

1. Trial subgroup disease propagation gap

Short summary

Where this sits in the current pipeline

Exact restriction causing the drop

Example: publication 114077

Why this matters downstream

What the issue is not

Current semantic mismatch

Scale of the issue in publication-sourced subgroup rows

Why broadening this blindly is risky

Spot checks showing recoverable signal

Spot checks showing noise or semantic drift

Reporting impact

Core problem statement

Open characterization questions

Working assumptions from discussion

Explored solution direction

Solution applied

Validation (2026-03-13)

2. ASCO API content type blind spot drops PresentationContentItem publications

Short summary

Where this sits in the current pipeline

Exact restrictions causing the drop

Concrete example

Downstream impact

What the issue is not

Scale

Spot checks

Open characterization questions

Explored solution direction

Solution applied

3. Publication dose context is trial-derived for linked result publications and still too unstructured for worksheet parity

Short summary

Where this sits in the current pipeline

Exact restriction causing the drop

Concrete examples

Example 1: publication 66552 (BL-B01D1 in ESCC, ESMO 2024)

Example 2: publication 133793 (simmitinib, ASCO 2024)

Example 3: publication 75999 (MRG003, ESMO 2021) shows the partial success case

Downstream impact

What the issue is not

Scale

Spot checks

Working assumptions from discussion

Open characterization questions

Explored solution direction

Solution applied

Issue reopened: pub_dose_lookup view join drops 76% of extracted dose evidence (2026-03-23)

Root cause

Concrete examples from CRC ADC audit (disease 4345, technology 708)

Scale

Fix applied

4. Most frequent AE columns lack grade-classified ranked export fields

Short summary

Where this sits in the current pipeline

Exact restriction causing the drop

Concrete examples

Worksheet row: Izalontamab brengitecan in ESCC (ESCC tab, row 3)

Worksheet row: Micvotabart pelidotin in HNSCC (HNSCC tab, row 4)

Scale

Worksheet AE column patterns

Downstream impact

What the issue is not

Open characterization questions

Explored solution direction

Solution applied

5. Publication prior therapy context is not extracted — min/max prior lines and prior therapy exposure are missing

Short summary

Where this sits in the current pipeline

Exact restriction causing the gap

Concrete examples

Example 1: publication 152908 (BOLD-100 in gastric cancer)

Example 2: publication 162733 (sEphB4-HSA in mCRPC)

Example 3: publication 53818 (PROfound — olaparib by prior taxane)

Downstream impact

What the issue is not

Scale

Example: publication `114077`

Example 1: publication `66552` (`BL-B01D1` in ESCC, ESMO 2024)

Example 2: publication `133793` (`simmitinib`, ASCO 2024)

Example 3: publication `75999` (`MRG003`, ESMO 2021) shows the partial success case

Issue reopened: `pub_dose_lookup` view join drops 76% of extracted dose evidence (2026-03-23)

Example 1: publication `152908` (BOLD-100 in gastric cancer)

Example 2: publication `162733` (sEphB4-HSA in mCRPC)

Example 3: publication `53818` (PROfound — olaparib by prior taxane)

Example 1: publication `241657` (belzutifan + lenvatinib in RCC)

Example 2: publication `116878` (BURAN — buparlisib in HNSCC)

Example 3: publication `240450` (BREAKWATER — encorafenib in mCRC)

Example 4: publication `191190` (pembrolizumab + nab-paclitaxel in HNSCC)

7. AE grade category enum is too coarse — grade 1-2 rows misclassified as `all_grade`

8. `max_prior_lines` zero-sentinel contamination

10. `classify_publications` drops subgroups identified by `extract_subgroups`