Skip to content

Publication Issues Tracker Archive

Temporary working document for tracking publication-processing issues identified during investigation.

The main motivation for this doc is the sheet: 1reh2-9Xpxd9DF7EB-73JfSXH8-MLtWI3zUDEOTgxPV8, where the client has collected clinical data for different disease areas and drugs. The purpose of this document is to identify gaps in the publications database that are preventing us from being able to correctly reconstruct this sheet in the future using structured data only (from the bioloupe data lake database).

Last updated: 2026-03-28 (Issues 31-34 added from job 1635 CRC+ADC audit triage — dose cross-contamination to control arms, TTP→PFS misclassification, cross-tabulated subgroups, immature→Not Reached confusion)

#TitleShort descriptionStatus
1Trial subgroup disease propagation gapNon-disease subgroups with disease-like labels (e.g. MSS-CRC, NSCLC) never get disease_id populated because propagation is gated on subgroup_type = 'disease'Complete — 1,924 subgroups pending term match resolution
2ASCO API content type blind spotPresentationContentItem publications silently dropped — search filter, detail query, and NCT ID search all restricted to AbstractContentItem onlyComplete
3Publication dose context gapLinked publications use trial-derived dose; publication-specific dose extraction only runs for unlinked publications; no structured dose fields (min/max/RP2D/units/frequency)Complete — extraction + view join fixed by Issue 20 (v16 view)
4AE grade classification gapIndividual named AE rows lack grade category (all_grade vs grade_gte3), preventing ranked “Most Frequent AE” export columnsComplete — superseded by Issue 7 full re-run
5Prior therapy context not extractedMin/max/median prior lines and prior therapy exposure (e.g. prior taxane, prior IO) not captured from publication abstracts despite being available in textComplete — max_prior_lines data quality cleanup needed
6Data cutoff date not extractedPublication data cutoff date is stated in ~6K abstracts but not persisted as a structured field — needed for worksheet Data Cut columnImplementation complete — backfill complete
7AE grade category too coarseBinary all_grade/grade_gte3 enum forces grade 1-2 rows into all_grade, causing 50 inverted AE pairs and 7.9K misclassified rows — expand to 6-value enum and re-run backfill ($10)Complete — enum expanded, backfill run, inverted pairs reduced from ~50 to 33
8max_prior_lines zero-sentinel contaminationLLM outputs 0 instead of null for unstated max prior lines, producing 124K unusable values including 12.9K logically impossible min > max rowsComplete — cleanup applied, 0 contradictions remain, residual zeros in 1L/Adj/Neo populations only
9All-grade AE extraction gapOriginally ~13K pubs suspected; after investigation only ~14 have genuine misclassification (any-grade values labeled as grade≥3). Already fixed by Issue 7 enum expansion — re-extraction produces correct resultsComplete — fixed by Issue 7 prompt, no additional changes needed
10classify_publications drops identified subgroupsLLM drops ~15% of subgroups identified by extract_subgroups — ~9,700 publications affected across all sources. Prompt + schema + validation fix implemented and full pipeline re-extraction completed in prodComplete
11Empty endpoint extractionsAll 102 publications with empty outcome_measures are correctly empty: trial designs, safety-only, biomarker studies, or truncated abstracts. Original worksheet gaps explained by Issue 10 + data availabilityClosed — not an issue
12Legacy Emerging Clinical Data query collapses subgroup-level resultsLegacy EmergingClinicalDataQuery groups by [pub_id, disease_id, line, arm] and prefers “Overall” subgroup, hiding dose-level and biomarker-stratified data; the current ClinicalEvidenceQuery already preserves subgroup rowsStale — superseded by ClinicalEvidenceQuery; legacy EmergingClinicalDataQuery still collapses subgroups
13Technology filter excludes combination partner drugsQuery filters view rows by technology_id, removing combo partner drugs with different technologies — e.g. paclitaxel (chemo) filtered out when querying for BsAb, so Amivantamab+paclitaxel shows no combo partnerComplete — switched to fetch_combination_partners
14Basket trial disease subgroups not extracted for minority cohortsBNT324/DB-1311 abstract mentions SCLC, CRPC, NSCLC by name but not HNSCC (only “1 pt with BTC” style mentions) — HNSCC N=3 data was in poster/presentation only, not abstract textInvestigation complete — data availability limit
15Disease extraction drops parent disease when subtype matches existbuild_match_set early-returns when subtype TermMatches succeed, skipping parent disease-name match — e.g. HNSCC (6200) dropped when H&N sub-sites match, making 1,856 pubs invisible under umbrella diseasesComplete — backfill ran 2026-03-18
16Confirmed ORR is not exported by EmergingClinicalDataQueryQuery/report endpoint whitelist omits cORR, so worksheet rows with Confirmed ORR (cORR) cannot be reconstructed even when ORR is present — folded into Issue 12Complete — confirmed boolean added, backfilled 3,061 rows
17ASCO abstract + presentation copies create duplicate publication rowsASCO ingestion saves AbstractContentItem and PresentationContentItem separately by source_id, so the same DOI can appear twice in the reportInvestigation complete
18PubMed-indexed journal article missing from publication corpusThe sqNSCLC worksheet row for Cofetuzumab now points to 10.1016/j.lungcan.2025.108492, but that article is absent from publications, so the row is still missing despite a valid journal sourceImplementation complete — 2025 PubMed backfill pending
19Biomarker context missing at subgroup levelBiomarkers are extracted at trial_disease_details level (disease scope), not per subgroup — ~13K biomarker-type subgroups like “EGFR-mutant” and “PD-L1 TPS≥1%” have no structured biomarker link, preventing biomarker-stratified exportComplete — extraction backfilled (52K records, 99%), matching pipeline run (67.3% matched), query layer aggregates multi-biomarker subgroups
20study_plan_arm link is fragile and causes dose/drug/arm issuesvw_publication_efficacy_data joins through study_plan_arms for arm roles AND drug resolution — causing arm role failures (62% of rows), dose evidence drop (76% lost via drug_id mismatch), and row triplication. Merges Issue 3 dose gap.Complete — v16 view deployed + arm_type backfill run in prod (2026-03-24)
21Phase 1 basket trials report response counts, not ORR percentagesLLM extracts PR (count) faithfully from phase 1 abstracts reporting “1 PR in 9 HNSCC patients”, but the query only recognizes ORR (percentage). No ORR is derived, and fallback patient count inflates to the cross-tumor totalComplete — derived ORR in post_process + backfill
22extract_subgroups doesn’t identify response counts as endpointsWhen abstracts report best response narratively (“1 PR and 14 SD out of 29 patients”) without a formal ORR, extract_subgroups only identifies DCR and TTP as endpoints — individual response counts (PR, CR) are missed, so classify_publications can’t extract themComplete — forward fix v2 + backfill v1+v2 run; 759→498 DCR-only pubs (remaining 498 verified clean)
23Dose extraction misses implicit RP2D in phase I/II trialsWhen a phase I/II abstract says “dose levels of X and Y were chosen for phase II”, the dose extractor classifies this as a range (dose_min/dose_max) rather than RP2D — but in phase I/II trials, doses chosen for phase II ARE the RP2D by definitionComplete — backfill ran 2026-03-23
24Subgroup participant count wrong for biomarker sub-cohortsKRAS-mutated CRC subgroup (pub 29737) reports n=7 but abstract states 13 KRAS-mutated patients with 7 having SD — LLM confused the SD count with the total KRAS cohort sizeComplete — backfill ran 2026-03-23
25Confirmed vs unconfirmed ORR confusion in classify_publicationsWhen abstracts report both confirmed and unconfirmed ORR (common in ADC trials), the LLM extracts the unconfirmed value but marks confirmed: true, or omits the confirmed ORR entirely — producing wrong cORR values and missing cORR endpointsIncomplete — extraction residual post-fix, see 2026-03-26 audit findings
26Parent population N propagated to child subgroupsclassify_publications copies the parent subgroup’s number_of_participants to child subgroups instead of extracting the subset-specific N — ~5,058 child subgroups across 1,174 publications affectedComplete
27extract_efficacy_metrics picks confirmed ORR as plain ORRWhen both confirmed and unconfirmed ORR rows exist with the same N, max_by(number_of_participants) picks the confirmed row for the plain ORR metric — making ORR and cORR identical and the ORR value wrongInvestigation complete
28build_result_rows collapses dose-level arms when study_plan_arm_id is nullGrouping key uses study_plan_arm_id which is null for publication-extracted arms — distinct dose cohorts (e.g. “8.0 mg/kg” vs “10.0 mg/kg”) sharing the same subgroup collapse into one row, silently dropping the lower-N armInvestigation complete
29Dose extraction captures study-level range, not efficacy population rangeIn dose-escalation studies, LLM extracts the full dose range (e.g. 1.0–8.3 mg/kg) even when efficacy is reported only for a subset (e.g. ≥4.0 mg/kg) — dose_min on the efficacy row is too lowInvestigation complete
30Cross-study data contamination from abstract background sectionsLLM extracts efficacy values from a referenced prior study cited in the abstract’s background, attributing them to the current publication which has no efficacy data yetInvestigation complete
31Investigational drug dose data bleeds onto control/comparator armspub_dose_lookup COALESCE fallback propagates investigational drug dose fields to control arms when publication_interventions.study_plan_arm_id is NULL — 2,890 rows across 566 publicationsInvestigation complete
32TTP (time to progression) misclassified as PFSLLM extraction maps TTP values to PFS endpoint — 149 publications mention TTP (not PFS) in abstract but have PFS extracted; additionally SD-subpopulation TTP values get attributed to full cohortInvestigation complete
33Cross-tabulated subgroups not identified in basket trialsextract_subgroups identifies single-dimension subgroups (tumor type OR biomarker) but not the cross-product (tumor type × biomarker) when tabular data is present — ~366 pubs have both disease + biomarker subgroups that could have cross-tabulated dataInvestigation complete
34”Immature” endpoints extracted as “Not Reached”LLM maps “not yet mature” / “data immature” to “Not Reached” — but immature means no median can be estimated (should be null), while “Not Reached” means median exceeds follow-up. ~71 pubs have immature language without “not reached” but have “Not Reached” extractedInvestigation complete

Each issue entry should keep analysis and remediation separate.

Recommended issue structure:

  • Short summary
  • Where this sits in the current pipeline
  • Exact restriction causing the drop
  • Concrete examples
  • Downstream impact
  • What the issue is not
  • Scale
  • Spot checks
  • Open characterization questions
  • Explored solution direction
  • Solution applied

Solution applied should remain empty until an actual fix is agreed and implemented.

Backfill pattern: When an issue requires backfilling historical data, see the “One-Off Backfill Tasks” section in .claude/skills/backend-expert/SKILL.md.

Publication subgroup rows can contain disease-like cohort labels in trial_subgroups.subgroup_value, but the current disease propagation path only assigns trial_subgroups.disease_id for subgroups whose subgroup_type is exactly disease.

If a subgroup is classified as analysis population or another non-disease type, its disease_id remains NULL even when:

  • the subgroup label is clearly disease-like, and
  • a high-confidence TermMatch already exists for that label.

This means disease-specific publication rows can fail to surface in reporting even though the publication contains a disease cohort tied to outcomes.

Current publication flow:

  1. extract_subgroups identifies subgroup labels and endpoint associations.
  2. classify_publications emits subgroup_outcome_measures, including:
    • type
    • value
    • linked outcome measures
  3. post_process_publications destroys and recreates publication.trial_subgroups from the LLM output.
  4. The separate publication disease workflow creates TermMatch records for subgroup disease strings and later post-processes them back into trial_subgroups.disease_id.

Relevant code paths:

The subgroup disease term population and subgroup disease post-processing are both restricted to subgroup_type = 'disease'.

In the model:

  • TrialSubgroup.disease_type is defined as where(subgroup_type: 'disease')
  • TrialSubgroup.populate_term_matches only iterates disease_type.with_subgroup_value

In the Thor task:

  • post_process_disease_matches builds the scope as:
    • TrialSubgroup.disease_type.with_subgroup_value.without_disease_id

So any subgroup classified as:

  • analysis population
  • clinical feature
  • mutation
  • patient characteristic
  • or any other non-disease type

is excluded from disease propagation, even if its subgroup_value is disease-like.

Publication:

  • publications.id = 114077
  • title: A phase I study of INCA33890, a PD-1/TGFβR2 bispecific antibody, for advanced solid tumours
  • linked trial: NCT05836324

Publication-level disease rows:

  • trial_disease_details contains only:
    • 4116 = Solid Tumors

Subgroup row:

  • trial_subgroups.id = 210858
  • source_type = 'Publication'
  • source_id = 114077
  • subgroup_type = 'analysis population'
  • subgroup_value = 'MSS-CRC'
  • disease_id = NULL

But the disease matcher already knows what this means:

  • term_matches.id = 100095
  • subject_type = 'TrialSubgroup'
  • field = 'disease_name'
  • strategy = 'DiseaseMatching'
  • term = 'mss-crc'
  • final_result.id = 4345
  • final_result.score = 0.95
  • disease 4345 = Colorectal Cancer

So the system has a validated disease match for the normalized term, but it is never propagated to trial_subgroups.disease_id because the subgroup is analysis population, not disease.

The publication efficacy view uses subgroup disease from trial_subgroups, not from the linked clinical trial and not from trial_disease_details.

In /Users/tomor/Sites/bioloupe-data-gov/db/views/vw_publication_efficacy_data_v07.sql:

  • treatment_line_mapping reads trial_subgroups where source_type = 'Publication'
  • subgroup_disease_id is set directly from trial_subgroups.disease_id

The view does not join:

  • clinical_trial_end_diseases
  • trial_disease_details

for subgroup disease attribution.

So if a publication subgroup is disease-like but trial_subgroups.disease_id stays null, the view row does not carry that disease.

Later, in /Users/tomor/Sites/bioloupe-data-gov/app/queries/tpp/emerging_clinical_data_query.rb, filtering works like this:

  • prefer v.subgroup_disease_id
  • if that is null, fall back to trial_disease_details

For publication 114077, that fallback disease is only Solid Tumors, so the publication does not surface as CRC.

This is not primarily a missing TermMatch problem.

For the MSS-CRC example, the TermMatch already exists and is high-confidence. The failure is in propagation from the normalized term match back onto the subgroup record.

This is also not a clinical-trial disease issue. In this path, the effective disease used by the publication efficacy view comes from publication subgroup records, not from clinical_trials.

The system currently behaves as if:

  • trial_subgroups.disease_id means: “this subgroup is explicitly a disease subgroup”

But many real publication subgroup labels behave more like:

  • disease cohort embedded inside another subgroup class
  • disease-shaped analysis population
  • disease-plus-qualifier cohort

Examples:

  • MSS-CRC
  • Overall → RCC
  • Relapsed/Refractory AML
  • BCG-refractory NMIBC
  • Stage I NSCLC

These can carry real disease meaning even when the LLM classified the subgroup as analysis population or another non-disease type.

Scale of the issue in publication-sourced subgroup rows

Section titled “Scale of the issue in publication-sourced subgroup rows”

For trial_subgroups.source_type = 'Publication' with disease_id IS NULL:

  • total null-disease subgroup rows: 140,057
  • distinct null-disease subgroup strings: 92,854

For subgroup_type = 'analysis population':

  • rows with non-empty subgroup_value and null disease_id: 88,623
  • distinct subgroup strings: 51,343

For non-disease subgroup rows overall:

  • rows with non-empty subgroup_value and null disease_id: 134,211
  • distinct subgroup strings: 88,382

Among publication analysis population rows specifically:

  • 1,720 rows already have an existing exact normalized high-confidence DiseaseMatching result available by term

Among all publication non-disease subgroup rows:

  • 2,422 rows already have an existing exact normalized high-confidence DiseaseMatching result available by term

This shows two things at once:

  1. there is recoverable disease signal being left unused
  2. most non-disease subgroup rows are not pre-validated disease matches

Many analysis population values are obviously not disease cohorts:

  • Overall
  • Responders
  • Evaluable patients
  • Monotherapy
  • Cohort 1
  • Placebo
  • Dose escalation
  • Healthy Volunteers
  • First-line
  • Japanese patients

So “map all non-disease subgroup types through disease matching” would push large volumes of junk terms into a disease-normalization process that was not designed for them.

These publication subgroup values look meaningfully disease-like and appear useful for disease attribution:

  • MSS-CRC -> Colorectal Cancer
  • Overall → RCC -> Renal Cell Carcinoma (RCC)
  • Overall → GIST -> Gastrointestinal Stromal Tumor (GIST)
  • Relapsed/Refractory AML -> Acute Myeloid Leukemia (AML)
  • BCG-refractory NMIBC -> Non-Muscle Invasive Bladder Cancer
  • Head and Neck Squamous Cell Carcinoma -> Head and Neck Squamous Cell Carcinoma (HNSCC)
  • NSCLC -> Non-Small Cell Lung Cancer (NSCLC)
  • Colorectal cancer -> Colorectal Cancer

These are the kinds of subgroups that currently fail to contribute disease-specific reachability if their subgroup_type is not disease.

Spot checks showing noise or semantic drift

Section titled “Spot checks showing noise or semantic drift”

These examples show why broad disease assignment on subgroup labels can produce incorrect or misleading disease attribution:

  • Previously untreated mPDAC -> matched to Multiple Myeloma at score 0.75
    • abbreviation collision
  • Relapsed/refractory cHL -> matched to Chronic Leukemia at score 0.825
    • clearly wrong
  • Overall → Carcinoma In Situ -> matched to Breast Ductal Carcinoma In Situ at score 0.85
    • wrong in a bladder-cancer context
  • Bone metastases -> matched to Bone Metastasis
    • may be useful as a retrieval concept but not necessarily the publication’s disease cohort

These are not hypothetical edge cases. They already exist in the term-matching results.

Because subgroup_disease_id from publication subgroups is preferred when present, this issue affects:

  • disease-specific publication discovery
  • disease-specific efficacy row inclusion
  • downstream CSV/report completeness for basket and umbrella studies
  • publications whose abstract reports disease cohorts under non-disease subgroup types

The observed failure mode is:

  1. publication contains a disease cohort in subgroup results
  2. subgroup gets created with a non-disease type
  3. subgroup disease propagation never runs
  4. vw_publication_efficacy_data row has subgroup_disease_id = NULL
  5. reporting falls back to publication-level disease or misses the disease entirely

The system currently treats subgroup disease attribution as a type-gated post-processing step:

  • only subgroup_type = 'disease' is eligible

But in actual publication abstracts, disease-bearing cohort labels are often emitted under other subgroup types, especially analysis population.

As a result, the pipeline loses disease information that is already present in subgroup text and, in some cases, already normalized in term_matches.

These are not proposed fixes. They are the unresolved aspects of the issue:

  • Is trial_subgroups.disease_id intended to mean “authoritative disease cohort” or “retrieval-relevant disease tag”?
  • Should disease-bearing analysis population subgroups be treated differently from clearly non-disease analysis population values like Responders or Cohort 1?
  • Metastatic-site labels such as Bone metastases may be valid for publication reachability if the ontology already contains the corresponding disease concept.
  • When subgroup disease is null, fallback to trial_disease_details should be interpreted as publication-level disease rather than subgroup-level disease.

The explored direction is not to map every subgroup directly into trial_subgroups.disease_id.

That would continue to create incorrect disease assignments, just with a different error pattern:

  • fewer abbreviation-only failures
  • more context-overreach failures

The better conceptual shape is:

extract_subgroups
classify_publications
subgroup disease adjudication (LLM, contextual)
post_process / disease matching

The key idea is to separate two questions that are currently blurred together:

  1. Is this subgroup actually disease-like?
  2. If yes, which disease concept should it map to?

The explored adjudication step would analyze subgroup rows in publication context and emit something like:

  • semantic class:
    • disease_cohort
    • disease_related_context
    • not_disease
  • normalized disease phrase, if applicable
  • evidence quote/span
  • confidence

Behavioral intent of those outputs:

  • disease_cohort
    • subgroup is a real disease-bearing cohort
    • eligible to write into authoritative trial_subgroups.disease_id
  • disease_related_context
    • subgroup contains disease signal that may help publication reachability or filtering
    • should not automatically overwrite authoritative subgroup disease semantics
    • may belong in a separate retrieval/tag field rather than trial_subgroups.disease_id
  • not_disease
    • subgroup remains unmapped for disease attribution

This distinction matters because the current system uses trial_subgroups.disease_id as an authoritative signal in reporting, not just as a search helper.

So if all subgroup strings are pushed directly into the existing disease_id field, the reports inherit those assignments as if they were true disease cohorts.

That is acceptable for:

  • MSS-CRC
  • Overall → RCC
  • Relapsed/Refractory AML
  • BCG-refractory NMIBC

But not acceptable for:

  • Responders
  • Cohort 1
  • Placebo
  • Evaluable patients
  • ambiguous or mis-normalized strings like Relapsed/refractory cHL
  • context-sensitive strings like Carcinoma In Situ

Placement options explored:

  1. In the main publication workflow:

    • after classify_publications
    • before post_process_publications
    • this would affect subgroup creation semantics earlier
  2. In the publication disease branch:

Current preferred exploration direction:

  • yes, a new LLM subgroup adjudication step makes sense
  • no, it should not directly map all subgroups into the existing authoritative disease_id field
  • analysis population is the best first expansion target
  • the main gain comes from separating:
    • “is this disease-like?”
    • from
    • “which disease is it?” using publication context rather than term-only matching

Implemented contextual LLM subgroup disease adjudication for all non-disease publication subgroups (~132K rows, ~89K distinct values).

Scope: All publication-sourced subgroups where subgroup_type != 'disease', including analysis population (89K rows), clinical feature (25K), mutation (10K), patient characteristic (2.4K), and smaller types. Spot checks confirmed disease-bearing labels appear across all these types (e.g. Metastatic Urothelial Carcinoma → PD-L1- under clinical feature, Relapsed/refractory multiple myeloma → del17p under mutation).

Estimated cost: ~$50 with gpt-5-mini for full backfill.

New code:

  • app/tasks/subgroup_disease_adjudication/task.rb — LLM adjudication task that classifies publication subgroup labels as disease_cohort, disease_related_context, or not_disease, with a normalized disease phrase, evidence span, and confidence score.
  • app/tasks/subgroup_disease_adjudication/response.rb — JSON schema for the adjudication response (StoreModel + DataTasks::JsonSchema).

Modified code:

  • app/models/trial_subgroup.rb — Added adjudicated_disease_cohort scope. Updated populate_term_matches to also generate TermMatch entries for adjudicated disease_cohort subgroups using the LLM-provided normalized_disease_phrase.
  • lib/tasks/clinical_trials/trial_subgroups.thor — Added adjudicate_subgroup_diseases Thor task for CLI access. Updated post_process_disease_matches to process both explicit disease-type subgroups and adjudicated disease cohort subgroups.
  • app/workflows/publication_disease_workflow.rb — Added adjudicate_subgroup_diseases step before populate_disease_terms_for_trial_subgroups in the workflow graph.

How it works:

  1. Adjudication runs on all publication-sourced non-disease subgroups and persists the result on trial_subgroups.llm_data['subgroup_disease_adjudication'].
  2. Only semantic_class = 'disease_cohort' subgroups enter the DiseaseMatching term population and post-processing paths.
  3. disease_related_context and not_disease subgroups remain excluded from trial_subgroups.disease_id.
  4. No changes to vw_publication_efficacy_data or Tpp::EmergingClinicalDataQuery — they consume the newly populated subgroup_disease_id automatically.

Initial spot check (15 random subgroups): All classifications correct. Disease cohorts (AML, mCRPC, CML, melanoma, solid tumors) correctly identified. Metastatic sites, biomarkers, treatment arms, dose levels, and healthy controls correctly excluded.

Pending: Manual verification on a curated sample before broad backfill.

Coverage: 134,061 / 134,211 non-disease subgroups adjudicated (99.9%). 34,652 classified as disease_cohort, of which 32,811 (94.7%) received disease_id.

Tracker example verified: Pub 114077, MSS-CRC subgroup (id 210858) correctly resolved: disease_id = 4345 (Colorectal Cancer), flows through vw_publication_efficacy_data.

Remaining gap — 1,841 disease_cohort subgroups without disease_id:

The populate_term_matches step has already run after adjudication. TermMatch rows exist for these terms — the gap is in the DiseaseMatching resolution pipeline itself, which is expected behavior in most cases.

The unresolved terms fall into categories that are inherent to the disease ontology design:

  1. Broad disease concepts not in simplified tree (e.g. “lymphoma”, “mesothelioma”).

    • Disease 4668 = “Lymphoma” exists in diseases but has simplified = false — intentionally excluded from the matchable disease set.
    • The DiseaseMatching pipeline correctly found only subtypes (Follicular, Hodgkin, etc.) as candidates, rejected them as too narrow, and returned null.
    • Verified against abstracts: these publications genuinely reference “lymphoma” without specifying a subtype (e.g. pub 90447: “relapsed/refractory lymphomas”; pub 119434: “newly diagnosed lymphoma”). The LLM adjudication correctly normalized to “Lymphoma” because the abstracts don’t provide enough context to be more specific.
    • Same pattern for “mucosal melanoma”, “mesothelioma” — the broad concept isn’t in the simplified tree, and the abstracts don’t specify further.
  2. Non-oncology diseases correctly absent from ontology.

    • “Polycystic Ovary Syndrome” (41 subgroups), “Uterine leiomyoma” (10), “Sepsis” (9): not in our hemonc-focused disease ontology. These subgroups were correctly adjudicated as disease_cohort by the LLM (they are disease cohorts), but the diseases themselves are out of scope.
  3. Too-generic terms below matching threshold.

    • “Cancer” (21 subgroups): score 0.35, too broad. “Advanced cancer” (19): score 0.75, at threshold. “Pediatric cancer” (13): score 0.7, below threshold.
  4. Finalization pipeline edge cases.

    • “Muscle-invasive urothelial carcinoma” (20 subgroups): Round 1 and Round 2 both agreed on disease 4424 (Muscle Invasive Bladder Cancer), judgment accepted with 0.9 confidence, but the majority-vote finalization step still produced null. This may warrant investigation as a potential finalization bug.
    • “Gastric and gastroesophageal junction adenocarcinoma” (12): compound disease phrase where the matcher couldn’t resolve to a single disease.

Assessment: The 1,841 gap is largely expected — broad/generic/out-of-scope terms that the disease tree intentionally doesn’t cover. The only potentially actionable subset is the ~32 subgroups affected by the finalization edge case (pattern 4), which may be a bug in the majority-vote logic.

2. ASCO API content type blind spot drops PresentationContentItem publications

Section titled “2. ASCO API content type blind spot drops PresentationContentItem publications”

The ASCO GraphQL API classifies conference content into multiple __typename variants: AbstractContentItem, PresentationContentItem, PosterContentItem, VideosSlidesContentItem, JournalContentItem, and SessionContentItem. Our ingestion pipeline only handles AbstractContentItem — in both the search filter and the detail query. Publications typed as PresentationContentItem (and potentially PosterContentItem) are silently dropped.

ASCO ingestion flow in app/services/publications/asco_api_service.rb:

  1. fetch_abstract_hits sends a GraphQL Search query with filters: { contentTypes: ['Abstract'] }.
  2. fetch_full_abstract_detail sends getContentByUID with a single inline fragment: ... on AbstractContentItem { uid title body doi ... }.
  3. save_publication receives the detail result and persists it.

Triggered from lib/tasks/clinical_trials/publications.thor via:

bundle exec thor clinical_trials:publications:import_from asco [options]

Three failure points, any one of which is sufficient to lose a publication:

1. Search filter excludes non-Abstract content types

filters_hash = { contentTypes: ['Abstract'] }

For wildcard searches (userInput: '*'), the ASCO API strictly filters by contentTypes. A PresentationContentItem is not returned when contentTypes: ['Abstract'] is used with a wildcard query.

Verified via API:

  • userInput: '*', contentTypes: ['Abstract'], years: [2025] → returns only hex UIDs (AbstractContentItem)
  • userInput: '*', contentTypes: ['Presentation'], years: [2025] → returns only PRESENTATION* UIDs

2. NCT ID text search returns zero hits for PresentationContentItem records

The ASCO search API does not index the clinicalTrialRegistryNumber field for search. Searching userInput: 'NCT05701709' returns zero hits regardless of contentTypes filter, even though the record has clinicalTrialRegistryNumber: 'NCT05701709' in its data.

Verified:

search(userInput: "NCT05701709", filters: {}) → 0 hits
search(userInput: "NCT05701709", filters: {contentTypes: ["Abstract"]}) → 0 hits
search(userInput: "SHR A2102", filters: {contentTypes: ["Abstract"]}) → finds PRESENTATION245980

This means the disease-specific ingestion path (which searches by NCT ID) can never discover this publication.

3. Detail query GraphQL fragment only matches AbstractContentItem

... on AbstractContentItem { uid title body doi clinicalTrialRegistryNumber ... }

When getContentByUID returns a PresentationContentItem, the fragment does not match. The result is {}. save_publication then sees a blank title and silently skips the record:

if publication_data[:title].blank?
increment_stat(:skipped)
Rails.logger.warn("ASCO Abstract #{abstract_data['uid']} has no title")
return :skipped
end

Publication: DOI 10.1200/JCO.2025.43.16_suppl.107

  • Title: “Phase 1 trial of SHR-A2102, a nectin-4-directed antibody drug conjugate (ADC), in advanced solid tumors.”
  • ASCO UID: PRESENTATION245980
  • __typename: PresentationContentItem
  • clinicalTrialRegistryNumber: NCT05701709
  • Drug: SHR-A2102 (drug_id 13643, known in our system)
  • Trial: NCT05701709 (clinical_trial_id 51789, linked to “Solid Tumors” disease)
  • ESMO version of same study: publication_id 65886, successfully ingested and linked to trial

API verification:

# Search finds nothing by NCT ID
search(userInput: "NCT05701709") → 0 hits
# Search finds it by drug name
search(userInput: "SHR A2102") → PRESENTATION245980 (score 19.66)
# Detail with AbstractContentItem fragment → empty
getContentByUID("PRESENTATION245980") with ... on AbstractContentItem → result: {}
# Detail with PresentationContentItem fragment → full data
getContentByUID("PRESENTATION245980") with ... on PresentationContentItem → title, body, doi, NCT ID, authors ✓
  • Missing ASCO publications for trials where the abstract is classified as Presentation
  • This particularly affects oral presentations and plenary sessions (low abstract numbers like 107), which are often the highest-impact results
  • Disease-specific reporting misses these publications entirely
  • Trial publication counts are understated
  • Not a disease-mapping problem — the drug and trial are correctly linked in our system
  • Not a timing/availability problem — the abstract is live in the ASCO API
  • Not specific to Chinese trials or specific sponsors — this is a content classification issue on the ASCO API side
  • Not a one_off_jobs issue — job 1022 (Dec 31 wildcard run) did run but could not discover these due to the contentTypes filter

ASCO API schema introspection reveals 6 content item types. Four have DOI + clinicalTrialRegistryNumber + body fields:

TypeHas DOIHas NCT ID fieldHas BodyCurrently handled
AbstractContentItemyesyesyesyes
PresentationContentItemyesyesyesno
PosterContentItemyesyesyesno
VideosSlidesContentItemyesyesyesno
JournalContentItemyesnoyesno
SessionContentItemnonoyesno

The exact count of PresentationContentItem records in ASCO is not easily determined (the API returns paginated results of 10 per page), but a drug-name search returning PRESENTATION UIDs alongside Abstract UIDs confirms they represent a meaningful fraction of conference content.

Our ASCO 2025 Annual Meeting coverage: 1,102 abstracts out of an estimated 5,000-6,000+ total — the gap is likely partly explained by this issue.

  • PRESENTATION245980 (DOI 10.1200/JCO.2025.43.16_suppl.107): SHR-A2102 Phase 1 in solid tumors — missing
  • PRESENTATION243121 (DOI 10.1200/JCO.2025.43.5_suppl.657): SHR-A2102 in urothelial carcinoma — missing

Both are PresentationContentItem with full abstract text, NCT IDs, authors, and DOIs available.

  • What fraction of ASCO Annual Meeting oral presentations are classified as PresentationContentItem vs AbstractContentItem?
  • Are PosterContentItem records also carrying unique abstracts we’re missing, or do they duplicate AbstractContentItem records?
  • Should VideosSlidesContentItem be ingested (they carry DOI and NCT ID fields)?

The fix is contained entirely in app/services/publications/asco_api_service.rb. Two methods need changes:

1. fetch_abstract_hits — broaden the search contentTypes filter

Current (line 93):

filters_hash = { contentTypes: ['Abstract'] }

Change to:

filters_hash = { contentTypes: ['Abstract', 'Presentation'] }

This ensures the wildcard search (userInput: '*') returns both AbstractContentItem and PresentationContentItem records. The ASCO API enforces contentTypes strictly for wildcard queries, so without adding 'Presentation' these records never appear in search results.

PosterContentItem is excluded for now — open question whether posters carry unique abstract content or duplicate what’s already in AbstractContentItem records. Can be added later if spot checks show unique content.

2. fetch_full_abstract_detail — add a PresentationContentItem inline fragment

Current query (lines 130–157) uses only:

... on AbstractContentItem { uid title body doi clinicalTrialRegistryNumber ... }

Add a second fragment with the shared fields that both types expose:

... on PresentationContentItem {
uid
title
body
doi
clinicalTrialRegistryNumber
journalCitation
taxonomy { subjectsThes drugsThes }
publishDate { start }
authors { displayName role publicationOrganization }
}

These are the same fields already requested from AbstractContentItem. The PresentationContentItem schema exposes all of them (verified via schema introspection). GraphQL will match whichever fragment corresponds to the returned __typename and populate the result identically.

No changes needed in save_publication — the downstream code reads abstract_data['title'], abstract_data['body'], etc. by string key. As long as the GraphQL fragment returns the same field names, save_publication works unchanged.

Deduplicationsave_publication already uses Publication.find_or_initialize_by(source: 'ASCO', source_id: publication_data[:source_id]), where source_id is the ASCO uid. Since PresentationContentItem records have distinct UIDs (e.g. PRESENTATION245980), they will not collide with existing AbstractContentItem records. If a presentation and an abstract share the same DOI but different UIDs, both would be saved — but find_or_initialize_by on source_id prevents true duplicates.

What this does not fix — the NCT ID search blind spot (failure point 2). The ASCO API does not index clinicalTrialRegistryNumber for text search regardless of content type. So the disease-specific ingestion path (userInput: 'NCT05701709') will still return zero hits for PresentationContentItem records. This is an ASCO API limitation outside our control. The fix works because the wildcard path (userInput: '*') will now find these records, and they will be correctly saved and linked to trials via clinicalTrialRegistryNumber at save time.

Updated app/services/publications/asco_api_service.rb with two changes:

  1. Search filter: contentTypes: ['Abstract']contentTypes: ['Abstract', 'Presentation'] in fetch_abstract_hits.
  2. Detail query: Added ... on PresentationContentItem { ... } inline fragment with identical fields to fetch_full_abstract_detail.
  3. Performance: Parallelized detail fetches using Parallel.map(hits, in_threads: 5) in fetch_publications_by_criteria.

No changes to save_publication — fields are identical across both content types.

Verification: Test run confirmed PRESENTATION-prefixed UIDs are returned by search, detail query resolves fields correctly, and publications save to the database with source: 'ASCO', category: 'ASCO Abstract', and correct titles/metadata.

3. Publication dose context is trial-derived for linked result publications and still too unstructured for worksheet parity

Section titled “3. Publication dose context is trial-derived for linked result publications and still too unstructured for worksheet parity”

The disease clinical evidence worksheet needs publication dose fields with substantially more precision than our current publication pipeline can provide:

  • Dose (if only one dose was used)
  • Dose Min
  • Dose Max
  • RP2D
  • Dose Unites
  • Dose Freqency

Today, most linked result publications never get publication-specific arm/intervention extraction at all. They still surface a dose in /Users/tomor/Sites/bioloupe-data-gov/db/views/vw_publication_efficacy_data_v07.sql, but that value is usually coming from trial study-plan interventions, not from the publication abstract.

Even when publication-specific intervention extraction does run, it only persists a free-text publication_interventions.dose string. That is enough to display a single dose blob, but not enough to reproduce the worksheet columns the client is maintaining manually in spreadsheet 1reh2-9Xpxd9DF7EB-73JfSXH8-MLtWI3zUDEOTgxPV8.

Current publication flow:

  1. /Users/tomor/Sites/bioloupe-data-gov/app/workflows/publications_workflow.rb runs extract_interventions before endpoint and AE processing.
  2. /Users/tomor/Sites/bioloupe-data-gov/app/tasks/publications_llm_classification/intervention_extraction.rb writes llm_data['intervention_arms'].
  3. /Users/tomor/Sites/bioloupe-data-gov/app/tasks/publications_llm_classification/drug_linker.rb persists publication_interventions and publication_arm_interventions.
  4. /Users/tomor/Sites/bioloupe-data-gov/db/views/vw_publication_efficacy_data_v07.sql builds drug_interventions for reporting:
    • linked publications use vw_bioloupe_interventions
    • only unlinked publications use publication_interventions
  5. /Users/tomor/Sites/bioloupe-data-gov/app/queries/tpp/emerging_clinical_data_query.rb reads v.dose as a single free-text field.

There are two separate restrictions, and they compound.

Restriction 1: intervention extraction is scoped to unlinked publications

In /Users/tomor/Sites/bioloupe-data-gov/app/tasks/publications_llm_classification/intervention_extraction.rb, base_scope is:

Publication.workflow_eligible
.unlinked_to_trials
.hematology_oncology_relevant
.where("(llm_data -> 'intervention_arms') is null")

So once a result publication is linked to a trial, it normally never enters publication arm extraction.

Restriction 2: the efficacy view only uses publication_interventions for publications without a trial link

In /Users/tomor/Sites/bioloupe-data-gov/db/views/vw_publication_efficacy_data_v07.sql, drug_interventions explicitly says:

  • sources 1a/1b/1c use vw_bioloupe_interventions for linked publications
  • source 2 uses publication_interventions
  • source 2 is restricted by:
WHERE pct.clinical_trial_id IS NULL and pi.source_type='Publication'

That means linked publications can show a dose, but it is almost always trial-derived.

Example 1: publication 66552 (BL-B01D1 in ESCC, ESMO 2024)

Section titled “Example 1: publication 66552 (BL-B01D1 in ESCC, ESMO 2024)”

Publication:

  • publications.id = 66552
  • title: BL-B01D1, an EGFR x her3 bispecific antibody-drug conjugate (ADC), in patients with locally advanced or metastatic esophageal squamous cell carcinoma (ESCC)
  • linked trial: NCT05262491

Abstract dose language:

  • 2.0, 2.5 and 3.0 mg/kg D1D8 Q3W
  • 2.5mg/kg (RP2D)

Current persisted state:

  • jsonb_array_length(publications.llm_data->'intervention_arms') = 0
  • no publication_interventions rows
  • vw_publication_efficacy_data.dose = 'not specified'

But the worksheet row in the client spreadsheet is manually decomposed into:

  • Dose Min = 2
  • Dose Max = 2.5
  • RP2D = 2.5
  • Dose Units = mg/kg
  • Dose Frequency = 2Q3W

So the publication abstract contains the dose context the worksheet needs, but the current linked-publication path discards it and falls back to trial-level not specified.

Example 2: publication 133793 (simmitinib, ASCO 2024)

Section titled “Example 2: publication 133793 (simmitinib, ASCO 2024)”

Publication:

  • publications.id = 133793
  • title: First-in-human study of simmitinib, a novel tyrosine kinase inhibitor targeting FGFR1-3, KDR and CSF-1R.
  • linked trial: NCT04058587

Abstract dose language:

  • dose escalation 1 to 9 mg orally
  • expansion regimens 4 mg QD, 6 mg QD, and 6 mg 3 weeks on 1 week off

Current persisted state:

  • no llm_data['intervention_arms']
  • no publication_interventions
  • vw_publication_efficacy_data.dose = 'starting dose 1mg/d'

This is not just incomplete. It is directionally misleading for reporting because the publication result set includes later expansion regimens and the worksheet needs to distinguish min/max/RP2D/schedule.

Example 3: publication 75999 (MRG003, ESMO 2021) shows the partial success case

Section titled “Example 3: publication 75999 (MRG003, ESMO 2021) shows the partial success case”

Publication:

  • publications.id = 75999
  • title: FIH phase I dose escalation and dose expansion study of anti-EGFR ADC MRG003 in patients with advanced solid tumors
  • no linked trial

Current persisted state:

  • jsonb_array_length(publications.llm_data->'intervention_arms') = 5
  • publication_interventions.dose = '0.1–3.0 mg/kg (dose-escalation cohorts)'
  • publication_interventions.schedule = 'Q3W'
  • vw_publication_efficacy_data.dose echoes the same free-text dose

This proves the existing publication arm extraction can capture publication-derived dosing when the publication is unlinked.

But it also shows the second gap:

  • the persisted output is still one free-text dose blob
  • the expansion dose 2.5 mg/kg Q3W is not decomposed into worksheet-ready columns
  • RP2D is not persisted separately

So broadening extraction scope alone will improve provenance, but not worksheet parity.

  • The disease clinical evidence export cannot reliably recreate the client worksheet from our publication database.
  • Linked publication rows can carry a dose string that looks structured enough to trust, while actually reflecting trial-plan interventions rather than the publication cohort being reported.
  • Basket, dose-escalation, dose-expansion, and subgroup-specific publications are especially exposed because publication dose often differs from the trial’s broad intervention description.
  • The existing add-dose-column-to-emerging-data direction is useful for visibility, but it does not solve the worksheet problem because the export needs decomposed dose fields, not only a single free-text dose.

This is not just a missing CSV column problem.

Exposing vw_publication_efficacy_data.dose more widely would still leave us with:

  • linked publications whose dose came from trial interventions instead of the publication
  • free-text values like not specified, specified dose, dose escalation, and starting dose 1mg/d
  • no reliable dose_min, dose_max, rp2d, dose_units, or dose_frequency

This is also not purely a trial curation problem.

In many cases the trial registry is doing exactly what it should: storing planned intervention doses at the study-plan level. The problem is that the publication is often talking about:

  • a subset of dose-escalation cohorts
  • a specific expansion dose
  • a weight-banded administration rule
  • a recommended phase 2 dose selected after escalation
  • a disease-specific cohort inside a broader trial

That context exists in the publication narrative, not necessarily in the linked study plan.

This is also not a good regex problem.

Dose strings in the worksheet and in publication text mix:

  • ranges
  • RP2D statements
  • schedules like Q3W, 2Q3W, QD, days 1, 8, and 15 of a 28-day cycle
  • weight-banded doses
  • escalation plus expansion language in the same abstract

We should not try to derive worksheet fields from vw_publication_efficacy_data.dose with string-splitting heuristics.

Current warehouse counts:

  • linked result publications: 53,701
  • linked result publications with any publication_interventions: 79
  • linked result publications with publication-derived dose in publication_interventions: 50
  • linked result publications with llm_data['intervention_arms']: 87
  • linked result publications with a nonblank vw_publication_efficacy_data.dose: 36,840
  • linked result publications with view dose but no publication-derived dose: 36,803

This is the key shape of the issue:

  • dose appears broadly in reporting
  • publication-specific dose provenance is almost absent for linked results

Contrast:

  • unlinked result publications with publication-derived dose in publication_interventions: 2,374

The field shape is also not export-ready even when populated:

  • vw_publication_efficacy_data rows with nonblank dose: 489,397
  • distinct dose strings in the view: 18,002
  • rows with obviously ambiguous values like not specified, not reported, or escalation-only labels: 45,236

Representative high-frequency values in the view:

  • not specified (28,138 rows)
  • specified dose (6,927 rows)
  • escalating doses (1,434 rows)
  • dose escalation (1,287 rows)

For publication-derived doses specifically:

  • publication_interventions rows with nonblank dose: 4,668
  • distinct publication-derived dose strings: 3,123
  • rows with structurally complex dose text (ranges, RP2D text, schedules): 775

Examples of currently persisted publication-derived dose strings:

  • 0.1–0.9 mg/m2 (administered over 1–10 minutes); RP2D 0.7 mg/m2 over 10 minutes
  • 0.05 mg/kg rounded to nearest 1.5 mg; weight-band doses used: 1.5 mg (<30 kg), 3 mg (30–60 kg), 4.5 mg (60–90 kg)
  • 1000 mg/m2 on days 1 and 8 every 3 weeks

These are useful raw evidence strings, but they are not already normalized worksheet fields.

Linked publications where publication text clearly contains richer dose context than the current export path:

  • 66552 (BL-B01D1, ESCC): publication says 2.0, 2.5 and 3.0 mg/kg D1D8 Q3W; view says not specified
  • 133793 (simmitinib): publication says 1 to 9 mg, 4 mg QD, 6 mg QD, 6 mg 3 weeks on 1 week off; view says starting dose 1mg/d
  • 240515 (amivantamab OrigAMI-1): worksheet needs the weight-based regimen; current linked-publication path has no publication intervention extraction at all

Unlinked publication showing the existing extraction path works but is still too shallow:

  • 75999 (MRG003): publication-derived dose and schedule are persisted, but only as raw text rather than decomposed worksheet fields
  • The authoritative persistence grain should be publication + arm + subgroup, interpreted as the smallest defensible publication-result scope.
  • We should not force false precision. Some dose evidence will legitimately be:
    • publication-level
    • publication + arm
    • publication + subgroup
    • publication + arm + subgroup
  • publication + disease is too coarse for dose evidence because dose usually follows treatment context, not just disease context.
  • Publication intervention extraction should run for all result publications, not just unlinked publications and not just records currently missing rows.
  • Operationally, reruns can still be versioned/idempotent so we only refresh missing, stale, or schema-changed records.
  • When a publication reports both escalation and expansion cohorts, we should persist:
    • the raw dose evidence text
    • a structured cohort array
    • and derive a preferred export dose per report row from the matching publication context
  • We should not persist one publication-wide preferred dose detached from arm/subgroup context.
  • Publication-derived dose should be treated as the source of truth for publication-backed rows when it matches the same or narrower context than the row being exported.
  • Linked trial dose remains fallback context only when the publication is silent or too vague to support a row-level dose assignment.
  • We do want evidence quotes/spans and confidence for extracted dose claims such as RP2D, units, schedule, or frequency. This is primarily for analyst review and debugging.
  • How should the persistence model represent scope when an abstract supports only publication-level or arm-level dose evidence and no subgroup is reported?
  • Should disease be denormalized onto the dose evidence row for easier querying, or resolved later from subgroup / publication disease context?
  • What exact cohort labels do we want to persist for dose context classification:
    • escalation
    • expansion
    • rp2d_or_fixed_dose
    • mixed_or_unclear
  • Should full text, when available, be allowed to override abstract-derived dose evidence, or only supplement it?

The direction that emerges from the worksheet and the warehouse evidence has two layers.

1. Use smallest-scope publication evidence as the persistence model

The target grain should be publication-result context, not publication-wide text blobs.

Preferred direction:

  • persist dose evidence at publication + arm + subgroup scope when supported
  • allow nullable arm/subgroup keys for publication-level and arm-only evidence
  • derive disease-facing exports from these scoped evidence rows instead of trying to back-infer scope later

2. Expand publication arm/intervention extraction to all result publications, including linked ones

The current unlinked_to_trials restriction is too aggressive for dose-sensitive reporting.

Preferred direction:

  • run publication arm/intervention extraction for all result publications, including linked ones
  • persist publication_interventions and publication_arm_interventions even when a publication is linked to a trial
  • keep trial-linked study-plan interventions as fallback context, not as the only source of dose

This addresses provenance.

3. Add a separate LLM-backed publication evidence extraction for worksheet dose fields

publication_interventions.dose should remain the raw publication dose phrase, but it should not be the final reporting shape.

Preferred structured output for the disease clinical evidence export:

  • raw publication dose text
  • structured cohort array
  • single_dose
  • dose_min
  • dose_max
  • rp2d
  • dose_units
  • dose_frequency
  • dose_context_type such as escalation, expansion, RP2D/fixed-dose, or mixed/unclear
  • evidence quote/span
  • confidence
  • optional cohort / arm note explaining whether the values come from escalation, expansion, or a disease-specific subset

This should be extracted from publication text with publication context using an LLM-backed schema, not reverse-parsed from the existing free-text dose field and not derived through substring / regex heuristics.

The current early extract_interventions step is still useful, but it is probably not sufficient on its own for dose attribution. The authoritative dose extraction likely belongs later in the workflow, after subgroup / arm / endpoint context exists, so the dose evidence can be attached to the correct publication result scope.

4. Use publication-derived dose as the preferred export source when it matches the publication result context

Source precedence for dose should likely be:

  1. publication-specific structured dose evidence
  2. publication raw intervention dose text
  3. linked trial intervention dose as fallback only

The important nuance is that we should derive the preferred export dose per output row from the matching publication context. We should not store or trust a single publication-wide preferred dose when the abstract contains multiple cohorts.

That is different from the current efficacy view, where linked publications are effectively forced into the trial-derived path.

5. Keep this as an export/evidence enrichment concern, not a generic trial-study-plan rewrite

The problem we are solving is:

  • can we recreate the worksheet from publication-backed evidence?

The answer does not require fully normalizing every historical publication intervention into canonical pharmacology. It requires a publication evidence layer that preserves what the publication actually says at the arm/cohort level.

Implemented 2026-03-11. Change: publication-dose-context-gap.

Four-part fix:

  1. Broadened intervention extraction scope — Removed .unlinked_to_trials from InterventionExtraction#base_scope. Previously ~53K linked publications were skipped because the therapeutic_area_filter step also had .unlinked_to_trials, so linked publications never got classified as hematology_oncology_relevant and never entered the intervention extraction scope — even though all trials in our database are hemonc by definition.

  2. Target disease scope for cost control — Running intervention + dose extraction across all 53K linked pubs would cost ~$480. Instead, scoped the backfill to publications linked to trials in target disease areas via clinical_trial_end_diseases:

    • Solid Tumors (4116), HNSCC (6200), ESCC (4260), sqNSCLC (4174), CRC (4345), Cholangiocarcinoma (6228/6229/4298)
    • Plus all existing hemonc-classified unlinked publications
    • Implemented as reusable scope Publication.target_disease_or_hemonc_relevant on the model
    • Reduces backfill from 53K to ~10K publications, estimated cost ~$66
    • These disease IDs are hardcoded for the initial backfill; scope can be broadened later by adding more disease IDs to Publication::TARGET_DISEASE_IDS
  3. New dose evidence extraction step — Created DoseEvidenceExtraction LLM task (app/tasks/publications_llm_classification/dose_evidence_extraction.rb) that decomposes free-text publication_interventions.dose into structured fields stored in publication_interventions.dose_evidence JSONB:

    • single_dose, dose_min, dose_max, rp2d, dose_units, dose_frequency, dose_context_type
    • evidence_quote, confidence, version
    • Uses gpt-5-mini at ~$0.004/publication — sufficient quality, no model upgrade needed
    • Prompt sends publication_intervention.id per intervention for deterministic persistence (no name matching)
    • Integrated into PublicationsWorkflow as a skippable step after extract_subgroups
  4. Efficacy view + export updatedvw_publication_efficacy_data v08 adds dose_min, dose_max, rp2d, dose_units, dose_frequency columns via a pub_dose_lookup CTE that reads publication_interventions.dose_evidence. emerging_clinical_data_query.rb includes these in export output.

Key discovery during implementation: The therapeutic_area_filter task also has .unlinked_to_trials in its scope, so 65,152 linked publications were never classified for hemonc relevance. Since all trials in our DB are hemonc, the classification gate is meaningless for linked pubs. Rather than running the LLM therapeutic area filter on 65K pubs unnecessarily, we bypass it with target_disease_or_hemonc_relevant which uses trial disease metadata for linked pubs and LLM classification for unlinked pubs.

Files changed:

  • app/models/publication.rb (target_disease_or_hemonc_relevant scope + TARGET_DISEASE_IDS)
  • app/tasks/publications_llm_classification/dose_evidence_extraction.rb (new)
  • app/tasks/publications_llm_classification/intervention_extraction.rb (scope changed to target_disease_or_hemonc_relevant)
  • app/workflows/publications_workflow.rb (new step added)
  • app/admin/services/publication_console/publication_workflow_registry.rb (registry entries)
  • app/admin/services/publication_console/publication_workflow_overview_service.rb (scope methods)
  • lib/tasks/clinical_trials/publications.thor (Thor task wiring)
  • db/migrate/20260311220054_add_dose_evidence_to_publication_interventions.rb (JSONB column + GIN index)
  • db/views/vw_publication_efficacy_data_v08.sql (structured dose columns)
  • db/migrate/20260311220657_update_vw_publication_efficacy_data_to_version8.rb (view migration)
  • app/queries/tpp/emerging_clinical_data_query.rb (export columns)

Smoke test results (4 publications, gpt-5-mini):

  • Pub 75999 (MRG003): dose_min=0.1 mg/kg, dose_max=3.0 mg/kg, rp2d=2.5 mg/kg, Q3W, context=escalation, confidence=0.95
  • Pub 117 (Olanzapine/Pregabalin): fixed doses correctly extracted (5mg, 75mg, 8mg)
  • Pub 88446 (21 interventions): all 21 matched by ID, non-drug interventions correctly got confidence=0.0
  • Structured dose columns confirmed flowing through materialized view after refresh

Backfill completed 2026-03-12. Four steps ran in production:

  1. thor clinical_trials:publications:extract_interventions --batched --parallelism=4 --batch-size=2000
  2. thor clinical_trials:publications:link_publication_drugs --parallelism=5
  3. thor clinical_trials:publications:extract_dose_evidence --batched --parallelism=4 --batch-size=2000 (ran twice — first pass covered unlinked pubs only; second pass covered newly materialized linked-pub interventions)
  4. REFRESH MATERIALIZED VIEW CONCURRENTLY vw_publication_efficacy_data

Backfill results:

  • 44,778 / 44,780 publication_interventions rows have dose_evidence populated
  • Actual cost: ~$8 total across both dose evidence runs (gpt-5-mini batch API, ~$0.0004/pub — 10x cheaper than pre-implementation estimate)
  • Extraction quality verified across random samples: high-confidence extractions accurate, RP2D correctly identified in escalation studies, weight-based/BSA-based classification correct, low-confidence calibration appropriate (no hallucinated doses)

Post-backfill cleanup:

  • ~1.1% of rows (513) had LLM garbage in string fields — placeholder text, chain-of-thought leaking, field-name rotation, escaped JSON fragments. All correlated with non-drug interventions (surgery, imaging, lifestyle). Root cause: system prompt redundantly described JSON format when structured outputs already constrain it.
  • ~5,500 rows had string "null" variants instead of JSON null.
  • Both issues fixed by one_off:cleanup_dose_evidence_garbage:execute (one-off Thor task, 6,545 rows cleaned).
  • Prevention added: sanitize_dose_evidence! in DoseEvidenceExtraction#persist_dose_evidence strips garbage on persist. System prompt simplified to avoid redundant format instructions with structured outputs.

Spot-check verification (tracker examples now resolved):

PubDrugBeforeAfter
66552BL-B01D1not specifieddose_min 2.0 mg/kg, dose_max 3.0 mg/kg, rp2d 2.5 mg/kg, D1D8 Q3W
133793simmitinibstarting dose 1mg/ddose_min 1 mg, dose_max 9 mg, rp2d 6 mg 3 weeks on 1 week off
75999MRG003raw text onlydose_min 0.1 mg/kg, dose_max 3.0 mg/kg, rp2d 2.5 mg/kg, Q3W
240515amivantamabno intervention rowsno intervention_arms in llm_data (abstract may lack dose detail)

Issue reopened: pub_dose_lookup view join drops 76% of extracted dose evidence (2026-03-23)

Section titled “Issue reopened: pub_dose_lookup view join drops 76% of extracted dose evidence (2026-03-23)”

The extraction and persistence steps from the 2026-03-11 fix are working correctly — 23,503 publications have dose_evidence populated in publication_interventions. However, only 8,764 publications (37%) have structured dose fields flowing through to vw_publication_efficacy_data. The remaining 17,826 publications (76%) have dose evidence silently dropped by the view’s pub_dose_lookup join.

The pub_dose_lookup CTE joins on (publication_id, drug_id):

LEFT JOIN pub_dose_lookup pdl
ON po.publication_id = pdl.publication_id
AND di.drug_id = pdl.drug_id
  • di.drug_id comes from the drug_interventions CTE, which for linked publications sources from vw_bioloupe_interventions (trial registry drugs)
  • pdl.drug_id comes from publication_interventions.drug_id (LLM-extracted and drug-linked)

This join fails in two ways:

Failure mode 1: NULL drug_id on publication_interventions (~13,600 pubs, 58%)

When link_publication_drugs doesn’t find a matching drug record, publication_interventions.drug_id stays NULL. The SQL condition di.drug_id = NULL is always FALSE, so the dose evidence is silently dropped even though it’s correctly extracted.

Failure mode 2: Drug_id mismatch between registry and publication (~2,148 pubs, 9%)

The trial registry and the LLM-extracted publication interventions can resolve to different drug records for the same compound:

  • ADC vs naked antibody: Zanidatamab (10432) vs Zanidatamab zovodotin (15231)
  • Unresolved drug matching: SHR-A1811 has drug_id=NULL in publication_interventions but drug_id=10733 (Trastuzumab rezetecan) in the trial registry
  • Biosimilar/brand aliases: SCT510 (15900) vs Bevacizumab (9022)

Concrete examples from CRC ADC audit (disease 4345, technology 708)

Section titled “Concrete examples from CRC ADC audit (disease 4345, technology 708)”
PubDrugPI drug_idView drug_idDose evidenceView dose fields
66516Zanidatamab10432 (Zanidatamab)15231 (Zanidatamab zovodotin)single_dose=1200 mgall NULL
70960SHR-A1811NULL10733 (Trastuzumab rezetecan)dose_min=3.2, dose_max=8.0, rp2d=6.4 mg/kgall NULL
114758Zanidatamab10432 (Zanidatamab)15231 (Zanidatamab zovodotin)single_dose=1200 mgall NULL

The unstructured dose column (from trial registry study_plan_components) still shows generic protocol text like “dose levels and schedules determined by the Safety Monitoring Committee (SMC)” for these publications.

23,503 publications with dose_evidence extracted
8,764 publications with structured dose in view (37%)
17,826 publications with dose evidence silently dropped (76%)
Breakdown of dropped:
~13,600 NULL drug_id on publication_interventions (58%)
~2,148 drug_id mismatch between registry and publication (9%)
~2,078 other (pub not in view, dose_evidence has no usable fields, etc.)

Resolved by Issue 20 fix (2026-03-23). The root cause was the drug_interventions CTE sourcing drug_id from vw_bioloupe_interventions (registry) while pub_dose_lookup used publication_interventions drug_id. The v16 view restructuring (see Issue 20 solution) fixes this by:

  1. Using publication_interventions as the primary drug source (Source 0), so di.drug_id and pdl.drug_id come from the same table.
  2. Threading publication_intervention_id through both CTEs for exact 1:1 join matching — eliminating the drug_id mismatch entirely, including for NULL drug_id interventions.
  3. Allowing NULL drug_id interventions through Source 0 (if we extracted them, they’re the source of truth — don’t fall back to registry).

Result: dose evidence coverage went from 8,764 pubs (71% of extracted) to 11,902 pubs (96.6% of extracted).

4. Most frequent AE columns lack grade-classified ranked export fields

Section titled “4. Most frequent AE columns lack grade-classified ranked export fields”

The disease clinical evidence worksheet has two AE columns per row:

  • Most Frequent AE All Grade — e.g. Anemia (85.4%), Leukopenia (53.7%), Thrombocytopenia (53.7%)
  • Most Frequent AE >=Gr3 — e.g. Anemia (28.0%), Leukopenia (15.9%), Thrombocytopenia (14.6%)

These are ranked lists of the top individual named adverse events by incidence, separated into all-grade vs grade ≥3 buckets.

The current pipeline extracts individual named AE rows with numeric values but does not:

  1. Classify each AE row by grade category (all-grade vs ≥grade 3)
  2. Rank AEs by incidence within each grade bucket
  3. Produce a formatted summary string for export

As a result, the worksheet AE columns cannot be populated from structured data today.

Publication AE flow:

  1. classify_publications extracts llm_data['adverse_events'] from the abstract. The LLM schema (details.rb:AdverseEvent) captures adverse_event (name), measure_unit, observation (free text), and arms[].measure_value (numeric). There is no grade_category field — grade information lands in observation as unstructured text or gets embedded in the AE name.
  2. post_process_publications creates adverse_events rows and trial_arm_outcomes rows with numeric measure_value.
  3. standardize_adverse_events does rule-based name standardization.
  4. classify_adverse_events LLM-matches AEs to safety endpoint categories.

Relevant code paths:

Two separate restrictions:

1. The LLM extraction schema has no grade classification field

The AdverseEvent schema in details.rb captures:

attribute :adverse_event, :string # name
attribute :measure_unit, :string # percentage/count
attribute :observation, :string # free text — grade info lands here
attribute :arms, Arm.to_array_type # numeric values per arm

There is no grade_category enum. The LLM puts grade context into observation as free text (e.g. "Grade ≥3", "Grade 3 treatment-related", "Any grade", "Most common adverse event", or empty).

2. The downstream safety extraction only handles aggregate metrics

classify_safety_metric in emerging_clinical_data_query.rb classifies AEs into aggregate categories (:grade3_traes, :grade3_teaes, :discontinuation) and returns nil for individual named AEs like Nausea or Neutropenia. These individual AEs are stored but never surfaced in any export path.

Worksheet row: Izalontamab brengitecan in ESCC (ESCC tab, row 3)

Section titled “Worksheet row: Izalontamab brengitecan in ESCC (ESCC tab, row 3)”

The worksheet contains:

  • Most Frequent AE All Grade: Anemia (85.4%), Leukopenia (53.7%), Thrombocytopenia (53.7%), Neutropenia (42.7%)
  • Most Frequent AE >=Gr3: Anemia (28.0%), Leukopenia (15.9%), Thrombocytopenia (14.6%), Neutropenia (14.6%)

Our database has the individual AE rows and numeric values for this publication, but no way to classify which rows are all-grade vs ≥grade 3, and no export field that produces the ranked formatted string.

Worksheet row: Micvotabart pelidotin in HNSCC (HNSCC tab, row 4)

Section titled “Worksheet row: Micvotabart pelidotin in HNSCC (HNSCC tab, row 4)”

The worksheet contains:

  • Most Frequent AE All Grade: Cutaneous (44%); Neuropathy (34%); Neutropenia (22%); Anemia (17%)
  • Most Frequent AE >=Gr3: Neuropathy (28%), Neutropenia (11%)

The pattern is consistent: top 2–4 AEs ranked by incidence, with percentages, semicolon or comma separated.

Current database state for publication-sourced AE rows:

  • Total publications with AE rows: 36,802
  • Publications with AE rows that have numeric trial_arm_outcomes.measure_value: 33,835
  • Total AE rows with numeric values: 156,325
  • Average AE rows per publication: 4.6 (median 3, p90 8)

Grade context distribution across the 156K rows:

Grade signalRows%Source
Clearly grade ≥3 (in observation)19,07912%observation ~* 'grade.*(3|≥3|3/4)'
Clearly grade ≥3 (in name)16,86211%name ~* 'grade.*(3|≥3|3/4)'
Subtotal grade ≥3 identifiable57,20637%Combined name + observation
Explicitly all-grade4,0543%observation ~* '(any grade|all grade)'
No grade context at all75,61648%Neither name nor observation mentions grade
Low grade only (1-2)~4,0243%
Other grade context~2,7972%

At the publication level:

CategoryPublications
Has BOTH all-grade and grade ≥3 rows4,053
Has grade ≥3 rows only6,547
Has any-grade rows only7,719
Ambiguous (no clear grade signals)1,287

The key finding: 48% of individual AE rows (75K) have no grade context in either name or observation. These are likely all-grade AEs but cannot be reliably classified without the abstract context that was available at extraction time.

From spot-checking across HNSCC and ESCC tabs:

  • All Grade column: typically 2–4 AEs, sometimes just names without % when percentages aren’t reported
  • =Gr3 column: typically 1–3 AEs, usually fewer than all-grade

  • Some cells include (NR) for “not reported”
  • Separator style varies: semicolons and commas both used
  • Format: AE_name (value%)
  • The disease clinical evidence export cannot populate the two most-frequent-AE columns
  • The existing safety extraction only surfaces aggregate TRAE/TEAE/discontinuation metrics
  • Individual named AEs with percentages exist in the database but are invisible to reporting
  • Publications where the abstract reports specific high-frequency AEs (the most clinically relevant safety signal) cannot be compared to the manually curated worksheet

This is not a missing AE extraction problem. The pipeline already extracts individual named AEs with numeric values for ~34K publications. The AE data exists — it just lacks grade classification and a ranked export format.

This is also not an aggregate safety metric problem. TRAE ≥Gr3, TEAE ≥Gr3, and discontinuation rates are already handled by extract_safety_metrics_for_publication. The gap is specifically in individual named AE ranking.

  • Should the ranked summary be persisted as pre-formatted strings (like the worksheet cells), or as structured arrays that the export formats at query time?
  • When a publication has AE rows for multiple arms, should the ranked summary use the experimental arm only (current behavior for aggregate metrics) or present the arm that matches the export row context?

The solution has two parts: a schema enhancement for future publications and a backfill for existing data.

1. Modify classify_publications extraction to include grade classification (going forward)

Add a grade_category enum field to the AdverseEvent schema in details.rb:

class AdverseEvent
include StoreModel::Model
include DataTasks::JsonSchema
desc 'The name of the adverse event reported in the trial.'
attribute :adverse_event, :string
desc 'Grade category of this adverse event. Use all_grade for any-grade or unspecified-grade AEs, grade_gte3 for grade ≥3/grade 3-4/grade 3-5 AEs.'
attribute :grade_category, :string # enum: all_grade, grade_gte3
# ... existing fields ...
end

Update the extraction prompt (section 4 in task.rb) to instruct the LLM to classify grade at extraction time. The LLM already reads the abstract in full — it knows whether “Nausea (75.3%)” is reported as all-grade or ≥grade 3 from surrounding context. Adding one enum field is nearly free in token cost.

Add a grade_category column to the adverse_events table (migration). Update post_process.rb:process_adverse_events to persist the new field.

2. Backfill existing AE rows with LLM grade classification

A separate one-time LLM task that reads existing adverse_events rows + the publication abstract and classifies grade_category for each row.

Scope: ~33,835 publications, ~156K AE rows.

Input per publication prompt:

  • Publication title + abstract (~2,750 chars avg)
  • Existing AE rows with name, observation, and measure_value (~217 chars avg)

Output per AE row:

  • grade_category: all_grade | grade_gte3

Estimated cost with gpt-5-mini batched: ~$15–25 for the full backfill.

The backfill task would update adverse_events.grade_category directly. After completion, all AE rows (both historical and future) have grade classification from the same source of truth.

3. Ranked summary derivation (query-time)

Once all AE rows have grade_category, producing the worksheet columns is a straightforward query:

-- For a given publication + arm context:
SELECT ae.name, tao.measure_value
FROM adverse_events ae
JOIN trial_arm_outcomes tao ON tao.adverse_event_id = ae.id
WHERE ae.source_id = :publication_id
AND ae.source_type = 'Publication'
AND ae.grade_category = 'all_grade' -- or 'grade_gte3'
AND ae.measure_unit = 'percentage'
AND tao.measure_value IS NOT NULL
AND tao.measure_value::numeric > 0
ORDER BY tao.measure_value::numeric DESC
LIMIT 4

Format as: AE_name (value%); AE_name (value%); ...

This can be computed at export time from the grade-tagged rows without a separate persistence step.

4. Workflow placement

No new workflow step needed for the going-forward path — grade classification happens inside the existing classify_publications step and is persisted by post_process_publications.

The backfill task runs independently as a one-time Thor task, similar in pattern to the subgroup disease adjudication backfill (Issue 1).

Implemented 2026-03-11. Change: add-publication-ae-grade-classification.

Status: Implementation complete. Full historical backfill has not yet been run across the remaining eligible publication AE rows.

Applied fix:

  1. Persisted AE grade category on adverse_events — Added adverse_events.grade_category with canonical values all_grade and grade_gte3, plus model normalization/validation so downstream readers have a stable field instead of re-parsing free text.

  2. Extended publication extraction for new rows — Updated the publication LLM schema and prompt so classify_publications emits grade_category for each adverse event row, and updated post_process_publications to persist it when creating publication-sourced AE rows.

  3. Added historical backfill task — Created PublicationsLlmClassification::AdverseEventGradeBackfill and wired a Thor task:

    • thor clinical_trials:publications:backfill_adverse_event_grade_categories
    • supports non-batched execution, --publication-ids, --limit, --source, --model, and --overwrite
    • default validation model: gpt-5-mini
  4. Added ranked named-AE export derivation — Implemented query-time ranking of named adverse events by grade_category and wired worksheet-style outputs into the reporting path:

    • Most Frequent AE All Grade
    • Most Frequent AE >=Gr3
  5. Hardened ranked summary filtering after manual spot checks — Updated the summary helper so it:

    • prefers the actual adverse-event name over standardized bucket labels
    • excludes aggregate rollup rows such as TRAE, TEAE, SAE, AESI, irAE, discontinuation, fatal/grade-5 rollups
    • excludes zero-value / not reported rows from named-AE summaries

Files changed:

  • app/models/adverse_event.rb
  • app/tasks/publications_llm_classification/details.rb
  • app/tasks/publications_llm_classification/task.rb
  • app/tasks/publications_llm_classification/post_process.rb
  • app/tasks/publications_llm_classification/adverse_event_grade_backfill.rb
  • lib/tasks/clinical_trials/publications.thor
  • app/queries/clinical_trials/publications_query.rb
  • app/queries/tpp/emerging_clinical_data_query.rb
  • app/services/tpp/reports/emerging_clinical_data_report.rb
  • db/migrate/20260311222107_add_grade_category_to_adverse_events.rb

Manual validation completed:

  • Non-batched gpt-5-mini run on 4 hand-picked publications: 4 publications processed, 33 rows updated
  • Confirmed persisted all_grade vs grade_gte3, default skip behavior, overwrite behavior, and arm fallback
  • Additional non-batched gpt-5-mini run on 8 random publications: 8 publications processed, 33 rows updated
  • Random spot checks confirmed:
    • named grade 3/4 and >=3 rows classify as grade_gte3
    • named any-grade / grade-1 rows classify as all_grade
    • aggregate safety rows are excluded from ranked named-AE summaries
    • zero/not reported rows no longer emit bogus ranked summary strings

Model outcome: gpt-5-mini was good enough on the manual validation slices; no progression to a stronger model was needed.

Operational follow-up: run the full historical backfill for the remaining eligible publication AE rows before marking this issue fully complete.

5. Publication prior therapy context is not extracted — min/max prior lines and prior therapy exposure are missing

Section titled “5. Publication prior therapy context is not extracted — min/max prior lines and prior therapy exposure are missing”

The disease clinical evidence worksheet has four columns that describe the prior therapy context of a publication’s study population:

  • Min Prior Lines — minimum number of prior lines of therapy (e.g. 1)
  • Max Prior Lines — maximum number of prior lines (e.g. 7)
  • Treatment Line — e.g. 2L+, 3L+ (already extracted, this issue does not cover treatment line)
  • Prior Taxane Use — e.g. Yes, No, Allowed, Required

Treatment line is already extracted and persisted on trial_subgroups.treatment_lines (see TreatmentContextExtraction task, renamed from TreatmentLineExtraction). But min_prior_lines, max_prior_lines, and prior therapy exposure are not extracted from publications at all.

The trial side has partial analogues:

  • trial_eligibility_criteria with modifier = 'prior_treatment_lines' stores min/max for ~62K trial records

Note: indicated_prior_therapies is related to drug approval indications, not trials or publications. It captures required/excluded prior therapies for regulatory label context, not clinical study populations.

Publication-sourced rows have no equivalent for either prior line counts or prior therapy exposure. When the worksheet reports “median 4 prior therapies (range 0–7)” or “52% had prior taxane therapy for mCRPC,” that context exists only in the publication abstract and is not captured by the pipeline.

Treatment line extraction:

  1. TreatmentContextExtraction in app/tasks/publications_llm_classification/treatment_context_extraction.rb maps abstracts to enum values (1L, 2L+, 3L+, etc.) and extracts prior therapy context
  2. Results persist on trial_subgroups.treatment_lines (JSONB array) and trial_subgroups.llm_data['treatment_lines']
  3. The efficacy view normalizes to effective_line (numeric 0–4) and treatment_settings

The treatment line extraction already reads prior therapy language to determine the line (e.g. “median of 4 prior therapies” → 3L+). But the numeric counts and specific therapy exposures are consumed as reasoning inputs, not persisted as structured data.

There is no extraction step for:

  • publication-level prior line counts (min, max, median)
  • prior therapy exposure flags (prior taxane, prior checkpoint inhibitor, etc.)

1. Treatment line extraction discards numeric prior therapy counts

The TreatmentLineExtraction system prompt instructs the LLM to use prior therapy counts for line determination:

- "median prior lines = N":
- N ≥ 2 → "3L+"
- N = 1 (or range includes 1–2) → "2L+"

But the output schema (TreatmentLineDetails) only captures treatment_lines (enum array) and evidence (free text). The actual numbers (median = 4, range 0–7) are consumed during reasoning but not persisted as structured fields.

2. Prior therapy exposure is completely out of scope

The treatment line extraction prompt explicitly states:

Out of scope: Dosing, endpoints, safety, biomarkers (unless they clarify line), efficacy stats.

Prior therapy exposure (e.g. “52% had prior taxane,” “required prior platinum,” “prior CAR-T allowed”) is not captured by any extraction step.

3. The efficacy view has no prior-line or prior-therapy columns from publications

vw_publication_efficacy_data exposes effective_line, treatment_settings, and raw_treatment_lines but has no min_prior_lines, max_prior_lines, or prior therapy fields. The trial efficacy view (vw_trial_efficacy_data) does have min_line and max_line from trial_eligibility_criteria, but the publication view has no equivalent.

Example 1: publication 152908 (BOLD-100 in gastric cancer)

Section titled “Example 1: publication 152908 (BOLD-100 in gastric cancer)”

Abstract states:

“Patients had a median of 4 prior systemic therapies [0, 7], 1 with no prior therapy, 2 had 2 prior therapies, 5 with 3 prior therapies, and 13 patients with 4 or more prior therapies. 20/21 patients received prior platinum with 18/21 receiving prior FOLFOX/CAPOX.”

Current extraction result: treatment_lines: ["3L+"] — correct, but we lose:

  • min_prior_lines: 0
  • max_prior_lines: 7
  • median_prior_lines: 4
  • prior platinum: 20/21 (95%)
  • prior FOLFOX/CAPOX: 18/21 (86%)

Example 2: publication 162733 (sEphB4-HSA in mCRPC)

Section titled “Example 2: publication 162733 (sEphB4-HSA in mCRPC)”

Abstract states:

“treatment with at least one second generation androgen receptor (AR)-targeted therapy but no more than three prior therapies for mCRPC” “received a median of three prior therapies (range 1-3)” “Ten patients received prior taxane for mCRPC or hormone sensitive prostate cancer”

Current extraction result: treatment_lines: ["2L+"] — correct, but we lose:

  • min_prior_lines: 1
  • max_prior_lines: 3
  • median_prior_lines: 3
  • prior taxane: 10/14 (71%)
  • prior AR-targeted therapy: 14/14 (100%, required)

Example 3: publication 53818 (PROfound — olaparib by prior taxane)

Section titled “Example 3: publication 53818 (PROfound — olaparib by prior taxane)”

This is the paradigmatic case — the entire publication is organized around prior taxane use as a stratification factor. The abstract reports efficacy by prior taxane yes/no subgroups. The worksheet needs Prior Taxane Use: Yes/No (stratified).

Current extraction captures treatment_lines: ["2L+"] but does not capture that prior taxane is the defining subgroup variable.

  • The disease clinical evidence export cannot populate Min Prior Lines, Max Prior Lines, or Prior Taxane Use columns from publication data
  • Researchers manually fill these from abstracts — exactly the kind of structured extraction the pipeline should automate
  • Prior therapy context is clinically important for interpreting efficacy results (a drug showing ORR of 30% in a post-taxane population is very different from 30% in a treatment-naïve population)
  • Without structured prior therapy data, comparative analyses across publications in the same disease are unreliable

This is not a treatment line problem. Treatment line extraction works well and correctly maps abstracts to 1L, 2L+, 3L+, etc. The issue is that treatment line is a categorical bucket, while prior therapy context includes:

  • numeric counts (min, max, median, range)
  • specific therapy exposure flags
  • exposure requirements (required, allowed, excluded)

This is also not a trial eligibility criteria problem. The trial side has prior_treatment_lines and indicated_prior_therapies, but these describe trial enrollment criteria, not the actual population characteristics reported in the publication abstract.

Prior therapy language in ~71K result publications:

PatternPublications mentioning
Mentions median prior line count1,936
Mentions prior line threshold (≥N)1,839
Mentions prior line range883
Mentions any specific prior therapy class1,458

Specific prior therapy class mentions (non-exclusive):

Prior therapy classPublications
Prior checkpoint/IO therapy572
Prior platinum302
Prior anti-VEGF241
Prior radiation190
Prior CDK4/6i152
Prior hormonal/endocrine151
Prior taxane148
Prior CAR-T143
Prior transplant84
Prior HMA80
Prior surgery55
Prior PI/bortezomib54
Prior IMiD/lenalidomide43
Prior anthracycline39
Prior fluoropyrimidine33
Prior gemcitabine27
Prior irinotecan21
Prior bispecific12
Prior BCG10
Prior ADC3

Key observations:

  1. “Prior taxane” (148 publications) is just one instance of a general pattern — at least 15 therapy classes appear routinely
  2. The highest-frequency classes (checkpoint/IO, platinum, anti-VEGF) reflect current oncology practice where these are standard earlier-line therapies
  3. ~1,900 publications contain explicit numeric prior line counts that are currently consumed during treatment line reasoning but discarded

Publications with rich prior therapy context that is currently lost:

  • 152908 (BOLD-100 in gastric cancer): median 4 prior therapies (range 0–7), 95% prior platinum — extracted as 3L+ only
  • 162733 (sEphB4-HSA in mCRPC): median 3 prior therapies (range 1–3), 71% prior taxane, 100% prior AR-targeted — extracted as 2L+ only
  • 53818 (PROfound olaparib): entire study stratified by prior taxane yes/no — extracted as 2L+ only, taxane context not captured
  • 65484 (givastomig in GEC): median 3 prior lines, 74% prior PD-(L)1 inhibitor — extracted as 3L+ only
  • 147778 (GSK2636771 in mCRPC): median 4 prior lines, 83% prior taxane — extracted as 3L+ only

The pipeline currently models treatment context as:

trial_subgroups.treatment_lines → ["2L+"] (categorical bucket)

The worksheet needs:

Treatment Line → 2L+ (categorical — already have)
Min Prior Lines → 1 (numeric — don't have)
Max Prior Lines → 3 (numeric — don't have)
Prior Taxane Use → Yes (71%) (therapy exposure flag — don't have)
Prior Platinum Use → Yes (95%) (therapy exposure flag — don't have)
Prior IO Use → No (therapy exposure flag — don't have)

The worksheet column is labeled “Prior Taxane Use” specifically, but the underlying data pattern is general: researchers track prior exposure to whatever therapy class is clinically relevant for the disease area. In breast cancer it’s taxane and anthracycline; in mCRPC it’s taxane and AR-targeted therapy; in myeloma it’s IMiD, PI, and anti-CD38; in lymphoma it’s CAR-T and bispecific.

  • Should we distinguish between required/allowed/excluded prior therapies, or just report exposure percentages?
    • “At least one prior platinum” (required) vs “prior taxane was allowed” (optional) vs “52% had prior taxane” (reported)
    • These carry different clinical meaning
    • The indicated_prior_therapies optionality enum on the indications side uses: must_have_received, progressed_on_after, not_previously_treated_with, After Failure Of, refractory_to, ineligible_for, inadequate_response_to, Intolerant to — these are richer than what abstracts typically state, but the pattern is informative
  • How should the persistence model represent scope when prior therapy context applies to the overall population but individual subgroups break it down differently?

1. Subgroup-level, not publication-level

Prior therapy context should persist at the subgroup level (on trial_subgroups), not at the publication level. Evidence:

  • Treatment line is already subgroup-level, and prior therapy context is tightly coupled to treatment line
  • 1,827 publications have multiple disease subgroups with treatment lines; 119 of those have different treatment lines across subgroups (e.g. pub 1703: “treatment-naive” subgroup at 1L vs “previously treated” at 2L+)
  • When treatment lines differ across subgroups, prior therapy context necessarily differs too — a 1L subgroup has 0 prior lines while a 2L+ subgroup has ≥1
  • The PROfound example (pub 53818) shows prior taxane as a subgroup stratification variable — some subgroups are “prior taxane yes” and others “prior taxane no”

For publications where the abstract only states population-level prior therapy characteristics (the common case), all subgroups inherit the same values. The subgroup-level model handles both cases correctly.

2. Rename to TreatmentContextExtraction

The existing TreatmentLineExtraction should be renamed to TreatmentContextExtraction (or similar) to reflect its expanded scope. The task already reads all prior therapy language for treatment line reasoning — it just discards the structured details. Expanding the output schema is natural.

This is not “mixing concerns” — treatment line, prior line counts, and prior therapy exposure are all facets of the same clinical context question: “Where does this population sit in the treatment sequence?”

3. Strict enum for therapy_class + free text for therapy_name (two-field design)

The key design insight is separating what the abstract says from what we query on:

therapy_name: "taxane-based chemotherapy" ← free text, what the abstract says (evidence)
therapy_class: "taxane" ← strict enum, what we filter/query on

This avoids the disease_stages antipattern in ParticipationCriterion where an initial predefined list grew unbounded through LLM and import drift, producing duplicates like Stage I / Stage 1 / Stage IA with no normalization layer.

The therapy_class enum is fixed in the schema. The LLM must pick from the list or use other. If other accumulates a meaningful cluster over time, that’s signal to add a new enum value — a conscious schema change, not drift.

The enum covers ~20 therapy classes based on publication frequency analysis:

therapy_classPubs mentioningExample abstract phrases
checkpoint_inhibitor865”prior anti-PD-1”, “prior IO”, “prior pembrolizumab”
surgery572”prior resection”, “prior nephrectomy”
transplant533”prior HSCT”, “prior auto-SCT”, “prior allo-SCT”
platinum506”prior platinum”, “prior cisplatin”, “prior carboplatin”
endocrine_therapy477”prior ARPI”, “prior endocrine therapy”, “prior enzalutamide”
anti_vegf361”prior bevacizumab”, “prior anti-VEGF”, “prior anti-angiogenic”
taxane344”prior taxane”, “prior docetaxel”, “prior paclitaxel”
radiation340”prior radiation”, “prior radiotherapy”, “prior chemoradiation”
car_t313”prior CAR-T”, “prior CAR T-cell therapy”
cdk_inhibitor204”prior CDK4/6 inhibitor”, “prior palbociclib”
anti_her2184”prior trastuzumab”, “prior T-DXd”, “prior pertuzumab”
imid141”prior lenalidomide”, “prior IMiD”, “prior pomalidomide”
hma121”prior azacitidine”, “prior HMA”, “prior decitabine”
proteasome_inhibitor121”prior bortezomib”, “prior PI”, “prior carfilzomib”
anthracycline90”prior anthracycline”, “prior doxorubicin”
fluoropyrimidine75”prior 5-FU”, “prior capecitabine”
bispecific48”prior bispecific antibody”
anti_cd3843”prior daratumumab”, “prior anti-CD38”
adc38”prior ADC”, “prior antibody-drug conjugate”
bcg16”prior BCG”
chemotherapy“prior chemotherapy” (generic, when no specific class stated)
otherCatch-all for anything not above

The long tail drops off fast — only 20 classes cover virtually all clinically meaningful prior therapy mentions in oncology/hematology publications.

Compound semantics are manageable: most publications (2,101 out of 2,349 mentioning specific priors) reference only a single prior therapy class. Only 231 mention two, and 17 mention three or more.

4. Compound prior therapy semantics (“prior X and Y”, “prior X or Y”)

Real abstract patterns:

  • Conjunctive (AND): "who received prior taxane, endocrine therapy, CDK4/6 inhibitor, and 2-4 prior chemotherapies" (TROPiCS-02) — all four are required
  • Disjunctive (OR): "prior platinum and/or fluoropyrimidine chemotherapy" — either qualifies
  • Mixed: "prior checkpoint inhibitor and platinum-based chemotherapy" — both required

The simplest model that handles all cases: extract each therapy as a separate row in the prior_therapies array. Each row has its own exposure_status. For compound requirements like TROPiCS-02, that becomes:

[
{ "therapy_name": "taxane", "exposure_status": "required", "evidence": "..." },
{ "therapy_name": "endocrine therapy", "exposure_status": "required", "evidence": "..." },
{ "therapy_name": "CDK4/6 inhibitor", "exposure_status": "required", "evidence": "..." }
]

We do NOT need to model the logical relationship (AND/OR) between therapies explicitly. Each therapy entry stands on its own with its exposure status. This is sufficient for worksheet export (“Prior Taxane Use: Yes”) and for filtering (“show publications requiring prior CDK4/6i”).

Rename TreatmentLineExtractionTreatmentContextExtraction

class Subgroup
include StoreModel::Model
include DataTasks::JsonSchema
desc 'ID of the subgroup from the input'
attribute :id, :integer
attribute :subgroup_type, :string, ignore: true
attribute :subgroup_value, :string, ignore: true
# Existing
attribute :treatment_lines, ArrayType.new, enum: Indication::TREATMENT_LINES
desc 'Textual evidence or reasoning that supports the treatment line(s)'
attribute :evidence, :string
# New: prior line counts
desc 'Minimum number of prior lines of therapy for this population (from eligibility criteria or reported range). Null if not stated.'
attribute :min_prior_lines, :integer
desc 'Maximum number of prior lines of therapy for this population. Null if not stated.'
attribute :max_prior_lines, :integer
desc 'Median number of prior lines of therapy, if explicitly stated in the abstract.'
attribute :median_prior_lines, :integer
# New: prior therapy exposures
attribute :prior_therapies, PriorTherapyExposure.to_array_type
end
class PriorTherapyExposure
include StoreModel::Model
include DataTasks::JsonSchema
THERAPY_CLASSES = %w[
checkpoint_inhibitor surgery transplant platinum endocrine_therapy
anti_vegf taxane radiation car_t cdk_inhibitor anti_her2
imid hma proteasome_inhibitor anthracycline fluoropyrimidine
bispecific anti_cd38 adc bcg chemotherapy other
].freeze
desc 'Normalized therapy class for filtering/querying. Must be one of the enum values.'
attribute :therapy_class, :string # enum: THERAPY_CLASSES
desc 'Therapy name as stated in the abstract (e.g. "taxane-based chemotherapy", "prior anti-PD-1 therapy", "lenalidomide"). Preserves original phrasing for evidence.'
attribute :therapy_name, :string
desc 'How this prior therapy relates to the study population'
attribute :exposure_status, :string # enum: required, allowed, excluded, reported
desc 'Percentage of patients with this prior exposure, if reported (e.g. 71.4). Null if not stated.'
attribute :exposure_percentage, :float
desc 'Evidence quote from the abstract'
attribute :evidence, :string
end

New columns on trial_subgroups:

  • min_prior_lines (integer, nullable)
  • max_prior_lines (integer, nullable)
  • median_prior_lines (integer, nullable)

Prior therapy exposures persist in trial_subgroups.llm_data['prior_therapies'] (JSONB array), consistent with how treatment line evidence is already stored in trial_subgroups.llm_data['treatment_lines'].

The efficacy view would expose min_prior_lines and max_prior_lines alongside effective_line. The emerging clinical data query would format prior therapies for export.

This requires a full backfill since we’re expanding the extraction schema. The renamed TreatmentContextExtraction task re-runs on all result publications that have subgroups.

Options to reduce cost:

  • Only backfill publications where the abstract contains prior therapy language (~3K–5K publications based on regex estimates) for the prior therapy fields
  • Use gpt-5-mini for the backfill since the extraction is well-defined
  • Batch processing with the existing DataTasks::Task infrastructure
  • The prior line count fields can be extracted in the same pass as treatment lines since the LLM already reasons about them

Estimated cost: ~$31 batched with gpt-5-mini for a full backfill of all 62K publications with subgroups. No regex pre-filter — the LLM returns empty arrays when no prior therapy context exists, and the cost per publication ($0.001) makes filtering unnecessary.

For the worksheet columns:

  • Min Prior Linestrial_subgroups.min_prior_lines (direct)
  • Max Prior Linestrial_subgroups.max_prior_lines (direct)
  • Prior Taxane Use → derived from llm_data['prior_therapies'] array, filtering for therapy_class = 'taxane':
    • If exposure_status = 'required'Yes (required)
    • If exposure_status = 'reported' with percentage → Yes (71%)
    • If exposure_status = 'excluded'No (excluded)
    • If exposure_status = 'allowed'Allowed
    • If no entry with therapy_class = 'taxane'NR

The worksheet currently labels this column “Prior Taxane Use” but the extraction captures all therapy classes via the strict therapy_class enum. The export filters by enum value — therapy_class = 'taxane' for this column, therapy_class = 'checkpoint_inhibitor' for “Prior IO Use”, etc. No schema changes needed to add new worksheet columns for different disease areas.

The therapy_name free text field preserves the original abstract phrasing for display and evidence review (e.g. “prior docetaxel-based chemotherapy” rather than just “taxane”).

Implemented as the TreatmentContextExtraction task, which expands the former TreatmentLineExtraction to extract prior therapy context alongside treatment lines in a single LLM call.

New columns on trial_subgroups:

  • min_prior_lines (integer, nullable) — minimum number of prior lines of therapy
  • max_prior_lines (integer, nullable) — maximum number of prior lines
  • median_prior_lines (integer, nullable) — median number of prior lines

New JSONB key in trial_subgroups.llm_data:

  • prior_therapies — array of PriorTherapyExposure objects, each with:
    • therapy_class — strict enum of 22 values (checkpoint_inhibitor, taxane, platinum, endocrine_therapy, anti_vegf, car_t, cdk_inhibitor, anti_her2, imid, hma, proteasome_inhibitor, anthracycline, fluoropyrimidine, bispecific, anti_cd38, adc, bcg, surgery, transplant, radiation, chemotherapy, other)
    • therapy_name — free text preserving original abstract phrasing
    • exposure_status — enum: required, allowed, excluded, reported
    • exposure_percentage — float, nullable (e.g. 71.4 for “71% had prior taxane”)
    • evidence — quote from abstract
  1. app/tasks/publications_llm_classification/treatment_context_extraction.rb — renamed from treatment_line_extraction.rb. Expanded Subgroup schema adds min_prior_lines, max_prior_lines, median_prior_lines, and prior_therapies array. System prompt extended with prior line count extraction rules and therapy class mapping with the 22-value enum.

  2. app/tasks/publications_llm_classification/post_process.rb — updated to write min_prior_lines, max_prior_lines, median_prior_lines columns and prior_therapies JSONB key during subgroup creation.

  3. db/views/vw_publication_efficacy_data_v09.sql — added min_prior_lines, max_prior_lines, median_prior_lines from trial_subgroups to the materialized view output.

  4. app/queries/tpp/emerging_clinical_data_query.rb — added min_prior_lines, max_prior_lines, median_prior_lines to result rows. Added prior_therapy_class parameter; when specified, includes a prior_therapy_use column formatted as: required → “Yes (required)”, reported with percentage → “Yes (71%)”, excluded → “No (excluded)”, allowed → “Allowed”, no entry → “NR”.

  5. lib/tasks/one_off/backfill_prior_therapy_context.thor — self-contained one-off backfill task processing all ~62K publications with subgroups (no regex pre-filter). Uses gpt-5-mini. Only writes prior therapy fields (min_prior_lines, max_prior_lines, median_prior_lines, llm_data['prior_therapies']) — does not overwrite existing treatment_lines. Delete when backfill is complete.

  6. lib/tasks/one_off/cleanup_prior_therapy_values.thor — one-off cleanup that nulls out invalid values from the backfill. Delete when done.

  7. Data validationsanitize_line_count added to treatment_context_extraction.rb and post_process.rb to reject negative sentinel values (-1, -999) the LLM uses instead of null. sanitize_prior_therapies rejects negative exposure_percentage values.

  • 62,008 publications processed via gpt-5-mini (synchronous)
  • 40,278 subgroups have non-zero prior line counts
  • 61,895 subgroups have at least one prior therapy entry

Post-backfill cleanup: LLM used sentinel values (-1, -999, -2147483648) instead of null for ~9K subgroups. Additionally ~25K subgroups had median outside [min, max] range. All cleaned via cleanup_prior_therapy_values.thor.

PublicationExpectedExtractedStatus
152908 (BOLD-100, gastric)min=0, max=7, median=4, 95% platinummin=0, max=7, median=4, platinum 95.2%Correct
162733 (sEphB4-HSA, mCRPC)min=1, max=3, median=3, 71% taxane, 100% AR-targetedmin=1, max=3, median=3, taxane 71.4%, endocrine_therapy requiredCorrect
53818 (PROfound, olaparib)Stratified by prior taxane yes/no”Prior taxane Yes” subgroups: taxane required 100%; “Prior taxane No”: taxane excludedCorrect
147778 (GSK2636731, mCRPC)median=4, 83% taxanemedian=4, taxane 83%Correct
  • Subgroup-defining therapies: The LLM sometimes classifies subgroup-defining therapy characteristics (e.g. “Prior taxane Yes”) as reported instead of required/excluded. Full backfill showed inconsistency vs spot-check runs — likely due to temperature: 1 (gpt-5-mini constraint). A prompt improvement could help but is not blocking.
  • Endocrine therapy ambiguity in mCRPC: Background ADT (universally required) and novel AR agents (often excluded) both map to endocrine_therapy, creating apparent contradictions (both required and excluded on the same subgroup). Could be addressed by splitting into separate therapy classes in a future iteration.
  • max_prior_lines zero-sentinel contamination: See Issue 8. The LLM outputs 0 instead of null for unstated max prior lines, producing 124K unusable values. This is a separate issue from the prior therapy extraction itself (which works correctly).

Coverage confirmed:

  • 150,689 subgroups have min_prior_lines (95%)
  • 149,952 have max_prior_lines (94%) — but see Issue 8 for data quality concern
  • 124,264 have median_prior_lines (78%)
  • 61,895 have at least one prior_therapies entry (39%)

Prior therapy class enum distribution is healthy. All 22 enum values are used. Top classes: chemotherapy (19.9K), surgery (6.7K), platinum (6K), checkpoint_inhibitor (5.5K), endocrine_therapy (5.5K). other has 25.4K entries (26%) — high but acceptable given the long tail of therapy types not covered by the 22 predefined classes.

Tracker examples all re-verified correct:

  • Pub 152908 (BOLD-100): min=0, max=7, median=4, platinum 95.2%, fluoropyrimidine 85.7%
  • Pub 162733 (sEphB4-HSA): min=1, max=3, median=3, taxane 71.4%, endocrine_therapy required
  • Pub 53818 (PROfound): “Prior taxane Yes” subgroups have taxane reported, “Prior taxane No” subgroups have taxane reported — exposure_status is reported rather than required/excluded (see known limitation above)
  • Pub 147778 (GSK2636731): median=4, taxane 83%

Report-readiness: Prior therapy class data and min_prior_lines are usable for reports. max_prior_lines is not usable without the cleanup described in Issue 8.

6. Data cutoff date is not extracted from publication abstracts

Section titled “6. Data cutoff date is not extracted from publication abstracts”

The disease clinical evidence worksheet has a Data Cut column that records the date when trial data collection was frozen for analysis (e.g. Jun 26, 2024, Mar 20, 2025).

Data cutoff date is not currently extracted or persisted as structured data. The pipeline already reads this language during endpoint and treatment line extraction but discards it. Data cutoff dates appear in ~6,100 publication abstracts with an extractable date in ~3,800 of those.

Current publication flow:

  1. classify_publications extracts endpoints and adverse events from the abstract. The system prompt references data cutoff incidentally (e.g. for maturity determination) but does not extract the date.
  2. The not_reached boolean on outcome measures captures the consequence of an immature data cutoff but not the cutoff date itself.
  3. is_partial_result / is_partial flags on publications signal interim results but not the specific cutoff date.

Relevant code paths:

1. No extraction schema field for data cutoff date

The Details schema in details.rb captures endpoints, arms, adverse events, study design, and partial result flags — but has no data_cutoff_date field. The LLM reads the cutoff date in the abstract for reasoning (e.g. to determine endpoint maturity via not_reached) but has no output slot to persist it.

2. The efficacy view and export have no data cutoff column

vw_publication_efficacy_data exposes effective_line, treatment_settings, dose, but has no data_cutoff_date. The CSV export (emerging_clinical_data_report.rb) has 37 columns but none for data cutoff.

Example 1: publication 241657 (belzutifan + lenvatinib in RCC)

Section titled “Example 1: publication 241657 (belzutifan + lenvatinib in RCC)”

Abstract states:

“for the first (IA1; data cutoff Jun 26, 2024) and second (IA2; data cutoff Apr 9, 2025) interim analysis”

This publication reports two separate data cutoffs for two interim analyses. The worksheet needs at minimum the most recent cutoff date (Apr 9, 2025). Neither date is captured.

Example 2: publication 116878 (BURAN — buparlisib in HNSCC)

Section titled “Example 2: publication 116878 (BURAN — buparlisib in HNSCC)”

Abstract states:

“data cut-off date of 15 March 2025, with a median follow up of 27 months”

Cutoff date (2025-03-15) is clearly stated. Not captured. The worksheet Data Cut column for this publication would be 15 Mar 2025.

Example 3: publication 240450 (BREAKWATER — encorafenib in mCRC)

Section titled “Example 3: publication 240450 (BREAKWATER — encorafenib in mCRC)”

Abstract states:

“At data cutoff (Mar 1, 2025), EC+FOLFIRI demonstrated a clinically meaningful and statistically significant improvement…”

Cutoff date (2025-03-01) is stated parenthetically. Not captured.

Example 4: publication 191190 (pembrolizumab + nab-paclitaxel in HNSCC)

Section titled “Example 4: publication 191190 (pembrolizumab + nab-paclitaxel in HNSCC)”

Abstract states:

“data cutoff (February 27, 2025; median follow-up 23 months)”

Cutoff date (2025-02-27) is clearly stated. Not captured.

Data cutoff date is clinically essential for interpreting results. When the same trial publishes multiple analyses, each with a different cutoff date, the cutoff distinguishes which analysis the reported endpoints belong to. Without it:

  • The worksheet Data Cut column cannot be populated
  • Analysts cannot distinguish interim from final analysis results for the same trial
  • Publications reporting updated OS at longer follow-up cannot be correctly ordered or attributed

This is not a not_reached problem. The not_reached flag captures whether a median was estimable at all. Data cutoff date describes when the analysis was performed, not whether an endpoint was reached.

This is also not a publication dating problem. publication_date is when the paper was presented or published. Data cutoff date is when the trial database was locked for that analysis — typically months before publication. The two dates serve different purposes and should not be conflated.

Across all publications with abstracts:

SignalPublications% of all pubs with abstracts (194K)
Mentions data cutoff language6,1483.2%
Data cutoff with extractable date (month/year or full date)3,8492.0%
Single cutoff mention5,31486% of cutoff pubs
Multiple cutoff mentions (2+)83414% of cutoff pubs

For the target worksheet diseases specifically:

DiseaseTotal pubsWith data cutoff
Colorectal Cancer2,878208 (7%)
HNSCC97483 (9%)
NSCLC4,079623 (15%)

Key observations:

  1. Data cutoff dates are most common in NSCLC publications (15%), likely reflecting the higher proportion of large randomized trials in lung cancer
  2. ~14% of publications with cutoff language mention multiple cutoffs (e.g. different interim analyses), confirming that subgroup-level persistence is needed
  3. ~63% of publications with cutoff language include an extractable date with at least month + year precision

From spot-checking 30 recent abstracts with data cutoff language:

FormatExampleFrequency
Month DD, YYYYdata cutoff date of October 27, 2025Common
Mon DD, YYYYdata cutoff Jun 26, 2024Common
DD Mon YYYYdata cut-off (18 Sept 2025)Common
Mon YYYY (no day)data cut-off (July 2025)Moderate
MM/DD/YYYYdata cut-off (06/13/2025)Rare
Month YYYY onlydata cutoff (Oct 2025)Moderate

Some abstracts state only month + year without a specific day. The LLM should extract whatever precision is available.

Some publications report multiple cutoff dates for interim analyses (e.g. publication 241657: IA1 cutoff Jun 26, 2024 and IA2 cutoff Apr 9, 2025). For the worksheet, the most recent cutoff associated with the reported results should be used.

The system already captures related but insufficient signals:

  • not_reached (boolean): Whether a time-to-event median was estimable. Captures endpoint maturity but not the temporal context.
  • is_partial_result (boolean): Whether the publication reports interim results. Related to data cutoff (interim = earlier cutoff) but does not carry the date.
  • publication_date: When the paper was published. Distinct from data cutoff — typically the cutoff is 3–12 months before publication.
  • LLM evidence text: Data cutoff dates appear embedded in llm_data observation/evidence free text (~2,400 publications) but are not structured or queryable.
  • When a publication reports multiple interim analyses with different cutoffs (e.g. IA1, IA2) and different subgroups share the same cutoff, should the cutoff be denormalized onto each subgroup or stored once with an analysis label?
  • Should the LLM extract an analysis label (e.g. “IA1”, “IA2”, “primary analysis”) alongside the cutoff date?

1. Subgroup-level persistence, not publication-level

Data cutoff date belongs on trial_subgroups, not on publications. Evidence:

  • ~14% of publications with cutoff language mention multiple cutoffs (e.g. pub 241657: PFS cutoff Jun 2024, OS cutoff Apr 2025 for different interim analyses)
  • Different subgroups or endpoint sets within the same publication can reference different analysis cutoffs
  • The common case (~86%) is a single cutoff — all subgroups inherit it, so subgroup-level handles both cases

This is consistent with how treatment_lines already persists on trial_subgroups.

2. Bake into classify_publications for going-forward extraction

The classify_publications task (PublicationsLlmClassification::Task) already reads the full abstract and encounters data cutoff language naturally. Add data_cutoff_date to the SubgroupOutcome schema in details.rb:

class SubgroupOutcome
# ... existing fields ...
desc 'Data cutoff date for results reported under this subgroup, in ISO 8601 format (YYYY-MM-DD). ' \
'Use YYYY-MM-01 when only month and year are stated. Null if not mentioned in the abstract.'
attribute :data_cutoff_date, :string, nullable: true
end

System prompt addition in task.rb (extend section 3, Endpoints and Outcome Measures):

** Data Cutoff Date:
- If the abstract states a data cutoff date for the results in this subgroup
(e.g. "data cutoff Jun 26, 2024", "data cut-off date was Mar 20, 2025"),
extract it as data_cutoff_date in YYYY-MM-DD format.
- Use YYYY-MM-01 when only month and year are given.
- If the publication reports a single cutoff for all results, apply it to every
subgroup_outcome_measures entry.
- If different analyses have different cutoffs, assign each cutoff to the
subgroup(s) whose results it covers.
- Leave null if not explicitly stated — do not infer from publication date.

Why this is better than a separate task:

  • Zero marginal cost — one nullable string field per subgroup entry adds negligible tokens
  • The LLM already has the full abstract in context and already reasons about data maturity (not_reached, is_partial_result)
  • No new task class, no new Thor command, no new workflow step for going-forward publications
  • Schema stays co-located with the subgroup outcome data it describes

3. Post-processing: propagate to trial_subgroups

post_process_publications already creates/updates trial_subgroups from llm_data['subgroup_outcome_measures']. Add data_cutoff_date to the attributes written during post-processing. This requires a migration to add data_cutoff_date (date, nullable) to the trial_subgroups table.

4. Backfill task for all existing result publications

A separate backfill task extracts data_cutoff_date from all existing result publications that already have llm_data['subgroup_outcome_measures'] (~63K publications). No regex pre-filtering — the LLM decides whether a cutoff date is present, not a pattern match. Regex would silently miss publications that state cutoff dates in unexpected phrasing.

The backfill task:

  • Reads the publication abstract and its existing trial_subgroups records
  • Extracts data_cutoff_date per subgroup
  • Writes directly to trial_subgroups.data_cutoff_date and trial_subgroups.llm_data, same pattern as TreatmentContextExtraction (which finds each trial_subgroup by ID and updates in place)
  • Does NOT re-run post_process_publications — that would destroy and recreate all trial_subgroups, wiping treatment lines and disease adjudication data
  • Runs as a one-time Thor task, similar in pattern to adjudicate_subgroup_diseases (Issue 1)

Estimated cost: ~$30–50 with gpt-5-mini for ~63K publications (single nullable date field per subgroup, minimal output tokens).

After backfill, the going-forward path (classify_publications) handles all new publications automatically.

Status: Implemented — backfill complete (validated 2026-03-13)

All code changes are in place. Backfill has been run.

Going-forward extraction (classify_publications)

Section titled “Going-forward extraction (classify_publications)”
  • Added data_cutoff_date (string, nullable) to SubgroupOutcome in details.rb
  • Added data cutoff extraction instructions to the system prompt in task.rb
  • Updated post_process.rb to propagate data_cutoff_date from llm_data['subgroup_outcome_measures'] to trial_subgroups.data_cutoff_date

New publications processed through classify_publicationspost_process_publications will automatically have data cutoff dates extracted and persisted.

  • Added data_cutoff_date (date, nullable) column to trial_subgroups
  • Added data_cutoff_date to vw_publication_efficacy_data (v10) sourced from trial_subgroups.data_cutoff_date
  • Added Data Cut column to EmergingClinicalDataQuery output and CSV export

One-off backfill task at lib/tasks/one_off/backfill_data_cutoff_dates.thor extracts data cutoff dates from all existing result publications with trial_subgroups. No regex pre-filter — all ~62K publications are sent to gpt-5-mini (estimated cost ~$6-10). The LLM returns null for publications without cutoff language.

Run with:

bundle exec thor one_off:backfill_data_cutoff_dates:extract --batched --parallelism=4

Tested on 6 publications with known cutoff dates (5 extracted correctly, 1 correctly returned null for abstract that says “at data cut-off” without stating the date):

Pub IDAbstract saysExtractedCorrect?
116878”data cut-off date of 15 March 2025”2025-03-15Yes
163930”data cutoff” Feb 4, 20212021-02-04Yes
190005”at data cut-off” (no date)nullYes
190016cutoff Sept 16, 20242024-09-16Yes
190620”data cutoff, 01 Aug 2025”2025-08-01 (all 14 subgroups)Yes
190677”data cutoff (07 Oct 24)“2024-10-07Yes

Coverage: 30,369 subgroups across 11,203 distinct publications have data_cutoff_date populated (19.1% of all publication subgroups). This exceeds the pre-implementation estimate of ~6K abstracts with cutoff language, confirming the backfill has been run.

Tracker spot-check pubs re-verified:

PubExpectedActualCorrect?
116878 (BURAN)2025-03-152025-03-15Yes
190016 (SERENA-1)2024-09-162024-09-16Yes
190620 (POD1UM-303)2025-08-01 (all 17 subgroups)17/17 populatedYes
190677 (CAPItello-281)2024-10-072024-10-07Yes
190005 (TROPION-Breast01)null (no date in text)nullYes

Tracker examples 241657 and 240450 have zero subgroups — they are newly ingested ASCO 2025 publications (created 2026-03-10) that haven’t been through classify_publications yet. Once the publication workflow runs, cutoff dates will be extracted automatically by the going-forward path.

Minor data quality issues:

  • 9 subgroups have cutoff dates before 2000 — verified as legitimate (e.g. pub 144506 is a 1988 pilot study in Qidong County).
  • 2 subgroups (pub 109543) have cutoff date 2028-12-01 — hallucinated future date. Should be cleaned.

7. AE grade category enum is too coarse — grade 1-2 rows misclassified as all_grade

Section titled “7. AE grade category enum is too coarse — grade 1-2 rows misclassified as all_grade”

The grade_category field on adverse_events only supports two values: all_grade and grade_gte3. Many publication abstracts report AEs in finer grade buckets (grade 1-2, grade 3-4, grade 5/fatal, SAE). When forced into the binary, grade 1-2 rows get shoehorned into all_grade, which is incorrect — true all-grade incidence includes all grades, while grade 1-2 is a strict subset.

This produces ~50 AE pairs where the grade_gte3 value is higher than the all_grade value for the same AE name, which is counter-intuitive but affects <0.3% of publications with AE data.

  • 36,545 publications have AE rows with grade_category
  • 312 publications (0.9%) have the same AE name under both grade categories
  • 50 AE pairs across those 312 pubs show inverted values (grade_gte3 > all_grade)
  • 92 of the 312 are in target disease areas

Current misclassification breakdown from observation text analysis:

Observation patternClassified as all_gradeClassified as grade_gte3Issue
Explicitly “all grade” / “any grade”5,143401401 wrong
Grade 1-2 specific7,8832217,883 should be grade_1_2
Grade 3-4 specific1,67915,2041,679 wrong
Grade 5 / fatal1763,803Should be own category
SAE context1,2871,804Should be own category
No observation20,15216,413Ambiguous
Other35,05313,173Mixed

The grade 1-2 → all_grade misclassification (7,883 rows) is the largest single issue. The grade ≥3 column is mostly correct, so the clinically important safety signal is preserved. The all-grade column underreports in affected cases.

Expand grade_category to a richer enum and re-run the backfill:

# Current: all_grade, grade_gte3
# Proposed:
attribute :grade_category, :string
# enum: all_grade, grade_1_2, grade_gte3, grade_3_4, grade_5_fatal, sae
ValueMeaningRanked summary use
all_gradeTrue all-grade / any-grade / unspecified”Most Frequent AE All Grade”
grade_1_2Grade 1-2 only (low-grade bucket)Excluded from ranked summaries
grade_gte3Grade ≥3 / grade 3+“Most Frequent AE >=Gr3”
grade_3_4Grade 3-4 specificallyTreated same as grade_gte3 for ranking
grade_5_fatalGrade 5 / fatal / treatment-related deathSeparate or excluded
saeSerious adverse event (any grade)Excluded from ranked summaries

The ranked summary helper would then:

  • “Most Frequent AE All Grade” → filter to all_grade only (not grade_1_2 or sae)
  • “Most Frequent AE >=Gr3” → filter to grade_gte3 + grade_3_4

This eliminates the inversion problem because grade 1-2 and SAE rows no longer contaminate the all-grade bucket.

Cost: Re-running the full AE grade backfill at ~$10 with gpt-5-mini. The schema change to the extraction prompt and the AdverseEventGradeBackfill task already exist — just need to expand the enum, update the prompt, and re-run.

Downstream changes: Update AdverseEvent model normalization, ranked summary helper, and export query to handle the expanded enum.

Implemented 2026-03-13. Commit ef8bcfa8.

Enum expansion: adverse_events.grade_category expanded from 2 values to 6: all_grade, grade_1_2, grade_gte3, grade_3_4, grade_5_fatal, sae. All model normalization, extraction schema, backfill task, and export queries updated.

Backfill completed. Current distribution across 148,084 classified rows:

grade_categoryCount%
all_grade60,45040.8%
grade_gte337,74725.5%
grade_3_421,68514.6%
grade_1_212,9078.7%
sae6,8654.6%
grade_5_fatal6,4304.3%
NULL2,3051.6%

Ranked summary updated: “Most Frequent AE All Grade” filters to all_grade only (excluding grade_1_2 and sae). “Most Frequent AE >=Gr3” filters to grade_gte3 + grade_3_4 + grade_5_fatal.

Residual: 2,305 rows (1.6%) across 1,257 publications still have NULL grade_category. Inverted AE pairs reduced from ~50 to 33 — remaining inversions likely reflect genuine data complexity (e.g. subgroup-level AE rates where a smaller subgroup has higher grade ≥3 than the overall all-grade rate).

8. max_prior_lines zero-sentinel contamination

Section titled “8. max_prior_lines zero-sentinel contamination”

The TreatmentContextExtraction LLM task outputs 0 instead of null for max_prior_lines when the abstract does not state a maximum number of prior therapies. This produces 124,446 subgroups (78% of all publication subgroups) with max_prior_lines = 0, of which 12,924 are logically impossible (min_prior_lines > max_prior_lines).

TreatmentContextExtraction (app/tasks/publications_llm_classification/treatment_context_extraction.rb):

  • Schema declares attribute :max_prior_lines, :integer, nullable: true with desc "Null if not stated."
  • System prompt (line 150): "Leave null if not stated. Do not infer counts that are not explicitly stated."
  • sanitize_line_count rejects negative values (value.negative?) but passes 0 through

Two contributing factors:

  1. Structured outputs integer default: When the LLM generates structured JSON with an integer field and the value is conceptually “not applicable,” many models default to 0 rather than null, even when the schema allows nullable and the prompt says “null if not stated.” This is a known behavior pattern with OpenAI structured outputs.

  2. Sanitizer gap: sanitize_line_count (line 411) was designed to catch the -1/-999 sentinel pattern discovered during the initial backfill, but did not anticipate 0 as a sentinel because 0 is a valid value for treatment-naïve (1L) populations.

max_prior_linesCount%
0124,44678.4%
1-313,5278.5%
4-106,8514.3%
>105,1283.2%
NULL8,7575.5%

Logically impossible rows (min > max): 12,924

Breakdown by treatment line for max_prior_lines = 0:

Treatment linemin=0 & max=0min>0 & max=0 (contradictory)
2L+17,2256,282
3L+5,6143,411
1L only25,827129
Other (Adj/Neo/Ind/etc.)51,922771

For 1L publications, min=0, max=0 is valid (treatment-naïve = zero prior lines). For 2L+ and 3L+ publications, max=0 is always wrong — by definition these populations have ≥1 prior line.

PubSubgroupTreatment lineminmaxAbstract says
69513Asian pts3L+20”at least 2 prior lines” (no max stated)
45604Overall2L+10”previously treated” (no max stated)
101698HRAS-mutated UC → Evaluable2L+10”at least one prior therapy” (no max stated)
121922Overall3L+20”≥2 prior systemic therapies” (no max stated)

In all cases the abstract provides a minimum threshold but no maximum. The LLM correctly extracted min_prior_lines but output 0 instead of null for max_prior_lines.

  • max_prior_lines is not usable for reports in its current state — 78% of values are sentinel zeros
  • The Max Prior Lines column in the worksheet export will show 0 for the vast majority of rows, which is misleading
  • The efficacy view (vw_publication_efficacy_data) exposes max_prior_lines directly from trial_subgroups, so the bad values propagate to all downstream consumers
  • min_prior_lines is less affected — 0 is valid for 1L populations, and the contradictory cases (min > 0 with max = 0) are identifiable

Part 1: Cleanup existing data

The cleanup is not straightforward because 0 is valid for 1L populations. Possible approaches:

  1. Conservative (rule-based): Set max_prior_lines = NULL where min_prior_lines > max_prior_lines (12,924 rows — clearly wrong). This fixes the worst cases but leaves ~111K ambiguous max=0 rows untouched.

  2. Moderate (rule-based with treatment line context): Additionally set max_prior_lines = NULL where max_prior_lines = 0 AND treatment_lines contains 2L+ or 3L+ (these populations by definition have ≥1 prior line, so max=0 is impossible). This would cover ~23K additional rows.

  3. Aggressive (re-extract via LLM): Re-run TreatmentContextExtraction on all affected publications. Most accurate but costs another ~$30 and risks other field drift. Could be scoped to only publications where max_prior_lines = 0 AND treatment_lines is not 1L.

Recommendation: Start with approach 2 (rule-based cleanup of clearly wrong values), then evaluate whether the remaining 1L + max=0 population needs LLM re-extraction or if 0 is acceptable there.

Part 2: Prevent recurrence

Two changes needed:

  1. Update sanitize_line_count to also reject 0 for max_prior_lines when min_prior_lines > 0:

    def sanitize_line_count(value)
    return nil if value.nil? || value.negative?
    value
    end

    This alone is insufficient because the sanitizer doesn’t have cross-field context. Better to add a post-persist validation.

  2. Update the system prompt to be more explicit about the 0-vs-null distinction:

    - IMPORTANT: Use null (not 0) when no maximum is stated. 0 means "zero prior lines"
    (treatment-naïve only). If the abstract says "at least 2 prior lines" with no upper
    bound, set min=2 and max=null, NOT max=0.
  3. Add a cross-field sanitizer in persist_results that nulls max_prior_lines when min > max:

    subgroup.max_prior_lines = nil if subgroup.min_prior_lines.present? &&
    subgroup.max_prior_lines.present? && subgroup.min_prior_lines > subgroup.max_prior_lines

Implemented 2026-03-13. Two-part fix:

Part 1: Prevent recurrence

  • Updated TreatmentContextExtraction system prompt with explicit zero-vs-null disambiguation and concrete example
  • Added cross-field validation (min > max → max = nil) in all three persist paths: TreatmentContextExtraction#persist_results, PostProcess outcome measure building, and backfill_prior_therapy_context.thor
  • Added MAX_PLAUSIBLE_PRIOR_LINES = 25 threshold to all three sanitize_line_count methods — values above 25 are nulled on persist (verified via spot-checking that real abstracts top out at ~20 prior lines in heavily pretreated myeloma/phase 1 basket trials)

Part 2: Historical data cleanup

  • Extended cleanup_prior_therapy_values.thor with three new cleanup rules:
    1. Nulled max_prior_lines where min_prior_lines > max_prior_lines (12,924 rows)
    2. Nulled max_prior_lines = 0 where treatment_lines contains 2L+/3L+/2L/3L/4L/4L+/5L/5L+ (23,193 additional rows)
    3. Nulled sentinel junk (values > 25) across all three fields: min_prior_lines (95 rows), max_prior_lines (3,013 rows), median_prior_lines (1,281 rows). Common sentinels included INT_MAX (2,147,483,647), 999, 999999, 65535, 32767, 123456789, etc.
  • Total cleaned: 40,506 rows across all rules

Post-cleanup validation:

  • Zero contradictory rows (min > max) remain
  • Zero impossible zeros for 2L+/3L+ populations remain
  • Spot-checked 4 example publications (69513, 45604, 101698, 121922) — all now have max_prior_lines = NULL
  • 1L populations with min=0, max=0 preserved correctly
  • All three fields now cap at plausible values (max observed: min=14, max=20, median=18)
  • ~40 rows in the 21-25 range were nulled that may have been valid; unrecoverable without LLM re-extraction, but impact is negligible

Post-cleanup state confirmed:

  • 0 contradictory rows (min > max) remain
  • 90,476 subgroups still have max_prior_lines = 0 — all are in 1L (5,564) or non-line-specific settings (Adjuvant, Neoadjuvant, Induction, etc.: 84,912). No 2L+/3L+ zeros remain.
  • The remaining zeros in non-line-specific settings (e.g. Adjuvant with max=0) are likely still sentinel zeros, but these populations have no treatment line context anyway, so the downstream impact is negligible.
  • max_prior_lines is now usable for reports where treatment line context exists. For 1L populations, max=0 is valid. For populations without treatment lines, max_prior_lines should be treated as unreliable.

9. All-grade AE extraction gap — originally ~13K publications, revised to ~14 after investigation

Section titled “9. All-grade AE extraction gap — originally ~13K publications, revised to ~14 after investigation”

Originally suspected that classify_publications fails to extract all-grade named AEs for ~13,000 publications. After deep investigation (2026-03-16), the issue is much narrower than initially estimated.

The 12,986 publications with only grade≥3 AE rows break down as:

CategoryPublicationsGenuine extraction failure?
Abstract genuinely only reports grade ≥3 AEs~11,200No — abstract has no all-grade named AE data
Abstract mentions “any grade” in aggregate context only (e.g. “discontinuation due to any grade TRAE”)~400No — “any grade” appears as an aggregate stat, not per-AE
Abstract has grade_1_2 AEs separately (not combined all-grade)1,744No — abstract reports low/high grade separately, not combined
Abstract has clear two-column AE table (Any grade + Grade ≥3) but LLM misclassified~14Yes — any-grade values extracted but labeled as grade_gte3

The LLM extracts numeric values from embedded AE tables but reads the first column (any-grade) and labels it as grade≥3, completely ignoring the second column (actual grade≥3 values). This was caused by the old binary grade_category enum (all_grade/grade_gte3) which didn’t give the LLM enough guidance to distinguish columns.

Confirmed example: pub 60886 (Debio 0123 + carboplatin, phase 1)

Section titled “Confirmed example: pub 60886 (Debio 0123 + carboplatin, phase 1)”

Abstract table:

TEAE Any grade n(%) Grade ≥3 n(%)
Thrombocytopenia 12 (31.6) 3 (7.9)
Nausea 12 (31.6) 0
Anemia 8 (21.1) 1 (2.6)
Fatigue 7 (18.4) 0

Before (old extraction): 7 rows, all grade_gte3 — values 31.6%, 31.6%, 21.1% are the ANY-GRADE column mislabeled. Grade≥3 column (7.9%, 0%, 2.6%, 0%) completely missing.

After re-extraction (current prompt with 6-value enum): 14 rows — 7 all_grade (31.6%, 31.6%, 21.1%, 18.4%, 13.2%, 13.2%, 10.5%) + 7 grade_gte3 (7.9%, 2.6%, 2.6%, 2.6%, 0%, 0%, 0%). All correct.

Confirmed example: pub 56057 (Debio 0123 + carbo + etoposide, phase 1)

Section titled “Confirmed example: pub 56057 (Debio 0123 + carbo + etoposide, phase 1)”

Same pattern — any-grade column extracted as grade≥3. After re-extraction: 4 all_grade + 4 grade_gte3, all values correct.

The expanded 6-value grade_category enum (Issue 7, implemented 2026-03-13) and the detailed grade classification instructions in the prompt give the LLM enough context to correctly distinguish table columns. Re-running classify_publications on the affected pubs with the current prompt produces correct results — verified on 2/2 test pubs.

  • ~11,200 pubs where the abstract genuinely only reports grade≥3: The all-grade data is not in the abstract. It may be in the full paper, poster, or oral presentation. This is a data availability limitation, not an extraction failure.
  • ~400 pubs where “any grade” appears in aggregate context: The abstract says things like “discontinuation due to any grade TRAE occurred in 7.5%” — this is an aggregate stat correctly handled by the safety metrics extraction, not individual named AEs.
  • 1,744 pubs with grade_1_2 + grade≥3 but no all_grade: The abstract reports grades separately (grade 1-2 and grade 3-4), not as a combined “any grade” bucket. This is correct — the “Most Frequent AE All Grade” column should only use true all-grade data, not sum of grade buckets.

35 AE pairs across pubs WITH both all_grade and grade≥3 rows show grade≥3 > all_grade for the same AE name. These are likely the same column-swap bug in pubs that DID get partial all-grade extraction. The Issue 10 re-extraction (2,182 pubs through full classify_publications) will fix any that overlap.

Going forward: The Issue 7 enum expansion (2026-03-13) and current prompt instructions are sufficient — re-running classify_publications on affected pubs produces correct two-column extraction. Verified on pubs 60886 and 56057: before=7 rows all grade_gte3 (any-grade values mislabeled), after=14 rows (7 all_grade + 7 grade_gte3, all values correct).

Why the Issue 7 AE grade backfill didn’t fix existing data: The backfill (AdverseEventGradeBackfill) can only reclassify existing AE rows — it cannot create new rows. For the ~14 affected pubs:

  • The original classify_publications extracted only the any-grade column values and labeled them grade_gte3 (wrong)
  • The grade≥3 column values (Nausea 0%, Thrombocytopenia 7.9%, etc.) were never extracted as rows at all
  • The backfill skipped these rows because grade_category was already non-null (set incorrectly by the original extraction)
  • Even with --overwrite, the backfill would at best reclassify the 7 rows from grade_gte3all_grade, but the 7 missing grade≥3 rows still wouldn’t exist

Fix requires re-running classify_publications on the affected pubs — only full re-extraction creates both sets of rows. The ~14 pubs will be fixed by either:

  1. The Issue 10 re-extraction (2,182 pubs) if they overlap, or
  2. The next full publications workflow run

No additional prompt changes or backfill tasks needed.

10. classify_publications drops subgroups identified by extract_subgroups

Section titled “10. classify_publications drops subgroups identified by extract_subgroups”

The classify_publications LLM task receives a list of subgroups with endpoint associations from the upstream extract_subgroups step, but sometimes drops subgroups entirely — producing subgroup_outcome_measures entries for only a subset of the provided subgroups. The subgroup extraction step correctly identifies the subgroup, the schema enum correctly includes it, and the endpoint association is correctly passed — but the main classification LLM simply doesn’t create an output entry for it.

This was discovered during worksheet validation against the client sheet 1reh2-9Xpxd9DF7EB-73JfSXH8-MLtWI3zUDEOTgxPV8.

Publication classification runs in two LLM steps:

  1. extract_subgroups (subgroup_extraction.rb) reads the abstract and identifies subgroup labels with their endpoint associations. Output is llm_data['subgroup_endpoints'].

  2. classify_publications (task.rb) receives subgroup_endpoints, derives distinct_subgroups, and passes them to the main LLM as:

    • A subgroup_endpoints field in the user prompt (subgroup → endpoint mapping)
    • An enum constraint on subgroup_outcome_measures[].value in the structured output schema (details.rb line 185)
    • A system prompt instruction: “Look at the provided ‘subgroup_endpoints’, keep the associations between the endpoints and subgroups as they are.” (line 31)

The schema constraint (details.rb line 185) enforces that subgroup_outcome_measures[].value MUST be one of the distinct_subgroups values — the LLM cannot hallucinate new subgroups. But the schema does not enforce that every enum value must appear at least once. The LLM is free to produce output with only a subset of the provided subgroups, and it does.

The structured output schema makes subgroup entries optional, not required.

The subgroup_outcome_measures field is an array of objects. Each object has a value field constrained to the enum. But the array itself has no minimum length and no constraint requiring each enum value to appear. The LLM is structurally allowed to produce output with 1 subgroup entry out of N provided.

The system prompt says to “keep the associations as they are” but this is a soft instruction. With structured outputs, the LLM’s tendency to minimize output length can override soft prompt instructions, especially when one subgroup has much more data in the abstract than another.

Example 1: pub 47147 (sigvotatug vedotin + pembrolizumab, ASCO 2025) — confirmed LLM drop

Section titled “Example 1: pub 47147 (sigvotatug vedotin + pembrolizumab, ASCO 2025) — confirmed LLM drop”

Abstract text (verbatim):

“In 7 efficacy-evaluable pts with TPS≥1 NSCLC, 1 confirmed (c) complete response (CR), 1 c partial response (PR), and 2 PRs pending confirmation were observed (ORR 57%; cORR 29%). In 8 efficacy-evaluable pts with 1L HNSCC, 2 cCR and 1 cPR were observed (cORR 37.5%).”

Both disease cohorts are in the same sentence block with explicit efficacy values.

extract_subgroups output (llm_data['subgroup_endpoints']):

[
{
"endpoint": "Objective response Rate",
"subgroups": ["NSCLC (PD-L1 TPS ≥1)"]
},
{
"endpoint": "Confirmed Objective Response Rate",
"subgroups": ["NSCLC (PD-L1 TPS ≥1)", "1L HNSCC"]
}
]

Step 1 correctly identified both subgroups. 1L HNSCC is associated with Confirmed Objective Response Rate.

distinct_subgroups passed to schema enum: ["NSCLC (PD-L1 TPS ≥1)", "1L HNSCC"]

Both subgroups were available as valid enum values in the structured output schema.

classify_publications output (llm_data['subgroup_outcome_measures']):

[
{
"type": "disease",
"value": "NSCLC (PD-L1 TPS ≥1)",
"outcome_measures": [
{"endpoint": "ORR", "measure_value": 57, "number_of_participants": 7},
{"endpoint": "cORR", "measure_value": 29, "number_of_participants": 7}
]
}
]

Only the NSCLC subgroup was created. The 1L HNSCC subgroup with cORR=37.5% was completely dropped despite being:

  • explicitly mentioned in the abstract with a numeric value
  • correctly identified by extract_subgroups
  • present in the schema enum
  • associated with Confirmed Objective Response Rate in the input

Worksheet impact: The sheet row for HNSCC says ORR=37.5% from this trial (NCT04389632). Our database has no HNSCC efficacy row for this publication.

Example 2: pub 71934 (cofetuzumab pelidotin, ESMO 2023) — data not in abstract table

Section titled “Example 2: pub 71934 (cofetuzumab pelidotin, ESMO 2023) — data not in abstract table”

Abstract embedded table has two columns:

ParameterNSQ EGFR WT, PTK7 ≥90%/≥2+ N=21Overall N=56
ORR30.0%19.6%
CBR90.0%78.6%
mDOR5.8 mo7.2 mo
mPFS5.5 mo5.3 mo

The LLM correctly extracted both columns as subgroups: PTK7-expressing rNSCLC (Overall) and NSQ EGFR WT → PTK7 ≥90%.

The abstract narrative mentions three histology cohorts: “27 NSQ EGFR WT, 13 NSQ EGFR mutant, and 16 squamous (SQ)” and states “Enrollment of SQ and NSQ EGFR mutant pts was halted to prioritize NSQ EGFR WT accrual due to response rates in each subgroup.”

However, the per-histology ORR values (including sqNSCLC ORR=12.5% from the worksheet) are not present in the abstract’s table or narrative text. The abstract only shows the overall and NSQ EGFR WT results. The squamous-specific data was likely in the poster or supplementary material, not the abstract.

This is NOT an LLM extraction failure — the data isn’t in the text we have. The worksheet’s sqNSCLC ORR=12.5% comes from a source outside our abstract corpus.

The two examples show different failure modes:

  1. Pub 47147 (HNSCC cORR=37.5%): Pure LLM output quality failure. The data is in the abstract, the subgroup was correctly identified upstream, the schema allowed it — but the LLM still dropped it. This is the actionable issue.

  2. Pub 71934 (sqNSCLC ORR=12.5%): Not an extraction failure. The data isn’t in the abstract. The worksheet references data from a source we don’t have.

For the actionable case (pub 47147 pattern), the root cause is:

  • The structured output schema does not require completeness — the LLM can produce fewer subgroup_outcome_measures entries than there are enum values
  • The system prompt instruction (“keep the associations as they are”) is not strong enough to override the LLM’s tendency to minimize output when one subgroup has much less data than another
  • The HNSCC subgroup had only one endpoint value (cORR=37.5%) while NSCLC had two (ORR=57%, cORR=29%), making it a “smaller” subgroup that the LLM is more likely to drop

Measured by comparing llm_data['subgroup_endpoints'] (distinct subgroups identified by extract_subgroups) against llm_data['subgroup_outcome_measures'] (entries with non-empty outcome_measures produced by classify_publications):

StatusPublications%
All subgroups used55,68384.8%
Partial drop (some subgroups lost)9,24514.1%
Total drop (all subgroups lost — zero outcome measures)4730.7%
More used than identified (LLM created extra)2930.4%
Total with dropped subgroups9,71814.8%

Note: initial measurement (2,760) undercounted due to a category filter that excluded PubMed, EHA, and other non-ASCO sources. The corrected count uses result = true across all sources.

1. Strengthen the prompt instruction (going-forward prevention)

Add explicit language to the system prompt in task.rb:

- IMPORTANT: You MUST create a subgroup_outcome_measures entry for EVERY subgroup in the
provided list that has associated endpoints. Do not skip subgroups even if they have fewer
results than others. If a subgroup has only one endpoint value, still create the entry.
Every subgroup provided to you was identified because the abstract contains results for it.

This won’t guarantee compliance (the current prompt already says “keep the associations as they are” and the LLM ignores it), but it raises the bar.

2. Schema-level enforcement

Add minItems: distinct_subgroups.length to the subgroup_outcome_measures array in to_json_schema. OpenAI structured outputs may or may not honor this — needs testing. If it works, it forces the LLM to produce at least N entries, preventing the drop.

schema[:properties]['subgroup_outcome_measures'][:minItems] = distinct_subgroups.length

3. Post-extraction validation + selective re-extraction (fix existing data)

The detection query is cheap (no LLM needed):

-- Compare identified vs used subgroup counts
SELECT p.id,
(SELECT count(DISTINCT sg)
FROM jsonb_array_elements(p.llm_data -> 'subgroup_endpoints') e,
jsonb_array_elements_text(e -> 'subgroups') sg) as identified,
(SELECT count(*) FROM jsonb_array_elements(p.llm_data -> 'subgroup_outcome_measures') s
WHERE jsonb_array_length(s -> 'outcome_measures') > 0) as used
FROM publications p
WHERE ...
HAVING identified > used

For the 2,760 affected publications, re-run classify_publications with the strengthened prompt. Estimated cost: ~$20 with o4-mini for 2,760 pubs.

This could also be wired as a permanent validation step in post_process_publications that flags mismatches for automatic re-extraction (with a retry limit to prevent infinite loops on genuinely ambiguous abstracts).

4. For the 701 total-drop publications (zero outcome_measures)

These need separate investigation — likely a mix of:

  • Trial-in-progress abstracts (correct behavior, no results to extract)
  • Genuine extraction failures where the LLM returned empty outcomes
  • Abstracts too short or ambiguous for the LLM to extract anything

A quick filter: check if partial_result_tags contains ‘Trial Design/Enrollment’ — if yes, the empty outcome is expected.

5. Re-run chain for the 2,760 affected publications

Because post_process_publications destroys and recreates trial_subgroups (line 138 of post_process.rb), re-running classify_publications requires re-running downstream steps that write to subgroup rows. The full chain:

  1. classify_publications --publication_ids <ids> --batched — re-extracts subgroup_outcome_measures with the fixed prompt. Reads llm_data['subgroup_endpoints'] (already correct from extract_subgroups). ~$20 with o4-mini.

  2. post_process_publications --publication_ids <ids> --overwrite — destroys all trial_subgroups, trial_outcome_measures, adverse_events, trial_disease_details for these pubs and recreates from llm_data. Re-persists treatment lines and prior therapy context for subgroups that match by subgroup_type + subgroup_value against llm_data['treatment_lines']['subgroups']. New subgroups (the ones previously dropped) will get null treatment context because treatment_context_extraction never ran on them.

  3. extract_treatment_lines --publication_ids <ids> — re-runs TreatmentContextExtraction on the new subgroups. Reads existing trial_subgroups by ID and writes treatment lines, min/max/median prior lines, and prior therapies. ~$20 with gpt-5-mini. Note: extract_treatment_lines scope (line 294) filters to llm_data->'treatment_lines' IS NULL — but since post_process writes llm_data['treatment_lines'] on the publication (not null), we need to either pass --publication_ids to bypass the scope or temporarily null out the field. Alternatively, since post_process matched existing subgroups correctly, only the new subgroups lack treatment context. A targeted approach: after step 2, query for the newly created trial_subgroups that have null treatment_lines and run treatment context extraction on just those publications.

  4. Disease workflow steps — re-run for these pubs:

    • adjudicate_subgroup_diseases — re-adjudicate new non-disease subgroups
    • populate_disease_terms_for_trial_subgroups + post_process_disease_matches — re-populate trial_subgroups.disease_id

Steps that do NOT need re-running: extract_subgroups (input is already correct), extract_interventions, link_publication_drugs, tag_investigational_interventions, extract_dose_evidence, therapeutic_area_filter — all write to llm_data on the publication or to publication_interventions, not to trial_subgroups.

Full downstream chain: Since post_process_publications destroys and recreates trial_subgroups, trial_endpoints, trial_outcome_measures, adverse_events, and trial_disease_details, all downstream steps need to re-run: extract_treatment_lines, standardize_adverse_events, classify_adverse_events, llm_classify_publication_endpoints_domains, llm_match_publication_endpoints, plus the publication_disease_workflow for disease_id. The simplest approach is to re-run the full publications_workflow from classify_publications onward, then the publication_disease_workflow.

Estimated cost: ~$40 for classify_publications re-extraction with o4-mini + ~$20 for extract_treatment_lines with gpt-5-mini + minor costs for other LLM steps.

Implemented 2026-03-16. Three-part fix:

1. Prompt hardening (task.rb): Added explicit instruction to the classify_publications system prompt:

IMPORTANT: You MUST create a subgroup_outcome_measures entry for EVERY subgroup in the
provided list that has associated endpoints. Do not skip subgroups even if they have fewer
results than others. If a subgroup has only one endpoint value, still create the entry.
Every subgroup provided to you was identified because the abstract contains results for it.

2. Schema enforcement (details.rb): Added minItems: distinct_subgroups.length to the subgroup_outcome_measures array in the structured output JSON schema. This prevents the LLM from producing fewer entries than there are identified subgroups.

3. Post-extraction validation logging (task.rb): After each publication is persisted, compares the set of subgroups from extract_subgroups against the set produced by classify_publications. Logs a warning if any subgroups were dropped.

Local test results: 6/6 publications with previously dropped subgroups now have all subgroups populated after re-extraction:

  • Pub 47147 (sigvotatug vedotin): 1L HNSCC subgroup with cORR=37.5% now extracted (previously dropped)
  • Pubs 51804, 53951, 56337, 60242, 144841: all dropped subgroups recovered

One-off re-extraction task: lib/tasks/one_off/reextract_dropped_subgroups.thor identifies the ~9,700 affected publications and creates OneOffJob records for the re-extraction. After classify_publications completes, re-run the full publications_workflow from post_process_publications onward, then publication_disease_workflow.

Production re-extraction completed 2026-03-21. Full pipeline re-run (extract_subgroups → classify_publications → post_process) executed across all affected publications. Issue is now closed.

11. Recently ingested publications have empty endpoint extractions — Closed: not an issue

Section titled “11. Recently ingested publications have empty endpoint extractions — Closed: not an issue”

Initially suspected that recently ingested publications (ASCO 2025, ESMO 2025) had llm_data['subgroup_outcome_measures'] with subgroup entries but empty outcome_measures: [] arrays, suggesting extraction failures.

Systematic analysis of all 102 publications with subgroup_outcome_measures containing only empty outcome_measures arrays:

CategoryCountGenuine extraction failure?
Trial Design/Enrollment (no results in abstract)61No — correct behavior
Safety/AE-focused publications (no efficacy endpoints)~15No — correct behavior
Biomarker/correlative science (no clinical endpoints)~8No — correct behavior
Truncated abstracts (data in figure/table not captured in text)~2Data availability limit, not bug
Mistagged pubs (tagged “Interim Result” but actually TDE)~7No — tagging wrong, extraction correct
Genuinely missed efficacy data0

Regex scan for standard efficacy keywords (ORR, mPFS, mOS, HR with numeric values) across all 41 non-TDE pubs found ~7 with keyword matches, but manual inspection confirmed all were false positives:

  • Pub 53912: pCR prediction AUROC values, not clinical endpoints
  • Pub 63720 (BNT327/PM8002): abstract text truncated — Results section jumps from enrollment stats to Conclusions, efficacy data was in an embedded figure not captured in text
  • Remainder: biomarker studies where HR/ORR appears in passing context, not as reported results

The worksheet rows that couldn’t be matched (MICVO ORR=46%, sigvotatug HNSCC cORR=37.5%, cofetuzumab sqNSCLC ORR=12.5%) were caused by:

  • MICVO ORR=46%: data from a Nov 2025 corporate presentation not in our publication corpus
  • Sigvotatug HNSCC cORR=37.5%: Issue 10 — data was in the abstract but the subgroup was dropped by classify_publications (now fixed)
  • Cofetuzumab sqNSCLC ORR=12.5%: data not in the abstract at all — squamous-specific results were in the poster/supplementary material

Closed — not an issue. The empty outcome_measures are correct in all 102 cases. The original concern was caused by confounding with Issue 10 (subgroup drops) and data availability limitations (corporate presentations, poster-only data).

12. Legacy Emerging Clinical Data query collapses subgroup-level results into Overall-preferred rows

Section titled “12. Legacy Emerging Clinical Data query collapses subgroup-level results into Overall-preferred rows”

Legacy Tpp::EmergingClinicalDataQuery groups all view rows by [publication_id, disease_id, effective_line, study_plan_arm_id] and then picks the “Overall” subgroup when extracting efficacy metrics. This means dose-level cohorts, biomarker-stratified subgroups, and other clinically meaningful splits are hidden behind the Overall population row — even when the data is correctly extracted and present in vw_publication_efficacy_data.

Status note: No further work planned for now. Subgroup-preserving behavior is available via Tpp::ClinicalEvidenceQuery, which is the current client-facing path for this use case. The remaining collapse behavior exists only on the legacy EmergingClinicalDataQuery path.

app/queries/tpp/emerging_clinical_data_query.rb:

  • build_result_rows (line 913): groups by [pub_id, disease_id, effective_line, study_plan_arm_id]
  • extract_efficacy_metrics (line 1057): overall_rows = matching_rows.select { |r| r['subgroup_value'] == 'Overall' } — prefers Overall when present
  • All subgroups with the same disease_id (including via trial_disease_details fallback) collapse into a single output row

Example 1: Ficlatuzumab HPV-negative subgroup (pub 43175)

Section titled “Example 1: Ficlatuzumab HPV-negative subgroup (pub 43175)”

Publication: “Randomized Phase II Trial of Ficlatuzumab With or Without Cetuximab in Pan-Refractory HNSCC” (NCT03422536)

Correctly extracted subgroups:

  • Overall → ORR=19%, PFS=3.7, N=32
  • Overall → HPV-negative → ORR=38%, PFS=4.1, N=16
  • Overall → HPV-negative → cMet overexpression → PFS data
  • Overall → HPV-positive → ORR=0%, PFS=2.3, N=16

All four subgroups are in vw_publication_efficacy_data with subgroup_disease_id = NULL. The fallback to trial_disease_details.disease_id = 6200 (HNSCC) gives them all the same disease_id. So they all land in the same group key [43175, 6200, 3L, nil].

extract_efficacy_metrics then picks subgroup_value = 'Overall' (ORR=19%), discarding the HPV-negative result (ORR=38%) that the worksheet expects.

Worksheet says: Ficlatuzumab HPV-neg N=16 ORR=38% Query returns: Ficlatuzumab Overall N=32 ORR=19%

Example 2: PF-08046054 dose-level cohorts (pub 65346, ESMO 2024)

Section titled “Example 2: PF-08046054 dose-level cohorts (pub 65346, ESMO 2024)”

The ESMO 2024 abstract for this solid-tumor basket trial extracted a single subgroup: PDL1-expressing solid tumors with N=55 ORR=27.3%. The sheet expects HNSCC-specific dose-level splits (N=19 at 1.5mg/kg ORR=10.5%, N=7 at 1.75mg/kg ORR=42.9%).

This is a compound issue:

  1. The abstract itself is a cross-tumor overview — HNSCC-specific dose-level data was in the poster/slides, not the abstract text (data availability)
  2. Even if separate subgroups existed, the query would collapse them into one row

Example 3: IBI363 TPS<1 squamous subgroup (pub 139344 / 237445, ASCO 2025)

Section titled “Example 3: IBI363 TPS<1 squamous subgroup (pub 139344 / 237445, ASCO 2025)”

The sqNSCLC worksheet keeps two rows for the same IBI363 abstract:

  • SqNSCLC 3 mg/kg Q3W: ORR = 43.3%, mPFS = 7.3, N = 30
  • SqNSCLC with TPS <1: ORR = 45.5%, N = 22

Both rows are correctly present in vw_publication_efficacy_data:

  • Advanced NSCLC → Squamous cell carcinoma → 3 mg/kg Q3WORR = 43.3, PFS = 7.3, N = 30
  • Advanced NSCLC → TPS <1 → Squamous cell carcinomaORR = 45.5, N = 22

But EmergingClinicalDataQuery groups both into the same key:

  • abstract copy: [139344, 4174, 1, nil]
  • presentation copy: [237445, 4174, 1, nil]

There is no subgroup_value = 'Overall', so extract_efficacy_metrics falls back to max_by(number_of_participants) and picks the 30-patient row. The TPS <1 row is hidden even though it is already structured and disease-linked.

Worksheet says: IBI363 TPS <1 SqNSCLC ORR = 45.5%, N = 22 Query returns: IBI363 SqNSCLC ORR = 43.3%, N = 30

The query was designed for one-row-per-publication summary display, not for subgroup-level comparisons. The “prefer Overall” logic (line 1057) is intentional — it prevents small subgroup analyses from overriding the main population result in summary tables. But for worksheet reconstruction, the subgroup-level detail IS the desired output.

Difficult to quantify precisely, but any publication with biomarker-stratified results (HPV+/-, PD-L1 CPS levels, mutation status) or dose-level cohorts will lose the subgroup-level detail. This affects basket trials and biomarker-enriched studies disproportionately.

From the HNSCC sheet comparison:

  • Ficlatuzumab HPV-neg (N=16 ORR=38%) — data present, hidden by Overall preference
  • PF-08046054 dose-levels (N=19, N=7) — data not in abstract, but would be hidden even if extracted
  • Becotatug vedotin 2.3mg/kg (N=32 ORR=43%) — data IS extracted (pub 71438 subgroup 2.3 mg/kg → 2/3-line prior platinum & PD-1/L1 inhibitor failure has 4 outcome measures) but invisible due to Issue 15 (disease mapping)
  • IBI363 TPS <1 SqNSCLC (N=22 ORR=45.5%) — data present, hidden behind the larger SqNSCLC 3 mg/kg row (N=30 ORR=43.3%)

This is not an extraction failure. The LLM correctly identifies and extracts subgroup-level data. The data exists in trial_subgroups, trial_outcome_measures, and vw_publication_efficacy_data. The loss happens at query time in the Ruby layer.

No additional implementation is planned at this time. The legacy EmergingClinicalDataQuery behavior remains documented below for reference, but this issue is currently superseded by ClinicalEvidenceQuery, which already preserves subgroup-level rows and surfaces cORR.

Two possible approaches:

1. Subgroup-aware grouping: Change build_result_rows to group by [pub_id, disease_id, effective_line, study_plan_arm_id, subgroup_value] instead of collapsing subgroups. This would produce multiple rows per publication — one for Overall, one for HPV-neg, one for each dose level. Downstream consumers (the TPP React component) would need to handle multiple rows per publication.

2. Subgroup expansion mode: Add an optional parameter (e.g. expand_subgroups: true) that preserves subgroup-level rows when set. Default behavior stays unchanged for summary display, but worksheet reconstruction can request the expanded view.

Option 2 would be the lower-risk approach if the legacy Emerging Clinical Data report needs to be revived without adopting ClinicalEvidenceQuery.

3. Confirmed ORR (cORR) not surfaced as a separate column

The worksheet has separate columns for ORR and Confirmed ORR (cORR). Our query only exports ORR. The data IS in the database — but it’s not distinguishable at query time.

Current state:

  • The endpoints catalog has no cORR entry — only ORR (ids 10, 64)
  • The EndpointMatcher maps all confirmed ORR extractions to the catalog ORR endpoint
  • When the LLM extracts “cORR” or “Confirmed Objective Response Rate”, it becomes a regular ORR row with “confirmed” noted in the trial_outcome_measures.observation text
  • 2,377 ORR rows have “confirmed” in their observation text
  • Only 7 rows in the entire DB have an explicit cORR/Confirmed ORR abbreviation on trial_endpoints
  • When an abstract reports both ORR and cORR (e.g. pub 47147: ORR=57%, cORR=29%), both are extracted as separate ORR rows — but the query picks one

Explored approach — adding cORR as a separate catalog endpoint: Not recommended. Confirmed ORR is not a different clinical endpoint — it’s the same ORR with confirmation scans. Splitting the catalog would create ambiguity in the matching step (should “ORR 35%” map to ORR or cORR?) and wouldn’t help for the 2,377 rows that already have “confirmed” buried in observation text.

Recommended approach — structured confirmed boolean on outcome measures:

Add a confirmed boolean field to the outcome measure schema in classify_publications. The LLM already knows whether a response is confirmed (it writes “confirmed” in the observation) — we should ask it to put that in a proper field rather than relying on substring/regex matching at query time.

The field would sit on:

  1. The outcome measure in llm_data['subgroup_outcome_measures'][].outcome_measures[] — set by the LLM during classify_publications
  2. trial_outcome_measures — persisted by post_process_publications

Then EmergingClinicalDataQuery can pull ORR rows where confirmed = true for the cORR column and confirmed = false/null for the regular ORR column.

Implementation steps:

  1. Add confirmed boolean to the outcome measure JSON schema in details.rb
  2. Add prompt instruction to task.rb: “Set confirmed: true when the response has been confirmed by follow-up assessment (e.g. cORR, confirmed CR/PR). Set confirmed: false or omit when unconfirmed or not stated.”
  3. Add confirmed column to trial_outcome_measures (migration)
  4. Persist the field in post_process.rb
  5. Expose in vw_publication_efficacy_data
  6. Use in EmergingClinicalDataQuery to populate a separate cORR column
  7. Backfill: re-run classify_publications on affected pubs, or run a lightweight AdverseEventGradeBackfill-style task that re-classifies existing ORR rows using the observation text

Solution applied (2026-03-18):

  1. Migration: Added confirmed boolean column to trial_outcome_measures (nullable, no default)
  2. Schema: Added confirmed attribute to Outcome StoreModel class in details.rb with description guiding the LLM
  3. Prompt: Added “Confirmed Response” instruction to task.rb system prompt — confirmed: true for cORR/confirmed CR/PR, false for unconfirmed, null when not stated
  4. Persistence: Added confirmed: om['confirmed'] to post_process.rb trial_outcome_measures.create! call
  5. View: Created vw_publication_efficacy_data_v11.sql exposing tom.confirmed column
  6. Backfill: Created lib/tasks/one_off/backfill_confirmed_orr.thor — rule-based detection from observation text and endpoint name (no LLM cost). Results:
    • 3,061 rows updated (2,722 confirmed=true, 339 confirmed=false)
    • 62,207 rows left as null (no signal in text)
    • 2,076 publications had llm_data synced
    • View refreshed: 4,332 confirmed rows, 517 unconfirmed rows visible

Verified on pub 47147 (sigvotatug vedotin):

  • HNSCC cORR=37.5% → confirmed=true
  • NSCLC ORR=57% → confirmed=null
  • NSCLC cORR=29% → confirmed=true

Going forward: New publications processed via classify_publications will have the confirmed field set by the LLM during extraction. The legacy EmergingClinicalDataQuery can now filter by confirmed = true for a cORR column, but no further query-layer work is planned here because subgroup-preserving behavior is already available in ClinicalEvidenceQuery.

Problem: Dose evidence was stored at publication_interventions level (one record per publication+drug), not per subgroup. When a publication reports multiple dose cohorts (e.g. Becotatug 2.0 mg/kg vs 2.3 mg/kg), efficacy is split into separate subgroups but they all share the same publication-wide dose_min/dose_max. ~17K publications with dose evidence have subgroups that could carry dose context.

Solution applied:

  1. Migration: Added 6 dose columns to trial_subgroups: dose_value, dose_min, dose_max, rp2d, dose_units, dose_frequency (all nullable strings)
  2. Schema: Added dose attributes to SubgroupOutcome class in details.rb — numeric values only, units separate in dose_units
  3. Prompt: Added “Subgroup Dose Context” instruction to task.rb system prompt — extract dose into subgroup fields for dose cohorts, leave null for non-dose subgroups
  4. Persistence: Added dose field mapping in post_process.rb trial_subgroups.create!
  5. View: Created vw_publication_efficacy_data_v12.sql — COALESCEs subgroup-level dose over publication-level dose: COALESCE(ts.dose_min, pdl.pub_dose_min) AS dose_min, etc. Also surfaces single_dose column via COALESCE(ts.dose_value, pdl.pub_single_dose)
  6. Backfill: Created lib/tasks/one_off/backfill_subgroup_dose.thor — sends all subgroups for publications with dose_evidence to gpt-5-mini, LLM determines which are dose-specific

Scope: 17,170 publications, 50,403 subgroups. Estimated cost ~$15 with gpt-5-mini batched.

Key design decisions:

  • Dose value fields are numeric-only (e.g. "2.3") with units in separate dose_units field (e.g. "mg/kg"). Initial run had 45/47 values with units leaked into numeric fields; fixed by making schema descriptions explicit (“WITHOUT units”)
  • Backfill scope is all publications with dose_evidence on publication_interventions, not regex-filtered by subgroup name. Earlier regex approach (mg|mg/kg|...) missed Gy, IU, U/kg, cell therapy doses (×10^N), DLT/MTD keywords, and schedule-only cohorts (QD/BID)
  • The LLM correctly nulls non-dose subgroups (disease cohorts, biomarker subgroups, “Overall”) even when they’re sent in the same prompt

Prod deployment:

  1. Run migrations (add columns + update view to v12)
  2. Run backfill: thor one_off:backfill_subgroup_dose:backfill --batched
  3. Refresh materialized view

Going forward: New publications processed via classify_publicationspost_process will automatically populate subgroup dose fields. No additional work is planned on the legacy Emerging Clinical Data path; ClinicalEvidenceQuery is the subgroup-preserving query for current use.

13. Technology filter excludes combination partner drugs

Section titled “13. Technology filter excludes combination partner drugs”

EmergingClinicalDataQuery filters vw_publication_efficacy_data rows by technology_id, which removes view rows for combination partner drugs that have a different technology than the investigational drug. This means extract_combination_partners_from_rows (which works from the filtered view rows) cannot see the combo partner, so the combination_partners field is blank even when the partner is correctly recorded in publication_interventions.

app/queries/tpp/emerging_clinical_data_query.rb:

  • build_base_query (line 501): AND v.technology_id = ANY(ARRAY[:technology_ids]::integer[]) filters ALL view rows by technology
  • extract_combination_partners_from_rows (line 1495): scans the filtered rows for investigational_component = false — but those rows were already removed by the technology filter
  • The older fetch_combination_partners method (line 1560) queries publication_interventions directly and would work, but it’s not used by build_single_rowextract_combination_partners_from_rows is used instead

Example 1: Amivantamab + Paclitaxel (pub 114606, ESMO 2025)

Section titled “Example 1: Amivantamab + Paclitaxel (pub 114606, ESMO 2025)”

publication_interventions correctly records:

  • Amivantamab: drug_id=10180, intervention_role='investigational', technology = Bispecific Antibody (235)
  • Paclitaxel: drug_id=10109, intervention_role='supportive', technology = (chemotherapy/small molecule)

When the query runs with technology_id = 235 (Bispecific Antibody):

  • View rows for Amivantamab pass the filter (technology_id = 235) ✓
  • View rows for Paclitaxel are filtered OUT (different technology) ✗
  • extract_combination_partners_from_rows sees only Amivantamab rows → combination_partners = nil

Worksheet says: Combination Partner = “Paclitaxel” Query returns: combination_partners = nil

Example 2: Petosemtamab + Pembrolizumab (pub 30362/209252)

Section titled “Example 2: Petosemtamab + Pembrolizumab (pub 30362/209252)”

publication_interventions correctly records Pembrolizumab as intervention_role='supportive'. When querying with technology_id = 235 (Bispecific Antibody), Pembrolizumab (Monoclonal Antibody, technology 230) is filtered out.

Note: even when running a separate query with technology_id = 230, Pembrolizumab rows would appear but Petosemtamab rows would be filtered out — so the combination context is lost in both directions.

The technology filter is applied to view rows before drug role analysis. The filter is correct for identifying the investigational drug’s technology, but it eliminates combo partner rows that inherently have a different technology. This is a fundamental design tension: the technology filter serves to scope results to a technology of interest, but combination therapy inherently crosses technology boundaries.

Affects any publication where the investigational drug and combination partner have different technologies. Common patterns:

  • ADC + checkpoint inhibitor (e.g. sigvotatug + pembrolizumab)
  • BsAb + chemotherapy (e.g. amivantamab + paclitaxel)
  • BsAb + checkpoint inhibitor (e.g. petosemtamab + pembrolizumab)

These are increasingly common in oncology clinical trials.

Additionally, the Amivantamab + Pembrolizumab 1L row from the MHNCS Feb 2026 conference is missing entirely — this publication does not exist in our database. The “Multidisciplinary Head and Neck Cancers Symposium” is not an ingested source. This is a data availability gap, not an extraction or query issue.

Option 1: Fall back to publication_interventions for combo partners. Instead of relying on filtered view rows, use the existing fetch_combination_partners method (line 1560) which queries publication_interventions directly. This method already exists and handles both publication-based and trial-based combo partner lookup. Change build_single_row to call fetch_combination_partners instead of extract_combination_partners_from_rows.

Option 2: Remove technology filter from combo partner extraction. Run a secondary unfiltered query for publication_interventions where investigational_component = false for the matched publication_ids.

Option 1 is simplest — the method already exists, just needs to be wired in.

Implemented 2026-03-18. Two changes in app/queries/tpp/emerging_clinical_data_query.rb:

  1. Fixed fetch_combination_partners SQL bug (line 1567): Changed pi.publication_id to pi.source_id — the column was renamed during the polymorphize migration but the SQL was never updated, so this method silently failed for all publications.

  2. Switched build_single_row to use fetch_combination_partners (line 951): Replaced extract_combination_partners_from_rows(rows) with fetch_combination_partners(publication_id, clinical_trial_id, primary_drug_id, primary_drug_name). This queries publication_interventions directly, bypassing the technology_id filter on the view. Falls back to extract_combination_partners_from_rows for non-publication rows.

Verified:

  • Amivantamab + Paclitaxel (pub 114606): now shows combo=Paclitaxel (was blank)
  • Petosemtamab + Pembrolizumab (pub 30362): now shows combo=Pembrolizumab (was blank)
  • Monotherapy publications: correctly show no combo partner

14. Basket trial disease subgroups not extracted for minority cohorts

Section titled “14. Basket trial disease subgroups not extracted for minority cohorts”

BNT324/DB-1311 (NCT05914116) is a solid-tumor basket trial. The ESMO abstract (pub 64328) reports results for 77 evaluable patients across multiple tumor types but only names SCLC (ORR=45.5%, n=33), CRPC (3 PRs), NSCLC (3 PRs), and BTC (1 PR) explicitly. HNSCC is never mentioned in the abstract text. The client sheet lists HNSCC N=3 ORR=100% from this trial — this data was in the poster/presentation, not the abstract.

Publication corpus: 5 publications linked to NCT05914116:

Pub IDSourceDisease focusHNSCC mentioned?
64328ESMOBroad solid tumors (SCLC emphasis)No
137185ASCOCRPCNo
190691ESMOCervical cancer / ovarianNo
236643ASCOCRPCNo
241480ASCOmCRPC + Lu-177 analysisNo

Abstract text analysis (pub 64328):

The abstract mentions PRs by tumor type: “In pts with SCLC (n=33), unconfirmed ORR was 45.5%… PRs were also observed in 3 pts with CRPC, 3 pts with NSCLC and 1 pt with BTC.” HNSCC is not in this list. The HNSCC N=3 data likely appeared in the ESMO Asia poster/supplementary materials.

Database state:

  • trial_subgroups for this trial with disease_id = 6200 (HNSCC): 2 records, both source_type = 'News' / 'NewsTrialMention' — NOT from publication extraction
  • publication_interventions: BNT324 (drug_id=12964) correctly linked with technology_id = 708 (ADC) ✓
  • No publication-sourced trial_subgroups have disease_id = 6200 for this trial

Data availability limitation. The LLM extraction is correct — it cannot extract HNSCC data that isn’t in the abstract text. The HNSCC results for this basket trial were only available in the poster/presentation at ESMO Asia 2024, which is not captured in our abstract corpus.

This is a common pattern for basket trials: the main abstract reports overall + top-responding tumor types, while per-tumor breakdowns for minority cohorts appear only in the poster, supplementary slides, or corporate presentations.

  1. Full poster/presentation ingestion — if ESMO Asia poster PDFs were ingested and processed, the per-tumor-type data would be extractable
  2. Corporate presentation ingestion — the sheet source “ESMO Asia 2024” may reference a BioNTech R&D day presentation rather than an abstract
  3. News-sourced subgroup promotion — the HNSCC subgroups exist from News/NewsTrialMention sources; these could potentially be surfaced alongside publication data, but this would require view/query changes to accept non-publication sources

This pattern affects any basket trial where minority cohort data is only in supplementary materials. Likely affects dozens of phase 1 solid-tumor basket trials in the database.

15. Disease extraction favors subtype matches over parent disease, losing the umbrella disease

Section titled “15. Disease extraction favors subtype matches over parent disease, losing the umbrella disease”

The disease_extraction.rb matching logic tries subtype-level matches first, and if they succeed, skips the parent disease-level match entirely (early return on line 219). For pub 71438 (Becotatug vedotin, ESMO), the LLM correctly extracted name = "squamous cell carcinoma of the head and neck" with subtypes ["oral cavity", "oropharynx", "hypopharynx", "larynx"]. The subtype combos matched to Oropharyngeal Cancer (5040), Hypopharyngeal Cancer (5031), etc. via TermMatch. Because those subtype matches succeeded, the disease-name-level match to HNSCC (6200) was never attempted. The publication ends up with 4 trial_disease_details rows for sub-site cancers but none for HNSCC itself.

app/tasks/publications_llm_classification/disease_extraction.rb:

  • build_match_set (line 207): Takes a disease name and subtype values
  • Lines 212-216: For each subtype, builds combo "squamous cell carcinoma of the head and neck - oral cavity" and looks up TermMatch with field = 'disease_subtypes'
  • Line 219: return matches if matches.any? — if ANY subtype matched, skip the disease-name match entirely
  • Lines 221-223: Only reached if no subtype matches — looks up "squamous cell carcinoma of the head and neck" as disease_name, which resolves to HNSCC (6200)

Then in post_process.rb:

  • Lines 401-436: Iterates over processed diseases, uses matched_disease.matched_disease_id to find the Disease record
  • Creates one trial_disease_details row per entry — since there are 4 subtype-matched entries (not the parent), 4 sub-site disease rows are created

Step 1 — LLM extraction (extract_diseases):

The LLM correctly extracted ONE disease:

{
"name": {"value": "squamous cell carcinoma of the head and neck"},
"subtypes": [{"value": "oral cavity"}, {"value": "oropharynx"}, {"value": "hypopharynx"}, {"value": "larynx"}]
}

Step 2 — Disease matching (disease_extraction.rb):

build_match_set receives disease_name = "squamous cell carcinoma of the head and neck", subtype_values = ["oral cavity", "oropharynx", "hypopharynx", "larynx"].

For each subtype, it builds a combo and finds a TermMatch:

Combo termTermMatch IDMatched diseaseConfidence
squamous cell carcinoma of the head and neck - oral cavity51046Lip and Oral Cavity Cancer (5047)0.925
squamous cell carcinoma of the head and neck - oropharynx50748Oropharyngeal Cancer (5040)0.9
squamous cell carcinoma of the head and neck - hypopharynx50744Hypopharyngeal Cancer (5031)0.975
squamous cell carcinoma of the head and neck - larynx50745Laryngeal Cancer (5023)0.9

All 4 subtype matches succeed → line 219 early return → disease-name match to HNSCC (6200) never runs.

The ONE input disease entry is split into 4 output entries, each with a subtype-matched disease and matched_disease.matched_disease_id pointing to the sub-site cancer (not HNSCC).

Step 3 — Post-processing (post_process.rb):

The 4 processed disease entries become 4 trial_disease_details rows:

TDD IDdisease_iddisease_namesubtypes
941265047Lip and Oral Cavity Cancer["oral cavity"]
941275040Oropharyngeal Cancer["oropharynx"]
941285031Hypopharyngeal Cancer["hypopharynx"]
941295023Laryngeal Cancer["larynx"]

HNSCC (6200) is nowhere in trial_disease_details for this publication.

Step 4 — Query (EmergingClinicalDataQuery):

The query uses Disease.subtree_for([6200]) which returns only [6200] (HNSCC has no descendants, all_descendants = []). None of the sub-site diseases (5047, 5040, 5031, 5023) are in this set. The publication is invisible.

Comparison with pub 242943 (PubMed, same trial)

Section titled “Comparison with pub 242943 (PubMed, same trial)”

Pub 242943 for the same trial (NCT04868162) has trial_disease_details.disease_id = 6200 (HNSCC directly). This works because the PubMed abstract either:

  • Did not have subtypes, so the disease-name fallback (line 221) ran and matched HNSCC
  • Or had different subtype values that didn’t match any disease_subtypes TermMatch

Why patient_population_diseases shows the correct match

Section titled “Why patient_population_diseases shows the correct match”

The llm_data['patient_population_diseases'] for pub 71438 shows matched_disease.matched_disease_id = 6200 with confidence 1.0. But this is stale data — it was set before disease_extraction.rb re-processed the entries. The extraction step replaces the matched_disease field on each cloned entry (line 174), overwriting the original HNSCC match with the subtype-level match.

The early return on line 219 of disease_extraction.rb treats subtype matches as replacing the parent disease match, rather than supplementing it. When the LLM extracts “squamous cell carcinoma of the head and neck” with anatomical subtypes, the system should create BOTH:

  1. A parent disease record for HNSCC (6200) — so the publication is discoverable under the umbrella term
  2. Subtype records for the anatomical sub-sites — for more granular filtering

Instead, it creates ONLY the subtype records and drops the parent entirely.

Even if the subtype records were the only ones created, the publication would still be discoverable IF the sub-site diseases were descendants of HNSCC in the disease hierarchy. But they are all root-level siblings:

Disease IDNameParentall_descendants
6200Head and Neck Squamous Cell Carcinoma (HNSCC)NULL[]
5040Oropharyngeal CancerNULL(separate tree)
5031Hypopharyngeal CancerNULL(separate tree)
5023Laryngeal CancerNULL(separate tree)
5047Lip and Oral Cavity CancerNULL(separate tree)

So Disease.subtree_for([6200]) returns only [6200], excluding all sub-site diseases.

Not yet quantified. Affects any publication where:

  1. The LLM extracts a disease with anatomical subtypes
  2. Those subtypes have disease_subtypes TermMatches to separate diseases
  3. The separate diseases are not descendants of the umbrella disease

This pattern is common for:

  • Head & neck cancers (HNSCC → oropharyngeal, laryngeal, hypopharyngeal, oral cavity)
  • Lung cancers (NSCLC → adenocarcinoma, squamous)
  • Potentially others with anatomical sub-site taxonomy

Option 1 (recommended): Always include parent disease match alongside subtype matches.

In disease_extraction.rb build_match_set, after collecting subtype matches, also run the disease-name match and include it in the result. Remove the early return on line 219:

# Current (line 219):
return matches if matches.any?
# Proposed: always also try the disease-name match
term_match = lookup_term_match('disease_name', disease_name)
if valid_match?(term_match)
# Only add parent match if it resolved to a different disease than the subtypes
parent_disease_id = term_match.final_result['id']
subtype_disease_ids = matches.filter_map { |m| m['matched_disease_id'] }
unless subtype_disease_ids.include?(parent_disease_id)
matches << format_match_data(disease_name, subtype_values, term_match, matched_subtype: nil)
end
end

This ensures HNSCC (6200) gets a trial_disease_details row alongside the sub-site rows. The deduplication check prevents creating a duplicate if the parent and subtype resolve to the same disease.

Option 2: Fix disease hierarchy. Make sub-site H&N cancers descendants of HNSCC. This is conceptually correct but clinically nuanced — not all oropharyngeal cancers are squamous cell carcinomas. Would need expert review.

Option 3: Both. Fix the extraction to always include the parent, AND fix the hierarchy for confirmed relationships. Belt and suspenders.

Implemented 2026-03-18.

1. Fixed disease_extraction.rb build_match_set: Removed the early return on line 219 that skipped the parent disease-name match when subtype matches existed. The method now always also tries the disease_name TermMatch lookup and includes it in the result set if it resolves to a different disease_id than any of the subtype matches. Added deduplication in merge_disease_matches to prevent the parent disease from being added multiple times when multiple sibling subtypes share the same parent.

2. Created backfill task lib/tasks/one_off/backfill_parent_disease_matches.thor:

  • identify — finds 1,856 publications with subtype-only disease matches
  • backfill — re-runs disease matching with the fixed logic, then destroys and recreates trial_disease_details only (does not touch subgroups, endpoints, or AEs)

Verified on pub 71438 (Becotatug vedotin, ESMO):

  • Before: trial_disease_details had 4 sub-site diseases (5047, 5040, 5031, 5023), no HNSCC
  • After: 5 entries — 4 sub-sites + HNSCC (6200)
  • Publication now surfaces in HNSCC queries via EmergingClinicalDataQuery

Scale: 1,856 publications affected. Top disease names: breast cancer (593 entries across case variants), NSCLC (205), prostate cancer (127), lymphoma (130), mesothelioma (70), renal cell carcinoma (36), H&N SCC (29).

Pending: Production backfill of the 1,856 affected publications.

16. Confirmed ORR is not exported by EmergingClinicalDataQuery

Section titled “16. Confirmed ORR is not exported by EmergingClinicalDataQuery”

The disease worksheet has a dedicated Confirmed ORR (cORR) column, but EmergingClinicalDataQuery only exports OS, PFS, ORR, DoR, DFS, and DCR. Even when a worksheet row distinguishes confirmed from unconfirmed response, the query output has no place to carry that metric.

This means worksheet rows can look “partially matched” because the main ORR is present while the confirmed-response column is always blank.

app/queries/tpp/emerging_clinical_data_query.rb:

  • PRIMARY_EFFICACY_ABBREVIATIONS is defined as %w[OS PFS ORR DOR DoR DFS DCR]
  • extract_efficacy_metrics iterates only that whitelist
  • the result hash has no :corr or :confirmed_orr key
  • summary_statistics, orr_ranking, and CSV export all inherit the same endpoint set

This is a reporting-layer omission. It sits after publication ingestion and after subgroup extraction.

The query hard-codes the primary efficacy endpoint set:

PRIMARY_EFFICACY_ABBREVIATIONS = %w[OS PFS ORR DOR DoR DFS DCR].freeze

Because cORR is not in that list:

  • extract_efficacy_metrics never reads confirmed-response rows even if they exist upstream
  • build_single_row never exposes a confirmed-response field
  • downstream consumers cannot distinguish:
    • unconfirmed ORR
    • confirmed ORR
    • rows where both are reported

Concrete examples from sqNSCLC sheet validation

Section titled “Concrete examples from sqNSCLC sheet validation”

Worksheet row:

  • ORR = 33.3%
  • cORR = 33.3%
  • N = 6

Query row:

  • ORR = 33.3%
  • no cORR field

Example 2: Ifinatamab deruxtecan (ESMO 2023)

Section titled “Example 2: Ifinatamab deruxtecan (ESMO 2023)”

Worksheet row:

  • ORR = 31%
  • cORR = 31%
  • mDoR = 4.1

Query row:

  • ORR = 31%
  • mDoR = 4.1
  • no cORR

Worksheet rows:

  • SqNSCLC ORR = 43.3%, cORR = 36.7%
  • SqNSCLC TPS <1 ORR = 45.5%, cORR = 36.7%

Query rows:

  • ORR = 43.3% on the main SqNSCLC row
  • no cORR
  • the TPS <1 row is additionally hidden by Issue 12
  • The worksheet Confirmed ORR (cORR) column cannot be reconstructed from structured output
  • studies that report both ORR and cORR appear more complete than they really are because only one of the two response metrics survives
  • comparisons between abstracts that emphasize unconfirmed responses versus confirmed responses become unreliable

This is not primarily a data-availability problem.

For the sqNSCLC examples above, the worksheet values are tied to concrete conference/journal records that we already ingest or otherwise match on the main ORR metric. The missing part is the confirmed-response export path.

This is also not the same as Issue 12. Issue 12 hides subgroup rows; Issue 16 removes an entire metric family from the report shape.

In the current sqNSCLC worksheet:

  • 5 / 10 populated rows include a cORR value
  • these rows cover at least 4 distinct studies

So this is not an edge case for the worksheet format.

Add confirmed response as a first-class efficacy metric:

  1. Expand the endpoint whitelist to include the confirmed-response abbreviation actually used in the data (cORR / normalized equivalent)
  2. Store it in the row hash alongside :orr
  3. Add a Confirmed ORR column to CSV/export formatting
  4. Keep ORR and cORR separate rather than trying to merge or overwrite one with the other

17. ASCO abstract and presentation copies create duplicate publication rows

Section titled “17. ASCO abstract and presentation copies create duplicate publication rows”

After broadening ASCO ingestion to include both AbstractContentItem and PresentationContentItem, the same scientific abstract can now be stored twice under different ASCO uids. EmergingClinicalDataQuery groups by publication_id, not DOI/title, so both copies surface as separate rows.

This showed up repeatedly during the sqNSCLC pass and makes the local output look larger and noisier than the sheet.

app/services/publications/asco_api_service.rb:

  • fetch_abstract_hits requests contentTypes: ['Abstract', 'Presentation']
  • save_publication persists records using Publication.find_or_initialize_by(source: 'ASCO', source_id: publication_data[:source_id])

app/queries/tpp/emerging_clinical_data_query.rb:

  • build_result_rows groups by publication_id, disease_id, effective_line, and study_plan_arm_id

There is no DOI-level or title-level deduplication step between ingestion and reporting.

The ASCO fix for Issue 2 intentionally broadened the search and detail query to include PresentationContentItem. That solved the “missing presentation” problem, but persistence still keys uniqueness on source_id:

publication = Publication.find_or_initialize_by(source: 'ASCO', source_id: publication_data[:source_id])

So if ASCO exposes both:

  • ABSTRACT492030
  • PRESENTATION251481

with the same DOI and same text, both are considered distinct publications locally.

Same DOI:

  • 10.1200/JCO.2025.43.16_suppl.8611

Stored twice:

  • publication 48035 — source_id ABSTRACT492030
  • publication 238708 — source_id PRESENTATION251481

Both produce the same sqNSCLC row (ORR = 33.3%, N = 6).

Same DOI:

  • 10.1200/JCO.2025.43.16_suppl.8509

Stored twice:

  • publication 139344 — source_id ABSTRACT500470
  • publication 237445 — source_id PRESENTATION246467

Both produce the same main sqNSCLC 3 mg/kg Q3W row.

Example 3: Additional duplicate DOI pairs in the same sqNSCLC slice

Section titled “Example 3: Additional duplicate DOI pairs in the same sqNSCLC slice”
  • Datopotamab deruxtecan: 10.1200/JCO.2025.43.16_suppl.8501
  • Sacituzumab govitecan: 10.1200/JCO.2025.43.16_suppl.8599
  • one worksheet row can correspond to two local rows
  • counts for “how many publication-backed rows do we have?” are overstated
  • manual comparison against the sheet becomes noisy
  • any future ranking or aggregation that does not dedupe by DOI/title risks double-counting conference data

This is not a disease-mapping issue and not a subgroup-extraction issue.

The data itself is usually valid in both copies. The problem is that they are the same scientific result represented twice because ASCO exposes two content-item types.

This is also not an argument to undo Issue 2 entirely. We needed PresentationContentItem support to recover records like SHR-A2102. The gap is specifically the lack of a deduplication strategy after broadening the source.

In the sqNSCLC ADC/fusion slice alone, there are 4 duplicate DOI pairs:

  • PF-08046054
  • IBI363
  • Datopotamab deruxtecan
  • Sacituzumab govitecan

So the effect is already material in a small disease/technology slice.

Two reasonable options:

1. Query/report deduplication

Keep both source records in publications, but dedupe in EmergingClinicalDataQuery or the TPP report by a stable key such as:

  • DOI + disease + subgroup/arm
  • or DOI + publication title

This is lower risk for ingestion history.

2. Ingestion-time merge

When saving ASCO records, detect that an incoming presentation and an existing abstract share the same DOI/title/NCT tuple and merge them into one canonical Publication.

This is cleaner downstream but riskier because it changes persistence semantics for already-ingested ASCO records.

18. PubMed-indexed journal article missing from publication corpus

Section titled “18. PubMed-indexed journal article missing from publication corpus”

The current sqNSCLC worksheet row for Cofetuzumab pelidotin points to the 2025 journal article:

  • DOI: 10.1016/j.lungcan.2025.108492
  • PMID: 40086026

That article exists on PubMed and contains the sqNSCLC result the sheet uses, but there is no corresponding Publication row in the local database. As a result, the row is completely absent from EmergingClinicalDataQuery.

This drop happens before EmergingClinicalDataQuery.

During validation:

  • Publication.where(doi: '10.1016/j.lungcan.2025.108492') returned no rows
  • Publication.where(source_id: '40086026') returned no rows

So the publication never entered the local corpus, or it was dropped before persistence.

Root cause isolated.

There are two distinct PubMed ingestion limitations affecting this paper:

  • the disease-specific path depends on PubMed exposing a ClinicalTrials.gov / NCT... databank entry, and this record does not appear to expose that linking metadata even though PubMed marks it as a clinical trial
  • the broad PubMed path in Publications::PubmedApiService built one giant combined query for the oncology MeSH clause plus the recovery clause; that combined search term excluded qualifying records that PubMed returned when the intended criteria were tested separately

What was verified live for PMID 40086026:

  • PubMed resolves DOI 10.1016/j.lungcan.2025.108492 to PMID 40086026
  • the record has Clinical Trial, Phase I
  • the record has oncology MeSH including Carcinoma, Non-Small-Cell Lung and Lung Neoplasms
  • 40086026[uid] AND mesh AND clinical-trial publication types AND 2025 date returned 1
  • 40086026[uid] AND full previous combined search term returned 0

So the missing publication was not due to missing PubMed record metadata for the broad query. It was due to our query construction.

Worksheet row: Cofetuzumab pelidotin in sqNSCLC

Section titled “Worksheet row: Cofetuzumab pelidotin in sqNSCLC”

Worksheet entry:

  • Drug: Cofetuzumab pelidotin
  • Publication: Lung Cancer (Journal), 2025
  • Link: https://doi.org/10.1016/j.lungcan.2025.108492
  • ORR = 12.5%
  • cORR = 12.5%
  • mPFS = 5.3
  • mDoR = 2.2

Local database state:

  • no Publication row for DOI 10.1016/j.lungcan.2025.108492
  • no Publication row for PMID 40086026
  • only older cofetuzumab records exist:
    • publication 150086 — ASCO 2021
    • publication 71934 — ESMO 2023
    • publication 101600 — Clinical Cancer Research 2021

External confirmation:

  • PubMed lists the paper as “A phase 1b study of cofetuzumab pelidotin monotherapy in patients with PTK7-expressing recurrent non-small cell lung cancer” with PMID 40086026
  • the sqNSCLC worksheet still has one fully missing non-investor row even after the backfills and corrections
  • the earlier tracker note that the cofetuzumab sqNSCLC value was poster-only is now stale for the current worksheet version
  • the publication will remain absent until a non---disease-specific 2025 PubMed run is executed against the fixed query logic
  • --disease-specific alone is still insufficient for this class of paper because PubMed does not appear to expose the ClinicalTrials.gov linking metadata we rely on

This does not contradict the earlier ESMO 2023 analysis in Issue 11.

That earlier note was about publication 71934, where the squamous-specific value was not in the 2023 abstract text. The current worksheet has since moved to a later 2025 journal article. That newer source should be representable if it is ingested.

Currently one confirmed sqNSCLC worksheet row for the original worksheet discrepancy.

For 2025-01-01 through 2025-12-31, after fixing the PubMed query construction:

  • the broad oncology/malignant-heme PubMed query returns 6,013 PMIDs
  • 3,831 of those are not already in local publications
  • compared with the old Clinical Trial[pt] path, there are 435 additional PMIDs
  • 431 of those additional PMIDs are not already in local publications

So this is not just one missing-paper edge case. The broken combined query was suppressing a non-trivial number of 2025 PubMed records.

  • Publication.where(doi: '10.1016/j.lungcan.2025.108492') returned no rows before the fix
  • Publication.where(source_id: '40086026') returned no rows before the fix
  • after the PubmedApiService query change, fetch_uids_by_date('2025/01/01', '2025/12/31', nct_ids: []) includes PMID 40086026
  • live verification after the fix returned:
    • includes_pmid_40086026 = true
    • total = 6013
  • After the 2025 backfill, how many of the 431 incremental publications are truly result publications versus broader cancer-clinical-trial noise?
  • Do we want to keep the broad non---disease-specific PubMed run as a regular sync, or use it only as a periodic coverage backfill?

Characterize the missing publication upstream of the query, then narrow the fix to the actual failure point:

  1. Trace the PubMed/journal ingestion path for DOI 10.1016/j.lungcan.2025.108492 / PMID 40086026
  2. Compare direct PubMed criteria matches against the full generated search term
  3. Split the broad PubMed search into separate query terms and union PMIDs in Ruby instead of relying on one giant combined PubMed query
  • updated Publications::PubmedApiService so the broad PubMed path now runs separate search terms for:
    • oncology/malignant-heme MeSH + clinical-trial publication types
    • oncology/malignant-heme MeSH + recovery result terms for the recent recovery window
  • changed PubMed UID fetching to execute each term separately and union the PMIDs in Ruby
  • aligned total-count logic with the split-query approach
  • verified live that the fixed 2025 query now includes PMID 40086026
  • syntax check passed: ruby -c app/services/publications/pubmed_api_service.rb

19. Biomarker context missing at subgroup level

Section titled “19. Biomarker context missing at subgroup level”

The worksheet slices data three ways: dose, treatment line, and biomarker. Treatment line and dose are now structured on trial_subgroups (issues 5/12), but biomarker context is not. Biomarkers are extracted at the trial_disease_details level (publication + disease scope) via disease_extraction.rb, stored in trial_disease_biomarkers. There is no link between a biomarker-type subgroup (e.g. “EGFR-mutant → ORR=45%”) and the structured biomarker record (EGFR = positive).

~13,177 subgroups have biomarker-type classifications (mutation: 11,850, biomarker: 913, molecular subtype: 118, etc.). ~94% are single-biomarker subgroups; ~6% are multi-biomarker (e.g. “EGFR/ALK-negative”, “KRAS wild-type + BRAF-mutated”).

Biomarker extraction (disease level):

  1. disease_extraction.rb → LLM extracts patient_population_diseases[].biomarkers[] from abstract
  2. post_process.rb lines 445-463 → creates trial_disease_biomarkers linked to trial_disease_details
  3. Matching via Biomarker.flexifind(biomarker_name, 'synonyms')biomarker_id

Subgroup extraction (no biomarker logic):

  1. subgroup_extraction.rb → identifies subgroup labels (e.g. “EGFR-mutant”), classifies subgroup_type = 'mutation'
  2. classify_publications (task.rb) → extracts outcome measures per subgroup
  3. post_process.rb lines 251-260 → creates trial_subgroups with subgroup_type, subgroup_valueno biomarker fields

No biomarker usage in query/view:

  • vw_publication_efficacy_data does not join or expose biomarker data
  • EmergingClinicalDataQuery does not query trial_disease_biomarkers

trial_subgroups has no biomarker columns. Biomarker information is only available as:

  • Unstructured text in subgroup_value (e.g. “EGFR-mutant”, “PD-L1 TPS≥1%”, “TMB high”)
  • Structured records in trial_disease_biomarkers — but these are linked to trial_disease_details, not to trial_subgroups

Example 1: EGFR-mutant subgroup (pub 176313)

  • trial_subgroups: subgroup_type='mutation', subgroup_value='EGFR-mutant', biomarker_id=NULL
  • trial_disease_biomarkers: biomarker_name='EGFR', value='positive', biomarker_id=656 — attached to trial_disease_detail, no link to the subgroup

Example 2: PD-L1 TPS≥1% subgroup

  • Subgroup value: “Non-squamous NSCLC → PD-L1 TPS≥1%”
  • Needs: biomarker_id → PD-L1, biomarker_value → “≥1%”
  • Currently: only unstructured text in subgroup_value

Example 3: Multi-biomarker (6% of cases)

  • Subgroup value: “KRAS wild-type + BRAF-mutated”
  • Contains two biomarkers — single biomarker_id column would capture only one
  • 13,177 subgroups with biomarker-type subgroup_type
  • ~5,117 (39%) contain a single recognized biomarker name
  • ~811 (6%) contain multiple biomarker names
  • ~7,361 (55%) contain less common markers not in the top-40 list but still single-biomarker (e.g. AKT1, VHL, DNMT3A, EZH2)
  • Total: ~94% single biomarker per subgroup
  • Not an extraction failure — biomarkers ARE extracted, just at the wrong granularity (disease level, not subgroup level)
  • Not a matching failure — Biomarker.flexifind works well, and BiomarkerMatchingService provides advanced LLM-based matching
  • Not a view/query issue — the data simply doesn’t exist on trial_subgroups yet

Resolution: Partially addressed by subgroup tagging

Section titled “Resolution: Partially addressed by subgroup tagging”

Phase 1 (complete): Subgroup tagging (openspec/changes/subgroup-tagging/) added a biomarker tag to trial_subgroups.tags, solving the filtering problem — users can find biomarker subgroups via tags @> '["biomarker"]'. Tags are multi-valued (“EGFR-mutant NSCLC” gets ["biomarker", "disease"]), exposed in vw_publication_efficacy_data and admin UI.

Phase 2 (implemented): Structured biomarker link for display and matching.

What was implemented:

  1. Join table trial_subgroup_biomarkers — mirrors trial_disease_biomarkers schema:

    • trial_subgroup_id → FK to trial_subgroups (cascade delete)
    • biomarker_name → LLM-extracted name (e.g., “KRAS”)
    • value → status/value (e.g., “mutated”, “wild-type”, “TPS≥1%”)
    • numeric_value → threshold if applicable (e.g., “1” for TPS≥1%)
    • biomarker_id → FK to biomarkers (populated by BiomarkerMatchingService, not flexifind)
  2. LLM extraction — two paths:

    • Backfill: lib/tasks/one_off/backfill_subgroup_biomarkers.thor — sends abstract + biomarker-tagged subgroups to GPT-5-mini per-publication. Extracts biomarker name + value. Handles multi-biomarker (e.g., “BRCA1/2” → two entries). ~13K subgroups, ~$5-10.
    • Forward pipeline: SubgroupBiomarker schema added to SubgroupOutcome in details.rb. post_process.rb creates trial_subgroup_biomarkers records when tags.include?('biomarker').
  3. No flexifindbiomarker_id is left NULL at extraction time. All matching goes through BiomarkerMatchingService pipeline in PublicationDiseaseWorkflow:

    • populate_term_matches → creates TermMatch entries with strategy: 'BiomarkerMatching', field: 'name'
    • Deduplicates with 6,151 existing BiomarkerMatching term matches (3,186 already resolved from ParticipationCriterionBiomarker runs)
    • suggest_keywordsfind_candidates (semantic) → pick_best_matchqa_best_matchjudge (gpt-5) → post_process writes biomarker_id
    • Also applied to trial_disease_biomarkers — removed flexifind from post_process.rb disease biomarker creation. Same matching pipeline now handles both subgroup-level and disease-level biomarkers.
  4. Workflow steps — added to PublicationDiseaseWorkflow as two parallel branches from the first node:

    • Subgroup biomarker branch: populate_term_matches_for_subgroup_biomarkers → 6 matching steps → post_process_subgroup_biomarkers
    • Disease biomarker branch: populate_term_matches_for_disease_biomarkers → 6 matching steps → post_process_disease_biomarkers
    • Both run in parallel with existing disease/subtype matching branches.
  5. View v15vw_publication_efficacy_data now exposes trial_subgroup_id for query-layer joins.

  6. Query updatesClinicalEvidenceQuery and EmergingClinicalDataQuery now COALESCE subgroup-level biomarkers over disease-level:

    LEFT JOIN trial_subgroup_biomarkers tsb ON tsb.trial_subgroup_id = v.trial_subgroup_id
    LEFT JOIN biomarkers sb ON tsb.biomarker_id = sb.id
    -- ...existing disease-level joins...
    COALESCE(tsb.biomarker_id, tdb.biomarker_id) AS biomarker_id,
    COALESCE(sb.name, tsb.biomarker_name, b.name, tdb.biomarker_name) AS biomarker_name,
    COALESCE(tsb.value, tdb.value) AS biomarker_value,

Production deployment:

Terminal window
# 1. Run migration (create trial_subgroup_biomarkers table + view v15) ✅
# 2. Backfill subgroup biomarker extraction ✅ (2026-03-24, gpt-5.4-mini)
# Results: 52,063 records across 44,725 subgroups (99% of 45,184 biomarker-tagged)
# 1.16 records/subgroup avg. Top markers: HER2 (3,728), PD-L1 (3,188), EGFR (1,964)
thor one_off:backfill_subgroup_biomarkers:backfill --batched --parallelism 4 --model=gpt-5.4-mini
# 3. Run PublicationDiseaseWorkflow — biomarker branches match both subgroup + disease biomarkers ✅ (2026-03-25)
# Results: 3,439 TermMatches created for TrialSubgroupBiomarker (3,191 resolved, 248 pending)
# 35,026 / 52,063 records matched to biomarker_id (67.3%)
# Unmatched breakdown: 7,038 resolved no-match (long tail), 8,005 deduped via PCB no-match, 477 PCB match not propagated, 1,517 unknown
# 4. Query layer fix: LEFT JOIN LATERAL with STRING_AGG to aggregate multi-biomarker subgroups ✅ (2026-03-25)
# Prevents row multiplication for ~5,810 multi-biomarker subgroups
# All biomarker names surface (matched or raw) via COALESCE

Design notes:

  • TermMatch field: 'name' is shared across all biomarker sources (ParticipationCriterionBiomarker, TrialSubgroupBiomarker, TrialDiseaseBiomarker) for deduplication
  • ~6% of biomarker subgroups are multi-biomarker — join table handles 1:N cleanly
  • Judge step uses gpt-5 (temperature=nil, since gpt-5 only supports default temperature)
  • Query layer uses LEFT JOIN LATERAL with STRING_AGG to aggregate multiple biomarkers per subgroup into comma-separated strings, avoiding row multiplication while preserving all biomarker names/values
Section titled “20. study_plan_arm link is fragile and causes dose/drug/arm issues (merges Issue 3)”

The vw_publication_efficacy_data materialized view depends on study_plan_arms (trial registry) for two critical functions: resolving arm roles (EXPERIMENTAL vs COMPARATOR) and resolving drug attribution (via vw_bioloupe_interventions). This dependency is the root cause of three cascading problems:

  1. Arm role failures — 62% of view rows have no study_plan_arm match and default to EXPERIMENTAL
  2. Dose evidence drop (Issue 3) — The pub_dose_lookup CTE joins on drug_id, but the view’s drug_id comes from the registry while dose evidence drug_id comes from publication_interventions. This mismatch causes 76% of extracted dose evidence (17,826 of 23,503 pubs) to silently drop.
  3. Row triplication — Multiple study_plan_arms per trial create duplicate rows in the drug_interventions CTE

The fix is to drop the study_plan_arm dependency entirely and use publication_interventions as the primary drug source, with LLM-classified arm roles replacing the registry lookup.

The study_plan_arm link flows through:

  1. publication_clinical_trials links a publication to a clinical_trial
  2. trial_arm_outcomes.study_plan_arm_id links an LLM-extracted outcome row to a registry arm
  3. study_plan_arms.arm_type provides the registry’s classification (EXPERIMENTAL, ACTIVE_COMPARATOR, PLACEBO_COMPARATOR, etc.)
  4. vw_publication_efficacy_data resolves resolved_group_type via: COALESCE(UPPER(spa.arm_type), CASE WHEN arm_type = 'experimental' THEN 'EXPERIMENTAL' ... END)
  5. The drug_interventions CTE in the view joins vw_bioloupe_interventions (trial registry drug data) to the correct arm via study_plan_arm_id

Relevant code paths:

trial_arm_outcomes.arm_type is always NULL for publication-sourced data. The LLM extraction pipeline (classify_publications) extracts arm names but does not classify arm roles. The only path to arm role classification is the study_plan_arm_id foreign key, which requires:

  1. The publication is linked to a trial (publication_clinical_trials exists)
  2. The LLM-extracted arm name was matched to a registry arm (study_plan_arm_id is set)

Both conditions frequently fail.

Coverage analysis of vw_publication_efficacy_data (total ~1.04M rows):

CategoryRow count% of total
Trial + arm linked (has study_plan_arm_id)399,37338%
Trial linked, no arm match447,91243%
Unlinked (uses publication_interventions)196,72319%

So 62% of view rows have no study_plan_arm link and default to EXPERIMENTAL.

For HNSCC specifically (14,660 rows):

  • 1,463 rows (10%) have comparator identification via the arm link
  • 12,360 rows are marked EXPERIMENTAL
  • 569 rows have NULL resolved_group_type

Dose evidence impact (from Issue 3 reopened investigation, 2026-03-23)

Section titled “Dose evidence impact (from Issue 3 reopened investigation, 2026-03-23)”

The same study_plan_arm dependency causes the drug_interventions CTE to use registry drug_ids. The pub_dose_lookup CTE then fails to join because publication_interventions.drug_id (LLM-extracted) doesn’t match:

23,503 publications with dose_evidence extracted
8,764 publications with structured dose in view (37%)
17,826 publications with dose evidence silently dropped (76%)
Breakdown of dropped:
~13,600 NULL drug_id on publication_interventions (58%)
~2,148 drug_id mismatch: registry vs LLM-extracted (9%)
~2,078 other (pub not in view, no usable fields, etc.)

Concrete examples from CRC ADC audit (disease 4345, technology 708):

PubDrugPI drug_idView drug_idDose evidenceView dose
66516Zanidatamab10432 (antibody)15231 (ADC: zovodotin)1200 mgNULL
70960SHR-A1811NULL10733 (Trastuzumab rezetecan)rp2d=6.4 mg/kgNULL
114758Zanidatamab10432 (antibody)15231 (ADC: zovodotin)1200 mgNULL

Dropping the study_plan_arm dependency and using publication_interventions as the primary drug source would fix this automatically — drug_id and pub_dose_lookup would use the same source.

LLM-extracted arm names that clearly indicate their role without registry lookup:

arm_name (LLM-extracted)resolved_group_type (from registry)Obvious from name?
Cetuximab + Chemotherapy (Control)ACTIVE_COMPARATORYes — “(Control)“
Standard TreatmentACTIVE_COMPARATORYes — “Standard”
PlaceboPLACEBO_COMPARATORYes — “Placebo”
Extreme RegimenACTIVE_COMPARATORAmbiguous — SOC regimen name
Experimental groupEXPERIMENTALYes — “Experimental”
Non-Randomized Single-ArmEXPERIMENTALYes — single-arm
BCA101 + pembrolizumabNO_INTERVENTIONRegistry is wrong — this is clearly experimental
Arm B: Cetuximab/Methotrexate/DocetaxelACTIVE_COMPARATORAmbiguous — needs context
1, 2, Arm IvariesNot classifiable from name alone
Section titled “What the client worksheet actually needs from the trial link”

Mapping each worksheet column against its data source:

Sheet columnData sourceNeeds trial link?Needs study_plan_arm?
Drugpublication_interventionsNoNo
TechnologydrugstechnologiesNo (via drug_id)No
Target(s)drug_target_actionsNo (via drug_id)No
Companydrug_ownershipsNo (via drug_id)No
Clinical Trial (NCT ID)publication_clinical_trialsclinical_trialsYes (trial ID only)No
Clinical Trial Nameclinical_trials.brief_titleYes (trial ID only)No
Clinical Trial Locationlocations table (country rollup)Yes (trial ID only)No
Combination Partnerpublication_interventionsNoNo
Comparatorstudy_plan_arms (COMPARATOR type)YesYes (current path)
Diseasetrial_disease_details / trial_subgroupsNoNo
Publication DatepublicationsNoNo
Data Cut Datetrial_subgroups (pub-extracted)NoNo
Prior Lines (min/max/median)trial_subgroups (pub-extracted)NoNo
Biomarkersubgroup tags (pub-extracted)NoNo
Dose fieldstrial_subgroups + publication_interventionsNoNo
Efficacy (mOS, mPFS, ORR, etc.)trial_outcome_measures / trial_arm_outcomesNoNo
Safety (TRAE, TEAE, etc.)adverse_eventsNoNo
Phase (internal filter)clinical_trials.phaseYes (trial ID only)No
Randomized (internal)study_designs.allocationYes (trial ID only)No
Is Basket Trial (internal)clinical_trial_end_diseases (computed)Yes (trial ID only)No

Conclusion: The study_plan_arm link is only needed for the “Comparator” column and for resolved_group_type (experimental vs comparator arm selection). All other trial-derived fields only need publication_clinical_trials.clinical_trial_id.

publication_interventions currently only exists for publications processed through the target-disease extraction pipeline (~17K publications). For the remaining ~45K linked publications, drug resolution still flows through vw_bioloupe_interventions via the trial link and arm join.

However, ClinicalEvidenceQuery is always scoped to a specific disease, which means its publications will have gone through the target-disease pipeline and will have publication_interventions. This is not a blocker for the clinical evidence report specifically.

  1. Efficacy extractionextract_efficacy_metrics prefers EXPERIMENTAL rows. Without arm role classification, randomized trial publications would have both experimental and comparator values lumped together, and the “best” row would be picked by patient count rather than arm role.

  2. Comparator value — The query extracts comparator_value (e.g., comparator mPFS) from rows with resolved_group_type containing COMPARATOR. Without this, the comparator column and comparator efficacy values would be empty.

  3. Safety extractionextract_safety_metrics_for_publication filters to EXPERIMENTAL arm for safety. Less critical since most single-arm studies (majority of the corpus) only have one arm anyway.

Drop study_plan_arm dependency; add LLM arm role classification.

The proposed approach has two parts:

Part 1: Classify arm roles from LLM-extracted arm names

Section titled “Part 1: Classify arm roles from LLM-extracted arm names”

Add an arm_role field to trial_arm_outcomes (or arm_type — currently always NULL for publication data). Populate it via one of:

Option A: LLM classification during classify_publications — Add arm role to the extraction schema so the LLM outputs "arm_role": "experimental" or "arm_role": "comparator" alongside the arm name. This is the most reliable since the LLM has the full abstract context and knows which drug is investigational.

Option B: Post-hoc heuristic — Pattern match on arm names: keywords like “control”, “placebo”, “standard of care”, “SOC”, “comparator” → COMPARATOR; “experimental”, “investigational”, “study drug”, “treatment” → EXPERIMENTAL. This catches ~70% of cases but fails on regimen names like “Extreme Regimen” (HNSCC SOC) or numbered arms like “Arm B”.

Option A is recommended because the LLM already has the context to make this classification, and the marginal cost per publication is negligible.

Once arm roles are self-classified:

  1. vw_publication_efficacy_data: Remove the study_plan_arms join from arm_outcomes_expanded. Use the new arm_role field on trial_arm_outcomes instead.

  2. drug_interventions CTE: Remove the publication_arm_linksvw_bioloupe_interventions join path entirely for clinical evidence queries. Use publication_interventions as the sole drug source (acceptable since clinical evidence queries are disease-scoped).

  3. fetch_trial_enrichments: Keep the enrichment query but simplify — it only needs clinical_trials + locations + study_designs for metadata. Remove the study_plan_arms subquery for comparator arm names; instead, derive comparator name from the LLM-extracted arm names where arm_role = 'comparator'.

  4. fetch_combination_partners: Already uses publication_interventions as primary path. No change needed.

  • NCT ID, trial name, phase, location, randomized, basket trial detection — all via publication_clinical_trialsclinical_trials (no arm join)
  • Correct experimental vs comparator arm selection — via LLM-classified arm_role
  • Comparator name in the report — derived from arm names where arm_role = 'comparator'
  • Dependency on study_plan_arm_id matching (currently fails for 62% of rows)
  • Registry arm type overriding LLM context (sometimes wrong, e.g., BCA101 + pembrolizumab tagged NO_INTERVENTION)
  • Drug resolution via vw_bioloupe_interventions for linked publications (replaced by publication_interventions)

Implemented 2026-03-23. Change: fix-study-plan-arm-dependency.

Four-part fix:

  1. vw_publication_efficacy_data v16 — restructured drug_interventions CTE:

    • Added Source 0: publication_interventions as primary drug source for all pubs that have them (linked AND unlinked). Includes NULL drug_id interventions — if we extracted them, that’s the source of truth.
    • Sources 1a/1a-fallback/1b/1c gated with NOT EXISTS (pubs_with_pi) — only fire as fallback for publications without publication_interventions (non-target-disease pubs used by EmergingClinicalDataQuery).
    • Removed Source 2 (unlinked-only path) — subsumed by Source 0.
    • Threaded publication_intervention_id through Source 0 and pub_dose_lookup for exact join matching, eliminating the drug_id mismatch that dropped 76% of dose evidence.
  2. vw_publication_efficacy_data v16 — inverted arm_outcomes_expanded priority:

    • LLM-classified tao.arm_type now preferred over registry spa.arm_type via CASE expression.
    • Maps control/active_comparatorACTIVE_COMPARATOR, placebo/placebo_comparatorPLACEBO_COMPARATOR.
    • Falls back to spa.arm_type only when LLM value is NULL.
  3. Safety queries in clinical_evidence_query.rb:

    • Updated both inline safety SQL queries to use the same LLM-first arm role logic.
  4. Arm role classification improvements (going-forward + backfill):

    • Expanded arm_type enum in details.rb from [investigational, control] to [investigational, control, active_comparator, placebo_comparator].
    • LLM-based backfill task lib/tasks/one_off/backfill_arm_type_from_name.thor:
      • Phase 1 fast-path: single-arm publications (39K pubs, 239K rows) → investigational directly.
      • Phase 2 LLM: multi-arm publications (28K pubs) sent to GPT-5-mini with abstract context for classification. Estimated cost ~$17.
    • Tested on 65 publications with 0 errors. LLM correctly classifies drug-name arms (e.g. “Sorafenib” → control, “Chemotherapy” → control), ambiguous labels (e.g. “Arm B”, “Group 1”), and placebo variants.

Results (prod, post-backfill, 2026-03-24):

MetricBefore v16After v16 + backfill
Pubs with structured dose in view8,76411,916 (+36%)
Coverage of extracted dose evidence71%96.7%
ACTIVE_COMPARATOR rowssparse (registry-only)124,346 (12.6% of view)
PLACEBO_COMPARATOR rowssparse (registry-only)32,383 (3.3% of view)
Total comparator identification~38% coverage when arm linked15.9% of all rows (up from near-zero for LLM-sourced pubs)
Stale registry values (PLACEHOLDER/NO_INTERVENTION/OTHER)7,806 rows17 rows

Prod verification (2026-03-24): Spot-checked 55+ publications across multiple categories:

  • Combo arms with “placebo” in name (e.g. “Nivo+Ipi+Placebo for Nivo”) → correctly EXPERIMENTAL
  • Drug-name comparators (e.g. “Sorafenib”, “FOLFIRI”, “Chemotherapy”) → correctly ACTIVE_COMPARATOR
  • Novel drug monotherapy vs combo (EV mono vs EV+pembro) → correctly identified mono as comparator
  • Phase I multi-arm dose trials → correctly all EXPERIMENTAL
  • Randomized dose-finding (same drug, different schedules) → correctly all EXPERIMENTAL
  • No false positives or misclassifications found

Tracker spot-checks resolved:

PubDrugBeforeAfter
66516Zanidatamaball NULL (drug_id mismatch: 10432 vs 15231)single_dose=1200 mg, dose_units=mg
114758Zanidatamaball NULL (same mismatch)single_dose=1200 mg, dose_frequency=on days 1 and 15
70960SHR-A1811all NULL (drug_id was NULL)dose_min=3.2 mg/kg, dose_max=8.0 mg/kg, rp2d=6.4 mg/kg

Files changed:

  • db/views/vw_publication_efficacy_data_v16.sql (new)
  • db/migrate/20260323212725_update_vw_publication_efficacy_data_to_version_16.rb (new)
  • app/queries/tpp/clinical_evidence_query.rb (safety query arm role logic)
  • app/tasks/publications_llm_classification/details.rb (arm_type enum expansion)
  • lib/tasks/one_off/backfill_arm_type_from_name.thor (new — LLM arm type backfill)

Deployment steps:

  1. rake db:migrate (creates v16 view + materializes)
  2. REFRESH MATERIALIZED VIEW CONCURRENTLY vw_publication_efficacy_data
  3. thor one_off:backfill_arm_type_from_name:backfill --batched --parallelism 4 --batch-size 2000
  4. REFRESH MATERIALIZED VIEW CONCURRENTLY vw_publication_efficacy_data (again after backfill)

21. Phase 1 basket trials report response counts, not ORR percentages

Section titled “21. Phase 1 basket trials report response counts, not ORR percentages”

Phase 1 dose-escalation and basket trial abstracts often report efficacy as response counts per tumor type (e.g. “1 PR in 9 HNSCC patients”) rather than ORR percentages. The LLM faithfully extracts these as PR endpoint with measure_unit = count, but the query only recognizes ORR with measure_unit = percentage. This causes two downstream problems:

  1. No efficacy shown — the publication surfaces in the report with empty ORR/PFS/OS columns despite having extractable response data
  2. Inflated patient count — when no recognized efficacy endpoint exists for the disease subgroup, extract_patient_count falls back to the largest number_of_participants across all rows, which is typically the cross-tumor Overall population (e.g. N=92 instead of N=9)

Publication 29759 — Praluzatamab ravtansine (CX-2009) first-in-human phase 1 (NCT03504488), ASCO 2020.

Abstract reports: “92 patients … 5 PRs in breast cancer (n=39), 2 PRs in ovarian (n=22), 1 PR in HNSCC (n=9)”

Extracted data (correct):

SubgroupEndpointValueUnitN
OverallSD21count92
Overall → HNSCCPR1count9
Overall → Breast CancerPR5count39
Overall → Ovarian CancerPR2count22

Query output for HNSCC (incorrect):

  • ORR: empty (no ORR endpoint exists)
  • Patient count: 92 (fallback to Overall N, should be 9)
  • The row appears in the report with no efficacy and a misleading N

Two gaps in the query layer:

  1. extract_efficacy_metrics only looks for PRIMARY_EFFICACY_ABBREVIATIONS (OS, PFS, ORR, DOR, DFS, DCR). PR and CR counts are not recognized. No logic derives ORR from PR count / N.

  2. extract_patient_count takes the max number_of_participants across all rows in the group. For basket trials where the Overall subgroup (N=92) and disease subgroup (N=9) coexist in the same group key, the fallback picks N=92.

Phase 1 dose-escalation trials commonly report response counts rather than ORR. Basket trials with disease-specific cohorts are particularly affected since they report per-tumor-type counts. The exact count of affected publications needs characterization, but this pattern is common in early-phase oncology abstracts.

Option 1: Derive ORR from PR/CR counts at query time. When no ORR endpoint exists for a subgroup but PR and/or CR counts exist with number_of_participants > 0, compute ORR = (PR + CR) / N * 100. This is clinically correct and matches how the client sheet manually computes these values.

Option 2: Have the LLM compute ORR during extraction. Add a prompt instruction: when only response counts are reported, also emit a derived ORR endpoint with measure_unit = percentage. Risk: the LLM might hallucinate percentages or miscount.

Option 3: Filter out publications with no recognized efficacy endpoints. If a publication has no ORR/PFS/OS/DoR for the disease subgroup, don’t surface it in the report. This avoids misleading rows but loses legitimate phase 1 data.

Option 1 is most reliable — the data is already correctly extracted, just needs a calculation step in the query.

Implemented (2026-03-20):

  1. Going-forward fix in post_process.rb: Added derive_orr_for_subgroup — after persisting outcome measures for each subgroup, checks if PR/CR counts exist with N > 0 but no ORR percentage. If so, creates a derived ORR row: (PR + CR) / N * 100 with measure_unit = 'percentage' and observation = 'Derived from PR + CR counts'. Skips subgroups tagged response_status (response-defined subgroups where derivation is meaningless).

  2. Backfill task lib/tasks/one_off/backfill_derived_orr.thor: Finds all publication subgroups with PR/CR counts but no ORR percentage and creates derived ORR rows. Results: 753 ORR rows created across 512 publications.

  3. Empty efficacy filter in ClinicalEvidenceQuery: Rows with no recognized efficacy endpoints (empty efficacy hash) are now filtered out in build_result_rows, preventing publications with only safety/DLT data from appearing as empty rows with misleading patient counts.

Prod deployment:

  1. Run response_status backfill first: thor one_off:backfill_response_status_tags:backfill --batched
  2. Run derived ORR backfill: thor one_off:backfill_derived_orr:backfill
  3. Refresh materialized view

22. extract_subgroups doesn’t identify response counts as endpoints

Section titled “22. extract_subgroups doesn’t identify response counts as endpoints”

When abstracts report best response as narrative counts (“1 PR and 14 SD out of 29 CRC patients”, “1 PR and 4 SD among 8 esophageal cancer patients”) without computing an explicit ORR percentage, the upstream extract_subgroups step only identifies formal endpoints like DCR and TTP. Individual response counts (PR, CR) are not recognized as extractable endpoints. Since classify_publications constrains its endpoint_abbreviation enum to the abbreviations identified upstream, the LLM cannot create PR/CR endpoint rows even though it sees the data in the abstract.

  1. extract_subgroups (step 7 in PublicationsWorkflow) scans the abstract and identifies subgroups + their associated endpoints → stored in llm_data['subgroup_endpoints']
  2. classify_publications (step 9) receives subgroup_endpoints as input, builds a JSON schema with endpoint_abbreviation constrained to the upstream list, and extracts structured outcome measures
  3. If PR/CR aren’t in the upstream endpoint list, classify_publications can’t output them

Publication 29737 — IMMU-132 (sacituzumab govitecan) phase I/II in GI cancers (NCT01631552), ASCO 2020.

Abstract text: “Of 29 CRC pts… 1 had a PR and 14 had SD as the best response by RECIST, with a time to progression (TTP) of 11.5+ months for the PR… This is a disease control rate (DCR) of 51.7%.”

subgroup_endpoints identified upstream:

  • Time to progression → 5 subgroups
  • Disease control rate → 3 subgroups

Missing: Partial Response / PR was not identified as an endpoint despite being explicitly reported per disease cohort.

LLM output: Extracted DCR=51.7% (N=29) and TTP values. The PR count (1/29) was noted in the DCR observation text (“1 PR and 14 SD out of 29 evaluable CRC patients”) but not as a separate endpoint row.

Result: No ORR can be derived (Issue 21’s derivation requires PR/CR rows to exist), and the publication shows DCR but no ORR in the report.

  • 759 publications have DCR but no ORR, PR, or CR endpoints
  • 287 of those have response counts (PR/CR) mentioned in the DCR observation text — confirming the data was seen by the LLM but not extracted as separate endpoints
  • 414 publications have SD counts but no ORR/PR/CR/DCR — similar pattern with stable disease

extract_subgroups identifies endpoints by looking for formal endpoint patterns in the abstract (named endpoints with abbreviations, table headings, structured results). Narrative best-response descriptions like “1 had a PR and 14 had SD” are not recognized as formal endpoints because:

  1. They don’t follow the endpoint = value pattern
  2. PR/SD/CR appear as best overall response categories, not as measured endpoints
  3. The abstract often only computes a summary metric (DCR) from these counts

The classify_publications schema then constrains the LLM to only the identified abbreviations, preventing it from creating PR/CR rows even though it clearly reads the counts (as evidenced by the observation text).

Option 1: Expand extract_subgroups to detect response count patterns. Add pattern matching for narrative response descriptions: “N had a PR”, “X partial responses”, “CR in Y patients”, etc. When detected, add PR/CR as endpoints alongside DCR/TTP.

Option 2: Allow classify_publications to add endpoints not in the upstream list. Remove or relax the endpoint_abbreviation enum constraint so the LLM can create PR/CR rows when it sees response counts. Risk: the LLM might hallucinate endpoints.

Option 3: Post-processing derivation from DCR observation text. Parse the observation strings like “1 PR and 14 SD out of 29 evaluable CRC patients” to extract PR/CR counts. This is fragile (regex on LLM-generated text) but catches the 287 publications where the data is already captured.

Option 4: Prompt instruction in classify_publications. Add an explicit instruction: “When the abstract reports individual response counts (e.g. ‘1 PR’, ‘2 CR’) per subgroup without an explicit ORR, also extract these as separate PR/CR endpoints with measure_unit=count.” Combined with relaxing the enum constraint for response-type abbreviations.

Option 4 is cleanest — it works within the existing pipeline, the LLM already sees the data, and combined with Issue 21’s derivation logic, the ORR gets computed automatically.

Forward fix (v1): Updated task.rb classify_publications prompt to instruct LLM to extract PR/CR counts. See Issue 21 for the ORR derivation that consumes these counts.

Forward fix (v2): Updated task.rb classify_publications prompt to also extract PR/CR/ORR percentages from DCR breakdowns (e.g. “DCR was 54% (CR 8%, PR 15%, SD 31%)”). Added dCR (durable CR) and pCR/MPR exclusions to prevent misidentification as standard CR.

Backfill v1 (2026-03-20/21): screen_missing_response_counts:screen (job 1568) screened candidates and flagged pubs with narrative response counts (e.g. “1 PR, 14 SD”). Re-extraction via classify_publications (job 1570) on flagged pubs. Reduced DCR-only population from 759→620 (~139 fixed).

Backfill v1 gap: The v1 screener explicitly excluded percentage-based response rates (“ORR was 35%” → NO), missing a second pattern where abstracts report ORR/PR/CR as percentages — either standalone (“ORR was 33%”, “BOR rate 18.2%”) or embedded in DCR breakdowns (“DCR was 54% (CR 8%, PR 15%, SD 31%)”). Prod analysis (2026-03-24) found ~92 publications with extractable response rate percentages but no response endpoint, of which 73 were never re-processed (pre-fix) and 19 ran with the v1 prompt but were missed.

Backfill v1 screener (historical): screen_missing_response_counts.thor was used to identify candidates for v1 re-extraction. Its prompt only detected integer counts and explicitly excluded percentage-based ORR — this is why the v1 gap exists. The screener is no longer needed for v2 since the targeted backfill scopes structurally via SQL.

Backfill v2 (complete, job 1604, 2026-03-24): Targeted backfill task backfill_missing_response_endpoints.thor — sent a focused LLM prompt (o4-mini) per publication extracting ORR/PR/CR values anchored to existing subgroups. Created trial_endpoint + trial_outcome_measure + trial_arm_outcome records directly without re-running the full classify pipeline. ORR derived inline from PR% + CR% when LLM didn’t return explicit ORR. Guards: skips zero values, excludes dCR/pCR/MPR, skips response-status subgroups, idempotent (skips existing records).

Results: 97 new records created (41 PR counts, 22 ORR percentages, 14 PR percentages, 12 CR counts, 8 CR percentages). DCR-only population reduced from 550 → 498 (~52 pubs fixed). Combined with v1 backfill: 759 → 498 total (~261 pubs fixed, ~34% reduction).

Verified: 10 random remaining DCR-only pubs manually checked against full abstract text — all 10 genuinely report only DCR with no PR/CR/ORR breakdown (phase I safety studies, PK/biomarker analyses, maintenance trials with DCR as primary endpoint, composite response rates ≠ ORR). The remaining 498 are clean.

23. Dose extraction misses implicit RP2D in phase I/II trials

Section titled “23. Dose extraction misses implicit RP2D in phase I/II trials”

The dose extraction LLM classifies “dose levels of 8 and 10 mg/kg were chosen for phase II” as a range (dose_min/dose_max) rather than RP2D. In phase I/II trials, doses selected for phase II expansion ARE the recommended phase 2 dose by definition — this is the entire purpose of the phase I dose escalation.

Publication 29737 — “Phase I/II trial of IMMU-132 (isactuzumab govitecan)”

Abstract states: “starting at a dose of 8 mg/kg given on days 1 and 8 of a 3-week cycle. Dose levels of 8 and 10 mg/kg were chosen for phase II”

Current extraction:

{
"dose_min": "8 mg/kg",
"dose_max": "10 mg/kg",
"rp2d": null,
"dose_context_type": "range"
}

Expected: rp2d should capture that 8 and 10 mg/kg are the RP2D levels. The phrase “chosen for phase II” in a phase I/II trial is semantically equivalent to “recommended phase 2 dose.”

Update the dose extraction prompt to recognize implicit RP2D language in phase I/II trials:

  • “doses chosen/selected for phase II”
  • “phase II dose levels”
  • “expansion cohort dose”
  • “dose carried forward to phase II”

The challenge is that RP2D is currently a single value field. When two dose levels are selected (8 and 10 mg/kg), storing both requires either a comma-separated value or keeping dose_min/dose_max AND setting rp2d.

Note: this publication also has a secondary issue — publication_interventions.drug_id is NULL, so the dose evidence can’t join to the view via pub_dose_lookup even if the extraction were correct.

~4,617 interventions across ~4,100 publications have dose_context_type of range, escalation, or rp2d (typed but value missing) with no rp2d value. LLM verification on a sample of 17 publications found implicit RP2D in ~20% of candidates (MTD declarations, phase II dose selections, expansion cohort doses).

Forward fix: Updated dose_evidence_extraction.rb system prompt to recognize implicit RP2D language: MTD declarations, “chosen/selected for phase II”, expansion cohort doses, “recommended for further study”.

Backfill: lib/tasks/one_off/backfill_implicit_rp2d.thor — sends abstract + current dose evidence for ~4,100 publications to GPT-5-mini. LLM determines if an implicit RP2D exists and extracts the value. Only updates rp2d and dose_context_type fields — does not overwrite existing dose_min/dose_max/units/frequency. Corrections tagged with rp2d_source: 'implicit_backfill' for audit. Estimated ~800 RP2Ds to be found. Cost: ~$2.

Terminal window
thor one_off:backfill_implicit_rp2d:backfill --batched --parallelism 4

24. Subgroup participant count wrong for biomarker sub-cohorts

Section titled “24. Subgroup participant count wrong for biomarker sub-cohorts”

When abstracts report results for a biomarker-defined sub-cohort within a disease subgroup, the LLM sometimes confuses the count of patients with a specific outcome with the total sub-cohort size.

Publication 29737, KRAS-mutated CRC subgroup:

Abstract states: “Thirteen CRC pts had KRAS mutations, 7 with SD (median TTP = 4.4+ mo)”

Current extraction: subgroup "Advanced GI cancers → Colorectal cancer → KRAS-mutated" with TTP endpoint, n=7

Expected: n=13 (the KRAS-mutated cohort size), with 7 being the count of patients with SD (stable disease).

The LLM set number_of_participants=7 (the SD count) instead of 13 (the KRAS cohort size). This is a pattern likely to recur wherever abstracts report “N patients had X, Y with outcome Z.”

~112 highly suspicious subgroups identified via heuristic (response count = N, and N < 30% of publication max N). True scope is likely larger but hard to detect structurally — confirmed by LLM verification on sample of 12 publications finding 11 corrections across 104 verified arms (~10.6% correction rate).

Affected patterns:

  • Basket trials: per-tumor-type enrollment vs SD/PR counts (pub 53427: CRC N=6 should be 14, PDAC N=6 should be 25, etc.)
  • Biomarker sub-cohorts: mutation cohort size vs outcome count (pub 29737: KRAS N=7 should be 13)
  • Response cohorts: assessable patients vs responder count (pub 3674: Cohort 2 N=13 should be 17 — 13 was cCR count, 17 was assessable)
  • Disease sub-cohorts in phase I: per-histology enrollment vs outcome count (pub 5024: DIPG N=7 should be 9, sDMG N=7 should be 2)

Forward fix: Updated classify_publications prompt in task.rb with explicit anti-example: “CRITICAL: Set number_of_participants to the TOTAL evaluable patients in that subgroup/cohort — NOT the count of patients with a specific outcome.”

Backfill: lib/tasks/one_off/backfill_subgroup_participant_counts.thor — sends abstract + all arm outcomes for ~1,240 publications with PR/CR/SD count endpoints to GPT-5-mini for verification. LLM compares current N against abstract and corrects where wrong. All corrections logged in trial_subgroups.llm_data['n_corrections'] for audit/revert. Estimated cost: ~$1.50.

Terminal window
thor one_off:backfill_subgroup_participant_counts:backfill --batched --parallelism 4

25. Confirmed vs unconfirmed ORR confusion in classify_publications

Section titled “25. Confirmed vs unconfirmed ORR confusion in classify_publications”

When abstracts report both confirmed and unconfirmed ORR (a common pattern in ADC oncology trials), classify_publications either (a) extracts the unconfirmed ORR value and incorrectly marks it confirmed: true, or (b) extracts only the unconfirmed ORR and omits the confirmed value entirely. This produces wrong cORR values in the report and missing cORR endpoints.

classify_publications (app/tasks/publications_llm_classification/task.rb) — the LLM extraction step that produces subgroup_outcome_measures. The confirmed boolean on ORR endpoints was added by Issue 16, but the extraction prompt doesn’t instruct the LLM on how to handle abstracts that report both confirmed and unconfirmed ORR.

The extraction schema allows a single ORR record per subgroup arm with a confirmed boolean. When an abstract reports “unconfirmed ORR was X% (confirmed: Y%)”, the LLM extracts one ORR record with measure_value=X (the unconfirmed value) and sets confirmed: true because the word “confirmed” appears in the abstract context. The actual confirmed value (Y%) is only captured in the free-text observation field.

The prompt does not instruct the LLM to:

  1. Create TWO separate ORR records when both confirmed and unconfirmed values are reported
  2. Distinguish which numeric value corresponds to confirmed vs unconfirmed status

Publication 192026 (Precemtabart tocentecan, PROCEADE-CRC-01 dose optimization):

Abstract states: “The unconfirmed objective response rate (ORR) at 2.8 mg/kg was 24.1% (95% CI: 10.3, 43.5) (confirmed: 13.8% [95% CI: 3.9, 31.7]).”

Extracted: ORR confirmed=true, measure_value=24.1

  • observation: “Unconfirmed ORR; confirmed ORR was 13.8%”

Expected: Two records:

  • ORR confirmed=false, measure_value=24.1 (unconfirmed)
  • ORR confirmed=true, measure_value=13.8 (confirmed)

Same pattern in pubs 237309, 49900 (same drug, different data cuts). Also confirmed in pub 190845 (missing cORR entirely) and pub 116824 (missing cORR for dose subgroups).

  • Wrong cORR values: The Clinical Evidence report shows unconfirmed ORR in the cORR column. For Precemtabart at 2.8 mg/kg, report shows cORR=24.1% when it should be 13.8% — a 75% overstatement.
  • Missing cORR endpoints: Some publications have no confirmed ORR extracted at all, leaving the cORR column blank when the abstract does report it.
  • Audit failures: 24 of 145 open audit issues in the CRC ADC scope (disease 4345, technology 708) are caused by this pattern: 18 incorrect_value issues on efficacy.corr.value and 6 missing_endpoint issues on efficacy.corr.value.

Directly confirmed in 5 publications (192026, 237309, 49900, 190845, 116824) across the CRC ADC audit scope. Likely affects any ADC trial publication reporting both confirmed and unconfirmed ORR — estimated dozens across the full corpus.

-- Publications with confirmed=true ORR that may have unconfirmed values stored as confirmed
SELECT DISTINCT ts.source_id as publication_id
FROM trial_subgroups ts
JOIN trial_outcome_measures tom ON tom.trial_subgroup_id = ts.id
JOIN trial_arm_outcomes tao ON tao.trial_outcome_measure_id = tom.id
JOIN trial_endpoints te ON te.id = tom.trial_endpoint_id
WHERE ts.source_type = 'Publication'
AND te.abbreviation = 'ORR'
AND tom.confirmed = true
AND tom.observation ILIKE '%unconfirmed%'

Forward fix: Update the classify_publications prompt in task.rb to explicitly handle confirmed/unconfirmed ORR:

“When an abstract reports both confirmed and unconfirmed ORR for the same subgroup/arm, create TWO separate ORR records: one with confirmed: false and the unconfirmed value, and one with confirmed: true and the confirmed value. The unconfirmed ORR is typically the larger number. Do NOT set confirmed: true on the unconfirmed ORR value.”

Backfill: Re-extract affected publications with the updated prompt. Scope can be identified by querying for ORR records where confirmed=true and the observation mentions “unconfirmed”. Estimated cost: minimal (small number of publications).

Forward fix (2026-03-24): Updated the classify_publications prompt in app/tasks/publications_llm_classification/task.rb to:

  1. Instruct the LLM to create TWO separate ORR endpoint records when an abstract reports both confirmed and unconfirmed values
  2. Not confuse different RECIST assessment criteria (RECIST 1.1 vs mRECIST) with confirmation status — use RECIST 1.1 as primary measure_value, note other criteria in observation

Targeted backfill v1 (2026-03-24): Created lib/tasks/one_off/backfill_confirmed_unconfirmed_orr.thor. Initial run (job 1603) fixed the most obvious cases but missed ~398 records due to overly conservative guardrails and narrow scope.

Backfill v2 (2026-03-24): Expanded the task to address gaps found in v1:

Problems found in v1:

  • 131 confirmed=true ORR records with “unconfirmed” in observation but no confirmed=false pair created — guardrail required LLM to return a full pair, skipping cases where it only returned one side
  • 23 confirmed=true ORR records where the abstract never mentions response confirmation — LLM hallucinated the flag
  • 244 confirmed=false ORR records missing their confirmed=true sibling
  • ~50 publications with PR/CR confirmed=null where abstract says “confirmed PR” — wrong flag propagates to derived ORR via post_process, making it invisible to the cORR metric in clinical_evidence_query

Changes in v2:

  1. Scope widened: Now covers 2,530 pubs — incomplete ORR pairs (1,596) + derived ORR pubs with PR/CR that may need confirmed flags (~934)
  2. PR/CR coverage: LLM now evaluates confirmed flags on PR, CR, and ORR in one pass (was ORR-only)
  3. Guardrail relaxed: Acts when LLM returns any non-null confirmed entry (was: required both true+false pair)
  4. Null upgrade: Can upgrade confirmed=null records to true/false instead of only creating new records
  5. Derived ORR fix: After correcting PR/CR flags, surgically updates derived ORR confirmed to match source PR/CR — no post_process re-run needed
  6. Prompt improved: Instructs LLM to derive both confirmed/unconfirmed ORR from response counts (e.g. “6 confirmed PRs and 2 unconfirmed PRs among 40 patients”)

Forward fix for derived ORR (2026-03-24): Updated post_process.rb derive_orr_for_subgroup to propagate the confirmed flag from source PR/CR records. If all PR/CR are confirmed=true, derived ORR gets confirmed=true. If mixed, derives both a confirmed and unconfirmed ORR.

Commands:

Terminal window
# Preview scope
bundle exec thor one_off:backfill_confirmed_unconfirmed_orr:identify
# Run (use --batched for large runs)
bundle exec thor one_off:backfill_confirmed_unconfirmed_orr:backfill --batched

Validation (2026-03-24):

  • v1 tested on 20 random publications across two rounds — correctly handled split ORR, single confirmed, RECIST criteria, ambiguous cases
  • v2 tested on 36 publications (30 dry run + 6 real run). Verified against abstracts:
    • Pub 1246: “unconfirmed partial response” → PR confirmed=false, derived ORR confirmed=false
    • Pub 31619: mixed confirmed/unconfirmed PRs across disease subgroups — each subgroup’s derived ORR correctly matched its PR flag ✓
    • Pubs 1527, 5024, 7313, 7499: no confirmation language → all flags left as confirmed=null
    • Zero spurious changes on pubs without confirmation language

Backfill v2 production run (2026-03-24, job 1608): 2,530 publications processed.

Results verified in prod:

  • confirmed=false ORR records: 575 → 744 (+169 new unconfirmed pairs created)
  • confirmed=true ORR records: 2,240 → 2,583 (+343 flags upgraded or new records)
  • Derived ORR with confirmed flag: 94 true + 28 false (was all null)
  • Known broken pubs verified correct: 47342 (27.6%/34.5%), 65504 (13.0%/17.4%), 74897 (46.7%/60.0%), 234678 (15.0%/20.0%)
  • Remaining 55 confirmed=true ORR (45 pubs) with “unconfirmed” in observation but no pair — root cause: the backfill was sending the existing confirmed flag to the LLM, which anchored on it and echoed it back instead of making a fresh determination from the abstract.

Backfill v3 fix (2026-03-24): Two changes to the prompt/input:

  1. Removed confirmed field from existing records sent to the LLM — forces fresh determination from abstract text only
  2. Added explicit instruction: “The existing records may have WRONG confirmed flags. Do NOT trust the existing confirmed value.”

Scope: ~2,000 pubs still in scope (1,553 incomplete pairs + 691 derived ORR with null confirmed). The pubs fixed by v2 are excluded (they now have complete pairs).

Command: bundle exec thor one_off:backfill_confirmed_unconfirmed_orr:backfill --batched

Backfill v3 production run (2026-03-24, job 1612): ~2,000 publications processed.

Results: 55 → 26 remaining records with “unconfirmed” in observation but no pair. The 26 remaining break down as:

  • ~20 combined rates where abstract reports “confirmed and unconfirmed responses” as a single number — confirmed=true is wrong (should be null) but can’t be split into two rows. Not a data loss since the value itself is correct.
  • 3 truncated abstracts (30362, 59711, 209569) — response breakdown is in the missing portion of the abstract text
  • 2 genuine LLM misses (116973, 236929) — abstract has the data but LLM didn’t split

2026-03-26 audit findings — Issue reopened

A Clinical Evidence audit (publications:audit_clinical_evidence) on HNSCC publications identified 7 open cORR-related audit issues across 5 publications that demonstrate the extraction fix is insufficient. Three categories of residual failure:

Category 1: LLM counts all responses as confirmed (post-fix)

Publication 30362 (Petosemtamab, updated_at: 2026-03-23 — processed AFTER the v3 backfill):

  • Abstract: “1 confirmed complete response, 2 confirmed and 3 unconfirmed partial responses” among 10 evaluable patients
  • Expected: cORR = 30% (3/10 confirmed), ORR = 60% (6/10 total)
  • Extracted: ORR = 60.0% with confirmed: true — LLM counted ALL responses as confirmed
  • Note: v3 backfill categorized this as “truncated abstract” but the abstract is NOT truncated — full response breakdown is present. The backfill LLM erroneously classified it as truncated.

Category 2: “cORR” terminology not recognized as confirmed flag

Publication 29660 (Tisotumab vedotin):

  • Abstract explicitly uses “confirmed objective response rate (cORR)” as primary endpoint throughout
  • Values: cORR = 32.5% (full cohort), cORR = 40.0% (≤2 prior lines)
  • Extracted: ORR with confirmed: null for both subgroups — correct values but missing confirmed flag
  • Impact: cORR column is empty in the report despite values being correctly extracted

Category 3: Total ORR mislabeled as confirmed

Publication 65575 (Ozuriftamab vedotin):

  • Abstract: “ORR was 32% including confirmed and unconfirmed responses”
  • Extracted: ORR = 32.0% with confirmed: true — the total ORR (including unconfirmed) is marked as confirmed
  • Only confirmed: true record exists; no confirmed: false pair

Additional confirmed cases with correct extraction but wrong audit flags (false positives from audit LLM):

  • Pubs 65346, 151763, 237727: Both confirmed and unconfirmed ORR rows exist with correct values and flags. The ClinicalEvidenceQuery cORR extraction at lines 658–675 correctly filters confirmed=true. These audit findings appear to be audit LLM errors (confusing which row is ORR vs cORR).

Remaining scope estimate:

-- Publications with only confirmed=true ORR (no unconfirmed counterpart)
-- that might have wrong confirmed attribution
SELECT count(DISTINCT ts.source_id)
FROM trial_subgroups ts
JOIN trial_outcome_measures tom ON tom.trial_subgroup_id = ts.id
JOIN trial_endpoints te ON te.id = tom.trial_endpoint_id
WHERE ts.source_type = 'Publication'
AND te.abbreviation = 'ORR'
AND tom.confirmed = true
AND NOT EXISTS (
SELECT 1 FROM trial_outcome_measures tom2
JOIN trial_endpoints te2 ON te2.id = tom2.trial_endpoint_id
WHERE tom2.trial_subgroup_id = ts.id
AND te2.abbreviation = 'ORR'
AND tom2.confirmed = false
);
-- Returns 1,178 publications — subset may have wrong attribution

Forward fix needed: The classify_publications prompt needs stronger instructions for three specific failure modes:

  1. When abstract lists individual confirmed + unconfirmed responses by count (e.g., “2 confirmed PR, 3 unconfirmed PR”), derive both cORR and ORR from counts — don’t sum them into one value
  2. When abstract uses “cORR” or “confirmed ORR” terminology, set confirmed: true on the endpoint even if no separate unconfirmed value is stated
  3. When abstract says “ORR X% (including confirmed and unconfirmed)”, set confirmed: false or confirmed: null — not confirmed: true

Related: See also Issue 27 — even when extraction is correct, extract_efficacy_metrics in ClinicalEvidenceQuery can pick the confirmed ORR value for the plain ORR metric

Forward fix v4 (2026-03-26): Added two additional prompt instructions to app/tasks/publications_llm_classification/task.rb:

  1. Explicit example for deriving TWO ORR records from mixed confirmed/unconfirmed response counts (e.g., “1 confirmed CR, 2 confirmed PR, and 3 unconfirmed PR among 10 patients” → cORR=30%, ORR=60%). Addresses the pattern where the LLM sums all responses and marks as confirmed.
  2. Instruction that when the primary endpoint is described as “cORR” or “confirmed ORR”, the value IS confirmed and confirmed: true must be set — do not leave as null.

Backfill scope (2026-03-26): Structural scope (no text matching) — all ~25.5K publications with ORR that don’t already have both a confirmed=true AND confirmed=false ORR record. The LLM determines from the abstract whether confirmation language exists; apply_result is a no-op for pubs where the LLM returns confirmed=null.

Estimated affected (will actually change): ~1,000-1,500 publications based on text analysis showing ~985 with “confirmed ORR”/“cORR” language + null flag, ~77 with wrong confirmed=true, ~21 v3 remnants.

-- V4 structural scope: all ORR pubs without complete confirmed pair
SELECT DISTINCT ts.source_id
FROM trial_subgroups ts
JOIN trial_outcome_measures tom ON tom.trial_subgroup_id = ts.id
JOIN trial_endpoints te ON te.id = tom.trial_endpoint_id
WHERE ts.source_type = 'Publication'
AND te.abbreviation = 'ORR'
AND NOT EXISTS (
SELECT 1 FROM trial_outcome_measures t1
JOIN trial_endpoints e1 ON e1.id = t1.trial_endpoint_id
JOIN trial_outcome_measures t2 ON t2.trial_subgroup_id = t1.trial_subgroup_id
JOIN trial_endpoints e2 ON e2.id = t2.trial_endpoint_id
WHERE t1.trial_subgroup_id = tom.trial_subgroup_id
AND e1.abbreviation = 'ORR' AND t1.confirmed = true
AND e2.abbreviation = 'ORR' AND t2.confirmed = false
);

Cost: ~$6 using gpt-4o-mini in batch mode (simple classification, no reasoning model needed).

Backfill v4 production run (2026-03-26, job 1626): 25,594 publications processed (full structural scope, gpt-5-mini batch).

Results:

  • confirmed=true records: 2,685 → 5,948 (+3,263 new confirmed flags)
  • confirmed=false records: 744 → 1,424 (+680 new unconfirmed records)
  • Complete confirmed/unconfirmed pairs: 477 → 783 pubs (+306)
  • Pubs with any confirmed flag: 2,240 → 3,436 (+1,196)

Spot-checked 12 random publications against full abstracts — 11 correct, 1 pre-existing extraction error:

Round 1:

  • Pub 29807 (AZD9291): abstract says “confirmed+unconfirmed ORR 51%” → correctly split to cORR=33.9%, ORR=51%
  • Pub 56237 (IMO+ipi): abstract says “6 PR (3 confirmed)” among 15 → correctly split to cORR=20%, ORR=46.7%
  • Pub 76478 (Pralsetinib): abstract says “all confirmed” for naïve subgroup → correct 73.7% confirmed; overall has small gap (63.3% vs 64.6%)
  • Pub 65763 (Belrestotug): complete pairs for all 4 arms, confirmed < unconfirmed in each (expected)
  • Pub 59860 (Pazopanib GCT): pre-existing extraction error — abstract reports marker response (4/5 AFP/HCG decrease), not RECIST ORR. The 80% “ORR” is a marker response rate, not a true ORR. Backfill correctly split confirmed/unconfirmed given the existing data, but the underlying extraction is wrong. Not a backfill bug.

Round 2 (full abstract read → compare):

  • Pub 58824 (Fruquintinib+S-1): 1 confirmed PR at 4mg, 2 unconfirmed at 5mg → cORR=16.67% (1/6), ORR=50% (3/6) ✓
  • Pub 62418 (Zongertinib GI): abstract explicitly states “confirmed ORR 17.2%” and “regardless of confirmation 20.7%” → exact match ✓
  • Pub 70313 (D-1553 KRAS G12C): “1 confirmed CR, 3 confirmed PR, 1 unconfirmed PR” → cORR=40% (4/10), ORR=50% (5/10) ✓
  • Pub 234635 (Ficerafusp SCAC): “6 of 7 responses confirmed” → cORR=27.3% (6/22), ORR=31.8% (7/22) ✓
  • Pub 238559 (BC3195 ADC): “4 confirmed PR (cPR)” out of 31 at 2.4mg, 5 total PR → cORR=12.9%, ORR=16.1% ✓

Zero spurious changes on pubs without confirmation language (24K+ no-ops)

26. Parent population N propagated to child subgroups

Section titled “26. Parent population N propagated to child subgroups”

When classify_publications extracts data for hierarchical subgroups (e.g., “Phase 1b dose expansion → SCCHN”), the LLM copies the parent subgroup’s number_of_participants to all child subgroups instead of extracting the subset-specific N. This produces incorrect patient counts for ~5,058 child subgroups across 1,174 publications.

classify_publicationsapp/tasks/publications_llm_classification/task.rb

The prompt currently instructs (line 120–123):

“CRITICAL: Set number_of_participants to the TOTAL evaluable patients in that subgroup/cohort — NOT the count of patients with a specific outcome.”

This instruction was added for Issue 24 (confusing outcome counts with cohort size), but it has a side effect: the LLM interprets “total evaluable patients in that subgroup” as the parent population total when it doesn’t know the child-specific N.

No prompt instruction distinguishes between:

  • The parent population N (e.g., 39 patients in Phase 1b)
  • The child subgroup N (e.g., the SCCHN subset of those 39)

The LLM defaults to the known parent N rather than outputting null when the child N isn’t explicitly stated.

Publication 134450 — MRG003 Phase 1 (SCCHN/NPC/CRC basket):

  • “Phase 1b dose expansion” → N=39 (correct, total)
  • “Phase 1b dose expansion → SCCHN” → N=39 (WRONG — SCCHN is a subset)
  • “Phase 1b dose expansion → NPC” → N=39 (WRONG — NPC is a subset)
  • “Phase 1b dose expansion → CRC” → N=39 (WRONG — CRC is a subset)

All three disease children show the parent’s N instead of the actual per-disease cohort size.

Publication 5799 — Neoadjuvant hormonal therapy in prostate cancer:

  • “Overall” → N=62 (correct)
  • “Overall → Baseline tumor burden: Low” → N=62 (WRONG — subset)
  • “Overall → Baseline tumor burden: High” → N=62 (WRONG — subset)
  • “Overall → PTEN/ERG immunostatus: Altered” → N=62 (WRONG — subset)
  • “Overall → PTEN/ERG immunostatus: Wild-type” → N=62 (WRONG — subset)
  • Clinical Evidence report shows inflated patient counts for sub-cohort rows
  • ORR percentages combined with wrong N produce misleading responder counts (e.g., 40% ORR with N=39 implies 15.6 responders, but the actual SCCHN cohort may only have 10 patients)
  • Undermines per-cohort comparisons in basket trial reporting

~5,058 child subgroup-endpoint rows across 1,174 publications where 2+ siblings all share the parent’s N.

-- Identify affected parent-child groups
WITH parent_child AS (
SELECT DISTINCT
ts_parent.source_id as pub_id,
ts_parent.subgroup_value as parent,
ts_child.subgroup_value as child,
tao_child.number_of_participants as child_n,
tao_parent.number_of_participants as parent_n,
te_child.abbreviation as endpoint
FROM trial_subgroups ts_child
JOIN trial_subgroups ts_parent ON ts_parent.source_id = ts_child.source_id
AND ts_parent.source_type = ts_child.source_type
AND ts_child.subgroup_value LIKE ts_parent.subgroup_value || ' → %'
AND ts_child.subgroup_value NOT LIKE ts_parent.subgroup_value || ' → % → %'
JOIN trial_outcome_measures tom_child ON tom_child.trial_subgroup_id = ts_child.id
JOIN trial_arm_outcomes tao_child ON tao_child.trial_outcome_measure_id = tom_child.id
JOIN trial_outcome_measures tom_parent ON tom_parent.trial_subgroup_id = ts_parent.id
JOIN trial_endpoints te_child ON tom_child.trial_endpoint_id = te_child.id
JOIN trial_endpoints te_parent ON tom_parent.trial_endpoint_id = te_parent.id
JOIN trial_arm_outcomes tao_parent ON tao_parent.trial_outcome_measure_id = tom_parent.id
WHERE ts_child.source_type = 'Publication'
AND te_child.abbreviation = te_parent.abbreviation
AND tao_child.number_of_participants = tao_parent.number_of_participants
AND tao_child.number_of_participants > 0
)
SELECT pub_id, parent, COUNT(DISTINCT child) as num_siblings
FROM parent_child
GROUP BY pub_id, parent
HAVING COUNT(DISTINCT child) >= 2;
-- Returns 1,776 parent groups across 1,174 publications

Forward fix: Add a prompt instruction to classify_publications in task.rb:

“When extracting number_of_participants for a child subgroup (e.g., ‘Overall → NSCLC’, ‘Phase 1b → SCCHN’), use the N specific to that sub-cohort, NOT the parent population’s total. If the abstract does not explicitly state how many patients are in the child sub-cohort, set number_of_participants to null rather than copying the parent’s N. For example, if ‘Phase 1b’ enrolled 39 patients across SCCHN, NPC, and CRC, do NOT set N=39 for each disease — set N to null unless the abstract specifies the per-disease count.”

Backfill: Re-extract the ~1,174 affected publications with the updated prompt. Alternatively, a cheaper post-processing cleanup could null out child N values that match the parent N when 2+ siblings exist — but this may also catch legitimate cases (e.g., crossover designs where all patients go through each arm), so prompt fix + re-extraction is safer.

Related issues: Issue 24 (subgroup participant count wrong for biomarker sub-cohorts) is a specific instance of this broader pattern.

Three-part fix:

  1. Forward prompt fix (app/tasks/publications_llm_classification/task.rb): Added instruction telling the LLM to use null for child subgroup number_of_participants when the abstract doesn’t explicitly state the per-subset count, rather than copying the parent’s N. Includes concrete right/wrong examples.

  2. Post-processing guard (app/tasks/publications_llm_classification/post_process.rb): Added null_out_propagated_parent_n method that runs after process_outcome_measures. Detects parent-child pairs where 2+ siblings share the parent’s N for the same endpoint and nulls out those child N values. Acts as a permanent safety net regardless of LLM behavior.

  3. One-off backfill (lib/tasks/one_off/null_propagated_parent_n.thor): SQL-based fix for existing affected records. Identifies child trial_arm_outcomes where 2+ siblings share the parent’s N and sets number_of_participants to NULL. No LLM re-runs needed — the correct answer is NULL since these abstracts don’t state the per-subset N.

    • Run: thor one_off:null_propagated_parent_n:identify to preview scope
    • Run: thor one_off:null_propagated_parent_n:backfill --no-dry-run to apply

Scope note: Only the 2+ siblings case is addressed. Single-child cases are ambiguous — the child could legitimately be the full parent population — and are left untouched.

Backfill v1 (prod): Ran successfully, nulled out all same-endpoint matches (0 remaining for same-endpoint check).

Backfill v2 fix (2026-03-24): v1 only matched child N against parent N on the same endpoint (e.g., child ORR N vs parent ORR N). But N propagation happens at the subgroup level — a child DCR can have the parent’s N even if the parent only has ORR. Found 2,495 pubs / 6,820 children still affected. Fixed both the one-off task and the post-process guard to match child N against ANY parent N across all endpoints.

Command: bundle exec thor one_off:null_propagated_parent_n:backfill --no-dry-run

27. extract_efficacy_metrics picks confirmed ORR as plain ORR

Section titled “27. extract_efficacy_metrics picks confirmed ORR as plain ORR”

When both confirmed (confirmed=true) and unconfirmed (confirmed=false) ORR rows exist for the same subgroup in the view, ClinicalEvidenceQuery#extract_efficacy_metrics can pick the confirmed row as the plain ORR metric value. This happens because the ORR extraction loop does not exclude confirmed rows, and when both rows have the same number_of_participants, max_by returns whichever comes first — often the confirmed row.

ClinicalEvidenceQuery#extract_efficacy_metricsapp/queries/tpp/clinical_evidence_query.rb, lines 590–628.

The cORR extraction (lines 658–675) correctly filters confirmed == true and is unaffected. The problem is exclusively in the general efficacy extraction loop that handles ORR alongside OS, PFS, DOR, etc.

Lines 600–611:

PRIMARY_EFFICACY_ABBREVIATIONS.each do |abbr|
matching = grouped[abbr] || grouped[abbr.downcase]
next if matching.nil? || matching.empty?
matching = filter_by_valid_unit(matching, abbr)
next if matching.empty?
experimental = matching.select { |r| r['resolved_group_type'] == 'EXPERIMENTAL' }
experimental = matching if experimental.empty?
best_row = experimental.max_by { |r| r['number_of_participants'].to_i } || matching.first

When abbr == 'ORR', matching includes ALL ORR rows regardless of confirmed flag. If both confirmed=true (value=26.7%) and confirmed=false (value=43.3%) exist with the same N, max_by picks the first match. The result: metrics[:orr] gets the confirmed value, making it identical to metrics[:corr] and wrong as a standalone ORR.

Publication 117228 (RM-1929 photoimmunotherapy in rHNSCC):

Abstract states:

  • “unconfirmed objective response rate (ORR) 43.3%”
  • “confirmed ORR 26.7%”

View correctly has both rows (subgroup “Heavily pretreated rHNSCC → Part 2”):

  • confirmed=true, measure_value=26.7, number_of_participants=30
  • confirmed=false, measure_value=43.3, number_of_participants=30

Report output: efficacy.orr.value = 26.7 (should be 43.3)

The cORR extraction correctly returns 26.7%, but the ORR extraction ALSO returns 26.7% instead of 43.3%.

  • Understated ORR: When confirmed ORR is lower than unconfirmed ORR (the typical pattern), the report shows the lower confirmed value as the headline ORR. For pub 117228, ORR is understated from 43.3% to 26.7%.
  • Duplicate values: ORR and cORR columns show the same value, making the cORR column appear redundant and hiding the existence of a lower confirmed rate.
  • Audit noise: The audit correctly flags these as incorrect_value on efficacy.orr.value, generating true-positive findings that overlap with Issue 25 audit findings.

477 publications currently have both confirmed=true and confirmed=false ORR rows (the correct Issue 25 extraction pattern). When both rows have the same N (which is common — confirmed and unconfirmed ORR are computed from the same denominator), the confirmed value gets picked as plain ORR.

-- Publications where confirmed and unconfirmed ORR have the same N
-- (susceptible to the wrong-pick bug)
SELECT count(DISTINCT ts.source_id)
FROM trial_subgroups ts
JOIN trial_outcome_measures tom_c ON tom_c.trial_subgroup_id = ts.id AND tom_c.confirmed = true
JOIN trial_outcome_measures tom_u ON tom_u.trial_subgroup_id = ts.id AND tom_u.confirmed = false
JOIN trial_endpoints te_c ON te_c.id = tom_c.trial_endpoint_id AND te_c.abbreviation = 'ORR'
JOIN trial_endpoints te_u ON te_u.id = tom_u.trial_endpoint_id AND te_u.abbreviation = 'ORR'
JOIN trial_arm_outcomes tao_c ON tao_c.trial_outcome_measure_id = tom_c.id
JOIN trial_arm_outcomes tao_u ON tao_u.trial_outcome_measure_id = tom_u.id
WHERE ts.source_type = 'Publication'
AND tao_c.number_of_participants = tao_u.number_of_participants;

Forward fix: In extract_efficacy_metrics, when processing ORR, exclude confirmed=true rows if confirmed=false rows also exist for the same subgroup. This ensures the plain ORR metric always uses the unconfirmed/total ORR:

# Inside the PRIMARY_EFFICACY_ABBREVIATIONS.each loop, after filtering matching:
if abbr == 'ORR'
unconfirmed = matching.reject { |r| [true, 't'].include?(r['confirmed']) }
matching = unconfirmed if unconfirmed.any?
end

This is a ~3 line change in clinical_evidence_query.rb. No backfill needed — fixing the query immediately fixes all report output.

No backfill required: This is a query-layer bug, not a data issue. The underlying data (trial_outcome_measures with correct confirmed flags) is correct. Fixing the Ruby code fixes all publications instantly.

Forward fix (2026-03-26): Added 5-line guard in app/queries/tpp/clinical_evidence_query.rb extract_efficacy_metrics method (line 610-613). When processing ORR, rejects confirmed=true rows if non-confirmed rows exist. This ensures the plain ORR metric uses the unconfirmed/total ORR, while the cORR extraction (line 667-683) independently picks confirmed=true rows.

if abbr == 'ORR'
non_confirmed = matching.reject { |r| [true, 't'].include?(r['confirmed']) }
matching = non_confirmed if non_confirmed.any?
end

Edge cases handled:

  • Both confirmed + unconfirmed exist → ORR gets unconfirmed, cORR gets confirmed (correct)
  • Only confirmed exists (no unconfirmed) → ORR falls back to confirmed value (safe fallback — same as cORR)
  • Only unconfirmed/null exists → no change (correct)

No backfill needed — query-layer fix applies immediately to all report output

28. build_result_rows collapses dose-level arms when study_plan_arm_id is null

Section titled “28. build_result_rows collapses dose-level arms when study_plan_arm_id is null”

ClinicalEvidenceQuery.build_result_rows groups view rows by [publication_id, disease_id, effective_line, study_plan_arm_id, subgroup_value]. When study_plan_arm_id is null — which it is for all publication-extracted arms that haven’t been matched to a clinical trial study plan arm — distinct dose-level arms (e.g. “8.0 mg/kg” and “10.0 mg/kg”) sharing the same subgroup_value collapse into a single group. extract_efficacy_metrics then picks one arm by max_by(number_of_participants), silently dropping the other.

app/queries/tpp/clinical_evidence_query.rb, build_result_rows method (line 306).

The grouping key at line 306 is:

grouped = enriched_data.group_by { |row|
[row['publication_id'], row['disease_id'], row['effective_line'],
row['study_plan_arm_id'], row['subgroup_value']]
}

When study_plan_arm_id is null for both dose arms (common for unlinked publications), they group together. extract_efficacy_metrics (line 619) then picks one via max_by(number_of_participants).

Pub 190656 (ARTEMIS-001, HS-20093 B7-H3 ADC in NSCLC):

  • View has 6 rows for “NSCLC → Squamous cell carcinoma” (3 endpoints × 2 dose arms: 8.0 mg/kg N=32 and 10.0 mg/kg N=26)
  • Both arms have study_plan_arm_id = null
  • Query collapses to 1 row, picks 8.0 mg/kg (N=32 > N=26)
  • Lost data: Sq 10.0 mg/kg cORR 26.9%, PFS 5.7, DOR 7.0

Dose-level subgroup data is silently dropped from the Clinical Evidence report. For dose-escalation studies where different dose levels have meaningfully different efficacy, only the higher-N cohort appears.

Affects dose-escalation/expansion publications where arms aren’t matched to trial study plan arms. The view correctly distinguishes arms by arm_name, but the query ignores arm_name in its grouping key.

Add arm_name to the grouping key in build_result_rows, or fall back to arm_name when study_plan_arm_id is null. This preserves dose-level arm distinctions without breaking publications where study_plan_arm_id correctly differentiates arms.

Related to Issue 20 (study_plan_arm link is fragile) — same root cause of over-reliance on study_plan_arm_id.

29. Dose extraction captures study-level range, not efficacy population range

Section titled “29. Dose extraction captures study-level range, not efficacy population range”

In dose-escalation studies, classify_publications extracts the full dose range stated in the abstract (e.g. dose_min=1.0, dose_max=8.3 mg/kg) as a property of the subgroup. But when the abstract restricts efficacy reporting to a dose subset (e.g. “results for patients who received ≥4.0 mg/kg”), the dose_min on the efficacy row is too low, creating a mismatch between the dose range and the efficacy population.

app/tasks/publications_llm_classification/task.rb — dose fields extracted as subgroup-level properties.

Dose extraction treats dose as a study-level attribute (“what doses were used?”) rather than scoping to the efficacy analysis population (“what doses did the patients in the reported results actually receive?”). The LLM prompt doesn’t instruct it to scope dose to the efficacy population.

Pub 238709 (MYTX-011 KisMET-01 updated):

  • Abstract: “85 pts received 1.0–8.3 mg/kg; 59 pts received ≥4.0 mg/kg” — efficacy reported only for ≥4.0 mg/kg subset
  • Extracted: dose_min=1.0, dose_max=8.3
  • Expected: dose_min=4.0, dose_max=8.3 (matching the efficacy population)
  • RP2D correctly extracted as “5.0 mg/kg Q3W (2-on 1-off) and 4.0 mg/kg Q3W”

Report rows show a broader dose range than the actual efficacy population received. Minor impact on report accuracy but misleading for dose-response interpretation.

Affects phase I dose-escalation studies where efficacy is reported for a dose subset. Relatively uncommon pattern — most studies report efficacy at a single dose or clearly per-dose-level.

Update the classify_publications dose extraction prompt to instruct the LLM: “When the abstract reports efficacy for a specific dose subset, use that subset’s dose range, not the full escalation range.” Alternatively, accept this as a known limitation since RP2D (when present) correctly reflects the clinically relevant dose.

30. Cross-study data contamination from abstract background sections

Section titled “30. Cross-study data contamination from abstract background sections”

When a publication abstract references efficacy results from a prior study as background context (e.g. “In our previous study NCT05029882, ORR was 24.4%”), classify_publications extracts those values as if they belong to the current study. This produces fabricated efficacy data for publications that may have no efficacy results of their own yet.

app/tasks/publications_llm_classification/task.rb — efficacy extraction from abstract text.

The LLM extraction prompt does not distinguish between efficacy results reported as outcomes of the current study vs. results cited from external/prior studies as background context. The abstract structure (Background → Methods → Results → Conclusions) is not enforced.

Pub 29705 (ABBV-400/Telisotuzumab adizutecan signal-seeking study, NCT06084481):

  • Abstract background: “Initial results from the ongoing first-in-human study (NCT05029882) of ABBV-400… an overall response rate of 24.4%”
  • Current study status: “As of 19 January 2024, 24 patients have been enrolled” — no efficacy data reported
  • Extracted: ORR=24.4%, N=24 (enrollment count misinterpreted as efficacy N)
  • Expected: No efficacy data (null)

The 24.4% ORR belongs to NCT05029882, not NCT06084481. The N=24 is enrollment, not an efficacy population.

Publications appear in the Clinical Evidence report with fabricated efficacy data from unrelated studies. This is particularly misleading for signal-seeking or early-enrollment publications where the abstract previews prior results to motivate the new study.

Affects publications whose abstracts cite efficacy results from prior/companion studies. Common in: signal-seeking study designs, follow-up studies referencing parent trials, and publications describing study rationale with prior data. Exact count unknown — requires systematic detection.

  1. Audit prompt guard (deployed): Added “CROSS-STUDY REFERENCES” instruction to the audit prompt so future audits flag these correctly.
  2. Extraction prompt fix (forward): Update classify_publications prompt to instruct: “Only extract efficacy values reported as results of THIS study (typically in the Results section). Do not extract values cited from prior/external studies in the Background or Introduction.”
  3. Detection query: Publications where llm_data has efficacy values but the abstract contains phrases like “previous study”, “prior study”, “first-in-human study (NCT…)” with efficacy values in the same sentence could be flagged for review.

Audit prompt updated with cross-study reference guard (2026-03-27). Extraction-level fix pending.


Job 1594 Triage Log (HNSCC + ADC, disease_id=6200, technology_ids=708)

Section titled “Job 1594 Triage Log (HNSCC + ADC, disease_id=6200, technology_ids=708)”
Audit IDPub IDTypeFieldClassificationNotes
833829660incorrect_valueefficacy.dor.valueTrue issue — extraction (minor)LLM appended spurious “(4.55)” to “Not Reached” DOR
834129705incorrect_valueefficacy.orr.valueTrue issue — extraction (Issue 30)ORR from referenced prior study NCT05029882, not current study
834229705incorrect_valueefficacy.orr.patient_countTrue issue — extraction (Issue 30)Enrollment count (24) misinterpreted as efficacy N
833944216incorrect_valuedose_minTrue issue — extraction (Issue 29)Dose-escalation range (0.3) on dose-expansion efficacy row (RP2D=2.0)
834044216incorrect_valuedose_maxTrue issue — extraction (Issue 29)Dose-escalation range (2.2) on dose-expansion efficacy row (RP2D=2.0)
8343115389incorrect_valueefficacy.pfs.valueTrue issue — extraction”Not Reached” should be null; abstract says “immature” (insufficient data)
8344134450incorrect_valuepatient_number_efficacyTrue issue — extraction (Issue 8 residual)Zero-sentinel: N=0 instead of null for unstated SCCHN-specific N
8345134450incorrect_valuedose_minTrue issue — extraction (Issue 29 variant)Child subgroup inherited phase 1a escalation dose (0.1) instead of parent’s fixed dose (2.5)
8346134450incorrect_valuedose_maxTrue issue — extraction (Issue 29)Dose range from escalation phase on expansion subgroup
834775542missing_subgroupFalse positive — audit LLMctDNA abundance is a Cox model correlation, not a tabulated efficacy subgroup
834875542missing_subgroupFalse positive — audit LLMVAF persistence is a statistical correlation, not a reportable subgroup
8349114973incorrect_valuedose_minTrue issue — extraction (Issue 29)Full escalation range (0.3) on efficacy row; efficacy population was 3.6-5.4
8350114973incorrect_valuedose_maxTrue issue — extraction (Issue 29)Full escalation range (8.0) on efficacy row; efficacy population was 3.6-5.4

Job 1635 Triage Log (CRC + ADC, disease_id=4345, technology_ids=708)

Section titled “Job 1635 Triage Log (CRC + ADC, disease_id=4345, technology_ids=708)”
Audit IDPub IDTypeFieldClassificationNotes
8360241259incorrect_valuepatient_number_efficacyTrue issue — extraction (Issue 8)Zero-sentinel: N=0 for 2.0 mg/kg arm; per-arm N not stated
8361241259incorrect_valuepatient_number_efficacyTrue issue — extraction (Issue 8)Zero-sentinel: N=0 for 2.4 mg/kg arm; per-arm N not stated
8362241259incorrect_valuedose_minTrue issue — view (Issue 31)SOC arm has Temab-A dose_min=1.6; SOC is trifluridine/tipiracil+BEV
8363241259incorrect_valuedose_maxTrue issue — view (Issue 31)SOC arm has Temab-A dose_max=2.4
8364241259incorrect_valuedose_unitsTrue issue — view (Issue 31)SOC arm has mg/kg (Temab-A units)
8365241259incorrect_valuedose_frequencyTrue issue — view (Issue 31)SOC arm has Q3W (Temab-A schedule)
8366241259incorrect_valuerp2dTrue issue — view (Issue 31)SOC arm has Temab-A RP2D
835229699incorrect_valueefficacy.orr.valueTrue issue — extraction (Issue 8)Zero-sentinel: ORR=0% for overall mCRC; no numeric ORR in abstract (E-R paper)
835329699incorrect_valuepatient_number_efficacyTrue issue — extraction (Issue 8)Zero-sentinel: N=0 for 2.4 mg/kg arm
835429699incorrect_valueefficacy.orr.valueTrue issue — extraction (Issue 8)Zero-sentinel: ORR=0% for 2.4 mg/kg; E-R correlations only
835529699incorrect_valuepatient_number_efficacyTrue issue — extraction (Issue 8)Zero-sentinel: N=0 for 3.0 mg/kg arm
835629699incorrect_valueefficacy.orr.valueTrue issue — extraction (Issue 8)Zero-sentinel: ORR=0% for 3.0 mg/kg; E-R correlations only
836829737incorrect_valueefficacy.pfs.valueTrue issue — extraction (Issue 32)TTP 4.8+ mo (SD pts only) mapped to PFS for full CRC cohort
836929737incorrect_valueefficacy.pfs.patient_countTrue issue — extraction (Issue 32)N=29 (full CRC) but TTP was for 14 SD patients only
837029737incorrect_valueefficacy.pfs.valueTrue issue — extraction (Issue 32)TTP 4.4+ mo (SD pts only) mapped to PFS for KRAS-mutated
837129737incorrect_valueefficacy.pfs.patient_countTrue issue — extraction (Issue 32)N=13 (full KRAS) but TTP was for 7 SD patients only
8411134450incorrect_valuepatient_number_efficacyTrue issue — extraction (Issue 8)Zero-sentinel: N=0 for CRC phase 1b; ORR/DCR reported
8412134450incorrect_valuedose_minTrue issue — extraction (Issue 29)Phase 1a escalation min (0.1) on phase 1b efficacy row (RP2D=2.5)
8413134450incorrect_valuepatient_number_efficacyTrue issue — extraction (Issue 8)Zero-sentinel: N=0 for SCCHN phase 1b; ORR/DCR reported
8414134450incorrect_valuedose_minTrue issue — extraction (Issue 29)Same as 8412 for SCCHN child subgroup
840272043missing_subgroupTrue issue — subgroup identification (Issue 33)CRC × HER2 IHC 3+ cross-tabulated subgroup missing
840372043missing_subgroupTrue issue — subgroup identification (Issue 33)CRC × HER2 IHC 2+ cross-tabulated subgroup missing
840472043missing_subgroupTrue issue — subgroup identification (Issue 33)CRC × HER2 IHC 1+ cross-tabulated subgroup missing
840572043missing_subgroupTrue issue — subgroup identification (Issue 33)CRC × HER2 mut/amp cross-tabulated subgroup missing
838674193incorrect_valueefficacy.pfs.valueTrue issue — extraction (Issue 32)TTP 1.6 mo mapped to PFS
838774193incorrect_valuepatient_number_efficacyTrue issue — extractionctDNA retained subgroup: N=3 (tested) but only 2 had retention
838874193incorrect_valueefficacy.orr.patient_countTrue issue — extractionSame: ORR denominator=3 should be 2
838974193incorrect_valueefficacy.dcr.patient_countTrue issue — extractionSame: DCR denominator=3 should be 2
8380200353incorrect_valuepatient_number_efficacyTrue issue — extraction (Issue 26)Parent N=97 propagated to “Absent MR” child subgroup
8381200353incorrect_valuepatient_number_efficacyTrue issue — extraction (Issue 26)Parent N=97 propagated to “Complete MR” child subgroup
8382200353incorrect_valuepatient_number_efficacyTrue issue — extraction (Issue 8)Zero-sentinel: N=0 for EGFR amplification subgroup
8383200353incorrect_valueefficacy.pfs.patient_countTrue issue — extraction (Issue 8)Zero-sentinel: PFS patient_count=0 for EGFR amp
837348880incorrect_valuesingle_doseTrue issue — extractionPooled Overall row shows single_dose=5.4; study had both 5.4 and 6.4 mg/kg
837448880incorrect_valuedose_minFalse positive — audit LLMdose_min=5.4 IS the minimum dose; audit confused by dose_max also being 5.4
837548880incorrect_valuedose_maxTrue issue — extractiondose_max=5.4 should be 6.4 (second arm omitted from pub-level dose)
8407135119incorrect_valuepatient_number_safetyTrue issue — extractionSafety N=28 (Q2W-LD only); full study N=43 includes Q3W arm
8408135119incorrect_valuedose_maxTrue issue — extractiondose_max=170 but Q3W arm went to 190 mg/m²
8409135119incorrect_valuedose_frequencyTrue issue — extractionQ2W only; study used both Q2W and Q3W schedules
839766892incorrect_valuedose_minTrue issue — extraction (Issue 29)Escalation min 0.8 on efficacy row; efficacy population ≥6 mg/kg
839866892incorrect_valuedose_minTrue issue — extraction (Issue 29)Same for IHC 2+/FISH+ child subgroup
839966892missing_subgroupTrue issue — subgroup identification (Issue 33)HER2 IHC 3+ subgroup (ORR 16/30=53.3%) not extracted
837748926incorrect_valuepatient_number_efficacyTrue issue — query/viewDisease-scoped IHC2+/ISH+ duplicate has N=0; non-scoped row has correct N=13
837848926incorrect_valuepatient_number_efficacyTrue issue — query/viewDisease-scoped IHC3+ duplicate has N=0; non-scoped row has correct N=40
837948926incorrect_valuepatient_number_efficacyTrue issue — query/viewDisease-scoped prior anti-HER2 duplicate has N=0; non-scoped row has correct N=16
839049899incorrect_valuepatient_number_efficacyTrue issue — extractionN=40 (overall) for ≥2.4 mg/kg subgroup; should be 34 per abstract
839149899incorrect_valueefficacy.orr.patient_countTrue issue — extractionORR denominator=40 should be 34
839249899incorrect_valueefficacy.corr.patient_countTrue issue — extractioncORR denominator=40 should be 34
839349900incorrect_valuepatient_number_safetyTrue issue — extractionSafety N=29 for 2.4 mg/kg arm; abstract says 31 treated
8351100incorrect_valueefficacy.pfs.valueTrue issue — extraction (Issue 32)TTP 2.70 mo mapped to PFS
839451436incorrect_valuedose_minTrue issue — extraction (Issue 29)Escalation min 1.5 on ≥6 mg/kg efficacy subgroup
839652543incorrect_valueefficacy.orr.patient_countFalse positive — audit LLMpatient_count=3 is denominator (correct); audit confused numerator/denominator
838467379incorrect_valuepatient_number_efficacyTrue issue — extraction (Issue 8)Zero-sentinel: N=0 for hTMB/MSS; PFS+HR reported
838567379incorrect_valueefficacy.pfs.patient_countTrue issue — extraction (Issue 8)Zero-sentinel: PFS patient_count=0 for same
840070960incorrect_valuedose_minTrue issue — extraction (Issue 29)Escalation min 3.2 on RP2D (6.4) subgroup
840170960incorrect_valuedose_maxTrue issue — extraction (Issue 29)Escalation max 8.0 on RP2D (6.4) subgroup
840673299incorrect_valueefficacy.pfs.valueTrue issue — extraction (Issue 32)TTP 1.8 mo mapped to PFS for CRC cohort
841075999spurious_rowTrue issue — query scopingNPC subgroup in CRC-scoped report (basket trial leak)
8415114571incorrect_valueefficacy.os.valueTrue issue — extractionOS=“Not Reached” but abstract says “not yet mature” → should be null
8358116843incorrect_valuerp2dTrue issue — view (Issue 31)SOC arm has Temab-A RP2D (dose cross-contamination)
8417152942spurious_rowTrue issue — query scopingPDA subgroup in CRC-scoped report (basket trial leak)
8418162304incorrect_valueefficacy.orr.valueTrue issue — extractionORR=35% is “any tumor reduction” rate; actual ORR≈1.5% (1/66 PR)
8359235204incorrect_valuepatient_number_efficacyTrue issue — extractionN=23 is PFS event count, not patient count; should be 31
8416238377incorrect_valueefficacy.dor.valueTrue issue — extractionDoR=11.03mo from “>48 weeks” (lower bound, not median)
8395240052incorrect_valuedose_minTrue issue — extraction (Issue 29)Escalation min 1.5 on ≥6 mg/kg efficacy subgroup
835729700missing_endpointefficacy.dor.valueTrue issue — extractionDoR=5.5 mo in abstract for 3.0 mg/kg but not extracted
836729735incorrect_valueefficacy.pfs.valueTrue issue — extraction (Issue 32)TTP 5.1 mo mapped to PFS for CRC
837229738incorrect_valueefficacy.pfs.valueTrue issue — extraction (Issue 32)TTP 18 wks → 4.14 mo converted and mapped to PFS
837648903incorrect_valuedose_maxTrue issue — extraction (Issue 29)Part 1 max (8.0) on Part 2 expansion row (5.4/6.4)

31. Investigational drug dose data bleeds onto control/comparator arms

Section titled “31. Investigational drug dose data bleeds onto control/comparator arms”

When publication_interventions.study_plan_arm_id is NULL (the common case for publication-extracted drugs via Source 0), the drug_interventions CTE in vw_publication_efficacy_data joins the investigational drug to ALL arms — including control/comparator arms. The pub_dose_lookup COALESCE fallback then propagates the investigational drug’s dose fields (dose_min, dose_max, rp2d, dose_units, dose_frequency) onto control arm rows that have no subgroup-level dose override. This makes it appear that the comparator arm received the investigational drug’s dosing.

db/views/vw_publication_efficacy_data_v18.sql:

  • drug_interventions CTE (Source 0): Joins publication_interventions to arms. When both clinical_trial_id and study_plan_arm_id are NULL, the drug matches all arms via the OR di.study_plan_arm_id IS NULL fallback.
  • pub_dose_lookup CTE: Pulls dose_evidence from publication_interventions. Joined to raw_rows via publication_intervention_id match from drug_interventions.
  • raw_rows COALESCE chain (lines 449–469): Falls through subgroup-level dose → pub-level dose. No arm_type guard prevents control arms from inheriting investigational drug dose.

In raw_rows, the dose COALESCE chain:

COALESCE(tlm.subgroup_dose_min, ..., pdl.pub_dose_min) AS dose_min,
COALESCE(tlm.subgroup_dose_max, ..., pdl.pub_dose_max) AS dose_max,
COALESCE(tlm.subgroup_rp2d, pdl.pub_rp2d) AS rp2d,

has no guard for aoe.arm_type or aoe.resolved_group_type. When a control arm’s subgroup has no dose fields, the COALESCE falls through to pub_dose_lookup, which contains the investigational drug’s dose evidence.

Pub 241259 (Temab-A exposure-response in mCRC):

  • SOC arm = trifluridine/tipiracil+BEV (N=20)
  • View shows: dose_min=1.6 mg/kg, dose_max=2.4 mg/kg, rp2d=2.4 mg/kg Q3W, dose_units=mg/kg, dose_frequency=Q3W
  • These are Temab-A doses from publication_interventions id=51068 (study_plan_arm_id=NULL)
  • Abstract explicitly states SOC is “trifluridine/tipiracil+BEV” — no Temab-A dosing

Pub 241978 (Enfortumab vedotin):

  • “No upfront dose reduction” control arm shows dose_min=0.75 mg/kg, dose_max=1.25 mg/kg
  • Clinical Evidence report: Control arms display investigational drug dose fields, misleading reviewers into thinking comparator arms received the ADC
  • Audit findings: Audit LLM correctly flags these as incorrect (5 of 7 issues on pub 241259 are this pattern)
  • Data quality: Dose fields on control arms are nonsensical — they describe a drug the arm didn’t receive
  • 2,890 view rows across 566 publications have dose data from pub_dose_lookup on control/comparator arms
  • 1,197 additional control rows have subgroup-level dose (potentially legitimate for dose-comparison arms)
  • Within ADC technology scope: 14 rows across 5 publications (smaller because most ADC trials are single-arm)
  • Drug NAME attribution to control arms is intentional — the report needs to show what drug the control is being compared against
  • Subgroup-level dose on control arms may be correct (e.g., dose-comparison trials where the control is a different dose of the same drug)
  • This does NOT affect experimental/investigational arm rows

Forward fix — view v19: Add an arm_type guard to the pub_dose_lookup COALESCE in raw_rows. When aoe.arm_type = 'control' (or aoe.resolved_group_type = 'ACTIVE_COMPARATOR'), skip the pub_dose_lookup fallback:

COALESCE(
tlm.subgroup_dose_min,
CASE WHEN tlm.subgroup_dose_value IS NOT NULL
THEN tlm.subgroup_dose_value || ' ' || COALESCE(tlm.subgroup_dose_units, '')
END,
CASE WHEN aoe.arm_type != 'control' THEN pdl.pub_dose_min END
) AS dose_min,

Apply the same pattern to dose_max, rp2d, dose_units, dose_frequency, and single_dose. This preserves subgroup-level dose (tier 1) for all arms but blocks the publication-level fallback (tier 3) for control arms only.

No backfill needed — rematerializing the view after deploying v19 will fix all affected rows.

Related to Issue 20: The v16 Source 0 fix (using publication_interventions as primary drug source) introduced this side effect by broadening the drug_interventions join. The drug join itself is correct; only the dose COALESCE fallback needs the arm_type guard.

(empty — pending implementation)


32. TTP (time to progression) misclassified as PFS

Section titled “32. TTP (time to progression) misclassified as PFS”

The LLM extraction pipeline (classify_publications) maps TTP (time to progression) values to PFS (progression-free survival) when the abstract reports TTP but not PFS. These are distinct endpoints — TTP censors deaths while PFS counts them as events. Additionally, in some cases (e.g., pub 29737), TTP values reported for a best-response subpopulation (e.g., SD patients only) are attributed to the entire cohort.

  • app/tasks/publications_llm_classification/subgroup_extraction.rb: Identifies endpoints from the abstract. May correctly identify TTP but it gets mapped to PFS downstream.
  • app/tasks/publications_llm_classification/task.rb: Extracts endpoint values. The LLM treats TTP as PFS when extracting, or the endpoint mapping normalizes TTP→PFS.
  • Endpoint normalization: If TTP is not in the standard endpoint list, the LLM may substitute the closest recognized endpoint (PFS).

The classify_publications prompt and/or endpoint schema does not distinguish TTP from PFS. When an abstract reports “median TTP = X months”, the LLM maps this to the PFS endpoint because TTP is not available as a separate extraction target. The LLM lacks instruction to leave PFS null when only TTP is reported.

Pub 29737 (IMMU-132 in GI cancers):

  • Abstract: “time to progression (TTP) … median of 4.8+ mo for the SD pts”
  • Extracted: PFS=4.8 months, patient_count=29 (entire CRC cohort)
  • Correct: TTP=4.8+ months, applicable to 14 SD patients only — PFS should be null
  • Two compounding errors: (1) TTP→PFS confusion, (2) SD-subpopulation value → full cohort

Pub 29737 KRAS-mutated subgroup:

  • Abstract: “median TTP = 4.4+ mo” for 7 SD patients
  • Extracted: PFS=4.4 months, patient_count=13 (all KRAS-mutated)
  • Correct: TTP=4.4+ months for 7 SD patients — PFS should be null
  • Clinical Evidence report: PFS column shows TTP values, overstating the evidence (PFS is a stronger endpoint than TTP)
  • Cross-study comparisons: TTP values mixed with genuine PFS values make comparisons unreliable
  • Patient counts: When TTP is reported only for responders/SD patients, attributing it to the full cohort inflates the denominator
  • 149 publications mention TTP (but not PFS) in their abstract yet have PFS as an extracted endpoint
  • 1,150 publications have TTP correctly extracted as TTP (suggesting the pipeline CAN handle TTP in many cases)
  • The SD-subpopulation misattribution is harder to quantify systematically but likely affects a subset of phase I/II publications reporting outcomes by best response category
  1. Extraction prompt fix (forward): Add explicit instruction to classify_publications: “TTP (time to progression) and PFS (progression-free survival) are distinct endpoints. If the abstract reports TTP but not PFS, extract TTP only — do NOT map TTP values to PFS. Leave PFS null when only TTP is reported.”
  2. Subpopulation guard: Add instruction: “When a time-based endpoint (TTP, PFS, DoR) is reported only for a best-response subgroup (e.g., ‘median TTP for SD patients’), do not attribute it to the parent population. Extract it under the response-specific subgroup or leave the parent’s value null.”
  3. Backfill: Re-extract PFS values for the 149 affected publications with updated prompt. Scope: publications where abstract contains TTP/time to progression but NOT PFS/progression-free survival, and a PFS endpoint was extracted.

(empty — pending implementation)


33. Cross-tabulated subgroups not identified in basket trials

Section titled “33. Cross-tabulated subgroups not identified in basket trials”

When basket trial abstracts report efficacy in a table structured as tumor type × biomarker status (e.g., CRC × HER2 IHC 3+/2+/1+), extract_subgroups identifies the single-dimension subgroups (tumor types and biomarker statuses separately) but not the cross-product subgroups (CRC IHC 3+, CRC IHC 2+, etc.). This means disease-specific biomarker-stratified efficacy data is lost — only the overall tumor-type and overall biomarker-status rows are extracted.

app/tasks/publications_llm_classification/subgroup_extraction.rb: Identifies subgroups and their endpoint associations from the abstract. The LLM prompt identifies subgroups as a flat list, and the hierarchical naming convention (e.g., “Non-breast STs → CRC”) captures one level of nesting but not cross-dimensional nesting.

The subgroup extraction prompt produces subgroups along each dimension independently:

  • By tumor type: BTC, UC, GC/GEJA, CRC
  • By biomarker: HER2 IHC3+, IHC2+, IHC1+

But it does not produce the cross-product: CRC IHC3+, CRC IHC2+, etc. The table data in the abstract contains these values, but the extraction doesn’t recognize the need to create nested subgroups for each cell in a tumor type × biomarker matrix.

Pub 72043 (SHR-A1811 in non-breast solid tumors):

  • Abstract table reports ORR for each tumor type × HER2 IHC status combination
  • Extracted subgroups: CRC (36.4%), IHC3+ (54.1%), IHC2+ (41.7%), IHC1+ (50.0%)
  • Missing: CRC IHC3+ (100%, 3/3), CRC IHC2+ (0%, 0/3), CRC IHC1+ (0%, 0/1), CRC HER2 mut/amp (0%, 0/3)
  • 4 audit issues (8402-8405) all flagging missing cross-tabulated CRC subgroups
  • Clinical Evidence report: Disease-specific biomarker-stratified efficacy data missing — can only show overall CRC ORR, not CRC by HER2 status
  • Granularity loss: The most clinically relevant data in basket trials is often the cross-tabulation (e.g., “does HER2 IHC 3+ predict response in CRC specifically?”)
  • ~366 publications have both disease-type and biomarker-type subgroups with common biomarkers (HER2, EGFR, KRAS, BRAF, PD-L1, MSI, MMR)
  • Not all 366 will have cross-tabulated data in the abstract — many will have separate analyses rather than a matrix table
  • The issue primarily affects basket/platform trials reporting across multiple tumor types with biomarker stratification
  • This is NOT about missing biomarker context on existing subgroups (that’s Issue 19)
  • This is NOT about dropped subgroups at the classify step (Issue 10) — the cross-product subgroups are never identified in the first place
  • Parent-level tumor type and biomarker subgroups ARE correctly extracted
  1. Extraction prompt enhancement: Update extract_subgroups prompt to recognize tabular cross-tabulation patterns: “When the abstract contains a table or matrix reporting efficacy by tumor type × biomarker status, create cross-product subgroups (e.g., ‘CRC → HER2 IHC 3+’) for each cell with reported data, in addition to the single-dimension subgroups.”
  2. Post-extraction cross-product generation: After extracting single-dimension subgroups, detect when a table exists with both dimensions and generate cross-product subgroups programmatically.
  3. Scope: Focus on publications with ≥2 disease subgroups AND ≥1 biomarker subgroup, and re-run extraction with the enhanced prompt.

(empty — pending implementation)


34. “Immature” endpoints extracted as “Not Reached”

Section titled “34. “Immature” endpoints extracted as “Not Reached””

When an abstract states that an endpoint (OS, PFS, DoR) is “not yet mature”, “data immature”, or “results are immature”, the LLM extraction maps this to “Not Reached”. These are clinically distinct concepts: “Not Reached” means the Kaplan-Meier curve hasn’t crossed the 50% mark (a real finding indicating the median exceeds current follow-up), while “immature” means insufficient events or follow-up to perform the analysis (no median can be estimated — value should be null).

app/tasks/publications_llm_classification/task.rb: The classify_publications prompt doesn’t distinguish between “Not Reached” and “immature/not yet mature”. The LLM treats both as equivalent and extracts “Not Reached” for either.

The extraction prompt has no instruction to differentiate “Not Reached” (endpoint was analyzed, median exceeds follow-up) from “immature” (endpoint was NOT formally analyzed, insufficient data). Both get mapped to the string “Not Reached”.

Pub 114571 (JSKN003 in HER2+ mCRC):

  • Abstract: “The median overall survival (OS) was not yet mature”
  • Extracted: OS = “Not Reached”
  • Correct: OS should be null — data immature, no median estimated

Pub 115389 (from job 1594):

  • Abstract: PFS described as “immature”
  • Extracted: PFS = “Not Reached”
  • Correct: PFS should be null
  • Clinical Evidence report: “Not Reached” implies a favorable outcome (median exceeds follow-up), while “immature” is neutral (no data yet). Reporting “Not Reached” when the data is simply immature overstates the evidence.
  • Cross-study comparisons: “Not Reached” OS is treated as a positive signal, biasing comparisons against studies that honestly report immature data.
  • ~71 publications have “immature”/“not yet mature” language in the abstract (without “not reached”) but have “Not Reached” extracted for OS, PFS, or DoR
  • Breakdown: OS (~214 total “Not Reached” pubs with immature language, ~71 without “not reached” in abstract), PFS (~107), DoR (~68)
  • Many abstracts legitimately say BOTH “immature” and “not reached” — these are correct and not affected
  • Abstracts that say “median OS was not reached” — these ARE correct as “Not Reached”
  • Abstracts that say “OS data are immature; median was not reached” — also correct (both terms used)
  • Only affects abstracts where “immature” is used WITHOUT “not reached” for the same endpoint
  1. Extraction prompt fix (forward): Add instruction to classify_publications: “Distinguish between ‘Not Reached’ (endpoint was analyzed but median exceeds follow-up — extract as ‘Not Reached’) and ‘immature/not yet mature’ (insufficient data to analyze the endpoint — extract as null/omit). Only use ‘Not Reached’ when the abstract explicitly states the median was not reached.”
  2. Backfill: Re-extract OS/PFS/DoR for the ~71 affected publications. Scope query:
    SELECT DISTINCT v.publication_id
    FROM vw_publication_efficacy_data v
    JOIN publications p ON p.id = v.publication_id
    WHERE v.measure_value = 'Not Reached'
    AND v.endpoint_abbreviation IN ('OS', 'PFS', 'DOR')
    AND (p.abstract ILIKE '%not yet mature%' OR p.abstract ILIKE '%data immature%'
    OR p.abstract ILIKE '%data are immature%' OR p.abstract ILIKE '%results are immature%')
    AND p.abstract NOT ILIKE '%not reached%'
    AND p.abstract NOT ILIKE '%not been reached%'

(empty — pending implementation)