Publication Issues Tracker Archive
Publication Issues Tracker
Section titled “Publication Issues Tracker”Temporary working document for tracking publication-processing issues identified during investigation.
The main motivation for this doc is the sheet: 1reh2-9Xpxd9DF7EB-73JfSXH8-MLtWI3zUDEOTgxPV8, where the client has collected clinical data for different disease areas and drugs. The purpose of this document is to identify gaps in the publications database that are preventing us from being able to correctly reconstruct this sheet in the future using structured data only (from the bioloupe data lake database).
Last updated: 2026-03-28 (Issues 31-34 added from job 1635 CRC+ADC audit triage — dose cross-contamination to control arms, TTP→PFS misclassification, cross-tabulated subgroups, immature→Not Reached confusion)
Issue index
Section titled “Issue index”| # | Title | Short description | Status |
|---|---|---|---|
| 1 | Trial subgroup disease propagation gap | Non-disease subgroups with disease-like labels (e.g. MSS-CRC, NSCLC) never get disease_id populated because propagation is gated on subgroup_type = 'disease' | Complete — 1,924 subgroups pending term match resolution |
| 2 | ASCO API content type blind spot | PresentationContentItem publications silently dropped — search filter, detail query, and NCT ID search all restricted to AbstractContentItem only | Complete |
| 3 | Publication dose context gap | Linked publications use trial-derived dose; publication-specific dose extraction only runs for unlinked publications; no structured dose fields (min/max/RP2D/units/frequency) | Complete — extraction + view join fixed by Issue 20 (v16 view) |
| 4 | AE grade classification gap | Individual named AE rows lack grade category (all_grade vs grade_gte3), preventing ranked “Most Frequent AE” export columns | Complete — superseded by Issue 7 full re-run |
| 5 | Prior therapy context not extracted | Min/max/median prior lines and prior therapy exposure (e.g. prior taxane, prior IO) not captured from publication abstracts despite being available in text | Complete — max_prior_lines data quality cleanup needed |
| 6 | Data cutoff date not extracted | Publication data cutoff date is stated in ~6K abstracts but not persisted as a structured field — needed for worksheet Data Cut column | Implementation complete — backfill complete |
| 7 | AE grade category too coarse | Binary all_grade/grade_gte3 enum forces grade 1-2 rows into all_grade, causing | Complete — enum expanded, backfill run, inverted pairs reduced from ~50 to 33 |
| 8 | max_prior_lines zero-sentinel contamination | LLM outputs 0 instead of null for unstated max prior lines, producing 124K unusable values including 12.9K logically impossible min > max rows | Complete — cleanup applied, 0 contradictions remain, residual zeros in 1L/Adj/Neo populations only |
| 9 | All-grade AE extraction gap | Originally ~13K pubs suspected; after investigation only ~14 have genuine misclassification (any-grade values labeled as grade≥3). Already fixed by Issue 7 enum expansion — re-extraction produces correct results | Complete — fixed by Issue 7 prompt, no additional changes needed |
| 10 | classify_publications drops identified subgroups | LLM drops ~15% of subgroups identified by extract_subgroups — ~9,700 publications affected across all sources. Prompt + schema + validation fix implemented and full pipeline re-extraction completed in prod | Complete |
| 11 | All 102 publications with empty outcome_measures are correctly empty: trial designs, safety-only, biomarker studies, or truncated abstracts. Original worksheet gaps explained by Issue 10 + data availability | Closed — not an issue | |
| 12 | Legacy Emerging Clinical Data query collapses subgroup-level results | Legacy EmergingClinicalDataQuery groups by [pub_id, disease_id, line, arm] and prefers “Overall” subgroup, hiding dose-level and biomarker-stratified data; the current ClinicalEvidenceQuery already preserves subgroup rows | Stale — superseded by ClinicalEvidenceQuery; legacy EmergingClinicalDataQuery still collapses subgroups |
| 13 | Technology filter excludes combination partner drugs | Query filters view rows by technology_id, removing combo partner drugs with different technologies — e.g. paclitaxel (chemo) filtered out when querying for BsAb, so Amivantamab+paclitaxel shows no combo partner | Complete — switched to fetch_combination_partners |
| 14 | Basket trial disease subgroups not extracted for minority cohorts | BNT324/DB-1311 abstract mentions SCLC, CRPC, NSCLC by name but not HNSCC (only “1 pt with BTC” style mentions) — HNSCC N=3 data was in poster/presentation only, not abstract text | Investigation complete — data availability limit |
| 15 | Disease extraction drops parent disease when subtype matches exist | build_match_set early-returns when subtype TermMatches succeed, skipping parent disease-name match — e.g. HNSCC (6200) dropped when H&N sub-sites match, making 1,856 pubs invisible under umbrella diseases | Complete — backfill ran 2026-03-18 |
| 16 | Confirmed ORR is not exported by EmergingClinicalDataQuery | Query/report endpoint whitelist omits cORR, so worksheet rows with Confirmed ORR (cORR) cannot be reconstructed even when ORR is present — folded into Issue 12 | Complete — confirmed boolean added, backfilled 3,061 rows |
| 17 | ASCO abstract + presentation copies create duplicate publication rows | ASCO ingestion saves AbstractContentItem and PresentationContentItem separately by source_id, so the same DOI can appear twice in the report | Investigation complete |
| 18 | PubMed-indexed journal article missing from publication corpus | The sqNSCLC worksheet row for Cofetuzumab now points to 10.1016/j.lungcan.2025.108492, but that article is absent from publications, so the row is still missing despite a valid journal source | Implementation complete — 2025 PubMed backfill pending |
| 19 | Biomarker context missing at subgroup level | Biomarkers are extracted at trial_disease_details level (disease scope), not per subgroup — ~13K biomarker-type subgroups like “EGFR-mutant” and “PD-L1 TPS≥1%” have no structured biomarker link, preventing biomarker-stratified export | Complete — extraction backfilled (52K records, 99%), matching pipeline run (67.3% matched), query layer aggregates multi-biomarker subgroups |
| 20 | study_plan_arm link is fragile and causes dose/drug/arm issues | vw_publication_efficacy_data joins through study_plan_arms for arm roles AND drug resolution — causing arm role failures (62% of rows), dose evidence drop (76% lost via drug_id mismatch), and row triplication. Merges Issue 3 dose gap. | Complete — v16 view deployed + arm_type backfill run in prod (2026-03-24) |
| 21 | Phase 1 basket trials report response counts, not ORR percentages | LLM extracts PR (count) faithfully from phase 1 abstracts reporting “1 PR in 9 HNSCC patients”, but the query only recognizes ORR (percentage). No ORR is derived, and fallback patient count inflates to the cross-tumor total | Complete — derived ORR in post_process + backfill |
| 22 | extract_subgroups doesn’t identify response counts as endpoints | When abstracts report best response narratively (“1 PR and 14 SD out of 29 patients”) without a formal ORR, extract_subgroups only identifies DCR and TTP as endpoints — individual response counts (PR, CR) are missed, so classify_publications can’t extract them | Complete — forward fix v2 + backfill v1+v2 run; 759→498 DCR-only pubs (remaining 498 verified clean) |
| 23 | Dose extraction misses implicit RP2D in phase I/II trials | When a phase I/II abstract says “dose levels of X and Y were chosen for phase II”, the dose extractor classifies this as a range (dose_min/dose_max) rather than RP2D — but in phase I/II trials, doses chosen for phase II ARE the RP2D by definition | Complete — backfill ran 2026-03-23 |
| 24 | Subgroup participant count wrong for biomarker sub-cohorts | KRAS-mutated CRC subgroup (pub 29737) reports n=7 but abstract states 13 KRAS-mutated patients with 7 having SD — LLM confused the SD count with the total KRAS cohort size | Complete — backfill ran 2026-03-23 |
| 25 | Confirmed vs unconfirmed ORR confusion in classify_publications | When abstracts report both confirmed and unconfirmed ORR (common in ADC trials), the LLM extracts the unconfirmed value but marks confirmed: true, or omits the confirmed ORR entirely — producing wrong cORR values and missing cORR endpoints | Incomplete — extraction residual post-fix, see 2026-03-26 audit findings |
| 26 | Parent population N propagated to child subgroups | classify_publications copies the parent subgroup’s number_of_participants to child subgroups instead of extracting the subset-specific N — ~5,058 child subgroups across 1,174 publications affected | Complete |
| 27 | extract_efficacy_metrics picks confirmed ORR as plain ORR | When both confirmed and unconfirmed ORR rows exist with the same N, max_by(number_of_participants) picks the confirmed row for the plain ORR metric — making ORR and cORR identical and the ORR value wrong | Investigation complete |
| 28 | build_result_rows collapses dose-level arms when study_plan_arm_id is null | Grouping key uses study_plan_arm_id which is null for publication-extracted arms — distinct dose cohorts (e.g. “8.0 mg/kg” vs “10.0 mg/kg”) sharing the same subgroup collapse into one row, silently dropping the lower-N arm | Investigation complete |
| 29 | Dose extraction captures study-level range, not efficacy population range | In dose-escalation studies, LLM extracts the full dose range (e.g. 1.0–8.3 mg/kg) even when efficacy is reported only for a subset (e.g. ≥4.0 mg/kg) — dose_min on the efficacy row is too low | Investigation complete |
| 30 | Cross-study data contamination from abstract background sections | LLM extracts efficacy values from a referenced prior study cited in the abstract’s background, attributing them to the current publication which has no efficacy data yet | Investigation complete |
| 31 | Investigational drug dose data bleeds onto control/comparator arms | pub_dose_lookup COALESCE fallback propagates investigational drug dose fields to control arms when publication_interventions.study_plan_arm_id is NULL — 2,890 rows across 566 publications | Investigation complete |
| 32 | TTP (time to progression) misclassified as PFS | LLM extraction maps TTP values to PFS endpoint — 149 publications mention TTP (not PFS) in abstract but have PFS extracted; additionally SD-subpopulation TTP values get attributed to full cohort | Investigation complete |
| 33 | Cross-tabulated subgroups not identified in basket trials | extract_subgroups identifies single-dimension subgroups (tumor type OR biomarker) but not the cross-product (tumor type × biomarker) when tabular data is present — ~366 pubs have both disease + biomarker subgroups that could have cross-tabulated data | Investigation complete |
| 34 | ”Immature” endpoints extracted as “Not Reached” | LLM maps “not yet mature” / “data immature” to “Not Reached” — but immature means no median can be estimated (should be null), while “Not Reached” means median exceeds follow-up. ~71 pubs have immature language without “not reached” but have “Not Reached” extracted | Investigation complete |
Each issue entry should keep analysis and remediation separate.
Recommended issue structure:
Short summaryWhere this sits in the current pipelineExact restriction causing the dropConcrete examplesDownstream impactWhat the issue is notScaleSpot checksOpen characterization questionsExplored solution directionSolution applied
Solution applied should remain empty until an actual fix is agreed and implemented.
Backfill pattern: When an issue requires backfilling historical data, see the “One-Off Backfill Tasks” section in
.claude/skills/backend-expert/SKILL.md.
1. Trial subgroup disease propagation gap
Section titled “1. Trial subgroup disease propagation gap”Short summary
Section titled “Short summary”Publication subgroup rows can contain disease-like cohort labels in trial_subgroups.subgroup_value, but the current disease propagation path only assigns trial_subgroups.disease_id for subgroups whose subgroup_type is exactly disease.
If a subgroup is classified as analysis population or another non-disease type, its disease_id remains NULL even when:
- the subgroup label is clearly disease-like, and
- a high-confidence
TermMatchalready exists for that label.
This means disease-specific publication rows can fail to surface in reporting even though the publication contains a disease cohort tied to outcomes.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”Current publication flow:
extract_subgroupsidentifies subgroup labels and endpoint associations.classify_publicationsemitssubgroup_outcome_measures, including:typevalue- linked outcome measures
post_process_publicationsdestroys and recreatespublication.trial_subgroupsfrom the LLM output.- The separate publication disease workflow creates
TermMatchrecords for subgroup disease strings and later post-processes them back intotrial_subgroups.disease_id.
Relevant code paths:
/Users/tomor/Sites/bioloupe-data-gov/app/workflows/publications_workflow.rb/Users/tomor/Sites/bioloupe-data-gov/app/tasks/publications_llm_classification/subgroup_extraction.rb/Users/tomor/Sites/bioloupe-data-gov/app/tasks/publications_llm_classification/task.rb/Users/tomor/Sites/bioloupe-data-gov/app/tasks/publications_llm_classification/post_process.rb/Users/tomor/Sites/bioloupe-data-gov/app/models/trial_subgroup.rb/Users/tomor/Sites/bioloupe-data-gov/lib/tasks/clinical_trials/trial_subgroups.thor/Users/tomor/Sites/bioloupe-data-gov/app/workflows/publication_disease_workflow.rb
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”The subgroup disease term population and subgroup disease post-processing are both restricted to subgroup_type = 'disease'.
In the model:
TrialSubgroup.disease_typeis defined aswhere(subgroup_type: 'disease')TrialSubgroup.populate_term_matchesonly iteratesdisease_type.with_subgroup_value
In the Thor task:
post_process_disease_matchesbuilds the scope as:TrialSubgroup.disease_type.with_subgroup_value.without_disease_id
So any subgroup classified as:
analysis populationclinical featuremutationpatient characteristic- or any other non-
diseasetype
is excluded from disease propagation, even if its subgroup_value is disease-like.
Example: publication 114077
Section titled “Example: publication 114077”Publication:
publications.id = 114077- title:
A phase I study of INCA33890, a PD-1/TGFβR2 bispecific antibody, for advanced solid tumours - linked trial:
NCT05836324
Publication-level disease rows:
trial_disease_detailscontains only:4116 = Solid Tumors
Subgroup row:
trial_subgroups.id = 210858source_type = 'Publication'source_id = 114077subgroup_type = 'analysis population'subgroup_value = 'MSS-CRC'disease_id = NULL
But the disease matcher already knows what this means:
term_matches.id = 100095subject_type = 'TrialSubgroup'field = 'disease_name'strategy = 'DiseaseMatching'term = 'mss-crc'final_result.id = 4345final_result.score = 0.95- disease
4345 = Colorectal Cancer
So the system has a validated disease match for the normalized term, but it is never propagated to trial_subgroups.disease_id because the subgroup is analysis population, not disease.
Why this matters downstream
Section titled “Why this matters downstream”The publication efficacy view uses subgroup disease from trial_subgroups, not from the linked clinical trial and not from trial_disease_details.
In /Users/tomor/Sites/bioloupe-data-gov/db/views/vw_publication_efficacy_data_v07.sql:
treatment_line_mappingreadstrial_subgroupswheresource_type = 'Publication'subgroup_disease_idis set directly fromtrial_subgroups.disease_id
The view does not join:
clinical_trial_end_diseasestrial_disease_details
for subgroup disease attribution.
So if a publication subgroup is disease-like but trial_subgroups.disease_id stays null, the view row does not carry that disease.
Later, in /Users/tomor/Sites/bioloupe-data-gov/app/queries/tpp/emerging_clinical_data_query.rb, filtering works like this:
- prefer
v.subgroup_disease_id - if that is null, fall back to
trial_disease_details
For publication 114077, that fallback disease is only Solid Tumors, so the publication does not surface as CRC.
What the issue is not
Section titled “What the issue is not”This is not primarily a missing TermMatch problem.
For the MSS-CRC example, the TermMatch already exists and is high-confidence. The failure is in propagation from the normalized term match back onto the subgroup record.
This is also not a clinical-trial disease issue. In this path, the effective disease used by the publication efficacy view comes from publication subgroup records, not from clinical_trials.
Current semantic mismatch
Section titled “Current semantic mismatch”The system currently behaves as if:
trial_subgroups.disease_idmeans: “this subgroup is explicitly a disease subgroup”
But many real publication subgroup labels behave more like:
- disease cohort embedded inside another subgroup class
- disease-shaped analysis population
- disease-plus-qualifier cohort
Examples:
MSS-CRCOverall → RCCRelapsed/Refractory AMLBCG-refractory NMIBCStage I NSCLC
These can carry real disease meaning even when the LLM classified the subgroup as analysis population or another non-disease type.
Scale of the issue in publication-sourced subgroup rows
Section titled “Scale of the issue in publication-sourced subgroup rows”For trial_subgroups.source_type = 'Publication' with disease_id IS NULL:
- total null-disease subgroup rows:
140,057 - distinct null-disease subgroup strings:
92,854
For subgroup_type = 'analysis population':
- rows with non-empty
subgroup_valueand nulldisease_id:88,623 - distinct subgroup strings:
51,343
For non-disease subgroup rows overall:
- rows with non-empty
subgroup_valueand nulldisease_id:134,211 - distinct subgroup strings:
88,382
Among publication analysis population rows specifically:
1,720rows already have an existing exact normalized high-confidenceDiseaseMatchingresult available by term
Among all publication non-disease subgroup rows:
2,422rows already have an existing exact normalized high-confidenceDiseaseMatchingresult available by term
This shows two things at once:
- there is recoverable disease signal being left unused
- most non-
diseasesubgroup rows are not pre-validated disease matches
Why broadening this blindly is risky
Section titled “Why broadening this blindly is risky”Many analysis population values are obviously not disease cohorts:
OverallRespondersEvaluable patientsMonotherapyCohort 1PlaceboDose escalationHealthy VolunteersFirst-lineJapanese patients
So “map all non-disease subgroup types through disease matching” would push large volumes of junk terms into a disease-normalization process that was not designed for them.
Spot checks showing recoverable signal
Section titled “Spot checks showing recoverable signal”These publication subgroup values look meaningfully disease-like and appear useful for disease attribution:
MSS-CRC->Colorectal CancerOverall → RCC->Renal Cell Carcinoma (RCC)Overall → GIST->Gastrointestinal Stromal Tumor (GIST)Relapsed/Refractory AML->Acute Myeloid Leukemia (AML)BCG-refractory NMIBC->Non-Muscle Invasive Bladder CancerHead and Neck Squamous Cell Carcinoma->Head and Neck Squamous Cell Carcinoma (HNSCC)NSCLC->Non-Small Cell Lung Cancer (NSCLC)Colorectal cancer->Colorectal Cancer
These are the kinds of subgroups that currently fail to contribute disease-specific reachability if their subgroup_type is not disease.
Spot checks showing noise or semantic drift
Section titled “Spot checks showing noise or semantic drift”These examples show why broad disease assignment on subgroup labels can produce incorrect or misleading disease attribution:
Previously untreated mPDAC-> matched toMultiple Myelomaat score0.75- abbreviation collision
Relapsed/refractory cHL-> matched toChronic Leukemiaat score0.825- clearly wrong
Overall → Carcinoma In Situ-> matched toBreast Ductal Carcinoma In Situat score0.85- wrong in a bladder-cancer context
Bone metastases-> matched toBone Metastasis- may be useful as a retrieval concept but not necessarily the publication’s disease cohort
These are not hypothetical edge cases. They already exist in the term-matching results.
Reporting impact
Section titled “Reporting impact”Because subgroup_disease_id from publication subgroups is preferred when present, this issue affects:
- disease-specific publication discovery
- disease-specific efficacy row inclusion
- downstream CSV/report completeness for basket and umbrella studies
- publications whose abstract reports disease cohorts under non-
diseasesubgroup types
The observed failure mode is:
- publication contains a disease cohort in subgroup results
- subgroup gets created with a non-
diseasetype - subgroup disease propagation never runs
vw_publication_efficacy_datarow hassubgroup_disease_id = NULL- reporting falls back to publication-level disease or misses the disease entirely
Core problem statement
Section titled “Core problem statement”The system currently treats subgroup disease attribution as a type-gated post-processing step:
- only
subgroup_type = 'disease'is eligible
But in actual publication abstracts, disease-bearing cohort labels are often emitted under other subgroup types, especially analysis population.
As a result, the pipeline loses disease information that is already present in subgroup text and, in some cases, already normalized in term_matches.
Open characterization questions
Section titled “Open characterization questions”These are not proposed fixes. They are the unresolved aspects of the issue:
- Is
trial_subgroups.disease_idintended to mean “authoritative disease cohort” or “retrieval-relevant disease tag”? - Should disease-bearing
analysis populationsubgroups be treated differently from clearly non-diseaseanalysis populationvalues likeRespondersorCohort 1?
Working assumptions from discussion
Section titled “Working assumptions from discussion”- Metastatic-site labels such as
Bone metastasesmay be valid for publication reachability if the ontology already contains the corresponding disease concept. - When subgroup disease is null, fallback to
trial_disease_detailsshould be interpreted as publication-level disease rather than subgroup-level disease.
Explored solution direction
Section titled “Explored solution direction”The explored direction is not to map every subgroup directly into trial_subgroups.disease_id.
That would continue to create incorrect disease assignments, just with a different error pattern:
- fewer abbreviation-only failures
- more context-overreach failures
The better conceptual shape is:
extract_subgroups ↓classify_publications ↓subgroup disease adjudication (LLM, contextual) ↓post_process / disease matchingThe key idea is to separate two questions that are currently blurred together:
- Is this subgroup actually disease-like?
- If yes, which disease concept should it map to?
The explored adjudication step would analyze subgroup rows in publication context and emit something like:
- semantic class:
disease_cohortdisease_related_contextnot_disease
- normalized disease phrase, if applicable
- evidence quote/span
- confidence
Behavioral intent of those outputs:
disease_cohort- subgroup is a real disease-bearing cohort
- eligible to write into authoritative
trial_subgroups.disease_id
disease_related_context- subgroup contains disease signal that may help publication reachability or filtering
- should not automatically overwrite authoritative subgroup disease semantics
- may belong in a separate retrieval/tag field rather than
trial_subgroups.disease_id
not_disease- subgroup remains unmapped for disease attribution
This distinction matters because the current system uses trial_subgroups.disease_id as an authoritative signal in reporting, not just as a search helper.
So if all subgroup strings are pushed directly into the existing disease_id field, the reports inherit those assignments as if they were true disease cohorts.
That is acceptable for:
MSS-CRCOverall → RCCRelapsed/Refractory AMLBCG-refractory NMIBC
But not acceptable for:
RespondersCohort 1PlaceboEvaluable patients- ambiguous or mis-normalized strings like
Relapsed/refractory cHL - context-sensitive strings like
Carcinoma In Situ
Placement options explored:
-
In the main publication workflow:
- after
classify_publications - before
post_process_publications - this would affect subgroup creation semantics earlier
- after
-
In the publication disease branch:
- near
/Users/tomor/Sites/bioloupe-data-gov/app/workflows/publication_disease_workflow.rb - this keeps subgroup extraction/classification separate from disease enrichment
- near
Current preferred exploration direction:
- yes, a new LLM subgroup adjudication step makes sense
- no, it should not directly map all subgroups into the existing authoritative
disease_idfield analysis populationis the best first expansion target- the main gain comes from separating:
- “is this disease-like?”
- from
- “which disease is it?” using publication context rather than term-only matching
Solution applied
Section titled “Solution applied”Implemented contextual LLM subgroup disease adjudication for all non-disease publication subgroups (~132K rows, ~89K distinct values).
Scope: All publication-sourced subgroups where subgroup_type != 'disease', including analysis population (89K rows), clinical feature (25K), mutation (10K), patient characteristic (2.4K), and smaller types. Spot checks confirmed disease-bearing labels appear across all these types (e.g. Metastatic Urothelial Carcinoma → PD-L1- under clinical feature, Relapsed/refractory multiple myeloma → del17p under mutation).
Estimated cost: ~$50 with gpt-5-mini for full backfill.
New code:
app/tasks/subgroup_disease_adjudication/task.rb— LLM adjudication task that classifies publication subgroup labels asdisease_cohort,disease_related_context, ornot_disease, with a normalized disease phrase, evidence span, and confidence score.app/tasks/subgroup_disease_adjudication/response.rb— JSON schema for the adjudication response (StoreModel + DataTasks::JsonSchema).
Modified code:
app/models/trial_subgroup.rb— Addedadjudicated_disease_cohortscope. Updatedpopulate_term_matchesto also generate TermMatch entries for adjudicateddisease_cohortsubgroups using the LLM-providednormalized_disease_phrase.lib/tasks/clinical_trials/trial_subgroups.thor— Addedadjudicate_subgroup_diseasesThor task for CLI access. Updatedpost_process_disease_matchesto process both explicit disease-type subgroups and adjudicated disease cohort subgroups.app/workflows/publication_disease_workflow.rb— Addedadjudicate_subgroup_diseasesstep beforepopulate_disease_terms_for_trial_subgroupsin the workflow graph.
How it works:
- Adjudication runs on all publication-sourced non-
diseasesubgroups and persists the result ontrial_subgroups.llm_data['subgroup_disease_adjudication']. - Only
semantic_class = 'disease_cohort'subgroups enter the DiseaseMatching term population and post-processing paths. disease_related_contextandnot_diseasesubgroups remain excluded fromtrial_subgroups.disease_id.- No changes to
vw_publication_efficacy_dataorTpp::EmergingClinicalDataQuery— they consume the newly populatedsubgroup_disease_idautomatically.
Initial spot check (15 random subgroups): All classifications correct. Disease cohorts (AML, mCRPC, CML, melanoma, solid tumors) correctly identified. Metastatic sites, biomarkers, treatment arms, dose levels, and healthy controls correctly excluded.
Pending: Manual verification on a curated sample before broad backfill.
Validation (2026-03-13)
Section titled “Validation (2026-03-13)”Coverage: 134,061 / 134,211 non-disease subgroups adjudicated (99.9%). 34,652 classified as disease_cohort, of which 32,811 (94.7%) received disease_id.
Tracker example verified: Pub 114077, MSS-CRC subgroup (id 210858) correctly resolved: disease_id = 4345 (Colorectal Cancer), flows through vw_publication_efficacy_data.
Remaining gap — 1,841 disease_cohort subgroups without disease_id:
The populate_term_matches step has already run after adjudication. TermMatch rows exist for these terms — the gap is in the DiseaseMatching resolution pipeline itself, which is expected behavior in most cases.
The unresolved terms fall into categories that are inherent to the disease ontology design:
-
Broad disease concepts not in simplified tree (e.g. “lymphoma”, “mesothelioma”).
- Disease 4668 = “Lymphoma” exists in
diseasesbut hassimplified = false— intentionally excluded from the matchable disease set. - The DiseaseMatching pipeline correctly found only subtypes (Follicular, Hodgkin, etc.) as candidates, rejected them as too narrow, and returned
null. - Verified against abstracts: these publications genuinely reference “lymphoma” without specifying a subtype (e.g. pub 90447: “relapsed/refractory lymphomas”; pub 119434: “newly diagnosed lymphoma”). The LLM adjudication correctly normalized to “Lymphoma” because the abstracts don’t provide enough context to be more specific.
- Same pattern for “mucosal melanoma”, “mesothelioma” — the broad concept isn’t in the simplified tree, and the abstracts don’t specify further.
- Disease 4668 = “Lymphoma” exists in
-
Non-oncology diseases correctly absent from ontology.
- “Polycystic Ovary Syndrome” (41 subgroups), “Uterine leiomyoma” (10), “Sepsis” (9): not in our hemonc-focused disease ontology. These subgroups were correctly adjudicated as
disease_cohortby the LLM (they are disease cohorts), but the diseases themselves are out of scope.
- “Polycystic Ovary Syndrome” (41 subgroups), “Uterine leiomyoma” (10), “Sepsis” (9): not in our hemonc-focused disease ontology. These subgroups were correctly adjudicated as
-
Too-generic terms below matching threshold.
- “Cancer” (21 subgroups): score 0.35, too broad. “Advanced cancer” (19): score 0.75, at threshold. “Pediatric cancer” (13): score 0.7, below threshold.
-
Finalization pipeline edge cases.
- “Muscle-invasive urothelial carcinoma” (20 subgroups): Round 1 and Round 2 both agreed on disease 4424 (Muscle Invasive Bladder Cancer), judgment accepted with 0.9 confidence, but the majority-vote finalization step still produced
null. This may warrant investigation as a potential finalization bug. - “Gastric and gastroesophageal junction adenocarcinoma” (12): compound disease phrase where the matcher couldn’t resolve to a single disease.
- “Muscle-invasive urothelial carcinoma” (20 subgroups): Round 1 and Round 2 both agreed on disease 4424 (Muscle Invasive Bladder Cancer), judgment accepted with 0.9 confidence, but the majority-vote finalization step still produced
Assessment: The 1,841 gap is largely expected — broad/generic/out-of-scope terms that the disease tree intentionally doesn’t cover. The only potentially actionable subset is the ~32 subgroups affected by the finalization edge case (pattern 4), which may be a bug in the majority-vote logic.
2. ASCO API content type blind spot drops PresentationContentItem publications
Section titled “2. ASCO API content type blind spot drops PresentationContentItem publications”Short summary
Section titled “Short summary”The ASCO GraphQL API classifies conference content into multiple __typename variants: AbstractContentItem, PresentationContentItem, PosterContentItem, VideosSlidesContentItem, JournalContentItem, and SessionContentItem. Our ingestion pipeline only handles AbstractContentItem — in both the search filter and the detail query. Publications typed as PresentationContentItem (and potentially PosterContentItem) are silently dropped.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”ASCO ingestion flow in app/services/publications/asco_api_service.rb:
fetch_abstract_hitssends a GraphQLSearchquery withfilters: { contentTypes: ['Abstract'] }.fetch_full_abstract_detailsendsgetContentByUIDwith a single inline fragment:... on AbstractContentItem { uid title body doi ... }.save_publicationreceives the detail result and persists it.
Triggered from lib/tasks/clinical_trials/publications.thor via:
bundle exec thor clinical_trials:publications:import_from asco [options]Exact restrictions causing the drop
Section titled “Exact restrictions causing the drop”Three failure points, any one of which is sufficient to lose a publication:
1. Search filter excludes non-Abstract content types
filters_hash = { contentTypes: ['Abstract'] }For wildcard searches (userInput: '*'), the ASCO API strictly filters by contentTypes. A PresentationContentItem is not returned when contentTypes: ['Abstract'] is used with a wildcard query.
Verified via API:
userInput: '*', contentTypes: ['Abstract'], years: [2025]→ returns only hex UIDs (AbstractContentItem)userInput: '*', contentTypes: ['Presentation'], years: [2025]→ returns onlyPRESENTATION*UIDs
2. NCT ID text search returns zero hits for PresentationContentItem records
The ASCO search API does not index the clinicalTrialRegistryNumber field for search. Searching userInput: 'NCT05701709' returns zero hits regardless of contentTypes filter, even though the record has clinicalTrialRegistryNumber: 'NCT05701709' in its data.
Verified:
search(userInput: "NCT05701709", filters: {}) → 0 hitssearch(userInput: "NCT05701709", filters: {contentTypes: ["Abstract"]}) → 0 hitssearch(userInput: "SHR A2102", filters: {contentTypes: ["Abstract"]}) → finds PRESENTATION245980This means the disease-specific ingestion path (which searches by NCT ID) can never discover this publication.
3. Detail query GraphQL fragment only matches AbstractContentItem
... on AbstractContentItem { uid title body doi clinicalTrialRegistryNumber ... }When getContentByUID returns a PresentationContentItem, the fragment does not match. The result is {}. save_publication then sees a blank title and silently skips the record:
if publication_data[:title].blank? increment_stat(:skipped) Rails.logger.warn("ASCO Abstract #{abstract_data['uid']} has no title") return :skippedendConcrete example
Section titled “Concrete example”Publication: DOI 10.1200/JCO.2025.43.16_suppl.107
- Title: “Phase 1 trial of SHR-A2102, a nectin-4-directed antibody drug conjugate (ADC), in advanced solid tumors.”
- ASCO UID:
PRESENTATION245980 __typename:PresentationContentItemclinicalTrialRegistryNumber:NCT05701709- Drug: SHR-A2102 (drug_id 13643, known in our system)
- Trial: NCT05701709 (clinical_trial_id 51789, linked to “Solid Tumors” disease)
- ESMO version of same study: publication_id 65886, successfully ingested and linked to trial
API verification:
# Search finds nothing by NCT IDsearch(userInput: "NCT05701709") → 0 hits
# Search finds it by drug namesearch(userInput: "SHR A2102") → PRESENTATION245980 (score 19.66)
# Detail with AbstractContentItem fragment → emptygetContentByUID("PRESENTATION245980") with ... on AbstractContentItem → result: {}
# Detail with PresentationContentItem fragment → full datagetContentByUID("PRESENTATION245980") with ... on PresentationContentItem → title, body, doi, NCT ID, authors ✓Downstream impact
Section titled “Downstream impact”- Missing ASCO publications for trials where the abstract is classified as Presentation
- This particularly affects oral presentations and plenary sessions (low abstract numbers like 107), which are often the highest-impact results
- Disease-specific reporting misses these publications entirely
- Trial publication counts are understated
What the issue is not
Section titled “What the issue is not”- Not a disease-mapping problem — the drug and trial are correctly linked in our system
- Not a timing/availability problem — the abstract is live in the ASCO API
- Not specific to Chinese trials or specific sponsors — this is a content classification issue on the ASCO API side
- Not a one_off_jobs issue — job 1022 (Dec 31 wildcard run) did run but could not discover these due to the
contentTypesfilter
ASCO API schema introspection reveals 6 content item types. Four have DOI + clinicalTrialRegistryNumber + body fields:
| Type | Has DOI | Has NCT ID field | Has Body | Currently handled |
|---|---|---|---|---|
AbstractContentItem | yes | yes | yes | yes |
PresentationContentItem | yes | yes | yes | no |
PosterContentItem | yes | yes | yes | no |
VideosSlidesContentItem | yes | yes | yes | no |
JournalContentItem | yes | no | yes | no |
SessionContentItem | no | no | yes | no |
The exact count of PresentationContentItem records in ASCO is not easily determined (the API returns paginated results of 10 per page), but a drug-name search returning PRESENTATION UIDs alongside Abstract UIDs confirms they represent a meaningful fraction of conference content.
Our ASCO 2025 Annual Meeting coverage: 1,102 abstracts out of an estimated 5,000-6,000+ total — the gap is likely partly explained by this issue.
Spot checks
Section titled “Spot checks”PRESENTATION245980(DOI10.1200/JCO.2025.43.16_suppl.107): SHR-A2102 Phase 1 in solid tumors — missingPRESENTATION243121(DOI10.1200/JCO.2025.43.5_suppl.657): SHR-A2102 in urothelial carcinoma — missing
Both are PresentationContentItem with full abstract text, NCT IDs, authors, and DOIs available.
Open characterization questions
Section titled “Open characterization questions”- What fraction of ASCO Annual Meeting oral presentations are classified as
PresentationContentItemvsAbstractContentItem? - Are
PosterContentItemrecords also carrying unique abstracts we’re missing, or do they duplicateAbstractContentItemrecords? - Should
VideosSlidesContentItembe ingested (they carry DOI and NCT ID fields)?
Explored solution direction
Section titled “Explored solution direction”The fix is contained entirely in app/services/publications/asco_api_service.rb. Two methods need changes:
1. fetch_abstract_hits — broaden the search contentTypes filter
Current (line 93):
filters_hash = { contentTypes: ['Abstract'] }Change to:
filters_hash = { contentTypes: ['Abstract', 'Presentation'] }This ensures the wildcard search (userInput: '*') returns both AbstractContentItem and PresentationContentItem records. The ASCO API enforces contentTypes strictly for wildcard queries, so without adding 'Presentation' these records never appear in search results.
PosterContentItem is excluded for now — open question whether posters carry unique abstract content or duplicate what’s already in AbstractContentItem records. Can be added later if spot checks show unique content.
2. fetch_full_abstract_detail — add a PresentationContentItem inline fragment
Current query (lines 130–157) uses only:
... on AbstractContentItem { uid title body doi clinicalTrialRegistryNumber ... }Add a second fragment with the shared fields that both types expose:
... on PresentationContentItem { uid title body doi clinicalTrialRegistryNumber journalCitation taxonomy { subjectsThes drugsThes } publishDate { start } authors { displayName role publicationOrganization }}These are the same fields already requested from AbstractContentItem. The PresentationContentItem schema exposes all of them (verified via schema introspection). GraphQL will match whichever fragment corresponds to the returned __typename and populate the result identically.
No changes needed in save_publication — the downstream code reads abstract_data['title'], abstract_data['body'], etc. by string key. As long as the GraphQL fragment returns the same field names, save_publication works unchanged.
Deduplication — save_publication already uses Publication.find_or_initialize_by(source: 'ASCO', source_id: publication_data[:source_id]), where source_id is the ASCO uid. Since PresentationContentItem records have distinct UIDs (e.g. PRESENTATION245980), they will not collide with existing AbstractContentItem records. If a presentation and an abstract share the same DOI but different UIDs, both would be saved — but find_or_initialize_by on source_id prevents true duplicates.
What this does not fix — the NCT ID search blind spot (failure point 2). The ASCO API does not index clinicalTrialRegistryNumber for text search regardless of content type. So the disease-specific ingestion path (userInput: 'NCT05701709') will still return zero hits for PresentationContentItem records. This is an ASCO API limitation outside our control. The fix works because the wildcard path (userInput: '*') will now find these records, and they will be correctly saved and linked to trials via clinicalTrialRegistryNumber at save time.
Solution applied
Section titled “Solution applied”Updated app/services/publications/asco_api_service.rb with two changes:
- Search filter:
contentTypes: ['Abstract']→contentTypes: ['Abstract', 'Presentation']infetch_abstract_hits. - Detail query: Added
... on PresentationContentItem { ... }inline fragment with identical fields tofetch_full_abstract_detail. - Performance: Parallelized detail fetches using
Parallel.map(hits, in_threads: 5)infetch_publications_by_criteria.
No changes to save_publication — fields are identical across both content types.
Verification: Test run confirmed PRESENTATION-prefixed UIDs are returned by search, detail query resolves fields correctly, and publications save to the database with source: 'ASCO', category: 'ASCO Abstract', and correct titles/metadata.
3. Publication dose context is trial-derived for linked result publications and still too unstructured for worksheet parity
Section titled “3. Publication dose context is trial-derived for linked result publications and still too unstructured for worksheet parity”Short summary
Section titled “Short summary”The disease clinical evidence worksheet needs publication dose fields with substantially more precision than our current publication pipeline can provide:
Dose (if only one dose was used)Dose MinDose MaxRP2DDose UnitesDose Freqency
Today, most linked result publications never get publication-specific arm/intervention extraction at all. They still surface a dose in /Users/tomor/Sites/bioloupe-data-gov/db/views/vw_publication_efficacy_data_v07.sql, but that value is usually coming from trial study-plan interventions, not from the publication abstract.
Even when publication-specific intervention extraction does run, it only persists a free-text publication_interventions.dose string. That is enough to display a single dose blob, but not enough to reproduce the worksheet columns the client is maintaining manually in spreadsheet 1reh2-9Xpxd9DF7EB-73JfSXH8-MLtWI3zUDEOTgxPV8.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”Current publication flow:
/Users/tomor/Sites/bioloupe-data-gov/app/workflows/publications_workflow.rbrunsextract_interventionsbefore endpoint and AE processing./Users/tomor/Sites/bioloupe-data-gov/app/tasks/publications_llm_classification/intervention_extraction.rbwritesllm_data['intervention_arms']./Users/tomor/Sites/bioloupe-data-gov/app/tasks/publications_llm_classification/drug_linker.rbpersistspublication_interventionsandpublication_arm_interventions./Users/tomor/Sites/bioloupe-data-gov/db/views/vw_publication_efficacy_data_v07.sqlbuildsdrug_interventionsfor reporting:- linked publications use
vw_bioloupe_interventions - only unlinked publications use
publication_interventions
- linked publications use
/Users/tomor/Sites/bioloupe-data-gov/app/queries/tpp/emerging_clinical_data_query.rbreadsv.doseas a single free-text field.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”There are two separate restrictions, and they compound.
Restriction 1: intervention extraction is scoped to unlinked publications
In /Users/tomor/Sites/bioloupe-data-gov/app/tasks/publications_llm_classification/intervention_extraction.rb, base_scope is:
Publication.workflow_eligible .unlinked_to_trials .hematology_oncology_relevant .where("(llm_data -> 'intervention_arms') is null")So once a result publication is linked to a trial, it normally never enters publication arm extraction.
Restriction 2: the efficacy view only uses publication_interventions for publications without a trial link
In /Users/tomor/Sites/bioloupe-data-gov/db/views/vw_publication_efficacy_data_v07.sql, drug_interventions explicitly says:
- sources
1a/1b/1cusevw_bioloupe_interventionsfor linked publications - source
2usespublication_interventions - source
2is restricted by:
WHERE pct.clinical_trial_id IS NULL and pi.source_type='Publication'That means linked publications can show a dose, but it is almost always trial-derived.
Concrete examples
Section titled “Concrete examples”Example 1: publication 66552 (BL-B01D1 in ESCC, ESMO 2024)
Section titled “Example 1: publication 66552 (BL-B01D1 in ESCC, ESMO 2024)”Publication:
publications.id = 66552- title:
BL-B01D1, an EGFR x her3 bispecific antibody-drug conjugate (ADC), in patients with locally advanced or metastatic esophageal squamous cell carcinoma (ESCC) - linked trial:
NCT05262491
Abstract dose language:
2.0, 2.5 and 3.0 mg/kg D1D8 Q3W2.5mg/kg (RP2D)
Current persisted state:
jsonb_array_length(publications.llm_data->'intervention_arms') = 0- no
publication_interventionsrows vw_publication_efficacy_data.dose = 'not specified'
But the worksheet row in the client spreadsheet is manually decomposed into:
Dose Min = 2Dose Max = 2.5RP2D = 2.5Dose Units = mg/kgDose Frequency = 2Q3W
So the publication abstract contains the dose context the worksheet needs, but the current linked-publication path discards it and falls back to trial-level not specified.
Example 2: publication 133793 (simmitinib, ASCO 2024)
Section titled “Example 2: publication 133793 (simmitinib, ASCO 2024)”Publication:
publications.id = 133793- title:
First-in-human study of simmitinib, a novel tyrosine kinase inhibitor targeting FGFR1-3, KDR and CSF-1R. - linked trial:
NCT04058587
Abstract dose language:
- dose escalation
1 to 9 mg orally - expansion regimens
4 mg QD,6 mg QD, and6 mg 3 weeks on 1 week off
Current persisted state:
- no
llm_data['intervention_arms'] - no
publication_interventions vw_publication_efficacy_data.dose = 'starting dose 1mg/d'
This is not just incomplete. It is directionally misleading for reporting because the publication result set includes later expansion regimens and the worksheet needs to distinguish min/max/RP2D/schedule.
Example 3: publication 75999 (MRG003, ESMO 2021) shows the partial success case
Section titled “Example 3: publication 75999 (MRG003, ESMO 2021) shows the partial success case”Publication:
publications.id = 75999- title:
FIH phase I dose escalation and dose expansion study of anti-EGFR ADC MRG003 in patients with advanced solid tumors - no linked trial
Current persisted state:
jsonb_array_length(publications.llm_data->'intervention_arms') = 5publication_interventions.dose = '0.1–3.0 mg/kg (dose-escalation cohorts)'publication_interventions.schedule = 'Q3W'vw_publication_efficacy_data.doseechoes the same free-text dose
This proves the existing publication arm extraction can capture publication-derived dosing when the publication is unlinked.
But it also shows the second gap:
- the persisted output is still one free-text
doseblob - the expansion dose
2.5 mg/kg Q3Wis not decomposed into worksheet-ready columns - RP2D is not persisted separately
So broadening extraction scope alone will improve provenance, but not worksheet parity.
Downstream impact
Section titled “Downstream impact”- The disease clinical evidence export cannot reliably recreate the client worksheet from our publication database.
- Linked publication rows can carry a dose string that looks structured enough to trust, while actually reflecting trial-plan interventions rather than the publication cohort being reported.
- Basket, dose-escalation, dose-expansion, and subgroup-specific publications are especially exposed because publication dose often differs from the trial’s broad intervention description.
- The existing
add-dose-column-to-emerging-datadirection is useful for visibility, but it does not solve the worksheet problem because the export needs decomposed dose fields, not only a single free-textdose.
What the issue is not
Section titled “What the issue is not”This is not just a missing CSV column problem.
Exposing vw_publication_efficacy_data.dose more widely would still leave us with:
- linked publications whose dose came from trial interventions instead of the publication
- free-text values like
not specified,specified dose,dose escalation, andstarting dose 1mg/d - no reliable
dose_min,dose_max,rp2d,dose_units, ordose_frequency
This is also not purely a trial curation problem.
In many cases the trial registry is doing exactly what it should: storing planned intervention doses at the study-plan level. The problem is that the publication is often talking about:
- a subset of dose-escalation cohorts
- a specific expansion dose
- a weight-banded administration rule
- a recommended phase 2 dose selected after escalation
- a disease-specific cohort inside a broader trial
That context exists in the publication narrative, not necessarily in the linked study plan.
This is also not a good regex problem.
Dose strings in the worksheet and in publication text mix:
- ranges
- RP2D statements
- schedules like
Q3W,2Q3W,QD,days 1, 8, and 15 of a 28-day cycle - weight-banded doses
- escalation plus expansion language in the same abstract
We should not try to derive worksheet fields from vw_publication_efficacy_data.dose with string-splitting heuristics.
Current warehouse counts:
- linked result publications:
53,701 - linked result publications with any
publication_interventions:79 - linked result publications with publication-derived dose in
publication_interventions:50 - linked result publications with
llm_data['intervention_arms']:87 - linked result publications with a nonblank
vw_publication_efficacy_data.dose:36,840 - linked result publications with view dose but no publication-derived dose:
36,803
This is the key shape of the issue:
- dose appears broadly in reporting
- publication-specific dose provenance is almost absent for linked results
Contrast:
- unlinked result publications with publication-derived dose in
publication_interventions:2,374
The field shape is also not export-ready even when populated:
vw_publication_efficacy_datarows with nonblank dose:489,397- distinct dose strings in the view:
18,002 - rows with obviously ambiguous values like
not specified,not reported, or escalation-only labels:45,236
Representative high-frequency values in the view:
not specified(28,138rows)specified dose(6,927rows)escalating doses(1,434rows)dose escalation(1,287rows)
For publication-derived doses specifically:
publication_interventionsrows with nonblank dose:4,668- distinct publication-derived dose strings:
3,123 - rows with structurally complex dose text (ranges, RP2D text, schedules):
775
Examples of currently persisted publication-derived dose strings:
0.1–0.9 mg/m2 (administered over 1–10 minutes); RP2D 0.7 mg/m2 over 10 minutes0.05 mg/kg rounded to nearest 1.5 mg; weight-band doses used: 1.5 mg (<30 kg), 3 mg (30–60 kg), 4.5 mg (60–90 kg)1000 mg/m2 on days 1 and 8 every 3 weeks
These are useful raw evidence strings, but they are not already normalized worksheet fields.
Spot checks
Section titled “Spot checks”Linked publications where publication text clearly contains richer dose context than the current export path:
66552(BL-B01D1, ESCC): publication says2.0, 2.5 and 3.0 mg/kg D1D8 Q3W; view saysnot specified133793(simmitinib): publication says1 to 9 mg,4 mg QD,6 mg QD,6 mg 3 weeks on 1 week off; view saysstarting dose 1mg/d240515(amivantamabOrigAMI-1): worksheet needs the weight-based regimen; current linked-publication path has no publication intervention extraction at all
Unlinked publication showing the existing extraction path works but is still too shallow:
75999(MRG003): publication-derived dose and schedule are persisted, but only as raw text rather than decomposed worksheet fields
Working assumptions from discussion
Section titled “Working assumptions from discussion”- The authoritative persistence grain should be
publication + arm + subgroup, interpreted as the smallest defensible publication-result scope. - We should not force false precision. Some dose evidence will legitimately be:
- publication-level
- publication + arm
- publication + subgroup
- publication + arm + subgroup
publication + diseaseis too coarse for dose evidence because dose usually follows treatment context, not just disease context.- Publication intervention extraction should run for all result publications, not just unlinked publications and not just records currently missing rows.
- Operationally, reruns can still be versioned/idempotent so we only refresh missing, stale, or schema-changed records.
- When a publication reports both escalation and expansion cohorts, we should persist:
- the raw dose evidence text
- a structured cohort array
- and derive a preferred export dose per report row from the matching publication context
- We should not persist one publication-wide preferred dose detached from arm/subgroup context.
- Publication-derived dose should be treated as the source of truth for publication-backed rows when it matches the same or narrower context than the row being exported.
- Linked trial dose remains fallback context only when the publication is silent or too vague to support a row-level dose assignment.
- We do want evidence quotes/spans and confidence for extracted dose claims such as RP2D, units, schedule, or frequency. This is primarily for analyst review and debugging.
Open characterization questions
Section titled “Open characterization questions”- How should the persistence model represent scope when an abstract supports only publication-level or arm-level dose evidence and no subgroup is reported?
- Should disease be denormalized onto the dose evidence row for easier querying, or resolved later from subgroup / publication disease context?
- What exact cohort labels do we want to persist for dose context classification:
escalationexpansionrp2d_or_fixed_dosemixed_or_unclear
- Should full text, when available, be allowed to override abstract-derived dose evidence, or only supplement it?
Explored solution direction
Section titled “Explored solution direction”The direction that emerges from the worksheet and the warehouse evidence has two layers.
1. Use smallest-scope publication evidence as the persistence model
The target grain should be publication-result context, not publication-wide text blobs.
Preferred direction:
- persist dose evidence at
publication + arm + subgroupscope when supported - allow nullable arm/subgroup keys for publication-level and arm-only evidence
- derive disease-facing exports from these scoped evidence rows instead of trying to back-infer scope later
2. Expand publication arm/intervention extraction to all result publications, including linked ones
The current unlinked_to_trials restriction is too aggressive for dose-sensitive reporting.
Preferred direction:
- run publication arm/intervention extraction for all result publications, including linked ones
- persist
publication_interventionsandpublication_arm_interventionseven when a publication is linked to a trial - keep trial-linked study-plan interventions as fallback context, not as the only source of dose
This addresses provenance.
3. Add a separate LLM-backed publication evidence extraction for worksheet dose fields
publication_interventions.dose should remain the raw publication dose phrase, but it should not be the final reporting shape.
Preferred structured output for the disease clinical evidence export:
- raw publication dose text
- structured cohort array
single_dosedose_mindose_maxrp2ddose_unitsdose_frequencydose_context_typesuch as escalation, expansion, RP2D/fixed-dose, or mixed/unclear- evidence quote/span
- confidence
- optional cohort / arm note explaining whether the values come from escalation, expansion, or a disease-specific subset
This should be extracted from publication text with publication context using an LLM-backed schema, not reverse-parsed from the existing free-text dose field and not derived through substring / regex heuristics.
The current early extract_interventions step is still useful, but it is probably not sufficient on its own for dose attribution. The authoritative dose extraction likely belongs later in the workflow, after subgroup / arm / endpoint context exists, so the dose evidence can be attached to the correct publication result scope.
4. Use publication-derived dose as the preferred export source when it matches the publication result context
Source precedence for dose should likely be:
- publication-specific structured dose evidence
- publication raw intervention dose text
- linked trial intervention dose as fallback only
The important nuance is that we should derive the preferred export dose per output row from the matching publication context. We should not store or trust a single publication-wide preferred dose when the abstract contains multiple cohorts.
That is different from the current efficacy view, where linked publications are effectively forced into the trial-derived path.
5. Keep this as an export/evidence enrichment concern, not a generic trial-study-plan rewrite
The problem we are solving is:
- can we recreate the worksheet from publication-backed evidence?
The answer does not require fully normalizing every historical publication intervention into canonical pharmacology. It requires a publication evidence layer that preserves what the publication actually says at the arm/cohort level.
Solution applied
Section titled “Solution applied”Implemented 2026-03-11. Change: publication-dose-context-gap.
Four-part fix:
-
Broadened intervention extraction scope — Removed
.unlinked_to_trialsfromInterventionExtraction#base_scope. Previously ~53K linked publications were skipped because thetherapeutic_area_filterstep also had.unlinked_to_trials, so linked publications never got classified ashematology_oncology_relevantand never entered the intervention extraction scope — even though all trials in our database are hemonc by definition. -
Target disease scope for cost control — Running intervention + dose extraction across all 53K linked pubs would cost ~$480. Instead, scoped the backfill to publications linked to trials in target disease areas via
clinical_trial_end_diseases:- Solid Tumors (4116), HNSCC (6200), ESCC (4260), sqNSCLC (4174), CRC (4345), Cholangiocarcinoma (6228/6229/4298)
- Plus all existing hemonc-classified unlinked publications
- Implemented as reusable scope
Publication.target_disease_or_hemonc_relevanton the model - Reduces backfill from 53K to ~10K publications, estimated cost ~$66
- These disease IDs are hardcoded for the initial backfill; scope can be broadened later by adding more disease IDs to
Publication::TARGET_DISEASE_IDS
-
New dose evidence extraction step — Created
DoseEvidenceExtractionLLM task (app/tasks/publications_llm_classification/dose_evidence_extraction.rb) that decomposes free-textpublication_interventions.doseinto structured fields stored inpublication_interventions.dose_evidenceJSONB:single_dose,dose_min,dose_max,rp2d,dose_units,dose_frequency,dose_context_typeevidence_quote,confidence,version- Uses
gpt-5-miniat ~$0.004/publication — sufficient quality, no model upgrade needed - Prompt sends
publication_intervention.idper intervention for deterministic persistence (no name matching) - Integrated into
PublicationsWorkflowas a skippable step afterextract_subgroups
-
Efficacy view + export updated —
vw_publication_efficacy_datav08 addsdose_min,dose_max,rp2d,dose_units,dose_frequencycolumns via apub_dose_lookupCTE that readspublication_interventions.dose_evidence.emerging_clinical_data_query.rbincludes these in export output.
Key discovery during implementation: The therapeutic_area_filter task also has .unlinked_to_trials in its scope, so 65,152 linked publications were never classified for hemonc relevance. Since all trials in our DB are hemonc, the classification gate is meaningless for linked pubs. Rather than running the LLM therapeutic area filter on 65K pubs unnecessarily, we bypass it with target_disease_or_hemonc_relevant which uses trial disease metadata for linked pubs and LLM classification for unlinked pubs.
Files changed:
app/models/publication.rb(target_disease_or_hemonc_relevantscope +TARGET_DISEASE_IDS)app/tasks/publications_llm_classification/dose_evidence_extraction.rb(new)app/tasks/publications_llm_classification/intervention_extraction.rb(scope changed totarget_disease_or_hemonc_relevant)app/workflows/publications_workflow.rb(new step added)app/admin/services/publication_console/publication_workflow_registry.rb(registry entries)app/admin/services/publication_console/publication_workflow_overview_service.rb(scope methods)lib/tasks/clinical_trials/publications.thor(Thor task wiring)db/migrate/20260311220054_add_dose_evidence_to_publication_interventions.rb(JSONB column + GIN index)db/views/vw_publication_efficacy_data_v08.sql(structured dose columns)db/migrate/20260311220657_update_vw_publication_efficacy_data_to_version8.rb(view migration)app/queries/tpp/emerging_clinical_data_query.rb(export columns)
Smoke test results (4 publications, gpt-5-mini):
- Pub 75999 (MRG003): dose_min=0.1 mg/kg, dose_max=3.0 mg/kg, rp2d=2.5 mg/kg, Q3W, context=escalation, confidence=0.95
- Pub 117 (Olanzapine/Pregabalin): fixed doses correctly extracted (5mg, 75mg, 8mg)
- Pub 88446 (21 interventions): all 21 matched by ID, non-drug interventions correctly got confidence=0.0
- Structured dose columns confirmed flowing through materialized view after refresh
Backfill completed 2026-03-12. Four steps ran in production:
thor clinical_trials:publications:extract_interventions --batched --parallelism=4 --batch-size=2000thor clinical_trials:publications:link_publication_drugs --parallelism=5thor clinical_trials:publications:extract_dose_evidence --batched --parallelism=4 --batch-size=2000(ran twice — first pass covered unlinked pubs only; second pass covered newly materialized linked-pub interventions)REFRESH MATERIALIZED VIEW CONCURRENTLY vw_publication_efficacy_data
Backfill results:
- 44,778 / 44,780 publication_interventions rows have
dose_evidencepopulated - Actual cost: ~$8 total across both dose evidence runs (gpt-5-mini batch API, ~$0.0004/pub — 10x cheaper than pre-implementation estimate)
- Extraction quality verified across random samples: high-confidence extractions accurate, RP2D correctly identified in escalation studies, weight-based/BSA-based classification correct, low-confidence calibration appropriate (no hallucinated doses)
Post-backfill cleanup:
- ~1.1% of rows (513) had LLM garbage in string fields — placeholder text, chain-of-thought leaking, field-name rotation, escaped JSON fragments. All correlated with non-drug interventions (surgery, imaging, lifestyle). Root cause: system prompt redundantly described JSON format when structured outputs already constrain it.
- ~5,500 rows had string
"null"variants instead of JSON null. - Both issues fixed by
one_off:cleanup_dose_evidence_garbage:execute(one-off Thor task, 6,545 rows cleaned). - Prevention added:
sanitize_dose_evidence!inDoseEvidenceExtraction#persist_dose_evidencestrips garbage on persist. System prompt simplified to avoid redundant format instructions with structured outputs.
Spot-check verification (tracker examples now resolved):
| Pub | Drug | Before | After |
|---|---|---|---|
| 66552 | BL-B01D1 | not specified | dose_min 2.0 mg/kg, dose_max 3.0 mg/kg, rp2d 2.5 mg/kg, D1D8 Q3W |
| 133793 | simmitinib | starting dose 1mg/d | dose_min 1 mg, dose_max 9 mg, rp2d 6 mg 3 weeks on 1 week off |
| 75999 | MRG003 | raw text only | dose_min 0.1 mg/kg, dose_max 3.0 mg/kg, rp2d 2.5 mg/kg, Q3W |
| 240515 | amivantamab | no intervention rows | no intervention_arms in llm_data (abstract may lack dose detail) |
Issue reopened: pub_dose_lookup view join drops 76% of extracted dose evidence (2026-03-23)
Section titled “Issue reopened: pub_dose_lookup view join drops 76% of extracted dose evidence (2026-03-23)”The extraction and persistence steps from the 2026-03-11 fix are working correctly — 23,503 publications have dose_evidence populated in publication_interventions. However, only 8,764 publications (37%) have structured dose fields flowing through to vw_publication_efficacy_data. The remaining 17,826 publications (76%) have dose evidence silently dropped by the view’s pub_dose_lookup join.
Root cause
Section titled “Root cause”The pub_dose_lookup CTE joins on (publication_id, drug_id):
LEFT JOIN pub_dose_lookup pdl ON po.publication_id = pdl.publication_id AND di.drug_id = pdl.drug_iddi.drug_idcomes from thedrug_interventionsCTE, which for linked publications sources fromvw_bioloupe_interventions(trial registry drugs)pdl.drug_idcomes frompublication_interventions.drug_id(LLM-extracted and drug-linked)
This join fails in two ways:
Failure mode 1: NULL drug_id on publication_interventions (~13,600 pubs, 58%)
When link_publication_drugs doesn’t find a matching drug record, publication_interventions.drug_id stays NULL. The SQL condition di.drug_id = NULL is always FALSE, so the dose evidence is silently dropped even though it’s correctly extracted.
Failure mode 2: Drug_id mismatch between registry and publication (~2,148 pubs, 9%)
The trial registry and the LLM-extracted publication interventions can resolve to different drug records for the same compound:
- ADC vs naked antibody: Zanidatamab (10432) vs Zanidatamab zovodotin (15231)
- Unresolved drug matching: SHR-A1811 has drug_id=NULL in publication_interventions but drug_id=10733 (Trastuzumab rezetecan) in the trial registry
- Biosimilar/brand aliases: SCT510 (15900) vs Bevacizumab (9022)
Concrete examples from CRC ADC audit (disease 4345, technology 708)
Section titled “Concrete examples from CRC ADC audit (disease 4345, technology 708)”| Pub | Drug | PI drug_id | View drug_id | Dose evidence | View dose fields |
|---|---|---|---|---|---|
| 66516 | Zanidatamab | 10432 (Zanidatamab) | 15231 (Zanidatamab zovodotin) | single_dose=1200 mg | all NULL |
| 70960 | SHR-A1811 | NULL | 10733 (Trastuzumab rezetecan) | dose_min=3.2, dose_max=8.0, rp2d=6.4 mg/kg | all NULL |
| 114758 | Zanidatamab | 10432 (Zanidatamab) | 15231 (Zanidatamab zovodotin) | single_dose=1200 mg | all NULL |
The unstructured dose column (from trial registry study_plan_components) still shows generic protocol text like “dose levels and schedules determined by the Safety Monitoring Committee (SMC)” for these publications.
23,503 publications with dose_evidence extracted 8,764 publications with structured dose in view (37%)17,826 publications with dose evidence silently dropped (76%)
Breakdown of dropped:~13,600 NULL drug_id on publication_interventions (58%) ~2,148 drug_id mismatch between registry and publication (9%) ~2,078 other (pub not in view, dose_evidence has no usable fields, etc.)Fix applied
Section titled “Fix applied”Resolved by Issue 20 fix (2026-03-23). The root cause was the drug_interventions CTE sourcing drug_id from vw_bioloupe_interventions (registry) while pub_dose_lookup used publication_interventions drug_id. The v16 view restructuring (see Issue 20 solution) fixes this by:
- Using
publication_interventionsas the primary drug source (Source 0), sodi.drug_idandpdl.drug_idcome from the same table. - Threading
publication_intervention_idthrough both CTEs for exact 1:1 join matching — eliminating the drug_id mismatch entirely, including for NULL drug_id interventions. - Allowing NULL drug_id interventions through Source 0 (if we extracted them, they’re the source of truth — don’t fall back to registry).
Result: dose evidence coverage went from 8,764 pubs (71% of extracted) to 11,902 pubs (96.6% of extracted).
4. Most frequent AE columns lack grade-classified ranked export fields
Section titled “4. Most frequent AE columns lack grade-classified ranked export fields”Short summary
Section titled “Short summary”The disease clinical evidence worksheet has two AE columns per row:
Most Frequent AE All Grade— e.g.Anemia (85.4%), Leukopenia (53.7%), Thrombocytopenia (53.7%)Most Frequent AE >=Gr3— e.g.Anemia (28.0%), Leukopenia (15.9%), Thrombocytopenia (14.6%)
These are ranked lists of the top individual named adverse events by incidence, separated into all-grade vs grade ≥3 buckets.
The current pipeline extracts individual named AE rows with numeric values but does not:
- Classify each AE row by grade category (all-grade vs ≥grade 3)
- Rank AEs by incidence within each grade bucket
- Produce a formatted summary string for export
As a result, the worksheet AE columns cannot be populated from structured data today.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”Publication AE flow:
classify_publicationsextractsllm_data['adverse_events']from the abstract. The LLM schema (details.rb:AdverseEvent) capturesadverse_event(name),measure_unit,observation(free text), andarms[].measure_value(numeric). There is nograde_categoryfield — grade information lands inobservationas unstructured text or gets embedded in the AE name.post_process_publicationscreatesadverse_eventsrows andtrial_arm_outcomesrows with numericmeasure_value.standardize_adverse_eventsdoes rule-based name standardization.classify_adverse_eventsLLM-matches AEs to safety endpoint categories.
Relevant code paths:
app/tasks/publications_llm_classification/task.rb— extraction prompt (section 4: Adverse Events)app/tasks/publications_llm_classification/details.rb—AdverseEventschema (lines 111–134)app/tasks/publications_llm_classification/post_process.rb—process_adverse_eventspersists rowsapp/queries/tpp/emerging_clinical_data_query.rb—extract_safety_metrics_for_publicationonly handles aggregate metrics (TRAE ≥Gr3, TEAE ≥Gr3, discontinuation), not individual named AEs
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”Two separate restrictions:
1. The LLM extraction schema has no grade classification field
The AdverseEvent schema in details.rb captures:
attribute :adverse_event, :string # nameattribute :measure_unit, :string # percentage/countattribute :observation, :string # free text — grade info lands hereattribute :arms, Arm.to_array_type # numeric values per armThere is no grade_category enum. The LLM puts grade context into observation as free text (e.g. "Grade ≥3", "Grade 3 treatment-related", "Any grade", "Most common adverse event", or empty).
2. The downstream safety extraction only handles aggregate metrics
classify_safety_metric in emerging_clinical_data_query.rb classifies AEs into aggregate categories (:grade3_traes, :grade3_teaes, :discontinuation) and returns nil for individual named AEs like Nausea or Neutropenia. These individual AEs are stored but never surfaced in any export path.
Concrete examples
Section titled “Concrete examples”Worksheet row: Izalontamab brengitecan in ESCC (ESCC tab, row 3)
Section titled “Worksheet row: Izalontamab brengitecan in ESCC (ESCC tab, row 3)”The worksheet contains:
Most Frequent AE All Grade:Anemia (85.4%), Leukopenia (53.7%), Thrombocytopenia (53.7%), Neutropenia (42.7%)Most Frequent AE >=Gr3:Anemia (28.0%), Leukopenia (15.9%), Thrombocytopenia (14.6%), Neutropenia (14.6%)
Our database has the individual AE rows and numeric values for this publication, but no way to classify which rows are all-grade vs ≥grade 3, and no export field that produces the ranked formatted string.
Worksheet row: Micvotabart pelidotin in HNSCC (HNSCC tab, row 4)
Section titled “Worksheet row: Micvotabart pelidotin in HNSCC (HNSCC tab, row 4)”The worksheet contains:
Most Frequent AE All Grade:Cutaneous (44%); Neuropathy (34%); Neutropenia (22%); Anemia (17%)Most Frequent AE >=Gr3:Neuropathy (28%), Neutropenia (11%)
The pattern is consistent: top 2–4 AEs ranked by incidence, with percentages, semicolon or comma separated.
Current database state for publication-sourced AE rows:
- Total publications with AE rows:
36,802 - Publications with AE rows that have numeric
trial_arm_outcomes.measure_value:33,835 - Total AE rows with numeric values:
156,325 - Average AE rows per publication:
4.6(median 3, p90 8)
Grade context distribution across the 156K rows:
| Grade signal | Rows | % | Source |
|---|---|---|---|
Clearly grade ≥3 (in observation) | 19,079 | 12% | observation ~* 'grade.*(3|≥3|3/4)' |
Clearly grade ≥3 (in name) | 16,862 | 11% | name ~* 'grade.*(3|≥3|3/4)' |
| Subtotal grade ≥3 identifiable | 57,206 | 37% | Combined name + observation |
| Explicitly all-grade | 4,054 | 3% | observation ~* '(any grade|all grade)' |
| No grade context at all | 75,616 | 48% | Neither name nor observation mentions grade |
| Low grade only (1-2) | ~4,024 | 3% | |
| Other grade context | ~2,797 | 2% |
At the publication level:
| Category | Publications |
|---|---|
| Has BOTH all-grade and grade ≥3 rows | 4,053 |
| Has grade ≥3 rows only | 6,547 |
| Has any-grade rows only | 7,719 |
| Ambiguous (no clear grade signals) | 1,287 |
The key finding: 48% of individual AE rows (75K) have no grade context in either name or observation. These are likely all-grade AEs but cannot be reliably classified without the abstract context that was available at extraction time.
Worksheet AE column patterns
Section titled “Worksheet AE column patterns”From spot-checking across HNSCC and ESCC tabs:
- All Grade column: typically 2–4 AEs, sometimes just names without % when percentages aren’t reported
-
=Gr3 column: typically 1–3 AEs, usually fewer than all-grade
- Some cells include
(NR)for “not reported” - Separator style varies: semicolons and commas both used
- Format:
AE_name (value%)
Downstream impact
Section titled “Downstream impact”- The disease clinical evidence export cannot populate the two most-frequent-AE columns
- The existing safety extraction only surfaces aggregate TRAE/TEAE/discontinuation metrics
- Individual named AEs with percentages exist in the database but are invisible to reporting
- Publications where the abstract reports specific high-frequency AEs (the most clinically relevant safety signal) cannot be compared to the manually curated worksheet
What the issue is not
Section titled “What the issue is not”This is not a missing AE extraction problem. The pipeline already extracts individual named AEs with numeric values for ~34K publications. The AE data exists — it just lacks grade classification and a ranked export format.
This is also not an aggregate safety metric problem. TRAE ≥Gr3, TEAE ≥Gr3, and discontinuation rates are already handled by extract_safety_metrics_for_publication. The gap is specifically in individual named AE ranking.
Open characterization questions
Section titled “Open characterization questions”- Should the ranked summary be persisted as pre-formatted strings (like the worksheet cells), or as structured arrays that the export formats at query time?
- When a publication has AE rows for multiple arms, should the ranked summary use the experimental arm only (current behavior for aggregate metrics) or present the arm that matches the export row context?
Explored solution direction
Section titled “Explored solution direction”The solution has two parts: a schema enhancement for future publications and a backfill for existing data.
1. Modify classify_publications extraction to include grade classification (going forward)
Add a grade_category enum field to the AdverseEvent schema in details.rb:
class AdverseEvent include StoreModel::Model include DataTasks::JsonSchema
desc 'The name of the adverse event reported in the trial.' attribute :adverse_event, :string
desc 'Grade category of this adverse event. Use all_grade for any-grade or unspecified-grade AEs, grade_gte3 for grade ≥3/grade 3-4/grade 3-5 AEs.' attribute :grade_category, :string # enum: all_grade, grade_gte3
# ... existing fields ...endUpdate the extraction prompt (section 4 in task.rb) to instruct the LLM to classify grade at extraction time. The LLM already reads the abstract in full — it knows whether “Nausea (75.3%)” is reported as all-grade or ≥grade 3 from surrounding context. Adding one enum field is nearly free in token cost.
Add a grade_category column to the adverse_events table (migration). Update post_process.rb:process_adverse_events to persist the new field.
2. Backfill existing AE rows with LLM grade classification
A separate one-time LLM task that reads existing adverse_events rows + the publication abstract and classifies grade_category for each row.
Scope: ~33,835 publications, ~156K AE rows.
Input per publication prompt:
- Publication title + abstract (~2,750 chars avg)
- Existing AE rows with name, observation, and measure_value (~217 chars avg)
Output per AE row:
grade_category:all_grade|grade_gte3
Estimated cost with gpt-5-mini batched: ~$15–25 for the full backfill.
The backfill task would update adverse_events.grade_category directly. After completion, all AE rows (both historical and future) have grade classification from the same source of truth.
3. Ranked summary derivation (query-time)
Once all AE rows have grade_category, producing the worksheet columns is a straightforward query:
-- For a given publication + arm context:SELECT ae.name, tao.measure_valueFROM adverse_events aeJOIN trial_arm_outcomes tao ON tao.adverse_event_id = ae.idWHERE ae.source_id = :publication_id AND ae.source_type = 'Publication' AND ae.grade_category = 'all_grade' -- or 'grade_gte3' AND ae.measure_unit = 'percentage' AND tao.measure_value IS NOT NULL AND tao.measure_value::numeric > 0ORDER BY tao.measure_value::numeric DESCLIMIT 4Format as: AE_name (value%); AE_name (value%); ...
This can be computed at export time from the grade-tagged rows without a separate persistence step.
4. Workflow placement
No new workflow step needed for the going-forward path — grade classification happens inside the existing classify_publications step and is persisted by post_process_publications.
The backfill task runs independently as a one-time Thor task, similar in pattern to the subgroup disease adjudication backfill (Issue 1).
Solution applied
Section titled “Solution applied”Implemented 2026-03-11. Change: add-publication-ae-grade-classification.
Status: Implementation complete. Full historical backfill has not yet been run across the remaining eligible publication AE rows.
Applied fix:
-
Persisted AE grade category on
adverse_events— Addedadverse_events.grade_categorywith canonical valuesall_gradeandgrade_gte3, plus model normalization/validation so downstream readers have a stable field instead of re-parsing free text. -
Extended publication extraction for new rows — Updated the publication LLM schema and prompt so
classify_publicationsemitsgrade_categoryfor each adverse event row, and updatedpost_process_publicationsto persist it when creating publication-sourced AE rows. -
Added historical backfill task — Created
PublicationsLlmClassification::AdverseEventGradeBackfilland wired a Thor task:thor clinical_trials:publications:backfill_adverse_event_grade_categories- supports non-batched execution,
--publication-ids,--limit,--source,--model, and--overwrite - default validation model:
gpt-5-mini
-
Added ranked named-AE export derivation — Implemented query-time ranking of named adverse events by
grade_categoryand wired worksheet-style outputs into the reporting path:Most Frequent AE All GradeMost Frequent AE >=Gr3
-
Hardened ranked summary filtering after manual spot checks — Updated the summary helper so it:
- prefers the actual adverse-event name over standardized bucket labels
- excludes aggregate rollup rows such as
TRAE,TEAE,SAE,AESI,irAE, discontinuation, fatal/grade-5 rollups - excludes zero-value /
not reportedrows from named-AE summaries
Files changed:
app/models/adverse_event.rbapp/tasks/publications_llm_classification/details.rbapp/tasks/publications_llm_classification/task.rbapp/tasks/publications_llm_classification/post_process.rbapp/tasks/publications_llm_classification/adverse_event_grade_backfill.rblib/tasks/clinical_trials/publications.thorapp/queries/clinical_trials/publications_query.rbapp/queries/tpp/emerging_clinical_data_query.rbapp/services/tpp/reports/emerging_clinical_data_report.rbdb/migrate/20260311222107_add_grade_category_to_adverse_events.rb
Manual validation completed:
- Non-batched
gpt-5-minirun on 4 hand-picked publications:4publications processed,33rows updated - Confirmed persisted
all_gradevsgrade_gte3, default skip behavior, overwrite behavior, and arm fallback - Additional non-batched
gpt-5-minirun on 8 random publications:8publications processed,33rows updated - Random spot checks confirmed:
- named grade
3/4and>=3rows classify asgrade_gte3 - named any-grade / grade-1 rows classify as
all_grade - aggregate safety rows are excluded from ranked named-AE summaries
- zero/
not reportedrows no longer emit bogus ranked summary strings
- named grade
Model outcome: gpt-5-mini was good enough on the manual validation slices; no progression to a stronger model was needed.
Operational follow-up: run the full historical backfill for the remaining eligible publication AE rows before marking this issue fully complete.
5. Publication prior therapy context is not extracted — min/max prior lines and prior therapy exposure are missing
Section titled “5. Publication prior therapy context is not extracted — min/max prior lines and prior therapy exposure are missing”Short summary
Section titled “Short summary”The disease clinical evidence worksheet has four columns that describe the prior therapy context of a publication’s study population:
Min Prior Lines— minimum number of prior lines of therapy (e.g.1)Max Prior Lines— maximum number of prior lines (e.g.7)Treatment Line— e.g.2L+,3L+(already extracted, this issue does not cover treatment line)Prior Taxane Use— e.g.Yes,No,Allowed,Required
Treatment line is already extracted and persisted on trial_subgroups.treatment_lines (see TreatmentContextExtraction task, renamed from TreatmentLineExtraction). But min_prior_lines, max_prior_lines, and prior therapy exposure are not extracted from publications at all.
The trial side has partial analogues:
trial_eligibility_criteriawithmodifier = 'prior_treatment_lines'storesmin/maxfor ~62K trial records
Note: indicated_prior_therapies is related to drug approval indications, not trials or publications. It captures required/excluded prior therapies for regulatory label context, not clinical study populations.
Publication-sourced rows have no equivalent for either prior line counts or prior therapy exposure. When the worksheet reports “median 4 prior therapies (range 0–7)” or “52% had prior taxane therapy for mCRPC,” that context exists only in the publication abstract and is not captured by the pipeline.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”Treatment line extraction:
TreatmentContextExtractioninapp/tasks/publications_llm_classification/treatment_context_extraction.rbmaps abstracts to enum values (1L,2L+,3L+, etc.) and extracts prior therapy context- Results persist on
trial_subgroups.treatment_lines(JSONB array) andtrial_subgroups.llm_data['treatment_lines'] - The efficacy view normalizes to
effective_line(numeric 0–4) andtreatment_settings
The treatment line extraction already reads prior therapy language to determine the line (e.g. “median of 4 prior therapies” → 3L+). But the numeric counts and specific therapy exposures are consumed as reasoning inputs, not persisted as structured data.
There is no extraction step for:
- publication-level prior line counts (min, max, median)
- prior therapy exposure flags (prior taxane, prior checkpoint inhibitor, etc.)
Exact restriction causing the gap
Section titled “Exact restriction causing the gap”1. Treatment line extraction discards numeric prior therapy counts
The TreatmentLineExtraction system prompt instructs the LLM to use prior therapy counts for line determination:
- "median prior lines = N": - N ≥ 2 → "3L+" - N = 1 (or range includes 1–2) → "2L+"But the output schema (TreatmentLineDetails) only captures treatment_lines (enum array) and evidence (free text). The actual numbers (median = 4, range 0–7) are consumed during reasoning but not persisted as structured fields.
2. Prior therapy exposure is completely out of scope
The treatment line extraction prompt explicitly states:
Out of scope: Dosing, endpoints, safety, biomarkers (unless they clarify line), efficacy stats.Prior therapy exposure (e.g. “52% had prior taxane,” “required prior platinum,” “prior CAR-T allowed”) is not captured by any extraction step.
3. The efficacy view has no prior-line or prior-therapy columns from publications
vw_publication_efficacy_data exposes effective_line, treatment_settings, and raw_treatment_lines but has no min_prior_lines, max_prior_lines, or prior therapy fields. The trial efficacy view (vw_trial_efficacy_data) does have min_line and max_line from trial_eligibility_criteria, but the publication view has no equivalent.
Concrete examples
Section titled “Concrete examples”Example 1: publication 152908 (BOLD-100 in gastric cancer)
Section titled “Example 1: publication 152908 (BOLD-100 in gastric cancer)”Abstract states:
“Patients had a median of 4 prior systemic therapies [0, 7], 1 with no prior therapy, 2 had 2 prior therapies, 5 with 3 prior therapies, and 13 patients with 4 or more prior therapies. 20/21 patients received prior platinum with 18/21 receiving prior FOLFOX/CAPOX.”
Current extraction result: treatment_lines: ["3L+"] — correct, but we lose:
min_prior_lines: 0max_prior_lines: 7median_prior_lines: 4- prior platinum: 20/21 (95%)
- prior FOLFOX/CAPOX: 18/21 (86%)
Example 2: publication 162733 (sEphB4-HSA in mCRPC)
Section titled “Example 2: publication 162733 (sEphB4-HSA in mCRPC)”Abstract states:
“treatment with at least one second generation androgen receptor (AR)-targeted therapy but no more than three prior therapies for mCRPC” “received a median of three prior therapies (range 1-3)” “Ten patients received prior taxane for mCRPC or hormone sensitive prostate cancer”
Current extraction result: treatment_lines: ["2L+"] — correct, but we lose:
min_prior_lines: 1max_prior_lines: 3median_prior_lines: 3- prior taxane: 10/14 (71%)
- prior AR-targeted therapy: 14/14 (100%, required)
Example 3: publication 53818 (PROfound — olaparib by prior taxane)
Section titled “Example 3: publication 53818 (PROfound — olaparib by prior taxane)”This is the paradigmatic case — the entire publication is organized around prior taxane use as a stratification factor. The abstract reports efficacy by prior taxane yes/no subgroups. The worksheet needs Prior Taxane Use: Yes/No (stratified).
Current extraction captures treatment_lines: ["2L+"] but does not capture that prior taxane is the defining subgroup variable.
Downstream impact
Section titled “Downstream impact”- The disease clinical evidence export cannot populate
Min Prior Lines,Max Prior Lines, orPrior Taxane Usecolumns from publication data - Researchers manually fill these from abstracts — exactly the kind of structured extraction the pipeline should automate
- Prior therapy context is clinically important for interpreting efficacy results (a drug showing ORR of 30% in a post-taxane population is very different from 30% in a treatment-naïve population)
- Without structured prior therapy data, comparative analyses across publications in the same disease are unreliable
What the issue is not
Section titled “What the issue is not”This is not a treatment line problem. Treatment line extraction works well and correctly maps abstracts to 1L, 2L+, 3L+, etc. The issue is that treatment line is a categorical bucket, while prior therapy context includes:
- numeric counts (min, max, median, range)
- specific therapy exposure flags
- exposure requirements (required, allowed, excluded)
This is also not a trial eligibility criteria problem. The trial side has prior_treatment_lines and indicated_prior_therapies, but these describe trial enrollment criteria, not the actual population characteristics reported in the publication abstract.
Prior therapy language in ~71K result publications:
| Pattern | Publications mentioning |
|---|---|
| Mentions median prior line count | 1,936 |
| Mentions prior line threshold (≥N) | 1,839 |
| Mentions prior line range | 883 |
| Mentions any specific prior therapy class | 1,458 |
Specific prior therapy class mentions (non-exclusive):
| Prior therapy class | Publications |
|---|---|
| Prior checkpoint/IO therapy | 572 |
| Prior platinum | 302 |
| Prior anti-VEGF | 241 |
| Prior radiation | 190 |
| Prior CDK4/6i | 152 |
| Prior hormonal/endocrine | 151 |
| Prior taxane | 148 |
| Prior CAR-T | 143 |
| Prior transplant | 84 |
| Prior HMA | 80 |
| Prior surgery | 55 |
| Prior PI/bortezomib | 54 |
| Prior IMiD/lenalidomide | 43 |
| Prior anthracycline | 39 |
| Prior fluoropyrimidine | 33 |
| Prior gemcitabine | 27 |
| Prior irinotecan | 21 |
| Prior bispecific | 12 |
| Prior BCG | 10 |
| Prior ADC | 3 |
Key observations:
- “Prior taxane” (148 publications) is just one instance of a general pattern — at least 15 therapy classes appear routinely
- The highest-frequency classes (checkpoint/IO, platinum, anti-VEGF) reflect current oncology practice where these are standard earlier-line therapies
- ~1,900 publications contain explicit numeric prior line counts that are currently consumed during treatment line reasoning but discarded
Spot checks
Section titled “Spot checks”Publications with rich prior therapy context that is currently lost:
152908(BOLD-100 in gastric cancer): median 4 prior therapies (range 0–7), 95% prior platinum — extracted as3L+only162733(sEphB4-HSA in mCRPC): median 3 prior therapies (range 1–3), 71% prior taxane, 100% prior AR-targeted — extracted as2L+only53818(PROfound olaparib): entire study stratified by prior taxane yes/no — extracted as2L+only, taxane context not captured65484(givastomig in GEC): median 3 prior lines, 74% prior PD-(L)1 inhibitor — extracted as3L+only147778(GSK2636771 in mCRPC): median 4 prior lines, 83% prior taxane — extracted as3L+only
Current semantic model vs what’s needed
Section titled “Current semantic model vs what’s needed”The pipeline currently models treatment context as:
trial_subgroups.treatment_lines → ["2L+"] (categorical bucket)The worksheet needs:
Treatment Line → 2L+ (categorical — already have)Min Prior Lines → 1 (numeric — don't have)Max Prior Lines → 3 (numeric — don't have)Prior Taxane Use → Yes (71%) (therapy exposure flag — don't have)Prior Platinum Use → Yes (95%) (therapy exposure flag — don't have)Prior IO Use → No (therapy exposure flag — don't have)The worksheet column is labeled “Prior Taxane Use” specifically, but the underlying data pattern is general: researchers track prior exposure to whatever therapy class is clinically relevant for the disease area. In breast cancer it’s taxane and anthracycline; in mCRPC it’s taxane and AR-targeted therapy; in myeloma it’s IMiD, PI, and anti-CD38; in lymphoma it’s CAR-T and bispecific.
Open characterization questions
Section titled “Open characterization questions”- Should we distinguish between required/allowed/excluded prior therapies, or just report exposure percentages?
- “At least one prior platinum” (required) vs “prior taxane was allowed” (optional) vs “52% had prior taxane” (reported)
- These carry different clinical meaning
- The
indicated_prior_therapiesoptionality enum on the indications side uses:must_have_received,progressed_on_after,not_previously_treated_with,After Failure Of,refractory_to,ineligible_for,inadequate_response_to,Intolerant to— these are richer than what abstracts typically state, but the pattern is informative
- How should the persistence model represent scope when prior therapy context applies to the overall population but individual subgroups break it down differently?
Explored solution direction
Section titled “Explored solution direction”Key design decisions from investigation
Section titled “Key design decisions from investigation”1. Subgroup-level, not publication-level
Prior therapy context should persist at the subgroup level (on trial_subgroups), not at the publication level. Evidence:
- Treatment line is already subgroup-level, and prior therapy context is tightly coupled to treatment line
- 1,827 publications have multiple disease subgroups with treatment lines; 119 of those have different treatment lines across subgroups (e.g. pub
1703: “treatment-naive” subgroup at1Lvs “previously treated” at2L+) - When treatment lines differ across subgroups, prior therapy context necessarily differs too — a 1L subgroup has 0 prior lines while a 2L+ subgroup has ≥1
- The PROfound example (pub
53818) shows prior taxane as a subgroup stratification variable — some subgroups are “prior taxane yes” and others “prior taxane no”
For publications where the abstract only states population-level prior therapy characteristics (the common case), all subgroups inherit the same values. The subgroup-level model handles both cases correctly.
2. Rename to TreatmentContextExtraction
The existing TreatmentLineExtraction should be renamed to TreatmentContextExtraction (or similar) to reflect its expanded scope. The task already reads all prior therapy language for treatment line reasoning — it just discards the structured details. Expanding the output schema is natural.
This is not “mixing concerns” — treatment line, prior line counts, and prior therapy exposure are all facets of the same clinical context question: “Where does this population sit in the treatment sequence?”
3. Strict enum for therapy_class + free text for therapy_name (two-field design)
The key design insight is separating what the abstract says from what we query on:
therapy_name: "taxane-based chemotherapy" ← free text, what the abstract says (evidence)therapy_class: "taxane" ← strict enum, what we filter/query onThis avoids the disease_stages antipattern in ParticipationCriterion where an initial predefined list grew unbounded through LLM and import drift, producing duplicates like Stage I / Stage 1 / Stage IA with no normalization layer.
The therapy_class enum is fixed in the schema. The LLM must pick from the list or use other. If other accumulates a meaningful cluster over time, that’s signal to add a new enum value — a conscious schema change, not drift.
The enum covers ~20 therapy classes based on publication frequency analysis:
therapy_class | Pubs mentioning | Example abstract phrases |
|---|---|---|
checkpoint_inhibitor | 865 | ”prior anti-PD-1”, “prior IO”, “prior pembrolizumab” |
surgery | 572 | ”prior resection”, “prior nephrectomy” |
transplant | 533 | ”prior HSCT”, “prior auto-SCT”, “prior allo-SCT” |
platinum | 506 | ”prior platinum”, “prior cisplatin”, “prior carboplatin” |
endocrine_therapy | 477 | ”prior ARPI”, “prior endocrine therapy”, “prior enzalutamide” |
anti_vegf | 361 | ”prior bevacizumab”, “prior anti-VEGF”, “prior anti-angiogenic” |
taxane | 344 | ”prior taxane”, “prior docetaxel”, “prior paclitaxel” |
radiation | 340 | ”prior radiation”, “prior radiotherapy”, “prior chemoradiation” |
car_t | 313 | ”prior CAR-T”, “prior CAR T-cell therapy” |
cdk_inhibitor | 204 | ”prior CDK4/6 inhibitor”, “prior palbociclib” |
anti_her2 | 184 | ”prior trastuzumab”, “prior T-DXd”, “prior pertuzumab” |
imid | 141 | ”prior lenalidomide”, “prior IMiD”, “prior pomalidomide” |
hma | 121 | ”prior azacitidine”, “prior HMA”, “prior decitabine” |
proteasome_inhibitor | 121 | ”prior bortezomib”, “prior PI”, “prior carfilzomib” |
anthracycline | 90 | ”prior anthracycline”, “prior doxorubicin” |
fluoropyrimidine | 75 | ”prior 5-FU”, “prior capecitabine” |
bispecific | 48 | ”prior bispecific antibody” |
anti_cd38 | 43 | ”prior daratumumab”, “prior anti-CD38” |
adc | 38 | ”prior ADC”, “prior antibody-drug conjugate” |
bcg | 16 | ”prior BCG” |
chemotherapy | — | “prior chemotherapy” (generic, when no specific class stated) |
other | — | Catch-all for anything not above |
The long tail drops off fast — only 20 classes cover virtually all clinically meaningful prior therapy mentions in oncology/hematology publications.
Compound semantics are manageable: most publications (2,101 out of 2,349 mentioning specific priors) reference only a single prior therapy class. Only 231 mention two, and 17 mention three or more.
4. Compound prior therapy semantics (“prior X and Y”, “prior X or Y”)
Real abstract patterns:
- Conjunctive (AND):
"who received prior taxane, endocrine therapy, CDK4/6 inhibitor, and 2-4 prior chemotherapies"(TROPiCS-02) — all four are required - Disjunctive (OR):
"prior platinum and/or fluoropyrimidine chemotherapy"— either qualifies - Mixed:
"prior checkpoint inhibitor and platinum-based chemotherapy"— both required
The simplest model that handles all cases: extract each therapy as a separate row in the prior_therapies array. Each row has its own exposure_status. For compound requirements like TROPiCS-02, that becomes:
[ { "therapy_name": "taxane", "exposure_status": "required", "evidence": "..." }, { "therapy_name": "endocrine therapy", "exposure_status": "required", "evidence": "..." }, { "therapy_name": "CDK4/6 inhibitor", "exposure_status": "required", "evidence": "..." }]We do NOT need to model the logical relationship (AND/OR) between therapies explicitly. Each therapy entry stands on its own with its exposure status. This is sufficient for worksheet export (“Prior Taxane Use: Yes”) and for filtering (“show publications requiring prior CDK4/6i”).
Proposed schema
Section titled “Proposed schema”Rename TreatmentLineExtraction → TreatmentContextExtraction
class Subgroup include StoreModel::Model include DataTasks::JsonSchema
desc 'ID of the subgroup from the input' attribute :id, :integer attribute :subgroup_type, :string, ignore: true attribute :subgroup_value, :string, ignore: true
# Existing attribute :treatment_lines, ArrayType.new, enum: Indication::TREATMENT_LINES desc 'Textual evidence or reasoning that supports the treatment line(s)' attribute :evidence, :string
# New: prior line counts desc 'Minimum number of prior lines of therapy for this population (from eligibility criteria or reported range). Null if not stated.' attribute :min_prior_lines, :integer
desc 'Maximum number of prior lines of therapy for this population. Null if not stated.' attribute :max_prior_lines, :integer
desc 'Median number of prior lines of therapy, if explicitly stated in the abstract.' attribute :median_prior_lines, :integer
# New: prior therapy exposures attribute :prior_therapies, PriorTherapyExposure.to_array_typeend
class PriorTherapyExposure include StoreModel::Model include DataTasks::JsonSchema
THERAPY_CLASSES = %w[ checkpoint_inhibitor surgery transplant platinum endocrine_therapy anti_vegf taxane radiation car_t cdk_inhibitor anti_her2 imid hma proteasome_inhibitor anthracycline fluoropyrimidine bispecific anti_cd38 adc bcg chemotherapy other ].freeze
desc 'Normalized therapy class for filtering/querying. Must be one of the enum values.' attribute :therapy_class, :string # enum: THERAPY_CLASSES
desc 'Therapy name as stated in the abstract (e.g. "taxane-based chemotherapy", "prior anti-PD-1 therapy", "lenalidomide"). Preserves original phrasing for evidence.' attribute :therapy_name, :string
desc 'How this prior therapy relates to the study population' attribute :exposure_status, :string # enum: required, allowed, excluded, reported
desc 'Percentage of patients with this prior exposure, if reported (e.g. 71.4). Null if not stated.' attribute :exposure_percentage, :float
desc 'Evidence quote from the abstract' attribute :evidence, :stringendPersistence
Section titled “Persistence”New columns on trial_subgroups:
min_prior_lines(integer, nullable)max_prior_lines(integer, nullable)median_prior_lines(integer, nullable)
Prior therapy exposures persist in trial_subgroups.llm_data['prior_therapies'] (JSONB array), consistent with how treatment line evidence is already stored in trial_subgroups.llm_data['treatment_lines'].
The efficacy view would expose min_prior_lines and max_prior_lines alongside effective_line. The emerging clinical data query would format prior therapies for export.
Backfill
Section titled “Backfill”This requires a full backfill since we’re expanding the extraction schema. The renamed TreatmentContextExtraction task re-runs on all result publications that have subgroups.
Options to reduce cost:
- Only backfill publications where the abstract contains prior therapy language (~3K–5K publications based on regex estimates) for the prior therapy fields
- Use
gpt-5-minifor the backfill since the extraction is well-defined - Batch processing with the existing
DataTasks::Taskinfrastructure - The prior line count fields can be extracted in the same pass as treatment lines since the LLM already reasons about them
Estimated cost: ~$31 batched with gpt-5-mini for a full backfill of all 62K publications with subgroups. No regex pre-filter — the LLM returns empty arrays when no prior therapy context exists, and the cost per publication ($0.001) makes filtering unnecessary.
Export formatting
Section titled “Export formatting”For the worksheet columns:
Min Prior Lines→trial_subgroups.min_prior_lines(direct)Max Prior Lines→trial_subgroups.max_prior_lines(direct)Prior Taxane Use→ derived fromllm_data['prior_therapies']array, filtering fortherapy_class = 'taxane':- If
exposure_status = 'required'→Yes (required) - If
exposure_status = 'reported'with percentage →Yes (71%) - If
exposure_status = 'excluded'→No (excluded) - If
exposure_status = 'allowed'→Allowed - If no entry with
therapy_class = 'taxane'→NR
- If
The worksheet currently labels this column “Prior Taxane Use” but the extraction captures all therapy classes via the strict therapy_class enum. The export filters by enum value — therapy_class = 'taxane' for this column, therapy_class = 'checkpoint_inhibitor' for “Prior IO Use”, etc. No schema changes needed to add new worksheet columns for different disease areas.
The therapy_name free text field preserves the original abstract phrasing for display and evidence review (e.g. “prior docetaxel-based chemotherapy” rather than just “taxane”).
Solution applied
Section titled “Solution applied”Implemented as the TreatmentContextExtraction task, which expands the former TreatmentLineExtraction to extract prior therapy context alongside treatment lines in a single LLM call.
Schema changes
Section titled “Schema changes”New columns on trial_subgroups:
min_prior_lines(integer, nullable) — minimum number of prior lines of therapymax_prior_lines(integer, nullable) — maximum number of prior linesmedian_prior_lines(integer, nullable) — median number of prior lines
New JSONB key in trial_subgroups.llm_data:
prior_therapies— array ofPriorTherapyExposureobjects, each with:therapy_class— strict enum of 22 values (checkpoint_inhibitor,taxane,platinum,endocrine_therapy,anti_vegf,car_t,cdk_inhibitor,anti_her2,imid,hma,proteasome_inhibitor,anthracycline,fluoropyrimidine,bispecific,anti_cd38,adc,bcg,surgery,transplant,radiation,chemotherapy,other)therapy_name— free text preserving original abstract phrasingexposure_status— enum:required,allowed,excluded,reportedexposure_percentage— float, nullable (e.g. 71.4 for “71% had prior taxane”)evidence— quote from abstract
Code changes
Section titled “Code changes”-
app/tasks/publications_llm_classification/treatment_context_extraction.rb— renamed fromtreatment_line_extraction.rb. ExpandedSubgroupschema addsmin_prior_lines,max_prior_lines,median_prior_lines, andprior_therapiesarray. System prompt extended with prior line count extraction rules and therapy class mapping with the 22-value enum. -
app/tasks/publications_llm_classification/post_process.rb— updated to writemin_prior_lines,max_prior_lines,median_prior_linescolumns andprior_therapiesJSONB key during subgroup creation. -
db/views/vw_publication_efficacy_data_v09.sql— addedmin_prior_lines,max_prior_lines,median_prior_linesfromtrial_subgroupsto the materialized view output. -
app/queries/tpp/emerging_clinical_data_query.rb— addedmin_prior_lines,max_prior_lines,median_prior_linesto result rows. Addedprior_therapy_classparameter; when specified, includes aprior_therapy_usecolumn formatted as:required→ “Yes (required)”,reportedwith percentage → “Yes (71%)”,excluded→ “No (excluded)”,allowed→ “Allowed”, no entry → “NR”. -
lib/tasks/one_off/backfill_prior_therapy_context.thor— self-contained one-off backfill task processing all ~62K publications with subgroups (no regex pre-filter). Usesgpt-5-mini. Only writes prior therapy fields (min_prior_lines,max_prior_lines,median_prior_lines,llm_data['prior_therapies']) — does not overwrite existingtreatment_lines. Delete when backfill is complete. -
lib/tasks/one_off/cleanup_prior_therapy_values.thor— one-off cleanup that nulls out invalid values from the backfill. Delete when done. -
Data validation —
sanitize_line_countadded totreatment_context_extraction.rbandpost_process.rbto reject negative sentinel values (-1, -999) the LLM uses instead of null.sanitize_prior_therapiesrejects negativeexposure_percentagevalues.
Backfill results (2026-03-12)
Section titled “Backfill results (2026-03-12)”- 62,008 publications processed via
gpt-5-mini(synchronous) - 40,278 subgroups have non-zero prior line counts
- 61,895 subgroups have at least one prior therapy entry
Post-backfill cleanup: LLM used sentinel values (-1, -999, -2147483648) instead of null for ~9K subgroups. Additionally ~25K subgroups had median outside [min, max] range. All cleaned via cleanup_prior_therapy_values.thor.
Spot-check verification (2026-03-12)
Section titled “Spot-check verification (2026-03-12)”| Publication | Expected | Extracted | Status |
|---|---|---|---|
| 152908 (BOLD-100, gastric) | min=0, max=7, median=4, 95% platinum | min=0, max=7, median=4, platinum 95.2% | Correct |
| 162733 (sEphB4-HSA, mCRPC) | min=1, max=3, median=3, 71% taxane, 100% AR-targeted | min=1, max=3, median=3, taxane 71.4%, endocrine_therapy required | Correct |
| 53818 (PROfound, olaparib) | Stratified by prior taxane yes/no | ”Prior taxane Yes” subgroups: taxane required 100%; “Prior taxane No”: taxane excluded | Correct |
| 147778 (GSK2636731, mCRPC) | median=4, 83% taxane | median=4, taxane 83% | Correct |
Known limitations
Section titled “Known limitations”- Subgroup-defining therapies: The LLM sometimes classifies subgroup-defining therapy characteristics (e.g. “Prior taxane Yes”) as
reportedinstead ofrequired/excluded. Full backfill showed inconsistency vs spot-check runs — likely due totemperature: 1(gpt-5-mini constraint). A prompt improvement could help but is not blocking. - Endocrine therapy ambiguity in mCRPC: Background ADT (universally required) and novel AR agents (often excluded) both map to
endocrine_therapy, creating apparent contradictions (bothrequiredandexcludedon the same subgroup). Could be addressed by splitting into separate therapy classes in a future iteration. max_prior_lineszero-sentinel contamination: See Issue 8. The LLM outputs0instead ofnullfor unstated max prior lines, producing 124K unusable values. This is a separate issue from the prior therapy extraction itself (which works correctly).
Validation (2026-03-13)
Section titled “Validation (2026-03-13)”Coverage confirmed:
- 150,689 subgroups have
min_prior_lines(95%) - 149,952 have
max_prior_lines(94%) — but see Issue 8 for data quality concern - 124,264 have
median_prior_lines(78%) - 61,895 have at least one
prior_therapiesentry (39%)
Prior therapy class enum distribution is healthy. All 22 enum values are used. Top classes: chemotherapy (19.9K), surgery (6.7K), platinum (6K), checkpoint_inhibitor (5.5K), endocrine_therapy (5.5K). other has 25.4K entries (26%) — high but acceptable given the long tail of therapy types not covered by the 22 predefined classes.
Tracker examples all re-verified correct:
- Pub 152908 (BOLD-100): min=0, max=7, median=4, platinum 95.2%, fluoropyrimidine 85.7%
- Pub 162733 (sEphB4-HSA): min=1, max=3, median=3, taxane 71.4%, endocrine_therapy required
- Pub 53818 (PROfound): “Prior taxane Yes” subgroups have taxane reported, “Prior taxane No” subgroups have taxane reported — exposure_status is
reportedrather thanrequired/excluded(see known limitation above) - Pub 147778 (GSK2636731): median=4, taxane 83%
Report-readiness: Prior therapy class data and min_prior_lines are usable for reports. max_prior_lines is not usable without the cleanup described in Issue 8.
6. Data cutoff date is not extracted from publication abstracts
Section titled “6. Data cutoff date is not extracted from publication abstracts”Short summary
Section titled “Short summary”The disease clinical evidence worksheet has a Data Cut column that records the date when trial data collection was frozen for analysis (e.g. Jun 26, 2024, Mar 20, 2025).
Data cutoff date is not currently extracted or persisted as structured data. The pipeline already reads this language during endpoint and treatment line extraction but discards it. Data cutoff dates appear in ~6,100 publication abstracts with an extractable date in ~3,800 of those.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”Current publication flow:
classify_publicationsextracts endpoints and adverse events from the abstract. The system prompt references data cutoff incidentally (e.g. for maturity determination) but does not extract the date.- The
not_reachedboolean on outcome measures captures the consequence of an immature data cutoff but not the cutoff date itself. is_partial_result/is_partialflags on publications signal interim results but not the specific cutoff date.
Relevant code paths:
app/tasks/publications_llm_classification/task.rb— main extraction promptapp/tasks/publications_llm_classification/details.rb—Detailsschema (hasnot_reachedbut no cutoff date)db/views/vw_publication_efficacy_data_v07.sql— no data cutoff columnapp/queries/tpp/emerging_clinical_data_query.rb— no data cutoff in output
Exact restriction causing the gap
Section titled “Exact restriction causing the gap”1. No extraction schema field for data cutoff date
The Details schema in details.rb captures endpoints, arms, adverse events, study design, and partial result flags — but has no data_cutoff_date field. The LLM reads the cutoff date in the abstract for reasoning (e.g. to determine endpoint maturity via not_reached) but has no output slot to persist it.
2. The efficacy view and export have no data cutoff column
vw_publication_efficacy_data exposes effective_line, treatment_settings, dose, but has no data_cutoff_date. The CSV export (emerging_clinical_data_report.rb) has 37 columns but none for data cutoff.
Concrete examples
Section titled “Concrete examples”Example 1: publication 241657 (belzutifan + lenvatinib in RCC)
Section titled “Example 1: publication 241657 (belzutifan + lenvatinib in RCC)”Abstract states:
“for the first (IA1; data cutoff Jun 26, 2024) and second (IA2; data cutoff Apr 9, 2025) interim analysis”
This publication reports two separate data cutoffs for two interim analyses. The worksheet needs at minimum the most recent cutoff date (Apr 9, 2025). Neither date is captured.
Example 2: publication 116878 (BURAN — buparlisib in HNSCC)
Section titled “Example 2: publication 116878 (BURAN — buparlisib in HNSCC)”Abstract states:
“data cut-off date of 15 March 2025, with a median follow up of 27 months”
Cutoff date (2025-03-15) is clearly stated. Not captured. The worksheet Data Cut column for this publication would be 15 Mar 2025.
Example 3: publication 240450 (BREAKWATER — encorafenib in mCRC)
Section titled “Example 3: publication 240450 (BREAKWATER — encorafenib in mCRC)”Abstract states:
“At data cutoff (Mar 1, 2025), EC+FOLFIRI demonstrated a clinically meaningful and statistically significant improvement…”
Cutoff date (2025-03-01) is stated parenthetically. Not captured.
Example 4: publication 191190 (pembrolizumab + nab-paclitaxel in HNSCC)
Section titled “Example 4: publication 191190 (pembrolizumab + nab-paclitaxel in HNSCC)”Abstract states:
“data cutoff (February 27, 2025; median follow-up 23 months)”
Cutoff date (2025-02-27) is clearly stated. Not captured.
Why this matters downstream
Section titled “Why this matters downstream”Data cutoff date is clinically essential for interpreting results. When the same trial publishes multiple analyses, each with a different cutoff date, the cutoff distinguishes which analysis the reported endpoints belong to. Without it:
- The worksheet
Data Cutcolumn cannot be populated - Analysts cannot distinguish interim from final analysis results for the same trial
- Publications reporting updated OS at longer follow-up cannot be correctly ordered or attributed
What the issue is not
Section titled “What the issue is not”This is not a not_reached problem. The not_reached flag captures whether a median was estimable at all. Data cutoff date describes when the analysis was performed, not whether an endpoint was reached.
This is also not a publication dating problem. publication_date is when the paper was presented or published. Data cutoff date is when the trial database was locked for that analysis — typically months before publication. The two dates serve different purposes and should not be conflated.
Across all publications with abstracts:
| Signal | Publications | % of all pubs with abstracts (194K) |
|---|---|---|
| Mentions data cutoff language | 6,148 | 3.2% |
| Data cutoff with extractable date (month/year or full date) | 3,849 | 2.0% |
| Single cutoff mention | 5,314 | 86% of cutoff pubs |
| Multiple cutoff mentions (2+) | 834 | 14% of cutoff pubs |
For the target worksheet diseases specifically:
| Disease | Total pubs | With data cutoff |
|---|---|---|
| Colorectal Cancer | 2,878 | 208 (7%) |
| HNSCC | 974 | 83 (9%) |
| NSCLC | 4,079 | 623 (15%) |
Key observations:
- Data cutoff dates are most common in NSCLC publications (15%), likely reflecting the higher proportion of large randomized trials in lung cancer
- ~14% of publications with cutoff language mention multiple cutoffs (e.g. different interim analyses), confirming that subgroup-level persistence is needed
- ~63% of publications with cutoff language include an extractable date with at least month + year precision
Data cutoff date formats in abstracts
Section titled “Data cutoff date formats in abstracts”From spot-checking 30 recent abstracts with data cutoff language:
| Format | Example | Frequency |
|---|---|---|
| Month DD, YYYY | data cutoff date of October 27, 2025 | Common |
| Mon DD, YYYY | data cutoff Jun 26, 2024 | Common |
| DD Mon YYYY | data cut-off (18 Sept 2025) | Common |
| Mon YYYY (no day) | data cut-off (July 2025) | Moderate |
| MM/DD/YYYY | data cut-off (06/13/2025) | Rare |
| Month YYYY only | data cutoff (Oct 2025) | Moderate |
Some abstracts state only month + year without a specific day. The LLM should extract whatever precision is available.
Some publications report multiple cutoff dates for interim analyses (e.g. publication 241657: IA1 cutoff Jun 26, 2024 and IA2 cutoff Apr 9, 2025). For the worksheet, the most recent cutoff associated with the reported results should be used.
Existing partial signals
Section titled “Existing partial signals”The system already captures related but insufficient signals:
not_reached(boolean): Whether a time-to-event median was estimable. Captures endpoint maturity but not the temporal context.is_partial_result(boolean): Whether the publication reports interim results. Related to data cutoff (interim = earlier cutoff) but does not carry the date.publication_date: When the paper was published. Distinct from data cutoff — typically the cutoff is 3–12 months before publication.- LLM evidence text: Data cutoff dates appear embedded in
llm_dataobservation/evidence free text (~2,400 publications) but are not structured or queryable.
Open characterization questions
Section titled “Open characterization questions”- When a publication reports multiple interim analyses with different cutoffs (e.g. IA1, IA2) and different subgroups share the same cutoff, should the cutoff be denormalized onto each subgroup or stored once with an analysis label?
- Should the LLM extract an analysis label (e.g. “IA1”, “IA2”, “primary analysis”) alongside the cutoff date?
Explored solution direction
Section titled “Explored solution direction”1. Subgroup-level persistence, not publication-level
Data cutoff date belongs on trial_subgroups, not on publications. Evidence:
- ~14% of publications with cutoff language mention multiple cutoffs (e.g. pub
241657: PFS cutoff Jun 2024, OS cutoff Apr 2025 for different interim analyses) - Different subgroups or endpoint sets within the same publication can reference different analysis cutoffs
- The common case (~86%) is a single cutoff — all subgroups inherit it, so subgroup-level handles both cases
This is consistent with how treatment_lines already persists on trial_subgroups.
2. Bake into classify_publications for going-forward extraction
The classify_publications task (PublicationsLlmClassification::Task) already reads the full abstract and encounters data cutoff language naturally. Add data_cutoff_date to the SubgroupOutcome schema in details.rb:
class SubgroupOutcome # ... existing fields ...
desc 'Data cutoff date for results reported under this subgroup, in ISO 8601 format (YYYY-MM-DD). ' \ 'Use YYYY-MM-01 when only month and year are stated. Null if not mentioned in the abstract.' attribute :data_cutoff_date, :string, nullable: trueendSystem prompt addition in task.rb (extend section 3, Endpoints and Outcome Measures):
** Data Cutoff Date: - If the abstract states a data cutoff date for the results in this subgroup (e.g. "data cutoff Jun 26, 2024", "data cut-off date was Mar 20, 2025"), extract it as data_cutoff_date in YYYY-MM-DD format. - Use YYYY-MM-01 when only month and year are given. - If the publication reports a single cutoff for all results, apply it to every subgroup_outcome_measures entry. - If different analyses have different cutoffs, assign each cutoff to the subgroup(s) whose results it covers. - Leave null if not explicitly stated — do not infer from publication date.Why this is better than a separate task:
- Zero marginal cost — one nullable string field per subgroup entry adds negligible tokens
- The LLM already has the full abstract in context and already reasons about data maturity (
not_reached,is_partial_result) - No new task class, no new Thor command, no new workflow step for going-forward publications
- Schema stays co-located with the subgroup outcome data it describes
3. Post-processing: propagate to trial_subgroups
post_process_publications already creates/updates trial_subgroups from llm_data['subgroup_outcome_measures']. Add data_cutoff_date to the attributes written during post-processing. This requires a migration to add data_cutoff_date (date, nullable) to the trial_subgroups table.
4. Backfill task for all existing result publications
A separate backfill task extracts data_cutoff_date from all existing result publications that already have llm_data['subgroup_outcome_measures'] (~63K publications). No regex pre-filtering — the LLM decides whether a cutoff date is present, not a pattern match. Regex would silently miss publications that state cutoff dates in unexpected phrasing.
The backfill task:
- Reads the publication abstract and its existing
trial_subgroupsrecords - Extracts
data_cutoff_dateper subgroup - Writes directly to
trial_subgroups.data_cutoff_dateandtrial_subgroups.llm_data, same pattern asTreatmentContextExtraction(which finds eachtrial_subgroupby ID and updates in place) - Does NOT re-run
post_process_publications— that would destroy and recreate all trial_subgroups, wiping treatment lines and disease adjudication data - Runs as a one-time Thor task, similar in pattern to
adjudicate_subgroup_diseases(Issue 1)
Estimated cost: ~$30–50 with gpt-5-mini for ~63K publications (single nullable date field per subgroup, minimal output tokens).
After backfill, the going-forward path (classify_publications) handles all new publications automatically.
Solution applied
Section titled “Solution applied”Status: Implemented — backfill complete (validated 2026-03-13)
All code changes are in place. Backfill has been run.
Going-forward extraction (classify_publications)
Section titled “Going-forward extraction (classify_publications)”- Added
data_cutoff_date(string, nullable) toSubgroupOutcomeindetails.rb - Added data cutoff extraction instructions to the system prompt in
task.rb - Updated
post_process.rbto propagatedata_cutoff_datefromllm_data['subgroup_outcome_measures']totrial_subgroups.data_cutoff_date
New publications processed through classify_publications → post_process_publications will automatically have data cutoff dates extracted and persisted.
Schema and view
Section titled “Schema and view”- Added
data_cutoff_date(date, nullable) column totrial_subgroups - Added
data_cutoff_datetovw_publication_efficacy_data(v10) sourced fromtrial_subgroups.data_cutoff_date - Added
Data Cutcolumn toEmergingClinicalDataQueryoutput and CSV export
Backfill task
Section titled “Backfill task”One-off backfill task at lib/tasks/one_off/backfill_data_cutoff_dates.thor extracts data cutoff dates from all existing result publications with trial_subgroups. No regex pre-filter — all ~62K publications are sent to gpt-5-mini (estimated cost ~$6-10). The LLM returns null for publications without cutoff language.
Run with:
bundle exec thor one_off:backfill_data_cutoff_dates:extract --batched --parallelism=4Spot-check validation
Section titled “Spot-check validation”Tested on 6 publications with known cutoff dates (5 extracted correctly, 1 correctly returned null for abstract that says “at data cut-off” without stating the date):
| Pub ID | Abstract says | Extracted | Correct? |
|---|---|---|---|
| 116878 | ”data cut-off date of 15 March 2025” | 2025-03-15 | Yes |
| 163930 | ”data cutoff” Feb 4, 2021 | 2021-02-04 | Yes |
| 190005 | ”at data cut-off” (no date) | null | Yes |
| 190016 | cutoff Sept 16, 2024 | 2024-09-16 | Yes |
| 190620 | ”data cutoff, 01 Aug 2025” | 2025-08-01 (all 14 subgroups) | Yes |
| 190677 | ”data cutoff (07 Oct 24)“ | 2024-10-07 | Yes |
Validation (2026-03-13)
Section titled “Validation (2026-03-13)”Coverage: 30,369 subgroups across 11,203 distinct publications have data_cutoff_date populated (19.1% of all publication subgroups). This exceeds the pre-implementation estimate of ~6K abstracts with cutoff language, confirming the backfill has been run.
Tracker spot-check pubs re-verified:
| Pub | Expected | Actual | Correct? |
|---|---|---|---|
| 116878 (BURAN) | 2025-03-15 | 2025-03-15 | Yes |
| 190016 (SERENA-1) | 2024-09-16 | 2024-09-16 | Yes |
| 190620 (POD1UM-303) | 2025-08-01 (all 17 subgroups) | 17/17 populated | Yes |
| 190677 (CAPItello-281) | 2024-10-07 | 2024-10-07 | Yes |
| 190005 (TROPION-Breast01) | null (no date in text) | null | Yes |
Tracker examples 241657 and 240450 have zero subgroups — they are newly ingested ASCO 2025 publications (created 2026-03-10) that haven’t been through classify_publications yet. Once the publication workflow runs, cutoff dates will be extracted automatically by the going-forward path.
Minor data quality issues:
- 9 subgroups have cutoff dates before 2000 — verified as legitimate (e.g. pub 144506 is a 1988 pilot study in Qidong County).
- 2 subgroups (pub 109543) have cutoff date
2028-12-01— hallucinated future date. Should be cleaned.
7. AE grade category enum is too coarse — grade 1-2 rows misclassified as all_grade
Section titled “7. AE grade category enum is too coarse — grade 1-2 rows misclassified as all_grade”Short summary
Section titled “Short summary”The grade_category field on adverse_events only supports two values: all_grade and grade_gte3. Many publication abstracts report AEs in finer grade buckets (grade 1-2, grade 3-4, grade 5/fatal, SAE). When forced into the binary, grade 1-2 rows get shoehorned into all_grade, which is incorrect — true all-grade incidence includes all grades, while grade 1-2 is a strict subset.
This produces ~50 AE pairs where the grade_gte3 value is higher than the all_grade value for the same AE name, which is counter-intuitive but affects <0.3% of publications with AE data.
Scale and severity: low
Section titled “Scale and severity: low”- 36,545 publications have AE rows with
grade_category - 312 publications (0.9%) have the same AE name under both grade categories
- 50 AE pairs across those 312 pubs show inverted values (grade_gte3 > all_grade)
- 92 of the 312 are in target disease areas
Current misclassification breakdown from observation text analysis:
| Observation pattern | Classified as all_grade | Classified as grade_gte3 | Issue |
|---|---|---|---|
| Explicitly “all grade” / “any grade” | 5,143 | 401 | 401 wrong |
| Grade 1-2 specific | 7,883 | 221 | 7,883 should be grade_1_2 |
| Grade 3-4 specific | 1,679 | 15,204 | 1,679 wrong |
| Grade 5 / fatal | 176 | 3,803 | Should be own category |
| SAE context | 1,287 | 1,804 | Should be own category |
| No observation | 20,152 | 16,413 | Ambiguous |
| Other | 35,053 | 13,173 | Mixed |
The grade 1-2 → all_grade misclassification (7,883 rows) is the largest single issue. The grade ≥3 column is mostly correct, so the clinically important safety signal is preserved. The all-grade column underreports in affected cases.
Explored solution direction
Section titled “Explored solution direction”Expand grade_category to a richer enum and re-run the backfill:
# Current: all_grade, grade_gte3# Proposed:attribute :grade_category, :string# enum: all_grade, grade_1_2, grade_gte3, grade_3_4, grade_5_fatal, sae| Value | Meaning | Ranked summary use |
|---|---|---|
all_grade | True all-grade / any-grade / unspecified | ”Most Frequent AE All Grade” |
grade_1_2 | Grade 1-2 only (low-grade bucket) | Excluded from ranked summaries |
grade_gte3 | Grade ≥3 / grade 3+ | “Most Frequent AE >=Gr3” |
grade_3_4 | Grade 3-4 specifically | Treated same as grade_gte3 for ranking |
grade_5_fatal | Grade 5 / fatal / treatment-related death | Separate or excluded |
sae | Serious adverse event (any grade) | Excluded from ranked summaries |
The ranked summary helper would then:
- “Most Frequent AE All Grade” → filter to
all_gradeonly (notgrade_1_2orsae) - “Most Frequent AE >=Gr3” → filter to
grade_gte3+grade_3_4
This eliminates the inversion problem because grade 1-2 and SAE rows no longer contaminate the all-grade bucket.
Cost: Re-running the full AE grade backfill at ~$10 with gpt-5-mini. The schema change to the extraction prompt and the AdverseEventGradeBackfill task already exist — just need to expand the enum, update the prompt, and re-run.
Downstream changes: Update AdverseEvent model normalization, ranked summary helper, and export query to handle the expanded enum.
Solution applied
Section titled “Solution applied”Implemented 2026-03-13. Commit ef8bcfa8.
Enum expansion: adverse_events.grade_category expanded from 2 values to 6: all_grade, grade_1_2, grade_gte3, grade_3_4, grade_5_fatal, sae. All model normalization, extraction schema, backfill task, and export queries updated.
Backfill completed. Current distribution across 148,084 classified rows:
grade_category | Count | % |
|---|---|---|
all_grade | 60,450 | 40.8% |
grade_gte3 | 37,747 | 25.5% |
grade_3_4 | 21,685 | 14.6% |
grade_1_2 | 12,907 | 8.7% |
sae | 6,865 | 4.6% |
grade_5_fatal | 6,430 | 4.3% |
| NULL | 2,305 | 1.6% |
Ranked summary updated: “Most Frequent AE All Grade” filters to all_grade only (excluding grade_1_2 and sae). “Most Frequent AE >=Gr3” filters to grade_gte3 + grade_3_4 + grade_5_fatal.
Residual: 2,305 rows (1.6%) across 1,257 publications still have NULL grade_category. Inverted AE pairs reduced from ~50 to 33 — remaining inversions likely reflect genuine data complexity (e.g. subgroup-level AE rates where a smaller subgroup has higher grade ≥3 than the overall all-grade rate).
8. max_prior_lines zero-sentinel contamination
Section titled “8. max_prior_lines zero-sentinel contamination”Short summary
Section titled “Short summary”The TreatmentContextExtraction LLM task outputs 0 instead of null for max_prior_lines when the abstract does not state a maximum number of prior therapies. This produces 124,446 subgroups (78% of all publication subgroups) with max_prior_lines = 0, of which 12,924 are logically impossible (min_prior_lines > max_prior_lines).
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”TreatmentContextExtraction (app/tasks/publications_llm_classification/treatment_context_extraction.rb):
- Schema declares
attribute :max_prior_lines, :integer, nullable: truewith desc"Null if not stated." - System prompt (line 150):
"Leave null if not stated. Do not infer counts that are not explicitly stated." sanitize_line_countrejects negative values (value.negative?) but passes0through
Root cause
Section titled “Root cause”Two contributing factors:
-
Structured outputs integer default: When the LLM generates structured JSON with an
integerfield and the value is conceptually “not applicable,” many models default to0rather thannull, even when the schema allows nullable and the prompt says “null if not stated.” This is a known behavior pattern with OpenAI structured outputs. -
Sanitizer gap:
sanitize_line_count(line 411) was designed to catch the-1/-999sentinel pattern discovered during the initial backfill, but did not anticipate0as a sentinel because0is a valid value for treatment-naïve (1L) populations.
max_prior_lines | Count | % |
|---|---|---|
| 0 | 124,446 | 78.4% |
| 1-3 | 13,527 | 8.5% |
| 4-10 | 6,851 | 4.3% |
| >10 | 5,128 | 3.2% |
| NULL | 8,757 | 5.5% |
Logically impossible rows (min > max): 12,924
Breakdown by treatment line for max_prior_lines = 0:
| Treatment line | min=0 & max=0 | min>0 & max=0 (contradictory) |
|---|---|---|
| 2L+ | 17,225 | 6,282 |
| 3L+ | 5,614 | 3,411 |
| 1L only | 25,827 | 129 |
| Other (Adj/Neo/Ind/etc.) | 51,922 | 771 |
For 1L publications, min=0, max=0 is valid (treatment-naïve = zero prior lines). For 2L+ and 3L+ publications, max=0 is always wrong — by definition these populations have ≥1 prior line.
Concrete examples
Section titled “Concrete examples”| Pub | Subgroup | Treatment line | min | max | Abstract says |
|---|---|---|---|---|---|
| 69513 | Asian pts | 3L+ | 2 | 0 | ”at least 2 prior lines” (no max stated) |
| 45604 | Overall | 2L+ | 1 | 0 | ”previously treated” (no max stated) |
| 101698 | HRAS-mutated UC → Evaluable | 2L+ | 1 | 0 | ”at least one prior therapy” (no max stated) |
| 121922 | Overall | 3L+ | 2 | 0 | ”≥2 prior systemic therapies” (no max stated) |
In all cases the abstract provides a minimum threshold but no maximum. The LLM correctly extracted min_prior_lines but output 0 instead of null for max_prior_lines.
Downstream impact
Section titled “Downstream impact”max_prior_linesis not usable for reports in its current state — 78% of values are sentinel zeros- The
Max Prior Linescolumn in the worksheet export will show0for the vast majority of rows, which is misleading - The efficacy view (
vw_publication_efficacy_data) exposesmax_prior_linesdirectly fromtrial_subgroups, so the bad values propagate to all downstream consumers min_prior_linesis less affected —0is valid for 1L populations, and the contradictory cases (min > 0 with max = 0) are identifiable
Recommended fix — two parts
Section titled “Recommended fix — two parts”Part 1: Cleanup existing data
The cleanup is not straightforward because 0 is valid for 1L populations. Possible approaches:
-
Conservative (rule-based): Set
max_prior_lines = NULLwheremin_prior_lines > max_prior_lines(12,924 rows — clearly wrong). This fixes the worst cases but leaves ~111K ambiguousmax=0rows untouched. -
Moderate (rule-based with treatment line context): Additionally set
max_prior_lines = NULLwheremax_prior_lines = 0ANDtreatment_linescontains2L+or3L+(these populations by definition have ≥1 prior line, so max=0 is impossible). This would cover ~23K additional rows. -
Aggressive (re-extract via LLM): Re-run
TreatmentContextExtractionon all affected publications. Most accurate but costs another ~$30 and risks other field drift. Could be scoped to only publications wheremax_prior_lines = 0ANDtreatment_linesis not1L.
Recommendation: Start with approach 2 (rule-based cleanup of clearly wrong values), then evaluate whether the remaining 1L + max=0 population needs LLM re-extraction or if 0 is acceptable there.
Part 2: Prevent recurrence
Two changes needed:
-
Update
sanitize_line_countto also reject0formax_prior_lineswhenmin_prior_lines > 0:def sanitize_line_count(value)return nil if value.nil? || value.negative?valueendThis alone is insufficient because the sanitizer doesn’t have cross-field context. Better to add a post-persist validation.
-
Update the system prompt to be more explicit about the 0-vs-null distinction:
- IMPORTANT: Use null (not 0) when no maximum is stated. 0 means "zero prior lines"(treatment-naïve only). If the abstract says "at least 2 prior lines" with no upperbound, set min=2 and max=null, NOT max=0. -
Add a cross-field sanitizer in
persist_resultsthat nullsmax_prior_lineswhenmin > max:subgroup.max_prior_lines = nil if subgroup.min_prior_lines.present? &&subgroup.max_prior_lines.present? && subgroup.min_prior_lines > subgroup.max_prior_lines
Solution applied
Section titled “Solution applied”Implemented 2026-03-13. Two-part fix:
Part 1: Prevent recurrence
- Updated
TreatmentContextExtractionsystem prompt with explicit zero-vs-null disambiguation and concrete example - Added cross-field validation (
min > max → max = nil) in all three persist paths:TreatmentContextExtraction#persist_results,PostProcessoutcome measure building, andbackfill_prior_therapy_context.thor - Added
MAX_PLAUSIBLE_PRIOR_LINES = 25threshold to all threesanitize_line_countmethods — values above 25 are nulled on persist (verified via spot-checking that real abstracts top out at ~20 prior lines in heavily pretreated myeloma/phase 1 basket trials)
Part 2: Historical data cleanup
- Extended
cleanup_prior_therapy_values.thorwith three new cleanup rules:- Nulled
max_prior_lineswheremin_prior_lines > max_prior_lines(12,924 rows) - Nulled
max_prior_lines = 0wheretreatment_linescontains 2L+/3L+/2L/3L/4L/4L+/5L/5L+ (23,193 additional rows) - Nulled sentinel junk (values > 25) across all three fields:
min_prior_lines(95 rows),max_prior_lines(3,013 rows),median_prior_lines(1,281 rows). Common sentinels included INT_MAX (2,147,483,647), 999, 999999, 65535, 32767, 123456789, etc.
- Nulled
- Total cleaned: 40,506 rows across all rules
Post-cleanup validation:
- Zero contradictory rows (
min > max) remain - Zero impossible zeros for 2L+/3L+ populations remain
- Spot-checked 4 example publications (69513, 45604, 101698, 121922) — all now have
max_prior_lines = NULL - 1L populations with
min=0, max=0preserved correctly - All three fields now cap at plausible values (max observed: min=14, max=20, median=18)
- ~40 rows in the 21-25 range were nulled that may have been valid; unrecoverable without LLM re-extraction, but impact is negligible
Validation (2026-03-16)
Section titled “Validation (2026-03-16)”Post-cleanup state confirmed:
- 0 contradictory rows (
min > max) remain - 90,476 subgroups still have
max_prior_lines = 0— all are in 1L (5,564) or non-line-specific settings (Adjuvant, Neoadjuvant, Induction, etc.: 84,912). No 2L+/3L+ zeros remain. - The remaining zeros in non-line-specific settings (e.g. Adjuvant with
max=0) are likely still sentinel zeros, but these populations have no treatment line context anyway, so the downstream impact is negligible. max_prior_linesis now usable for reports where treatment line context exists. For 1L populations,max=0is valid. For populations without treatment lines,max_prior_linesshould be treated as unreliable.
9. All-grade AE extraction gap — originally ~13K publications, revised to ~14 after investigation
Section titled “9. All-grade AE extraction gap — originally ~13K publications, revised to ~14 after investigation”Short summary
Section titled “Short summary”Originally suspected that classify_publications fails to extract all-grade named AEs for ~13,000 publications. After deep investigation (2026-03-16), the issue is much narrower than initially estimated.
Investigation (2026-03-16)
Section titled “Investigation (2026-03-16)”The 12,986 publications with only grade≥3 AE rows break down as:
| Category | Publications | Genuine extraction failure? |
|---|---|---|
| Abstract genuinely only reports grade ≥3 AEs | ~11,200 | No — abstract has no all-grade named AE data |
| Abstract mentions “any grade” in aggregate context only (e.g. “discontinuation due to any grade TRAE”) | ~400 | No — “any grade” appears as an aggregate stat, not per-AE |
| Abstract has grade_1_2 AEs separately (not combined all-grade) | 1,744 | No — abstract reports low/high grade separately, not combined |
| Abstract has clear two-column AE table (Any grade + Grade ≥3) but LLM misclassified | ~14 | Yes — any-grade values extracted but labeled as grade_gte3 |
Root cause for the ~14 genuine failures
Section titled “Root cause for the ~14 genuine failures”The LLM extracts numeric values from embedded AE tables but reads the first column (any-grade) and labels it as grade≥3, completely ignoring the second column (actual grade≥3 values). This was caused by the old binary grade_category enum (all_grade/grade_gte3) which didn’t give the LLM enough guidance to distinguish columns.
Confirmed example: pub 60886 (Debio 0123 + carboplatin, phase 1)
Section titled “Confirmed example: pub 60886 (Debio 0123 + carboplatin, phase 1)”Abstract table:
TEAE Any grade n(%) Grade ≥3 n(%)Thrombocytopenia 12 (31.6) 3 (7.9)Nausea 12 (31.6) 0Anemia 8 (21.1) 1 (2.6)Fatigue 7 (18.4) 0Before (old extraction): 7 rows, all grade_gte3 — values 31.6%, 31.6%, 21.1% are the ANY-GRADE column mislabeled. Grade≥3 column (7.9%, 0%, 2.6%, 0%) completely missing.
After re-extraction (current prompt with 6-value enum): 14 rows — 7 all_grade (31.6%, 31.6%, 21.1%, 18.4%, 13.2%, 13.2%, 10.5%) + 7 grade_gte3 (7.9%, 2.6%, 2.6%, 2.6%, 0%, 0%, 0%). All correct.
Confirmed example: pub 56057 (Debio 0123 + carbo + etoposide, phase 1)
Section titled “Confirmed example: pub 56057 (Debio 0123 + carbo + etoposide, phase 1)”Same pattern — any-grade column extracted as grade≥3. After re-extraction: 4 all_grade + 4 grade_gte3, all values correct.
Why the current prompt already fixes this
Section titled “Why the current prompt already fixes this”The expanded 6-value grade_category enum (Issue 7, implemented 2026-03-13) and the detailed grade classification instructions in the prompt give the LLM enough context to correctly distinguish table columns. Re-running classify_publications on the affected pubs with the current prompt produces correct results — verified on 2/2 test pubs.
What does NOT need fixing
Section titled “What does NOT need fixing”- ~11,200 pubs where the abstract genuinely only reports grade≥3: The all-grade data is not in the abstract. It may be in the full paper, poster, or oral presentation. This is a data availability limitation, not an extraction failure.
- ~400 pubs where “any grade” appears in aggregate context: The abstract says things like “discontinuation due to any grade TRAE occurred in 7.5%” — this is an aggregate stat correctly handled by the safety metrics extraction, not individual named AEs.
- 1,744 pubs with grade_1_2 + grade≥3 but no all_grade: The abstract reports grades separately (grade 1-2 and grade 3-4), not as a combined “any grade” bucket. This is correct — the “Most Frequent AE All Grade” column should only use true all-grade data, not sum of grade buckets.
Remaining 35 inverted AE pairs
Section titled “Remaining 35 inverted AE pairs”35 AE pairs across pubs WITH both all_grade and grade≥3 rows show grade≥3 > all_grade for the same AE name. These are likely the same column-swap bug in pubs that DID get partial all-grade extraction. The Issue 10 re-extraction (2,182 pubs through full classify_publications) will fix any that overlap.
Solution applied
Section titled “Solution applied”Going forward: The Issue 7 enum expansion (2026-03-13) and current prompt instructions are sufficient — re-running classify_publications on affected pubs produces correct two-column extraction. Verified on pubs 60886 and 56057: before=7 rows all grade_gte3 (any-grade values mislabeled), after=14 rows (7 all_grade + 7 grade_gte3, all values correct).
Why the Issue 7 AE grade backfill didn’t fix existing data: The backfill (AdverseEventGradeBackfill) can only reclassify existing AE rows — it cannot create new rows. For the ~14 affected pubs:
- The original
classify_publicationsextracted only the any-grade column values and labeled themgrade_gte3(wrong) - The grade≥3 column values (Nausea 0%, Thrombocytopenia 7.9%, etc.) were never extracted as rows at all
- The backfill skipped these rows because
grade_categorywas already non-null (set incorrectly by the original extraction) - Even with
--overwrite, the backfill would at best reclassify the 7 rows fromgrade_gte3→all_grade, but the 7 missing grade≥3 rows still wouldn’t exist
Fix requires re-running classify_publications on the affected pubs — only full re-extraction creates both sets of rows. The ~14 pubs will be fixed by either:
- The Issue 10 re-extraction (2,182 pubs) if they overlap, or
- The next full publications workflow run
No additional prompt changes or backfill tasks needed.
10. classify_publications drops subgroups identified by extract_subgroups
Section titled “10. classify_publications drops subgroups identified by extract_subgroups”Short summary
Section titled “Short summary”The classify_publications LLM task receives a list of subgroups with endpoint associations from the upstream extract_subgroups step, but sometimes drops subgroups entirely — producing subgroup_outcome_measures entries for only a subset of the provided subgroups. The subgroup extraction step correctly identifies the subgroup, the schema enum correctly includes it, and the endpoint association is correctly passed — but the main classification LLM simply doesn’t create an output entry for it.
This was discovered during worksheet validation against the client sheet 1reh2-9Xpxd9DF7EB-73JfSXH8-MLtWI3zUDEOTgxPV8.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”Publication classification runs in two LLM steps:
-
extract_subgroups(subgroup_extraction.rb) reads the abstract and identifies subgroup labels with their endpoint associations. Output isllm_data['subgroup_endpoints']. -
classify_publications(task.rb) receivessubgroup_endpoints, derivesdistinct_subgroups, and passes them to the main LLM as:- A
subgroup_endpointsfield in the user prompt (subgroup → endpoint mapping) - An
enumconstraint onsubgroup_outcome_measures[].valuein the structured output schema (details.rbline 185) - A system prompt instruction: “Look at the provided ‘subgroup_endpoints’, keep the associations between the endpoints and subgroups as they are.” (line 31)
- A
The schema constraint (details.rb line 185) enforces that subgroup_outcome_measures[].value MUST be one of the distinct_subgroups values — the LLM cannot hallucinate new subgroups. But the schema does not enforce that every enum value must appear at least once. The LLM is free to produce output with only a subset of the provided subgroups, and it does.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”The structured output schema makes subgroup entries optional, not required.
The subgroup_outcome_measures field is an array of objects. Each object has a value field constrained to the enum. But the array itself has no minimum length and no constraint requiring each enum value to appear. The LLM is structurally allowed to produce output with 1 subgroup entry out of N provided.
The system prompt says to “keep the associations as they are” but this is a soft instruction. With structured outputs, the LLM’s tendency to minimize output length can override soft prompt instructions, especially when one subgroup has much more data in the abstract than another.
Concrete examples
Section titled “Concrete examples”Example 1: pub 47147 (sigvotatug vedotin + pembrolizumab, ASCO 2025) — confirmed LLM drop
Section titled “Example 1: pub 47147 (sigvotatug vedotin + pembrolizumab, ASCO 2025) — confirmed LLM drop”Abstract text (verbatim):
“In 7 efficacy-evaluable pts with TPS≥1 NSCLC, 1 confirmed (c) complete response (CR), 1 c partial response (PR), and 2 PRs pending confirmation were observed (ORR 57%; cORR 29%). In 8 efficacy-evaluable pts with 1L HNSCC, 2 cCR and 1 cPR were observed (cORR 37.5%).”
Both disease cohorts are in the same sentence block with explicit efficacy values.
extract_subgroups output (llm_data['subgroup_endpoints']):
[ { "endpoint": "Objective response Rate", "subgroups": ["NSCLC (PD-L1 TPS ≥1)"] }, { "endpoint": "Confirmed Objective Response Rate", "subgroups": ["NSCLC (PD-L1 TPS ≥1)", "1L HNSCC"] }]Step 1 correctly identified both subgroups. 1L HNSCC is associated with Confirmed Objective Response Rate.
distinct_subgroups passed to schema enum: ["NSCLC (PD-L1 TPS ≥1)", "1L HNSCC"]
Both subgroups were available as valid enum values in the structured output schema.
classify_publications output (llm_data['subgroup_outcome_measures']):
[ { "type": "disease", "value": "NSCLC (PD-L1 TPS ≥1)", "outcome_measures": [ {"endpoint": "ORR", "measure_value": 57, "number_of_participants": 7}, {"endpoint": "cORR", "measure_value": 29, "number_of_participants": 7} ] }]Only the NSCLC subgroup was created. The 1L HNSCC subgroup with cORR=37.5% was completely dropped despite being:
- explicitly mentioned in the abstract with a numeric value
- correctly identified by
extract_subgroups - present in the schema enum
- associated with
Confirmed Objective Response Ratein the input
Worksheet impact: The sheet row for HNSCC says ORR=37.5% from this trial (NCT04389632). Our database has no HNSCC efficacy row for this publication.
Example 2: pub 71934 (cofetuzumab pelidotin, ESMO 2023) — data not in abstract table
Section titled “Example 2: pub 71934 (cofetuzumab pelidotin, ESMO 2023) — data not in abstract table”Abstract embedded table has two columns:
| Parameter | NSQ EGFR WT, PTK7 ≥90%/≥2+ N=21 | Overall N=56 |
|---|---|---|
| ORR | 30.0% | 19.6% |
| CBR | 90.0% | 78.6% |
| mDOR | 5.8 mo | 7.2 mo |
| mPFS | 5.5 mo | 5.3 mo |
The LLM correctly extracted both columns as subgroups: PTK7-expressing rNSCLC (Overall) and NSQ EGFR WT → PTK7 ≥90%.
The abstract narrative mentions three histology cohorts: “27 NSQ EGFR WT, 13 NSQ EGFR mutant, and 16 squamous (SQ)” and states “Enrollment of SQ and NSQ EGFR mutant pts was halted to prioritize NSQ EGFR WT accrual due to response rates in each subgroup.”
However, the per-histology ORR values (including sqNSCLC ORR=12.5% from the worksheet) are not present in the abstract’s table or narrative text. The abstract only shows the overall and NSQ EGFR WT results. The squamous-specific data was likely in the poster or supplementary material, not the abstract.
This is NOT an LLM extraction failure — the data isn’t in the text we have. The worksheet’s sqNSCLC ORR=12.5% comes from a source outside our abstract corpus.
Root cause analysis
Section titled “Root cause analysis”The two examples show different failure modes:
-
Pub 47147 (HNSCC cORR=37.5%): Pure LLM output quality failure. The data is in the abstract, the subgroup was correctly identified upstream, the schema allowed it — but the LLM still dropped it. This is the actionable issue.
-
Pub 71934 (sqNSCLC ORR=12.5%): Not an extraction failure. The data isn’t in the abstract. The worksheet references data from a source we don’t have.
For the actionable case (pub 47147 pattern), the root cause is:
- The structured output schema does not require completeness — the LLM can produce fewer
subgroup_outcome_measuresentries than there are enum values - The system prompt instruction (“keep the associations as they are”) is not strong enough to override the LLM’s tendency to minimize output when one subgroup has much less data than another
- The HNSCC subgroup had only one endpoint value (cORR=37.5%) while NSCLC had two (ORR=57%, cORR=29%), making it a “smaller” subgroup that the LLM is more likely to drop
Measured by comparing llm_data['subgroup_endpoints'] (distinct subgroups identified by extract_subgroups) against llm_data['subgroup_outcome_measures'] (entries with non-empty outcome_measures produced by classify_publications):
| Status | Publications | % |
|---|---|---|
| All subgroups used | 55,683 | 84.8% |
| Partial drop (some subgroups lost) | 9,245 | 14.1% |
| Total drop (all subgroups lost — zero outcome measures) | 473 | 0.7% |
| More used than identified (LLM created extra) | 293 | 0.4% |
| Total with dropped subgroups | 9,718 | 14.8% |
Note: initial measurement (2,760) undercounted due to a category filter that excluded PubMed, EHA, and other non-ASCO sources. The corrected count uses result = true across all sources.
Explored solution direction
Section titled “Explored solution direction”1. Strengthen the prompt instruction (going-forward prevention)
Add explicit language to the system prompt in task.rb:
- IMPORTANT: You MUST create a subgroup_outcome_measures entry for EVERY subgroup in the provided list that has associated endpoints. Do not skip subgroups even if they have fewer results than others. If a subgroup has only one endpoint value, still create the entry. Every subgroup provided to you was identified because the abstract contains results for it.This won’t guarantee compliance (the current prompt already says “keep the associations as they are” and the LLM ignores it), but it raises the bar.
2. Schema-level enforcement
Add minItems: distinct_subgroups.length to the subgroup_outcome_measures array in to_json_schema. OpenAI structured outputs may or may not honor this — needs testing. If it works, it forces the LLM to produce at least N entries, preventing the drop.
schema[:properties]['subgroup_outcome_measures'][:minItems] = distinct_subgroups.length3. Post-extraction validation + selective re-extraction (fix existing data)
The detection query is cheap (no LLM needed):
-- Compare identified vs used subgroup countsSELECT p.id, (SELECT count(DISTINCT sg) FROM jsonb_array_elements(p.llm_data -> 'subgroup_endpoints') e, jsonb_array_elements_text(e -> 'subgroups') sg) as identified, (SELECT count(*) FROM jsonb_array_elements(p.llm_data -> 'subgroup_outcome_measures') s WHERE jsonb_array_length(s -> 'outcome_measures') > 0) as usedFROM publications pWHERE ...HAVING identified > usedFor the 2,760 affected publications, re-run classify_publications with the strengthened prompt. Estimated cost: ~$20 with o4-mini for 2,760 pubs.
This could also be wired as a permanent validation step in post_process_publications that flags mismatches for automatic re-extraction (with a retry limit to prevent infinite loops on genuinely ambiguous abstracts).
4. For the 701 total-drop publications (zero outcome_measures)
These need separate investigation — likely a mix of:
- Trial-in-progress abstracts (correct behavior, no results to extract)
- Genuine extraction failures where the LLM returned empty outcomes
- Abstracts too short or ambiguous for the LLM to extract anything
A quick filter: check if partial_result_tags contains ‘Trial Design/Enrollment’ — if yes, the empty outcome is expected.
5. Re-run chain for the 2,760 affected publications
Because post_process_publications destroys and recreates trial_subgroups (line 138 of post_process.rb), re-running classify_publications requires re-running downstream steps that write to subgroup rows. The full chain:
-
classify_publications --publication_ids <ids> --batched— re-extractssubgroup_outcome_measureswith the fixed prompt. Readsllm_data['subgroup_endpoints'](already correct fromextract_subgroups). ~$20 witho4-mini. -
post_process_publications --publication_ids <ids> --overwrite— destroys alltrial_subgroups,trial_outcome_measures,adverse_events,trial_disease_detailsfor these pubs and recreates fromllm_data. Re-persists treatment lines and prior therapy context for subgroups that match bysubgroup_type + subgroup_valueagainstllm_data['treatment_lines']['subgroups']. New subgroups (the ones previously dropped) will get null treatment context becausetreatment_context_extractionnever ran on them. -
extract_treatment_lines --publication_ids <ids>— re-runsTreatmentContextExtractionon the new subgroups. Reads existingtrial_subgroupsby ID and writes treatment lines, min/max/median prior lines, and prior therapies. ~$20 withgpt-5-mini. Note:extract_treatment_linesscope (line 294) filters tollm_data->'treatment_lines' IS NULL— but sincepost_processwritesllm_data['treatment_lines']on the publication (not null), we need to either pass--publication_idsto bypass the scope or temporarily null out the field. Alternatively, sincepost_processmatched existing subgroups correctly, only the new subgroups lack treatment context. A targeted approach: after step 2, query for the newly createdtrial_subgroupsthat have nulltreatment_linesand run treatment context extraction on just those publications. -
Disease workflow steps — re-run for these pubs:
adjudicate_subgroup_diseases— re-adjudicate new non-disease subgroupspopulate_disease_terms_for_trial_subgroups+post_process_disease_matches— re-populatetrial_subgroups.disease_id
Steps that do NOT need re-running: extract_subgroups (input is already correct), extract_interventions, link_publication_drugs, tag_investigational_interventions, extract_dose_evidence, therapeutic_area_filter — all write to llm_data on the publication or to publication_interventions, not to trial_subgroups.
Full downstream chain: Since post_process_publications destroys and recreates trial_subgroups, trial_endpoints, trial_outcome_measures, adverse_events, and trial_disease_details, all downstream steps need to re-run: extract_treatment_lines, standardize_adverse_events, classify_adverse_events, llm_classify_publication_endpoints_domains, llm_match_publication_endpoints, plus the publication_disease_workflow for disease_id. The simplest approach is to re-run the full publications_workflow from classify_publications onward, then the publication_disease_workflow.
Estimated cost: ~$40 for classify_publications re-extraction with o4-mini + ~$20 for extract_treatment_lines with gpt-5-mini + minor costs for other LLM steps.
Solution applied
Section titled “Solution applied”Implemented 2026-03-16. Three-part fix:
1. Prompt hardening (task.rb): Added explicit instruction to the classify_publications system prompt:
IMPORTANT: You MUST create a subgroup_outcome_measures entry for EVERY subgroup in theprovided list that has associated endpoints. Do not skip subgroups even if they have fewerresults than others. If a subgroup has only one endpoint value, still create the entry.Every subgroup provided to you was identified because the abstract contains results for it.2. Schema enforcement (details.rb): Added minItems: distinct_subgroups.length to the subgroup_outcome_measures array in the structured output JSON schema. This prevents the LLM from producing fewer entries than there are identified subgroups.
3. Post-extraction validation logging (task.rb): After each publication is persisted, compares the set of subgroups from extract_subgroups against the set produced by classify_publications. Logs a warning if any subgroups were dropped.
Local test results: 6/6 publications with previously dropped subgroups now have all subgroups populated after re-extraction:
- Pub 47147 (sigvotatug vedotin):
1L HNSCCsubgroup with cORR=37.5% now extracted (previously dropped) - Pubs 51804, 53951, 56337, 60242, 144841: all dropped subgroups recovered
One-off re-extraction task: lib/tasks/one_off/reextract_dropped_subgroups.thor identifies the ~9,700 affected publications and creates OneOffJob records for the re-extraction. After classify_publications completes, re-run the full publications_workflow from post_process_publications onward, then publication_disease_workflow.
Production re-extraction completed 2026-03-21. Full pipeline re-run (extract_subgroups → classify_publications → post_process) executed across all affected publications. Issue is now closed.
11. Recently ingested publications have empty endpoint extractions — Closed: not an issue
Section titled “11. Recently ingested publications have empty endpoint extractions — Closed: not an issue”Short summary
Section titled “Short summary”Initially suspected that recently ingested publications (ASCO 2025, ESMO 2025) had llm_data['subgroup_outcome_measures'] with subgroup entries but empty outcome_measures: [] arrays, suggesting extraction failures.
Investigation (2026-03-16)
Section titled “Investigation (2026-03-16)”Systematic analysis of all 102 publications with subgroup_outcome_measures containing only empty outcome_measures arrays:
| Category | Count | Genuine extraction failure? |
|---|---|---|
| Trial Design/Enrollment (no results in abstract) | 61 | No — correct behavior |
| Safety/AE-focused publications (no efficacy endpoints) | ~15 | No — correct behavior |
| Biomarker/correlative science (no clinical endpoints) | ~8 | No — correct behavior |
| Truncated abstracts (data in figure/table not captured in text) | ~2 | Data availability limit, not bug |
| Mistagged pubs (tagged “Interim Result” but actually TDE) | ~7 | No — tagging wrong, extraction correct |
| Genuinely missed efficacy data | 0 | — |
Regex scan for standard efficacy keywords (ORR, mPFS, mOS, HR with numeric values) across all 41 non-TDE pubs found ~7 with keyword matches, but manual inspection confirmed all were false positives:
- Pub 53912: pCR prediction AUROC values, not clinical endpoints
- Pub 63720 (BNT327/PM8002): abstract text truncated — Results section jumps from enrollment stats to Conclusions, efficacy data was in an embedded figure not captured in text
- Remainder: biomarker studies where HR/ORR appears in passing context, not as reported results
The worksheet rows that couldn’t be matched (MICVO ORR=46%, sigvotatug HNSCC cORR=37.5%, cofetuzumab sqNSCLC ORR=12.5%) were caused by:
- MICVO ORR=46%: data from a Nov 2025 corporate presentation not in our publication corpus
- Sigvotatug HNSCC cORR=37.5%: Issue 10 — data was in the abstract but the subgroup was dropped by
classify_publications(now fixed) - Cofetuzumab sqNSCLC ORR=12.5%: data not in the abstract at all — squamous-specific results were in the poster/supplementary material
Resolution
Section titled “Resolution”Closed — not an issue. The empty outcome_measures are correct in all 102 cases. The original concern was caused by confounding with Issue 10 (subgroup drops) and data availability limitations (corporate presentations, poster-only data).
12. Legacy Emerging Clinical Data query collapses subgroup-level results into Overall-preferred rows
Section titled “12. Legacy Emerging Clinical Data query collapses subgroup-level results into Overall-preferred rows”Short summary
Section titled “Short summary”Legacy Tpp::EmergingClinicalDataQuery groups all view rows by [publication_id, disease_id, effective_line, study_plan_arm_id] and then picks the “Overall” subgroup when extracting efficacy metrics. This means dose-level cohorts, biomarker-stratified subgroups, and other clinically meaningful splits are hidden behind the Overall population row — even when the data is correctly extracted and present in vw_publication_efficacy_data.
Status note: No further work planned for now. Subgroup-preserving behavior is available via Tpp::ClinicalEvidenceQuery, which is the current client-facing path for this use case. The remaining collapse behavior exists only on the legacy EmergingClinicalDataQuery path.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/queries/tpp/emerging_clinical_data_query.rb:
build_result_rows(line 913): groups by[pub_id, disease_id, effective_line, study_plan_arm_id]extract_efficacy_metrics(line 1057):overall_rows = matching_rows.select { |r| r['subgroup_value'] == 'Overall' }— prefers Overall when present- All subgroups with the same
disease_id(including viatrial_disease_detailsfallback) collapse into a single output row
Concrete examples
Section titled “Concrete examples”Example 1: Ficlatuzumab HPV-negative subgroup (pub 43175)
Section titled “Example 1: Ficlatuzumab HPV-negative subgroup (pub 43175)”Publication: “Randomized Phase II Trial of Ficlatuzumab With or Without Cetuximab in Pan-Refractory HNSCC” (NCT03422536)
Correctly extracted subgroups:
Overall→ ORR=19%, PFS=3.7, N=32Overall → HPV-negative→ ORR=38%, PFS=4.1, N=16Overall → HPV-negative → cMet overexpression→ PFS dataOverall → HPV-positive→ ORR=0%, PFS=2.3, N=16
All four subgroups are in vw_publication_efficacy_data with subgroup_disease_id = NULL. The fallback to trial_disease_details.disease_id = 6200 (HNSCC) gives them all the same disease_id. So they all land in the same group key [43175, 6200, 3L, nil].
extract_efficacy_metrics then picks subgroup_value = 'Overall' (ORR=19%), discarding the HPV-negative result (ORR=38%) that the worksheet expects.
Worksheet says: Ficlatuzumab HPV-neg N=16 ORR=38% Query returns: Ficlatuzumab Overall N=32 ORR=19%
Example 2: PF-08046054 dose-level cohorts (pub 65346, ESMO 2024)
Section titled “Example 2: PF-08046054 dose-level cohorts (pub 65346, ESMO 2024)”The ESMO 2024 abstract for this solid-tumor basket trial extracted a single subgroup: PDL1-expressing solid tumors with N=55 ORR=27.3%. The sheet expects HNSCC-specific dose-level splits (N=19 at 1.5mg/kg ORR=10.5%, N=7 at 1.75mg/kg ORR=42.9%).
This is a compound issue:
- The abstract itself is a cross-tumor overview — HNSCC-specific dose-level data was in the poster/slides, not the abstract text (data availability)
- Even if separate subgroups existed, the query would collapse them into one row
Example 3: IBI363 TPS<1 squamous subgroup (pub 139344 / 237445, ASCO 2025)
Section titled “Example 3: IBI363 TPS<1 squamous subgroup (pub 139344 / 237445, ASCO 2025)”The sqNSCLC worksheet keeps two rows for the same IBI363 abstract:
- SqNSCLC 3 mg/kg Q3W:
ORR = 43.3%,mPFS = 7.3,N = 30 - SqNSCLC with
TPS <1:ORR = 45.5%,N = 22
Both rows are correctly present in vw_publication_efficacy_data:
Advanced NSCLC → Squamous cell carcinoma → 3 mg/kg Q3W→ORR = 43.3,PFS = 7.3,N = 30Advanced NSCLC → TPS <1 → Squamous cell carcinoma→ORR = 45.5,N = 22
But EmergingClinicalDataQuery groups both into the same key:
- abstract copy:
[139344, 4174, 1, nil] - presentation copy:
[237445, 4174, 1, nil]
There is no subgroup_value = 'Overall', so extract_efficacy_metrics falls back to max_by(number_of_participants) and picks the 30-patient row. The TPS <1 row is hidden even though it is already structured and disease-linked.
Worksheet says: IBI363 TPS <1 SqNSCLC ORR = 45.5%, N = 22
Query returns: IBI363 SqNSCLC ORR = 43.3%, N = 30
Root cause
Section titled “Root cause”The query was designed for one-row-per-publication summary display, not for subgroup-level comparisons. The “prefer Overall” logic (line 1057) is intentional — it prevents small subgroup analyses from overriding the main population result in summary tables. But for worksheet reconstruction, the subgroup-level detail IS the desired output.
Difficult to quantify precisely, but any publication with biomarker-stratified results (HPV+/-, PD-L1 CPS levels, mutation status) or dose-level cohorts will lose the subgroup-level detail. This affects basket trials and biomarker-enriched studies disproportionately.
From the HNSCC sheet comparison:
- Ficlatuzumab HPV-neg (N=16 ORR=38%) — data present, hidden by Overall preference
- PF-08046054 dose-levels (N=19, N=7) — data not in abstract, but would be hidden even if extracted
- Becotatug vedotin 2.3mg/kg (N=32 ORR=43%) — data IS extracted (pub 71438 subgroup
2.3 mg/kg → 2/3-line prior platinum & PD-1/L1 inhibitor failurehas 4 outcome measures) but invisible due to Issue 15 (disease mapping) - IBI363
TPS <1SqNSCLC (N=22 ORR=45.5%) — data present, hidden behind the larger SqNSCLC 3 mg/kg row (N=30 ORR=43.3%)
What the issue is not
Section titled “What the issue is not”This is not an extraction failure. The LLM correctly identifies and extracts subgroup-level data. The data exists in trial_subgroups, trial_outcome_measures, and vw_publication_efficacy_data. The loss happens at query time in the Ruby layer.
Explored solution direction
Section titled “Explored solution direction”No additional implementation is planned at this time. The legacy EmergingClinicalDataQuery behavior remains documented below for reference, but this issue is currently superseded by ClinicalEvidenceQuery, which already preserves subgroup-level rows and surfaces cORR.
Two possible approaches:
1. Subgroup-aware grouping: Change build_result_rows to group by [pub_id, disease_id, effective_line, study_plan_arm_id, subgroup_value] instead of collapsing subgroups. This would produce multiple rows per publication — one for Overall, one for HPV-neg, one for each dose level. Downstream consumers (the TPP React component) would need to handle multiple rows per publication.
2. Subgroup expansion mode: Add an optional parameter (e.g. expand_subgroups: true) that preserves subgroup-level rows when set. Default behavior stays unchanged for summary display, but worksheet reconstruction can request the expanded view.
Option 2 would be the lower-risk approach if the legacy Emerging Clinical Data report needs to be revived without adopting ClinicalEvidenceQuery.
3. Confirmed ORR (cORR) not surfaced as a separate column
The worksheet has separate columns for ORR and Confirmed ORR (cORR). Our query only exports ORR. The data IS in the database — but it’s not distinguishable at query time.
Current state:
- The
endpointscatalog has nocORRentry — onlyORR(ids 10, 64) - The
EndpointMatchermaps all confirmed ORR extractions to the catalogORRendpoint - When the LLM extracts “cORR” or “Confirmed Objective Response Rate”, it becomes a regular
ORRrow with “confirmed” noted in thetrial_outcome_measures.observationtext - 2,377 ORR rows have “confirmed” in their observation text
- Only 7 rows in the entire DB have an explicit
cORR/Confirmed ORRabbreviation ontrial_endpoints - When an abstract reports both ORR and cORR (e.g. pub 47147: ORR=57%, cORR=29%), both are extracted as separate
ORRrows — but the query picks one
Explored approach — adding cORR as a separate catalog endpoint: Not recommended. Confirmed ORR is not a different clinical endpoint — it’s the same ORR with confirmation scans. Splitting the catalog would create ambiguity in the matching step (should “ORR 35%” map to ORR or cORR?) and wouldn’t help for the 2,377 rows that already have “confirmed” buried in observation text.
Recommended approach — structured confirmed boolean on outcome measures:
Add a confirmed boolean field to the outcome measure schema in classify_publications. The LLM already knows whether a response is confirmed (it writes “confirmed” in the observation) — we should ask it to put that in a proper field rather than relying on substring/regex matching at query time.
The field would sit on:
- The outcome measure in
llm_data['subgroup_outcome_measures'][].outcome_measures[]— set by the LLM duringclassify_publications trial_outcome_measures— persisted bypost_process_publications
Then EmergingClinicalDataQuery can pull ORR rows where confirmed = true for the cORR column and confirmed = false/null for the regular ORR column.
Implementation steps:
- Add
confirmedboolean to the outcome measure JSON schema indetails.rb - Add prompt instruction to
task.rb: “Setconfirmed: truewhen the response has been confirmed by follow-up assessment (e.g. cORR, confirmed CR/PR). Setconfirmed: falseor omit when unconfirmed or not stated.” - Add
confirmedcolumn totrial_outcome_measures(migration) - Persist the field in
post_process.rb - Expose in
vw_publication_efficacy_data - Use in
EmergingClinicalDataQueryto populate a separate cORR column - Backfill: re-run
classify_publicationson affected pubs, or run a lightweightAdverseEventGradeBackfill-style task that re-classifies existing ORR rows using the observation text
Solution applied (2026-03-18):
- Migration: Added
confirmedboolean column totrial_outcome_measures(nullable, no default) - Schema: Added
confirmedattribute toOutcomeStoreModel class indetails.rbwith description guiding the LLM - Prompt: Added “Confirmed Response” instruction to
task.rbsystem prompt —confirmed: truefor cORR/confirmed CR/PR,falsefor unconfirmed,nullwhen not stated - Persistence: Added
confirmed: om['confirmed']topost_process.rbtrial_outcome_measures.create!call - View: Created
vw_publication_efficacy_data_v11.sqlexposingtom.confirmedcolumn - Backfill: Created
lib/tasks/one_off/backfill_confirmed_orr.thor— rule-based detection from observation text and endpoint name (no LLM cost). Results:- 3,061 rows updated (2,722 confirmed=true, 339 confirmed=false)
- 62,207 rows left as null (no signal in text)
- 2,076 publications had
llm_datasynced - View refreshed: 4,332 confirmed rows, 517 unconfirmed rows visible
Verified on pub 47147 (sigvotatug vedotin):
- HNSCC cORR=37.5% →
confirmed=true✓ - NSCLC ORR=57% →
confirmed=null✓ - NSCLC cORR=29% →
confirmed=true✓
Going forward: New publications processed via classify_publications will have the confirmed field set by the LLM during extraction. The legacy EmergingClinicalDataQuery can now filter by confirmed = true for a cORR column, but no further query-layer work is planned here because subgroup-preserving behavior is already available in ClinicalEvidenceQuery.
Subgroup-level dose fields (2026-03-18)
Section titled “Subgroup-level dose fields (2026-03-18)”Problem: Dose evidence was stored at publication_interventions level (one record per publication+drug), not per subgroup. When a publication reports multiple dose cohorts (e.g. Becotatug 2.0 mg/kg vs 2.3 mg/kg), efficacy is split into separate subgroups but they all share the same publication-wide dose_min/dose_max. ~17K publications with dose evidence have subgroups that could carry dose context.
Solution applied:
- Migration: Added 6 dose columns to
trial_subgroups:dose_value,dose_min,dose_max,rp2d,dose_units,dose_frequency(all nullable strings) - Schema: Added dose attributes to
SubgroupOutcomeclass indetails.rb— numeric values only, units separate indose_units - Prompt: Added “Subgroup Dose Context” instruction to
task.rbsystem prompt — extract dose into subgroup fields for dose cohorts, leave null for non-dose subgroups - Persistence: Added dose field mapping in
post_process.rbtrial_subgroups.create! - View: Created
vw_publication_efficacy_data_v12.sql— COALESCEs subgroup-level dose over publication-level dose:COALESCE(ts.dose_min, pdl.pub_dose_min) AS dose_min, etc. Also surfacessingle_dosecolumn viaCOALESCE(ts.dose_value, pdl.pub_single_dose) - Backfill: Created
lib/tasks/one_off/backfill_subgroup_dose.thor— sends all subgroups for publications withdose_evidencetogpt-5-mini, LLM determines which are dose-specific
Scope: 17,170 publications, 50,403 subgroups. Estimated cost ~$15 with gpt-5-mini batched.
Key design decisions:
- Dose value fields are numeric-only (e.g.
"2.3") with units in separatedose_unitsfield (e.g."mg/kg"). Initial run had 45/47 values with units leaked into numeric fields; fixed by making schema descriptions explicit (“WITHOUT units”) - Backfill scope is all publications with
dose_evidenceonpublication_interventions, not regex-filtered by subgroup name. Earlier regex approach (mg|mg/kg|...) missed Gy, IU, U/kg, cell therapy doses (×10^N), DLT/MTD keywords, and schedule-only cohorts (QD/BID) - The LLM correctly nulls non-dose subgroups (disease cohorts, biomarker subgroups, “Overall”) even when they’re sent in the same prompt
Prod deployment:
- Run migrations (add columns + update view to v12)
- Run backfill:
thor one_off:backfill_subgroup_dose:backfill --batched - Refresh materialized view
Going forward: New publications processed via classify_publications → post_process will automatically populate subgroup dose fields. No additional work is planned on the legacy Emerging Clinical Data path; ClinicalEvidenceQuery is the subgroup-preserving query for current use.
13. Technology filter excludes combination partner drugs
Section titled “13. Technology filter excludes combination partner drugs”Short summary
Section titled “Short summary”EmergingClinicalDataQuery filters vw_publication_efficacy_data rows by technology_id, which removes view rows for combination partner drugs that have a different technology than the investigational drug. This means extract_combination_partners_from_rows (which works from the filtered view rows) cannot see the combo partner, so the combination_partners field is blank even when the partner is correctly recorded in publication_interventions.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/queries/tpp/emerging_clinical_data_query.rb:
build_base_query(line 501):AND v.technology_id = ANY(ARRAY[:technology_ids]::integer[])filters ALL view rows by technologyextract_combination_partners_from_rows(line 1495): scans the filtered rows forinvestigational_component = false— but those rows were already removed by the technology filter- The older
fetch_combination_partnersmethod (line 1560) queriespublication_interventionsdirectly and would work, but it’s not used bybuild_single_row—extract_combination_partners_from_rowsis used instead
Concrete examples
Section titled “Concrete examples”Example 1: Amivantamab + Paclitaxel (pub 114606, ESMO 2025)
Section titled “Example 1: Amivantamab + Paclitaxel (pub 114606, ESMO 2025)”publication_interventions correctly records:
- Amivantamab:
drug_id=10180,intervention_role='investigational', technology = Bispecific Antibody (235) - Paclitaxel:
drug_id=10109,intervention_role='supportive', technology = (chemotherapy/small molecule)
When the query runs with technology_id = 235 (Bispecific Antibody):
- View rows for Amivantamab pass the filter (technology_id = 235) ✓
- View rows for Paclitaxel are filtered OUT (different technology) ✗
extract_combination_partners_from_rowssees only Amivantamab rows →combination_partners = nil
Worksheet says: Combination Partner = “Paclitaxel”
Query returns: combination_partners = nil
Example 2: Petosemtamab + Pembrolizumab (pub 30362/209252)
Section titled “Example 2: Petosemtamab + Pembrolizumab (pub 30362/209252)”publication_interventions correctly records Pembrolizumab as intervention_role='supportive'. When querying with technology_id = 235 (Bispecific Antibody), Pembrolizumab (Monoclonal Antibody, technology 230) is filtered out.
Note: even when running a separate query with technology_id = 230, Pembrolizumab rows would appear but Petosemtamab rows would be filtered out — so the combination context is lost in both directions.
Root cause
Section titled “Root cause”The technology filter is applied to view rows before drug role analysis. The filter is correct for identifying the investigational drug’s technology, but it eliminates combo partner rows that inherently have a different technology. This is a fundamental design tension: the technology filter serves to scope results to a technology of interest, but combination therapy inherently crosses technology boundaries.
Affects any publication where the investigational drug and combination partner have different technologies. Common patterns:
- ADC + checkpoint inhibitor (e.g. sigvotatug + pembrolizumab)
- BsAb + chemotherapy (e.g. amivantamab + paclitaxel)
- BsAb + checkpoint inhibitor (e.g. petosemtamab + pembrolizumab)
These are increasingly common in oncology clinical trials.
Additionally, the Amivantamab + Pembrolizumab 1L row from the MHNCS Feb 2026 conference is missing entirely — this publication does not exist in our database. The “Multidisciplinary Head and Neck Cancers Symposium” is not an ingested source. This is a data availability gap, not an extraction or query issue.
Explored solution direction
Section titled “Explored solution direction”Option 1: Fall back to publication_interventions for combo partners. Instead of relying on filtered view rows, use the existing fetch_combination_partners method (line 1560) which queries publication_interventions directly. This method already exists and handles both publication-based and trial-based combo partner lookup. Change build_single_row to call fetch_combination_partners instead of extract_combination_partners_from_rows.
Option 2: Remove technology filter from combo partner extraction. Run a secondary unfiltered query for publication_interventions where investigational_component = false for the matched publication_ids.
Option 1 is simplest — the method already exists, just needs to be wired in.
Solution applied
Section titled “Solution applied”Implemented 2026-03-18. Two changes in app/queries/tpp/emerging_clinical_data_query.rb:
-
Fixed
fetch_combination_partnersSQL bug (line 1567): Changedpi.publication_idtopi.source_id— the column was renamed during the polymorphize migration but the SQL was never updated, so this method silently failed for all publications. -
Switched
build_single_rowto usefetch_combination_partners(line 951): Replacedextract_combination_partners_from_rows(rows)withfetch_combination_partners(publication_id, clinical_trial_id, primary_drug_id, primary_drug_name). This queriespublication_interventionsdirectly, bypassing the technology_id filter on the view. Falls back toextract_combination_partners_from_rowsfor non-publication rows.
Verified:
- Amivantamab + Paclitaxel (pub 114606): now shows
combo=Paclitaxel(was blank) - Petosemtamab + Pembrolizumab (pub 30362): now shows
combo=Pembrolizumab(was blank) - Monotherapy publications: correctly show no combo partner
14. Basket trial disease subgroups not extracted for minority cohorts
Section titled “14. Basket trial disease subgroups not extracted for minority cohorts”Short summary
Section titled “Short summary”BNT324/DB-1311 (NCT05914116) is a solid-tumor basket trial. The ESMO abstract (pub 64328) reports results for 77 evaluable patients across multiple tumor types but only names SCLC (ORR=45.5%, n=33), CRPC (3 PRs), NSCLC (3 PRs), and BTC (1 PR) explicitly. HNSCC is never mentioned in the abstract text. The client sheet lists HNSCC N=3 ORR=100% from this trial — this data was in the poster/presentation, not the abstract.
Investigation (2026-03-17)
Section titled “Investigation (2026-03-17)”Publication corpus: 5 publications linked to NCT05914116:
| Pub ID | Source | Disease focus | HNSCC mentioned? |
|---|---|---|---|
| 64328 | ESMO | Broad solid tumors (SCLC emphasis) | No |
| 137185 | ASCO | CRPC | No |
| 190691 | ESMO | Cervical cancer / ovarian | No |
| 236643 | ASCO | CRPC | No |
| 241480 | ASCO | mCRPC + Lu-177 analysis | No |
Abstract text analysis (pub 64328):
The abstract mentions PRs by tumor type: “In pts with SCLC (n=33), unconfirmed ORR was 45.5%… PRs were also observed in 3 pts with CRPC, 3 pts with NSCLC and 1 pt with BTC.” HNSCC is not in this list. The HNSCC N=3 data likely appeared in the ESMO Asia poster/supplementary materials.
Database state:
trial_subgroupsfor this trial withdisease_id = 6200(HNSCC): 2 records, bothsource_type = 'News'/'NewsTrialMention'— NOT from publication extractionpublication_interventions: BNT324 (drug_id=12964) correctly linked withtechnology_id = 708(ADC) ✓- No publication-sourced
trial_subgroupshavedisease_id = 6200for this trial
Root cause
Section titled “Root cause”Data availability limitation. The LLM extraction is correct — it cannot extract HNSCC data that isn’t in the abstract text. The HNSCC results for this basket trial were only available in the poster/presentation at ESMO Asia 2024, which is not captured in our abstract corpus.
This is a common pattern for basket trials: the main abstract reports overall + top-responding tumor types, while per-tumor breakdowns for minority cohorts appear only in the poster, supplementary slides, or corporate presentations.
What would fix this
Section titled “What would fix this”- Full poster/presentation ingestion — if ESMO Asia poster PDFs were ingested and processed, the per-tumor-type data would be extractable
- Corporate presentation ingestion — the sheet source “ESMO Asia 2024” may reference a BioNTech R&D day presentation rather than an abstract
- News-sourced subgroup promotion — the HNSCC subgroups exist from
News/NewsTrialMentionsources; these could potentially be surfaced alongside publication data, but this would require view/query changes to accept non-publication sources
This pattern affects any basket trial where minority cohort data is only in supplementary materials. Likely affects dozens of phase 1 solid-tumor basket trials in the database.
15. Disease extraction favors subtype matches over parent disease, losing the umbrella disease
Section titled “15. Disease extraction favors subtype matches over parent disease, losing the umbrella disease”Short summary
Section titled “Short summary”The disease_extraction.rb matching logic tries subtype-level matches first, and if they succeed, skips the parent disease-level match entirely (early return on line 219). For pub 71438 (Becotatug vedotin, ESMO), the LLM correctly extracted name = "squamous cell carcinoma of the head and neck" with subtypes ["oral cavity", "oropharynx", "hypopharynx", "larynx"]. The subtype combos matched to Oropharyngeal Cancer (5040), Hypopharyngeal Cancer (5031), etc. via TermMatch. Because those subtype matches succeeded, the disease-name-level match to HNSCC (6200) was never attempted. The publication ends up with 4 trial_disease_details rows for sub-site cancers but none for HNSCC itself.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/disease_extraction.rb:
build_match_set(line 207): Takes a disease name and subtype values- Lines 212-216: For each subtype, builds combo
"squamous cell carcinoma of the head and neck - oral cavity"and looks upTermMatchwithfield = 'disease_subtypes' - Line 219:
return matches if matches.any?— if ANY subtype matched, skip the disease-name match entirely - Lines 221-223: Only reached if no subtype matches — looks up
"squamous cell carcinoma of the head and neck"asdisease_name, which resolves to HNSCC (6200)
Then in post_process.rb:
- Lines 401-436: Iterates over processed diseases, uses
matched_disease.matched_disease_idto find theDiseaserecord - Creates one
trial_disease_detailsrow per entry — since there are 4 subtype-matched entries (not the parent), 4 sub-site disease rows are created
Exact data flow for pub 71438
Section titled “Exact data flow for pub 71438”Step 1 — LLM extraction (extract_diseases):
The LLM correctly extracted ONE disease:
{ "name": {"value": "squamous cell carcinoma of the head and neck"}, "subtypes": [{"value": "oral cavity"}, {"value": "oropharynx"}, {"value": "hypopharynx"}, {"value": "larynx"}]}Step 2 — Disease matching (disease_extraction.rb):
build_match_set receives disease_name = "squamous cell carcinoma of the head and neck", subtype_values = ["oral cavity", "oropharynx", "hypopharynx", "larynx"].
For each subtype, it builds a combo and finds a TermMatch:
| Combo term | TermMatch ID | Matched disease | Confidence |
|---|---|---|---|
squamous cell carcinoma of the head and neck - oral cavity | 51046 | Lip and Oral Cavity Cancer (5047) | 0.925 |
squamous cell carcinoma of the head and neck - oropharynx | 50748 | Oropharyngeal Cancer (5040) | 0.9 |
squamous cell carcinoma of the head and neck - hypopharynx | 50744 | Hypopharyngeal Cancer (5031) | 0.975 |
squamous cell carcinoma of the head and neck - larynx | 50745 | Laryngeal Cancer (5023) | 0.9 |
All 4 subtype matches succeed → line 219 early return → disease-name match to HNSCC (6200) never runs.
The ONE input disease entry is split into 4 output entries, each with a subtype-matched disease and matched_disease.matched_disease_id pointing to the sub-site cancer (not HNSCC).
Step 3 — Post-processing (post_process.rb):
The 4 processed disease entries become 4 trial_disease_details rows:
| TDD ID | disease_id | disease_name | subtypes |
|---|---|---|---|
| 94126 | 5047 | Lip and Oral Cavity Cancer | ["oral cavity"] |
| 94127 | 5040 | Oropharyngeal Cancer | ["oropharynx"] |
| 94128 | 5031 | Hypopharyngeal Cancer | ["hypopharynx"] |
| 94129 | 5023 | Laryngeal Cancer | ["larynx"] |
HNSCC (6200) is nowhere in trial_disease_details for this publication.
Step 4 — Query (EmergingClinicalDataQuery):
The query uses Disease.subtree_for([6200]) which returns only [6200] (HNSCC has no descendants, all_descendants = []). None of the sub-site diseases (5047, 5040, 5031, 5023) are in this set. The publication is invisible.
Comparison with pub 242943 (PubMed, same trial)
Section titled “Comparison with pub 242943 (PubMed, same trial)”Pub 242943 for the same trial (NCT04868162) has trial_disease_details.disease_id = 6200 (HNSCC directly). This works because the PubMed abstract either:
- Did not have subtypes, so the disease-name fallback (line 221) ran and matched HNSCC
- Or had different subtype values that didn’t match any
disease_subtypesTermMatch
Why patient_population_diseases shows the correct match
Section titled “Why patient_population_diseases shows the correct match”The llm_data['patient_population_diseases'] for pub 71438 shows matched_disease.matched_disease_id = 6200 with confidence 1.0. But this is stale data — it was set before disease_extraction.rb re-processed the entries. The extraction step replaces the matched_disease field on each cloned entry (line 174), overwriting the original HNSCC match with the subtype-level match.
Root cause
Section titled “Root cause”The early return on line 219 of disease_extraction.rb treats subtype matches as replacing the parent disease match, rather than supplementing it. When the LLM extracts “squamous cell carcinoma of the head and neck” with anatomical subtypes, the system should create BOTH:
- A parent disease record for HNSCC (6200) — so the publication is discoverable under the umbrella term
- Subtype records for the anatomical sub-sites — for more granular filtering
Instead, it creates ONLY the subtype records and drops the parent entirely.
Disease ontology contributing factor
Section titled “Disease ontology contributing factor”Even if the subtype records were the only ones created, the publication would still be discoverable IF the sub-site diseases were descendants of HNSCC in the disease hierarchy. But they are all root-level siblings:
| Disease ID | Name | Parent | all_descendants |
|---|---|---|---|
| 6200 | Head and Neck Squamous Cell Carcinoma (HNSCC) | NULL | [] |
| 5040 | Oropharyngeal Cancer | NULL | (separate tree) |
| 5031 | Hypopharyngeal Cancer | NULL | (separate tree) |
| 5023 | Laryngeal Cancer | NULL | (separate tree) |
| 5047 | Lip and Oral Cavity Cancer | NULL | (separate tree) |
So Disease.subtree_for([6200]) returns only [6200], excluding all sub-site diseases.
Not yet quantified. Affects any publication where:
- The LLM extracts a disease with anatomical subtypes
- Those subtypes have
disease_subtypesTermMatches to separate diseases - The separate diseases are not descendants of the umbrella disease
This pattern is common for:
- Head & neck cancers (HNSCC → oropharyngeal, laryngeal, hypopharyngeal, oral cavity)
- Lung cancers (NSCLC → adenocarcinoma, squamous)
- Potentially others with anatomical sub-site taxonomy
Explored solution direction
Section titled “Explored solution direction”Option 1 (recommended): Always include parent disease match alongside subtype matches.
In disease_extraction.rb build_match_set, after collecting subtype matches, also run the disease-name match and include it in the result. Remove the early return on line 219:
# Current (line 219):return matches if matches.any?
# Proposed: always also try the disease-name matchterm_match = lookup_term_match('disease_name', disease_name)if valid_match?(term_match) # Only add parent match if it resolved to a different disease than the subtypes parent_disease_id = term_match.final_result['id'] subtype_disease_ids = matches.filter_map { |m| m['matched_disease_id'] } unless subtype_disease_ids.include?(parent_disease_id) matches << format_match_data(disease_name, subtype_values, term_match, matched_subtype: nil) endendThis ensures HNSCC (6200) gets a trial_disease_details row alongside the sub-site rows. The deduplication check prevents creating a duplicate if the parent and subtype resolve to the same disease.
Option 2: Fix disease hierarchy. Make sub-site H&N cancers descendants of HNSCC. This is conceptually correct but clinically nuanced — not all oropharyngeal cancers are squamous cell carcinomas. Would need expert review.
Option 3: Both. Fix the extraction to always include the parent, AND fix the hierarchy for confirmed relationships. Belt and suspenders.
Solution applied
Section titled “Solution applied”Implemented 2026-03-18.
1. Fixed disease_extraction.rb build_match_set: Removed the early return on line 219 that skipped the parent disease-name match when subtype matches existed. The method now always also tries the disease_name TermMatch lookup and includes it in the result set if it resolves to a different disease_id than any of the subtype matches. Added deduplication in merge_disease_matches to prevent the parent disease from being added multiple times when multiple sibling subtypes share the same parent.
2. Created backfill task lib/tasks/one_off/backfill_parent_disease_matches.thor:
identify— finds 1,856 publications with subtype-only disease matchesbackfill— re-runs disease matching with the fixed logic, then destroys and recreatestrial_disease_detailsonly (does not touch subgroups, endpoints, or AEs)
Verified on pub 71438 (Becotatug vedotin, ESMO):
- Before:
trial_disease_detailshad 4 sub-site diseases (5047, 5040, 5031, 5023), no HNSCC - After: 5 entries — 4 sub-sites + HNSCC (6200)
- Publication now surfaces in HNSCC queries via
EmergingClinicalDataQuery
Scale: 1,856 publications affected. Top disease names: breast cancer (593 entries across case variants), NSCLC (205), prostate cancer (127), lymphoma (130), mesothelioma (70), renal cell carcinoma (36), H&N SCC (29).
Pending: Production backfill of the 1,856 affected publications.
16. Confirmed ORR is not exported by EmergingClinicalDataQuery
Section titled “16. Confirmed ORR is not exported by EmergingClinicalDataQuery”Short summary
Section titled “Short summary”The disease worksheet has a dedicated Confirmed ORR (cORR) column, but EmergingClinicalDataQuery only exports OS, PFS, ORR, DoR, DFS, and DCR. Even when a worksheet row distinguishes confirmed from unconfirmed response, the query output has no place to carry that metric.
This means worksheet rows can look “partially matched” because the main ORR is present while the confirmed-response column is always blank.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/queries/tpp/emerging_clinical_data_query.rb:
PRIMARY_EFFICACY_ABBREVIATIONSis defined as%w[OS PFS ORR DOR DoR DFS DCR]extract_efficacy_metricsiterates only that whitelist- the result hash has no
:corror:confirmed_orrkey summary_statistics,orr_ranking, and CSV export all inherit the same endpoint set
This is a reporting-layer omission. It sits after publication ingestion and after subgroup extraction.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”The query hard-codes the primary efficacy endpoint set:
PRIMARY_EFFICACY_ABBREVIATIONS = %w[OS PFS ORR DOR DoR DFS DCR].freezeBecause cORR is not in that list:
extract_efficacy_metricsnever reads confirmed-response rows even if they exist upstreambuild_single_rownever exposes a confirmed-response field- downstream consumers cannot distinguish:
- unconfirmed ORR
- confirmed ORR
- rows where both are reported
Concrete examples from sqNSCLC sheet validation
Section titled “Concrete examples from sqNSCLC sheet validation”Example 1: PF-08046054 (ASCO 2025)
Section titled “Example 1: PF-08046054 (ASCO 2025)”Worksheet row:
ORR = 33.3%cORR = 33.3%N = 6
Query row:
ORR = 33.3%- no
cORRfield
Example 2: Ifinatamab deruxtecan (ESMO 2023)
Section titled “Example 2: Ifinatamab deruxtecan (ESMO 2023)”Worksheet row:
ORR = 31%cORR = 31%mDoR = 4.1
Query row:
ORR = 31%mDoR = 4.1- no
cORR
Example 3: IBI363 (ASCO 2025)
Section titled “Example 3: IBI363 (ASCO 2025)”Worksheet rows:
- SqNSCLC
ORR = 43.3%,cORR = 36.7% - SqNSCLC
TPS <1ORR = 45.5%,cORR = 36.7%
Query rows:
ORR = 43.3%on the main SqNSCLC row- no
cORR - the
TPS <1row is additionally hidden by Issue 12
Downstream impact
Section titled “Downstream impact”- The worksheet
Confirmed ORR (cORR)column cannot be reconstructed from structured output - studies that report both ORR and cORR appear more complete than they really are because only one of the two response metrics survives
- comparisons between abstracts that emphasize unconfirmed responses versus confirmed responses become unreliable
What the issue is not
Section titled “What the issue is not”This is not primarily a data-availability problem.
For the sqNSCLC examples above, the worksheet values are tied to concrete conference/journal records that we already ingest or otherwise match on the main ORR metric. The missing part is the confirmed-response export path.
This is also not the same as Issue 12. Issue 12 hides subgroup rows; Issue 16 removes an entire metric family from the report shape.
In the current sqNSCLC worksheet:
5 / 10populated rows include acORRvalue- these rows cover at least
4distinct studies
So this is not an edge case for the worksheet format.
Explored solution direction
Section titled “Explored solution direction”Add confirmed response as a first-class efficacy metric:
- Expand the endpoint whitelist to include the confirmed-response abbreviation actually used in the data (
cORR/ normalized equivalent) - Store it in the row hash alongside
:orr - Add a
Confirmed ORRcolumn to CSV/export formatting - Keep
ORRandcORRseparate rather than trying to merge or overwrite one with the other
17. ASCO abstract and presentation copies create duplicate publication rows
Section titled “17. ASCO abstract and presentation copies create duplicate publication rows”Short summary
Section titled “Short summary”After broadening ASCO ingestion to include both AbstractContentItem and PresentationContentItem, the same scientific abstract can now be stored twice under different ASCO uids. EmergingClinicalDataQuery groups by publication_id, not DOI/title, so both copies surface as separate rows.
This showed up repeatedly during the sqNSCLC pass and makes the local output look larger and noisier than the sheet.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/services/publications/asco_api_service.rb:
fetch_abstract_hitsrequestscontentTypes: ['Abstract', 'Presentation']save_publicationpersists records usingPublication.find_or_initialize_by(source: 'ASCO', source_id: publication_data[:source_id])
app/queries/tpp/emerging_clinical_data_query.rb:
build_result_rowsgroups bypublication_id,disease_id,effective_line, andstudy_plan_arm_id
There is no DOI-level or title-level deduplication step between ingestion and reporting.
Exact restriction causing the duplication
Section titled “Exact restriction causing the duplication”The ASCO fix for Issue 2 intentionally broadened the search and detail query to include PresentationContentItem. That solved the “missing presentation” problem, but persistence still keys uniqueness on source_id:
publication = Publication.find_or_initialize_by(source: 'ASCO', source_id: publication_data[:source_id])So if ASCO exposes both:
ABSTRACT492030PRESENTATION251481
with the same DOI and same text, both are considered distinct publications locally.
Concrete examples from sqNSCLC validation
Section titled “Concrete examples from sqNSCLC validation”Example 1: PF-08046054
Section titled “Example 1: PF-08046054”Same DOI:
10.1200/JCO.2025.43.16_suppl.8611
Stored twice:
- publication
48035— source_idABSTRACT492030 - publication
238708— source_idPRESENTATION251481
Both produce the same sqNSCLC row (ORR = 33.3%, N = 6).
Example 2: IBI363
Section titled “Example 2: IBI363”Same DOI:
10.1200/JCO.2025.43.16_suppl.8509
Stored twice:
- publication
139344— source_idABSTRACT500470 - publication
237445— source_idPRESENTATION246467
Both produce the same main sqNSCLC 3 mg/kg Q3W row.
Example 3: Additional duplicate DOI pairs in the same sqNSCLC slice
Section titled “Example 3: Additional duplicate DOI pairs in the same sqNSCLC slice”- Datopotamab deruxtecan:
10.1200/JCO.2025.43.16_suppl.8501 - Sacituzumab govitecan:
10.1200/JCO.2025.43.16_suppl.8599
Downstream impact
Section titled “Downstream impact”- one worksheet row can correspond to two local rows
- counts for “how many publication-backed rows do we have?” are overstated
- manual comparison against the sheet becomes noisy
- any future ranking or aggregation that does not dedupe by DOI/title risks double-counting conference data
What the issue is not
Section titled “What the issue is not”This is not a disease-mapping issue and not a subgroup-extraction issue.
The data itself is usually valid in both copies. The problem is that they are the same scientific result represented twice because ASCO exposes two content-item types.
This is also not an argument to undo Issue 2 entirely. We needed PresentationContentItem support to recover records like SHR-A2102. The gap is specifically the lack of a deduplication strategy after broadening the source.
In the sqNSCLC ADC/fusion slice alone, there are 4 duplicate DOI pairs:
- PF-08046054
- IBI363
- Datopotamab deruxtecan
- Sacituzumab govitecan
So the effect is already material in a small disease/technology slice.
Explored solution direction
Section titled “Explored solution direction”Two reasonable options:
1. Query/report deduplication
Keep both source records in publications, but dedupe in EmergingClinicalDataQuery or the TPP report by a stable key such as:
- DOI + disease + subgroup/arm
- or DOI + publication title
This is lower risk for ingestion history.
2. Ingestion-time merge
When saving ASCO records, detect that an incoming presentation and an existing abstract share the same DOI/title/NCT tuple and merge them into one canonical Publication.
This is cleaner downstream but riskier because it changes persistence semantics for already-ingested ASCO records.
18. PubMed-indexed journal article missing from publication corpus
Section titled “18. PubMed-indexed journal article missing from publication corpus”Short summary
Section titled “Short summary”The current sqNSCLC worksheet row for Cofetuzumab pelidotin points to the 2025 journal article:
- DOI:
10.1016/j.lungcan.2025.108492 - PMID:
40086026
That article exists on PubMed and contains the sqNSCLC result the sheet uses, but there is no corresponding Publication row in the local database. As a result, the row is completely absent from EmergingClinicalDataQuery.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”This drop happens before EmergingClinicalDataQuery.
During validation:
Publication.where(doi: '10.1016/j.lungcan.2025.108492')returned no rowsPublication.where(source_id: '40086026')returned no rows
So the publication never entered the local corpus, or it was dropped before persistence.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”Root cause isolated.
There are two distinct PubMed ingestion limitations affecting this paper:
- the disease-specific path depends on PubMed exposing a
ClinicalTrials.gov/NCT...databank entry, and this record does not appear to expose that linking metadata even though PubMed marks it as a clinical trial - the broad PubMed path in
Publications::PubmedApiServicebuilt one giant combined query for the oncology MeSH clause plus the recovery clause; that combined search term excluded qualifying records that PubMed returned when the intended criteria were tested separately
What was verified live for PMID 40086026:
- PubMed resolves DOI
10.1016/j.lungcan.2025.108492to PMID40086026 - the record has
Clinical Trial, Phase I - the record has oncology MeSH including
Carcinoma, Non-Small-Cell LungandLung Neoplasms 40086026[uid] AND mesh AND clinical-trial publication types AND 2025 datereturned140086026[uid] AND full previous combined search termreturned0
So the missing publication was not due to missing PubMed record metadata for the broad query. It was due to our query construction.
Concrete example
Section titled “Concrete example”Worksheet row: Cofetuzumab pelidotin in sqNSCLC
Section titled “Worksheet row: Cofetuzumab pelidotin in sqNSCLC”Worksheet entry:
- Drug:
Cofetuzumab pelidotin - Publication:
Lung Cancer (Journal), 2025 - Link:
https://doi.org/10.1016/j.lungcan.2025.108492 ORR = 12.5%cORR = 12.5%mPFS = 5.3mDoR = 2.2
Local database state:
- no
Publicationrow for DOI10.1016/j.lungcan.2025.108492 - no
Publicationrow for PMID40086026 - only older cofetuzumab records exist:
- publication
150086— ASCO 2021 - publication
71934— ESMO 2023 - publication
101600— Clinical Cancer Research 2021
- publication
External confirmation:
- PubMed lists the paper as “A phase 1b study of cofetuzumab pelidotin monotherapy in patients with PTK7-expressing recurrent non-small cell lung cancer” with PMID
40086026
Downstream impact
Section titled “Downstream impact”- the sqNSCLC worksheet still has one fully missing non-investor row even after the backfills and corrections
- the earlier tracker note that the cofetuzumab sqNSCLC value was poster-only is now stale for the current worksheet version
- the publication will remain absent until a non-
--disease-specific2025 PubMed run is executed against the fixed query logic --disease-specificalone is still insufficient for this class of paper because PubMed does not appear to expose theClinicalTrials.govlinking metadata we rely on
What the issue is not
Section titled “What the issue is not”This does not contradict the earlier ESMO 2023 analysis in Issue 11.
That earlier note was about publication 71934, where the squamous-specific value was not in the 2023 abstract text. The current worksheet has since moved to a later 2025 journal article. That newer source should be representable if it is ingested.
Currently one confirmed sqNSCLC worksheet row for the original worksheet discrepancy.
For 2025-01-01 through 2025-12-31, after fixing the PubMed query construction:
- the broad oncology/malignant-heme PubMed query returns
6,013PMIDs 3,831of those are not already in localpublications- compared with the old
Clinical Trial[pt]path, there are435additional PMIDs 431of those additional PMIDs are not already in localpublications
So this is not just one missing-paper edge case. The broken combined query was suppressing a non-trivial number of 2025 PubMed records.
Spot checks
Section titled “Spot checks”Publication.where(doi: '10.1016/j.lungcan.2025.108492')returned no rows before the fixPublication.where(source_id: '40086026')returned no rows before the fix- after the
PubmedApiServicequery change,fetch_uids_by_date('2025/01/01', '2025/12/31', nct_ids: [])includes PMID40086026 - live verification after the fix returned:
includes_pmid_40086026 = truetotal = 6013
Open characterization questions
Section titled “Open characterization questions”- After the 2025 backfill, how many of the
431incremental publications are truly result publications versus broader cancer-clinical-trial noise? - Do we want to keep the broad non-
--disease-specificPubMed run as a regular sync, or use it only as a periodic coverage backfill?
Explored solution direction
Section titled “Explored solution direction”Characterize the missing publication upstream of the query, then narrow the fix to the actual failure point:
- Trace the PubMed/journal ingestion path for DOI
10.1016/j.lungcan.2025.108492/ PMID40086026 - Compare direct PubMed criteria matches against the full generated search term
- Split the broad PubMed search into separate query terms and union PMIDs in Ruby instead of relying on one giant combined PubMed query
Solution applied
Section titled “Solution applied”- updated
Publications::PubmedApiServiceso the broad PubMed path now runs separate search terms for:- oncology/malignant-heme MeSH + clinical-trial publication types
- oncology/malignant-heme MeSH + recovery result terms for the recent recovery window
- changed PubMed UID fetching to execute each term separately and union the PMIDs in Ruby
- aligned total-count logic with the split-query approach
- verified live that the fixed 2025 query now includes PMID
40086026 - syntax check passed:
ruby -c app/services/publications/pubmed_api_service.rb
19. Biomarker context missing at subgroup level
Section titled “19. Biomarker context missing at subgroup level”Short summary
Section titled “Short summary”The worksheet slices data three ways: dose, treatment line, and biomarker. Treatment line and dose are now structured on trial_subgroups (issues 5/12), but biomarker context is not. Biomarkers are extracted at the trial_disease_details level (publication + disease scope) via disease_extraction.rb, stored in trial_disease_biomarkers. There is no link between a biomarker-type subgroup (e.g. “EGFR-mutant → ORR=45%”) and the structured biomarker record (EGFR = positive).
~13,177 subgroups have biomarker-type classifications (mutation: 11,850, biomarker: 913, molecular subtype: 118, etc.). ~94% are single-biomarker subgroups; ~6% are multi-biomarker (e.g. “EGFR/ALK-negative”, “KRAS wild-type + BRAF-mutated”).
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”Biomarker extraction (disease level):
disease_extraction.rb→ LLM extractspatient_population_diseases[].biomarkers[]from abstractpost_process.rblines 445-463 → createstrial_disease_biomarkerslinked totrial_disease_details- Matching via
Biomarker.flexifind(biomarker_name, 'synonyms')→biomarker_id
Subgroup extraction (no biomarker logic):
subgroup_extraction.rb→ identifies subgroup labels (e.g. “EGFR-mutant”), classifiessubgroup_type = 'mutation'classify_publications(task.rb) → extracts outcome measures per subgrouppost_process.rblines 251-260 → createstrial_subgroupswithsubgroup_type,subgroup_value— no biomarker fields
No biomarker usage in query/view:
vw_publication_efficacy_datadoes not join or expose biomarker dataEmergingClinicalDataQuerydoes not querytrial_disease_biomarkers
Exact restriction causing the gap
Section titled “Exact restriction causing the gap”trial_subgroups has no biomarker columns. Biomarker information is only available as:
- Unstructured text in
subgroup_value(e.g. “EGFR-mutant”, “PD-L1 TPS≥1%”, “TMB high”) - Structured records in
trial_disease_biomarkers— but these are linked totrial_disease_details, not totrial_subgroups
Concrete examples
Section titled “Concrete examples”Example 1: EGFR-mutant subgroup (pub 176313)
trial_subgroups:subgroup_type='mutation',subgroup_value='EGFR-mutant',biomarker_id=NULLtrial_disease_biomarkers:biomarker_name='EGFR',value='positive',biomarker_id=656— attached totrial_disease_detail, no link to the subgroup
Example 2: PD-L1 TPS≥1% subgroup
- Subgroup value: “Non-squamous NSCLC → PD-L1 TPS≥1%”
- Needs:
biomarker_id→ PD-L1,biomarker_value→ “≥1%” - Currently: only unstructured text in
subgroup_value
Example 3: Multi-biomarker (6% of cases)
- Subgroup value: “KRAS wild-type + BRAF-mutated”
- Contains two biomarkers — single
biomarker_idcolumn would capture only one
- 13,177 subgroups with biomarker-type
subgroup_type - ~5,117 (39%) contain a single recognized biomarker name
- ~811 (6%) contain multiple biomarker names
- ~7,361 (55%) contain less common markers not in the top-40 list but still single-biomarker (e.g. AKT1, VHL, DNMT3A, EZH2)
- Total: ~94% single biomarker per subgroup
What the issue is not
Section titled “What the issue is not”- Not an extraction failure — biomarkers ARE extracted, just at the wrong granularity (disease level, not subgroup level)
- Not a matching failure —
Biomarker.flexifindworks well, andBiomarkerMatchingServiceprovides advanced LLM-based matching - Not a view/query issue — the data simply doesn’t exist on
trial_subgroupsyet
Resolution: Partially addressed by subgroup tagging
Section titled “Resolution: Partially addressed by subgroup tagging”Phase 1 (complete): Subgroup tagging (openspec/changes/subgroup-tagging/) added a biomarker tag to trial_subgroups.tags, solving the filtering problem — users can find biomarker subgroups via tags @> '["biomarker"]'. Tags are multi-valued (“EGFR-mutant NSCLC” gets ["biomarker", "disease"]), exposed in vw_publication_efficacy_data and admin UI.
Phase 2 (implemented): Structured biomarker link for display and matching.
What was implemented:
-
Join table
trial_subgroup_biomarkers— mirrorstrial_disease_biomarkersschema:trial_subgroup_id→ FK totrial_subgroups(cascade delete)biomarker_name→ LLM-extracted name (e.g., “KRAS”)value→ status/value (e.g., “mutated”, “wild-type”, “TPS≥1%”)numeric_value→ threshold if applicable (e.g., “1” for TPS≥1%)biomarker_id→ FK tobiomarkers(populated by BiomarkerMatchingService, not flexifind)
-
LLM extraction — two paths:
- Backfill:
lib/tasks/one_off/backfill_subgroup_biomarkers.thor— sends abstract + biomarker-tagged subgroups to GPT-5-mini per-publication. Extracts biomarker name + value. Handles multi-biomarker (e.g., “BRCA1/2” → two entries). ~13K subgroups, ~$5-10. - Forward pipeline:
SubgroupBiomarkerschema added toSubgroupOutcomeindetails.rb.post_process.rbcreatestrial_subgroup_biomarkersrecords whentags.include?('biomarker').
- Backfill:
-
No flexifind —
biomarker_idis left NULL at extraction time. All matching goes throughBiomarkerMatchingServicepipeline inPublicationDiseaseWorkflow:populate_term_matches→ creates TermMatch entries withstrategy: 'BiomarkerMatching',field: 'name'- Deduplicates with 6,151 existing BiomarkerMatching term matches (3,186 already resolved from ParticipationCriterionBiomarker runs)
suggest_keywords→find_candidates(semantic) →pick_best_match→qa_best_match→judge(gpt-5) →post_processwritesbiomarker_id- Also applied to
trial_disease_biomarkers— removed flexifind frompost_process.rbdisease biomarker creation. Same matching pipeline now handles both subgroup-level and disease-level biomarkers.
-
Workflow steps — added to
PublicationDiseaseWorkflowas two parallel branches from the first node:- Subgroup biomarker branch:
populate_term_matches_for_subgroup_biomarkers→ 6 matching steps →post_process_subgroup_biomarkers - Disease biomarker branch:
populate_term_matches_for_disease_biomarkers→ 6 matching steps →post_process_disease_biomarkers - Both run in parallel with existing disease/subtype matching branches.
- Subgroup biomarker branch:
-
View v15 —
vw_publication_efficacy_datanow exposestrial_subgroup_idfor query-layer joins. -
Query updates —
ClinicalEvidenceQueryandEmergingClinicalDataQuerynow COALESCE subgroup-level biomarkers over disease-level:LEFT JOIN trial_subgroup_biomarkers tsb ON tsb.trial_subgroup_id = v.trial_subgroup_idLEFT JOIN biomarkers sb ON tsb.biomarker_id = sb.id-- ...existing disease-level joins...COALESCE(tsb.biomarker_id, tdb.biomarker_id) AS biomarker_id,COALESCE(sb.name, tsb.biomarker_name, b.name, tdb.biomarker_name) AS biomarker_name,COALESCE(tsb.value, tdb.value) AS biomarker_value,
Production deployment:
# 1. Run migration (create trial_subgroup_biomarkers table + view v15) ✅# 2. Backfill subgroup biomarker extraction ✅ (2026-03-24, gpt-5.4-mini)# Results: 52,063 records across 44,725 subgroups (99% of 45,184 biomarker-tagged)# 1.16 records/subgroup avg. Top markers: HER2 (3,728), PD-L1 (3,188), EGFR (1,964)thor one_off:backfill_subgroup_biomarkers:backfill --batched --parallelism 4 --model=gpt-5.4-mini# 3. Run PublicationDiseaseWorkflow — biomarker branches match both subgroup + disease biomarkers ✅ (2026-03-25)# Results: 3,439 TermMatches created for TrialSubgroupBiomarker (3,191 resolved, 248 pending)# 35,026 / 52,063 records matched to biomarker_id (67.3%)# Unmatched breakdown: 7,038 resolved no-match (long tail), 8,005 deduped via PCB no-match, 477 PCB match not propagated, 1,517 unknown# 4. Query layer fix: LEFT JOIN LATERAL with STRING_AGG to aggregate multi-biomarker subgroups ✅ (2026-03-25)# Prevents row multiplication for ~5,810 multi-biomarker subgroups# All biomarker names surface (matched or raw) via COALESCEDesign notes:
- TermMatch
field: 'name'is shared across all biomarker sources (ParticipationCriterionBiomarker, TrialSubgroupBiomarker, TrialDiseaseBiomarker) for deduplication - ~6% of biomarker subgroups are multi-biomarker — join table handles 1:N cleanly
- Judge step uses gpt-5 (temperature=nil, since gpt-5 only supports default temperature)
- Query layer uses
LEFT JOIN LATERALwithSTRING_AGGto aggregate multiple biomarkers per subgroup into comma-separated strings, avoiding row multiplication while preserving all biomarker names/values
20. study_plan_arm link is fragile and causes dose/drug/arm issues (merges Issue 3)
Section titled “20. study_plan_arm link is fragile and causes dose/drug/arm issues (merges Issue 3)”Short summary
Section titled “Short summary”The vw_publication_efficacy_data materialized view depends on study_plan_arms (trial registry) for two critical functions: resolving arm roles (EXPERIMENTAL vs COMPARATOR) and resolving drug attribution (via vw_bioloupe_interventions). This dependency is the root cause of three cascading problems:
- Arm role failures — 62% of view rows have no
study_plan_armmatch and default to EXPERIMENTAL - Dose evidence drop (Issue 3) — The
pub_dose_lookupCTE joins ondrug_id, but the view’s drug_id comes from the registry while dose evidence drug_id comes frompublication_interventions. This mismatch causes 76% of extracted dose evidence (17,826 of 23,503 pubs) to silently drop. - Row triplication — Multiple
study_plan_armsper trial create duplicate rows in thedrug_interventionsCTE
The fix is to drop the study_plan_arm dependency entirely and use publication_interventions as the primary drug source, with LLM-classified arm roles replacing the registry lookup.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”The study_plan_arm link flows through:
publication_clinical_trialslinks a publication to aclinical_trialtrial_arm_outcomes.study_plan_arm_idlinks an LLM-extracted outcome row to a registry armstudy_plan_arms.arm_typeprovides the registry’s classification (EXPERIMENTAL, ACTIVE_COMPARATOR, PLACEBO_COMPARATOR, etc.)vw_publication_efficacy_dataresolvesresolved_group_typevia:COALESCE(UPPER(spa.arm_type), CASE WHEN arm_type = 'experimental' THEN 'EXPERIMENTAL' ... END)- The
drug_interventionsCTE in the view joinsvw_bioloupe_interventions(trial registry drug data) to the correct arm viastudy_plan_arm_id
Relevant code paths:
app/queries/tpp/clinical_evidence_query.rb— usesresolved_group_typeto prefer EXPERIMENTAL rows for efficacy/safety, extract comparator valuesdb/views/vw_publication_efficacy_data_v18.sql— the materialized view definitionfetch_trial_enrichmentsin the query — fetches comparator arm names fromstudy_plan_arms
Exact restriction causing the issue
Section titled “Exact restriction causing the issue”trial_arm_outcomes.arm_type is always NULL for publication-sourced data. The LLM extraction pipeline (classify_publications) extracts arm names but does not classify arm roles. The only path to arm role classification is the study_plan_arm_id foreign key, which requires:
- The publication is linked to a trial (
publication_clinical_trialsexists) - The LLM-extracted arm name was matched to a registry arm (
study_plan_arm_idis set)
Both conditions frequently fail.
Coverage analysis of vw_publication_efficacy_data (total ~1.04M rows):
| Category | Row count | % of total |
|---|---|---|
Trial + arm linked (has study_plan_arm_id) | 399,373 | 38% |
| Trial linked, no arm match | 447,912 | 43% |
Unlinked (uses publication_interventions) | 196,723 | 19% |
So 62% of view rows have no study_plan_arm link and default to EXPERIMENTAL.
For HNSCC specifically (14,660 rows):
- 1,463 rows (10%) have comparator identification via the arm link
- 12,360 rows are marked EXPERIMENTAL
- 569 rows have NULL
resolved_group_type
Dose evidence impact (from Issue 3 reopened investigation, 2026-03-23)
Section titled “Dose evidence impact (from Issue 3 reopened investigation, 2026-03-23)”The same study_plan_arm dependency causes the drug_interventions CTE to use registry drug_ids. The pub_dose_lookup CTE then fails to join because publication_interventions.drug_id (LLM-extracted) doesn’t match:
23,503 publications with dose_evidence extracted 8,764 publications with structured dose in view (37%)17,826 publications with dose evidence silently dropped (76%)
Breakdown of dropped:~13,600 NULL drug_id on publication_interventions (58%) ~2,148 drug_id mismatch: registry vs LLM-extracted (9%) ~2,078 other (pub not in view, no usable fields, etc.)Concrete examples from CRC ADC audit (disease 4345, technology 708):
| Pub | Drug | PI drug_id | View drug_id | Dose evidence | View dose |
|---|---|---|---|---|---|
| 66516 | Zanidatamab | 10432 (antibody) | 15231 (ADC: zovodotin) | 1200 mg | NULL |
| 70960 | SHR-A1811 | NULL | 10733 (Trastuzumab rezetecan) | rp2d=6.4 mg/kg | NULL |
| 114758 | Zanidatamab | 10432 (antibody) | 15231 (ADC: zovodotin) | 1200 mg | NULL |
Dropping the study_plan_arm dependency and using publication_interventions as the primary drug source would fix this automatically — drug_id and pub_dose_lookup would use the same source.
Concrete examples
Section titled “Concrete examples”LLM-extracted arm names that clearly indicate their role without registry lookup:
| arm_name (LLM-extracted) | resolved_group_type (from registry) | Obvious from name? |
|---|---|---|
Cetuximab + Chemotherapy (Control) | ACTIVE_COMPARATOR | Yes — “(Control)“ |
Standard Treatment | ACTIVE_COMPARATOR | Yes — “Standard” |
Placebo | PLACEBO_COMPARATOR | Yes — “Placebo” |
Extreme Regimen | ACTIVE_COMPARATOR | Ambiguous — SOC regimen name |
Experimental group | EXPERIMENTAL | Yes — “Experimental” |
Non-Randomized Single-Arm | EXPERIMENTAL | Yes — single-arm |
BCA101 + pembrolizumab | NO_INTERVENTION | Registry is wrong — this is clearly experimental |
Arm B: Cetuximab/Methotrexate/Docetaxel | ACTIVE_COMPARATOR | Ambiguous — needs context |
1, 2, Arm I | varies | Not classifiable from name alone |
What the client worksheet actually needs from the trial link
Section titled “What the client worksheet actually needs from the trial link”Mapping each worksheet column against its data source:
| Sheet column | Data source | Needs trial link? | Needs study_plan_arm? |
|---|---|---|---|
| Drug | publication_interventions | No | No |
| Technology | drugs → technologies | No (via drug_id) | No |
| Target(s) | drug_target_actions | No (via drug_id) | No |
| Company | drug_ownerships | No (via drug_id) | No |
| Clinical Trial (NCT ID) | publication_clinical_trials → clinical_trials | Yes (trial ID only) | No |
| Clinical Trial Name | clinical_trials.brief_title | Yes (trial ID only) | No |
| Clinical Trial Location | locations table (country rollup) | Yes (trial ID only) | No |
| Combination Partner | publication_interventions | No | No |
| Comparator | study_plan_arms (COMPARATOR type) | Yes | Yes (current path) |
| Disease | trial_disease_details / trial_subgroups | No | No |
| Publication Date | publications | No | No |
| Data Cut Date | trial_subgroups (pub-extracted) | No | No |
| Prior Lines (min/max/median) | trial_subgroups (pub-extracted) | No | No |
| Biomarker | subgroup tags (pub-extracted) | No | No |
| Dose fields | trial_subgroups + publication_interventions | No | No |
| Efficacy (mOS, mPFS, ORR, etc.) | trial_outcome_measures / trial_arm_outcomes | No | No |
| Safety (TRAE, TEAE, etc.) | adverse_events | No | No |
| Phase (internal filter) | clinical_trials.phase | Yes (trial ID only) | No |
| Randomized (internal) | study_designs.allocation | Yes (trial ID only) | No |
| Is Basket Trial (internal) | clinical_trial_end_diseases (computed) | Yes (trial ID only) | No |
Conclusion: The study_plan_arm link is only needed for the “Comparator” column and for resolved_group_type (experimental vs comparator arm selection). All other trial-derived fields only need publication_clinical_trials.clinical_trial_id.
Drug resolution dependency
Section titled “Drug resolution dependency”publication_interventions currently only exists for publications processed through the target-disease extraction pipeline (~17K publications). For the remaining ~45K linked publications, drug resolution still flows through vw_bioloupe_interventions via the trial link and arm join.
However, ClinicalEvidenceQuery is always scoped to a specific disease, which means its publications will have gone through the target-disease pipeline and will have publication_interventions. This is not a blocker for the clinical evidence report specifically.
Downstream impact
Section titled “Downstream impact”-
Efficacy extraction —
extract_efficacy_metricsprefersEXPERIMENTALrows. Without arm role classification, randomized trial publications would have both experimental and comparator values lumped together, and the “best” row would be picked by patient count rather than arm role. -
Comparator value — The query extracts
comparator_value(e.g., comparator mPFS) from rows withresolved_group_typecontainingCOMPARATOR. Without this, the comparator column and comparator efficacy values would be empty. -
Safety extraction —
extract_safety_metrics_for_publicationfilters toEXPERIMENTALarm for safety. Less critical since most single-arm studies (majority of the corpus) only have one arm anyway.
Explored solution direction
Section titled “Explored solution direction”Drop study_plan_arm dependency; add LLM arm role classification.
The proposed approach has two parts:
Part 1: Classify arm roles from LLM-extracted arm names
Section titled “Part 1: Classify arm roles from LLM-extracted arm names”Add an arm_role field to trial_arm_outcomes (or arm_type — currently always NULL for publication data). Populate it via one of:
Option A: LLM classification during classify_publications — Add arm role to the extraction schema so the LLM outputs "arm_role": "experimental" or "arm_role": "comparator" alongside the arm name. This is the most reliable since the LLM has the full abstract context and knows which drug is investigational.
Option B: Post-hoc heuristic — Pattern match on arm names: keywords like “control”, “placebo”, “standard of care”, “SOC”, “comparator” → COMPARATOR; “experimental”, “investigational”, “study drug”, “treatment” → EXPERIMENTAL. This catches ~70% of cases but fails on regimen names like “Extreme Regimen” (HNSCC SOC) or numbered arms like “Arm B”.
Option A is recommended because the LLM already has the context to make this classification, and the marginal cost per publication is negligible.
Part 2: Simplify the view and query
Section titled “Part 2: Simplify the view and query”Once arm roles are self-classified:
-
vw_publication_efficacy_data: Remove thestudy_plan_armsjoin fromarm_outcomes_expanded. Use the newarm_rolefield ontrial_arm_outcomesinstead. -
drug_interventionsCTE: Remove thepublication_arm_links→vw_bioloupe_interventionsjoin path entirely for clinical evidence queries. Usepublication_interventionsas the sole drug source (acceptable since clinical evidence queries are disease-scoped). -
fetch_trial_enrichments: Keep the enrichment query but simplify — it only needsclinical_trials+locations+study_designsfor metadata. Remove thestudy_plan_armssubquery for comparator arm names; instead, derive comparator name from the LLM-extracted arm names wherearm_role = 'comparator'. -
fetch_combination_partners: Already usespublication_interventionsas primary path. No change needed.
What this preserves
Section titled “What this preserves”- NCT ID, trial name, phase, location, randomized, basket trial detection — all via
publication_clinical_trials→clinical_trials(no arm join) - Correct experimental vs comparator arm selection — via LLM-classified
arm_role - Comparator name in the report — derived from arm names where
arm_role = 'comparator'
What this removes
Section titled “What this removes”- Dependency on
study_plan_arm_idmatching (currently fails for 62% of rows) - Registry arm type overriding LLM context (sometimes wrong, e.g.,
BCA101 + pembrolizumabtaggedNO_INTERVENTION) - Drug resolution via
vw_bioloupe_interventionsfor linked publications (replaced bypublication_interventions)
Solution applied
Section titled “Solution applied”Implemented 2026-03-23. Change: fix-study-plan-arm-dependency.
Four-part fix:
-
vw_publication_efficacy_datav16 — restructureddrug_interventionsCTE:- Added Source 0:
publication_interventionsas primary drug source for all pubs that have them (linked AND unlinked). Includes NULLdrug_idinterventions — if we extracted them, that’s the source of truth. - Sources 1a/1a-fallback/1b/1c gated with
NOT EXISTS (pubs_with_pi)— only fire as fallback for publications withoutpublication_interventions(non-target-disease pubs used byEmergingClinicalDataQuery). - Removed Source 2 (unlinked-only path) — subsumed by Source 0.
- Threaded
publication_intervention_idthrough Source 0 andpub_dose_lookupfor exact join matching, eliminating the drug_id mismatch that dropped 76% of dose evidence.
- Added Source 0:
-
vw_publication_efficacy_datav16 — invertedarm_outcomes_expandedpriority:- LLM-classified
tao.arm_typenow preferred over registryspa.arm_typevia CASE expression. - Maps
control/active_comparator→ACTIVE_COMPARATOR,placebo/placebo_comparator→PLACEBO_COMPARATOR. - Falls back to
spa.arm_typeonly when LLM value is NULL.
- LLM-classified
-
Safety queries in
clinical_evidence_query.rb:- Updated both inline safety SQL queries to use the same LLM-first arm role logic.
-
Arm role classification improvements (going-forward + backfill):
- Expanded
arm_typeenum indetails.rbfrom[investigational, control]to[investigational, control, active_comparator, placebo_comparator]. - LLM-based backfill task
lib/tasks/one_off/backfill_arm_type_from_name.thor:- Phase 1 fast-path: single-arm publications (39K pubs, 239K rows) →
investigationaldirectly. - Phase 2 LLM: multi-arm publications (28K pubs) sent to GPT-5-mini with abstract context for classification. Estimated cost ~$17.
- Phase 1 fast-path: single-arm publications (39K pubs, 239K rows) →
- Tested on 65 publications with 0 errors. LLM correctly classifies drug-name arms (e.g. “Sorafenib” → control, “Chemotherapy” → control), ambiguous labels (e.g. “Arm B”, “Group 1”), and placebo variants.
- Expanded
Results (prod, post-backfill, 2026-03-24):
| Metric | Before v16 | After v16 + backfill |
|---|---|---|
| Pubs with structured dose in view | 8,764 | 11,916 (+36%) |
| Coverage of extracted dose evidence | 71% | 96.7% |
| ACTIVE_COMPARATOR rows | sparse (registry-only) | 124,346 (12.6% of view) |
| PLACEBO_COMPARATOR rows | sparse (registry-only) | 32,383 (3.3% of view) |
| Total comparator identification | ~38% coverage when arm linked | 15.9% of all rows (up from near-zero for LLM-sourced pubs) |
| Stale registry values (PLACEHOLDER/NO_INTERVENTION/OTHER) | 7,806 rows | 17 rows |
Prod verification (2026-03-24): Spot-checked 55+ publications across multiple categories:
- Combo arms with “placebo” in name (e.g. “Nivo+Ipi+Placebo for Nivo”) → correctly EXPERIMENTAL
- Drug-name comparators (e.g. “Sorafenib”, “FOLFIRI”, “Chemotherapy”) → correctly ACTIVE_COMPARATOR
- Novel drug monotherapy vs combo (EV mono vs EV+pembro) → correctly identified mono as comparator
- Phase I multi-arm dose trials → correctly all EXPERIMENTAL
- Randomized dose-finding (same drug, different schedules) → correctly all EXPERIMENTAL
- No false positives or misclassifications found
Tracker spot-checks resolved:
| Pub | Drug | Before | After |
|---|---|---|---|
| 66516 | Zanidatamab | all NULL (drug_id mismatch: 10432 vs 15231) | single_dose=1200 mg, dose_units=mg |
| 114758 | Zanidatamab | all NULL (same mismatch) | single_dose=1200 mg, dose_frequency=on days 1 and 15 |
| 70960 | SHR-A1811 | all NULL (drug_id was NULL) | dose_min=3.2 mg/kg, dose_max=8.0 mg/kg, rp2d=6.4 mg/kg |
Files changed:
db/views/vw_publication_efficacy_data_v16.sql(new)db/migrate/20260323212725_update_vw_publication_efficacy_data_to_version_16.rb(new)app/queries/tpp/clinical_evidence_query.rb(safety query arm role logic)app/tasks/publications_llm_classification/details.rb(arm_type enum expansion)lib/tasks/one_off/backfill_arm_type_from_name.thor(new — LLM arm type backfill)
Deployment steps:
rake db:migrate(creates v16 view + materializes)REFRESH MATERIALIZED VIEW CONCURRENTLY vw_publication_efficacy_datathor one_off:backfill_arm_type_from_name:backfill --batched --parallelism 4 --batch-size 2000REFRESH MATERIALIZED VIEW CONCURRENTLY vw_publication_efficacy_data(again after backfill)
21. Phase 1 basket trials report response counts, not ORR percentages
Section titled “21. Phase 1 basket trials report response counts, not ORR percentages”Short summary
Section titled “Short summary”Phase 1 dose-escalation and basket trial abstracts often report efficacy as response counts per tumor type (e.g. “1 PR in 9 HNSCC patients”) rather than ORR percentages. The LLM faithfully extracts these as PR endpoint with measure_unit = count, but the query only recognizes ORR with measure_unit = percentage. This causes two downstream problems:
- No efficacy shown — the publication surfaces in the report with empty ORR/PFS/OS columns despite having extractable response data
- Inflated patient count — when no recognized efficacy endpoint exists for the disease subgroup,
extract_patient_countfalls back to the largestnumber_of_participantsacross all rows, which is typically the cross-tumor Overall population (e.g. N=92 instead of N=9)
Concrete example
Section titled “Concrete example”Publication 29759 — Praluzatamab ravtansine (CX-2009) first-in-human phase 1 (NCT03504488), ASCO 2020.
Abstract reports: “92 patients … 5 PRs in breast cancer (n=39), 2 PRs in ovarian (n=22), 1 PR in HNSCC (n=9)”
Extracted data (correct):
| Subgroup | Endpoint | Value | Unit | N |
|---|---|---|---|---|
| Overall | SD | 21 | count | 92 |
| Overall → HNSCC | PR | 1 | count | 9 |
| Overall → Breast Cancer | PR | 5 | count | 39 |
| Overall → Ovarian Cancer | PR | 2 | count | 22 |
Query output for HNSCC (incorrect):
- ORR: empty (no
ORRendpoint exists) - Patient count: 92 (fallback to Overall N, should be 9)
- The row appears in the report with no efficacy and a misleading N
Root cause
Section titled “Root cause”Two gaps in the query layer:
-
extract_efficacy_metricsonly looks forPRIMARY_EFFICACY_ABBREVIATIONS(OS,PFS,ORR,DOR,DFS,DCR).PRandCRcounts are not recognized. No logic derives ORR fromPR count / N. -
extract_patient_counttakes the maxnumber_of_participantsacross all rows in the group. For basket trials where the Overall subgroup (N=92) and disease subgroup (N=9) coexist in the same group key, the fallback picks N=92.
Phase 1 dose-escalation trials commonly report response counts rather than ORR. Basket trials with disease-specific cohorts are particularly affected since they report per-tumor-type counts. The exact count of affected publications needs characterization, but this pattern is common in early-phase oncology abstracts.
Explored solution direction
Section titled “Explored solution direction”Option 1: Derive ORR from PR/CR counts at query time. When no ORR endpoint exists for a subgroup but PR and/or CR counts exist with number_of_participants > 0, compute ORR = (PR + CR) / N * 100. This is clinically correct and matches how the client sheet manually computes these values.
Option 2: Have the LLM compute ORR during extraction. Add a prompt instruction: when only response counts are reported, also emit a derived ORR endpoint with measure_unit = percentage. Risk: the LLM might hallucinate percentages or miscount.
Option 3: Filter out publications with no recognized efficacy endpoints. If a publication has no ORR/PFS/OS/DoR for the disease subgroup, don’t surface it in the report. This avoids misleading rows but loses legitimate phase 1 data.
Option 1 is most reliable — the data is already correctly extracted, just needs a calculation step in the query.
Solution applied
Section titled “Solution applied”Implemented (2026-03-20):
-
Going-forward fix in
post_process.rb: Addedderive_orr_for_subgroup— after persisting outcome measures for each subgroup, checks if PR/CR counts exist with N > 0 but no ORR percentage. If so, creates a derived ORR row:(PR + CR) / N * 100withmeasure_unit = 'percentage'andobservation = 'Derived from PR + CR counts'. Skips subgroups taggedresponse_status(response-defined subgroups where derivation is meaningless). -
Backfill task
lib/tasks/one_off/backfill_derived_orr.thor: Finds all publication subgroups with PR/CR counts but no ORR percentage and creates derived ORR rows. Results: 753 ORR rows created across 512 publications. -
Empty efficacy filter in
ClinicalEvidenceQuery: Rows with no recognized efficacy endpoints (emptyefficacyhash) are now filtered out inbuild_result_rows, preventing publications with only safety/DLT data from appearing as empty rows with misleading patient counts.
Prod deployment:
- Run response_status backfill first:
thor one_off:backfill_response_status_tags:backfill --batched - Run derived ORR backfill:
thor one_off:backfill_derived_orr:backfill - Refresh materialized view
22. extract_subgroups doesn’t identify response counts as endpoints
Section titled “22. extract_subgroups doesn’t identify response counts as endpoints”Short summary
Section titled “Short summary”When abstracts report best response as narrative counts (“1 PR and 14 SD out of 29 CRC patients”, “1 PR and 4 SD among 8 esophageal cancer patients”) without computing an explicit ORR percentage, the upstream extract_subgroups step only identifies formal endpoints like DCR and TTP. Individual response counts (PR, CR) are not recognized as extractable endpoints. Since classify_publications constrains its endpoint_abbreviation enum to the abbreviations identified upstream, the LLM cannot create PR/CR endpoint rows even though it sees the data in the abstract.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”extract_subgroups(step 7 inPublicationsWorkflow) scans the abstract and identifies subgroups + their associated endpoints → stored inllm_data['subgroup_endpoints']classify_publications(step 9) receivessubgroup_endpointsas input, builds a JSON schema withendpoint_abbreviationconstrained to the upstream list, and extracts structured outcome measures- If PR/CR aren’t in the upstream endpoint list,
classify_publicationscan’t output them
Concrete example
Section titled “Concrete example”Publication 29737 — IMMU-132 (sacituzumab govitecan) phase I/II in GI cancers (NCT01631552), ASCO 2020.
Abstract text: “Of 29 CRC pts… 1 had a PR and 14 had SD as the best response by RECIST, with a time to progression (TTP) of 11.5+ months for the PR… This is a disease control rate (DCR) of 51.7%.”
subgroup_endpoints identified upstream:
Time to progression→ 5 subgroupsDisease control rate→ 3 subgroups
Missing: Partial Response / PR was not identified as an endpoint despite being explicitly reported per disease cohort.
LLM output: Extracted DCR=51.7% (N=29) and TTP values. The PR count (1/29) was noted in the DCR observation text (“1 PR and 14 SD out of 29 evaluable CRC patients”) but not as a separate endpoint row.
Result: No ORR can be derived (Issue 21’s derivation requires PR/CR rows to exist), and the publication shows DCR but no ORR in the report.
- 759 publications have DCR but no ORR, PR, or CR endpoints
- 287 of those have response counts (PR/CR) mentioned in the DCR observation text — confirming the data was seen by the LLM but not extracted as separate endpoints
- 414 publications have SD counts but no ORR/PR/CR/DCR — similar pattern with stable disease
Root cause
Section titled “Root cause”extract_subgroups identifies endpoints by looking for formal endpoint patterns in the abstract (named endpoints with abbreviations, table headings, structured results). Narrative best-response descriptions like “1 had a PR and 14 had SD” are not recognized as formal endpoints because:
- They don’t follow the
endpoint = valuepattern - PR/SD/CR appear as best overall response categories, not as measured endpoints
- The abstract often only computes a summary metric (DCR) from these counts
The classify_publications schema then constrains the LLM to only the identified abbreviations, preventing it from creating PR/CR rows even though it clearly reads the counts (as evidenced by the observation text).
Explored solution direction
Section titled “Explored solution direction”Option 1: Expand extract_subgroups to detect response count patterns. Add pattern matching for narrative response descriptions: “N had a PR”, “X partial responses”, “CR in Y patients”, etc. When detected, add PR/CR as endpoints alongside DCR/TTP.
Option 2: Allow classify_publications to add endpoints not in the upstream list. Remove or relax the endpoint_abbreviation enum constraint so the LLM can create PR/CR rows when it sees response counts. Risk: the LLM might hallucinate endpoints.
Option 3: Post-processing derivation from DCR observation text. Parse the observation strings like “1 PR and 14 SD out of 29 evaluable CRC patients” to extract PR/CR counts. This is fragile (regex on LLM-generated text) but catches the 287 publications where the data is already captured.
Option 4: Prompt instruction in classify_publications. Add an explicit instruction: “When the abstract reports individual response counts (e.g. ‘1 PR’, ‘2 CR’) per subgroup without an explicit ORR, also extract these as separate PR/CR endpoints with measure_unit=count.” Combined with relaxing the enum constraint for response-type abbreviations.
Option 4 is cleanest — it works within the existing pipeline, the LLM already sees the data, and combined with Issue 21’s derivation logic, the ORR gets computed automatically.
Solution applied
Section titled “Solution applied”Forward fix (v1): Updated task.rb classify_publications prompt to instruct LLM to extract PR/CR counts. See Issue 21 for the ORR derivation that consumes these counts.
Forward fix (v2): Updated task.rb classify_publications prompt to also extract PR/CR/ORR percentages from DCR breakdowns (e.g. “DCR was 54% (CR 8%, PR 15%, SD 31%)”). Added dCR (durable CR) and pCR/MPR exclusions to prevent misidentification as standard CR.
Backfill v1 (2026-03-20/21): screen_missing_response_counts:screen (job 1568) screened candidates and flagged pubs with narrative response counts (e.g. “1 PR, 14 SD”). Re-extraction via classify_publications (job 1570) on flagged pubs. Reduced DCR-only population from 759→620 (~139 fixed).
Backfill v1 gap: The v1 screener explicitly excluded percentage-based response rates (“ORR was 35%” → NO), missing a second pattern where abstracts report ORR/PR/CR as percentages — either standalone (“ORR was 33%”, “BOR rate 18.2%”) or embedded in DCR breakdowns (“DCR was 54% (CR 8%, PR 15%, SD 31%)”). Prod analysis (2026-03-24) found ~92 publications with extractable response rate percentages but no response endpoint, of which 73 were never re-processed (pre-fix) and 19 ran with the v1 prompt but were missed.
Backfill v1 screener (historical): screen_missing_response_counts.thor was used to identify candidates for v1 re-extraction. Its prompt only detected integer counts and explicitly excluded percentage-based ORR — this is why the v1 gap exists. The screener is no longer needed for v2 since the targeted backfill scopes structurally via SQL.
Backfill v2 (complete, job 1604, 2026-03-24): Targeted backfill task backfill_missing_response_endpoints.thor — sent a focused LLM prompt (o4-mini) per publication extracting ORR/PR/CR values anchored to existing subgroups. Created trial_endpoint + trial_outcome_measure + trial_arm_outcome records directly without re-running the full classify pipeline. ORR derived inline from PR% + CR% when LLM didn’t return explicit ORR. Guards: skips zero values, excludes dCR/pCR/MPR, skips response-status subgroups, idempotent (skips existing records).
Results: 97 new records created (41 PR counts, 22 ORR percentages, 14 PR percentages, 12 CR counts, 8 CR percentages). DCR-only population reduced from 550 → 498 (~52 pubs fixed). Combined with v1 backfill: 759 → 498 total (~261 pubs fixed, ~34% reduction).
Verified: 10 random remaining DCR-only pubs manually checked against full abstract text — all 10 genuinely report only DCR with no PR/CR/ORR breakdown (phase I safety studies, PK/biomarker analyses, maintenance trials with DCR as primary endpoint, composite response rates ≠ ORR). The remaining 498 are clean.
23. Dose extraction misses implicit RP2D in phase I/II trials
Section titled “23. Dose extraction misses implicit RP2D in phase I/II trials”Short summary
Section titled “Short summary”The dose extraction LLM classifies “dose levels of 8 and 10 mg/kg were chosen for phase II” as a range (dose_min/dose_max) rather than RP2D. In phase I/II trials, doses selected for phase II expansion ARE the recommended phase 2 dose by definition — this is the entire purpose of the phase I dose escalation.
Concrete example
Section titled “Concrete example”Publication 29737 — “Phase I/II trial of IMMU-132 (isactuzumab govitecan)”
Abstract states: “starting at a dose of 8 mg/kg given on days 1 and 8 of a 3-week cycle. Dose levels of 8 and 10 mg/kg were chosen for phase II”
Current extraction:
{ "dose_min": "8 mg/kg", "dose_max": "10 mg/kg", "rp2d": null, "dose_context_type": "range"}Expected: rp2d should capture that 8 and 10 mg/kg are the RP2D levels. The phrase “chosen for phase II” in a phase I/II trial is semantically equivalent to “recommended phase 2 dose.”
Explored solution direction
Section titled “Explored solution direction”Update the dose extraction prompt to recognize implicit RP2D language in phase I/II trials:
- “doses chosen/selected for phase II”
- “phase II dose levels”
- “expansion cohort dose”
- “dose carried forward to phase II”
The challenge is that RP2D is currently a single value field. When two dose levels are selected (8 and 10 mg/kg), storing both requires either a comma-separated value or keeping dose_min/dose_max AND setting rp2d.
Note: this publication also has a secondary issue — publication_interventions.drug_id is NULL, so the dose evidence can’t join to the view via pub_dose_lookup even if the extraction were correct.
~4,617 interventions across ~4,100 publications have dose_context_type of range, escalation, or rp2d (typed but value missing) with no rp2d value. LLM verification on a sample of 17 publications found implicit RP2D in ~20% of candidates (MTD declarations, phase II dose selections, expansion cohort doses).
Solution applied
Section titled “Solution applied”Forward fix: Updated dose_evidence_extraction.rb system prompt to recognize implicit RP2D language: MTD declarations, “chosen/selected for phase II”, expansion cohort doses, “recommended for further study”.
Backfill: lib/tasks/one_off/backfill_implicit_rp2d.thor — sends abstract + current dose evidence for ~4,100 publications to GPT-5-mini. LLM determines if an implicit RP2D exists and extracts the value. Only updates rp2d and dose_context_type fields — does not overwrite existing dose_min/dose_max/units/frequency. Corrections tagged with rp2d_source: 'implicit_backfill' for audit. Estimated ~800 RP2Ds to be found. Cost: ~$2.
thor one_off:backfill_implicit_rp2d:backfill --batched --parallelism 424. Subgroup participant count wrong for biomarker sub-cohorts
Section titled “24. Subgroup participant count wrong for biomarker sub-cohorts”Short summary
Section titled “Short summary”When abstracts report results for a biomarker-defined sub-cohort within a disease subgroup, the LLM sometimes confuses the count of patients with a specific outcome with the total sub-cohort size.
Concrete example
Section titled “Concrete example”Publication 29737, KRAS-mutated CRC subgroup:
Abstract states: “Thirteen CRC pts had KRAS mutations, 7 with SD (median TTP = 4.4+ mo)”
Current extraction: subgroup "Advanced GI cancers → Colorectal cancer → KRAS-mutated" with TTP endpoint, n=7
Expected: n=13 (the KRAS-mutated cohort size), with 7 being the count of patients with SD (stable disease).
The LLM set number_of_participants=7 (the SD count) instead of 13 (the KRAS cohort size). This is a pattern likely to recur wherever abstracts report “N patients had X, Y with outcome Z.”
~112 highly suspicious subgroups identified via heuristic (response count = N, and N < 30% of publication max N). True scope is likely larger but hard to detect structurally — confirmed by LLM verification on sample of 12 publications finding 11 corrections across 104 verified arms (~10.6% correction rate).
Affected patterns:
- Basket trials: per-tumor-type enrollment vs SD/PR counts (pub 53427: CRC N=6 should be 14, PDAC N=6 should be 25, etc.)
- Biomarker sub-cohorts: mutation cohort size vs outcome count (pub 29737: KRAS N=7 should be 13)
- Response cohorts: assessable patients vs responder count (pub 3674: Cohort 2 N=13 should be 17 — 13 was cCR count, 17 was assessable)
- Disease sub-cohorts in phase I: per-histology enrollment vs outcome count (pub 5024: DIPG N=7 should be 9, sDMG N=7 should be 2)
Solution applied
Section titled “Solution applied”Forward fix: Updated classify_publications prompt in task.rb with explicit anti-example: “CRITICAL: Set number_of_participants to the TOTAL evaluable patients in that subgroup/cohort — NOT the count of patients with a specific outcome.”
Backfill: lib/tasks/one_off/backfill_subgroup_participant_counts.thor — sends abstract + all arm outcomes for ~1,240 publications with PR/CR/SD count endpoints to GPT-5-mini for verification. LLM compares current N against abstract and corrects where wrong. All corrections logged in trial_subgroups.llm_data['n_corrections'] for audit/revert. Estimated cost: ~$1.50.
thor one_off:backfill_subgroup_participant_counts:backfill --batched --parallelism 425. Confirmed vs unconfirmed ORR confusion in classify_publications
Section titled “25. Confirmed vs unconfirmed ORR confusion in classify_publications”Short summary
Section titled “Short summary”When abstracts report both confirmed and unconfirmed ORR (a common pattern in ADC oncology trials), classify_publications either (a) extracts the unconfirmed ORR value and incorrectly marks it confirmed: true, or (b) extracts only the unconfirmed ORR and omits the confirmed value entirely. This produces wrong cORR values in the report and missing cORR endpoints.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”classify_publications (app/tasks/publications_llm_classification/task.rb) — the LLM extraction step that produces subgroup_outcome_measures. The confirmed boolean on ORR endpoints was added by Issue 16, but the extraction prompt doesn’t instruct the LLM on how to handle abstracts that report both confirmed and unconfirmed ORR.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”The extraction schema allows a single ORR record per subgroup arm with a confirmed boolean. When an abstract reports “unconfirmed ORR was X% (confirmed: Y%)”, the LLM extracts one ORR record with measure_value=X (the unconfirmed value) and sets confirmed: true because the word “confirmed” appears in the abstract context. The actual confirmed value (Y%) is only captured in the free-text observation field.
The prompt does not instruct the LLM to:
- Create TWO separate ORR records when both confirmed and unconfirmed values are reported
- Distinguish which numeric value corresponds to confirmed vs unconfirmed status
Concrete examples
Section titled “Concrete examples”Publication 192026 (Precemtabart tocentecan, PROCEADE-CRC-01 dose optimization):
Abstract states: “The unconfirmed objective response rate (ORR) at 2.8 mg/kg was 24.1% (95% CI: 10.3, 43.5) (confirmed: 13.8% [95% CI: 3.9, 31.7]).”
Extracted: ORR confirmed=true, measure_value=24.1
- observation: “Unconfirmed ORR; confirmed ORR was 13.8%”
Expected: Two records:
ORR confirmed=false, measure_value=24.1(unconfirmed)ORR confirmed=true, measure_value=13.8(confirmed)
Same pattern in pubs 237309, 49900 (same drug, different data cuts). Also confirmed in pub 190845 (missing cORR entirely) and pub 116824 (missing cORR for dose subgroups).
Downstream impact
Section titled “Downstream impact”- Wrong cORR values: The Clinical Evidence report shows unconfirmed ORR in the cORR column. For Precemtabart at 2.8 mg/kg, report shows cORR=24.1% when it should be 13.8% — a 75% overstatement.
- Missing cORR endpoints: Some publications have no confirmed ORR extracted at all, leaving the cORR column blank when the abstract does report it.
- Audit failures: 24 of 145 open audit issues in the CRC ADC scope (disease 4345, technology 708) are caused by this pattern: 18
incorrect_valueissues onefficacy.corr.valueand 6missing_endpointissues onefficacy.corr.value.
Directly confirmed in 5 publications (192026, 237309, 49900, 190845, 116824) across the CRC ADC audit scope. Likely affects any ADC trial publication reporting both confirmed and unconfirmed ORR — estimated dozens across the full corpus.
-- Publications with confirmed=true ORR that may have unconfirmed values stored as confirmedSELECT DISTINCT ts.source_id as publication_idFROM trial_subgroups tsJOIN trial_outcome_measures tom ON tom.trial_subgroup_id = ts.idJOIN trial_arm_outcomes tao ON tao.trial_outcome_measure_id = tom.idJOIN trial_endpoints te ON te.id = tom.trial_endpoint_idWHERE ts.source_type = 'Publication' AND te.abbreviation = 'ORR' AND tom.confirmed = true AND tom.observation ILIKE '%unconfirmed%'Explored solution direction
Section titled “Explored solution direction”Forward fix: Update the classify_publications prompt in task.rb to explicitly handle confirmed/unconfirmed ORR:
“When an abstract reports both confirmed and unconfirmed ORR for the same subgroup/arm, create TWO separate ORR records: one with
confirmed: falseand the unconfirmed value, and one withconfirmed: trueand the confirmed value. The unconfirmed ORR is typically the larger number. Do NOT setconfirmed: trueon the unconfirmed ORR value.”
Backfill: Re-extract affected publications with the updated prompt. Scope can be identified by querying for ORR records where confirmed=true and the observation mentions “unconfirmed”. Estimated cost: minimal (small number of publications).
Solution applied
Section titled “Solution applied”Forward fix (2026-03-24): Updated the classify_publications prompt in app/tasks/publications_llm_classification/task.rb to:
- Instruct the LLM to create TWO separate ORR endpoint records when an abstract reports both confirmed and unconfirmed values
- Not confuse different RECIST assessment criteria (RECIST 1.1 vs mRECIST) with confirmation status — use RECIST 1.1 as primary
measure_value, note other criteria inobservation
Targeted backfill v1 (2026-03-24): Created lib/tasks/one_off/backfill_confirmed_unconfirmed_orr.thor. Initial run (job 1603) fixed the most obvious cases but missed ~398 records due to overly conservative guardrails and narrow scope.
Backfill v2 (2026-03-24): Expanded the task to address gaps found in v1:
Problems found in v1:
- 131
confirmed=trueORR records with “unconfirmed” in observation but noconfirmed=falsepair created — guardrail required LLM to return a full pair, skipping cases where it only returned one side - 23
confirmed=trueORR records where the abstract never mentions response confirmation — LLM hallucinated the flag - 244
confirmed=falseORR records missing theirconfirmed=truesibling - ~50 publications with PR/CR
confirmed=nullwhere abstract says “confirmed PR” — wrong flag propagates to derived ORR via post_process, making it invisible to the cORR metric in clinical_evidence_query
Changes in v2:
- Scope widened: Now covers 2,530 pubs — incomplete ORR pairs (1,596) + derived ORR pubs with PR/CR that may need confirmed flags (~934)
- PR/CR coverage: LLM now evaluates confirmed flags on PR, CR, and ORR in one pass (was ORR-only)
- Guardrail relaxed: Acts when LLM returns any non-null confirmed entry (was: required both true+false pair)
- Null upgrade: Can upgrade
confirmed=nullrecords to true/false instead of only creating new records - Derived ORR fix: After correcting PR/CR flags, surgically updates derived ORR
confirmedto match source PR/CR — no post_process re-run needed - Prompt improved: Instructs LLM to derive both confirmed/unconfirmed ORR from response counts (e.g. “6 confirmed PRs and 2 unconfirmed PRs among 40 patients”)
Forward fix for derived ORR (2026-03-24): Updated post_process.rb derive_orr_for_subgroup to propagate the confirmed flag from source PR/CR records. If all PR/CR are confirmed=true, derived ORR gets confirmed=true. If mixed, derives both a confirmed and unconfirmed ORR.
Commands:
# Preview scopebundle exec thor one_off:backfill_confirmed_unconfirmed_orr:identify
# Run (use --batched for large runs)bundle exec thor one_off:backfill_confirmed_unconfirmed_orr:backfill --batchedValidation (2026-03-24):
- v1 tested on 20 random publications across two rounds — correctly handled split ORR, single confirmed, RECIST criteria, ambiguous cases
- v2 tested on 36 publications (30 dry run + 6 real run). Verified against abstracts:
- Pub 1246: “unconfirmed partial response” → PR
confirmed=false, derived ORRconfirmed=false✓ - Pub 31619: mixed confirmed/unconfirmed PRs across disease subgroups — each subgroup’s derived ORR correctly matched its PR flag ✓
- Pubs 1527, 5024, 7313, 7499: no confirmation language → all flags left as
confirmed=null✓ - Zero spurious changes on pubs without confirmation language
- Pub 1246: “unconfirmed partial response” → PR
Backfill v2 production run (2026-03-24, job 1608): 2,530 publications processed.
Results verified in prod:
confirmed=falseORR records: 575 → 744 (+169 new unconfirmed pairs created)confirmed=trueORR records: 2,240 → 2,583 (+343 flags upgraded or new records)- Derived ORR with confirmed flag: 94
true+ 28false(was allnull) - Known broken pubs verified correct: 47342 (27.6%/34.5%), 65504 (13.0%/17.4%), 74897 (46.7%/60.0%), 234678 (15.0%/20.0%)
- Remaining 55
confirmed=trueORR (45 pubs) with “unconfirmed” in observation but no pair — root cause: the backfill was sending the existingconfirmedflag to the LLM, which anchored on it and echoed it back instead of making a fresh determination from the abstract.
Backfill v3 fix (2026-03-24): Two changes to the prompt/input:
- Removed
confirmedfield from existing records sent to the LLM — forces fresh determination from abstract text only - Added explicit instruction: “The existing records may have WRONG confirmed flags. Do NOT trust the existing confirmed value.”
Scope: ~2,000 pubs still in scope (1,553 incomplete pairs + 691 derived ORR with null confirmed). The pubs fixed by v2 are excluded (they now have complete pairs).
Command: bundle exec thor one_off:backfill_confirmed_unconfirmed_orr:backfill --batched
Backfill v3 production run (2026-03-24, job 1612): ~2,000 publications processed.
Results: 55 → 26 remaining records with “unconfirmed” in observation but no pair. The 26 remaining break down as:
- ~20 combined rates where abstract reports “confirmed and unconfirmed responses” as a single number —
confirmed=trueis wrong (should benull) but can’t be split into two rows. Not a data loss since the value itself is correct. - 3 truncated abstracts (30362, 59711, 209569) — response breakdown is in the missing portion of the abstract text
- 2 genuine LLM misses (116973, 236929) — abstract has the data but LLM didn’t split
2026-03-26 audit findings — Issue reopened
A Clinical Evidence audit (publications:audit_clinical_evidence) on HNSCC publications identified 7 open cORR-related audit issues across 5 publications that demonstrate the extraction fix is insufficient. Three categories of residual failure:
Category 1: LLM counts all responses as confirmed (post-fix)
Publication 30362 (Petosemtamab, updated_at: 2026-03-23 — processed AFTER the v3 backfill):
- Abstract: “1 confirmed complete response, 2 confirmed and 3 unconfirmed partial responses” among 10 evaluable patients
- Expected: cORR = 30% (3/10 confirmed), ORR = 60% (6/10 total)
- Extracted: ORR = 60.0% with
confirmed: true— LLM counted ALL responses as confirmed - Note: v3 backfill categorized this as “truncated abstract” but the abstract is NOT truncated — full response breakdown is present. The backfill LLM erroneously classified it as truncated.
Category 2: “cORR” terminology not recognized as confirmed flag
Publication 29660 (Tisotumab vedotin):
- Abstract explicitly uses “confirmed objective response rate (cORR)” as primary endpoint throughout
- Values: cORR = 32.5% (full cohort), cORR = 40.0% (≤2 prior lines)
- Extracted: ORR with
confirmed: nullfor both subgroups — correct values but missing confirmed flag - Impact: cORR column is empty in the report despite values being correctly extracted
Category 3: Total ORR mislabeled as confirmed
Publication 65575 (Ozuriftamab vedotin):
- Abstract: “ORR was 32% including confirmed and unconfirmed responses”
- Extracted: ORR = 32.0% with
confirmed: true— the total ORR (including unconfirmed) is marked as confirmed - Only
confirmed: truerecord exists; noconfirmed: falsepair
Additional confirmed cases with correct extraction but wrong audit flags (false positives from audit LLM):
- Pubs 65346, 151763, 237727: Both confirmed and unconfirmed ORR rows exist with correct values and flags. The
ClinicalEvidenceQuerycORR extraction at lines 658–675 correctly filtersconfirmed=true. These audit findings appear to be audit LLM errors (confusing which row is ORR vs cORR).
Remaining scope estimate:
-- Publications with only confirmed=true ORR (no unconfirmed counterpart)-- that might have wrong confirmed attributionSELECT count(DISTINCT ts.source_id)FROM trial_subgroups tsJOIN trial_outcome_measures tom ON tom.trial_subgroup_id = ts.idJOIN trial_endpoints te ON te.id = tom.trial_endpoint_idWHERE ts.source_type = 'Publication' AND te.abbreviation = 'ORR' AND tom.confirmed = true AND NOT EXISTS ( SELECT 1 FROM trial_outcome_measures tom2 JOIN trial_endpoints te2 ON te2.id = tom2.trial_endpoint_id WHERE tom2.trial_subgroup_id = ts.id AND te2.abbreviation = 'ORR' AND tom2.confirmed = false );-- Returns 1,178 publications — subset may have wrong attributionForward fix needed: The classify_publications prompt needs stronger instructions for three specific failure modes:
- When abstract lists individual confirmed + unconfirmed responses by count (e.g., “2 confirmed PR, 3 unconfirmed PR”), derive both cORR and ORR from counts — don’t sum them into one value
- When abstract uses “cORR” or “confirmed ORR” terminology, set
confirmed: trueon the endpoint even if no separate unconfirmed value is stated - When abstract says “ORR X% (including confirmed and unconfirmed)”, set
confirmed: falseorconfirmed: null— notconfirmed: true
Related: See also Issue 27 — even when extraction is correct, extract_efficacy_metrics in ClinicalEvidenceQuery can pick the confirmed ORR value for the plain ORR metric
Forward fix v4 (2026-03-26): Added two additional prompt instructions to app/tasks/publications_llm_classification/task.rb:
- Explicit example for deriving TWO ORR records from mixed confirmed/unconfirmed response counts (e.g., “1 confirmed CR, 2 confirmed PR, and 3 unconfirmed PR among 10 patients” → cORR=30%, ORR=60%). Addresses the pattern where the LLM sums all responses and marks as confirmed.
- Instruction that when the primary endpoint is described as “cORR” or “confirmed ORR”, the value IS confirmed and
confirmed: truemust be set — do not leave as null.
Backfill scope (2026-03-26): Structural scope (no text matching) — all ~25.5K publications with ORR that don’t already have both a confirmed=true AND confirmed=false ORR record. The LLM determines from the abstract whether confirmation language exists; apply_result is a no-op for pubs where the LLM returns confirmed=null.
Estimated affected (will actually change): ~1,000-1,500 publications based on text analysis showing ~985 with “confirmed ORR”/“cORR” language + null flag, ~77 with wrong confirmed=true, ~21 v3 remnants.
-- V4 structural scope: all ORR pubs without complete confirmed pairSELECT DISTINCT ts.source_idFROM trial_subgroups tsJOIN trial_outcome_measures tom ON tom.trial_subgroup_id = ts.idJOIN trial_endpoints te ON te.id = tom.trial_endpoint_idWHERE ts.source_type = 'Publication' AND te.abbreviation = 'ORR' AND NOT EXISTS ( SELECT 1 FROM trial_outcome_measures t1 JOIN trial_endpoints e1 ON e1.id = t1.trial_endpoint_id JOIN trial_outcome_measures t2 ON t2.trial_subgroup_id = t1.trial_subgroup_id JOIN trial_endpoints e2 ON e2.id = t2.trial_endpoint_id WHERE t1.trial_subgroup_id = tom.trial_subgroup_id AND e1.abbreviation = 'ORR' AND t1.confirmed = true AND e2.abbreviation = 'ORR' AND t2.confirmed = false );Cost: ~$6 using gpt-4o-mini in batch mode (simple classification, no reasoning model needed).
Backfill v4 production run (2026-03-26, job 1626): 25,594 publications processed (full structural scope, gpt-5-mini batch).
Results:
confirmed=truerecords: 2,685 → 5,948 (+3,263 new confirmed flags)confirmed=falserecords: 744 → 1,424 (+680 new unconfirmed records)- Complete confirmed/unconfirmed pairs: 477 → 783 pubs (+306)
- Pubs with any confirmed flag: 2,240 → 3,436 (+1,196)
Spot-checked 12 random publications against full abstracts — 11 correct, 1 pre-existing extraction error:
Round 1:
- Pub 29807 (AZD9291): abstract says “confirmed+unconfirmed ORR 51%” → correctly split to cORR=33.9%, ORR=51%
- Pub 56237 (IMO+ipi): abstract says “6 PR (3 confirmed)” among 15 → correctly split to cORR=20%, ORR=46.7%
- Pub 76478 (Pralsetinib): abstract says “all confirmed” for naïve subgroup → correct 73.7% confirmed; overall has small gap (63.3% vs 64.6%)
- Pub 65763 (Belrestotug): complete pairs for all 4 arms, confirmed < unconfirmed in each (expected)
- Pub 59860 (Pazopanib GCT): pre-existing extraction error — abstract reports marker response (4/5 AFP/HCG decrease), not RECIST ORR. The 80% “ORR” is a marker response rate, not a true ORR. Backfill correctly split confirmed/unconfirmed given the existing data, but the underlying extraction is wrong. Not a backfill bug.
Round 2 (full abstract read → compare):
- Pub 58824 (Fruquintinib+S-1): 1 confirmed PR at 4mg, 2 unconfirmed at 5mg → cORR=16.67% (1/6), ORR=50% (3/6) ✓
- Pub 62418 (Zongertinib GI): abstract explicitly states “confirmed ORR 17.2%” and “regardless of confirmation 20.7%” → exact match ✓
- Pub 70313 (D-1553 KRAS G12C): “1 confirmed CR, 3 confirmed PR, 1 unconfirmed PR” → cORR=40% (4/10), ORR=50% (5/10) ✓
- Pub 234635 (Ficerafusp SCAC): “6 of 7 responses confirmed” → cORR=27.3% (6/22), ORR=31.8% (7/22) ✓
- Pub 238559 (BC3195 ADC): “4 confirmed PR (cPR)” out of 31 at 2.4mg, 5 total PR → cORR=12.9%, ORR=16.1% ✓
Zero spurious changes on pubs without confirmation language (24K+ no-ops)
26. Parent population N propagated to child subgroups
Section titled “26. Parent population N propagated to child subgroups”Short summary
Section titled “Short summary”When classify_publications extracts data for hierarchical subgroups (e.g., “Phase 1b dose expansion → SCCHN”), the LLM copies the parent subgroup’s number_of_participants to all child subgroups instead of extracting the subset-specific N. This produces incorrect patient counts for ~5,058 child subgroups across 1,174 publications.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”classify_publications — app/tasks/publications_llm_classification/task.rb
The prompt currently instructs (line 120–123):
“CRITICAL: Set number_of_participants to the TOTAL evaluable patients in that subgroup/cohort — NOT the count of patients with a specific outcome.”
This instruction was added for Issue 24 (confusing outcome counts with cohort size), but it has a side effect: the LLM interprets “total evaluable patients in that subgroup” as the parent population total when it doesn’t know the child-specific N.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”No prompt instruction distinguishes between:
- The parent population N (e.g., 39 patients in Phase 1b)
- The child subgroup N (e.g., the SCCHN subset of those 39)
The LLM defaults to the known parent N rather than outputting null when the child N isn’t explicitly stated.
Concrete examples
Section titled “Concrete examples”Publication 134450 — MRG003 Phase 1 (SCCHN/NPC/CRC basket):
- “Phase 1b dose expansion” → N=39 (correct, total)
- “Phase 1b dose expansion → SCCHN” → N=39 (WRONG — SCCHN is a subset)
- “Phase 1b dose expansion → NPC” → N=39 (WRONG — NPC is a subset)
- “Phase 1b dose expansion → CRC” → N=39 (WRONG — CRC is a subset)
All three disease children show the parent’s N instead of the actual per-disease cohort size.
Publication 5799 — Neoadjuvant hormonal therapy in prostate cancer:
- “Overall” → N=62 (correct)
- “Overall → Baseline tumor burden: Low” → N=62 (WRONG — subset)
- “Overall → Baseline tumor burden: High” → N=62 (WRONG — subset)
- “Overall → PTEN/ERG immunostatus: Altered” → N=62 (WRONG — subset)
- “Overall → PTEN/ERG immunostatus: Wild-type” → N=62 (WRONG — subset)
Downstream impact
Section titled “Downstream impact”- Clinical Evidence report shows inflated patient counts for sub-cohort rows
- ORR percentages combined with wrong N produce misleading responder counts (e.g., 40% ORR with N=39 implies 15.6 responders, but the actual SCCHN cohort may only have 10 patients)
- Undermines per-cohort comparisons in basket trial reporting
~5,058 child subgroup-endpoint rows across 1,174 publications where 2+ siblings all share the parent’s N.
-- Identify affected parent-child groupsWITH parent_child AS ( SELECT DISTINCT ts_parent.source_id as pub_id, ts_parent.subgroup_value as parent, ts_child.subgroup_value as child, tao_child.number_of_participants as child_n, tao_parent.number_of_participants as parent_n, te_child.abbreviation as endpoint FROM trial_subgroups ts_child JOIN trial_subgroups ts_parent ON ts_parent.source_id = ts_child.source_id AND ts_parent.source_type = ts_child.source_type AND ts_child.subgroup_value LIKE ts_parent.subgroup_value || ' → %' AND ts_child.subgroup_value NOT LIKE ts_parent.subgroup_value || ' → % → %' JOIN trial_outcome_measures tom_child ON tom_child.trial_subgroup_id = ts_child.id JOIN trial_arm_outcomes tao_child ON tao_child.trial_outcome_measure_id = tom_child.id JOIN trial_outcome_measures tom_parent ON tom_parent.trial_subgroup_id = ts_parent.id JOIN trial_endpoints te_child ON tom_child.trial_endpoint_id = te_child.id JOIN trial_endpoints te_parent ON tom_parent.trial_endpoint_id = te_parent.id JOIN trial_arm_outcomes tao_parent ON tao_parent.trial_outcome_measure_id = tom_parent.id WHERE ts_child.source_type = 'Publication' AND te_child.abbreviation = te_parent.abbreviation AND tao_child.number_of_participants = tao_parent.number_of_participants AND tao_child.number_of_participants > 0)SELECT pub_id, parent, COUNT(DISTINCT child) as num_siblingsFROM parent_childGROUP BY pub_id, parentHAVING COUNT(DISTINCT child) >= 2;-- Returns 1,776 parent groups across 1,174 publicationsExplored solution direction
Section titled “Explored solution direction”Forward fix: Add a prompt instruction to classify_publications in task.rb:
“When extracting
number_of_participantsfor a child subgroup (e.g., ‘Overall → NSCLC’, ‘Phase 1b → SCCHN’), use the N specific to that sub-cohort, NOT the parent population’s total. If the abstract does not explicitly state how many patients are in the child sub-cohort, setnumber_of_participantsto null rather than copying the parent’s N. For example, if ‘Phase 1b’ enrolled 39 patients across SCCHN, NPC, and CRC, do NOT set N=39 for each disease — set N to null unless the abstract specifies the per-disease count.”
Backfill: Re-extract the ~1,174 affected publications with the updated prompt. Alternatively, a cheaper post-processing cleanup could null out child N values that match the parent N when 2+ siblings exist — but this may also catch legitimate cases (e.g., crossover designs where all patients go through each arm), so prompt fix + re-extraction is safer.
Related issues: Issue 24 (subgroup participant count wrong for biomarker sub-cohorts) is a specific instance of this broader pattern.
Solution applied
Section titled “Solution applied”Three-part fix:
-
Forward prompt fix (
app/tasks/publications_llm_classification/task.rb): Added instruction telling the LLM to use null for child subgroupnumber_of_participantswhen the abstract doesn’t explicitly state the per-subset count, rather than copying the parent’s N. Includes concrete right/wrong examples. -
Post-processing guard (
app/tasks/publications_llm_classification/post_process.rb): Addednull_out_propagated_parent_nmethod that runs afterprocess_outcome_measures. Detects parent-child pairs where 2+ siblings share the parent’s N for the same endpoint and nulls out those child N values. Acts as a permanent safety net regardless of LLM behavior. -
One-off backfill (
lib/tasks/one_off/null_propagated_parent_n.thor): SQL-based fix for existing affected records. Identifies childtrial_arm_outcomeswhere 2+ siblings share the parent’s N and setsnumber_of_participantsto NULL. No LLM re-runs needed — the correct answer is NULL since these abstracts don’t state the per-subset N.- Run:
thor one_off:null_propagated_parent_n:identifyto preview scope - Run:
thor one_off:null_propagated_parent_n:backfill --no-dry-runto apply
- Run:
Scope note: Only the 2+ siblings case is addressed. Single-child cases are ambiguous — the child could legitimately be the full parent population — and are left untouched.
Backfill v1 (prod): Ran successfully, nulled out all same-endpoint matches (0 remaining for same-endpoint check).
Backfill v2 fix (2026-03-24): v1 only matched child N against parent N on the same endpoint (e.g., child ORR N vs parent ORR N). But N propagation happens at the subgroup level — a child DCR can have the parent’s N even if the parent only has ORR. Found 2,495 pubs / 6,820 children still affected. Fixed both the one-off task and the post-process guard to match child N against ANY parent N across all endpoints.
Command: bundle exec thor one_off:null_propagated_parent_n:backfill --no-dry-run
27. extract_efficacy_metrics picks confirmed ORR as plain ORR
Section titled “27. extract_efficacy_metrics picks confirmed ORR as plain ORR”Short summary
Section titled “Short summary”When both confirmed (confirmed=true) and unconfirmed (confirmed=false) ORR rows exist for the same subgroup in the view, ClinicalEvidenceQuery#extract_efficacy_metrics can pick the confirmed row as the plain ORR metric value. This happens because the ORR extraction loop does not exclude confirmed rows, and when both rows have the same number_of_participants, max_by returns whichever comes first — often the confirmed row.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”ClinicalEvidenceQuery#extract_efficacy_metrics — app/queries/tpp/clinical_evidence_query.rb, lines 590–628.
The cORR extraction (lines 658–675) correctly filters confirmed == true and is unaffected. The problem is exclusively in the general efficacy extraction loop that handles ORR alongside OS, PFS, DOR, etc.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”Lines 600–611:
PRIMARY_EFFICACY_ABBREVIATIONS.each do |abbr| matching = grouped[abbr] || grouped[abbr.downcase] next if matching.nil? || matching.empty?
matching = filter_by_valid_unit(matching, abbr) next if matching.empty?
experimental = matching.select { |r| r['resolved_group_type'] == 'EXPERIMENTAL' } experimental = matching if experimental.empty?
best_row = experimental.max_by { |r| r['number_of_participants'].to_i } || matching.firstWhen abbr == 'ORR', matching includes ALL ORR rows regardless of confirmed flag. If both confirmed=true (value=26.7%) and confirmed=false (value=43.3%) exist with the same N, max_by picks the first match. The result: metrics[:orr] gets the confirmed value, making it identical to metrics[:corr] and wrong as a standalone ORR.
Concrete examples
Section titled “Concrete examples”Publication 117228 (RM-1929 photoimmunotherapy in rHNSCC):
Abstract states:
- “unconfirmed objective response rate (ORR) 43.3%”
- “confirmed ORR 26.7%”
View correctly has both rows (subgroup “Heavily pretreated rHNSCC → Part 2”):
confirmed=true, measure_value=26.7, number_of_participants=30confirmed=false, measure_value=43.3, number_of_participants=30
Report output: efficacy.orr.value = 26.7 (should be 43.3)
The cORR extraction correctly returns 26.7%, but the ORR extraction ALSO returns 26.7% instead of 43.3%.
Downstream impact
Section titled “Downstream impact”- Understated ORR: When confirmed ORR is lower than unconfirmed ORR (the typical pattern), the report shows the lower confirmed value as the headline ORR. For pub 117228, ORR is understated from 43.3% to 26.7%.
- Duplicate values: ORR and cORR columns show the same value, making the cORR column appear redundant and hiding the existence of a lower confirmed rate.
- Audit noise: The audit correctly flags these as
incorrect_valueonefficacy.orr.value, generating true-positive findings that overlap with Issue 25 audit findings.
477 publications currently have both confirmed=true and confirmed=false ORR rows (the correct Issue 25 extraction pattern). When both rows have the same N (which is common — confirmed and unconfirmed ORR are computed from the same denominator), the confirmed value gets picked as plain ORR.
-- Publications where confirmed and unconfirmed ORR have the same N-- (susceptible to the wrong-pick bug)SELECT count(DISTINCT ts.source_id)FROM trial_subgroups tsJOIN trial_outcome_measures tom_c ON tom_c.trial_subgroup_id = ts.id AND tom_c.confirmed = trueJOIN trial_outcome_measures tom_u ON tom_u.trial_subgroup_id = ts.id AND tom_u.confirmed = falseJOIN trial_endpoints te_c ON te_c.id = tom_c.trial_endpoint_id AND te_c.abbreviation = 'ORR'JOIN trial_endpoints te_u ON te_u.id = tom_u.trial_endpoint_id AND te_u.abbreviation = 'ORR'JOIN trial_arm_outcomes tao_c ON tao_c.trial_outcome_measure_id = tom_c.idJOIN trial_arm_outcomes tao_u ON tao_u.trial_outcome_measure_id = tom_u.idWHERE ts.source_type = 'Publication' AND tao_c.number_of_participants = tao_u.number_of_participants;Explored solution direction
Section titled “Explored solution direction”Forward fix: In extract_efficacy_metrics, when processing ORR, exclude confirmed=true rows if confirmed=false rows also exist for the same subgroup. This ensures the plain ORR metric always uses the unconfirmed/total ORR:
# Inside the PRIMARY_EFFICACY_ABBREVIATIONS.each loop, after filtering matching:if abbr == 'ORR' unconfirmed = matching.reject { |r| [true, 't'].include?(r['confirmed']) } matching = unconfirmed if unconfirmed.any?endThis is a ~3 line change in clinical_evidence_query.rb. No backfill needed — fixing the query immediately fixes all report output.
No backfill required: This is a query-layer bug, not a data issue. The underlying data (trial_outcome_measures with correct confirmed flags) is correct. Fixing the Ruby code fixes all publications instantly.
Solution applied
Section titled “Solution applied”Forward fix (2026-03-26): Added 5-line guard in app/queries/tpp/clinical_evidence_query.rb extract_efficacy_metrics method (line 610-613). When processing ORR, rejects confirmed=true rows if non-confirmed rows exist. This ensures the plain ORR metric uses the unconfirmed/total ORR, while the cORR extraction (line 667-683) independently picks confirmed=true rows.
if abbr == 'ORR' non_confirmed = matching.reject { |r| [true, 't'].include?(r['confirmed']) } matching = non_confirmed if non_confirmed.any?endEdge cases handled:
- Both confirmed + unconfirmed exist → ORR gets unconfirmed, cORR gets confirmed (correct)
- Only confirmed exists (no unconfirmed) → ORR falls back to confirmed value (safe fallback — same as cORR)
- Only unconfirmed/null exists → no change (correct)
No backfill needed — query-layer fix applies immediately to all report output
28. build_result_rows collapses dose-level arms when study_plan_arm_id is null
Section titled “28. build_result_rows collapses dose-level arms when study_plan_arm_id is null”Short summary
Section titled “Short summary”ClinicalEvidenceQuery.build_result_rows groups view rows by [publication_id, disease_id, effective_line, study_plan_arm_id, subgroup_value]. When study_plan_arm_id is null — which it is for all publication-extracted arms that haven’t been matched to a clinical trial study plan arm — distinct dose-level arms (e.g. “8.0 mg/kg” and “10.0 mg/kg”) sharing the same subgroup_value collapse into a single group. extract_efficacy_metrics then picks one arm by max_by(number_of_participants), silently dropping the other.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/queries/tpp/clinical_evidence_query.rb, build_result_rows method (line 306).
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”The grouping key at line 306 is:
grouped = enriched_data.group_by { |row| [row['publication_id'], row['disease_id'], row['effective_line'], row['study_plan_arm_id'], row['subgroup_value']]}When study_plan_arm_id is null for both dose arms (common for unlinked publications), they group together. extract_efficacy_metrics (line 619) then picks one via max_by(number_of_participants).
Concrete examples
Section titled “Concrete examples”Pub 190656 (ARTEMIS-001, HS-20093 B7-H3 ADC in NSCLC):
- View has 6 rows for “NSCLC → Squamous cell carcinoma” (3 endpoints × 2 dose arms: 8.0 mg/kg N=32 and 10.0 mg/kg N=26)
- Both arms have
study_plan_arm_id = null - Query collapses to 1 row, picks 8.0 mg/kg (N=32 > N=26)
- Lost data: Sq 10.0 mg/kg cORR 26.9%, PFS 5.7, DOR 7.0
Downstream impact
Section titled “Downstream impact”Dose-level subgroup data is silently dropped from the Clinical Evidence report. For dose-escalation studies where different dose levels have meaningfully different efficacy, only the higher-N cohort appears.
Affects dose-escalation/expansion publications where arms aren’t matched to trial study plan arms. The view correctly distinguishes arms by arm_name, but the query ignores arm_name in its grouping key.
Explored solution direction
Section titled “Explored solution direction”Add arm_name to the grouping key in build_result_rows, or fall back to arm_name when study_plan_arm_id is null. This preserves dose-level arm distinctions without breaking publications where study_plan_arm_id correctly differentiates arms.
Related to Issue 20 (study_plan_arm link is fragile) — same root cause of over-reliance on study_plan_arm_id.
Solution applied
Section titled “Solution applied”29. Dose extraction captures study-level range, not efficacy population range
Section titled “29. Dose extraction captures study-level range, not efficacy population range”Short summary
Section titled “Short summary”In dose-escalation studies, classify_publications extracts the full dose range stated in the abstract (e.g. dose_min=1.0, dose_max=8.3 mg/kg) as a property of the subgroup. But when the abstract restricts efficacy reporting to a dose subset (e.g. “results for patients who received ≥4.0 mg/kg”), the dose_min on the efficacy row is too low, creating a mismatch between the dose range and the efficacy population.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/task.rb — dose fields extracted as subgroup-level properties.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”Dose extraction treats dose as a study-level attribute (“what doses were used?”) rather than scoping to the efficacy analysis population (“what doses did the patients in the reported results actually receive?”). The LLM prompt doesn’t instruct it to scope dose to the efficacy population.
Concrete examples
Section titled “Concrete examples”Pub 238709 (MYTX-011 KisMET-01 updated):
- Abstract: “85 pts received 1.0–8.3 mg/kg; 59 pts received ≥4.0 mg/kg” — efficacy reported only for ≥4.0 mg/kg subset
- Extracted:
dose_min=1.0, dose_max=8.3 - Expected:
dose_min=4.0, dose_max=8.3(matching the efficacy population) - RP2D correctly extracted as “5.0 mg/kg Q3W (2-on 1-off) and 4.0 mg/kg Q3W”
Downstream impact
Section titled “Downstream impact”Report rows show a broader dose range than the actual efficacy population received. Minor impact on report accuracy but misleading for dose-response interpretation.
Affects phase I dose-escalation studies where efficacy is reported for a dose subset. Relatively uncommon pattern — most studies report efficacy at a single dose or clearly per-dose-level.
Explored solution direction
Section titled “Explored solution direction”Update the classify_publications dose extraction prompt to instruct the LLM: “When the abstract reports efficacy for a specific dose subset, use that subset’s dose range, not the full escalation range.” Alternatively, accept this as a known limitation since RP2D (when present) correctly reflects the clinically relevant dose.
Solution applied
Section titled “Solution applied”30. Cross-study data contamination from abstract background sections
Section titled “30. Cross-study data contamination from abstract background sections”Short summary
Section titled “Short summary”When a publication abstract references efficacy results from a prior study as background context (e.g. “In our previous study NCT05029882, ORR was 24.4%”), classify_publications extracts those values as if they belong to the current study. This produces fabricated efficacy data for publications that may have no efficacy results of their own yet.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/task.rb — efficacy extraction from abstract text.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”The LLM extraction prompt does not distinguish between efficacy results reported as outcomes of the current study vs. results cited from external/prior studies as background context. The abstract structure (Background → Methods → Results → Conclusions) is not enforced.
Concrete examples
Section titled “Concrete examples”Pub 29705 (ABBV-400/Telisotuzumab adizutecan signal-seeking study, NCT06084481):
- Abstract background: “Initial results from the ongoing first-in-human study (NCT05029882) of ABBV-400… an overall response rate of 24.4%”
- Current study status: “As of 19 January 2024, 24 patients have been enrolled” — no efficacy data reported
- Extracted: ORR=24.4%, N=24 (enrollment count misinterpreted as efficacy N)
- Expected: No efficacy data (null)
The 24.4% ORR belongs to NCT05029882, not NCT06084481. The N=24 is enrollment, not an efficacy population.
Downstream impact
Section titled “Downstream impact”Publications appear in the Clinical Evidence report with fabricated efficacy data from unrelated studies. This is particularly misleading for signal-seeking or early-enrollment publications where the abstract previews prior results to motivate the new study.
Affects publications whose abstracts cite efficacy results from prior/companion studies. Common in: signal-seeking study designs, follow-up studies referencing parent trials, and publications describing study rationale with prior data. Exact count unknown — requires systematic detection.
Explored solution direction
Section titled “Explored solution direction”- Audit prompt guard (deployed): Added “CROSS-STUDY REFERENCES” instruction to the audit prompt so future audits flag these correctly.
- Extraction prompt fix (forward): Update
classify_publicationsprompt to instruct: “Only extract efficacy values reported as results of THIS study (typically in the Results section). Do not extract values cited from prior/external studies in the Background or Introduction.” - Detection query: Publications where
llm_datahas efficacy values but the abstract contains phrases like “previous study”, “prior study”, “first-in-human study (NCT…)” with efficacy values in the same sentence could be flagged for review.
Solution applied
Section titled “Solution applied”Audit prompt updated with cross-study reference guard (2026-03-27). Extraction-level fix pending.
Job 1594 Triage Log (HNSCC + ADC, disease_id=6200, technology_ids=708)
Section titled “Job 1594 Triage Log (HNSCC + ADC, disease_id=6200, technology_ids=708)”| Audit ID | Pub ID | Type | Field | Classification | Notes |
|---|---|---|---|---|---|
| 8338 | 29660 | incorrect_value | efficacy.dor.value | True issue — extraction (minor) | LLM appended spurious “(4.55)” to “Not Reached” DOR |
| 8341 | 29705 | incorrect_value | efficacy.orr.value | True issue — extraction (Issue 30) | ORR from referenced prior study NCT05029882, not current study |
| 8342 | 29705 | incorrect_value | efficacy.orr.patient_count | True issue — extraction (Issue 30) | Enrollment count (24) misinterpreted as efficacy N |
| 8339 | 44216 | incorrect_value | dose_min | True issue — extraction (Issue 29) | Dose-escalation range (0.3) on dose-expansion efficacy row (RP2D=2.0) |
| 8340 | 44216 | incorrect_value | dose_max | True issue — extraction (Issue 29) | Dose-escalation range (2.2) on dose-expansion efficacy row (RP2D=2.0) |
| 8343 | 115389 | incorrect_value | efficacy.pfs.value | True issue — extraction | ”Not Reached” should be null; abstract says “immature” (insufficient data) |
| 8344 | 134450 | incorrect_value | patient_number_efficacy | True issue — extraction (Issue 8 residual) | Zero-sentinel: N=0 instead of null for unstated SCCHN-specific N |
| 8345 | 134450 | incorrect_value | dose_min | True issue — extraction (Issue 29 variant) | Child subgroup inherited phase 1a escalation dose (0.1) instead of parent’s fixed dose (2.5) |
| 8346 | 134450 | incorrect_value | dose_max | True issue — extraction (Issue 29) | Dose range from escalation phase on expansion subgroup |
| 8347 | 75542 | missing_subgroup | — | False positive — audit LLM | ctDNA abundance is a Cox model correlation, not a tabulated efficacy subgroup |
| 8348 | 75542 | missing_subgroup | — | False positive — audit LLM | VAF persistence is a statistical correlation, not a reportable subgroup |
| 8349 | 114973 | incorrect_value | dose_min | True issue — extraction (Issue 29) | Full escalation range (0.3) on efficacy row; efficacy population was 3.6-5.4 |
| 8350 | 114973 | incorrect_value | dose_max | True issue — extraction (Issue 29) | Full escalation range (8.0) on efficacy row; efficacy population was 3.6-5.4 |
Job 1635 Triage Log (CRC + ADC, disease_id=4345, technology_ids=708)
Section titled “Job 1635 Triage Log (CRC + ADC, disease_id=4345, technology_ids=708)”| Audit ID | Pub ID | Type | Field | Classification | Notes |
|---|---|---|---|---|---|
| 8360 | 241259 | incorrect_value | patient_number_efficacy | True issue — extraction (Issue 8) | Zero-sentinel: N=0 for 2.0 mg/kg arm; per-arm N not stated |
| 8361 | 241259 | incorrect_value | patient_number_efficacy | True issue — extraction (Issue 8) | Zero-sentinel: N=0 for 2.4 mg/kg arm; per-arm N not stated |
| 8362 | 241259 | incorrect_value | dose_min | True issue — view (Issue 31) | SOC arm has Temab-A dose_min=1.6; SOC is trifluridine/tipiracil+BEV |
| 8363 | 241259 | incorrect_value | dose_max | True issue — view (Issue 31) | SOC arm has Temab-A dose_max=2.4 |
| 8364 | 241259 | incorrect_value | dose_units | True issue — view (Issue 31) | SOC arm has mg/kg (Temab-A units) |
| 8365 | 241259 | incorrect_value | dose_frequency | True issue — view (Issue 31) | SOC arm has Q3W (Temab-A schedule) |
| 8366 | 241259 | incorrect_value | rp2d | True issue — view (Issue 31) | SOC arm has Temab-A RP2D |
| 8352 | 29699 | incorrect_value | efficacy.orr.value | True issue — extraction (Issue 8) | Zero-sentinel: ORR=0% for overall mCRC; no numeric ORR in abstract (E-R paper) |
| 8353 | 29699 | incorrect_value | patient_number_efficacy | True issue — extraction (Issue 8) | Zero-sentinel: N=0 for 2.4 mg/kg arm |
| 8354 | 29699 | incorrect_value | efficacy.orr.value | True issue — extraction (Issue 8) | Zero-sentinel: ORR=0% for 2.4 mg/kg; E-R correlations only |
| 8355 | 29699 | incorrect_value | patient_number_efficacy | True issue — extraction (Issue 8) | Zero-sentinel: N=0 for 3.0 mg/kg arm |
| 8356 | 29699 | incorrect_value | efficacy.orr.value | True issue — extraction (Issue 8) | Zero-sentinel: ORR=0% for 3.0 mg/kg; E-R correlations only |
| 8368 | 29737 | incorrect_value | efficacy.pfs.value | True issue — extraction (Issue 32) | TTP 4.8+ mo (SD pts only) mapped to PFS for full CRC cohort |
| 8369 | 29737 | incorrect_value | efficacy.pfs.patient_count | True issue — extraction (Issue 32) | N=29 (full CRC) but TTP was for 14 SD patients only |
| 8370 | 29737 | incorrect_value | efficacy.pfs.value | True issue — extraction (Issue 32) | TTP 4.4+ mo (SD pts only) mapped to PFS for KRAS-mutated |
| 8371 | 29737 | incorrect_value | efficacy.pfs.patient_count | True issue — extraction (Issue 32) | N=13 (full KRAS) but TTP was for 7 SD patients only |
| 8411 | 134450 | incorrect_value | patient_number_efficacy | True issue — extraction (Issue 8) | Zero-sentinel: N=0 for CRC phase 1b; ORR/DCR reported |
| 8412 | 134450 | incorrect_value | dose_min | True issue — extraction (Issue 29) | Phase 1a escalation min (0.1) on phase 1b efficacy row (RP2D=2.5) |
| 8413 | 134450 | incorrect_value | patient_number_efficacy | True issue — extraction (Issue 8) | Zero-sentinel: N=0 for SCCHN phase 1b; ORR/DCR reported |
| 8414 | 134450 | incorrect_value | dose_min | True issue — extraction (Issue 29) | Same as 8412 for SCCHN child subgroup |
| 8402 | 72043 | missing_subgroup | — | True issue — subgroup identification (Issue 33) | CRC × HER2 IHC 3+ cross-tabulated subgroup missing |
| 8403 | 72043 | missing_subgroup | — | True issue — subgroup identification (Issue 33) | CRC × HER2 IHC 2+ cross-tabulated subgroup missing |
| 8404 | 72043 | missing_subgroup | — | True issue — subgroup identification (Issue 33) | CRC × HER2 IHC 1+ cross-tabulated subgroup missing |
| 8405 | 72043 | missing_subgroup | — | True issue — subgroup identification (Issue 33) | CRC × HER2 mut/amp cross-tabulated subgroup missing |
| 8386 | 74193 | incorrect_value | efficacy.pfs.value | True issue — extraction (Issue 32) | TTP 1.6 mo mapped to PFS |
| 8387 | 74193 | incorrect_value | patient_number_efficacy | True issue — extraction | ctDNA retained subgroup: N=3 (tested) but only 2 had retention |
| 8388 | 74193 | incorrect_value | efficacy.orr.patient_count | True issue — extraction | Same: ORR denominator=3 should be 2 |
| 8389 | 74193 | incorrect_value | efficacy.dcr.patient_count | True issue — extraction | Same: DCR denominator=3 should be 2 |
| 8380 | 200353 | incorrect_value | patient_number_efficacy | True issue — extraction (Issue 26) | Parent N=97 propagated to “Absent MR” child subgroup |
| 8381 | 200353 | incorrect_value | patient_number_efficacy | True issue — extraction (Issue 26) | Parent N=97 propagated to “Complete MR” child subgroup |
| 8382 | 200353 | incorrect_value | patient_number_efficacy | True issue — extraction (Issue 8) | Zero-sentinel: N=0 for EGFR amplification subgroup |
| 8383 | 200353 | incorrect_value | efficacy.pfs.patient_count | True issue — extraction (Issue 8) | Zero-sentinel: PFS patient_count=0 for EGFR amp |
| 8373 | 48880 | incorrect_value | single_dose | True issue — extraction | Pooled Overall row shows single_dose=5.4; study had both 5.4 and 6.4 mg/kg |
| 8374 | 48880 | incorrect_value | dose_min | False positive — audit LLM | dose_min=5.4 IS the minimum dose; audit confused by dose_max also being 5.4 |
| 8375 | 48880 | incorrect_value | dose_max | True issue — extraction | dose_max=5.4 should be 6.4 (second arm omitted from pub-level dose) |
| 8407 | 135119 | incorrect_value | patient_number_safety | True issue — extraction | Safety N=28 (Q2W-LD only); full study N=43 includes Q3W arm |
| 8408 | 135119 | incorrect_value | dose_max | True issue — extraction | dose_max=170 but Q3W arm went to 190 mg/m² |
| 8409 | 135119 | incorrect_value | dose_frequency | True issue — extraction | Q2W only; study used both Q2W and Q3W schedules |
| 8397 | 66892 | incorrect_value | dose_min | True issue — extraction (Issue 29) | Escalation min 0.8 on efficacy row; efficacy population ≥6 mg/kg |
| 8398 | 66892 | incorrect_value | dose_min | True issue — extraction (Issue 29) | Same for IHC 2+/FISH+ child subgroup |
| 8399 | 66892 | missing_subgroup | — | True issue — subgroup identification (Issue 33) | HER2 IHC 3+ subgroup (ORR 16/30=53.3%) not extracted |
| 8377 | 48926 | incorrect_value | patient_number_efficacy | True issue — query/view | Disease-scoped IHC2+/ISH+ duplicate has N=0; non-scoped row has correct N=13 |
| 8378 | 48926 | incorrect_value | patient_number_efficacy | True issue — query/view | Disease-scoped IHC3+ duplicate has N=0; non-scoped row has correct N=40 |
| 8379 | 48926 | incorrect_value | patient_number_efficacy | True issue — query/view | Disease-scoped prior anti-HER2 duplicate has N=0; non-scoped row has correct N=16 |
| 8390 | 49899 | incorrect_value | patient_number_efficacy | True issue — extraction | N=40 (overall) for ≥2.4 mg/kg subgroup; should be 34 per abstract |
| 8391 | 49899 | incorrect_value | efficacy.orr.patient_count | True issue — extraction | ORR denominator=40 should be 34 |
| 8392 | 49899 | incorrect_value | efficacy.corr.patient_count | True issue — extraction | cORR denominator=40 should be 34 |
| 8393 | 49900 | incorrect_value | patient_number_safety | True issue — extraction | Safety N=29 for 2.4 mg/kg arm; abstract says 31 treated |
| 8351 | 100 | incorrect_value | efficacy.pfs.value | True issue — extraction (Issue 32) | TTP 2.70 mo mapped to PFS |
| 8394 | 51436 | incorrect_value | dose_min | True issue — extraction (Issue 29) | Escalation min 1.5 on ≥6 mg/kg efficacy subgroup |
| 8396 | 52543 | incorrect_value | efficacy.orr.patient_count | False positive — audit LLM | patient_count=3 is denominator (correct); audit confused numerator/denominator |
| 8384 | 67379 | incorrect_value | patient_number_efficacy | True issue — extraction (Issue 8) | Zero-sentinel: N=0 for hTMB/MSS; PFS+HR reported |
| 8385 | 67379 | incorrect_value | efficacy.pfs.patient_count | True issue — extraction (Issue 8) | Zero-sentinel: PFS patient_count=0 for same |
| 8400 | 70960 | incorrect_value | dose_min | True issue — extraction (Issue 29) | Escalation min 3.2 on RP2D (6.4) subgroup |
| 8401 | 70960 | incorrect_value | dose_max | True issue — extraction (Issue 29) | Escalation max 8.0 on RP2D (6.4) subgroup |
| 8406 | 73299 | incorrect_value | efficacy.pfs.value | True issue — extraction (Issue 32) | TTP 1.8 mo mapped to PFS for CRC cohort |
| 8410 | 75999 | spurious_row | — | True issue — query scoping | NPC subgroup in CRC-scoped report (basket trial leak) |
| 8415 | 114571 | incorrect_value | efficacy.os.value | True issue — extraction | OS=“Not Reached” but abstract says “not yet mature” → should be null |
| 8358 | 116843 | incorrect_value | rp2d | True issue — view (Issue 31) | SOC arm has Temab-A RP2D (dose cross-contamination) |
| 8417 | 152942 | spurious_row | — | True issue — query scoping | PDA subgroup in CRC-scoped report (basket trial leak) |
| 8418 | 162304 | incorrect_value | efficacy.orr.value | True issue — extraction | ORR=35% is “any tumor reduction” rate; actual ORR≈1.5% (1/66 PR) |
| 8359 | 235204 | incorrect_value | patient_number_efficacy | True issue — extraction | N=23 is PFS event count, not patient count; should be 31 |
| 8416 | 238377 | incorrect_value | efficacy.dor.value | True issue — extraction | DoR=11.03mo from “>48 weeks” (lower bound, not median) |
| 8395 | 240052 | incorrect_value | dose_min | True issue — extraction (Issue 29) | Escalation min 1.5 on ≥6 mg/kg efficacy subgroup |
| 8357 | 29700 | missing_endpoint | efficacy.dor.value | True issue — extraction | DoR=5.5 mo in abstract for 3.0 mg/kg but not extracted |
| 8367 | 29735 | incorrect_value | efficacy.pfs.value | True issue — extraction (Issue 32) | TTP 5.1 mo mapped to PFS for CRC |
| 8372 | 29738 | incorrect_value | efficacy.pfs.value | True issue — extraction (Issue 32) | TTP 18 wks → 4.14 mo converted and mapped to PFS |
| 8376 | 48903 | incorrect_value | dose_max | True issue — extraction (Issue 29) | Part 1 max (8.0) on Part 2 expansion row (5.4/6.4) |
31. Investigational drug dose data bleeds onto control/comparator arms
Section titled “31. Investigational drug dose data bleeds onto control/comparator arms”Short summary
Section titled “Short summary”When publication_interventions.study_plan_arm_id is NULL (the common case for publication-extracted drugs via Source 0), the drug_interventions CTE in vw_publication_efficacy_data joins the investigational drug to ALL arms — including control/comparator arms. The pub_dose_lookup COALESCE fallback then propagates the investigational drug’s dose fields (dose_min, dose_max, rp2d, dose_units, dose_frequency) onto control arm rows that have no subgroup-level dose override. This makes it appear that the comparator arm received the investigational drug’s dosing.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”db/views/vw_publication_efficacy_data_v18.sql:
drug_interventionsCTE (Source 0): Joinspublication_interventionsto arms. When bothclinical_trial_idandstudy_plan_arm_idare NULL, the drug matches all arms via theOR di.study_plan_arm_id IS NULLfallback.pub_dose_lookupCTE: Pulls dose_evidence frompublication_interventions. Joined toraw_rowsviapublication_intervention_idmatch fromdrug_interventions.raw_rowsCOALESCE chain (lines 449–469): Falls through subgroup-level dose → pub-level dose. No arm_type guard prevents control arms from inheriting investigational drug dose.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”In raw_rows, the dose COALESCE chain:
COALESCE(tlm.subgroup_dose_min, ..., pdl.pub_dose_min) AS dose_min,COALESCE(tlm.subgroup_dose_max, ..., pdl.pub_dose_max) AS dose_max,COALESCE(tlm.subgroup_rp2d, pdl.pub_rp2d) AS rp2d,has no guard for aoe.arm_type or aoe.resolved_group_type. When a control arm’s subgroup has no dose fields, the COALESCE falls through to pub_dose_lookup, which contains the investigational drug’s dose evidence.
Concrete examples
Section titled “Concrete examples”Pub 241259 (Temab-A exposure-response in mCRC):
- SOC arm = trifluridine/tipiracil+BEV (N=20)
- View shows: dose_min=1.6 mg/kg, dose_max=2.4 mg/kg, rp2d=2.4 mg/kg Q3W, dose_units=mg/kg, dose_frequency=Q3W
- These are Temab-A doses from
publication_interventionsid=51068 (study_plan_arm_id=NULL) - Abstract explicitly states SOC is “trifluridine/tipiracil+BEV” — no Temab-A dosing
Pub 241978 (Enfortumab vedotin):
- “No upfront dose reduction” control arm shows dose_min=0.75 mg/kg, dose_max=1.25 mg/kg
Downstream impact
Section titled “Downstream impact”- Clinical Evidence report: Control arms display investigational drug dose fields, misleading reviewers into thinking comparator arms received the ADC
- Audit findings: Audit LLM correctly flags these as incorrect (5 of 7 issues on pub 241259 are this pattern)
- Data quality: Dose fields on control arms are nonsensical — they describe a drug the arm didn’t receive
- 2,890 view rows across 566 publications have dose data from pub_dose_lookup on control/comparator arms
- 1,197 additional control rows have subgroup-level dose (potentially legitimate for dose-comparison arms)
- Within ADC technology scope: 14 rows across 5 publications (smaller because most ADC trials are single-arm)
What the issue is not
Section titled “What the issue is not”- Drug NAME attribution to control arms is intentional — the report needs to show what drug the control is being compared against
- Subgroup-level dose on control arms may be correct (e.g., dose-comparison trials where the control is a different dose of the same drug)
- This does NOT affect experimental/investigational arm rows
Explored solution direction
Section titled “Explored solution direction”Forward fix — view v19: Add an arm_type guard to the pub_dose_lookup COALESCE in raw_rows. When aoe.arm_type = 'control' (or aoe.resolved_group_type = 'ACTIVE_COMPARATOR'), skip the pub_dose_lookup fallback:
COALESCE( tlm.subgroup_dose_min, CASE WHEN tlm.subgroup_dose_value IS NOT NULL THEN tlm.subgroup_dose_value || ' ' || COALESCE(tlm.subgroup_dose_units, '') END, CASE WHEN aoe.arm_type != 'control' THEN pdl.pub_dose_min END) AS dose_min,Apply the same pattern to dose_max, rp2d, dose_units, dose_frequency, and single_dose. This preserves subgroup-level dose (tier 1) for all arms but blocks the publication-level fallback (tier 3) for control arms only.
No backfill needed — rematerializing the view after deploying v19 will fix all affected rows.
Related to Issue 20: The v16 Source 0 fix (using publication_interventions as primary drug source) introduced this side effect by broadening the drug_interventions join. The drug join itself is correct; only the dose COALESCE fallback needs the arm_type guard.
Solution applied
Section titled “Solution applied”(empty — pending implementation)
32. TTP (time to progression) misclassified as PFS
Section titled “32. TTP (time to progression) misclassified as PFS”Short summary
Section titled “Short summary”The LLM extraction pipeline (classify_publications) maps TTP (time to progression) values to PFS (progression-free survival) when the abstract reports TTP but not PFS. These are distinct endpoints — TTP censors deaths while PFS counts them as events. Additionally, in some cases (e.g., pub 29737), TTP values reported for a best-response subpopulation (e.g., SD patients only) are attributed to the entire cohort.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/subgroup_extraction.rb: Identifies endpoints from the abstract. May correctly identify TTP but it gets mapped to PFS downstream.app/tasks/publications_llm_classification/task.rb: Extracts endpoint values. The LLM treats TTP as PFS when extracting, or the endpoint mapping normalizes TTP→PFS.- Endpoint normalization: If TTP is not in the standard endpoint list, the LLM may substitute the closest recognized endpoint (PFS).
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”The classify_publications prompt and/or endpoint schema does not distinguish TTP from PFS. When an abstract reports “median TTP = X months”, the LLM maps this to the PFS endpoint because TTP is not available as a separate extraction target. The LLM lacks instruction to leave PFS null when only TTP is reported.
Concrete examples
Section titled “Concrete examples”Pub 29737 (IMMU-132 in GI cancers):
- Abstract: “time to progression (TTP) … median of 4.8+ mo for the SD pts”
- Extracted: PFS=4.8 months, patient_count=29 (entire CRC cohort)
- Correct: TTP=4.8+ months, applicable to 14 SD patients only — PFS should be null
- Two compounding errors: (1) TTP→PFS confusion, (2) SD-subpopulation value → full cohort
Pub 29737 KRAS-mutated subgroup:
- Abstract: “median TTP = 4.4+ mo” for 7 SD patients
- Extracted: PFS=4.4 months, patient_count=13 (all KRAS-mutated)
- Correct: TTP=4.4+ months for 7 SD patients — PFS should be null
Downstream impact
Section titled “Downstream impact”- Clinical Evidence report: PFS column shows TTP values, overstating the evidence (PFS is a stronger endpoint than TTP)
- Cross-study comparisons: TTP values mixed with genuine PFS values make comparisons unreliable
- Patient counts: When TTP is reported only for responders/SD patients, attributing it to the full cohort inflates the denominator
- 149 publications mention TTP (but not PFS) in their abstract yet have PFS as an extracted endpoint
- 1,150 publications have TTP correctly extracted as TTP (suggesting the pipeline CAN handle TTP in many cases)
- The SD-subpopulation misattribution is harder to quantify systematically but likely affects a subset of phase I/II publications reporting outcomes by best response category
Explored solution direction
Section titled “Explored solution direction”- Extraction prompt fix (forward): Add explicit instruction to
classify_publications: “TTP (time to progression) and PFS (progression-free survival) are distinct endpoints. If the abstract reports TTP but not PFS, extract TTP only — do NOT map TTP values to PFS. Leave PFS null when only TTP is reported.” - Subpopulation guard: Add instruction: “When a time-based endpoint (TTP, PFS, DoR) is reported only for a best-response subgroup (e.g., ‘median TTP for SD patients’), do not attribute it to the parent population. Extract it under the response-specific subgroup or leave the parent’s value null.”
- Backfill: Re-extract PFS values for the 149 affected publications with updated prompt. Scope: publications where abstract contains TTP/time to progression but NOT PFS/progression-free survival, and a PFS endpoint was extracted.
Solution applied
Section titled “Solution applied”(empty — pending implementation)
33. Cross-tabulated subgroups not identified in basket trials
Section titled “33. Cross-tabulated subgroups not identified in basket trials”Short summary
Section titled “Short summary”When basket trial abstracts report efficacy in a table structured as tumor type × biomarker status (e.g., CRC × HER2 IHC 3+/2+/1+), extract_subgroups identifies the single-dimension subgroups (tumor types and biomarker statuses separately) but not the cross-product subgroups (CRC IHC 3+, CRC IHC 2+, etc.). This means disease-specific biomarker-stratified efficacy data is lost — only the overall tumor-type and overall biomarker-status rows are extracted.
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/subgroup_extraction.rb: Identifies subgroups and their endpoint associations from the abstract. The LLM prompt identifies subgroups as a flat list, and the hierarchical naming convention (e.g., “Non-breast STs → CRC”) captures one level of nesting but not cross-dimensional nesting.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”The subgroup extraction prompt produces subgroups along each dimension independently:
- By tumor type: BTC, UC, GC/GEJA, CRC
- By biomarker: HER2 IHC3+, IHC2+, IHC1+
But it does not produce the cross-product: CRC IHC3+, CRC IHC2+, etc. The table data in the abstract contains these values, but the extraction doesn’t recognize the need to create nested subgroups for each cell in a tumor type × biomarker matrix.
Concrete examples
Section titled “Concrete examples”Pub 72043 (SHR-A1811 in non-breast solid tumors):
- Abstract table reports ORR for each tumor type × HER2 IHC status combination
- Extracted subgroups: CRC (36.4%), IHC3+ (54.1%), IHC2+ (41.7%), IHC1+ (50.0%)
- Missing: CRC IHC3+ (100%, 3/3), CRC IHC2+ (0%, 0/3), CRC IHC1+ (0%, 0/1), CRC HER2 mut/amp (0%, 0/3)
- 4 audit issues (8402-8405) all flagging missing cross-tabulated CRC subgroups
Downstream impact
Section titled “Downstream impact”- Clinical Evidence report: Disease-specific biomarker-stratified efficacy data missing — can only show overall CRC ORR, not CRC by HER2 status
- Granularity loss: The most clinically relevant data in basket trials is often the cross-tabulation (e.g., “does HER2 IHC 3+ predict response in CRC specifically?”)
- ~366 publications have both disease-type and biomarker-type subgroups with common biomarkers (HER2, EGFR, KRAS, BRAF, PD-L1, MSI, MMR)
- Not all 366 will have cross-tabulated data in the abstract — many will have separate analyses rather than a matrix table
- The issue primarily affects basket/platform trials reporting across multiple tumor types with biomarker stratification
What the issue is not
Section titled “What the issue is not”- This is NOT about missing biomarker context on existing subgroups (that’s Issue 19)
- This is NOT about dropped subgroups at the classify step (Issue 10) — the cross-product subgroups are never identified in the first place
- Parent-level tumor type and biomarker subgroups ARE correctly extracted
Explored solution direction
Section titled “Explored solution direction”- Extraction prompt enhancement: Update
extract_subgroupsprompt to recognize tabular cross-tabulation patterns: “When the abstract contains a table or matrix reporting efficacy by tumor type × biomarker status, create cross-product subgroups (e.g., ‘CRC → HER2 IHC 3+’) for each cell with reported data, in addition to the single-dimension subgroups.” - Post-extraction cross-product generation: After extracting single-dimension subgroups, detect when a table exists with both dimensions and generate cross-product subgroups programmatically.
- Scope: Focus on publications with ≥2 disease subgroups AND ≥1 biomarker subgroup, and re-run extraction with the enhanced prompt.
Solution applied
Section titled “Solution applied”(empty — pending implementation)
34. “Immature” endpoints extracted as “Not Reached”
Section titled “34. “Immature” endpoints extracted as “Not Reached””Short summary
Section titled “Short summary”When an abstract states that an endpoint (OS, PFS, DoR) is “not yet mature”, “data immature”, or “results are immature”, the LLM extraction maps this to “Not Reached”. These are clinically distinct concepts: “Not Reached” means the Kaplan-Meier curve hasn’t crossed the 50% mark (a real finding indicating the median exceeds current follow-up), while “immature” means insufficient events or follow-up to perform the analysis (no median can be estimated — value should be null).
Where this sits in the current pipeline
Section titled “Where this sits in the current pipeline”app/tasks/publications_llm_classification/task.rb: The classify_publications prompt doesn’t distinguish between “Not Reached” and “immature/not yet mature”. The LLM treats both as equivalent and extracts “Not Reached” for either.
Exact restriction causing the drop
Section titled “Exact restriction causing the drop”The extraction prompt has no instruction to differentiate “Not Reached” (endpoint was analyzed, median exceeds follow-up) from “immature” (endpoint was NOT formally analyzed, insufficient data). Both get mapped to the string “Not Reached”.
Concrete examples
Section titled “Concrete examples”Pub 114571 (JSKN003 in HER2+ mCRC):
- Abstract: “The median overall survival (OS) was not yet mature”
- Extracted: OS = “Not Reached”
- Correct: OS should be null — data immature, no median estimated
Pub 115389 (from job 1594):
- Abstract: PFS described as “immature”
- Extracted: PFS = “Not Reached”
- Correct: PFS should be null
Downstream impact
Section titled “Downstream impact”- Clinical Evidence report: “Not Reached” implies a favorable outcome (median exceeds follow-up), while “immature” is neutral (no data yet). Reporting “Not Reached” when the data is simply immature overstates the evidence.
- Cross-study comparisons: “Not Reached” OS is treated as a positive signal, biasing comparisons against studies that honestly report immature data.
- ~71 publications have “immature”/“not yet mature” language in the abstract (without “not reached”) but have “Not Reached” extracted for OS, PFS, or DoR
- Breakdown: OS (~214 total “Not Reached” pubs with immature language, ~71 without “not reached” in abstract), PFS (~107), DoR (~68)
- Many abstracts legitimately say BOTH “immature” and “not reached” — these are correct and not affected
What the issue is not
Section titled “What the issue is not”- Abstracts that say “median OS was not reached” — these ARE correct as “Not Reached”
- Abstracts that say “OS data are immature; median was not reached” — also correct (both terms used)
- Only affects abstracts where “immature” is used WITHOUT “not reached” for the same endpoint
Explored solution direction
Section titled “Explored solution direction”- Extraction prompt fix (forward): Add instruction to
classify_publications: “Distinguish between ‘Not Reached’ (endpoint was analyzed but median exceeds follow-up — extract as ‘Not Reached’) and ‘immature/not yet mature’ (insufficient data to analyze the endpoint — extract as null/omit). Only use ‘Not Reached’ when the abstract explicitly states the median was not reached.” - Backfill: Re-extract OS/PFS/DoR for the ~71 affected publications. Scope query:
SELECT DISTINCT v.publication_idFROM vw_publication_efficacy_data vJOIN publications p ON p.id = v.publication_idWHERE v.measure_value = 'Not Reached'AND v.endpoint_abbreviation IN ('OS', 'PFS', 'DOR')AND (p.abstract ILIKE '%not yet mature%' OR p.abstract ILIKE '%data immature%'OR p.abstract ILIKE '%data are immature%' OR p.abstract ILIKE '%results are immature%')AND p.abstract NOT ILIKE '%not reached%'AND p.abstract NOT ILIKE '%not been reached%'
Solution applied
Section titled “Solution applied”(empty — pending implementation)