Skip to content

Publication Data Model

How publication efficacy data is stored, from abstract text through LLM extraction to the clinical evidence report.

Last updated: 2026-04-03

Publication (root)
├── trial_arms Treatment groups — each arm is a group of patients who
│ │ received the same treatment (e.g., "8.0 mg/kg",
│ │ "Pembrolizumab + Chemo"). Includes an "All Arms" entry
│ │ for pooled results. Created by extract_interventions.
│ │
│ └── trial_arm_interventions What drugs/interventions were given in this arm, at what
│ dose. Each intervention has its own drug_id, dose fields,
│ and intervention_role (investigational, combination,
│ comparator, supportive).
├── trial_disease_details What diseases this publication studies — disease name,
│ │ stage, subtype, risk, treatment setting. Linked to the
│ │ diseases table via disease_id.
│ │
│ └── trial_disease_biomarkers Biomarkers associated with the disease context (e.g.,
│ EGFR mutation for NSCLC). Linked to biomarkers table.
├── trial_endpoints What endpoints were measured (ORR, PFS, OS, DOR...).
│ Definitions only — no values here.
├── trial_subgroups Who was studied — patient populations (disease,
│ │ biomarker, dose cohort, demographics).
│ │
│ ├── trial_subgroup_biomarkers Structured biomarker details for biomarker-tagged
│ │ subgroups (name, value, numeric threshold).
│ │
│ └── trial_outcome_measures What was measured for this subgroup — the intersection
│ │ of a subgroup × endpoint. Defines WHAT we're looking at
│ │ (e.g., "ORR for Squamous, confirmed, percentage, primary").
│ │ Holds metadata: outcome_type, measure_unit, confirmed,
│ │ time_point. No actual result values.
│ │
│ └── trial_arm_outcomes The actual numbers — per-arm results within an outcome
│ measure. Each row is one arm's result (e.g., "8.0 mg/kg
│ arm: N=32, ORR=15.6%"). Linked to trial_arms via
│ trial_arm_id FK. Holds measure_value, N, p_value,
│ hazard_ratio, odds_ratio.
├── adverse_events Safety endpoints (neutropenia, ILD, etc.).
│ │ Holds grade_category, standardized_name.
│ │
│ └── trial_arm_outcomes Per-arm safety results (same table as efficacy arm
│ outcomes, linked via adverse_event_id instead of
│ trial_outcome_measure_id).
└── publication_interventions LEGACY (News only). Study-level drug records. Still used
by NewsTrialMention. No longer created for Publications
— replaced by trial_arms + trial_arm_interventions.

Several tables use polymorphic source_type + source_id columns instead of a direct publication_id foreign key. This allows the same table to store data sourced from publications OR clinical trial registries:

TablePolymorphic columnsTypical source_type
trial_subgroupssource_type, source_id'Publication'
trial_endpointssource_type, source_id'Publication'
trial_outcome_measuressource_type, source_id'Publication'
trial_armssource_type, source_id'Publication'
trial_disease_detailssource_type, source_id'Publication'
adverse_eventssource_type, source_id'Publication'
publication_interventionssource_type, source_id'Publication' or 'NewsTrialMention'

To query all subgroups for a publication: WHERE source_type = 'Publication' AND source_id = <pub_id>.

trial_arm_outcomes is NOT polymorphic — it belongs directly to a trial_outcome_measure (via trial_outcome_measure_id) or an adverse_event (via adverse_event_id).

Disease and biomarker information is stored at two different levels, serving different purposes:

Path 1: Publication-level — “What does this study cover?”

Publication → trial_disease_details → diseases
→ trial_disease_biomarkers → biomarkers

Populated by the extract_diseases pipeline step. Describes the study’s overall disease context (e.g., “EGFR-mutant advanced NSCLC”). Used for filtering and categorization.

Path 2: Subgroup-level — “What population does this specific result apply to?”

Publication → trial_subgroups → diseases (via disease_id)
→ trial_subgroup_biomarkers → biomarkers

Populated by classify_publications + post_process. These are the data-carrying entities that link down to outcome measures and arm results.

The two levels can diverge. A study might cover “advanced NSCLC” at the disease detail level but report results for “Squamous” and “Adenocarcinoma → AGA-negative” subgroups. A basket trial studying “solid tumors” might have disease-specific subgroups for NSCLC, CRC, and breast cancer.


The root entity. Key fields for the data model:

ColumnPurpose
abstractSource text — ground truth for all extraction
llm_data (JSONB)Raw LLM extraction output before materialization
llm_data_processed (bool)Whether post_process has materialized llm_data into child tables
result (bool)Whether this publication reports clinical results
clinical_trial_idFK to linked clinical trial (nullable)
total_number_of_participantsDenormalized from llm_data
trial_outcomepositive / negative / unclear

A patient population or cohort within a publication. Subgroups can represent different things depending on subgroup_type:

subgroup_typeExample subgroup_valueWhat it means
diseaseNSCLC → Squamous cell carcinomaHistology/disease subpopulation
biomarkerPD-L1 TPS ≥50%Biomarker-selected subgroup
doseDose 100-300mgDose-defined cohort
overallOverallFull study population

Key columns:

ColumnPurpose
subgroup_valueHuman-readable label (hierarchical with )
subgroup_typeSemantic category
number_of_participantsN for this subgroup (from abstract)
population_roleDenominator semantics: overall, partition, selected_subset, etc.
tags (JSONB)Semantic dimension tags
dose_value, dose_min, dose_maxDose fields — only populated when the subgroup itself IS a dose cohort
dose_units, dose_frequency, rp2dAdditional dose context
data_cutoff_dateWhen results were cut off
treatment_lines (JSONB)Prior therapy context
min_prior_lines, max_prior_linesSanitized treatment line counts
disease_idFK to matched disease entity

Endpoint definitions extracted from the publication:

ColumnPurpose
endpoint_nameFull name (e.g., “Overall Survival”)
abbreviationShort form (e.g., “OS”)
endpoint_idFK to master endpoints table

The intersection of a subgroup and an endpoint — “ORR for the Squamous subgroup”:

ColumnPurpose
trial_subgroup_idFK to trial_subgroups
trial_endpoint_idFK to trial_endpoints
outcome_typeprimary / secondary / exploratory
measure_unitpercentage, months, count
confirmed (bool, nullable)Confirmed vs unconfirmed response (e.g., cORR vs ORR)
p_value, hazard_ratio, odds_ratioOutcome-level statistics
time_pointFor landmark analyses (e.g., “12 months”)

Per-arm results within an outcome measure — “ORR for Squamous in the 8.0 mg/kg arm”:

ColumnPurpose
trial_outcome_measure_idFK to trial_outcome_measures
arm_nameArm label (e.g., “8.0 mg/kg”, “Pembrolizumab + Chemo”)
arm_typeinvestigational, control, active_comparator, placebo_comparator
number_of_participantsN for this arm
measure_valueThe result value (e.g., “33.3”, “Not Reached”)
study_plan_arm_idFK to registry study_plan_arms (nullable)
p_value, hazard_ratio, odds_ratioArm-level statistics

No dose columns. Arm-specific dose is only captured in the arm_name string. See Dose Data below.


Each trial_subgroup has two classification fields set during LLM extraction:

A subgroup can have multiple tags. For example, “EGFR-mutant NSCLC” would get ["overall", "biomarker", "disease"].

TagDescriptionCount
diseaseDisease type, histology, subtype (NSCLC, AML, DLBCL)63,553
populationSpecific analysis populations (per-protocol, safety-evaluable, responders) — NOT the unsliced overall62,262
biomarkerMutations, expression markers, molecular subtypes (EGFR, PD-L1, HER2, TMB)44,559
overallTop-line study population. A disease-specific cohort can still be “overall” when it is the single top-line cohort being reported33,529
treatment_armTreatment arms, regimen groupings23,076
doseDose levels, cohorts, schedules17,168
prior_therapySpecific prior treatments (prior platinum, prior IO)15,436
stageDisease stage (early, advanced, metastatic)14,755
otherOnly if no other tag fits13,378
risk_groupCytogenetic risk, IMDC risk, prognostic groups6,947
line_of_therapyTreatment line (1L, 2L+, treatment-naive)6,443
ageAge demographic splits5,913
response_statusSubgroups defined by achieved response (responders, CR, PR, SD, PD, pCR, MRD-negative)2,396
genderSex/gender splits1,824
geographyRegion/country splits1,604
race_ethnicityRace/ethnicity demographic splits1,154
performance_statusECOG PS, KPS1,059

Defined in TrialSubgroup::SUBGROUP_TAGS (app/models/trial_subgroup.rb).

Clarifies what the subgroup’s N represents as a denominator:

RoleDescription
overallThe unsliced top-line population for the full reported cohort
analysis_populationITT, mITT, safety, evaluable, assessable, tested, treated populations
partitionAn ordinary subgroup bucket — dose cohort, treatment arm, age band, sex split, stage bucket. Disease is “partition” only when it is one bucket among multiple disease cohorts side by side
selected_subsetA filtered subset defined by a qualifying condition (biomarker-positive, prior-therapy-exposed, condition-present)
response_subsetA subgroup defined by achieved response status

Defined in TrialSubgroup::SUBGROUP_POPULATION_ROLES (app/models/trial_subgroup.rb).

Disease context for the publication — what disease(s) were studied and their clinical characteristics.

ColumnPurpose
disease_nameExtracted disease name
disease_idFK to matched diseases entity
subtypes (JSONB)Disease subtypes (e.g., adenocarcinoma, squamous)
stages (JSONB)Disease stages (e.g., advanced, metastatic, stage III)
extents (JSONB)Disease extent descriptors
statuses (JSONB)Disease status (e.g., relapsed, refractory)
risks (JSONB)Risk classifications (e.g., high-risk cytogenetics)
treatment_settings (JSONB)Treatment setting context
number_of_prior_treatment_linesPrior therapy line count

Biomarkers associated with a disease context. Belongs to trial_disease_details.

ColumnPurpose
trial_disease_detail_idFK to trial_disease_details
biomarker_idFK to matched biomarkers entity (nullable)
biomarker_nameExtracted biomarker name (e.g., “EGFR”)
valueBiomarker status (e.g., “mutated”, “positive”)
numeric_valueThreshold if applicable (e.g., “50” for TPS ≥50%)
alternatives_names (JSONB)Alternative names for matching

publication_interventions (DEPRECATED for Publications — News only)

Section titled “publication_interventions (DEPRECATED for Publications — News only)”

Drug/intervention records. No longer created for Publication sources — replaced by trial_arms + trial_arm_interventions. Still used by NewsTrialMention through the News pipeline.

ColumnPurpose
intervention_nameDrug name
drug_idFK to matched drug entity
intervention_roleinvestigational, comparator, combination, supportive
intervention_typedrug, biological, procedure
doseFree-text dose string from abstract
dose_evidence (JSONB)Structured dose extraction (see below)
study_plan_arm_idFK to study_plan_arms — always NULL in practice

Dose information exists at three levels, but there is a structural gap at the arm level.

Level 1: Study-level dose (publication_interventions)

Section titled “Level 1: Study-level dose (publication_interventions)”

extract_dose_evidence runs a separate LLM pass over each publication_intervention to populate the dose_evidence JSONB:

{
"single_dose": "400 mg",
"dose_min": "8.0 mg/kg",
"dose_max": "10.0 mg/kg",
"rp2d": "8.0 mg/kg",
"dose_units": "mg/kg",
"dose_frequency": "Q3W",
"dose_context_type": "weight_based",
"confidence": 0.95
}

There is one PI per drug per publication, so this captures the study-level dose range. For a multi-dose-arm study like “8.0 mg/kg vs 10.0 mg/kg”, it records dose_min=8.0 and dose_max=10.0 — the range, not per-arm values.

Level 2: Subgroup-level dose (trial_subgroups)

Section titled “Level 2: Subgroup-level dose (trial_subgroups)”

When a subgroup IS a dose cohort (e.g., “Dose 100-300mg”), the LLM extraction populates dose_min, dose_max, dose_value on the trial_subgroup record. ~1,200 subgroups out of ~200k have these fields set.

trial_arm_outcomes has NO dose columns. When a publication reports efficacy by dose arm (e.g., “8.0 mg/kg arm: ORR 15.6%” and “10.0 mg/kg arm: ORR 26.9%”), the dose is only captured in arm_name as an unstructured string.

The classify_publications LLM extraction already identifies each arm by dose name but the arm schema only has: name, arm_type, measure_value, number_of_participants. No dose fields.

How the view resolves dose (COALESCE chain)

Section titled “How the view resolves dose (COALESCE chain)”

The vw_publication_efficacy_data view uses a fallback chain to populate dose_min/dose_max/single_dose on each row:

1. trial_subgroups.dose_min/dose_max (subgroup is a dose cohort)
2. trial_subgroups.dose_value (single-dose subgroup, formatted with units)
3. publication_interventions.dose_evidence (study-level fallback, with guards)

The pub-level fallback is gated:

  • Skipped for control/comparator arms (Issue 31)
  • Skipped for escalation/range/rp2d context types (Issue 35)
  • single_dose only falls back when pub_dose_min = pub_dose_max (single dose study)

The problem: For multi-dose-arm studies where subgroups are disease-defined (not dose-defined), the COALESCE chain falls through to study-level dose, which propagates the full dose range to every arm row. The “8.0 mg/kg” arm shows dose_min=8.0, dose_max=10.0 — misleading, because that arm only received 8.0.

For a study like ARTEMIS-001 (pub 190656) with:

  • Subgroups: Squamous, Adenocarcinoma (disease-defined)
  • Arms: 8.0 mg/kg, 10.0 mg/kg (dose-defined)

Every row in the view gets the same dose_min=8.0, dose_max=10.0 regardless of which arm it belongs to. The arm_name has the correct dose but it’s a text string, not queryable as structured data.

The fix requires adding dose fields to the arm extraction and storage:

  1. Add dose fields to the arm schema in classify_publications (so the LLM extracts per-arm dose)
  2. Add dose columns to trial_arm_outcomes (to store it)
  3. Update post_process.rb to persist arm-level dose during materialization
  4. Update the view to use arm-level dose as the first COALESCE choice

The publications workflow (app/workflows/publications_workflow.rb) runs these steps in order:

1. extract_trial_identifiers Find NCT IDs and registry links in abstract
2. web_search_identifiers Web search for missing trial IDs (disabled)
3. relink_to_clinical_trials Match publications to clinical trials
4. therapeutic_area_filter Filter to target therapeutic areas
5. extract_interventions LLM: extract arms and their interventions → trial_arms + trial_arm_interventions
6. link_publication_drugs Match intervention names to drug entities
7. tag_investigational_interventions Classify intervention roles
8. extract_subgroups LLM: identify subgroups and endpoints from abstract
9. extract_dose_evidence LLM: extract structured dose per intervention
10. classify_publications LLM: full efficacy/safety extraction → llm_data
11. extract_diseases LLM: identify diseases
12. post_process_publications Materialize llm_data → normalized tables
13. classify_intent LLM: classify publication intent
14. extract_treatment_lines LLM: extract prior therapy context
15. standardize_adverse_events Normalize AE names
16. classify_adverse_events LLM: classify AEs

extract_subgroups (step 8): Identifies what subgroups and endpoints exist in the abstract. Stores in llm_data['subgroup_endpoints']. This runs BEFORE classify_publications to guide extraction.

extract_dose_evidence (step 9): Separate LLM pass per publication_intervention. Produces structured dose_evidence JSONB on the PI record. This is study-level dose per drug.

classify_publications (step 10): The main extraction. Takes the abstract plus known subgroups/endpoints and extracts:

{
"subgroup_outcome_measures": [
{
"type": "disease",
"value": "NSCLC → Squamous cell carcinoma",
"number_of_participants": null,
"outcome_measures": [
{
"endpoint": "Overall Response Rate",
"endpoint_abbreviation": "ORR",
"confirmed": true,
"arms": [
{
"name": "8.0 mg/kg",
"arm_type": "investigational",
"measure_value": 15.6,
"number_of_participants": 32
},
{
"name": "10.0 mg/kg",
"arm_type": "investigational",
"measure_value": 26.9,
"number_of_participants": 26
}
]
}
]
}
]
}

Note: arms have name and measure_value but no structured dose fields.

post_process_publications (step 12): Materializes llm_data into normalized tables:

  • subgroup_outcome_measures entries → trial_subgroups rows
  • Each outcome_measures entry → trial_outcome_measure row (linked to subgroup + endpoint)
  • Each arms entry → trial_arm_outcome row (linked to outcome measure)
  • Guards: N=0 → nil (zero-sentinel), all-zero percentage endpoints with nil N → nil

vw_publication_efficacy_data_v22 joins everything together:

trial_subgroups (with dose, treatment lines)
← trial_outcome_measures (subgroup × endpoint)
← trial_arm_outcomes (per-arm results)
← drug_interventions (drug/technology from publication_interventions or registry)
← pub_dose_lookup (dose_evidence from publication_interventions)

The view outputs one row per: publication × subgroup × endpoint × arm × drug, with resolved dose fields via COALESCE fallback.

Tpp::ClinicalEvidenceQuery filters the view by disease and technology, enriches with biomarker data, and groups results by drug for the clinical evidence report.


The same dose split can be modeled two ways, depending on how the abstract presents data:

Dose as subgroup — when the abstract reports each dose cohort independently:

trial_subgroup: "8.0 mg/kg cohort" (dose_value = "8.0", dose_units = "mg/kg")
└── trial_outcome_measure: ORR
└── trial_arm_outcome: arm_name = "HS-20093"
trial_subgroup: "10.0 mg/kg cohort" (dose_value = "10.0")
└── trial_outcome_measure: ORR
└── trial_arm_outcome: arm_name = "HS-20093"

Dose as arm — when the abstract cross-tabulates dose × subgroup:

trial_subgroup: "Squamous cell carcinoma" (dose fields = NULL)
└── trial_outcome_measure: ORR
├── trial_arm_outcome: arm_name = "8.0 mg/kg" (no structured dose)
└── trial_arm_outcome: arm_name = "10.0 mg/kg" (no structured dose)

The LLM picks whichever matches the abstract structure. In the second case, per-arm dose is lost as structured data.

Subgroup values use as a hierarchy separator:

  • NSCLC (parent)
  • NSCLC → Adenocarcinoma (child)
  • NSCLC → Adenocarcinoma → AGA-negative (grandchild)

Each level is a separate trial_subgroup record. The parent serves as “Overall” for its children.

When a publication reports both confirmed ORR (cORR) and unconfirmed ORR:

  • Two trial_outcome_measure records are created for the same subgroup + endpoint
  • Distinguished by confirmed = true vs confirmed = false
  • Issue 27: the query layer previously picked the wrong one via max_by(number_of_participants)

Arms as First-Class Entities (implemented 2026-04-02)

Section titled “Arms as First-Class Entities (implemented 2026-04-02)”

A publication’s efficacy result is: subgroup × endpoint × arm = value, where each dimension is a first-class entity linked by FK.

  1. extract_interventions creates trial_arms (with IDs) + trial_arm_interventions (drugs, dose per arm)
  2. classify_publications receives arm IDs in the prompt, assigns them to each outcome
  3. post_process reads arm_data['id'] as trial_arm_id — direct FK, no name matching

An “All Arms” entry is always created for pooled results.

ColumnPurpose
nameArm label (e.g., “8.0 mg/kg”, “Pembrolizumab + Chemo”, “All Arms”)
arm_typeinvestigational, control, active_comparator, placebo_comparator, combination
number_of_participantsArm-level N
positionPreserves LLM output ordering

No clinical_trial_id or study_plan_arm_id — trial_arms are self-contained publication entities.

ColumnPurpose
trial_arm_idFK to trial_arms
drug_idFK to matched drug entity (nullable)
ncit_concept_idFK to NCI Thesaurus concept (nullable)
intervention_nameDrug/intervention name
intervention_typedrug, biological, procedure, device, other
intervention_roleinvestigational, combination, comparator, supportive — per arm, not per publication
dose, dose_min, dose_max, single_dose, rp2dStructured dose fields
dose_units, dose_frequency, dose_context_typeDose context
dose_evidence (JSONB)Full dose extraction audit trail
Old approachNew approach
publication_interventions — one record per drug per pub, study-level dosetrial_arm_interventions — one record per drug per arm, arm-level dose
Drug-arm linkage via name-substring matching in 600-line SQL viewDirect FK: trial_arm_outcomes.trial_arm_idtrial_arm_interventions
intervention_role per publication (same drug, one role)intervention_role per arm (same drug can be “combination” in one arm, “comparator” in another)
study_plan_arms from registry passed to LLMtrial_arms from our own extraction passed to LLM

The table still exists and is used by NewsTrialMention (News pipeline). No longer created for Publications. The efficacy view v23 reads from trial_arm_interventions instead.

  • Production backfill: Run extract_interventionsclassify_publicationspost_process on all target-scope publications to get ID-based linking
  • Legacy data: ~43k pre-pipeline pubs have trial_arms created from arm outcomes (no interventions). These need reprocessing through extract_interventions to get drug/dose data
  • Issue 50: DrugLinker false-matches non-drug interventions — needs intervention_type guard

PurposePath
Publication modelapp/models/publication.rb
TrialArm modelapp/models/trial_arm.rb
TrialArmIntervention modelapp/models/trial_arm_intervention.rb
TrialSubgroup modelapp/models/trial_subgroup.rb
TrialOutcomeMeasure modelapp/models/trial_outcome_measure.rb
TrialArmOutcome modelapp/models/trial_arm_outcome.rb
PublicationIntervention model (legacy, News only)app/models/publication_intervention.rb
Intervention extractionapp/tasks/publications_llm_classification/intervention_extraction.rb
Trial arm materializerapp/tasks/publications_llm_classification/trial_arm_materializer.rb
Subgroup extractionapp/tasks/publications_llm_classification/subgroup_extraction.rb
Main LLM extractionapp/tasks/publications_llm_classification/task.rb
Extraction schemaapp/tasks/publications_llm_classification/details.rb
Dose evidence extractionapp/tasks/publications_llm_classification/dose_evidence_extraction.rb
Post-process materializationapp/tasks/publications_llm_classification/post_process.rb
Efficacy view (latest)db/views/vw_publication_efficacy_data_v23.sql
Clinical evidence queryapp/queries/tpp/clinical_evidence_query.rb
Pipeline workflowapp/workflows/publications_workflow.rb
Backfill tasklib/tasks/one_off/backfill_trial_arms.thor
Issues trackerdocs/publication_issues_tracker.md