Publication Data Model

How publication efficacy data is stored, from abstract text through LLM extraction to the clinical evidence report.

Last updated: 2026-04-03

Entity Relationship Overview

Publication (root)
│
├── trial_arms                       Treatment groups — each arm is a group of patients who
│   │                                received the same treatment (e.g., "8.0 mg/kg",
│   │                                "Pembrolizumab + Chemo"). Includes an "All Arms" entry
│   │                                for pooled results. Created by extract_interventions.
│   │
│   └── trial_arm_interventions      What drugs/interventions were given in this arm, at what
│                                    dose. Each intervention has its own drug_id, dose fields,
│                                    and intervention_role (investigational, combination,
│                                    comparator, supportive).
│
├── trial_disease_details            What diseases this publication studies — disease name,
│   │                                stage, subtype, risk, treatment setting. Linked to the
│   │                                diseases table via disease_id.
│   │
│   └── trial_disease_biomarkers     Biomarkers associated with the disease context (e.g.,
│                                    EGFR mutation for NSCLC). Linked to biomarkers table.
│
├── trial_endpoints                  What endpoints were measured (ORR, PFS, OS, DOR...).
│                                    Definitions only — no values here.
│
├── trial_subgroups                  Who was studied — patient populations (disease,
│   │                                biomarker, dose cohort, demographics).
│   │
│   ├── trial_subgroup_biomarkers    Structured biomarker details for biomarker-tagged
│   │                                subgroups (name, value, numeric threshold).
│   │
│   └── trial_outcome_measures       What was measured for this subgroup — the intersection
│       │                            of a subgroup × endpoint. Defines WHAT we're looking at
│       │                            (e.g., "ORR for Squamous, confirmed, percentage, primary").
│       │                            Holds metadata: outcome_type, measure_unit, confirmed,
│       │                            time_point. No actual result values.
│       │
│       └── trial_arm_outcomes       The actual numbers — per-arm results within an outcome
│                                    measure. Each row is one arm's result (e.g., "8.0 mg/kg
│                                    arm: N=32, ORR=15.6%"). Linked to trial_arms via
│                                    trial_arm_id FK. Holds measure_value, N, p_value,
│                                    hazard_ratio, odds_ratio.
│
├── adverse_events                   Safety endpoints (neutropenia, ILD, etc.).
│   │                                Holds grade_category, standardized_name.
│   │
│   └── trial_arm_outcomes           Per-arm safety results (same table as efficacy arm
│                                    outcomes, linked via adverse_event_id instead of
│                                    trial_outcome_measure_id).
│
└── publication_interventions        LEGACY (News only). Study-level drug records. Still used
                                     by NewsTrialMention. No longer created for Publications
                                     — replaced by trial_arms + trial_arm_interventions.

Polymorphic ownership

Several tables use polymorphic source_type + source_id columns instead of a direct publication_id foreign key. This allows the same table to store data sourced from publications OR clinical trial registries:

Table	Polymorphic columns	Typical source_type
`trial_subgroups`	`source_type`, `source_id`	`'Publication'`
`trial_endpoints`	`source_type`, `source_id`	`'Publication'`
`trial_outcome_measures`	`source_type`, `source_id`	`'Publication'`
`trial_arms`	`source_type`, `source_id`	`'Publication'`
`trial_disease_details`	`source_type`, `source_id`	`'Publication'`
`adverse_events`	`source_type`, `source_id`	`'Publication'`
`publication_interventions`	`source_type`, `source_id`	`'Publication'` or `'NewsTrialMention'`

To query all subgroups for a publication: WHERE source_type = 'Publication' AND source_id = <pub_id>.

trial_arm_outcomes is NOT polymorphic — it belongs directly to a trial_outcome_measure (via trial_outcome_measure_id) or an adverse_event (via adverse_event_id).

Two paths to disease and biomarker data

Disease and biomarker information is stored at two different levels, serving different purposes:

Path 1: Publication-level — “What does this study cover?”

Publication → trial_disease_details → diseases
                                    → trial_disease_biomarkers → biomarkers

Populated by the extract_diseases pipeline step. Describes the study’s overall disease context (e.g., “EGFR-mutant advanced NSCLC”). Used for filtering and categorization.

Path 2: Subgroup-level — “What population does this specific result apply to?”

Publication → trial_subgroups → diseases  (via disease_id)
                              → trial_subgroup_biomarkers → biomarkers

Populated by classify_publications + post_process. These are the data-carrying entities that link down to outcome measures and arm results.

The two levels can diverge. A study might cover “advanced NSCLC” at the disease detail level but report results for “Squamous” and “Adenocarcinoma → AGA-negative” subgroups. A basket trial studying “solid tumors” might have disease-specific subgroups for NSCLC, CRC, and breast cancer.

Core Tables

publications

The root entity. Key fields for the data model:

Column	Purpose
`abstract`	Source text — ground truth for all extraction
`llm_data` (JSONB)	Raw LLM extraction output before materialization
`llm_data_processed` (bool)	Whether post_process has materialized llm_data into child tables
`result` (bool)	Whether this publication reports clinical results
`clinical_trial_id`	FK to linked clinical trial (nullable)
`total_number_of_participants`	Denormalized from llm_data
`trial_outcome`	positive / negative / unclear

trial_subgroups

A patient population or cohort within a publication. Subgroups can represent different things depending on subgroup_type:

subgroup_type	Example subgroup_value	What it means
`disease`	`NSCLC → Squamous cell carcinoma`	Histology/disease subpopulation
`biomarker`	`PD-L1 TPS ≥50%`	Biomarker-selected subgroup
`dose`	`Dose 100-300mg`	Dose-defined cohort
`overall`	`Overall`	Full study population

Key columns:

Column	Purpose
`subgroup_value`	Human-readable label (hierarchical with `→`)
`subgroup_type`	Semantic category
`number_of_participants`	N for this subgroup (from abstract)
`population_role`	Denominator semantics: overall, partition, selected_subset, etc.
`tags` (JSONB)	Semantic dimension tags
`dose_value`, `dose_min`, `dose_max`	Dose fields — only populated when the subgroup itself IS a dose cohort
`dose_units`, `dose_frequency`, `rp2d`	Additional dose context
`data_cutoff_date`	When results were cut off
`treatment_lines` (JSONB)	Prior therapy context
`min_prior_lines`, `max_prior_lines`	Sanitized treatment line counts
`disease_id`	FK to matched disease entity

trial_endpoints

Endpoint definitions extracted from the publication:

Column	Purpose
`endpoint_name`	Full name (e.g., “Overall Survival”)
`abbreviation`	Short form (e.g., “OS”)
`endpoint_id`	FK to master `endpoints` table

trial_outcome_measures

The intersection of a subgroup and an endpoint — “ORR for the Squamous subgroup”:

Column	Purpose
`trial_subgroup_id`	FK to trial_subgroups
`trial_endpoint_id`	FK to trial_endpoints
`outcome_type`	primary / secondary / exploratory
`measure_unit`	percentage, months, count
`confirmed` (bool, nullable)	Confirmed vs unconfirmed response (e.g., cORR vs ORR)
`p_value`, `hazard_ratio`, `odds_ratio`	Outcome-level statistics
`time_point`	For landmark analyses (e.g., “12 months”)

trial_arm_outcomes

Per-arm results within an outcome measure — “ORR for Squamous in the 8.0 mg/kg arm”:

Column	Purpose
`trial_outcome_measure_id`	FK to trial_outcome_measures
`arm_name`	Arm label (e.g., “8.0 mg/kg”, “Pembrolizumab + Chemo”)
`arm_type`	investigational, control, active_comparator, placebo_comparator
`number_of_participants`	N for this arm
`measure_value`	The result value (e.g., “33.3”, “Not Reached”)
`study_plan_arm_id`	FK to registry study_plan_arms (nullable)
`p_value`, `hazard_ratio`, `odds_ratio`	Arm-level statistics

No dose columns. Arm-specific dose is only captured in the arm_name string. See Dose Data below.

Subgroup Tags and Population Roles

Each trial_subgroup has two classification fields set during LLM extraction:

Tags (multi-select)

A subgroup can have multiple tags. For example, “EGFR-mutant NSCLC” would get ["overall", "biomarker", "disease"].

Tag	Description	Count
`disease`	Disease type, histology, subtype (NSCLC, AML, DLBCL)	63,553
`population`	Specific analysis populations (per-protocol, safety-evaluable, responders) — NOT the unsliced overall	62,262
`biomarker`	Mutations, expression markers, molecular subtypes (EGFR, PD-L1, HER2, TMB)	44,559
`overall`	Top-line study population. A disease-specific cohort can still be “overall” when it is the single top-line cohort being reported	33,529
`treatment_arm`	Treatment arms, regimen groupings	23,076
`dose`	Dose levels, cohorts, schedules	17,168
`prior_therapy`	Specific prior treatments (prior platinum, prior IO)	15,436
`stage`	Disease stage (early, advanced, metastatic)	14,755
`other`	Only if no other tag fits	13,378
`risk_group`	Cytogenetic risk, IMDC risk, prognostic groups	6,947
`line_of_therapy`	Treatment line (1L, 2L+, treatment-naive)	6,443
`age`	Age demographic splits	5,913
`response_status`	Subgroups defined by achieved response (responders, CR, PR, SD, PD, pCR, MRD-negative)	2,396
`gender`	Sex/gender splits	1,824
`geography`	Region/country splits	1,604
`race_ethnicity`	Race/ethnicity demographic splits	1,154
`performance_status`	ECOG PS, KPS	1,059

Defined in TrialSubgroup::SUBGROUP_TAGS (app/models/trial_subgroup.rb).

Population Role (single-select, nullable)

Clarifies what the subgroup’s N represents as a denominator:

Role	Description
`overall`	The unsliced top-line population for the full reported cohort
`analysis_population`	ITT, mITT, safety, evaluable, assessable, tested, treated populations
`partition`	An ordinary subgroup bucket — dose cohort, treatment arm, age band, sex split, stage bucket. Disease is “partition” only when it is one bucket among multiple disease cohorts side by side
`selected_subset`	A filtered subset defined by a qualifying condition (biomarker-positive, prior-therapy-exposed, condition-present)
`response_subset`	A subgroup defined by achieved response status

Defined in TrialSubgroup::SUBGROUP_POPULATION_ROLES (app/models/trial_subgroup.rb).

trial_disease_details

Disease context for the publication — what disease(s) were studied and their clinical characteristics.

Column	Purpose
`disease_name`	Extracted disease name
`disease_id`	FK to matched `diseases` entity
`subtypes` (JSONB)	Disease subtypes (e.g., adenocarcinoma, squamous)
`stages` (JSONB)	Disease stages (e.g., advanced, metastatic, stage III)
`extents` (JSONB)	Disease extent descriptors
`statuses` (JSONB)	Disease status (e.g., relapsed, refractory)
`risks` (JSONB)	Risk classifications (e.g., high-risk cytogenetics)
`treatment_settings` (JSONB)	Treatment setting context
`number_of_prior_treatment_lines`	Prior therapy line count

trial_disease_biomarkers

Biomarkers associated with a disease context. Belongs to trial_disease_details.

Column	Purpose
`trial_disease_detail_id`	FK to trial_disease_details
`biomarker_id`	FK to matched `biomarkers` entity (nullable)
`biomarker_name`	Extracted biomarker name (e.g., “EGFR”)
`value`	Biomarker status (e.g., “mutated”, “positive”)
`numeric_value`	Threshold if applicable (e.g., “50” for TPS ≥50%)
`alternatives_names` (JSONB)	Alternative names for matching

publication_interventions (DEPRECATED for Publications — News only)

Drug/intervention records. No longer created for Publication sources — replaced by trial_arms + trial_arm_interventions. Still used by NewsTrialMention through the News pipeline.

Column	Purpose
`intervention_name`	Drug name
`drug_id`	FK to matched drug entity
`intervention_role`	investigational, comparator, combination, supportive
`intervention_type`	drug, biological, procedure
`dose`	Free-text dose string from abstract
`dose_evidence` (JSONB)	Structured dose extraction (see below)
`study_plan_arm_id`	FK to study_plan_arms — always NULL in practice

Dose Data and Its Gaps

Dose information exists at three levels, but there is a structural gap at the arm level.

Level 1: Study-level dose (publication_interventions)

extract_dose_evidence runs a separate LLM pass over each publication_intervention to populate the dose_evidence JSONB:

{
  "single_dose": "400 mg",
  "dose_min": "8.0 mg/kg",
  "dose_max": "10.0 mg/kg",
  "rp2d": "8.0 mg/kg",
  "dose_units": "mg/kg",
  "dose_frequency": "Q3W",
  "dose_context_type": "weight_based",
  "confidence": 0.95
}

There is one PI per drug per publication, so this captures the study-level dose range. For a multi-dose-arm study like “8.0 mg/kg vs 10.0 mg/kg”, it records dose_min=8.0 and dose_max=10.0 — the range, not per-arm values.

Level 2: Subgroup-level dose (trial_subgroups)

When a subgroup IS a dose cohort (e.g., “Dose 100-300mg”), the LLM extraction populates dose_min, dose_max, dose_value on the trial_subgroup record. ~1,200 subgroups out of ~200k have these fields set.

Level 3: Arm-level dose — THE GAP

trial_arm_outcomes has NO dose columns. When a publication reports efficacy by dose arm (e.g., “8.0 mg/kg arm: ORR 15.6%” and “10.0 mg/kg arm: ORR 26.9%”), the dose is only captured in arm_name as an unstructured string.

The classify_publications LLM extraction already identifies each arm by dose name but the arm schema only has: name, arm_type, measure_value, number_of_participants. No dose fields.

How the view resolves dose (COALESCE chain)

The vw_publication_efficacy_data view uses a fallback chain to populate dose_min/dose_max/single_dose on each row:

1. trial_subgroups.dose_min/dose_max        (subgroup is a dose cohort)
2. trial_subgroups.dose_value               (single-dose subgroup, formatted with units)
3. publication_interventions.dose_evidence   (study-level fallback, with guards)

The pub-level fallback is gated:

Skipped for control/comparator arms (Issue 31)
Skipped for escalation/range/rp2d context types (Issue 35)
single_dose only falls back when pub_dose_min = pub_dose_max (single dose study)

The problem: For multi-dose-arm studies where subgroups are disease-defined (not dose-defined), the COALESCE chain falls through to study-level dose, which propagates the full dose range to every arm row. The “8.0 mg/kg” arm shows dose_min=8.0, dose_max=10.0 — misleading, because that arm only received 8.0.

What this means in practice

For a study like ARTEMIS-001 (pub 190656) with:

Subgroups: Squamous, Adenocarcinoma (disease-defined)
Arms: 8.0 mg/kg, 10.0 mg/kg (dose-defined)

Every row in the view gets the same dose_min=8.0, dose_max=10.0 regardless of which arm it belongs to. The arm_name has the correct dose but it’s a text string, not queryable as structured data.

Fixing the gap

The fix requires adding dose fields to the arm extraction and storage:

Add dose fields to the arm schema in classify_publications (so the LLM extracts per-arm dose)
Add dose columns to trial_arm_outcomes (to store it)
Update post_process.rb to persist arm-level dose during materialization
Update the view to use arm-level dose as the first COALESCE choice

Data Flow: Abstract to Report

Pipeline Steps

The publications workflow (app/workflows/publications_workflow.rb) runs these steps in order:

1. extract_trial_identifiers     Find NCT IDs and registry links in abstract
2. web_search_identifiers        Web search for missing trial IDs (disabled)
3. relink_to_clinical_trials     Match publications to clinical trials
4. therapeutic_area_filter       Filter to target therapeutic areas
5. extract_interventions         LLM: extract arms and their interventions → trial_arms + trial_arm_interventions
6. link_publication_drugs        Match intervention names to drug entities
7. tag_investigational_interventions  Classify intervention roles
8. extract_subgroups             LLM: identify subgroups and endpoints from abstract
9. extract_dose_evidence         LLM: extract structured dose per intervention
10. classify_publications        LLM: full efficacy/safety extraction → llm_data
11. extract_diseases             LLM: identify diseases
12. post_process_publications    Materialize llm_data → normalized tables
13. classify_intent              LLM: classify publication intent
14. extract_treatment_lines      LLM: extract prior therapy context
15. standardize_adverse_events   Normalize AE names
16. classify_adverse_events      LLM: classify AEs

What happens at each key step

extract_subgroups (step 8): Identifies what subgroups and endpoints exist in the abstract. Stores in llm_data['subgroup_endpoints']. This runs BEFORE classify_publications to guide extraction.

extract_dose_evidence (step 9): Separate LLM pass per publication_intervention. Produces structured dose_evidence JSONB on the PI record. This is study-level dose per drug.

classify_publications (step 10): The main extraction. Takes the abstract plus known subgroups/endpoints and extracts:

{
  "subgroup_outcome_measures": [
    {
      "type": "disease",
      "value": "NSCLC → Squamous cell carcinoma",
      "number_of_participants": null,
      "outcome_measures": [
        {
          "endpoint": "Overall Response Rate",
          "endpoint_abbreviation": "ORR",
          "confirmed": true,
          "arms": [
            {
              "name": "8.0 mg/kg",
              "arm_type": "investigational",
              "measure_value": 15.6,
              "number_of_participants": 32
            },
            {
              "name": "10.0 mg/kg",
              "arm_type": "investigational",
              "measure_value": 26.9,
              "number_of_participants": 26
            }
          ]
        }
      ]
    }
  ]
}

Note: arms have name and measure_value but no structured dose fields.

post_process_publications (step 12): Materializes llm_data into normalized tables:

subgroup_outcome_measures entries → trial_subgroups rows
Each outcome_measures entry → trial_outcome_measure row (linked to subgroup + endpoint)
Each arms entry → trial_arm_outcome row (linked to outcome measure)
Guards: N=0 → nil (zero-sentinel), all-zero percentage endpoints with nil N → nil

View materialization

vw_publication_efficacy_data_v22 joins everything together:

trial_subgroups (with dose, treatment lines)
  ← trial_outcome_measures (subgroup × endpoint)
    ← trial_arm_outcomes (per-arm results)
      ← drug_interventions (drug/technology from publication_interventions or registry)
        ← pub_dose_lookup (dose_evidence from publication_interventions)

The view outputs one row per: publication × subgroup × endpoint × arm × drug, with resolved dose fields via COALESCE fallback.

Query layer

Tpp::ClinicalEvidenceQuery filters the view by disease and technology, enriches with biomarker data, and groups results by drug for the clinical evidence report.

Common Modeling Patterns

Dose as subgroup vs dose as arm

The same dose split can be modeled two ways, depending on how the abstract presents data:

Dose as subgroup — when the abstract reports each dose cohort independently:

trial_subgroup: "8.0 mg/kg cohort"  (dose_value = "8.0", dose_units = "mg/kg")
  └── trial_outcome_measure: ORR
        └── trial_arm_outcome: arm_name = "HS-20093"

trial_subgroup: "10.0 mg/kg cohort"  (dose_value = "10.0")
  └── trial_outcome_measure: ORR
        └── trial_arm_outcome: arm_name = "HS-20093"

Dose as arm — when the abstract cross-tabulates dose × subgroup:

trial_subgroup: "Squamous cell carcinoma"  (dose fields = NULL)
  └── trial_outcome_measure: ORR
        ├── trial_arm_outcome: arm_name = "8.0 mg/kg"   (no structured dose)
        └── trial_arm_outcome: arm_name = "10.0 mg/kg"  (no structured dose)

The LLM picks whichever matches the abstract structure. In the second case, per-arm dose is lost as structured data.

Hierarchical subgroups

Subgroup values use → as a hierarchy separator:

NSCLC (parent)
NSCLC → Adenocarcinoma (child)
NSCLC → Adenocarcinoma → AGA-negative (grandchild)

Each level is a separate trial_subgroup record. The parent serves as “Overall” for its children.

Confirmed vs unconfirmed response

When a publication reports both confirmed ORR (cORR) and unconfirmed ORR:

Two trial_outcome_measure records are created for the same subgroup + endpoint
Distinguished by confirmed = true vs confirmed = false
Issue 27: the query layer previously picked the wrong one via max_by(number_of_participants)

Arms as First-Class Entities (implemented 2026-04-02)

The result fact

A publication’s efficacy result is: subgroup × endpoint × arm = value, where each dimension is a first-class entity linked by FK.

How it works

extract_interventions creates trial_arms (with IDs) + trial_arm_interventions (drugs, dose per arm)
classify_publications receives arm IDs in the prompt, assigns them to each outcome
post_process reads arm_data['id'] as trial_arm_id — direct FK, no name matching

An “All Arms” entry is always created for pooled results.

trial_arms

Column	Purpose
`name`	Arm label (e.g., “8.0 mg/kg”, “Pembrolizumab + Chemo”, “All Arms”)
`arm_type`	investigational, control, active_comparator, placebo_comparator, combination
`number_of_participants`	Arm-level N
`position`	Preserves LLM output ordering

No clinical_trial_id or study_plan_arm_id — trial_arms are self-contained publication entities.

trial_arm_interventions

Column	Purpose
`trial_arm_id`	FK to trial_arms
`drug_id`	FK to matched drug entity (nullable)
`ncit_concept_id`	FK to NCI Thesaurus concept (nullable)
`intervention_name`	Drug/intervention name
`intervention_type`	drug, biological, procedure, device, other
`intervention_role`	investigational, combination, comparator, supportive — per arm, not per publication
`dose`, `dose_min`, `dose_max`, `single_dose`, `rp2d`	Structured dose fields
`dose_units`, `dose_frequency`, `dose_context_type`	Dose context
`dose_evidence` (JSONB)	Full dose extraction audit trail

What this replaced

Old approach	New approach
`publication_interventions` — one record per drug per pub, study-level dose	`trial_arm_interventions` — one record per drug per arm, arm-level dose
Drug-arm linkage via name-substring matching in 600-line SQL view	Direct FK: `trial_arm_outcomes.trial_arm_id` → `trial_arm_interventions`
`intervention_role` per publication (same drug, one role)	`intervention_role` per arm (same drug can be “combination” in one arm, “comparator” in another)
`study_plan_arms` from registry passed to LLM	`trial_arms` from our own extraction passed to LLM

publication_interventions (legacy)

The table still exists and is used by NewsTrialMention (News pipeline). No longer created for Publications. The efficacy view v23 reads from trial_arm_interventions instead.

Remaining work

Production backfill: Run extract_interventions → classify_publications → post_process on all target-scope publications to get ID-based linking
Legacy data: ~43k pre-pipeline pubs have trial_arms created from arm outcomes (no interventions). These need reprocessing through extract_interventions to get drug/dose data
Issue 50: DrugLinker false-matches non-drug interventions — needs intervention_type guard

Key Files

Purpose	Path
Publication model	`app/models/publication.rb`
TrialArm model	`app/models/trial_arm.rb`
TrialArmIntervention model	`app/models/trial_arm_intervention.rb`
TrialSubgroup model	`app/models/trial_subgroup.rb`
TrialOutcomeMeasure model	`app/models/trial_outcome_measure.rb`
TrialArmOutcome model	`app/models/trial_arm_outcome.rb`
PublicationIntervention model (legacy, News only)	`app/models/publication_intervention.rb`
Intervention extraction	`app/tasks/publications_llm_classification/intervention_extraction.rb`
Trial arm materializer	`app/tasks/publications_llm_classification/trial_arm_materializer.rb`
Subgroup extraction	`app/tasks/publications_llm_classification/subgroup_extraction.rb`
Main LLM extraction	`app/tasks/publications_llm_classification/task.rb`
Extraction schema	`app/tasks/publications_llm_classification/details.rb`
Dose evidence extraction	`app/tasks/publications_llm_classification/dose_evidence_extraction.rb`
Post-process materialization	`app/tasks/publications_llm_classification/post_process.rb`
Efficacy view (latest)	`db/views/vw_publication_efficacy_data_v23.sql`
Clinical evidence query	`app/queries/tpp/clinical_evidence_query.rb`
Pipeline workflow	`app/workflows/publications_workflow.rb`
Backfill task	`lib/tasks/one_off/backfill_trial_arms.thor`
Issues tracker	`docs/publication_issues_tracker.md`