Publications Workflow

Pipeline that transforms raw publication records into structured clinical evidence. 18 sequential steps, 13 LLM calls, processing ~76k result publications from a pool of ~216k.

Data Flow Overview

Raw Publication (title, abstract, source metadata)
       │
       ▼
  PHASE 1: Identity & Linking
  (Which trial does this paper belong to?)
       │
       ▼
  PHASE 2: Interventions
  (What drugs/treatments were studied?)
       │
       ▼
  PHASE 3: Subgroups & Classification
  (What populations, endpoints, and outcomes were reported?)
       │
       ▼
  PHASE 4: Post-processing & Intent
  (Materialize rows, classify publication purpose)
       │
       ▼
  PHASE 5: Adverse Events
  (Standardize safety data)
       │
       ▼
  PHASE 6: Endpoint Enrichment
  (Map outcomes to canonical endpoint catalog)
       │
       ▼
  Structured Evidence
  (publication_interventions, trial_subgroups, trial_outcome_measures,
   trial_disease_details, adverse_events, trial_endpoints)

Step-by-Step Reference

Phase 1: Identity & Linking

Step 1 — Extract Trial Identifiers extract_trial_identifier

Purpose: Determine if a publication reports clinical trial results, and extract registry IDs (NCT, ISRCTN, EudraCT, etc.) and endpoint names from the abstract.
Model: OpenAI (configurable)
Reads: Publication abstract
Writes: llm_data['registry_records'], llm_data['endpoints'], llm_data['result_study']
Service: PublicationsLlmClassification::IdentifierExtraction
Notes: Gates the entire pipeline — only publications marked as result = true proceed.

Step 2 — Web Search for Trial IDs web_search_nctids

Purpose: Find registry IDs via web search for result publications that couldn’t be matched from the abstract alone.
Model: LLM-assisted web search
Reads: Publications marked as results but lacking registry records
Writes: llm_data['web_search_registry_records']
Service: PublicationsLlmClassification::WebSearchIdentifiers
Notes: Currently disabled (limit=1). 17k legacy records exist. Reserved for future semantic search replacement. can_skip: true.

Step 3 — Relink to Clinical Trials relink_to_clinical_trials

Purpose: Match extracted registry IDs to clinical trial records in the database, creating publication-trial links.
Model: None (algorithmic)
Reads: Registry records from multiple sources (api_registry_records, registry_records, web_search_registry_records, PubMed references)
Writes: publication_clinical_trials join table (77k links)
Service: Direct implementation in Thor task
Notes: Deterministic matching. Checks all registry ID sources and deduplicates.

Step 4 — Therapeutic Area Filter therapeutic_area_filter

Purpose: For publications that couldn’t be linked to any trial, determine if they’re relevant to hematology/oncology.
Model: gpt-5-nano
Reads: Unlinked publications with endpoints marked as results
Writes: llm_data['hematology_oncology_relevant'], llm_data['therapeutic_areas']
Service: PublicationsLlmClassification::TherapeuticAreaFilter
Notes: Only processes unlinked publications. 17k processed. Acts as a relevance gate for unlinked pubs entering the rest of the pipeline.

Phase 2: Interventions

Step 5 — Extract Interventions extract_interventions

Purpose: Extract treatment arms and interventions (drug names, doses, schedules) from the abstract.
Model: gpt-5-mini
Reads: Result publications with clinical relevance (linked to trial OR hematology/oncology relevant)
Writes: llm_data['intervention_arms']
Service: PublicationsLlmClassification::InterventionExtraction
Notes: 19k publications have intervention_arms. can_skip: true.

Step 6 — Link Publication Drugs link_publication_drugs

Purpose: Match extracted intervention names to canonical Drug and NCIt Concept records in the database.
Model: Algorithmic with LLM fallback (SimpleCandidateMatchingService)
Reads: Publication interventions without drug or NCIt mappings
Writes: publication_interventions.drug_id, publication_interventions.ncit_concept_id
Service: PublicationsLlmClassification::DrugLinker
Notes: 29k of 45k interventions have drug links. can_skip: true. Higher resource requirements (2 vCPUs, 4GB memory).

Step 7 — Tag Investigational Interventions tag_investigational_interventions

Purpose: Classify each intervention’s role in the trial (investigational, comparator, combination, supportive).
Model: gpt-5-mini
Reads: Publications with intervention_arms where interventions lack role assignments
Writes: publication_interventions.intervention_role
Service: PublicationsLlmClassification::InvestigationalTagger
Notes: 45k/45k interventions have roles assigned. can_skip: true.

Phase 3: Subgroups & Classification

Step 8 — Extract Subgroups extract_subgroups

Purpose: Identify patient subgroups in the abstract and map which endpoints apply to each subgroup.
Model: o4-mini (reasoning)
Reads: Publications with endpoints, linked or relevant to clinical trials
Writes: llm_data['subgroup_endpoints'], llm_data['baseline_subgroup'], llm_data['overall_population']
Service: PublicationsLlmClassification::SubgroupExtraction
Notes: 66k processed. Uses arrow notation for nested subgroups (e.g., “NSCLC → Squamous”). Feeds directly into step 10.

Step 9 — Extract Dose Evidence extract_dose_evidence

Purpose: Extract structured dosing information per intervention from the abstract.
Model: Configurable (gpt-5-mini default)
Reads: Publications with interventions lacking dose evidence
Writes: publication_interventions.dose_evidence (JSONB: single_dose, dose_min, dose_max, rp2d, units, frequency, context_type, evidence_quote, confidence)
Service: PublicationsLlmClassification::DoseEvidenceExtraction
Notes: New step, not yet populated. can_skip: true. Operates per-intervention, not per-publication.

Step 10 — Classify Publications classify_publications

Purpose: The main extraction step. Extracts trial design, outcomes per subgroup, adverse events, trial conclusions, and partial result indicators.
Model: o4-mini (reasoning, reasoning_effort=‘medium’)
Reads: Publications with subgroup_endpoints from step 8
Writes: llm_data['subgroup_outcome_measures'], llm_data['trial_conclusion'], llm_data['patient_population'], llm_data['study_design'], llm_data['adverse_events'], llm_data['is_partial_result'], llm_data['partial_result_tags'], llm_data['total_number_of_participants']
Service: PublicationsLlmClassification::Task
Notes: 66k processed. Heaviest prompt (157 lines). Validates that ALL subgroups from step 8 are present in output — rejects and retries if any are dropped. Also extracts dose context per subgroup (dose cohorts) and data cutoff dates.

Step 11 — Extract Diseases extract_diseases

Purpose: Structure the patient population into canonical disease entities with extents, stages, biomarkers, and treatment settings.
Model: o4-mini (reasoning)
Reads: Publications with outcome measures but no disease structure
Writes: llm_data['patient_population_diseases']
Service: PublicationsLlmClassification::DiseaseExtraction
Notes: 67k processed. Simplest prompt (17 lines). Uses shared schema from ParticipationCriteriaExtraction::Details (reused for clinical trial participation criteria). Post-processes with disease matching against canonical database.

Phase 4: Post-processing & Intent

Step 12 — Post-process Publications post_process_publications

Purpose: Materialize LLM-extracted data from llm_data JSONB into normalized database tables.
Model: None (algorithmic)
Reads: All llm_data fields from steps 1-11
Writes: trial_subgroups, trial_outcome_measures, trial_endpoints, trial_disease_details, adverse_events rows. Sets llm_data_processed = true.
Service: PublicationsLlmClassification::PostProcess
Notes: 66k processed. Critical materialization boundary — everything before this writes to JSONB, everything after reads from normalized tables.

Step 13 — Classify Intent classify_intent

Purpose: Classify the publication’s scientific intent and relationship to the trial.
Model: Configurable (gpt-5-mini default)
Reads: Post-processed publications with title and abstract
Writes: llm_data['intent_classification'] (primary_intent, secondary_intents[], trial_relationship)
Service: PublicationsLlmClassification::IntentClassification
Notes: 66k processed. Enum-heavy: 10 primary intents, 60 secondary intents, 15 trial relationships.

Step 14 — Extract Treatment Lines extract_treatment_lines

Purpose: Determine treatment line (1L, 2L+, etc.) and prior therapy exposure per subgroup.
Model: gpt-5-mini
Reads: Publications with intervention arms
Writes: trial_subgroups.treatment_lines (JSONB), trial_subgroups.min_prior_lines, trial_subgroups.max_prior_lines, trial_subgroups.median_prior_lines
Service: PublicationsLlmClassification::TreatmentContextExtraction
Notes: New step (writes to treatment_context key). Most complex prompt (288 lines). Operates per-subgroup with 22 therapy class enums. treatment_lines key has 66k records from an older implementation.

Phase 5: Adverse Events

Step 15 — Standardize Adverse Events standardize_adverse_events

Purpose: Match adverse event names to standardized Endpoint records using deterministic string matching.
Model: None (algorithmic)
Reads: Adverse event records from publications without standardized names
Writes: adverse_events.standardized_name, adverse_events.endpoint_id
Service: PublicationsLlmClassification::AdverseEventStandardization
Notes: 150k/150k matched. High coverage deterministic step.

Step 16 — Classify Adverse Events (LLM) classify_adverse_events

Purpose: LLM fallback for adverse events that couldn’t be matched deterministically.
Model: gpt-5-nano
Reads: Unmatched adverse events from publications
Writes: adverse_events.standardized_name, adverse_events.endpoint_id, adverse_events.classification_source
Service: PublicationsLlmClassification::AdverseEventLlmClassification
Notes: 90k of 150k have classification_source set (indicating LLM was needed).

Phase 6: Endpoint Enrichment

Step 17 — Classify Endpoint Domains llm_classify_publication_endpoints_domains

Purpose: Classify each endpoint into clinical domain groups (efficacy, safety, pharmacokinetics, etc.) and detect milestone endpoints.
Model: gpt-5-nano
Reads: Publications with structured endpoints
Writes: llm_data['endpoint_domain_classifier'] (domain_groups[] with confidence, is_milestone flag)
Service: PublicationsLlmClassification::DomainClassifier
Notes: 63k processed. Dynamic enum — domain groups loaded from database at runtime. Strict 3-part test for milestone detection. can_skip: true.

Step 18 — Match Publication Endpoints llm_match_publication_endpoints

Purpose: Match publication-reported outcome measures to canonical catalog endpoints with confidence scores.
Model: gpt-5-mini
Reads: Publications with domain-classified endpoints (from step 17)
Writes: llm_data['endpoint_matcher'] (outcome_measure → catalog endpoint matches with confidence)
Service: PublicationsLlmClassification::EndpointMatcher
Notes: 52k processed. Depends on step 17’s domain classifications to filter candidate endpoints. Conservative matching — returns empty if ambiguous. can_skip: true.

Data Scale Summary

Table	Row Count	Source
publications	216k total, 76k results	PubMed, Europe PMC
publication_clinical_trials	77k	Step 3
publication_interventions	45k	Steps 5-7
trial_endpoints	238k	Step 12
trial_subgroups	188k	Step 12
trial_outcome_measures	403k	Step 12
trial_disease_details	83k	Step 12
adverse_events	150k	Step 12

LLM Usage by Model

Model	Steps	Calls per pub
o4-mini (reasoning)	8, 10, 11	3
gpt-5-mini	5, 7, 9, 13, 14, 18	6
gpt-5-nano	4, 16, 17	3
OpenAI (configurable)	1	1
LLM web search	2 (disabled)	0
Algorithmic (no LLM)	3, 6*, 12, 15	0

*Step 6 uses LLM as fallback only.

Proposed Changes

Proposal 1: Unified Triage Step (Steps 1+4+13 → 1 step)

Current state: Three separate LLM calls at different points in the pipeline:

Step 1 (extract_trial_identifier): Reads abstract → determines result status, extracts registry IDs and endpoints.
Step 4 (therapeutic_area_filter): Reads abstract again → binary hematology/oncology relevance. Only runs on unlinked pubs, after step 3.
Step 13 (classify_intent): Reads abstract again → classifies publication intent (10 primary, 60 secondary, 15 trial relationships). Runs after post-processing.

All three read the same abstract and make high-level determinations. Three separate LLM round-trips for what is fundamentally one question: “What kind of paper is this?”

Proposed: Single triage call that extracts everything needed to route the publication through the pipeline:

Result status (is this a clinical trial result paper?)
Registry IDs (NCT, ISRCTN, EudraCT, etc.)
Endpoint names
Therapeutic area relevance (hematology/oncology: yes/no)
Intent classification (primary intent, secondary intents, trial relationship)

Schema changes:

Output combines current IdentifierExtraction, TherapeuticAreaFilter, and IntentClassification schemas
All flat fields and enums — no deep nesting
trial_linked hint for intent classification is dropped (model can infer from registry IDs it just extracted)

What changes downstream:

Step 4 is removed as a workflow step. TA relevance is extracted for all result pubs in step 1; downstream logic continues to use it only for unlinked pubs.
Step 13 is removed as a workflow step. Intent is extracted early; downstream consumers read from llm_data['intent_classification'] as before.
Step 3 (relink_to_clinical_trials) and all subsequent steps are unaffected.

Trade-offs:

TA relevance is now extracted for all ~76k result pubs instead of ~17k unlinked ones. Marginal cost is near zero since it’s a single field in an already-happening call.
Intent classification loses the trial_linked boolean input. This was a minor contextual hint — the model already has the registry IDs it extracted, which is stronger signal.
Prompt grows by ~30 lines (TA filter rules) + intent classification enums. Still well within single-call bounds.

Proposed workflow (16 steps, 11 LLM calls):

 1. triage_publication          ← NEW merged step (steps 1+4+13)
 2. web_search_nctids
 3. relink_to_clinical_trials
 4. extract_interventions       ← was step 5
 5. link_publication_drugs      ← was step 6
 6. tag_investigational         ← was step 7
 7. extract_subgroups           ← was step 8
 8. extract_dose_evidence       ← was step 9
 9. classify_publications       ← was step 10
10. extract_diseases            ← was step 11
11. post_process_publications   ← was step 12
12. extract_treatment_lines     ← was step 14
13. standardize_adverse_events  ← was step 15
14. classify_adverse_events     ← was step 16
15. classify_and_match_endpoints ← merged steps 17+18 (proposal 2)

Evaluated & Rejected: Merge Endpoint Domain Classification + Endpoint Matching (Steps 17+18)

Current state: Two sequential LLM calls — step 17 classifies endpoints into domain groups (nano), then step 18 uses those classifications to filter catalog candidates and match (mini).

Why keep separate: Domain classification is not just intermediate context — it serves as a meaningful pre-filter and provides semantic guidance to the matcher:

Candidate filtering: 144 total catalog endpoints are reduced to ~25-70 domain-relevant candidates per publication. Without this, the matching prompt would include all 144 endpoints every time.
Semantic steering: The matcher prompt includes domain labels (e.g. [response_remission]) next to each publication endpoint, guiding the model toward the right catalog section.
Cost efficiency: The nano call is cheap and meaningfully reduces prompt size and cognitive load for the more expensive mini matching call.

Future Consideration: Merge SubgroupExtraction + classify_publications (Steps 8+10)

Current state: Two o4-mini calls. Step 8 identifies subgroups and maps endpoints to them. Step 10 consumes those subgroups and extracts outcome data, with a validation check ensuring no subgroups are dropped.

Rationale for keeping separate (for now):

Debuggability: llm_data['subgroup_endpoints'] provides an inspectable intermediate state showing what the model identified vs what it extracted results for.
The validation contract between steps catches extraction errors.
Step 8’s focused prompt (76 lines) gives the model one clear job.

When to reconsider: If subgroup identification proves stable and the validation check rarely triggers rejections, merging into a single o4-mini call would eliminate ~66k reasoning-model API calls.