Skip to content

Publications Workflow

Pipeline that transforms raw publication records into structured clinical evidence. 18 sequential steps, 13 LLM calls, processing ~76k result publications from a pool of ~216k.

Raw Publication (title, abstract, source metadata)
PHASE 1: Identity & Linking
(Which trial does this paper belong to?)
PHASE 2: Interventions
(What drugs/treatments were studied?)
PHASE 3: Subgroups & Classification
(What populations, endpoints, and outcomes were reported?)
PHASE 4: Post-processing & Intent
(Materialize rows, classify publication purpose)
PHASE 5: Adverse Events
(Standardize safety data)
PHASE 6: Endpoint Enrichment
(Map outcomes to canonical endpoint catalog)
Structured Evidence
(publication_interventions, trial_subgroups, trial_outcome_measures,
trial_disease_details, adverse_events, trial_endpoints)

Step 1 — Extract Trial Identifiers extract_trial_identifier

  • Purpose: Determine if a publication reports clinical trial results, and extract registry IDs (NCT, ISRCTN, EudraCT, etc.) and endpoint names from the abstract.
  • Model: OpenAI (configurable)
  • Reads: Publication abstract
  • Writes: llm_data['registry_records'], llm_data['endpoints'], llm_data['result_study']
  • Service: PublicationsLlmClassification::IdentifierExtraction
  • Notes: Gates the entire pipeline — only publications marked as result = true proceed.

Step 2 — Web Search for Trial IDs web_search_nctids

  • Purpose: Find registry IDs via web search for result publications that couldn’t be matched from the abstract alone.
  • Model: LLM-assisted web search
  • Reads: Publications marked as results but lacking registry records
  • Writes: llm_data['web_search_registry_records']
  • Service: PublicationsLlmClassification::WebSearchIdentifiers
  • Notes: Currently disabled (limit=1). 17k legacy records exist. Reserved for future semantic search replacement. can_skip: true.

Step 3 — Relink to Clinical Trials relink_to_clinical_trials

  • Purpose: Match extracted registry IDs to clinical trial records in the database, creating publication-trial links.
  • Model: None (algorithmic)
  • Reads: Registry records from multiple sources (api_registry_records, registry_records, web_search_registry_records, PubMed references)
  • Writes: publication_clinical_trials join table (77k links)
  • Service: Direct implementation in Thor task
  • Notes: Deterministic matching. Checks all registry ID sources and deduplicates.

Step 4 — Therapeutic Area Filter therapeutic_area_filter

  • Purpose: For publications that couldn’t be linked to any trial, determine if they’re relevant to hematology/oncology.
  • Model: gpt-5-nano
  • Reads: Unlinked publications with endpoints marked as results
  • Writes: llm_data['hematology_oncology_relevant'], llm_data['therapeutic_areas']
  • Service: PublicationsLlmClassification::TherapeuticAreaFilter
  • Notes: Only processes unlinked publications. 17k processed. Acts as a relevance gate for unlinked pubs entering the rest of the pipeline.

Step 5 — Extract Interventions extract_interventions

  • Purpose: Extract treatment arms and interventions (drug names, doses, schedules) from the abstract.
  • Model: gpt-5-mini
  • Reads: Result publications with clinical relevance (linked to trial OR hematology/oncology relevant)
  • Writes: llm_data['intervention_arms']
  • Service: PublicationsLlmClassification::InterventionExtraction
  • Notes: 19k publications have intervention_arms. can_skip: true.

Step 6 — Link Publication Drugs link_publication_drugs

  • Purpose: Match extracted intervention names to canonical Drug and NCIt Concept records in the database.
  • Model: Algorithmic with LLM fallback (SimpleCandidateMatchingService)
  • Reads: Publication interventions without drug or NCIt mappings
  • Writes: publication_interventions.drug_id, publication_interventions.ncit_concept_id
  • Service: PublicationsLlmClassification::DrugLinker
  • Notes: 29k of 45k interventions have drug links. can_skip: true. Higher resource requirements (2 vCPUs, 4GB memory).

Step 7 — Tag Investigational Interventions tag_investigational_interventions

  • Purpose: Classify each intervention’s role in the trial (investigational, comparator, combination, supportive).
  • Model: gpt-5-mini
  • Reads: Publications with intervention_arms where interventions lack role assignments
  • Writes: publication_interventions.intervention_role
  • Service: PublicationsLlmClassification::InvestigationalTagger
  • Notes: 45k/45k interventions have roles assigned. can_skip: true.

Step 8 — Extract Subgroups extract_subgroups

  • Purpose: Identify patient subgroups in the abstract and map which endpoints apply to each subgroup.
  • Model: o4-mini (reasoning)
  • Reads: Publications with endpoints, linked or relevant to clinical trials
  • Writes: llm_data['subgroup_endpoints'], llm_data['baseline_subgroup'], llm_data['overall_population']
  • Service: PublicationsLlmClassification::SubgroupExtraction
  • Notes: 66k processed. Uses arrow notation for nested subgroups (e.g., “NSCLC → Squamous”). Feeds directly into step 10.

Step 9 — Extract Dose Evidence extract_dose_evidence

  • Purpose: Extract structured dosing information per intervention from the abstract.
  • Model: Configurable (gpt-5-mini default)
  • Reads: Publications with interventions lacking dose evidence
  • Writes: publication_interventions.dose_evidence (JSONB: single_dose, dose_min, dose_max, rp2d, units, frequency, context_type, evidence_quote, confidence)
  • Service: PublicationsLlmClassification::DoseEvidenceExtraction
  • Notes: New step, not yet populated. can_skip: true. Operates per-intervention, not per-publication.

Step 10 — Classify Publications classify_publications

  • Purpose: The main extraction step. Extracts trial design, outcomes per subgroup, adverse events, trial conclusions, and partial result indicators.
  • Model: o4-mini (reasoning, reasoning_effort=‘medium’)
  • Reads: Publications with subgroup_endpoints from step 8
  • Writes: llm_data['subgroup_outcome_measures'], llm_data['trial_conclusion'], llm_data['patient_population'], llm_data['study_design'], llm_data['adverse_events'], llm_data['is_partial_result'], llm_data['partial_result_tags'], llm_data['total_number_of_participants']
  • Service: PublicationsLlmClassification::Task
  • Notes: 66k processed. Heaviest prompt (157 lines). Validates that ALL subgroups from step 8 are present in output — rejects and retries if any are dropped. Also extracts dose context per subgroup (dose cohorts) and data cutoff dates.

Step 11 — Extract Diseases extract_diseases

  • Purpose: Structure the patient population into canonical disease entities with extents, stages, biomarkers, and treatment settings.
  • Model: o4-mini (reasoning)
  • Reads: Publications with outcome measures but no disease structure
  • Writes: llm_data['patient_population_diseases']
  • Service: PublicationsLlmClassification::DiseaseExtraction
  • Notes: 67k processed. Simplest prompt (17 lines). Uses shared schema from ParticipationCriteriaExtraction::Details (reused for clinical trial participation criteria). Post-processes with disease matching against canonical database.

Step 12 — Post-process Publications post_process_publications

  • Purpose: Materialize LLM-extracted data from llm_data JSONB into normalized database tables.
  • Model: None (algorithmic)
  • Reads: All llm_data fields from steps 1-11
  • Writes: trial_subgroups, trial_outcome_measures, trial_endpoints, trial_disease_details, adverse_events rows. Sets llm_data_processed = true.
  • Service: PublicationsLlmClassification::PostProcess
  • Notes: 66k processed. Critical materialization boundary — everything before this writes to JSONB, everything after reads from normalized tables.

Step 13 — Classify Intent classify_intent

  • Purpose: Classify the publication’s scientific intent and relationship to the trial.
  • Model: Configurable (gpt-5-mini default)
  • Reads: Post-processed publications with title and abstract
  • Writes: llm_data['intent_classification'] (primary_intent, secondary_intents[], trial_relationship)
  • Service: PublicationsLlmClassification::IntentClassification
  • Notes: 66k processed. Enum-heavy: 10 primary intents, 60 secondary intents, 15 trial relationships.

Step 14 — Extract Treatment Lines extract_treatment_lines

  • Purpose: Determine treatment line (1L, 2L+, etc.) and prior therapy exposure per subgroup.
  • Model: gpt-5-mini
  • Reads: Publications with intervention arms
  • Writes: trial_subgroups.treatment_lines (JSONB), trial_subgroups.min_prior_lines, trial_subgroups.max_prior_lines, trial_subgroups.median_prior_lines
  • Service: PublicationsLlmClassification::TreatmentContextExtraction
  • Notes: New step (writes to treatment_context key). Most complex prompt (288 lines). Operates per-subgroup with 22 therapy class enums. treatment_lines key has 66k records from an older implementation.

Step 15 — Standardize Adverse Events standardize_adverse_events

  • Purpose: Match adverse event names to standardized Endpoint records using deterministic string matching.
  • Model: None (algorithmic)
  • Reads: Adverse event records from publications without standardized names
  • Writes: adverse_events.standardized_name, adverse_events.endpoint_id
  • Service: PublicationsLlmClassification::AdverseEventStandardization
  • Notes: 150k/150k matched. High coverage deterministic step.

Step 16 — Classify Adverse Events (LLM) classify_adverse_events

  • Purpose: LLM fallback for adverse events that couldn’t be matched deterministically.
  • Model: gpt-5-nano
  • Reads: Unmatched adverse events from publications
  • Writes: adverse_events.standardized_name, adverse_events.endpoint_id, adverse_events.classification_source
  • Service: PublicationsLlmClassification::AdverseEventLlmClassification
  • Notes: 90k of 150k have classification_source set (indicating LLM was needed).

Step 17 — Classify Endpoint Domains llm_classify_publication_endpoints_domains

  • Purpose: Classify each endpoint into clinical domain groups (efficacy, safety, pharmacokinetics, etc.) and detect milestone endpoints.
  • Model: gpt-5-nano
  • Reads: Publications with structured endpoints
  • Writes: llm_data['endpoint_domain_classifier'] (domain_groups[] with confidence, is_milestone flag)
  • Service: PublicationsLlmClassification::DomainClassifier
  • Notes: 63k processed. Dynamic enum — domain groups loaded from database at runtime. Strict 3-part test for milestone detection. can_skip: true.

Step 18 — Match Publication Endpoints llm_match_publication_endpoints

  • Purpose: Match publication-reported outcome measures to canonical catalog endpoints with confidence scores.
  • Model: gpt-5-mini
  • Reads: Publications with domain-classified endpoints (from step 17)
  • Writes: llm_data['endpoint_matcher'] (outcome_measure → catalog endpoint matches with confidence)
  • Service: PublicationsLlmClassification::EndpointMatcher
  • Notes: 52k processed. Depends on step 17’s domain classifications to filter candidate endpoints. Conservative matching — returns empty if ambiguous. can_skip: true.
TableRow CountSource
publications216k total, 76k resultsPubMed, Europe PMC
publication_clinical_trials77kStep 3
publication_interventions45kSteps 5-7
trial_endpoints238kStep 12
trial_subgroups188kStep 12
trial_outcome_measures403kStep 12
trial_disease_details83kStep 12
adverse_events150kStep 12
ModelStepsCalls per pub
o4-mini (reasoning)8, 10, 113
gpt-5-mini5, 7, 9, 13, 14, 186
gpt-5-nano4, 16, 173
OpenAI (configurable)11
LLM web search2 (disabled)0
Algorithmic (no LLM)3, 6*, 12, 150

*Step 6 uses LLM as fallback only.

Proposal 1: Unified Triage Step (Steps 1+4+13 → 1 step)

Section titled “Proposal 1: Unified Triage Step (Steps 1+4+13 → 1 step)”

Current state: Three separate LLM calls at different points in the pipeline:

  • Step 1 (extract_trial_identifier): Reads abstract → determines result status, extracts registry IDs and endpoints.
  • Step 4 (therapeutic_area_filter): Reads abstract again → binary hematology/oncology relevance. Only runs on unlinked pubs, after step 3.
  • Step 13 (classify_intent): Reads abstract again → classifies publication intent (10 primary, 60 secondary, 15 trial relationships). Runs after post-processing.

All three read the same abstract and make high-level determinations. Three separate LLM round-trips for what is fundamentally one question: “What kind of paper is this?”

Proposed: Single triage call that extracts everything needed to route the publication through the pipeline:

  • Result status (is this a clinical trial result paper?)
  • Registry IDs (NCT, ISRCTN, EudraCT, etc.)
  • Endpoint names
  • Therapeutic area relevance (hematology/oncology: yes/no)
  • Intent classification (primary intent, secondary intents, trial relationship)

Schema changes:

  • Output combines current IdentifierExtraction, TherapeuticAreaFilter, and IntentClassification schemas
  • All flat fields and enums — no deep nesting
  • trial_linked hint for intent classification is dropped (model can infer from registry IDs it just extracted)

What changes downstream:

  • Step 4 is removed as a workflow step. TA relevance is extracted for all result pubs in step 1; downstream logic continues to use it only for unlinked pubs.
  • Step 13 is removed as a workflow step. Intent is extracted early; downstream consumers read from llm_data['intent_classification'] as before.
  • Step 3 (relink_to_clinical_trials) and all subsequent steps are unaffected.

Trade-offs:

  • TA relevance is now extracted for all ~76k result pubs instead of ~17k unlinked ones. Marginal cost is near zero since it’s a single field in an already-happening call.
  • Intent classification loses the trial_linked boolean input. This was a minor contextual hint — the model already has the registry IDs it extracted, which is stronger signal.
  • Prompt grows by ~30 lines (TA filter rules) + intent classification enums. Still well within single-call bounds.

Proposed workflow (16 steps, 11 LLM calls):

1. triage_publication ← NEW merged step (steps 1+4+13)
2. web_search_nctids
3. relink_to_clinical_trials
4. extract_interventions ← was step 5
5. link_publication_drugs ← was step 6
6. tag_investigational ← was step 7
7. extract_subgroups ← was step 8
8. extract_dose_evidence ← was step 9
9. classify_publications ← was step 10
10. extract_diseases ← was step 11
11. post_process_publications ← was step 12
12. extract_treatment_lines ← was step 14
13. standardize_adverse_events ← was step 15
14. classify_adverse_events ← was step 16
15. classify_and_match_endpoints ← merged steps 17+18 (proposal 2)

Evaluated & Rejected: Merge Endpoint Domain Classification + Endpoint Matching (Steps 17+18)

Section titled “Evaluated & Rejected: Merge Endpoint Domain Classification + Endpoint Matching (Steps 17+18)”

Current state: Two sequential LLM calls — step 17 classifies endpoints into domain groups (nano), then step 18 uses those classifications to filter catalog candidates and match (mini).

Why keep separate: Domain classification is not just intermediate context — it serves as a meaningful pre-filter and provides semantic guidance to the matcher:

  • Candidate filtering: 144 total catalog endpoints are reduced to ~25-70 domain-relevant candidates per publication. Without this, the matching prompt would include all 144 endpoints every time.
  • Semantic steering: The matcher prompt includes domain labels (e.g. [response_remission]) next to each publication endpoint, guiding the model toward the right catalog section.
  • Cost efficiency: The nano call is cheap and meaningfully reduces prompt size and cognitive load for the more expensive mini matching call.

Future Consideration: Merge SubgroupExtraction + classify_publications (Steps 8+10)

Section titled “Future Consideration: Merge SubgroupExtraction + classify_publications (Steps 8+10)”

Current state: Two o4-mini calls. Step 8 identifies subgroups and maps endpoints to them. Step 10 consumes those subgroups and extracts outcome data, with a validation check ensuring no subgroups are dropped.

Rationale for keeping separate (for now):

  • Debuggability: llm_data['subgroup_endpoints'] provides an inspectable intermediate state showing what the model identified vs what it extracted results for.
  • The validation contract between steps catches extraction errors.
  • Step 8’s focused prompt (76 lines) gives the model one clear job.

When to reconsider: If subgroup identification proves stable and the validation check rarely triggers rejections, merging into a single o4-mini call would eliminate ~66k reasoning-model API calls.