Publications Workflow
Publications Workflow
Section titled “Publications Workflow”Pipeline that transforms raw publication records into structured clinical evidence. 18 sequential steps, 13 LLM calls, processing ~76k result publications from a pool of ~216k.
Data Flow Overview
Section titled “Data Flow Overview”Raw Publication (title, abstract, source metadata) │ ▼ PHASE 1: Identity & Linking (Which trial does this paper belong to?) │ ▼ PHASE 2: Interventions (What drugs/treatments were studied?) │ ▼ PHASE 3: Subgroups & Classification (What populations, endpoints, and outcomes were reported?) │ ▼ PHASE 4: Post-processing & Intent (Materialize rows, classify publication purpose) │ ▼ PHASE 5: Adverse Events (Standardize safety data) │ ▼ PHASE 6: Endpoint Enrichment (Map outcomes to canonical endpoint catalog) │ ▼ Structured Evidence (publication_interventions, trial_subgroups, trial_outcome_measures, trial_disease_details, adverse_events, trial_endpoints)Step-by-Step Reference
Section titled “Step-by-Step Reference”Phase 1: Identity & Linking
Section titled “Phase 1: Identity & Linking”Step 1 — Extract Trial Identifiers extract_trial_identifier
- Purpose: Determine if a publication reports clinical trial results, and extract registry IDs (NCT, ISRCTN, EudraCT, etc.) and endpoint names from the abstract.
- Model: OpenAI (configurable)
- Reads: Publication abstract
- Writes:
llm_data['registry_records'],llm_data['endpoints'],llm_data['result_study'] - Service:
PublicationsLlmClassification::IdentifierExtraction - Notes: Gates the entire pipeline — only publications marked as
result = trueproceed.
Step 2 — Web Search for Trial IDs web_search_nctids
- Purpose: Find registry IDs via web search for result publications that couldn’t be matched from the abstract alone.
- Model: LLM-assisted web search
- Reads: Publications marked as results but lacking registry records
- Writes:
llm_data['web_search_registry_records'] - Service:
PublicationsLlmClassification::WebSearchIdentifiers - Notes: Currently disabled (
limit=1). 17k legacy records exist. Reserved for future semantic search replacement.can_skip: true.
Step 3 — Relink to Clinical Trials relink_to_clinical_trials
- Purpose: Match extracted registry IDs to clinical trial records in the database, creating publication-trial links.
- Model: None (algorithmic)
- Reads: Registry records from multiple sources (
api_registry_records,registry_records,web_search_registry_records, PubMed references) - Writes:
publication_clinical_trialsjoin table (77k links) - Service: Direct implementation in Thor task
- Notes: Deterministic matching. Checks all registry ID sources and deduplicates.
Step 4 — Therapeutic Area Filter therapeutic_area_filter
- Purpose: For publications that couldn’t be linked to any trial, determine if they’re relevant to hematology/oncology.
- Model: gpt-5-nano
- Reads: Unlinked publications with endpoints marked as results
- Writes:
llm_data['hematology_oncology_relevant'],llm_data['therapeutic_areas'] - Service:
PublicationsLlmClassification::TherapeuticAreaFilter - Notes: Only processes unlinked publications. 17k processed. Acts as a relevance gate for unlinked pubs entering the rest of the pipeline.
Phase 2: Interventions
Section titled “Phase 2: Interventions”Step 5 — Extract Interventions extract_interventions
- Purpose: Extract treatment arms and interventions (drug names, doses, schedules) from the abstract.
- Model: gpt-5-mini
- Reads: Result publications with clinical relevance (linked to trial OR hematology/oncology relevant)
- Writes:
llm_data['intervention_arms'] - Service:
PublicationsLlmClassification::InterventionExtraction - Notes: 19k publications have intervention_arms.
can_skip: true.
Step 6 — Link Publication Drugs link_publication_drugs
- Purpose: Match extracted intervention names to canonical Drug and NCIt Concept records in the database.
- Model: Algorithmic with LLM fallback (
SimpleCandidateMatchingService) - Reads: Publication interventions without drug or NCIt mappings
- Writes:
publication_interventions.drug_id,publication_interventions.ncit_concept_id - Service:
PublicationsLlmClassification::DrugLinker - Notes: 29k of 45k interventions have drug links.
can_skip: true. Higher resource requirements (2 vCPUs, 4GB memory).
Step 7 — Tag Investigational Interventions tag_investigational_interventions
- Purpose: Classify each intervention’s role in the trial (investigational, comparator, combination, supportive).
- Model: gpt-5-mini
- Reads: Publications with intervention_arms where interventions lack role assignments
- Writes:
publication_interventions.intervention_role - Service:
PublicationsLlmClassification::InvestigationalTagger - Notes: 45k/45k interventions have roles assigned.
can_skip: true.
Phase 3: Subgroups & Classification
Section titled “Phase 3: Subgroups & Classification”Step 8 — Extract Subgroups extract_subgroups
- Purpose: Identify patient subgroups in the abstract and map which endpoints apply to each subgroup.
- Model: o4-mini (reasoning)
- Reads: Publications with endpoints, linked or relevant to clinical trials
- Writes:
llm_data['subgroup_endpoints'],llm_data['baseline_subgroup'],llm_data['overall_population'] - Service:
PublicationsLlmClassification::SubgroupExtraction - Notes: 66k processed. Uses arrow notation for nested subgroups (e.g., “NSCLC → Squamous”). Feeds directly into step 10.
Step 9 — Extract Dose Evidence extract_dose_evidence
- Purpose: Extract structured dosing information per intervention from the abstract.
- Model: Configurable (gpt-5-mini default)
- Reads: Publications with interventions lacking dose evidence
- Writes:
publication_interventions.dose_evidence(JSONB: single_dose, dose_min, dose_max, rp2d, units, frequency, context_type, evidence_quote, confidence) - Service:
PublicationsLlmClassification::DoseEvidenceExtraction - Notes: New step, not yet populated.
can_skip: true. Operates per-intervention, not per-publication.
Step 10 — Classify Publications classify_publications
- Purpose: The main extraction step. Extracts trial design, outcomes per subgroup, adverse events, trial conclusions, and partial result indicators.
- Model: o4-mini (reasoning, reasoning_effort=‘medium’)
- Reads: Publications with subgroup_endpoints from step 8
- Writes:
llm_data['subgroup_outcome_measures'],llm_data['trial_conclusion'],llm_data['patient_population'],llm_data['study_design'],llm_data['adverse_events'],llm_data['is_partial_result'],llm_data['partial_result_tags'],llm_data['total_number_of_participants'] - Service:
PublicationsLlmClassification::Task - Notes: 66k processed. Heaviest prompt (157 lines). Validates that ALL subgroups from step 8 are present in output — rejects and retries if any are dropped. Also extracts dose context per subgroup (dose cohorts) and data cutoff dates.
Step 11 — Extract Diseases extract_diseases
- Purpose: Structure the patient population into canonical disease entities with extents, stages, biomarkers, and treatment settings.
- Model: o4-mini (reasoning)
- Reads: Publications with outcome measures but no disease structure
- Writes:
llm_data['patient_population_diseases'] - Service:
PublicationsLlmClassification::DiseaseExtraction - Notes: 67k processed. Simplest prompt (17 lines). Uses shared schema from
ParticipationCriteriaExtraction::Details(reused for clinical trial participation criteria). Post-processes with disease matching against canonical database.
Phase 4: Post-processing & Intent
Section titled “Phase 4: Post-processing & Intent”Step 12 — Post-process Publications post_process_publications
- Purpose: Materialize LLM-extracted data from
llm_dataJSONB into normalized database tables. - Model: None (algorithmic)
- Reads: All
llm_datafields from steps 1-11 - Writes:
trial_subgroups,trial_outcome_measures,trial_endpoints,trial_disease_details,adverse_eventsrows. Setsllm_data_processed = true. - Service:
PublicationsLlmClassification::PostProcess - Notes: 66k processed. Critical materialization boundary — everything before this writes to JSONB, everything after reads from normalized tables.
Step 13 — Classify Intent classify_intent
- Purpose: Classify the publication’s scientific intent and relationship to the trial.
- Model: Configurable (gpt-5-mini default)
- Reads: Post-processed publications with title and abstract
- Writes:
llm_data['intent_classification'](primary_intent, secondary_intents[], trial_relationship) - Service:
PublicationsLlmClassification::IntentClassification - Notes: 66k processed. Enum-heavy: 10 primary intents, 60 secondary intents, 15 trial relationships.
Step 14 — Extract Treatment Lines extract_treatment_lines
- Purpose: Determine treatment line (1L, 2L+, etc.) and prior therapy exposure per subgroup.
- Model: gpt-5-mini
- Reads: Publications with intervention arms
- Writes:
trial_subgroups.treatment_lines(JSONB),trial_subgroups.min_prior_lines,trial_subgroups.max_prior_lines,trial_subgroups.median_prior_lines - Service:
PublicationsLlmClassification::TreatmentContextExtraction - Notes: New step (writes to
treatment_contextkey). Most complex prompt (288 lines). Operates per-subgroup with 22 therapy class enums.treatment_lineskey has 66k records from an older implementation.
Phase 5: Adverse Events
Section titled “Phase 5: Adverse Events”Step 15 — Standardize Adverse Events standardize_adverse_events
- Purpose: Match adverse event names to standardized Endpoint records using deterministic string matching.
- Model: None (algorithmic)
- Reads: Adverse event records from publications without standardized names
- Writes:
adverse_events.standardized_name,adverse_events.endpoint_id - Service:
PublicationsLlmClassification::AdverseEventStandardization - Notes: 150k/150k matched. High coverage deterministic step.
Step 16 — Classify Adverse Events (LLM) classify_adverse_events
- Purpose: LLM fallback for adverse events that couldn’t be matched deterministically.
- Model: gpt-5-nano
- Reads: Unmatched adverse events from publications
- Writes:
adverse_events.standardized_name,adverse_events.endpoint_id,adverse_events.classification_source - Service:
PublicationsLlmClassification::AdverseEventLlmClassification - Notes: 90k of 150k have classification_source set (indicating LLM was needed).
Phase 6: Endpoint Enrichment
Section titled “Phase 6: Endpoint Enrichment”Step 17 — Classify Endpoint Domains llm_classify_publication_endpoints_domains
- Purpose: Classify each endpoint into clinical domain groups (efficacy, safety, pharmacokinetics, etc.) and detect milestone endpoints.
- Model: gpt-5-nano
- Reads: Publications with structured endpoints
- Writes:
llm_data['endpoint_domain_classifier'](domain_groups[] with confidence, is_milestone flag) - Service:
PublicationsLlmClassification::DomainClassifier - Notes: 63k processed. Dynamic enum — domain groups loaded from database at runtime. Strict 3-part test for milestone detection.
can_skip: true.
Step 18 — Match Publication Endpoints llm_match_publication_endpoints
- Purpose: Match publication-reported outcome measures to canonical catalog endpoints with confidence scores.
- Model: gpt-5-mini
- Reads: Publications with domain-classified endpoints (from step 17)
- Writes:
llm_data['endpoint_matcher'](outcome_measure → catalog endpoint matches with confidence) - Service:
PublicationsLlmClassification::EndpointMatcher - Notes: 52k processed. Depends on step 17’s domain classifications to filter candidate endpoints. Conservative matching — returns empty if ambiguous.
can_skip: true.
Data Scale Summary
Section titled “Data Scale Summary”| Table | Row Count | Source |
|---|---|---|
| publications | 216k total, 76k results | PubMed, Europe PMC |
| publication_clinical_trials | 77k | Step 3 |
| publication_interventions | 45k | Steps 5-7 |
| trial_endpoints | 238k | Step 12 |
| trial_subgroups | 188k | Step 12 |
| trial_outcome_measures | 403k | Step 12 |
| trial_disease_details | 83k | Step 12 |
| adverse_events | 150k | Step 12 |
LLM Usage by Model
Section titled “LLM Usage by Model”| Model | Steps | Calls per pub |
|---|---|---|
| o4-mini (reasoning) | 8, 10, 11 | 3 |
| gpt-5-mini | 5, 7, 9, 13, 14, 18 | 6 |
| gpt-5-nano | 4, 16, 17 | 3 |
| OpenAI (configurable) | 1 | 1 |
| LLM web search | 2 (disabled) | 0 |
| Algorithmic (no LLM) | 3, 6*, 12, 15 | 0 |
*Step 6 uses LLM as fallback only.
Proposed Changes
Section titled “Proposed Changes”Proposal 1: Unified Triage Step (Steps 1+4+13 → 1 step)
Section titled “Proposal 1: Unified Triage Step (Steps 1+4+13 → 1 step)”Current state: Three separate LLM calls at different points in the pipeline:
- Step 1 (extract_trial_identifier): Reads abstract → determines result status, extracts registry IDs and endpoints.
- Step 4 (therapeutic_area_filter): Reads abstract again → binary hematology/oncology relevance. Only runs on unlinked pubs, after step 3.
- Step 13 (classify_intent): Reads abstract again → classifies publication intent (10 primary, 60 secondary, 15 trial relationships). Runs after post-processing.
All three read the same abstract and make high-level determinations. Three separate LLM round-trips for what is fundamentally one question: “What kind of paper is this?”
Proposed: Single triage call that extracts everything needed to route the publication through the pipeline:
- Result status (is this a clinical trial result paper?)
- Registry IDs (NCT, ISRCTN, EudraCT, etc.)
- Endpoint names
- Therapeutic area relevance (hematology/oncology: yes/no)
- Intent classification (primary intent, secondary intents, trial relationship)
Schema changes:
- Output combines current
IdentifierExtraction,TherapeuticAreaFilter, andIntentClassificationschemas - All flat fields and enums — no deep nesting
trial_linkedhint for intent classification is dropped (model can infer from registry IDs it just extracted)
What changes downstream:
- Step 4 is removed as a workflow step. TA relevance is extracted for all result pubs in step 1; downstream logic continues to use it only for unlinked pubs.
- Step 13 is removed as a workflow step. Intent is extracted early; downstream consumers read from
llm_data['intent_classification']as before. - Step 3 (relink_to_clinical_trials) and all subsequent steps are unaffected.
Trade-offs:
- TA relevance is now extracted for all ~76k result pubs instead of ~17k unlinked ones. Marginal cost is near zero since it’s a single field in an already-happening call.
- Intent classification loses the
trial_linkedboolean input. This was a minor contextual hint — the model already has the registry IDs it extracted, which is stronger signal. - Prompt grows by ~30 lines (TA filter rules) + intent classification enums. Still well within single-call bounds.
Proposed workflow (16 steps, 11 LLM calls):
1. triage_publication ← NEW merged step (steps 1+4+13) 2. web_search_nctids 3. relink_to_clinical_trials 4. extract_interventions ← was step 5 5. link_publication_drugs ← was step 6 6. tag_investigational ← was step 7 7. extract_subgroups ← was step 8 8. extract_dose_evidence ← was step 9 9. classify_publications ← was step 1010. extract_diseases ← was step 1111. post_process_publications ← was step 1212. extract_treatment_lines ← was step 1413. standardize_adverse_events ← was step 1514. classify_adverse_events ← was step 1615. classify_and_match_endpoints ← merged steps 17+18 (proposal 2)Evaluated & Rejected: Merge Endpoint Domain Classification + Endpoint Matching (Steps 17+18)
Section titled “Evaluated & Rejected: Merge Endpoint Domain Classification + Endpoint Matching (Steps 17+18)”Current state: Two sequential LLM calls — step 17 classifies endpoints into domain groups (nano), then step 18 uses those classifications to filter catalog candidates and match (mini).
Why keep separate: Domain classification is not just intermediate context — it serves as a meaningful pre-filter and provides semantic guidance to the matcher:
- Candidate filtering: 144 total catalog endpoints are reduced to ~25-70 domain-relevant candidates per publication. Without this, the matching prompt would include all 144 endpoints every time.
- Semantic steering: The matcher prompt includes domain labels (e.g.
[response_remission]) next to each publication endpoint, guiding the model toward the right catalog section. - Cost efficiency: The nano call is cheap and meaningfully reduces prompt size and cognitive load for the more expensive mini matching call.
Future Consideration: Merge SubgroupExtraction + classify_publications (Steps 8+10)
Section titled “Future Consideration: Merge SubgroupExtraction + classify_publications (Steps 8+10)”Current state: Two o4-mini calls. Step 8 identifies subgroups and maps endpoints to them. Step 10 consumes those subgroups and extracts outcome data, with a validation check ensuring no subgroups are dropped.
Rationale for keeping separate (for now):
- Debuggability:
llm_data['subgroup_endpoints']provides an inspectable intermediate state showing what the model identified vs what it extracted results for. - The validation contract between steps catches extraction errors.
- Step 8’s focused prompt (76 lines) gives the model one clear job.
When to reconsider: If subgroup identification proves stable and the validation check rarely triggers rejections, merging into a single o4-mini call would eliminate ~66k reasoning-model API calls.