Clinical trials pipeline
Every week, Data Gov collects, reconciles, and enriches 72,000+ clinical trials from two registries. This page follows a trial from raw registry data to a fully structured record in the knowledge graph. It covers the pipeline architecture, entity resolution, and the domain knowledge you need to understand what the data means.
Why clinical trials matter
Section titled “Why clinical trials matter”Clinical trials are the experimental evidence behind every drug. When Pfizer’s paxlovid enters Phase III, that trial generates data about which diseases it targets, which biomarkers it requires, what endpoints it measures, and what results it produces. Data Gov captures all of this.
But trial registries are messy. ClinicalTrials.gov accepts free-text, self-reported data. The same drug appears as “Pembrolizumab”, “KEYTRUDA”, “MK-3475”, and “lambrolizumab” across different trial records. Disease names are inconsistent. Eligibility criteria are unstructured paragraphs. Converting this into structured, linked knowledge is the core challenge.
Two data sources
Section titled “Two data sources”Data Gov collects trials from two registries.
AACT (Aggregate Analysis of ClinicalTrials.gov) mirrors the US federal registry as a PostgreSQL database. Data Gov connects to it as a read-only external database. Trials carry NCT identifiers (e.g., NCT04380636). AACT covers the vast majority of global oncology trials.
ChiCTR (Chinese Clinical Trial Registry) has no API. Data Gov scrapes it weekly using headless Chrome (Ferrum gem). Trials carry ChiCTR identifiers (e.g., ChiCTR2500108641). The registry covers 22,000+ oncology trials not in ClinicalTrials.gov — a significant data source that most competing platforms miss.
| Registry | Access method | Identifier | Approximate scale |
|---|---|---|---|
| ClinicalTrials.gov (AACT) | Read-only PostgreSQL connection | NCT{number} | ~50,000 oncology trials |
| ChiCTR | Headless Chrome scraping | ChiCTR{number} | ~22,000 oncology trials |
Pipeline overview
Section titled “Pipeline overview”The pipeline runs as two chained workflows every Tuesday at 08:00 UTC. ClinicalTrialsWorkflow (22 steps) handles collection and extraction. ClinicalTrialEligibilitiesWorkflow (30 steps) handles disease and biomarker entity resolution. The second triggers automatically when the first completes via an after_complete hook.
flowchart TB
subgraph Phase1["Phase 1: Reconciliation"]
SV["Sync version\nhistories"]
RE["Diff against\nAAC T"]
DE["Detect eligibility\nchanges (LLM)"]
CL["Clean stale\nflags"]
RS["Reset changed\ncriteria"]
end
subgraph Phase2["Phase 2: Collection"]
CA["Collect AACT\n(Node.js)"]
CC["Collect ChiCTR\n(Ferrum scraper)"]
end
subgraph Phase3["Phase 3: Study Plans"]
GM["Generate ChiCTR\nmarkdown"]
EPA["Extract plans\n(AACT)"]
EPC["Extract plans\n(ChiCTR)"]
MSP["Match plan\ncomponents"]
end
subgraph Phase4["Phase 4: Shared Processing"]
SO["Sync outcomes"]
SR["Collect results"]
LS["Link sponsors\nto orgs"]
BC["Build conditions"]
end
subgraph Phase5["Phase 5: Participation Criteria"]
PC["Extract criteria\n(LLM)"]
PP["Post-process\nand link"]
end
subgraph Phase6["Phase 6: Finalization"]
PR["Pull PubMed\nreferences"]
UV["Update version\nmarkers"]
end
SV --> RE --> DE --> CL --> RS
RS --> CA & CC
CA --> EPA & SO & SR & BC
CC --> GM --> EPC
EPA & EPC --> MSP
BC --> PC --> PP --> PR --> UV
subgraph Eligibilities["Auto-triggered: Eligibilities Workflow"]
DM["Match diseases\n(4-stage cascade)"]
BM["Match biomarkers\n(4-stage cascade)"]
SM["Match subtypes"]
end
UV --> DM --> BM --> SM
Phase 1: Reconciliation
Section titled “Phase 1: Reconciliation”Before collecting new data, the pipeline detects what changed since the last run. This avoids reprocessing 72,000 trials every week.
Sync versions. The CtGovService calls the ClinicalTrials.gov API to fetch version histories for every tracked NCT ID. Version arrays store in the versions JSONB column on clinical_trials.
Reconcile existing trials. Compares the persisted current_version against new versions. For each trial with changes, fetches updated eligibility, conditions, and references from AACT. Processes in parallel: 5 threads, batches of 100. Tracks which modules changed (Eligibility, Conditions, Adverse Events, Study Design).
Detect eligibility changes (LLM). Not every text change is meaningful. GPT-4.1 reads the old and new eligibility text and determines if the change is semantically significant. This prevents unnecessary reprocessing when a site merely fixes a typo.
Clean and reset. Stale pending flags clear. Participation criteria for trials with confirmed meaningful changes reset and queue for re-extraction in Phase 5.
Phase 2: Data collection
Section titled “Phase 2: Data collection”AACT and ChiCTR collectors run in parallel.
AACT collection
Section titled “AACT collection”A Node.js script (data-collection-job:6 on AWS Batch) fetches new and updated trials from the AACT database. It creates ClinicalTrial records and generates markdown summaries in-stream. The markdown feeds into study plan extraction.
ChiCTR collection
Section titled “ChiCTR collection”The ChictrTrialsService scrapes the ChiCTR website using Ferrum (headless Chrome). It runs 140+ cancer keyword searches, rate-limited to 2 requests/second with user agent rotation. Raw XML files store in S3 at silver/chictr/{run_id}/. The scraper uses advisory locks to prevent concurrent imports from colliding.
Phase 3: Study plan extraction
Section titled “Phase 3: Study plan extraction”Every trial describes its treatment arms in free text. Study plan extraction turns this into structured data: which drugs, at what doses, in which arms.
GPT-4.1 reads the trial markdown and extracts StudyPlanArm records (EXPERIMENTAL, COMPARATOR, CONTROL) and StudyPlanComponent records (individual drugs with dosing JSONB). Each component carries an investigational_component flag marking whether it is the experimental drug.
Component matching maps extracted drug names to Bioloupe drug entities. It uses a 4-stage fallback chain:
- Database match against
drugs.nameanddrugs.all_synonyms - NCI Thesaurus exact match
- NCI fuzzy match + LLM confirmation
- Create a new
NcitConceptfor manual resolution
Phase 4: Shared processing
Section titled “Phase 4: Shared processing”These steps run for both AACT and ChiCTR trials.
Sync outcomes. Fetches outcome measure definitions from AACT. Each outcome has a title, time frame, and type (Primary, Secondary, Other).
Collect study results. A Node.js script (4 vCPUs, 8 GB) pulls detailed statistical results: result groups, outcome measurements, baseline data, adverse events, and participant flows.
Link sponsors to organizations. Maps the free-text sponsor names from trial records to canonical Organisation records. Uses fuzzy matching against organisations.name and organisations.branch_names.
Build conditions. Creates ClinicalTrialCondition records from AACT condition data. These raw condition strings feed into the eligibilities workflow for disease matching.
Phase 5: Participation criteria extraction
Section titled “Phase 5: Participation criteria extraction”Free-text eligibility criteria contain structured information buried in paragraphs. GPT-4.1 parses each trial’s eligibility text into structured fields.
What the LLM extracts from eligibility text:
- Disease names and subtypes
- Required biomarkers (e.g., “HER2-positive”, “EGFR mutation”)
- Disease stages and extents
- Treatment lines (“second-line or later”)
- Treatment settings (neoadjuvant, adjuvant, metastatic)
- Age ranges and ECOG performance scores
Results land in participation_criteria rows — one per disease-trial pair. Post-processing normalizes extracted values and prepares them for entity resolution.
The eligibilities workflow: entity resolution
Section titled “The eligibilities workflow: entity resolution”When ClinicalTrialsWorkflow completes, ClinicalTrialEligibilitiesWorkflow starts automatically. Its 30 steps resolve the free-text disease and biomarker names into canonical entities.
The resolution uses a 4-stage cascade for each term type (diseases, subtypes, biomarkers). Each stage runs in order. The first confident match wins.
flowchart LR Pop["Populate\nterms"] --> KW["Suggest\nkeywords\n(LLM)"] KW --> Sem["Semantic\ncandidates\n(pgvector)"] Sem --> NCI["NCI Thesaurus\ncandidates"] NCI --> Pick["Rank\ncandidates\n(LLM)"] Pick --> QA["Quality check\n(LLM)"] QA --> Judge["Final judge\n(GPT-4.1)"] Judge --> Post["Post-process\nand link"]
Stage 1: Keyword suggestion. GPT-4.1 generates search keywords from the raw term. “Cancer of lung, non small cell” produces keywords like “NSCLC”, “non-small cell lung cancer.”
Stage 2: Candidate search. Two parallel searches find candidates. Semantic search uses pgvector nearest-neighbor against OpenAI embeddings. NCI Thesaurus search uses the NcitService API with exact and fuzzy matching.
Stage 3: Ranking. GPT-4.1 ranks the candidates by relevance. The best match gets a confidence score.
Stage 4: Judging. A final GPT-4.1 call validates the top match. Ambiguous cases get consensus voting from multiple LLM calls. High-confidence matches link automatically. Low-confidence matches queue for human review.
Schedule and triggers
Section titled “Schedule and triggers”| Detail | Value |
|---|---|
| Cron schedule | 0 8 * * 2 (Tuesday 08:00 UTC) |
| Job class | ClinicalTrialsWorkflowSchedulerJob |
| Concurrency guard | Skips if a previous run is still active |
| Manual trigger | rails runner "ClinicalTrialsWorkflowSchedulerJob.perform_now" |
| AACT connection | Read-only. Configured via AACT_DB_* env vars. |
| ChiCTR scraping | Rate-limited to 2 req/sec. 140+ cancer keyword searches. |
| LLM model | GPT-4.1 for extraction, matching, and judging |
Key services
Section titled “Key services”| Service | Purpose |
|---|---|
CtGovService | Fetches version histories from ClinicalTrials.gov API |
ChictrTrialsService | Scrapes ChiCTR with headless Chrome |
AactEligibilityTransformer | Transforms AACT eligibility data |
OpenAiService | LLM calls for extraction and matching |
DiseaseMatchingService | Disease entity resolution (NCI-coded) |
BiomarkerMatchingService | Biomarker entity resolution |
TermMatchingService | Generic 4-stage term matching |
SimpleCandidateMatchingService | Strategy-based candidate search |
Common problems
Section titled “Common problems”| Symptom | Likely cause | Fix |
|---|---|---|
| Workflow stuck at “Sync Versions” | CT.gov API rate limiting (HTTP 429) | Wait 10-15 min. The service retries with backoff. |
| ChiCTR returns 0 results | Anti-bot detection triggered | Check user agent rotation. Try smaller keyword set. |
| Eligibility detection timeout | Too many pending trials | Run with --limit flag to process in chunks. |
| Term matching low quality | NCI Thesaurus API down | Check NCI API. Fall back to semantic-only. |
| ”Workflow already running” skip | Previous run did not complete | Mark stale workflow as complete in ActiveAdmin. |
| Study plans missing ChiCTR | Markdown generation skipped | Re-run clinical_trials:chictr:generate_markdown. |
Next steps
Section titled “Next steps”- Drug approvals — How regulatory data from four agencies becomes unified approval records
- Data model — Deep dive into the trial tables and their relationships