Clinical trials pipeline

Every week, Data Gov collects, reconciles, and enriches 72,000+ clinical trials from two registries. This page follows a trial from raw registry data to a fully structured record in the knowledge graph. It covers the pipeline architecture, entity resolution, and the domain knowledge you need to understand what the data means.

Why clinical trials matter

Clinical trials are the experimental evidence behind every drug. When Pfizer’s paxlovid enters Phase III, that trial generates data about which diseases it targets, which biomarkers it requires, what endpoints it measures, and what results it produces. Data Gov captures all of this.

But trial registries are messy. ClinicalTrials.gov accepts free-text, self-reported data. The same drug appears as “Pembrolizumab”, “KEYTRUDA”, “MK-3475”, and “lambrolizumab” across different trial records. Disease names are inconsistent. Eligibility criteria are unstructured paragraphs. Converting this into structured, linked knowledge is the core challenge.

Two data sources

Data Gov collects trials from two registries.

AACT (Aggregate Analysis of ClinicalTrials.gov) mirrors the US federal registry as a PostgreSQL database. Data Gov connects to it as a read-only external database. Trials carry NCT identifiers (e.g., NCT04380636). AACT covers the vast majority of global oncology trials.

ChiCTR (Chinese Clinical Trial Registry) has no API. Data Gov scrapes it weekly using headless Chrome (Ferrum gem). Trials carry ChiCTR identifiers (e.g., ChiCTR2500108641). The registry covers 22,000+ oncology trials not in ClinicalTrials.gov — a significant data source that most competing platforms miss.

Registry	Access method	Identifier	Approximate scale
ClinicalTrials.gov (AACT)	Read-only PostgreSQL connection	`NCT{number}`	~50,000 oncology trials
ChiCTR	Headless Chrome scraping	`ChiCTR{number}`	~22,000 oncology trials

Pipeline overview

The pipeline runs as two chained workflows every Tuesday at 08:00 UTC. ClinicalTrialsWorkflow (22 steps) handles collection and extraction. ClinicalTrialEligibilitiesWorkflow (30 steps) handles disease and biomarker entity resolution. The second triggers automatically when the first completes via an after_complete hook.

flowchart TB
  subgraph Phase1["Phase 1: Reconciliation"]
    SV["Sync version\nhistories"]
    RE["Diff against\nAAC T"]
    DE["Detect eligibility\nchanges (LLM)"]
    CL["Clean stale\nflags"]
    RS["Reset changed\ncriteria"]
  end

  subgraph Phase2["Phase 2: Collection"]
    CA["Collect AACT\n(Node.js)"]
    CC["Collect ChiCTR\n(Ferrum scraper)"]
  end

  subgraph Phase3["Phase 3: Study Plans"]
    GM["Generate ChiCTR\nmarkdown"]
    EPA["Extract plans\n(AACT)"]
    EPC["Extract plans\n(ChiCTR)"]
    MSP["Match plan\ncomponents"]
  end

  subgraph Phase4["Phase 4: Shared Processing"]
    SO["Sync outcomes"]
    SR["Collect results"]
    LS["Link sponsors\nto orgs"]
    BC["Build conditions"]
  end

  subgraph Phase5["Phase 5: Participation Criteria"]
    PC["Extract criteria\n(LLM)"]
    PP["Post-process\nand link"]
  end

  subgraph Phase6["Phase 6: Finalization"]
    PR["Pull PubMed\nreferences"]
    UV["Update version\nmarkers"]
  end

  SV --> RE --> DE --> CL --> RS
  RS --> CA & CC
  CA --> EPA & SO & SR & BC
  CC --> GM --> EPC
  EPA & EPC --> MSP
  BC --> PC --> PP --> PR --> UV

  subgraph Eligibilities["Auto-triggered: Eligibilities Workflow"]
    DM["Match diseases\n(4-stage cascade)"]
    BM["Match biomarkers\n(4-stage cascade)"]
    SM["Match subtypes"]
  end

  UV --> DM --> BM --> SM

Phase 1: Reconciliation

Before collecting new data, the pipeline detects what changed since the last run. This avoids reprocessing 72,000 trials every week.

Sync versions. The CtGovService calls the ClinicalTrials.gov API to fetch version histories for every tracked NCT ID. Version arrays store in the versions JSONB column on clinical_trials.

Reconcile existing trials. Compares the persisted current_version against new versions. For each trial with changes, fetches updated eligibility, conditions, and references from AACT. Processes in parallel: 5 threads, batches of 100. Tracks which modules changed (Eligibility, Conditions, Adverse Events, Study Design).

Detect eligibility changes (LLM). Not every text change is meaningful. GPT-4.1 reads the old and new eligibility text and determines if the change is semantically significant. This prevents unnecessary reprocessing when a site merely fixes a typo.

Clean and reset. Stale pending flags clear. Participation criteria for trials with confirmed meaningful changes reset and queue for re-extraction in Phase 5.

Phase 2: Data collection

AACT and ChiCTR collectors run in parallel.

AACT collection

A Node.js script (data-collection-job:6 on AWS Batch) fetches new and updated trials from the AACT database. It creates ClinicalTrial records and generates markdown summaries in-stream. The markdown feeds into study plan extraction.

ChiCTR collection

The ChictrTrialsService scrapes the ChiCTR website using Ferrum (headless Chrome). It runs 140+ cancer keyword searches, rate-limited to 2 requests/second with user agent rotation. Raw XML files store in S3 at silver/chictr/{run_id}/. The scraper uses advisory locks to prevent concurrent imports from colliding.

Phase 3: Study plan extraction

Every trial describes its treatment arms in free text. Study plan extraction turns this into structured data: which drugs, at what doses, in which arms.

GPT-4.1 reads the trial markdown and extracts StudyPlanArm records (EXPERIMENTAL, COMPARATOR, CONTROL) and StudyPlanComponent records (individual drugs with dosing JSONB). Each component carries an investigational_component flag marking whether it is the experimental drug.

Component matching maps extracted drug names to Bioloupe drug entities. It uses a 4-stage fallback chain:

Database match against drugs.name and drugs.all_synonyms
NCI Thesaurus exact match
NCI fuzzy match + LLM confirmation
Create a new NcitConcept for manual resolution

Phase 4: Shared processing

These steps run for both AACT and ChiCTR trials.

Sync outcomes. Fetches outcome measure definitions from AACT. Each outcome has a title, time frame, and type (Primary, Secondary, Other).

Collect study results. A Node.js script (4 vCPUs, 8 GB) pulls detailed statistical results: result groups, outcome measurements, baseline data, adverse events, and participant flows.

Link sponsors to organizations. Maps the free-text sponsor names from trial records to canonical Organisation records. Uses fuzzy matching against organisations.name and organisations.branch_names.

Build conditions. Creates ClinicalTrialCondition records from AACT condition data. These raw condition strings feed into the eligibilities workflow for disease matching.

Phase 5: Participation criteria extraction

Free-text eligibility criteria contain structured information buried in paragraphs. GPT-4.1 parses each trial’s eligibility text into structured fields.

What the LLM extracts from eligibility text:

Disease names and subtypes
Required biomarkers (e.g., “HER2-positive”, “EGFR mutation”)
Disease stages and extents
Treatment lines (“second-line or later”)
Treatment settings (neoadjuvant, adjuvant, metastatic)
Age ranges and ECOG performance scores

Results land in participation_criteria rows — one per disease-trial pair. Post-processing normalizes extracted values and prepares them for entity resolution.

The eligibilities workflow: entity resolution

When ClinicalTrialsWorkflow completes, ClinicalTrialEligibilitiesWorkflow starts automatically. Its 30 steps resolve the free-text disease and biomarker names into canonical entities.

The resolution uses a 4-stage cascade for each term type (diseases, subtypes, biomarkers). Each stage runs in order. The first confident match wins.

flowchart LR
  Pop["Populate\nterms"] --> KW["Suggest\nkeywords\n(LLM)"]
  KW --> Sem["Semantic\ncandidates\n(pgvector)"]
  Sem --> NCI["NCI Thesaurus\ncandidates"]
  NCI --> Pick["Rank\ncandidates\n(LLM)"]
  Pick --> QA["Quality check\n(LLM)"]
  QA --> Judge["Final judge\n(GPT-4.1)"]
  Judge --> Post["Post-process\nand link"]

Stage 1: Keyword suggestion. GPT-4.1 generates search keywords from the raw term. “Cancer of lung, non small cell” produces keywords like “NSCLC”, “non-small cell lung cancer.”

Stage 2: Candidate search. Two parallel searches find candidates. Semantic search uses pgvector nearest-neighbor against OpenAI embeddings. NCI Thesaurus search uses the NcitService API with exact and fuzzy matching.

Stage 3: Ranking. GPT-4.1 ranks the candidates by relevance. The best match gets a confidence score.

Stage 4: Judging. A final GPT-4.1 call validates the top match. Ambiguous cases get consensus voting from multiple LLM calls. High-confidence matches link automatically. Low-confidence matches queue for human review.

Schedule and triggers

Detail	Value
Cron schedule	`0 8 * * 2` (Tuesday 08:00 UTC)
Job class	`ClinicalTrialsWorkflowSchedulerJob`
Concurrency guard	Skips if a previous run is still active
Manual trigger	`rails runner "ClinicalTrialsWorkflowSchedulerJob.perform_now"`
AACT connection	Read-only. Configured via `AACT_DB_*` env vars.
ChiCTR scraping	Rate-limited to 2 req/sec. 140+ cancer keyword searches.
LLM model	GPT-4.1 for extraction, matching, and judging

Key services

Service	Purpose
`CtGovService`	Fetches version histories from ClinicalTrials.gov API
`ChictrTrialsService`	Scrapes ChiCTR with headless Chrome
`AactEligibilityTransformer`	Transforms AACT eligibility data
`OpenAiService`	LLM calls for extraction and matching
`DiseaseMatchingService`	Disease entity resolution (NCI-coded)
`BiomarkerMatchingService`	Biomarker entity resolution
`TermMatchingService`	Generic 4-stage term matching
`SimpleCandidateMatchingService`	Strategy-based candidate search

Common problems

Symptom	Likely cause	Fix
Workflow stuck at “Sync Versions”	CT.gov API rate limiting (HTTP 429)	Wait 10-15 min. The service retries with backoff.
ChiCTR returns 0 results	Anti-bot detection triggered	Check user agent rotation. Try smaller keyword set.
Eligibility detection timeout	Too many pending trials	Run with `--limit` flag to process in chunks.
Term matching low quality	NCI Thesaurus API down	Check NCI API. Fall back to semantic-only.
”Workflow already running” skip	Previous run did not complete	Mark stale workflow as complete in ActiveAdmin.
Study plans missing ChiCTR	Markdown generation skipped	Re-run `clinical_trials:chictr:generate_markdown`.

Next steps

Drug approvals — How regulatory data from four agencies becomes unified approval records
Data model — Deep dive into the trial tables and their relationships