Skip to content

Clinical trials pipeline

Every week, Data Gov collects, reconciles, and enriches 72,000+ clinical trials from two registries. This page follows a trial from raw registry data to a fully structured record in the knowledge graph. It covers the pipeline architecture, entity resolution, and the domain knowledge you need to understand what the data means.

Clinical trials are the experimental evidence behind every drug. When Pfizer’s paxlovid enters Phase III, that trial generates data about which diseases it targets, which biomarkers it requires, what endpoints it measures, and what results it produces. Data Gov captures all of this.

But trial registries are messy. ClinicalTrials.gov accepts free-text, self-reported data. The same drug appears as “Pembrolizumab”, “KEYTRUDA”, “MK-3475”, and “lambrolizumab” across different trial records. Disease names are inconsistent. Eligibility criteria are unstructured paragraphs. Converting this into structured, linked knowledge is the core challenge.

Data Gov collects trials from two registries.

AACT (Aggregate Analysis of ClinicalTrials.gov) mirrors the US federal registry as a PostgreSQL database. Data Gov connects to it as a read-only external database. Trials carry NCT identifiers (e.g., NCT04380636). AACT covers the vast majority of global oncology trials.

ChiCTR (Chinese Clinical Trial Registry) has no API. Data Gov scrapes it weekly using headless Chrome (Ferrum gem). Trials carry ChiCTR identifiers (e.g., ChiCTR2500108641). The registry covers 22,000+ oncology trials not in ClinicalTrials.gov — a significant data source that most competing platforms miss.

RegistryAccess methodIdentifierApproximate scale
ClinicalTrials.gov (AACT)Read-only PostgreSQL connectionNCT{number}~50,000 oncology trials
ChiCTRHeadless Chrome scrapingChiCTR{number}~22,000 oncology trials

The pipeline runs as two chained workflows every Tuesday at 08:00 UTC. ClinicalTrialsWorkflow (22 steps) handles collection and extraction. ClinicalTrialEligibilitiesWorkflow (30 steps) handles disease and biomarker entity resolution. The second triggers automatically when the first completes via an after_complete hook.

flowchart TB
  subgraph Phase1["Phase 1: Reconciliation"]
    SV["Sync version\nhistories"]
    RE["Diff against\nAAC T"]
    DE["Detect eligibility\nchanges (LLM)"]
    CL["Clean stale\nflags"]
    RS["Reset changed\ncriteria"]
  end

  subgraph Phase2["Phase 2: Collection"]
    CA["Collect AACT\n(Node.js)"]
    CC["Collect ChiCTR\n(Ferrum scraper)"]
  end

  subgraph Phase3["Phase 3: Study Plans"]
    GM["Generate ChiCTR\nmarkdown"]
    EPA["Extract plans\n(AACT)"]
    EPC["Extract plans\n(ChiCTR)"]
    MSP["Match plan\ncomponents"]
  end

  subgraph Phase4["Phase 4: Shared Processing"]
    SO["Sync outcomes"]
    SR["Collect results"]
    LS["Link sponsors\nto orgs"]
    BC["Build conditions"]
  end

  subgraph Phase5["Phase 5: Participation Criteria"]
    PC["Extract criteria\n(LLM)"]
    PP["Post-process\nand link"]
  end

  subgraph Phase6["Phase 6: Finalization"]
    PR["Pull PubMed\nreferences"]
    UV["Update version\nmarkers"]
  end

  SV --> RE --> DE --> CL --> RS
  RS --> CA & CC
  CA --> EPA & SO & SR & BC
  CC --> GM --> EPC
  EPA & EPC --> MSP
  BC --> PC --> PP --> PR --> UV

  subgraph Eligibilities["Auto-triggered: Eligibilities Workflow"]
    DM["Match diseases\n(4-stage cascade)"]
    BM["Match biomarkers\n(4-stage cascade)"]
    SM["Match subtypes"]
  end

  UV --> DM --> BM --> SM

Before collecting new data, the pipeline detects what changed since the last run. This avoids reprocessing 72,000 trials every week.

Sync versions. The CtGovService calls the ClinicalTrials.gov API to fetch version histories for every tracked NCT ID. Version arrays store in the versions JSONB column on clinical_trials.

Reconcile existing trials. Compares the persisted current_version against new versions. For each trial with changes, fetches updated eligibility, conditions, and references from AACT. Processes in parallel: 5 threads, batches of 100. Tracks which modules changed (Eligibility, Conditions, Adverse Events, Study Design).

Detect eligibility changes (LLM). Not every text change is meaningful. GPT-4.1 reads the old and new eligibility text and determines if the change is semantically significant. This prevents unnecessary reprocessing when a site merely fixes a typo.

Clean and reset. Stale pending flags clear. Participation criteria for trials with confirmed meaningful changes reset and queue for re-extraction in Phase 5.

AACT and ChiCTR collectors run in parallel.

A Node.js script (data-collection-job:6 on AWS Batch) fetches new and updated trials from the AACT database. It creates ClinicalTrial records and generates markdown summaries in-stream. The markdown feeds into study plan extraction.

The ChictrTrialsService scrapes the ChiCTR website using Ferrum (headless Chrome). It runs 140+ cancer keyword searches, rate-limited to 2 requests/second with user agent rotation. Raw XML files store in S3 at silver/chictr/{run_id}/. The scraper uses advisory locks to prevent concurrent imports from colliding.

Every trial describes its treatment arms in free text. Study plan extraction turns this into structured data: which drugs, at what doses, in which arms.

GPT-4.1 reads the trial markdown and extracts StudyPlanArm records (EXPERIMENTAL, COMPARATOR, CONTROL) and StudyPlanComponent records (individual drugs with dosing JSONB). Each component carries an investigational_component flag marking whether it is the experimental drug.

Component matching maps extracted drug names to Bioloupe drug entities. It uses a 4-stage fallback chain:

  1. Database match against drugs.name and drugs.all_synonyms
  2. NCI Thesaurus exact match
  3. NCI fuzzy match + LLM confirmation
  4. Create a new NcitConcept for manual resolution

These steps run for both AACT and ChiCTR trials.

Sync outcomes. Fetches outcome measure definitions from AACT. Each outcome has a title, time frame, and type (Primary, Secondary, Other).

Collect study results. A Node.js script (4 vCPUs, 8 GB) pulls detailed statistical results: result groups, outcome measurements, baseline data, adverse events, and participant flows.

Link sponsors to organizations. Maps the free-text sponsor names from trial records to canonical Organisation records. Uses fuzzy matching against organisations.name and organisations.branch_names.

Build conditions. Creates ClinicalTrialCondition records from AACT condition data. These raw condition strings feed into the eligibilities workflow for disease matching.

Phase 5: Participation criteria extraction

Section titled “Phase 5: Participation criteria extraction”

Free-text eligibility criteria contain structured information buried in paragraphs. GPT-4.1 parses each trial’s eligibility text into structured fields.

What the LLM extracts from eligibility text:

  • Disease names and subtypes
  • Required biomarkers (e.g., “HER2-positive”, “EGFR mutation”)
  • Disease stages and extents
  • Treatment lines (“second-line or later”)
  • Treatment settings (neoadjuvant, adjuvant, metastatic)
  • Age ranges and ECOG performance scores

Results land in participation_criteria rows — one per disease-trial pair. Post-processing normalizes extracted values and prepares them for entity resolution.

The eligibilities workflow: entity resolution

Section titled “The eligibilities workflow: entity resolution”

When ClinicalTrialsWorkflow completes, ClinicalTrialEligibilitiesWorkflow starts automatically. Its 30 steps resolve the free-text disease and biomarker names into canonical entities.

The resolution uses a 4-stage cascade for each term type (diseases, subtypes, biomarkers). Each stage runs in order. The first confident match wins.

flowchart LR
  Pop["Populate\nterms"] --> KW["Suggest\nkeywords\n(LLM)"]
  KW --> Sem["Semantic\ncandidates\n(pgvector)"]
  Sem --> NCI["NCI Thesaurus\ncandidates"]
  NCI --> Pick["Rank\ncandidates\n(LLM)"]
  Pick --> QA["Quality check\n(LLM)"]
  QA --> Judge["Final judge\n(GPT-4.1)"]
  Judge --> Post["Post-process\nand link"]

Stage 1: Keyword suggestion. GPT-4.1 generates search keywords from the raw term. “Cancer of lung, non small cell” produces keywords like “NSCLC”, “non-small cell lung cancer.”

Stage 2: Candidate search. Two parallel searches find candidates. Semantic search uses pgvector nearest-neighbor against OpenAI embeddings. NCI Thesaurus search uses the NcitService API with exact and fuzzy matching.

Stage 3: Ranking. GPT-4.1 ranks the candidates by relevance. The best match gets a confidence score.

Stage 4: Judging. A final GPT-4.1 call validates the top match. Ambiguous cases get consensus voting from multiple LLM calls. High-confidence matches link automatically. Low-confidence matches queue for human review.

DetailValue
Cron schedule0 8 * * 2 (Tuesday 08:00 UTC)
Job classClinicalTrialsWorkflowSchedulerJob
Concurrency guardSkips if a previous run is still active
Manual triggerrails runner "ClinicalTrialsWorkflowSchedulerJob.perform_now"
AACT connectionRead-only. Configured via AACT_DB_* env vars.
ChiCTR scrapingRate-limited to 2 req/sec. 140+ cancer keyword searches.
LLM modelGPT-4.1 for extraction, matching, and judging
ServicePurpose
CtGovServiceFetches version histories from ClinicalTrials.gov API
ChictrTrialsServiceScrapes ChiCTR with headless Chrome
AactEligibilityTransformerTransforms AACT eligibility data
OpenAiServiceLLM calls for extraction and matching
DiseaseMatchingServiceDisease entity resolution (NCI-coded)
BiomarkerMatchingServiceBiomarker entity resolution
TermMatchingServiceGeneric 4-stage term matching
SimpleCandidateMatchingServiceStrategy-based candidate search
SymptomLikely causeFix
Workflow stuck at “Sync Versions”CT.gov API rate limiting (HTTP 429)Wait 10-15 min. The service retries with backoff.
ChiCTR returns 0 resultsAnti-bot detection triggeredCheck user agent rotation. Try smaller keyword set.
Eligibility detection timeoutToo many pending trialsRun with --limit flag to process in chunks.
Term matching low qualityNCI Thesaurus API downCheck NCI API. Fall back to semantic-only.
”Workflow already running” skipPrevious run did not completeMark stale workflow as complete in ActiveAdmin.
Study plans missing ChiCTRMarkdown generation skippedRe-run clinical_trials:chictr:generate_markdown.
  • Drug approvals — How regulatory data from four agencies becomes unified approval records
  • Data model — Deep dive into the trial tables and their relationships