Skip to content

ChiCTR pipeline

ChiCTR (Chinese Clinical Trial Registry) has no public API. All data is scraped via Ferrum browser automation with anti-bot evasion, S3-based deduplication, checkpoint recovery, and budget-controlled weekly runs. The pipeline covers collection, import into 10 database tables, LLM-powered study plan extraction, and Chinese organisation sponsor linking.

For how ChiCTR fits into the broader trial pipeline, see Clinical trials pipeline.

Key services: app/services/chictr/ Thor tasks: lib/tasks/clinical_trials/chictr_weekly.thor, chictr_import.thor, chictr_collector.thor, chictr_trials_scrapper.thor, chictr_study_plan.thor


Use caseCommand
Full weekly runbundle exec thor clinical_trials:chictr_weekly:weekly_run
Fetch phase onlybundle exec thor clinical_trials:chictr_weekly:weekly_run --phase fetch
Import phase onlybundle exec thor clinical_trials:chictr_weekly:weekly_run --phase import
Custom budgetbundle exec thor clinical_trials:chictr_weekly:weekly_run --page-budget 2000 --max-pages-per-disease 50
Resume after crashbundle exec thor clinical_trials:chictr_weekly:weekly_run --run-id SAME_RUN_ID
Manual single diseasebundle exec thor clinical_trials:chictr:collect_disease diabetes --pages 5
Import from S3bundle exec thor clinical_trials:chictr_import:import_xml --directory tmp/chictr/run
Generate markdownbundle exec thor clinical_trials:chictr:generate_markdown --run_id=batch_001 --storage=s3
LLM extraction (ChiCTR)bundle exec thor clinical_trials:study_plan_refactor:extraction --run-id=batch_001 --chictr-only
Import organisationsbundle exec thor one_off:chictr_organisations:import --limit 500
Link sponsorsbundle exec thor clinical_trials:sponsors:link_all_chictr
Test single trialbundle exec thor clinical_trials:chictr_scrapper:test 280689

The scraper runs reverse pagination (newest trials first, page N → 1) with a budget of 4,000 pages/week and 30 pages/disease max. Rate: 4-6 seconds per trial.

Checkpoints. Every 50 pages (~8 min of work), the scraper saves state to both S3 and local disk. On SIGTERM/SIGINT, it finishes the current page, saves checkpoint, and exits gracefully.

Resume. Same run_id = crash recovery (loads checkpoint, skips already-fetched trials). New run_id = incremental run (loads summary_latest.json from previous week, adds new diseases automatically). A sanity check aborts keyword refresh if the new keyword list is less than 50% of the existing size (data source failure protection).

Budget-efficient change detection. Every week, page 1 of every disease is fetched for free. For completed diseases, page 1 is compared against saved snapshots to detect new/updated/removed trials. Only changed trials are re-fetched (0 budget pages consumed).

Error typeStrategy
Network (scraping)8 retries, exponential backoff + jitter (2s base, 10s cap)
S3 upload3 retries, delete local file on failure (S3 is source of truth)
DatabaseBatch insert_all with individual retry fallback on duplicate/validation
Advisory lock30s timeout, abort if busy
Recovery scenarioAction
XML parse errorLog + skip trial, continue
OOM crashRestart with same run_id, registry rebuilds from S3 in ~30s
S3 upload failureDelete local file (maintains S3 source of truth)
FK violation on re-importUses destroy_all to trigger Rails cascade deletion

ChiCTR uses Cloudflare protection, browser fingerprinting, and request pattern detection. Evasion is implemented in app/services/chictr/web_scraper.rb:

TechniqueImplementation
Full browser (Ferrum)Headless Chrome with JS execution
UA rotation5 realistic signatures, rotated in lib/browser_manager.rb
Rate limiting1.0s mandatory sleep between requests
Human-like scrolling300px scroll simulation
Network idle waitingWait for all AJAX/JS before extraction
Cloudflare detectionCheck for “Just a moment” / “Checking your browser” pages
Browser health rotationReplace browser every 15 requests or 30 minutes
Automation flag hidingDisable AutomationControlled blink feature

app/services/chictr/disease_keywords_loader.rb builds the search keyword list:

3 sources (thousands raw) → normalize → length filter (>4 chars) → blacklist → substring dedup → hundreds of keywords
  1. Database diseases: Disease.where(simplified: true, deleted_at: nil) — names and synonym variations (largest source)
  2. MeSH terms: From lib/tasks/data/mesh_terms_by_therapeutic_areas.csv, filtered by Oncology/Hematology
  3. Cancer keywords: Constants in app/services/chictr/cancer_keywords_processor.rb

Substring deduplication keeps only shortest parent terms — e.g. “leukemia” removes hundreds of subtypes. Abbreviations that are not substrings survive (e.g. “aids-related nhl”).

In-memory hash with O(1) lookups built from S3 file listing. Composite key: (registration_number, project_id).

flowchart TD
    Start([Trial]) --> CheckExact{Exact match?<br/>reg_num AND proj_id}
    CheckExact -->|Yes| Skip[SKIP]
    CheckExact -->|No| CheckUpdate{Same reg_num<br/>different proj_id?}
    CheckUpdate -->|Yes| Update[UPDATE - Fetch new version]
    CheckUpdate -->|No| New[NEW - Fetch]

Cross-disease deduplication: shared registry prevents duplicate fetches within a run. Crash recovery: registry rebuilds from S3 in ~30s for 5K trials.


app/services/chictr/import_service.rb and app/services/chictr/xml_parser.rb parse XML (Nokogiri) into:

TableRelationshipKey fields
clinical_trialsmainnct_id, titles, phase, status, dates, enrollment, llm_data (JSONB)
eligibilities1:1Inclusion/exclusion criteria, age, gender
sponsors1:NName, lead_or_collaborator, agency_class
overall_officials1:NName, affiliation, role
locations1:NParsed from Chinese addresses (provinces, municipalities, Hong Kong)
outcomes1:NPrimary/secondary/other with measure, time_frame
arms1:NTitle, description, group_type
clinical_trial_conditionsM:NCondition names
study_plan_extractions1:1Parent record with run_id, extraction_confidence
study_plan_arms1:NArms with interventions JSONB

Import orchestrated by app/services/chictr/phase_two_importer.rb in memory-safe batches of 1,000 files with PostgreSQL advisory locks.

Trial stateclinical_trialsRelated tablesllm_data
NewINSERTINSERTNULL (populated later by LLM pipeline)
ExistingUPDATE metadataSKIPPreserved (excluded from upsert hash)

The llm_data exclusion is intentional — preserves ~$0.50/trial of LLM processing. Manual override with --update_related=true forces full re-import but destroys LLM data.

ChiCTR populates BOTH arms (raw import layer, legacy compatibility) and study_plan_arms (Study Plan v2 processing layer with LLM-ready structure). Study Plan v2 is the current primary processing approach. Intervention types are inferred via pattern matching on arm descriptions (drug/mg/tablet → DRUG, surgery/procedure → PROCEDURE, etc.), though the LLM now classifies types directly for better accuracy.


Three data sources feed into app/services/chictr/import_organisations_service.rb:

  1. CSV (lib/tasks/data/organisations/chictr_organisations.csv): Curated organisations with semicolon-separated synonyms
  2. ORG_MAP (lib/tasks/clinical_trials/sponsors.thor): Hardcoded name variation mappings handling preposition, case, word order, and abbreviation differences
  3. Database discovery: Dynamic querying of ChiCTR sponsor frequencies for top-N organisations

Organisation.flexifind in app/models/concerns/searchable.rb runs a cascade:

  1. Exact (~5ms): name ILIKE :term OR branch_names JSONB contains term
  2. Fuzzy (~10-50ms): strict_word_similarity + similarity via GiST-indexed trigrams
  3. Levenshtein validation: Return only if similarity >= 85%

Processing: unlinked sponsors in batches of 1,000, 5 parallel threads.

Chinese institution names follow patterns: university-affiliated hospitals (“West China Hospital of Sichuan University”), academy-affiliated hospitals, military institutions. The ORG_MAP handles comma variations, case, word order, and abbreviations.

Country handling: headquarters: "CN" (ISO 3166-1 alpha-2) + country: "China" (legacy). Character encoding: UTF-8 with non-breaking space (\xA0) handling.


ServiceFilePurpose
Chictr::WebScraperapp/services/chictr/web_scraper.rbFerrum browser automation with anti-bot evasion
Chictr::ImportServiceapp/services/chictr/import_service.rbXML-to-database import with dedup
Chictr::XmlParserapp/services/chictr/xml_parser.rbNokogiri XML extraction
Chictr::PhaseTwoImporterapp/services/chictr/phase_two_importer.rbS3-to-database batch orchestration
Chictr::DiseaseKeywordsLoaderapp/services/chictr/disease_keywords_loader.rbKeyword generation from 3 sources
Chictr::ImportOrganisationsServiceapp/services/chictr/import_organisations_service.rbChinese organisation import
Chictr::WeeklyScraperapp/services/chictr/weekly_scraper.rbWeekly run lifecycle and checkpoints
Chictr::SummaryManagerapp/services/chictr/summary_manager.rbCheckpoint persistence and resume
ContentS3 pathLocal path
XML filessilver/chictr/{run_id}/{reg_num}__{proj_id}.xmltmp/chictr/{run_id}/
Run summarysilver/chictr/{run_id}/summary_{run_id}.jsontmp/chictr/{run_id}/summary_{run_id}.json
Latest summarysilver/chictr/summary_latest.json
Markdownsilver/study-plans/{run_id}/markdown/{nct_id}.mdtmp/study_plans/markdown/
ConstantValuePurpose
RETRY_MAX_ATTEMPTS8Network retry ceiling
RETRY_BASE_DELAY2.0sExponential backoff base
RETRY_MAX_DELAY10.0sBackoff cap
MAX_S3_RETRIES3S3 upload retries
SNAPSHOT_OVERLAP_THRESHOLD0.3 (30%)Resume point integrity check
CHECKPOINT_INTERVAL50 pages (~8 min)Checkpoint frequency
studytype1Interventional studies only