ChiCTR pipeline
ChiCTR (Chinese Clinical Trial Registry) has no public API. All data is scraped via Ferrum browser automation with anti-bot evasion, S3-based deduplication, checkpoint recovery, and budget-controlled weekly runs. The pipeline covers collection, import into 10 database tables, LLM-powered study plan extraction, and Chinese organisation sponsor linking.
For how ChiCTR fits into the broader trial pipeline, see Clinical trials pipeline.
Key services: app/services/chictr/
Thor tasks: lib/tasks/clinical_trials/chictr_weekly.thor, chictr_import.thor, chictr_collector.thor, chictr_trials_scrapper.thor, chictr_study_plan.thor
Operator reference
Section titled “Operator reference”Command reference
Section titled “Command reference”| Use case | Command |
|---|---|
| Full weekly run | bundle exec thor clinical_trials:chictr_weekly:weekly_run |
| Fetch phase only | bundle exec thor clinical_trials:chictr_weekly:weekly_run --phase fetch |
| Import phase only | bundle exec thor clinical_trials:chictr_weekly:weekly_run --phase import |
| Custom budget | bundle exec thor clinical_trials:chictr_weekly:weekly_run --page-budget 2000 --max-pages-per-disease 50 |
| Resume after crash | bundle exec thor clinical_trials:chictr_weekly:weekly_run --run-id SAME_RUN_ID |
| Manual single disease | bundle exec thor clinical_trials:chictr:collect_disease diabetes --pages 5 |
| Import from S3 | bundle exec thor clinical_trials:chictr_import:import_xml --directory tmp/chictr/run |
| Generate markdown | bundle exec thor clinical_trials:chictr:generate_markdown --run_id=batch_001 --storage=s3 |
| LLM extraction (ChiCTR) | bundle exec thor clinical_trials:study_plan_refactor:extraction --run-id=batch_001 --chictr-only |
| Import organisations | bundle exec thor one_off:chictr_organisations:import --limit 500 |
| Link sponsors | bundle exec thor clinical_trials:sponsors:link_all_chictr |
| Test single trial | bundle exec thor clinical_trials:chictr_scrapper:test 280689 |
Weekly run lifecycle
Section titled “Weekly run lifecycle”The scraper runs reverse pagination (newest trials first, page N → 1) with a budget of 4,000 pages/week and 30 pages/disease max. Rate: 4-6 seconds per trial.
Checkpoints. Every 50 pages (~8 min of work), the scraper saves state to both S3 and local disk. On SIGTERM/SIGINT, it finishes the current page, saves checkpoint, and exits gracefully.
Resume. Same run_id = crash recovery (loads checkpoint, skips already-fetched trials). New run_id = incremental run (loads summary_latest.json from previous week, adds new diseases automatically). A sanity check aborts keyword refresh if the new keyword list is less than 50% of the existing size (data source failure protection).
Budget-efficient change detection. Every week, page 1 of every disease is fetched for free. For completed diseases, page 1 is compared against saved snapshots to detect new/updated/removed trials. Only changed trials are re-fetched (0 budget pages consumed).
Error handling
Section titled “Error handling”| Error type | Strategy |
|---|---|
| Network (scraping) | 8 retries, exponential backoff + jitter (2s base, 10s cap) |
| S3 upload | 3 retries, delete local file on failure (S3 is source of truth) |
| Database | Batch insert_all with individual retry fallback on duplicate/validation |
| Advisory lock | 30s timeout, abort if busy |
| Recovery scenario | Action |
|---|---|
| XML parse error | Log + skip trial, continue |
| OOM crash | Restart with same run_id, registry rebuilds from S3 in ~30s |
| S3 upload failure | Delete local file (maintains S3 source of truth) |
| FK violation on re-import | Uses destroy_all to trigger Rails cascade deletion |
Collection system
Section titled “Collection system”Anti-bot strategy
Section titled “Anti-bot strategy”ChiCTR uses Cloudflare protection, browser fingerprinting, and request pattern detection. Evasion is implemented in app/services/chictr/web_scraper.rb:
| Technique | Implementation |
|---|---|
| Full browser (Ferrum) | Headless Chrome with JS execution |
| UA rotation | 5 realistic signatures, rotated in lib/browser_manager.rb |
| Rate limiting | 1.0s mandatory sleep between requests |
| Human-like scrolling | 300px scroll simulation |
| Network idle waiting | Wait for all AJAX/JS before extraction |
| Cloudflare detection | Check for “Just a moment” / “Checking your browser” pages |
| Browser health rotation | Replace browser every 15 requests or 30 minutes |
| Automation flag hiding | Disable AutomationControlled blink feature |
Disease keywords
Section titled “Disease keywords”app/services/chictr/disease_keywords_loader.rb builds the search keyword list:
3 sources (thousands raw) → normalize → length filter (>4 chars) → blacklist → substring dedup → hundreds of keywords- Database diseases:
Disease.where(simplified: true, deleted_at: nil)— names and synonym variations (largest source) - MeSH terms: From
lib/tasks/data/mesh_terms_by_therapeutic_areas.csv, filtered by Oncology/Hematology - Cancer keywords: Constants in
app/services/chictr/cancer_keywords_processor.rb
Substring deduplication keeps only shortest parent terms — e.g. “leukemia” removes hundreds of subtypes. Abbreviations that are not substrings survive (e.g. “aids-related nhl”).
S3 deduplication
Section titled “S3 deduplication”In-memory hash with O(1) lookups built from S3 file listing. Composite key: (registration_number, project_id).
flowchart TD
Start([Trial]) --> CheckExact{Exact match?<br/>reg_num AND proj_id}
CheckExact -->|Yes| Skip[SKIP]
CheckExact -->|No| CheckUpdate{Same reg_num<br/>different proj_id?}
CheckUpdate -->|Yes| Update[UPDATE - Fetch new version]
CheckUpdate -->|No| New[NEW - Fetch]
Cross-disease deduplication: shared registry prevents duplicate fetches within a run. Crash recovery: registry rebuilds from S3 in ~30s for 5K trials.
Import layer
Section titled “Import layer”Database tables populated (10 total)
Section titled “Database tables populated (10 total)”app/services/chictr/import_service.rb and app/services/chictr/xml_parser.rb parse XML (Nokogiri) into:
| Table | Relationship | Key fields |
|---|---|---|
clinical_trials | main | nct_id, titles, phase, status, dates, enrollment, llm_data (JSONB) |
eligibilities | 1:1 | Inclusion/exclusion criteria, age, gender |
sponsors | 1:N | Name, lead_or_collaborator, agency_class |
overall_officials | 1:N | Name, affiliation, role |
locations | 1:N | Parsed from Chinese addresses (provinces, municipalities, Hong Kong) |
outcomes | 1:N | Primary/secondary/other with measure, time_frame |
arms | 1:N | Title, description, group_type |
clinical_trial_conditions | M:N | Condition names |
study_plan_extractions | 1:1 | Parent record with run_id, extraction_confidence |
study_plan_arms | 1:N | Arms with interventions JSONB |
Import orchestrated by app/services/chictr/phase_two_importer.rb in memory-safe batches of 1,000 files with PostgreSQL advisory locks.
Data preservation rules
Section titled “Data preservation rules”| Trial state | clinical_trials | Related tables | llm_data |
|---|---|---|---|
| New | INSERT | INSERT | NULL (populated later by LLM pipeline) |
| Existing | UPDATE metadata | SKIP | Preserved (excluded from upsert hash) |
The llm_data exclusion is intentional — preserves ~$0.50/trial of LLM processing. Manual override with --update_related=true forces full re-import but destroys LLM data.
Dual arms architecture
Section titled “Dual arms architecture”ChiCTR populates BOTH arms (raw import layer, legacy compatibility) and study_plan_arms (Study Plan v2 processing layer with LLM-ready structure). Study Plan v2 is the current primary processing approach. Intervention types are inferred via pattern matching on arm descriptions (drug/mg/tablet → DRUG, surgery/procedure → PROCEDURE, etc.), though the LLM now classifies types directly for better accuracy.
Chinese organisations & sponsor linking
Section titled “Chinese organisations & sponsor linking”Import sources
Section titled “Import sources”Three data sources feed into app/services/chictr/import_organisations_service.rb:
- CSV (
lib/tasks/data/organisations/chictr_organisations.csv): Curated organisations with semicolon-separated synonyms - ORG_MAP (
lib/tasks/clinical_trials/sponsors.thor): Hardcoded name variation mappings handling preposition, case, word order, and abbreviation differences - Database discovery: Dynamic querying of ChiCTR sponsor frequencies for top-N organisations
Three-tier matching
Section titled “Three-tier matching”Organisation.flexifind in app/models/concerns/searchable.rb runs a cascade:
- Exact (~5ms):
name ILIKE :termORbranch_namesJSONB contains term - Fuzzy (~10-50ms):
strict_word_similarity+similarityvia GiST-indexed trigrams - Levenshtein validation: Return only if similarity >= 85%
Processing: unlinked sponsors in batches of 1,000, 5 parallel threads.
Naming conventions
Section titled “Naming conventions”Chinese institution names follow patterns: university-affiliated hospitals (“West China Hospital of Sichuan University”), academy-affiliated hospitals, military institutions. The ORG_MAP handles comma variations, case, word order, and abbreviations.
Country handling: headquarters: "CN" (ISO 3166-1 alpha-2) + country: "China" (legacy). Character encoding: UTF-8 with non-breaking space (\xA0) handling.
Configuration & key services
Section titled “Configuration & key services”Key services
Section titled “Key services”| Service | File | Purpose |
|---|---|---|
Chictr::WebScraper | app/services/chictr/web_scraper.rb | Ferrum browser automation with anti-bot evasion |
Chictr::ImportService | app/services/chictr/import_service.rb | XML-to-database import with dedup |
Chictr::XmlParser | app/services/chictr/xml_parser.rb | Nokogiri XML extraction |
Chictr::PhaseTwoImporter | app/services/chictr/phase_two_importer.rb | S3-to-database batch orchestration |
Chictr::DiseaseKeywordsLoader | app/services/chictr/disease_keywords_loader.rb | Keyword generation from 3 sources |
Chictr::ImportOrganisationsService | app/services/chictr/import_organisations_service.rb | Chinese organisation import |
Chictr::WeeklyScraper | app/services/chictr/weekly_scraper.rb | Weekly run lifecycle and checkpoints |
Chictr::SummaryManager | app/services/chictr/summary_manager.rb | Checkpoint persistence and resume |
S3 file paths
Section titled “S3 file paths”| Content | S3 path | Local path |
|---|---|---|
| XML files | silver/chictr/{run_id}/{reg_num}__{proj_id}.xml | tmp/chictr/{run_id}/ |
| Run summary | silver/chictr/{run_id}/summary_{run_id}.json | tmp/chictr/{run_id}/summary_{run_id}.json |
| Latest summary | silver/chictr/summary_latest.json | — |
| Markdown | silver/study-plans/{run_id}/markdown/{nct_id}.md | tmp/study_plans/markdown/ |
Key constants
Section titled “Key constants”| Constant | Value | Purpose |
|---|---|---|
RETRY_MAX_ATTEMPTS | 8 | Network retry ceiling |
RETRY_BASE_DELAY | 2.0s | Exponential backoff base |
RETRY_MAX_DELAY | 10.0s | Backoff cap |
MAX_S3_RETRIES | 3 | S3 upload retries |
SNAPSHOT_OVERLAP_THRESHOLD | 0.3 (30%) | Resume point integrity check |
CHECKPOINT_INTERVAL | 50 pages (~8 min) | Checkpoint frequency |
studytype | 1 | Interventional studies only |
Next steps
Section titled “Next steps”- Clinical trials pipeline — The broader trial pipeline that ChiCTR feeds into
- Drug approvals — CDE (China) drug approval pipeline
- Data model — Trial tables and their relationships