ChiCTR pipeline

ChiCTR (Chinese Clinical Trial Registry) has no public API. All data is scraped via Ferrum browser automation with anti-bot evasion, S3-based deduplication, checkpoint recovery, and budget-controlled weekly runs. The pipeline covers collection, import into 10 database tables, LLM-powered study plan extraction, and Chinese organisation sponsor linking.

For how ChiCTR fits into the broader trial pipeline, see Clinical trials pipeline.

Key services: app/services/chictr/ Thor tasks: lib/tasks/clinical_trials/chictr_weekly.thor, chictr_import.thor, chictr_collector.thor, chictr_trials_scrapper.thor, chictr_study_plan.thor

Operator reference

Command reference

Use case	Command
Full weekly run	`bundle exec thor clinical_trials:chictr_weekly:weekly_run`
Fetch phase only	`bundle exec thor clinical_trials:chictr_weekly:weekly_run --phase fetch`
Import phase only	`bundle exec thor clinical_trials:chictr_weekly:weekly_run --phase import`
Custom budget	`bundle exec thor clinical_trials:chictr_weekly:weekly_run --page-budget 2000 --max-pages-per-disease 50`
Resume after crash	`bundle exec thor clinical_trials:chictr_weekly:weekly_run --run-id SAME_RUN_ID`
Manual single disease	`bundle exec thor clinical_trials:chictr:collect_disease diabetes --pages 5`
Import from S3	`bundle exec thor clinical_trials:chictr_import:import_xml --directory tmp/chictr/run`
Generate markdown	`bundle exec thor clinical_trials:chictr:generate_markdown --run_id=batch_001 --storage=s3`
LLM extraction (ChiCTR)	`bundle exec thor clinical_trials:study_plan_refactor:extraction --run-id=batch_001 --chictr-only`
Import organisations	`bundle exec thor one_off:chictr_organisations:import --limit 500`
Link sponsors	`bundle exec thor clinical_trials:sponsors:link_all_chictr`
Test single trial	`bundle exec thor clinical_trials:chictr_scrapper:test 280689`

Weekly run lifecycle

The scraper runs reverse pagination (newest trials first, page N → 1) with a budget of 4,000 pages/week and 30 pages/disease max. Rate: 4-6 seconds per trial.

Checkpoints. Every 50 pages (~8 min of work), the scraper saves state to both S3 and local disk. On SIGTERM/SIGINT, it finishes the current page, saves checkpoint, and exits gracefully.

Resume. Same run_id = crash recovery (loads checkpoint, skips already-fetched trials). New run_id = incremental run (loads summary_latest.json from previous week, adds new diseases automatically). A sanity check aborts keyword refresh if the new keyword list is less than 50% of the existing size (data source failure protection).

Budget-efficient change detection. Every week, page 1 of every disease is fetched for free. For completed diseases, page 1 is compared against saved snapshots to detect new/updated/removed trials. Only changed trials are re-fetched (0 budget pages consumed).

Error handling

Error type	Strategy
Network (scraping)	8 retries, exponential backoff + jitter (2s base, 10s cap)
S3 upload	3 retries, delete local file on failure (S3 is source of truth)
Database	Batch `insert_all` with individual retry fallback on duplicate/validation
Advisory lock	30s timeout, abort if busy

Recovery scenario	Action
XML parse error	Log + skip trial, continue
OOM crash	Restart with same `run_id`, registry rebuilds from S3 in ~30s
S3 upload failure	Delete local file (maintains S3 source of truth)
FK violation on re-import	Uses `destroy_all` to trigger Rails cascade deletion

Collection system

Anti-bot strategy

ChiCTR uses Cloudflare protection, browser fingerprinting, and request pattern detection. Evasion is implemented in app/services/chictr/web_scraper.rb:

Technique	Implementation
Full browser (Ferrum)	Headless Chrome with JS execution
UA rotation	5 realistic signatures, rotated in `lib/browser_manager.rb`
Rate limiting	1.0s mandatory sleep between requests
Human-like scrolling	300px scroll simulation
Network idle waiting	Wait for all AJAX/JS before extraction
Cloudflare detection	Check for “Just a moment” / “Checking your browser” pages
Browser health rotation	Replace browser every 15 requests or 30 minutes
Automation flag hiding	Disable `AutomationControlled` blink feature

Disease keywords

app/services/chictr/disease_keywords_loader.rb builds the search keyword list:

3 sources (thousands raw) → normalize → length filter (>4 chars) → blacklist → substring dedup → hundreds of keywords

Database diseases: Disease.where(simplified: true, deleted_at: nil) — names and synonym variations (largest source)
MeSH terms: From lib/tasks/data/mesh_terms_by_therapeutic_areas.csv, filtered by Oncology/Hematology
Cancer keywords: Constants in app/services/chictr/cancer_keywords_processor.rb

Substring deduplication keeps only shortest parent terms — e.g. “leukemia” removes hundreds of subtypes. Abbreviations that are not substrings survive (e.g. “aids-related nhl”).

S3 deduplication

In-memory hash with O(1) lookups built from S3 file listing. Composite key: (registration_number, project_id).

flowchart TD
    Start([Trial]) --> CheckExact{Exact match?<br/>reg_num AND proj_id}
    CheckExact -->|Yes| Skip[SKIP]
    CheckExact -->|No| CheckUpdate{Same reg_num<br/>different proj_id?}
    CheckUpdate -->|Yes| Update[UPDATE - Fetch new version]
    CheckUpdate -->|No| New[NEW - Fetch]

Cross-disease deduplication: shared registry prevents duplicate fetches within a run. Crash recovery: registry rebuilds from S3 in ~30s for 5K trials.

Import layer

Database tables populated (10 total)

app/services/chictr/import_service.rb and app/services/chictr/xml_parser.rb parse XML (Nokogiri) into:

Table	Relationship	Key fields
`clinical_trials`	main	`nct_id`, titles, phase, status, dates, enrollment, `llm_data` (JSONB)
`eligibilities`	1:1	Inclusion/exclusion criteria, age, gender
`sponsors`	1:N	Name, `lead_or_collaborator`, `agency_class`
`overall_officials`	1:N	Name, affiliation, role
`locations`	1:N	Parsed from Chinese addresses (provinces, municipalities, Hong Kong)
`outcomes`	1:N	Primary/secondary/other with measure, time_frame
`arms`	1:N	Title, description, group_type
`clinical_trial_conditions`	M:N	Condition names
`study_plan_extractions`	1:1	Parent record with `run_id`, `extraction_confidence`
`study_plan_arms`	1:N	Arms with `interventions` JSONB

Import orchestrated by app/services/chictr/phase_two_importer.rb in memory-safe batches of 1,000 files with PostgreSQL advisory locks.

Data preservation rules

Trial state	`clinical_trials`	Related tables	`llm_data`
New	INSERT	INSERT	NULL (populated later by LLM pipeline)
Existing	UPDATE metadata	SKIP	Preserved (excluded from upsert hash)

The llm_data exclusion is intentional — preserves ~$0.50/trial of LLM processing. Manual override with --update_related=true forces full re-import but destroys LLM data.

Dual arms architecture

ChiCTR populates BOTH arms (raw import layer, legacy compatibility) and study_plan_arms (Study Plan v2 processing layer with LLM-ready structure). Study Plan v2 is the current primary processing approach. Intervention types are inferred via pattern matching on arm descriptions (drug/mg/tablet → DRUG, surgery/procedure → PROCEDURE, etc.), though the LLM now classifies types directly for better accuracy.

Import sources

Three data sources feed into app/services/chictr/import_organisations_service.rb:

CSV (lib/tasks/data/organisations/chictr_organisations.csv): Curated organisations with semicolon-separated synonyms
ORG_MAP (lib/tasks/clinical_trials/sponsors.thor): Hardcoded name variation mappings handling preposition, case, word order, and abbreviation differences
Database discovery: Dynamic querying of ChiCTR sponsor frequencies for top-N organisations

Three-tier matching

Organisation.flexifind in app/models/concerns/searchable.rb runs a cascade:

Exact (~5ms): name ILIKE :term OR branch_names JSONB contains term
Fuzzy (~10-50ms): strict_word_similarity + similarity via GiST-indexed trigrams
Levenshtein validation: Return only if similarity >= 85%

Processing: unlinked sponsors in batches of 1,000, 5 parallel threads.

Naming conventions

Chinese institution names follow patterns: university-affiliated hospitals (“West China Hospital of Sichuan University”), academy-affiliated hospitals, military institutions. The ORG_MAP handles comma variations, case, word order, and abbreviations.

Country handling: headquarters: "CN" (ISO 3166-1 alpha-2) + country: "China" (legacy). Character encoding: UTF-8 with non-breaking space (\xA0) handling.

Configuration & key services

Key services

Service	File	Purpose
`Chictr::WebScraper`	`app/services/chictr/web_scraper.rb`	Ferrum browser automation with anti-bot evasion
`Chictr::ImportService`	`app/services/chictr/import_service.rb`	XML-to-database import with dedup
`Chictr::XmlParser`	`app/services/chictr/xml_parser.rb`	Nokogiri XML extraction
`Chictr::PhaseTwoImporter`	`app/services/chictr/phase_two_importer.rb`	S3-to-database batch orchestration
`Chictr::DiseaseKeywordsLoader`	`app/services/chictr/disease_keywords_loader.rb`	Keyword generation from 3 sources
`Chictr::ImportOrganisationsService`	`app/services/chictr/import_organisations_service.rb`	Chinese organisation import
`Chictr::WeeklyScraper`	`app/services/chictr/weekly_scraper.rb`	Weekly run lifecycle and checkpoints
`Chictr::SummaryManager`	`app/services/chictr/summary_manager.rb`	Checkpoint persistence and resume

S3 file paths

Content	S3 path	Local path
XML files	`silver/chictr/{run_id}/{reg_num}__{proj_id}.xml`	`tmp/chictr/{run_id}/`
Run summary	`silver/chictr/{run_id}/summary_{run_id}.json`	`tmp/chictr/{run_id}/summary_{run_id}.json`
Latest summary	`silver/chictr/summary_latest.json`	—
Markdown	`silver/study-plans/{run_id}/markdown/{nct_id}.md`	`tmp/study_plans/markdown/`

Key constants

Constant	Value	Purpose
`RETRY_MAX_ATTEMPTS`	8	Network retry ceiling
`RETRY_BASE_DELAY`	2.0s	Exponential backoff base
`RETRY_MAX_DELAY`	10.0s	Backoff cap
`MAX_S3_RETRIES`	3	S3 upload retries
`SNAPSHOT_OVERLAP_THRESHOLD`	0.3 (30%)	Resume point integrity check
`CHECKPOINT_INTERVAL`	50 pages (~8 min)	Checkpoint frequency
`studytype`	1	Interventional studies only

Next steps

Clinical trials pipeline — The broader trial pipeline that ChiCTR feeds into
Drug approvals — CDE (China) drug approval pipeline
Data model — Trial tables and their relationships