Drug approvals pipeline
When the FDA approves a new cancer drug, that approval appears in Data Gov within days — matched to the correct Bioloupe drug, linked to structured indications, and enriched with label text. This page follows a drug approval from raw regulatory data through the pipeline to a unified DrugApproval record.
Four agencies, one unified model
Section titled “Four agencies, one unified model”The same drug receives separate approvals in each region. Pembrolizumab is approved by the FDA (USA), EMA (EU), KEGG/PMDA (Japan), and CDE/NMPA (China). Each agency publishes data in a different format, with different identifiers, at different times.
Data Gov collects from all four and assembles a single drug_approvals record per region per drug. Polymorphic source_type/source_id columns trace every approval back to its raw regulatory source.
flowchart TB
subgraph Sources["Four Regulatory Agencies"]
FDA["FDA\n(openFDA API)\nNDA/BLA numbers"]
EMA["EMA\n(XLSX export)\nProduct numbers"]
KEGG["KEGG\n(Node.js scraper)\nKEGG identifiers"]
CDE["CDE/NMPA\n(Database)\nCDE codes"]
end
subgraph Match["Match to Bioloupe Drugs"]
MFDA["Match FDA records"]
MEMA["Match EMA records"]
MKEGG["Match KEGG records"]
end
subgraph Assembly["Assembly"]
ASM["Create unified\nDrugApproval records"]
PB["Purple Book\nbiosimilar approvals"]
DIT["Determine\ninnovation type"]
LNK["Link organisations\nto approvals"]
end
subgraph Enrich["Enrichment"]
LBL["Fetch FDA labels\n(DailyMed)"]
BW["Parse boxed\nwarnings"]
REMS["Fetch REMS\nsafety data"]
CDX["Sync companion\ndiagnostics"]
STR["Structure labels\n(LLM)"]
end
subgraph Indications["Indication Extraction"]
PI["Parse indications\n(LLM)"]
PPD["Link diseases"]
PPB["Link biomarkers"]
ETA["Extract therapeutic\nareas"]
end
FDA --> MFDA --> ASM
EMA --> MEMA --> ASM
KEGG --> MKEGG --> ASM
CDE --> ASM
ASM --> PB --> DIT --> LNK
LNK --> LBL --> BW --> REMS
LBL --> STR --> PI --> PPD & PPB
STR --> ETA
LNK --> CDX
Data sources in detail
Section titled “Data sources in detail”FDA (USA)
Section titled “FDA (USA)”Source: openFDA download API (api.fda.gov/download.json)
The Regulatory::Fda Thor task downloads partitioned ZIP files containing JSON records for drugs and labels. Each record represents an FDA application (NDA or BLA). The pipeline:
- Downloads drug partition files. Extracts records matching oncology/hematology therapeutic areas.
- Downloads label partition files. Matches labels to drugs by application number.
- Upserts
FdaDatumrecords keyed onapplication_number+brand_name.
FDA data also feeds supplementary tables:
- Orange Book (
orangebook_records) — Patent expiry dates and exclusivity periods - Purple Book (
purplebook_records) — Biological products, biosimilar reference products - Breakthrough Therapy (
breakthroughs) — FDA Breakthrough Therapy designations - Companion Diagnostics (
fda_approved_companion_diagnostics) — Approved diagnostic tests paired with drugs and biomarkers - REMS (
rems) — Risk Evaluation and Mitigation Strategies
EMA (EU)
Section titled “EMA (EU)”Source: EMA public XLSX export
The Regulatory::Ema Thor task downloads the spreadsheet and creates EmaApprovedDrug records. Only category: 'Human' records are processed. Fields include authorization status, ATC code, therapeutic area, and regulatory flags (orphan, conditional, accelerated, PRIME, advanced therapy).
KEGG (Japan)
Section titled “KEGG (Japan)”Source: KEGG drug database via Node.js scraper
Creates KeggApprovedDrug records for Japanese approvals. Remark codes indicate regulatory categories: G (generic), B (biosimilar), C (conditional), E/A (expedited review).
CDE (China)
Section titled “CDE (China)”Source: China CDE/NMPA database
Creates ChinaApprovedDrug records. Stores both original Chinese-language data (original_data JSONB) and machine-translated versions (translated_original_data JSONB). Currently run on-demand rather than as part of the automated workflow.
Matching: from source records to Bioloupe drugs
Section titled “Matching: from source records to Bioloupe drugs”Each source table carries a nullable drug_id FK. The matching step links source records to canonical Drug records using name matching against drugs.name and drugs.all_synonyms.
The matching Thor tasks (regulatory:matching:match_fda_to_bioloupe, etc.) run after collection and before assembly. Unmatched records remain with drug_id: null and queue for human review in ActiveAdmin.
Assembly: creating unified approvals
Section titled “Assembly: creating unified approvals”The regulatory:approvals:assemble task creates unified DrugApproval records from all source tables. For each source record, it:
- Maps source-specific fields to a common schema (brand name, approval status, region, dates, innovation type)
- Looks up an existing
DrugApprovalbysource_type+source_id - Falls back to matching by
application_number+brand_namefor manually created records - Upserts with lockable attribute support — locked fields are never overwritten
- Tracks indication changes (old vs. current) in
supplementary_info.indicationsJSONB
Innovation types determine whether an approval is for an original drug, a generic, or a biosimilar. The pipeline derives this from FDA submission class codes, EMA generic/biosimilar flags, and KEGG remark codes.
| Innovation type | Meaning |
|---|---|
original | First approval of a new molecular entity |
generic | Small molecule copy of an approved drug |
biosimilar | Biological product similar to an approved biologic |
biosimilar_interchangeable | Biosimilar approved as interchangeable with the reference product |
non_original | Other non-original products |
Label enrichment
Section titled “Label enrichment”After assembly, the pipeline enriches FDA approvals with structured label content.
Fetch FDA labels. Fda::LabelSync calls the DailyMed API to retrieve label metadata and sections. Labels contain the official indication text, dosing, warnings, and adverse reactions.
Parse boxed warnings. Fetches SPL XML from DailyMed and extracts the boxed warning HTML. Boxed warnings are the highest-severity safety alerts the FDA issues.
Fetch REMS. Downloads REMS CSV files from FDA. Creates Rems records with program details, materials, and linked application numbers.
Sync companion diagnostics. Updates fda_approved_companion_diagnostics with paired drug-biomarker-test relationships.
Indication extraction
Section titled “Indication extraction”This is where drug approvals connect to diseases. It runs as a separate IndicationWorkflow (37 steps) triggered after the drugs workflow.
Step 1: Structure labels (LLM)
Section titled “Step 1: Structure labels (LLM)”GPT-4.1 reads raw FDA label text and structures it into parsed sections. This transforms free-text paragraphs into machine-readable fields.
Step 2: Parse indications (LLM)
Section titled “Step 2: Parse indications (LLM)”GPT-4.1 extracts individual Indication records from structured label text. Each indication captures:
- Disease name and subtypes
- Treatment lines (first-line, second-line, maintenance)
- Required biomarkers
- Indication type (treatment, diagnosis, prevention)
- Accelerated vs. full approval dates
Step 3: Post-process
Section titled “Step 3: Post-process”Links extracted disease names to canonical Disease records using the same entity resolution cascade from the clinical trials pipeline. Links biomarker mentions to Biomarker records. Extracts therapeutic areas and updates the parent drug.
Schedule and triggers
Section titled “Schedule and triggers”| Component | Schedule | Job class |
|---|---|---|
| Full drugs workflow | Manual via admin UI | WorkflowRunnerJob |
| Indication workflow | Auto-triggered after drugs workflow | WorkflowRunnerJob |
| FDA approval notifications | Sat 12:00 UTC | FdaApprovalNotificationsJob |
Trigger manually:
# Full drugs pipelineWorkflowRunnerJob.perform_now(workflow_type: "DrugsWorkflow")
# Just the indication extractionWorkflowRunnerJob.perform_now(workflow_type: "IndicationWorkflow")Key services
Section titled “Key services”| Service | Purpose |
|---|---|
FdaService | Downloads and unzips FDA data files |
Fda::LabelSync | Fetches label metadata from DailyMed |
Fda::DailyMedClient | HTTP client for DailyMed API |
DrugMerger | Merges duplicate drug records |
OpenAiService | LLM calls for label structuring and indication parsing |
Key Thor tasks
Section titled “Key Thor tasks”| Task | Purpose |
|---|---|
regulatory:fda:download_and_extract | Download FDA drugs and labels |
regulatory:ema:download_and_extract | Download EMA XLSX export |
regulatory:approvals:assemble | Create unified DrugApproval records |
regulatory:matching:match_fda_to_bioloupe | Match FDA records to Bioloupe drugs |
regulatory:indications:structure_labels | LLM-structure label text |
regulatory:indications:parse | LLM-parse individual indications |
regulatory:fda:fetch_labels | Fetch labels from DailyMed |
regulatory:fda:fetch_boxed_warnings | Pull boxed warnings from SPL XML |
regulatory:purplebook:sync | Sync Purple Book biological products |
regulatory:companion_diagnostics:sync | Sync FDA companion diagnostics |
Common problems
Section titled “Common problems”| Symptom | Likely cause | Fix |
|---|---|---|
| FDA download returns 403 | API key expired or rate limited | Update API key in the FDA Thor task |
| EMA XLSX structure changed | EMA updated their export format | Check column names in the EMA Thor task |
| Label fetch hangs | DailyMed API timeout | Restart with --skip-completed to resume |
| Duplicate DrugApprovals | Source key mismatch | Check source_type/source_id uniqueness |
| Innovation type is nil | New submission class code | Add mapping in approvals.thor |
Next steps
Section titled “Next steps”- News and intelligence — How press releases and publications become structured intelligence
- Data model — Deep dive into the approval and indication tables