Drug approvals pipeline

When the FDA approves a new cancer drug, that approval appears in Data Gov within days — matched to the correct Bioloupe drug, linked to structured indications, and enriched with label text. This page follows a drug approval from raw regulatory data through the pipeline to a unified DrugApproval record.

Four agencies, one unified model

The same drug receives separate approvals in each region. Pembrolizumab is approved by the FDA (USA), EMA (EU), KEGG/PMDA (Japan), and CDE/NMPA (China). Each agency publishes data in a different format, with different identifiers, at different times.

Data Gov collects from all four and assembles a single drug_approvals record per region per drug. Polymorphic source_type/source_id columns trace every approval back to its raw regulatory source.

flowchart TB
  subgraph Sources["Four Regulatory Agencies"]
    FDA["FDA\n(openFDA API)\nNDA/BLA numbers"]
    EMA["EMA\n(XLSX export)\nProduct numbers"]
    KEGG["KEGG\n(Node.js scraper)\nKEGG identifiers"]
    CDE["CDE/NMPA\n(Database)\nCDE codes"]
  end

  subgraph Match["Match to Bioloupe Drugs"]
    MFDA["Match FDA records"]
    MEMA["Match EMA records"]
    MKEGG["Match KEGG records"]
  end

  subgraph Assembly["Assembly"]
    ASM["Create unified\nDrugApproval records"]
    PB["Purple Book\nbiosimilar approvals"]
    DIT["Determine\ninnovation type"]
    LNK["Link organisations\nto approvals"]
  end

  subgraph Enrich["Enrichment"]
    LBL["Fetch FDA labels\n(DailyMed)"]
    BW["Parse boxed\nwarnings"]
    REMS["Fetch REMS\nsafety data"]
    CDX["Sync companion\ndiagnostics"]
    STR["Structure labels\n(LLM)"]
  end

  subgraph Indications["Indication Extraction"]
    PI["Parse indications\n(LLM)"]
    PPD["Link diseases"]
    PPB["Link biomarkers"]
    ETA["Extract therapeutic\nareas"]
  end

  FDA --> MFDA --> ASM
  EMA --> MEMA --> ASM
  KEGG --> MKEGG --> ASM
  CDE --> ASM
  ASM --> PB --> DIT --> LNK
  LNK --> LBL --> BW --> REMS
  LBL --> STR --> PI --> PPD & PPB
  STR --> ETA
  LNK --> CDX

Data sources in detail

FDA (USA)

Source: openFDA download API (api.fda.gov/download.json)

The Regulatory::Fda Thor task downloads partitioned ZIP files containing JSON records for drugs and labels. Each record represents an FDA application (NDA or BLA). The pipeline:

Downloads drug partition files. Extracts records matching oncology/hematology therapeutic areas.
Downloads label partition files. Matches labels to drugs by application number.
Upserts FdaDatum records keyed on application_number + brand_name.

FDA data also feeds supplementary tables:

Orange Book (orangebook_records) — Patent expiry dates and exclusivity periods
Purple Book (purplebook_records) — Biological products, biosimilar reference products
Breakthrough Therapy (breakthroughs) — FDA Breakthrough Therapy designations
Companion Diagnostics (fda_approved_companion_diagnostics) — Approved diagnostic tests paired with drugs and biomarkers
REMS (rems) — Risk Evaluation and Mitigation Strategies

EMA (EU)

Source: EMA public XLSX export

The Regulatory::Ema Thor task downloads the spreadsheet and creates EmaApprovedDrug records. Only category: 'Human' records are processed. Fields include authorization status, ATC code, therapeutic area, and regulatory flags (orphan, conditional, accelerated, PRIME, advanced therapy).

KEGG (Japan)

Source: KEGG drug database via Node.js scraper

Creates KeggApprovedDrug records for Japanese approvals. Remark codes indicate regulatory categories: G (generic), B (biosimilar), C (conditional), E/A (expedited review).

CDE (China)

Source: China CDE/NMPA database

Creates ChinaApprovedDrug records. Stores both original Chinese-language data (original_data JSONB) and machine-translated versions (translated_original_data JSONB). Currently run on-demand rather than as part of the automated workflow.

Matching: from source records to Bioloupe drugs

Each source table carries a nullable drug_id FK. The matching step links source records to canonical Drug records using name matching against drugs.name and drugs.all_synonyms.

The matching Thor tasks (regulatory:matching:match_fda_to_bioloupe, etc.) run after collection and before assembly. Unmatched records remain with drug_id: null and queue for human review in ActiveAdmin.

Assembly: creating unified approvals

The regulatory:approvals:assemble task creates unified DrugApproval records from all source tables. For each source record, it:

Maps source-specific fields to a common schema (brand name, approval status, region, dates, innovation type)
Looks up an existing DrugApproval by source_type + source_id
Falls back to matching by application_number + brand_name for manually created records
Upserts with lockable attribute support — locked fields are never overwritten
Tracks indication changes (old vs. current) in supplementary_info.indications JSONB

Innovation types determine whether an approval is for an original drug, a generic, or a biosimilar. The pipeline derives this from FDA submission class codes, EMA generic/biosimilar flags, and KEGG remark codes.

Innovation type	Meaning
`original`	First approval of a new molecular entity
`generic`	Small molecule copy of an approved drug
`biosimilar`	Biological product similar to an approved biologic
`biosimilar_interchangeable`	Biosimilar approved as interchangeable with the reference product
`non_original`	Other non-original products

Label enrichment

After assembly, the pipeline enriches FDA approvals with structured label content.

Fetch FDA labels. Fda::LabelSync calls the DailyMed API to retrieve label metadata and sections. Labels contain the official indication text, dosing, warnings, and adverse reactions.

Parse boxed warnings. Fetches SPL XML from DailyMed and extracts the boxed warning HTML. Boxed warnings are the highest-severity safety alerts the FDA issues.

Fetch REMS. Downloads REMS CSV files from FDA. Creates Rems records with program details, materials, and linked application numbers.

Sync companion diagnostics. Updates fda_approved_companion_diagnostics with paired drug-biomarker-test relationships.

Indication extraction

This is where drug approvals connect to diseases. It runs as a separate IndicationWorkflow (37 steps) triggered after the drugs workflow.

Step 1: Structure labels (LLM)

GPT-4.1 reads raw FDA label text and structures it into parsed sections. This transforms free-text paragraphs into machine-readable fields.

Step 2: Parse indications (LLM)

GPT-4.1 extracts individual Indication records from structured label text. Each indication captures:

Disease name and subtypes
Treatment lines (first-line, second-line, maintenance)
Required biomarkers
Indication type (treatment, diagnosis, prevention)
Accelerated vs. full approval dates

Step 3: Post-process

Links extracted disease names to canonical Disease records using the same entity resolution cascade from the clinical trials pipeline. Links biomarker mentions to Biomarker records. Extracts therapeutic areas and updates the parent drug.

Schedule and triggers

Component	Schedule	Job class
Full drugs workflow	Manual via admin UI	`WorkflowRunnerJob`
Indication workflow	Auto-triggered after drugs workflow	`WorkflowRunnerJob`
FDA approval notifications	Sat 12:00 UTC	`FdaApprovalNotificationsJob`

Trigger manually:

# Full drugs pipeline
WorkflowRunnerJob.perform_now(workflow_type: "DrugsWorkflow")

# Just the indication extraction
WorkflowRunnerJob.perform_now(workflow_type: "IndicationWorkflow")

Key services

Service	Purpose
`FdaService`	Downloads and unzips FDA data files
`Fda::LabelSync`	Fetches label metadata from DailyMed
`Fda::DailyMedClient`	HTTP client for DailyMed API
`DrugMerger`	Merges duplicate drug records
`OpenAiService`	LLM calls for label structuring and indication parsing

Key Thor tasks

Task	Purpose
`regulatory:fda:download_and_extract`	Download FDA drugs and labels
`regulatory:ema:download_and_extract`	Download EMA XLSX export
`regulatory:approvals:assemble`	Create unified DrugApproval records
`regulatory:matching:match_fda_to_bioloupe`	Match FDA records to Bioloupe drugs
`regulatory:indications:structure_labels`	LLM-structure label text
`regulatory:indications:parse`	LLM-parse individual indications
`regulatory:fda:fetch_labels`	Fetch labels from DailyMed
`regulatory:fda:fetch_boxed_warnings`	Pull boxed warnings from SPL XML
`regulatory:purplebook:sync`	Sync Purple Book biological products
`regulatory:companion_diagnostics:sync`	Sync FDA companion diagnostics

Common problems

Symptom	Likely cause	Fix
FDA download returns 403	API key expired or rate limited	Update API key in the FDA Thor task
EMA XLSX structure changed	EMA updated their export format	Check column names in the EMA Thor task
Label fetch hangs	DailyMed API timeout	Restart with `--skip-completed` to resume
Duplicate DrugApprovals	Source key mismatch	Check `source_type`/`source_id` uniqueness
Innovation type is nil	New submission class code	Add mapping in `approvals.thor`

Next steps

News and intelligence — How press releases and publications become structured intelligence
Data model — Deep dive into the approval and indication tables