Skip to content

Drug approvals pipeline

When the FDA approves a new cancer drug, that approval appears in Data Gov within days — matched to the correct Bioloupe drug, linked to structured indications, and enriched with label text. This page follows a drug approval from raw regulatory data through the pipeline to a unified DrugApproval record.

The same drug receives separate approvals in each region. Pembrolizumab is approved by the FDA (USA), EMA (EU), KEGG/PMDA (Japan), and CDE/NMPA (China). Each agency publishes data in a different format, with different identifiers, at different times.

Data Gov collects from all four and assembles a single drug_approvals record per region per drug. Polymorphic source_type/source_id columns trace every approval back to its raw regulatory source.

flowchart TB
  subgraph Sources["Four Regulatory Agencies"]
    FDA["FDA\n(openFDA API)\nNDA/BLA numbers"]
    EMA["EMA\n(XLSX export)\nProduct numbers"]
    KEGG["KEGG\n(Node.js scraper)\nKEGG identifiers"]
    CDE["CDE/NMPA\n(Database)\nCDE codes"]
  end

  subgraph Match["Match to Bioloupe Drugs"]
    MFDA["Match FDA records"]
    MEMA["Match EMA records"]
    MKEGG["Match KEGG records"]
  end

  subgraph Assembly["Assembly"]
    ASM["Create unified\nDrugApproval records"]
    PB["Purple Book\nbiosimilar approvals"]
    DIT["Determine\ninnovation type"]
    LNK["Link organisations\nto approvals"]
  end

  subgraph Enrich["Enrichment"]
    LBL["Fetch FDA labels\n(DailyMed)"]
    BW["Parse boxed\nwarnings"]
    REMS["Fetch REMS\nsafety data"]
    CDX["Sync companion\ndiagnostics"]
    STR["Structure labels\n(LLM)"]
  end

  subgraph Indications["Indication Extraction"]
    PI["Parse indications\n(LLM)"]
    PPD["Link diseases"]
    PPB["Link biomarkers"]
    ETA["Extract therapeutic\nareas"]
  end

  FDA --> MFDA --> ASM
  EMA --> MEMA --> ASM
  KEGG --> MKEGG --> ASM
  CDE --> ASM
  ASM --> PB --> DIT --> LNK
  LNK --> LBL --> BW --> REMS
  LBL --> STR --> PI --> PPD & PPB
  STR --> ETA
  LNK --> CDX

Source: openFDA download API (api.fda.gov/download.json)

The Regulatory::Fda Thor task downloads partitioned ZIP files containing JSON records for drugs and labels. Each record represents an FDA application (NDA or BLA). The pipeline:

  1. Downloads drug partition files. Extracts records matching oncology/hematology therapeutic areas.
  2. Downloads label partition files. Matches labels to drugs by application number.
  3. Upserts FdaDatum records keyed on application_number + brand_name.

FDA data also feeds supplementary tables:

  • Orange Book (orangebook_records) — Patent expiry dates and exclusivity periods
  • Purple Book (purplebook_records) — Biological products, biosimilar reference products
  • Breakthrough Therapy (breakthroughs) — FDA Breakthrough Therapy designations
  • Companion Diagnostics (fda_approved_companion_diagnostics) — Approved diagnostic tests paired with drugs and biomarkers
  • REMS (rems) — Risk Evaluation and Mitigation Strategies

Source: EMA public XLSX export

The Regulatory::Ema Thor task downloads the spreadsheet and creates EmaApprovedDrug records. Only category: 'Human' records are processed. Fields include authorization status, ATC code, therapeutic area, and regulatory flags (orphan, conditional, accelerated, PRIME, advanced therapy).

Source: KEGG drug database via Node.js scraper

Creates KeggApprovedDrug records for Japanese approvals. Remark codes indicate regulatory categories: G (generic), B (biosimilar), C (conditional), E/A (expedited review).

Source: China CDE/NMPA database

Creates ChinaApprovedDrug records. Stores both original Chinese-language data (original_data JSONB) and machine-translated versions (translated_original_data JSONB). Currently run on-demand rather than as part of the automated workflow.

Matching: from source records to Bioloupe drugs

Section titled “Matching: from source records to Bioloupe drugs”

Each source table carries a nullable drug_id FK. The matching step links source records to canonical Drug records using name matching against drugs.name and drugs.all_synonyms.

The matching Thor tasks (regulatory:matching:match_fda_to_bioloupe, etc.) run after collection and before assembly. Unmatched records remain with drug_id: null and queue for human review in ActiveAdmin.

The regulatory:approvals:assemble task creates unified DrugApproval records from all source tables. For each source record, it:

  1. Maps source-specific fields to a common schema (brand name, approval status, region, dates, innovation type)
  2. Looks up an existing DrugApproval by source_type + source_id
  3. Falls back to matching by application_number + brand_name for manually created records
  4. Upserts with lockable attribute support — locked fields are never overwritten
  5. Tracks indication changes (old vs. current) in supplementary_info.indications JSONB

Innovation types determine whether an approval is for an original drug, a generic, or a biosimilar. The pipeline derives this from FDA submission class codes, EMA generic/biosimilar flags, and KEGG remark codes.

Innovation typeMeaning
originalFirst approval of a new molecular entity
genericSmall molecule copy of an approved drug
biosimilarBiological product similar to an approved biologic
biosimilar_interchangeableBiosimilar approved as interchangeable with the reference product
non_originalOther non-original products

After assembly, the pipeline enriches FDA approvals with structured label content.

Fetch FDA labels. Fda::LabelSync calls the DailyMed API to retrieve label metadata and sections. Labels contain the official indication text, dosing, warnings, and adverse reactions.

Parse boxed warnings. Fetches SPL XML from DailyMed and extracts the boxed warning HTML. Boxed warnings are the highest-severity safety alerts the FDA issues.

Fetch REMS. Downloads REMS CSV files from FDA. Creates Rems records with program details, materials, and linked application numbers.

Sync companion diagnostics. Updates fda_approved_companion_diagnostics with paired drug-biomarker-test relationships.

This is where drug approvals connect to diseases. It runs as a separate IndicationWorkflow (37 steps) triggered after the drugs workflow.

GPT-4.1 reads raw FDA label text and structures it into parsed sections. This transforms free-text paragraphs into machine-readable fields.

GPT-4.1 extracts individual Indication records from structured label text. Each indication captures:

  • Disease name and subtypes
  • Treatment lines (first-line, second-line, maintenance)
  • Required biomarkers
  • Indication type (treatment, diagnosis, prevention)
  • Accelerated vs. full approval dates

Links extracted disease names to canonical Disease records using the same entity resolution cascade from the clinical trials pipeline. Links biomarker mentions to Biomarker records. Extracts therapeutic areas and updates the parent drug.

ComponentScheduleJob class
Full drugs workflowManual via admin UIWorkflowRunnerJob
Indication workflowAuto-triggered after drugs workflowWorkflowRunnerJob
FDA approval notificationsSat 12:00 UTCFdaApprovalNotificationsJob

Trigger manually:

# Full drugs pipeline
WorkflowRunnerJob.perform_now(workflow_type: "DrugsWorkflow")
# Just the indication extraction
WorkflowRunnerJob.perform_now(workflow_type: "IndicationWorkflow")
ServicePurpose
FdaServiceDownloads and unzips FDA data files
Fda::LabelSyncFetches label metadata from DailyMed
Fda::DailyMedClientHTTP client for DailyMed API
DrugMergerMerges duplicate drug records
OpenAiServiceLLM calls for label structuring and indication parsing
TaskPurpose
regulatory:fda:download_and_extractDownload FDA drugs and labels
regulatory:ema:download_and_extractDownload EMA XLSX export
regulatory:approvals:assembleCreate unified DrugApproval records
regulatory:matching:match_fda_to_bioloupeMatch FDA records to Bioloupe drugs
regulatory:indications:structure_labelsLLM-structure label text
regulatory:indications:parseLLM-parse individual indications
regulatory:fda:fetch_labelsFetch labels from DailyMed
regulatory:fda:fetch_boxed_warningsPull boxed warnings from SPL XML
regulatory:purplebook:syncSync Purple Book biological products
regulatory:companion_diagnostics:syncSync FDA companion diagnostics
SymptomLikely causeFix
FDA download returns 403API key expired or rate limitedUpdate API key in the FDA Thor task
EMA XLSX structure changedEMA updated their export formatCheck column names in the EMA Thor task
Label fetch hangsDailyMed API timeoutRestart with --skip-completed to resume
Duplicate DrugApprovalsSource key mismatchCheck source_type/source_id uniqueness
Innovation type is nilNew submission class codeAdd mapping in approvals.thor
  • News and intelligence — How press releases and publications become structured intelligence
  • Data model — Deep dive into the approval and indication tables