Skip to content

What is Bioloupe Data?

Bioloupe Data is a 59-table pharmaceutical intelligence database that tracks drug lifecycles from discovery through clinical trials to regulatory approval and market launch. It replaces a legacy 291-table system with a clean, normalized schema designed for AI agent text-to-SQL queries. Every table has a clear responsibility. Every relationship is explicit. Every column carries a machine-readable comment describing its business meaning.

Pharmaceutical data is fragmented across dozens of sources. The same drug appears as “Pembrolizumab”, “KEYTRUDA”, “MK-3475”, and “lambrolizumab” depending on who filed the paperwork. ClinicalTrials.gov accepts free-text self-reported data with no naming standards. The FDA, EMA, and KEGG all use different identifiers for the same approvals. Press releases bury clinical trial results in marketing language.

No single source tells the full story of a drug’s lifecycle — from target identification through preclinical work, clinical trials, regulatory review, and commercial launch. Bioloupe Data exists to assemble that story into a single queryable schema.

flowchart LR
  subgraph Sources["Public Data Sources"]
    AACT["ClinicalTrials.gov"]
    Reg["FDA / EMA / KEGG"]
    PubMed["PubMed\n+ Conferences"]
    News["News Wires\n+ Financial APIs"]
  end

  subgraph Facts["59 Fact Tables"]
    Curated["Scraped & curated\nraw facts"]
  end

  subgraph Intelligence["5 Materialized Views"]
    MV["Pre-computed\nintelligence"]
  end

  subgraph Consumer["Consumers"]
    Agent["AI Agent\n(text-to-SQL)"]
  end

  Sources --> Curated --> MV --> Agent

Data flows in one direction. Public sources are scraped and curated into 59 fact tables — each holding one kind of truth. Five materialized views then compute intelligence on top of those facts: development programs, drug-trial mappings, efficacy evidence, competitive landscapes, and pipeline overviews. The AI agent queries the materialized views for common questions and drops down to base tables when it needs detail.

The schema separates facts from intelligence.

Fact tables hold accurately scraped or curated raw data. A drug’s name, its mechanism of action, which organization owns it, which trials study it, what the FDA approved it for. Each fact lives in exactly one place. No redundancy, no computed columns in fact tables, no editable derived data.

Materialized views compute intelligence by joining and aggregating fact tables. The highest development phase for a drug-disease pair. The competitive landscape for a therapeutic area. Efficacy evidence unified across trial results and publications. Views are read-only, refreshed on a schedule, and designed so the AI agent can answer common questions without writing multi-table JOINs.

The schema spans 16 domains covering the full drug development lifecycle.

DomainTablesWhat it captures
Drugs8Drug identity, ownership, target actions, approvals, brand names, milestones
Clinical trials14Trial design, arms, interventions, diseases, sponsors, locations, results, adverse events, eligibility
Diseases4Disease ontology (DAG with closure table), population statistics, biomarkers
Targets1Biological targets (proteins, receptors, pathways) that drugs act on
Technologies1Drug modalities — small molecule, monoclonal antibody, ADC, cell therapy, etc.
Biomarkers1Molecular markers used in trial eligibility and treatment selection
Organisations4Companies, licensing deals, deal participants, deal-drug associations
Regimens3Treatment regimens, individual interventions, regimen composition
Indications4Approved/investigational indications, therapeutic approaches, prior therapy lines
Guidelines1Clinical practice guidelines referencing drugs and diseases
Publications4Journal articles, conference abstracts, reported outcomes, studied interventions
News7News articles with structured entity mentions (drugs, diseases, orgs, trials, targets, technologies)
Lookups6Reference data — countries, endpoints, modes of action, NCI Thesaurus concepts
Audit1Change tracking across all entities

Five materialized views sit on top of these tables:

ViewWhat it computes
mv_development_programsHighest phase per drug-disease pair
mv_drug_trialsDrug-to-trial bridge with derived role
mv_efficacy_evidenceUnified trial results + publication outcomes
mv_competitive_landscapeDrug-disease-phase matrix with current owners
mv_drug_pipelinePre-joined drug overview for zero-JOIN queries

These docs are designed to read in order. Each section builds on the last.

Start with architecture to understand the design philosophy before diving into specific domains:

  1. Data Philosophy — Why tables hold raw facts and materialized views compute intelligence
  2. Schema Overview — Bird’s-eye ERD, domain map, full entity inventory
  3. Materialized Views — The five views, what they pre-compute, the refresh strategy

Then read the domains that matter to your work, starting from the center of the graph and moving outward:

  1. Drugs — The center of the schema. Everything connects through drugs.
  2. Clinical Trials — The largest domain (14 tables). Trial design, results, eligibility.
  3. Diseases — The ontology that organizes all therapeutic work.
  4. Organisations — Who develops what, and who bought it from whom.
  5. Indications — Treatment landscape, therapeutic approaches, prior therapy lines.
  6. Regimens — How drugs combine into treatment regimens.
  7. Publications — Literature evidence linking trials to published outcomes.
  8. News — Structured intelligence extracted from press releases and news wires.
  9. Lookups — Reference data that standardizes vocabulary across domains.

Finish with the AI agent guide:

  1. AI Agent Querying — Which views to hit, when to use base tables, query patterns and anti-patterns.

Dagster orchestrates all data ingestion, enrichment, and the computed layer (dbt). The pipeline code lives in pipelines/ as a Python package alongside the TypeScript schema layer. See the data pipeline design for the full architecture.

Pre-existing tables (drugs, organisations, targets, technologies, biomarkers, indications, drug_ownerships) are declared as external AssetSpecs — Dagster tracks them without managing their schema.