What is Bioloupe Data?

Bioloupe Data is a 59-table pharmaceutical intelligence database that tracks drug lifecycles from discovery through clinical trials to regulatory approval and market launch. It replaces a legacy 291-table system with a clean, normalized schema designed for AI agent text-to-SQL queries. Every table has a clear responsibility. Every relationship is explicit. Every column carries a machine-readable comment describing its business meaning.

The problem it solves

Pharmaceutical data is fragmented across dozens of sources. The same drug appears as “Pembrolizumab”, “KEYTRUDA”, “MK-3475”, and “lambrolizumab” depending on who filed the paperwork. ClinicalTrials.gov accepts free-text self-reported data with no naming standards. The FDA, EMA, and KEGG all use different identifiers for the same approvals. Press releases bury clinical trial results in marketing language.

No single source tells the full story of a drug’s lifecycle — from target identification through preclinical work, clinical trials, regulatory review, and commercial launch. Bioloupe Data exists to assemble that story into a single queryable schema.

How it works

flowchart LR
  subgraph Sources["Public Data Sources"]
    AACT["ClinicalTrials.gov"]
    Reg["FDA / EMA / KEGG"]
    PubMed["PubMed\n+ Conferences"]
    News["News Wires\n+ Financial APIs"]
  end

  subgraph Facts["59 Fact Tables"]
    Curated["Scraped & curated\nraw facts"]
  end

  subgraph Intelligence["5 Materialized Views"]
    MV["Pre-computed\nintelligence"]
  end

  subgraph Consumer["Consumers"]
    Agent["AI Agent\n(text-to-SQL)"]
  end

  Sources --> Curated --> MV --> Agent

Data flows in one direction. Public sources are scraped and curated into 59 fact tables — each holding one kind of truth. Five materialized views then compute intelligence on top of those facts: development programs, drug-trial mappings, efficacy evidence, competitive landscapes, and pipeline overviews. The AI agent queries the materialized views for common questions and drops down to base tables when it needs detail.

Two-layer design

The schema separates facts from intelligence.

Fact tables hold accurately scraped or curated raw data. A drug’s name, its mechanism of action, which organization owns it, which trials study it, what the FDA approved it for. Each fact lives in exactly one place. No redundancy, no computed columns in fact tables, no editable derived data.

Materialized views compute intelligence by joining and aggregating fact tables. The highest development phase for a drug-disease pair. The competitive landscape for a therapeutic area. Efficacy evidence unified across trial results and publications. Views are read-only, refreshed on a schedule, and designed so the AI agent can answer common questions without writing multi-table JOINs.

What it tracks

The schema spans 16 domains covering the full drug development lifecycle.

Domain	Tables	What it captures
Drugs	8	Drug identity, ownership, target actions, approvals, brand names, milestones
Clinical trials	14	Trial design, arms, interventions, diseases, sponsors, locations, results, adverse events, eligibility
Diseases	4	Disease ontology (DAG with closure table), population statistics, biomarkers
Targets	1	Biological targets (proteins, receptors, pathways) that drugs act on
Technologies	1	Drug modalities — small molecule, monoclonal antibody, ADC, cell therapy, etc.
Biomarkers	1	Molecular markers used in trial eligibility and treatment selection
Organisations	4	Companies, licensing deals, deal participants, deal-drug associations
Regimens	3	Treatment regimens, individual interventions, regimen composition
Indications	4	Approved/investigational indications, therapeutic approaches, prior therapy lines
Guidelines	1	Clinical practice guidelines referencing drugs and diseases
Publications	4	Journal articles, conference abstracts, reported outcomes, studied interventions
News	7	News articles with structured entity mentions (drugs, diseases, orgs, trials, targets, technologies)
Lookups	6	Reference data — countries, endpoints, modes of action, NCI Thesaurus concepts
Audit	1	Change tracking across all entities

Five materialized views sit on top of these tables:

View	What it computes
`mv_development_programs`	Highest phase per drug-disease pair
`mv_drug_trials`	Drug-to-trial bridge with derived role
`mv_efficacy_evidence`	Unified trial results + publication outcomes
`mv_competitive_landscape`	Drug-disease-phase matrix with current owners
`mv_drug_pipeline`	Pre-joined drug overview for zero-JOIN queries

How to read these docs

These docs are designed to read in order. Each section builds on the last.

Start with architecture to understand the design philosophy before diving into specific domains:

Data Philosophy — Why tables hold raw facts and materialized views compute intelligence
Schema Overview — Bird’s-eye ERD, domain map, full entity inventory
Materialized Views — The five views, what they pre-compute, the refresh strategy

Then read the domains that matter to your work, starting from the center of the graph and moving outward:

Drugs — The center of the schema. Everything connects through drugs.
Clinical Trials — The largest domain (14 tables). Trial design, results, eligibility.
Diseases — The ontology that organizes all therapeutic work.
Organisations — Who develops what, and who bought it from whom.
Indications — Treatment landscape, therapeutic approaches, prior therapy lines.
Regimens — How drugs combine into treatment regimens.
Publications — Literature evidence linking trials to published outcomes.
News — Structured intelligence extracted from press releases and news wires.
Lookups — Reference data that standardizes vocabulary across domains.

Finish with the AI agent guide:

AI Agent Querying — Which views to hit, when to use base tables, query patterns and anti-patterns.

Pipeline layer

Dagster orchestrates all data ingestion, enrichment, and the computed layer (dbt). The pipeline code lives in pipelines/ as a Python package alongside the TypeScript schema layer. See the data pipeline design for the full architecture.

Pre-existing tables (drugs, organisations, targets, technologies, biomarkers, indications, drug_ownerships) are declared as external AssetSpecs — Dagster tracks them without managing their schema.