What is Bioloupe Data?
Bioloupe Data is a 59-table pharmaceutical intelligence database that tracks drug lifecycles from discovery through clinical trials to regulatory approval and market launch. It replaces a legacy 291-table system with a clean, normalized schema designed for AI agent text-to-SQL queries. Every table has a clear responsibility. Every relationship is explicit. Every column carries a machine-readable comment describing its business meaning.
The problem it solves
Section titled “The problem it solves”Pharmaceutical data is fragmented across dozens of sources. The same drug appears as “Pembrolizumab”, “KEYTRUDA”, “MK-3475”, and “lambrolizumab” depending on who filed the paperwork. ClinicalTrials.gov accepts free-text self-reported data with no naming standards. The FDA, EMA, and KEGG all use different identifiers for the same approvals. Press releases bury clinical trial results in marketing language.
No single source tells the full story of a drug’s lifecycle — from target identification through preclinical work, clinical trials, regulatory review, and commercial launch. Bioloupe Data exists to assemble that story into a single queryable schema.
How it works
Section titled “How it works”flowchart LR
subgraph Sources["Public Data Sources"]
AACT["ClinicalTrials.gov"]
Reg["FDA / EMA / KEGG"]
PubMed["PubMed\n+ Conferences"]
News["News Wires\n+ Financial APIs"]
end
subgraph Facts["59 Fact Tables"]
Curated["Scraped & curated\nraw facts"]
end
subgraph Intelligence["5 Materialized Views"]
MV["Pre-computed\nintelligence"]
end
subgraph Consumer["Consumers"]
Agent["AI Agent\n(text-to-SQL)"]
end
Sources --> Curated --> MV --> Agent
Data flows in one direction. Public sources are scraped and curated into 59 fact tables — each holding one kind of truth. Five materialized views then compute intelligence on top of those facts: development programs, drug-trial mappings, efficacy evidence, competitive landscapes, and pipeline overviews. The AI agent queries the materialized views for common questions and drops down to base tables when it needs detail.
Two-layer design
Section titled “Two-layer design”The schema separates facts from intelligence.
Fact tables hold accurately scraped or curated raw data. A drug’s name, its mechanism of action, which organization owns it, which trials study it, what the FDA approved it for. Each fact lives in exactly one place. No redundancy, no computed columns in fact tables, no editable derived data.
Materialized views compute intelligence by joining and aggregating fact tables. The highest development phase for a drug-disease pair. The competitive landscape for a therapeutic area. Efficacy evidence unified across trial results and publications. Views are read-only, refreshed on a schedule, and designed so the AI agent can answer common questions without writing multi-table JOINs.
What it tracks
Section titled “What it tracks”The schema spans 16 domains covering the full drug development lifecycle.
| Domain | Tables | What it captures |
|---|---|---|
| Drugs | 8 | Drug identity, ownership, target actions, approvals, brand names, milestones |
| Clinical trials | 14 | Trial design, arms, interventions, diseases, sponsors, locations, results, adverse events, eligibility |
| Diseases | 4 | Disease ontology (DAG with closure table), population statistics, biomarkers |
| Targets | 1 | Biological targets (proteins, receptors, pathways) that drugs act on |
| Technologies | 1 | Drug modalities — small molecule, monoclonal antibody, ADC, cell therapy, etc. |
| Biomarkers | 1 | Molecular markers used in trial eligibility and treatment selection |
| Organisations | 4 | Companies, licensing deals, deal participants, deal-drug associations |
| Regimens | 3 | Treatment regimens, individual interventions, regimen composition |
| Indications | 4 | Approved/investigational indications, therapeutic approaches, prior therapy lines |
| Guidelines | 1 | Clinical practice guidelines referencing drugs and diseases |
| Publications | 4 | Journal articles, conference abstracts, reported outcomes, studied interventions |
| News | 7 | News articles with structured entity mentions (drugs, diseases, orgs, trials, targets, technologies) |
| Lookups | 6 | Reference data — countries, endpoints, modes of action, NCI Thesaurus concepts |
| Audit | 1 | Change tracking across all entities |
Five materialized views sit on top of these tables:
| View | What it computes |
|---|---|
mv_development_programs | Highest phase per drug-disease pair |
mv_drug_trials | Drug-to-trial bridge with derived role |
mv_efficacy_evidence | Unified trial results + publication outcomes |
mv_competitive_landscape | Drug-disease-phase matrix with current owners |
mv_drug_pipeline | Pre-joined drug overview for zero-JOIN queries |
How to read these docs
Section titled “How to read these docs”These docs are designed to read in order. Each section builds on the last.
Start with architecture to understand the design philosophy before diving into specific domains:
- Data Philosophy — Why tables hold raw facts and materialized views compute intelligence
- Schema Overview — Bird’s-eye ERD, domain map, full entity inventory
- Materialized Views — The five views, what they pre-compute, the refresh strategy
Then read the domains that matter to your work, starting from the center of the graph and moving outward:
- Drugs — The center of the schema. Everything connects through drugs.
- Clinical Trials — The largest domain (14 tables). Trial design, results, eligibility.
- Diseases — The ontology that organizes all therapeutic work.
- Organisations — Who develops what, and who bought it from whom.
- Indications — Treatment landscape, therapeutic approaches, prior therapy lines.
- Regimens — How drugs combine into treatment regimens.
- Publications — Literature evidence linking trials to published outcomes.
- News — Structured intelligence extracted from press releases and news wires.
- Lookups — Reference data that standardizes vocabulary across domains.
Finish with the AI agent guide:
- AI Agent Querying — Which views to hit, when to use base tables, query patterns and anti-patterns.
Pipeline layer
Section titled “Pipeline layer”Dagster orchestrates all data ingestion, enrichment, and the computed layer (dbt). The pipeline code lives in pipelines/ as a Python package alongside the TypeScript schema layer. See the data pipeline design for the full architecture.
Pre-existing tables (drugs, organisations, targets, technologies, biomarkers, indications, drug_ownerships) are declared as external AssetSpecs — Dagster tracks them without managing their schema.