Data model
The Data Gov database tells the story of every tracked drug — from discovery through clinical trials to regulatory approval and market launch. This page explains what data lives in the system, why it is structured the way it is, and how 206 tables connect into a coherent knowledge graph.
The big picture
Section titled “The big picture”Four core domains sit at the center of the graph. Every other table exists to connect, enrich, or contextualize these domains.
erDiagram
drugs ||--o{ drug_approvals : "approved as"
drugs ||--o{ drug_target_actions : "acts on"
drugs ||--o{ drug_ownerships : "owned by"
drugs ||--o{ drug_statuses : "status per area"
drugs ||--o{ brand_drugs : "marketed as"
drugs }o--|| technologies : "modality"
drug_approvals ||--o{ indications : "indicated for"
indications }o--o{ diseases : "treats"
drug_target_actions }o--|| targets : "targets"
drug_ownerships }o--|| organisations : "company"
drugs }o--o{ clinical_trials : "tested in"
clinical_trials ||--o{ participation_criteria : "eligibility"
clinical_trials ||--o{ study_plan_extractions : "study plans"
clinical_trials ||--o{ result_outcomes : "results"
clinical_trials ||--o{ sponsors : "sponsored by"
sponsors }o--|| organisations : "company"
diseases ||--o{ disease_parents : "hierarchy"
diseases ||--o{ disease_biomarkers : "characterized by"
disease_biomarkers }o--|| biomarkers : "biomarker"
diseases ||--o{ guidelines : "treated per"
organisations ||--o{ organisation_histories : "deal history"
news ||--o{ news_drug_mentions : "mentions"
news ||--o{ news_fda_submissions : "tracks"
publications ||--o{ publication_clinical_trials : "references"
Three schemas, one primary
Section titled “Three schemas, one primary”The database uses three PostgreSQL schemas. Only public is writable.
| Schema | Tables | Purpose |
|---|---|---|
public | ~170 | Core entities, relationships, pipeline staging, audit trails |
analytics | ~25 | Dimensional model (star/snowflake) for BI tools. No FK constraints. |
forecasting | ~10 | Django application tables for forecasting models |
Six PostgreSQL extensions power the search and matching infrastructure: pg_trgm (trigram similarity), fuzzystrmatch (Levenshtein distance), vector (pgvector embeddings), pg_stat_statements (query monitoring), pgstattuple (storage stats), and plpgsql (stored procedures).
Domain 1: Drugs
Section titled “Domain 1: Drugs”The drugs table is the center of the knowledge graph. One row per compound, tracked at the INN (International Nonproprietary Name) level — the active ingredient, not the brand.
Key columns on drugs:
| Column | Purpose |
|---|---|
bioloupe_id | Stable external identifier (e.g., B1234). Auto-generated. |
name | Primary name (INN or common name) |
all_synonyms | JSONB array of all known names. GIN-indexed for search. |
drug_type | New Molecular Entity, New Active Ingredient, New Formulation, Generic, Biosimilar, Interchangeable, Original |
drug_status | drug_marketed, drug_development, drug_development_stopped, drug_withdrawn, drug_unknown |
development_phase | discovery through approved |
is_combination | Fixed-dose combination drugs. Mutually exclusive with single-target drugs. |
technology_id | FK to technologies. Drug modality (small molecule, ADC, CAR-T, etc.) |
organisation_id | FK to organisations. Primary associated company. |
therapeutic_areas | JSONB array. Oncology, Hematology, etc. |
pdufa_date | FDA target action date |
inn_key | Normalized INN for cross-source matching |
Drugs connect outward to every other domain through relationship tables.
drug_approvals — Regulatory approvals. One row per region per drug. Polymorphic source_type/source_id links back to the raw regulatory record (FDA, EMA, KEGG, or CDE). Each approval has many indications.
drug_ownerships — Which company owns or partners on the drug, with geographic scope (geography_include/geography_exclude JSONB arrays).
drug_target_actions — Links drugs to molecular targets with mechanism of action (Inhibitor, Agonist, etc.). Supports mutation-specific targeting and allosteric binding flags.
drug_statuses — Per-therapeutic-area status overrides. One drug can be drug_marketed in Oncology but still drug_development in Hematology.
brand_drugs — Marketed brand names. Aggregates regional approvals with per-region approval FKs (usa_approval_id, eu_approval_id, jp_approval_id, cn_approval_id).
combination_drug_components — Self-join linking combination drugs to their component compounds.
technologies — Hierarchical drug modality taxonomy (Small Molecule > Kinase Inhibitor). Self-referential parent_id. CAR-T, ADC, and cell/gene therapy drugs carry additional attributes through master_records FKs on the drug itself.
Indications: where drugs meet diseases
Section titled “Indications: where drugs meet diseases”The indications table is the critical junction between drugs and diseases. Each drug_approval has many indications. Each indication captures:
disease_name— Raw text from the labeldisease_stages,disease_subtypes,disease_statuses,disease_extents— JSONB arrays of faceted disease characteristicstreatment_lines— JSONB array (e.g.,["1L", "2L+"])indication_type—treatment,diagnosis,prevention,mitigationaccelerated_approval_date,full_approval_date— Track the approval pathway
Indications link to diseases through indication_diseases. They carry sub-tables for biomarker requirements (indicated_biomarkers), therapeutic approaches (indicated_therapeutic_approaches), combination partners (indicated_combination_partners), and prior therapy requirements (indicated_prior_therapies).
Domain 2: Clinical trials
Section titled “Domain 2: Clinical trials”Clinical trials capture the experimental evidence for every drug. 72,000+ trials from two registries, each decomposed into structured components.
clinical_trials — One row per registered trial. Key columns: nct_id (unique identifier), phase, overall_status (recruitment status), collection_source (aact or chictr), llm_data (JSONB for LLM extraction results), and versions (JSONB for CT.gov version history).
The trial domain splits into four sub-areas.
Study plans
Section titled “Study plans”Study plans decompose trial arms into individual drug components with dosing details. Three tables form the hierarchy:
study_plan_extractions— One per trial. Tracks confidence and match statistics.study_plan_arms— Individual arms (EXPERIMENTAL, COMPARATOR, CONTROL) with intervention data.study_plan_components— Individual drugs or procedures within an arm. StoresdosingJSONB,matched_resourcefor linking to drugs, andinvestigational_componentflag.
Participation criteria
Section titled “Participation criteria”participation_criteria stores LLM-extracted disease-level criteria per trial. One row per disease-trial pair. Fields include disease_id, faceted JSONB arrays (disease_stages, disease_subtypes, treatment_settings), and biomarker requirements through participation_criterion_biomarkers.
Trial results
Section titled “Trial results”Results follow a strict AACT hierarchy. Each result table has a _stage twin for pipeline staging.
result_groups— Treatment arms within resultsresult_outcomes— Outcome definitions (Primary, Secondary, Other)outcome_measurements— Statistical values per outcome per groupoutcome_analyses— p-values, confidence intervals, hazard ratiosreported_events/reported_event_totals— Adverse events by organ systembaseline_counts/baseline_measurements— Demographics and baseline characteristics
Endpoints
Section titled “Endpoints”endpoints define standard clinical endpoints (Overall Survival, Progression-Free Survival). clinical_trial_endpoints links trials to their endpoints. endpoint_groups provide hierarchical categorization (Survival, Tumor Response, Quality of Life).
Domain 3: Diseases and biomarkers
Section titled “Domain 3: Diseases and biomarkers”Diseases form a directed acyclic graph (DAG). A disease can have multiple parents through disease_parents. This captures the reality that NSCLC is both a “Lung Cancer” and a “Solid Tumor.”
Key columns on diseases:
| Column | Purpose |
|---|---|
name | Disease name (e.g., “Non-Small Cell Lung Cancer”) |
nci_code | NCI Thesaurus code. Unique, not null. |
keywords, excluded_keywords | JSONB arrays for condition-matching pipelines |
synonyms | JSONB array of alternative names |
all_descendants | JSONB array of descendant IDs. Computed on save. |
hierarchy_level | Depth from root |
simplified | Whether this disease is in the curated tree |
standard_of_care | Whether SoC data has been configured |
disease_biomarkers — Links biomarkers to diseases with prevalence data. Each carries prevalence ranges and potential_values JSONB.
biomarkers — Master biomarker entities (EGFR, PD-L1, HER2). Hierarchical via parent_id. Stores known_values JSONB (Positive, Negative, High) and carries NCI Thesaurus codes.
disease_nodes — NCI Thesaurus-aligned classification nodes within a disease. Each carries matching rules (inclusion_rule, exclusion_rule) for automated condition parsing.
guidelines — Standard-of-care treatment guidelines per disease. Rich JSONB data model with line_of_therapy, disease_subtype, stages, prognosis, raw_biomarkers, and results. Links to drugs, drug groups, and chemo combinations through HABTM join tables.
disease_treatment_lines — Links diseases to applicable treatment lines (First-line, Second-line, Maintenance) with ordering.
disease_stats — SEER-style epidemiological data: incidence rates, death rates, survival rates.
Segments
Section titled “Segments”Diseases partition into patient segments for standard-of-care mapping. disease_base_segments define segmentation by stage or subtype. disease_biomarker_segments add biomarker-based stratification. These connect to treatment guidelines and prevalence data.
Domain 4: Organizations
Section titled “Domain 4: Organizations”Organizations represent pharmaceutical companies, biotechs, CROs, academic institutions, and diagnostics companies.
Key columns on organisations:
| Column | Purpose |
|---|---|
bioloupe_id | Stable identifier (e.g., BC567). Auto-generated. |
name | Company name. Unique. |
type | Pharmaceutical/Biotechnology, Diagnostic, Academic Institution/Hospital, CRO, etc. |
ownership_status | Publicly Held, Privately Held, Subsidiary |
branch_names | JSONB array of subsidiary and division names for matching |
headquarters | ISO country code |
subsidiary_of_id | Self-referential FK. Parent company. |
Financial columns: market_cap, enterprise_value, revenue_ttm, profit_margin, total_cash_mrq, cash_runaway_month. Stock data: us_stock_exchange, us_stock_symbol.
organisation_histories — M&A, licensing, and partnership deals. Carries financial terms (total_value, upfront_payment, milestone_payment), geographic scope, and deal classification. Links to drugs through drugs_organisation_histories.
organisation_history_participants — Links organizations to deals with roles (acquirer, target, licensor, licensee).
pipeline_snapshots — Automated screenshots of company pipeline pages. current_screenshot_hash vs previous_screenshot_hash detect changes. change_explanation stores LLM-generated description of what changed.
Supporting domains
Section titled “Supporting domains”Regulatory source tables
Section titled “Regulatory source tables”Raw source records from each regulatory agency live in dedicated tables. These are the unprocessed imports before entity resolution.
| Table | Source | Key identifier |
|---|---|---|
fda_data | FDA Drugs@FDA API | application_number |
ema_approved_drugs | EMA XLSX export | product_number |
kegg_approved_drugs | KEGG Japan database | kegg identifier |
china_approved_drugs | CDE/NMPA China | cde_code |
purplebook_records | FDA Purple Book | bla_number |
orangebook_records | FDA Orange Book | appl_no + product_no |
breakthroughs | FDA Breakthrough Therapy | agent + indication |
Each source table carries a drug_id FK (nullable before matching) and serves as a polymorphic source for drug_approvals.
News and publications
Section titled “News and publications”news — Press releases from Business Wire, GlobeNewsWire, Cision, and financial sources. LLM pipelines classify articles by category (Regulatory, Business Deal, Trial Results), extract drug and disease mentions, and identify FDA submission events.
news_chunks — Chunked segments for RAG and embedding search. Uses the Vectorizable concern.
publications — Scientific articles from PubMed and conferences (ASCO, AACR, ASH, EHA, ESMO). Stores abstracts, full text when available, and LLM extraction results. Links to trials through publication_clinical_trials.
news_fda_submissions — FDA submission events (NDA, BLA, sNDA) extracted from news. Tracks PDUFA dates and links to matched indications.
Reference data
Section titled “Reference data”source_drugs — External drug records from ChEMBL, NCI Thesaurus, RxNorm, and Adis. Used for entity resolution. Drugs link via dedicated FKs: chembl_id, ncit_id, rx_norm_id, adis_id.
ncit_concepts — NCI Thesaurus concepts broader than drugs. Covers procedures, diseases, and biomedical entities. Drug-like concepts can be converted to Drug records.
master_lists / master_records — Dropdown values for controlled vocabularies. A MasterList defines a category (Mechanism of Action). MasterRecord entries store individual values (Inhibitor, Agonist). Many drug fields reference master records via FKs.
System tables
Section titled “System tables”users — Admin users with roles and Devise authentication.
workflow_instances / workflow_steps — Pipeline orchestration state. Each workflow instance tracks a single run. Steps carry status, retry counts, and AWS Batch integration.
versions — PaperTrail audit trail. Tracks changes to drugs, approvals, diseases, organizations, biomarkers, indications, and guidelines.
lockable_attributes — Polymorphic table recording which fields on which records are locked against automated overwrite.
assignments — Polymorphic user assignments for QA workflows.
embeddables — Polymorphic table storing vector embeddings for any model.
Cross-cutting patterns
Section titled “Cross-cutting patterns”These patterns appear throughout the schema. Understand them once and you understand every table.
JSONB: staging and flexibility
Section titled “JSONB: staging and flexibility”JSONB columns serve two purposes:
- LLM staging —
llm_dataholds AI extraction results before human QA promotes them to relational columns. Theraw_datacolumn preserves the original API response as an immutable audit trail. - Flexible storage —
all_synonyms,keywords,therapeutic_areas, and similar columns store arrays or objects that do not warrant dedicated tables.
JSONB fields use GIN indexes for containment queries. The StoreModel gem types some JSONB columns as validated Ruby objects. The jsonb_accessor gem provides typed accessors for nested keys.
Stage tables
Section titled “Stage tables”Many clinical-trial-result tables have a _stage twin (e.g., milestones / milestones_stage). Stage tables receive raw AACT pipeline data. After QA, records promote to the production table. Both share the same schema and ID sequence.
Lockable attributes
Section titled “Lockable attributes”The Lockable concern lets any model protect manually edited fields from automated overwrites. When an analyst manually sets a drug’s name, a lockable_attributes row locks that field. Pipelines check drug.name_locked? before overwriting. This preserves human curation across import cycles.
Claim-based QA
Section titled “Claim-based QA”The Claimable concern implements 24-hour review windows. An analyst claims a record (claimed_by / claimed_at). Claims expire after 24 hours if not completed. The analyst reviews, approves, and the data flows to authoritative tables.
Polymorphic associations
Section titled “Polymorphic associations”Several tables use source_type/source_id pairs: drug_approvals (five source types), aggregations, embeddables, audit_issues, and workflow_resource_links. This pattern supports multiple parent types without schema changes.
Bioloupe IDs
Section titled “Bioloupe IDs”Drugs (B1234), organizations (BC567), and targets (BT123) carry auto-generated, human-readable bioloupe_id strings. These are stable external references used across all Bioloupe products.
Naming conventions
Section titled “Naming conventions”Standard Rails conventions throughout: plural snake_case table names, id bigint primary keys, {model}_id foreign keys, created_at/updated_at timestamps, deleted_at for soft deletes (via Paranoia gem), and {table_a}_{table_b} for join tables.
Next steps
Section titled “Next steps”Now that you understand what data lives in Data Gov, read how it gets there.
- Clinical trials — How 72,000 trials flow from registries into the knowledge graph
- Drug approvals — How four regulatory agencies become unified approval records
- News and intelligence — How press releases become structured intelligence