Data model

The Data Gov database tells the story of every tracked drug — from discovery through clinical trials to regulatory approval and market launch. This page explains what data lives in the system, why it is structured the way it is, and how 206 tables connect into a coherent knowledge graph.

The big picture

Four core domains sit at the center of the graph. Every other table exists to connect, enrich, or contextualize these domains.

erDiagram
    drugs ||--o{ drug_approvals : "approved as"
    drugs ||--o{ drug_target_actions : "acts on"
    drugs ||--o{ drug_ownerships : "owned by"
    drugs ||--o{ drug_statuses : "status per area"
    drugs ||--o{ brand_drugs : "marketed as"
    drugs }o--|| technologies : "modality"
    drug_approvals ||--o{ indications : "indicated for"
    indications }o--o{ diseases : "treats"
    drug_target_actions }o--|| targets : "targets"
    drug_ownerships }o--|| organisations : "company"
    drugs }o--o{ clinical_trials : "tested in"
    clinical_trials ||--o{ participation_criteria : "eligibility"
    clinical_trials ||--o{ study_plan_extractions : "study plans"
    clinical_trials ||--o{ result_outcomes : "results"
    clinical_trials ||--o{ sponsors : "sponsored by"
    sponsors }o--|| organisations : "company"
    diseases ||--o{ disease_parents : "hierarchy"
    diseases ||--o{ disease_biomarkers : "characterized by"
    disease_biomarkers }o--|| biomarkers : "biomarker"
    diseases ||--o{ guidelines : "treated per"
    organisations ||--o{ organisation_histories : "deal history"
    news ||--o{ news_drug_mentions : "mentions"
    news ||--o{ news_fda_submissions : "tracks"
    publications ||--o{ publication_clinical_trials : "references"

Three schemas, one primary

The database uses three PostgreSQL schemas. Only public is writable.

Schema	Tables	Purpose
`public`	~170	Core entities, relationships, pipeline staging, audit trails
`analytics`	~25	Dimensional model (star/snowflake) for BI tools. No FK constraints.
`forecasting`	~10	Django application tables for forecasting models

Six PostgreSQL extensions power the search and matching infrastructure: pg_trgm (trigram similarity), fuzzystrmatch (Levenshtein distance), vector (pgvector embeddings), pg_stat_statements (query monitoring), pgstattuple (storage stats), and plpgsql (stored procedures).

Domain 1: Drugs

The drugs table is the center of the knowledge graph. One row per compound, tracked at the INN (International Nonproprietary Name) level — the active ingredient, not the brand.

Key columns on drugs:

Column	Purpose
`bioloupe_id`	Stable external identifier (e.g., `B1234`). Auto-generated.
`name`	Primary name (INN or common name)
`all_synonyms`	JSONB array of all known names. GIN-indexed for search.
`drug_type`	`New Molecular Entity`, `New Active Ingredient`, `New Formulation`, `Generic`, `Biosimilar`, `Interchangeable`, `Original`
`drug_status`	`drug_marketed`, `drug_development`, `drug_development_stopped`, `drug_withdrawn`, `drug_unknown`
`development_phase`	`discovery` through `approved`
`is_combination`	Fixed-dose combination drugs. Mutually exclusive with single-target drugs.
`technology_id`	FK to `technologies`. Drug modality (small molecule, ADC, CAR-T, etc.)
`organisation_id`	FK to `organisations`. Primary associated company.
`therapeutic_areas`	JSONB array. Oncology, Hematology, etc.
`pdufa_date`	FDA target action date
`inn_key`	Normalized INN for cross-source matching

Drugs connect outward to every other domain through relationship tables.

drug_approvals — Regulatory approvals. One row per region per drug. Polymorphic source_type/source_id links back to the raw regulatory record (FDA, EMA, KEGG, or CDE). Each approval has many indications.

drug_ownerships — Which company owns or partners on the drug, with geographic scope (geography_include/geography_exclude JSONB arrays).

drug_target_actions — Links drugs to molecular targets with mechanism of action (Inhibitor, Agonist, etc.). Supports mutation-specific targeting and allosteric binding flags.

drug_statuses — Per-therapeutic-area status overrides. One drug can be drug_marketed in Oncology but still drug_development in Hematology.

brand_drugs — Marketed brand names. Aggregates regional approvals with per-region approval FKs (usa_approval_id, eu_approval_id, jp_approval_id, cn_approval_id).

combination_drug_components — Self-join linking combination drugs to their component compounds.

technologies — Hierarchical drug modality taxonomy (Small Molecule > Kinase Inhibitor). Self-referential parent_id. CAR-T, ADC, and cell/gene therapy drugs carry additional attributes through master_records FKs on the drug itself.

Indications: where drugs meet diseases

The indications table is the critical junction between drugs and diseases. Each drug_approval has many indications. Each indication captures:

disease_name — Raw text from the label
disease_stages, disease_subtypes, disease_statuses, disease_extents — JSONB arrays of faceted disease characteristics
treatment_lines — JSONB array (e.g., ["1L", "2L+"])
indication_type — treatment, diagnosis, prevention, mitigation
accelerated_approval_date, full_approval_date — Track the approval pathway

Indications link to diseases through indication_diseases. They carry sub-tables for biomarker requirements (indicated_biomarkers), therapeutic approaches (indicated_therapeutic_approaches), combination partners (indicated_combination_partners), and prior therapy requirements (indicated_prior_therapies).

Domain 2: Clinical trials

Clinical trials capture the experimental evidence for every drug. 72,000+ trials from two registries, each decomposed into structured components.

clinical_trials — One row per registered trial. Key columns: nct_id (unique identifier), phase, overall_status (recruitment status), collection_source (aact or chictr), llm_data (JSONB for LLM extraction results), and versions (JSONB for CT.gov version history).

The trial domain splits into four sub-areas.

Study plans

Study plans decompose trial arms into individual drug components with dosing details. Three tables form the hierarchy:

study_plan_extractions — One per trial. Tracks confidence and match statistics.
study_plan_arms — Individual arms (EXPERIMENTAL, COMPARATOR, CONTROL) with intervention data.
study_plan_components — Individual drugs or procedures within an arm. Stores dosing JSONB, matched_resource for linking to drugs, and investigational_component flag.

Participation criteria

participation_criteria stores LLM-extracted disease-level criteria per trial. One row per disease-trial pair. Fields include disease_id, faceted JSONB arrays (disease_stages, disease_subtypes, treatment_settings), and biomarker requirements through participation_criterion_biomarkers.

Trial results

Results follow a strict AACT hierarchy. Each result table has a _stage twin for pipeline staging.

result_groups — Treatment arms within results
result_outcomes — Outcome definitions (Primary, Secondary, Other)
outcome_measurements — Statistical values per outcome per group
outcome_analyses — p-values, confidence intervals, hazard ratios
reported_events / reported_event_totals — Adverse events by organ system
baseline_counts / baseline_measurements — Demographics and baseline characteristics

Endpoints

endpoints define standard clinical endpoints (Overall Survival, Progression-Free Survival). clinical_trial_endpoints links trials to their endpoints. endpoint_groups provide hierarchical categorization (Survival, Tumor Response, Quality of Life).

Domain 3: Diseases and biomarkers

Diseases form a directed acyclic graph (DAG). A disease can have multiple parents through disease_parents. This captures the reality that NSCLC is both a “Lung Cancer” and a “Solid Tumor.”

Key columns on diseases:

Column	Purpose
`name`	Disease name (e.g., “Non-Small Cell Lung Cancer”)
`nci_code`	NCI Thesaurus code. Unique, not null.
`keywords`, `excluded_keywords`	JSONB arrays for condition-matching pipelines
`synonyms`	JSONB array of alternative names
`all_descendants`	JSONB array of descendant IDs. Computed on save.
`hierarchy_level`	Depth from root
`simplified`	Whether this disease is in the curated tree
`standard_of_care`	Whether SoC data has been configured

disease_biomarkers — Links biomarkers to diseases with prevalence data. Each carries prevalence ranges and potential_values JSONB.

biomarkers — Master biomarker entities (EGFR, PD-L1, HER2). Hierarchical via parent_id. Stores known_values JSONB (Positive, Negative, High) and carries NCI Thesaurus codes.

disease_nodes — NCI Thesaurus-aligned classification nodes within a disease. Each carries matching rules (inclusion_rule, exclusion_rule) for automated condition parsing.

guidelines — Standard-of-care treatment guidelines per disease. Rich JSONB data model with line_of_therapy, disease_subtype, stages, prognosis, raw_biomarkers, and results. Links to drugs, drug groups, and chemo combinations through HABTM join tables.

disease_treatment_lines — Links diseases to applicable treatment lines (First-line, Second-line, Maintenance) with ordering.

disease_stats — SEER-style epidemiological data: incidence rates, death rates, survival rates.

Segments

Diseases partition into patient segments for standard-of-care mapping. disease_base_segments define segmentation by stage or subtype. disease_biomarker_segments add biomarker-based stratification. These connect to treatment guidelines and prevalence data.

Domain 4: Organizations

Organizations represent pharmaceutical companies, biotechs, CROs, academic institutions, and diagnostics companies.

Key columns on organisations:

Column	Purpose
`bioloupe_id`	Stable identifier (e.g., `BC567`). Auto-generated.
`name`	Company name. Unique.
`type`	`Pharmaceutical/Biotechnology`, `Diagnostic`, `Academic Institution/Hospital`, `CRO`, etc.
`ownership_status`	`Publicly Held`, `Privately Held`, `Subsidiary`
`branch_names`	JSONB array of subsidiary and division names for matching
`headquarters`	ISO country code
`subsidiary_of_id`	Self-referential FK. Parent company.

Financial columns: market_cap, enterprise_value, revenue_ttm, profit_margin, total_cash_mrq, cash_runaway_month. Stock data: us_stock_exchange, us_stock_symbol.

organisation_histories — M&A, licensing, and partnership deals. Carries financial terms (total_value, upfront_payment, milestone_payment), geographic scope, and deal classification. Links to drugs through drugs_organisation_histories.

organisation_history_participants — Links organizations to deals with roles (acquirer, target, licensor, licensee).

pipeline_snapshots — Automated screenshots of company pipeline pages. current_screenshot_hash vs previous_screenshot_hash detect changes. change_explanation stores LLM-generated description of what changed.

Supporting domains

Regulatory source tables

Raw source records from each regulatory agency live in dedicated tables. These are the unprocessed imports before entity resolution.

Table	Source	Key identifier
`fda_data`	FDA Drugs@FDA API	`application_number`
`ema_approved_drugs`	EMA XLSX export	`product_number`
`kegg_approved_drugs`	KEGG Japan database	`kegg` identifier
`china_approved_drugs`	CDE/NMPA China	`cde_code`
`purplebook_records`	FDA Purple Book	`bla_number`
`orangebook_records`	FDA Orange Book	`appl_no` + `product_no`
`breakthroughs`	FDA Breakthrough Therapy	`agent` + `indication`

Each source table carries a drug_id FK (nullable before matching) and serves as a polymorphic source for drug_approvals.

News and publications

news — Press releases from Business Wire, GlobeNewsWire, Cision, and financial sources. LLM pipelines classify articles by category (Regulatory, Business Deal, Trial Results), extract drug and disease mentions, and identify FDA submission events.

news_chunks — Chunked segments for RAG and embedding search. Uses the Vectorizable concern.

publications — Scientific articles from PubMed and conferences (ASCO, AACR, ASH, EHA, ESMO). Stores abstracts, full text when available, and LLM extraction results. Links to trials through publication_clinical_trials.

news_fda_submissions — FDA submission events (NDA, BLA, sNDA) extracted from news. Tracks PDUFA dates and links to matched indications.

Reference data

source_drugs — External drug records from ChEMBL, NCI Thesaurus, RxNorm, and Adis. Used for entity resolution. Drugs link via dedicated FKs: chembl_id, ncit_id, rx_norm_id, adis_id.

ncit_concepts — NCI Thesaurus concepts broader than drugs. Covers procedures, diseases, and biomedical entities. Drug-like concepts can be converted to Drug records.

master_lists / master_records — Dropdown values for controlled vocabularies. A MasterList defines a category (Mechanism of Action). MasterRecord entries store individual values (Inhibitor, Agonist). Many drug fields reference master records via FKs.

System tables

users — Admin users with roles and Devise authentication.

workflow_instances / workflow_steps — Pipeline orchestration state. Each workflow instance tracks a single run. Steps carry status, retry counts, and AWS Batch integration.

versions — PaperTrail audit trail. Tracks changes to drugs, approvals, diseases, organizations, biomarkers, indications, and guidelines.

lockable_attributes — Polymorphic table recording which fields on which records are locked against automated overwrite.

assignments — Polymorphic user assignments for QA workflows.

embeddables — Polymorphic table storing vector embeddings for any model.

Cross-cutting patterns

These patterns appear throughout the schema. Understand them once and you understand every table.

JSONB: staging and flexibility

JSONB columns serve two purposes:

LLM staging — llm_data holds AI extraction results before human QA promotes them to relational columns. The raw_data column preserves the original API response as an immutable audit trail.
Flexible storage — all_synonyms, keywords, therapeutic_areas, and similar columns store arrays or objects that do not warrant dedicated tables.

JSONB fields use GIN indexes for containment queries. The StoreModel gem types some JSONB columns as validated Ruby objects. The jsonb_accessor gem provides typed accessors for nested keys.

Stage tables

Many clinical-trial-result tables have a _stage twin (e.g., milestones / milestones_stage). Stage tables receive raw AACT pipeline data. After QA, records promote to the production table. Both share the same schema and ID sequence.

Lockable attributes

The Lockable concern lets any model protect manually edited fields from automated overwrites. When an analyst manually sets a drug’s name, a lockable_attributes row locks that field. Pipelines check drug.name_locked? before overwriting. This preserves human curation across import cycles.

Claim-based QA

The Claimable concern implements 24-hour review windows. An analyst claims a record (claimed_by / claimed_at). Claims expire after 24 hours if not completed. The analyst reviews, approves, and the data flows to authoritative tables.

Polymorphic associations

Several tables use source_type/source_id pairs: drug_approvals (five source types), aggregations, embeddables, audit_issues, and workflow_resource_links. This pattern supports multiple parent types without schema changes.

Bioloupe IDs

Drugs (B1234), organizations (BC567), and targets (BT123) carry auto-generated, human-readable bioloupe_id strings. These are stable external references used across all Bioloupe products.

Naming conventions

Standard Rails conventions throughout: plural snake_case table names, id bigint primary keys, {model}_id foreign keys, created_at/updated_at timestamps, deleted_at for soft deletes (via Paranoia gem), and {table_a}_{table_b} for join tables.

Next steps

Now that you understand what data lives in Data Gov, read how it gets there.

Clinical trials — How 72,000 trials flow from registries into the knowledge graph
Drug approvals — How four regulatory agencies become unified approval records
News and intelligence — How press releases become structured intelligence