Skip to content

Data model

The Data Gov database tells the story of every tracked drug — from discovery through clinical trials to regulatory approval and market launch. This page explains what data lives in the system, why it is structured the way it is, and how 206 tables connect into a coherent knowledge graph.

Four core domains sit at the center of the graph. Every other table exists to connect, enrich, or contextualize these domains.

erDiagram
    drugs ||--o{ drug_approvals : "approved as"
    drugs ||--o{ drug_target_actions : "acts on"
    drugs ||--o{ drug_ownerships : "owned by"
    drugs ||--o{ drug_statuses : "status per area"
    drugs ||--o{ brand_drugs : "marketed as"
    drugs }o--|| technologies : "modality"
    drug_approvals ||--o{ indications : "indicated for"
    indications }o--o{ diseases : "treats"
    drug_target_actions }o--|| targets : "targets"
    drug_ownerships }o--|| organisations : "company"
    drugs }o--o{ clinical_trials : "tested in"
    clinical_trials ||--o{ participation_criteria : "eligibility"
    clinical_trials ||--o{ study_plan_extractions : "study plans"
    clinical_trials ||--o{ result_outcomes : "results"
    clinical_trials ||--o{ sponsors : "sponsored by"
    sponsors }o--|| organisations : "company"
    diseases ||--o{ disease_parents : "hierarchy"
    diseases ||--o{ disease_biomarkers : "characterized by"
    disease_biomarkers }o--|| biomarkers : "biomarker"
    diseases ||--o{ guidelines : "treated per"
    organisations ||--o{ organisation_histories : "deal history"
    news ||--o{ news_drug_mentions : "mentions"
    news ||--o{ news_fda_submissions : "tracks"
    publications ||--o{ publication_clinical_trials : "references"

The database uses three PostgreSQL schemas. Only public is writable.

SchemaTablesPurpose
public~170Core entities, relationships, pipeline staging, audit trails
analytics~25Dimensional model (star/snowflake) for BI tools. No FK constraints.
forecasting~10Django application tables for forecasting models

Six PostgreSQL extensions power the search and matching infrastructure: pg_trgm (trigram similarity), fuzzystrmatch (Levenshtein distance), vector (pgvector embeddings), pg_stat_statements (query monitoring), pgstattuple (storage stats), and plpgsql (stored procedures).

The drugs table is the center of the knowledge graph. One row per compound, tracked at the INN (International Nonproprietary Name) level — the active ingredient, not the brand.

Key columns on drugs:

ColumnPurpose
bioloupe_idStable external identifier (e.g., B1234). Auto-generated.
namePrimary name (INN or common name)
all_synonymsJSONB array of all known names. GIN-indexed for search.
drug_typeNew Molecular Entity, New Active Ingredient, New Formulation, Generic, Biosimilar, Interchangeable, Original
drug_statusdrug_marketed, drug_development, drug_development_stopped, drug_withdrawn, drug_unknown
development_phasediscovery through approved
is_combinationFixed-dose combination drugs. Mutually exclusive with single-target drugs.
technology_idFK to technologies. Drug modality (small molecule, ADC, CAR-T, etc.)
organisation_idFK to organisations. Primary associated company.
therapeutic_areasJSONB array. Oncology, Hematology, etc.
pdufa_dateFDA target action date
inn_keyNormalized INN for cross-source matching

Drugs connect outward to every other domain through relationship tables.

drug_approvals — Regulatory approvals. One row per region per drug. Polymorphic source_type/source_id links back to the raw regulatory record (FDA, EMA, KEGG, or CDE). Each approval has many indications.

drug_ownerships — Which company owns or partners on the drug, with geographic scope (geography_include/geography_exclude JSONB arrays).

drug_target_actions — Links drugs to molecular targets with mechanism of action (Inhibitor, Agonist, etc.). Supports mutation-specific targeting and allosteric binding flags.

drug_statuses — Per-therapeutic-area status overrides. One drug can be drug_marketed in Oncology but still drug_development in Hematology.

brand_drugs — Marketed brand names. Aggregates regional approvals with per-region approval FKs (usa_approval_id, eu_approval_id, jp_approval_id, cn_approval_id).

combination_drug_components — Self-join linking combination drugs to their component compounds.

technologies — Hierarchical drug modality taxonomy (Small Molecule > Kinase Inhibitor). Self-referential parent_id. CAR-T, ADC, and cell/gene therapy drugs carry additional attributes through master_records FKs on the drug itself.

The indications table is the critical junction between drugs and diseases. Each drug_approval has many indications. Each indication captures:

  • disease_name — Raw text from the label
  • disease_stages, disease_subtypes, disease_statuses, disease_extents — JSONB arrays of faceted disease characteristics
  • treatment_lines — JSONB array (e.g., ["1L", "2L+"])
  • indication_typetreatment, diagnosis, prevention, mitigation
  • accelerated_approval_date, full_approval_date — Track the approval pathway

Indications link to diseases through indication_diseases. They carry sub-tables for biomarker requirements (indicated_biomarkers), therapeutic approaches (indicated_therapeutic_approaches), combination partners (indicated_combination_partners), and prior therapy requirements (indicated_prior_therapies).

Clinical trials capture the experimental evidence for every drug. 72,000+ trials from two registries, each decomposed into structured components.

clinical_trials — One row per registered trial. Key columns: nct_id (unique identifier), phase, overall_status (recruitment status), collection_source (aact or chictr), llm_data (JSONB for LLM extraction results), and versions (JSONB for CT.gov version history).

The trial domain splits into four sub-areas.

Study plans decompose trial arms into individual drug components with dosing details. Three tables form the hierarchy:

  • study_plan_extractions — One per trial. Tracks confidence and match statistics.
  • study_plan_arms — Individual arms (EXPERIMENTAL, COMPARATOR, CONTROL) with intervention data.
  • study_plan_components — Individual drugs or procedures within an arm. Stores dosing JSONB, matched_resource for linking to drugs, and investigational_component flag.

participation_criteria stores LLM-extracted disease-level criteria per trial. One row per disease-trial pair. Fields include disease_id, faceted JSONB arrays (disease_stages, disease_subtypes, treatment_settings), and biomarker requirements through participation_criterion_biomarkers.

Results follow a strict AACT hierarchy. Each result table has a _stage twin for pipeline staging.

  • result_groups — Treatment arms within results
  • result_outcomes — Outcome definitions (Primary, Secondary, Other)
  • outcome_measurements — Statistical values per outcome per group
  • outcome_analyses — p-values, confidence intervals, hazard ratios
  • reported_events / reported_event_totals — Adverse events by organ system
  • baseline_counts / baseline_measurements — Demographics and baseline characteristics

endpoints define standard clinical endpoints (Overall Survival, Progression-Free Survival). clinical_trial_endpoints links trials to their endpoints. endpoint_groups provide hierarchical categorization (Survival, Tumor Response, Quality of Life).

Diseases form a directed acyclic graph (DAG). A disease can have multiple parents through disease_parents. This captures the reality that NSCLC is both a “Lung Cancer” and a “Solid Tumor.”

Key columns on diseases:

ColumnPurpose
nameDisease name (e.g., “Non-Small Cell Lung Cancer”)
nci_codeNCI Thesaurus code. Unique, not null.
keywords, excluded_keywordsJSONB arrays for condition-matching pipelines
synonymsJSONB array of alternative names
all_descendantsJSONB array of descendant IDs. Computed on save.
hierarchy_levelDepth from root
simplifiedWhether this disease is in the curated tree
standard_of_careWhether SoC data has been configured

disease_biomarkers — Links biomarkers to diseases with prevalence data. Each carries prevalence ranges and potential_values JSONB.

biomarkers — Master biomarker entities (EGFR, PD-L1, HER2). Hierarchical via parent_id. Stores known_values JSONB (Positive, Negative, High) and carries NCI Thesaurus codes.

disease_nodes — NCI Thesaurus-aligned classification nodes within a disease. Each carries matching rules (inclusion_rule, exclusion_rule) for automated condition parsing.

guidelines — Standard-of-care treatment guidelines per disease. Rich JSONB data model with line_of_therapy, disease_subtype, stages, prognosis, raw_biomarkers, and results. Links to drugs, drug groups, and chemo combinations through HABTM join tables.

disease_treatment_lines — Links diseases to applicable treatment lines (First-line, Second-line, Maintenance) with ordering.

disease_stats — SEER-style epidemiological data: incidence rates, death rates, survival rates.

Diseases partition into patient segments for standard-of-care mapping. disease_base_segments define segmentation by stage or subtype. disease_biomarker_segments add biomarker-based stratification. These connect to treatment guidelines and prevalence data.

Organizations represent pharmaceutical companies, biotechs, CROs, academic institutions, and diagnostics companies.

Key columns on organisations:

ColumnPurpose
bioloupe_idStable identifier (e.g., BC567). Auto-generated.
nameCompany name. Unique.
typePharmaceutical/Biotechnology, Diagnostic, Academic Institution/Hospital, CRO, etc.
ownership_statusPublicly Held, Privately Held, Subsidiary
branch_namesJSONB array of subsidiary and division names for matching
headquartersISO country code
subsidiary_of_idSelf-referential FK. Parent company.

Financial columns: market_cap, enterprise_value, revenue_ttm, profit_margin, total_cash_mrq, cash_runaway_month. Stock data: us_stock_exchange, us_stock_symbol.

organisation_histories — M&A, licensing, and partnership deals. Carries financial terms (total_value, upfront_payment, milestone_payment), geographic scope, and deal classification. Links to drugs through drugs_organisation_histories.

organisation_history_participants — Links organizations to deals with roles (acquirer, target, licensor, licensee).

pipeline_snapshots — Automated screenshots of company pipeline pages. current_screenshot_hash vs previous_screenshot_hash detect changes. change_explanation stores LLM-generated description of what changed.

Raw source records from each regulatory agency live in dedicated tables. These are the unprocessed imports before entity resolution.

TableSourceKey identifier
fda_dataFDA Drugs@FDA APIapplication_number
ema_approved_drugsEMA XLSX exportproduct_number
kegg_approved_drugsKEGG Japan databasekegg identifier
china_approved_drugsCDE/NMPA Chinacde_code
purplebook_recordsFDA Purple Bookbla_number
orangebook_recordsFDA Orange Bookappl_no + product_no
breakthroughsFDA Breakthrough Therapyagent + indication

Each source table carries a drug_id FK (nullable before matching) and serves as a polymorphic source for drug_approvals.

news — Press releases from Business Wire, GlobeNewsWire, Cision, and financial sources. LLM pipelines classify articles by category (Regulatory, Business Deal, Trial Results), extract drug and disease mentions, and identify FDA submission events.

news_chunks — Chunked segments for RAG and embedding search. Uses the Vectorizable concern.

publications — Scientific articles from PubMed and conferences (ASCO, AACR, ASH, EHA, ESMO). Stores abstracts, full text when available, and LLM extraction results. Links to trials through publication_clinical_trials.

news_fda_submissions — FDA submission events (NDA, BLA, sNDA) extracted from news. Tracks PDUFA dates and links to matched indications.

source_drugs — External drug records from ChEMBL, NCI Thesaurus, RxNorm, and Adis. Used for entity resolution. Drugs link via dedicated FKs: chembl_id, ncit_id, rx_norm_id, adis_id.

ncit_concepts — NCI Thesaurus concepts broader than drugs. Covers procedures, diseases, and biomedical entities. Drug-like concepts can be converted to Drug records.

master_lists / master_records — Dropdown values for controlled vocabularies. A MasterList defines a category (Mechanism of Action). MasterRecord entries store individual values (Inhibitor, Agonist). Many drug fields reference master records via FKs.

users — Admin users with roles and Devise authentication.

workflow_instances / workflow_steps — Pipeline orchestration state. Each workflow instance tracks a single run. Steps carry status, retry counts, and AWS Batch integration.

versions — PaperTrail audit trail. Tracks changes to drugs, approvals, diseases, organizations, biomarkers, indications, and guidelines.

lockable_attributes — Polymorphic table recording which fields on which records are locked against automated overwrite.

assignments — Polymorphic user assignments for QA workflows.

embeddables — Polymorphic table storing vector embeddings for any model.

These patterns appear throughout the schema. Understand them once and you understand every table.

JSONB columns serve two purposes:

  1. LLM stagingllm_data holds AI extraction results before human QA promotes them to relational columns. The raw_data column preserves the original API response as an immutable audit trail.
  2. Flexible storageall_synonyms, keywords, therapeutic_areas, and similar columns store arrays or objects that do not warrant dedicated tables.

JSONB fields use GIN indexes for containment queries. The StoreModel gem types some JSONB columns as validated Ruby objects. The jsonb_accessor gem provides typed accessors for nested keys.

Many clinical-trial-result tables have a _stage twin (e.g., milestones / milestones_stage). Stage tables receive raw AACT pipeline data. After QA, records promote to the production table. Both share the same schema and ID sequence.

The Lockable concern lets any model protect manually edited fields from automated overwrites. When an analyst manually sets a drug’s name, a lockable_attributes row locks that field. Pipelines check drug.name_locked? before overwriting. This preserves human curation across import cycles.

The Claimable concern implements 24-hour review windows. An analyst claims a record (claimed_by / claimed_at). Claims expire after 24 hours if not completed. The analyst reviews, approves, and the data flows to authoritative tables.

Several tables use source_type/source_id pairs: drug_approvals (five source types), aggregations, embeddables, audit_issues, and workflow_resource_links. This pattern supports multiple parent types without schema changes.

Drugs (B1234), organizations (BC567), and targets (BT123) carry auto-generated, human-readable bioloupe_id strings. These are stable external references used across all Bioloupe products.

Standard Rails conventions throughout: plural snake_case table names, id bigint primary keys, {model}_id foreign keys, created_at/updated_at timestamps, deleted_at for soft deletes (via Paranoia gem), and {table_a}_{table_b} for join tables.

Now that you understand what data lives in Data Gov, read how it gets there.