Skip to content

Data model

The Data Gov database tells the story of every tracked drug — from discovery through clinical trials to regulatory approval and market launch. This page explains what data lives in the system, why it is structured the way it is, and how 206 tables connect into a coherent knowledge graph.

Four core domains sit at the center of the graph. Every other table exists to connect, enrich, or contextualize these domains.

erDiagram
    drugs ||--o{ drug_approvals : "approved as"
    drugs ||--o{ drug_target_actions : "acts on"
    drugs ||--o{ drug_ownerships : "owned by"
    drugs ||--o{ drug_statuses : "status per area"
    drugs ||--o{ brand_drugs : "marketed as"
    drugs }o--|| technologies : "modality"
    drug_approvals ||--o{ indications : "indicated for"
    indications }o--o{ diseases : "treats"
    drug_target_actions }o--|| targets : "targets"
    drug_ownerships }o--|| organisations : "company"
    drugs }o--o{ clinical_trials : "tested in"
    clinical_trials ||--o{ participation_criteria : "eligibility"
    clinical_trials ||--o{ study_plan_extractions : "study plans"
    clinical_trials ||--o{ result_outcomes : "results"
    clinical_trials ||--o{ sponsors : "sponsored by"
    sponsors }o--|| organisations : "company"
    diseases ||--o{ disease_parents : "hierarchy"
    diseases ||--o{ disease_biomarkers : "characterized by"
    disease_biomarkers }o--|| biomarkers : "biomarker"
    diseases ||--o{ guidelines : "treated per"
    organisations ||--o{ organisation_histories : "deal history"
    news ||--o{ news_drug_mentions : "mentions"
    news ||--o{ news_fda_submissions : "tracks"
    publications ||--o{ publication_clinical_trials : "references"

The database uses three PostgreSQL schemas. Only public is writable.

SchemaTablesPurpose
public~170Core entities, relationships, pipeline staging, audit trails
analytics~25Dimensional model (star/snowflake) for BI tools. No FK constraints.
forecasting~10Django application tables for forecasting models

Six PostgreSQL extensions power the search and matching infrastructure: pg_trgm (trigram similarity), fuzzystrmatch (Levenshtein distance), vector (pgvector embeddings), pg_stat_statements (query monitoring), pgstattuple (storage stats), and plpgsql (stored procedures).

The drugs table is the center of the knowledge graph. One row per compound, tracked at the INN (International Nonproprietary Name) level — the active ingredient, not the brand.

Key columns on drugs:

ColumnPurpose
bioloupe_idStable external identifier (e.g., B1234). Auto-generated.
namePrimary name (INN or common name)
all_synonymsJSONB array of all known names. GIN-indexed for search.
drug_typeNew Molecular Entity, New Active Ingredient, New Formulation, Generic, Biosimilar, Interchangeable, Original
drug_statusdrug_marketed, drug_development, drug_development_stopped, drug_withdrawn, drug_unknown
development_phasediscovery through approved
is_combinationFixed-dose combination drugs. May carry their own technology_id and drug_target_actions in addition to component links via combination_drug_components.
technology_idFK to technologies. Drug modality (small molecule, ADC, CAR-T, etc.)
organisation_idFK to organisations. Primary associated company.
therapeutic_areasJSONB array. Oncology, Hematology, etc.
recent_development_activity3-year activity signal: active, inactive, unknown. Manually editable.
pdufa_dateFDA target action date
inn_keyNormalized INN for cross-source matching

Drugs connect outward to every other domain through relationship tables.

drug_approvals — Regulatory approvals. One row per region per drug. Polymorphic source_type/source_id links back to the raw regulatory record (FDA, EMA, KEGG, or CDE). Each approval has many indications.

drug_ownerships — Which company owns or partners on the drug, with geographic scope (geography_include/geography_exclude JSONB arrays).

drug_target_actions — Links drugs to molecular targets with mechanism of action (Inhibitor, Agonist, etc.). Supports mutation-specific targeting and allosteric binding flags.

drug_statuses — Per-therapeutic-area status overrides. One drug can be drug_marketed in Oncology but still drug_development in Hematology.

brand_drugs — Marketed brand names. Aggregates regional approvals with per-region approval FKs (usa_approval_id, eu_approval_id, jp_approval_id, cn_approval_id).

combination_drug_components — Self-join linking combination drugs to their component compounds.

technologies — Hierarchical drug modality taxonomy (Small Molecule > Kinase Inhibitor). Self-referential parent_id. CAR-T, ADC, and cell/gene therapy drugs carry additional attributes through master_records FKs on the drug itself.

The indications table is the critical junction between drugs and diseases. Each drug_approval has many indications. Each indication captures:

  • disease_name — Raw text from the label
  • disease_stages, disease_subtypes, disease_statuses, disease_extents — JSONB arrays of faceted disease characteristics
  • treatment_lines — JSONB array (e.g., ["1L", "2L+"])
  • indication_typetreatment, diagnosis, prevention, mitigation
  • accelerated_approval_date, full_approval_date — Track the approval pathway

Indications link to diseases through indication_diseases. They carry sub-tables for biomarker requirements (indicated_biomarkers), therapeutic approaches (indicated_therapeutic_approaches), combination partners (indicated_combination_partners), and prior therapy requirements (indicated_prior_therapies).

Clinical trials capture the experimental evidence for every drug. 72,000+ trials from two registries, each decomposed into structured components.

clinical_trials — One row per registered trial. Key columns: nct_id (unique identifier), phase, overall_status (recruitment status), collection_source (aact or chictr), llm_data (JSONB for LLM extraction results), and versions (JSONB for CT.gov version history).

The trial domain splits into four sub-areas.

Study plans decompose trial arms into individual drug components with dosing details. Three tables form the hierarchy:

  • study_plan_extractions — One per trial. Tracks confidence and match statistics.
  • study_plan_arms — Individual arms (EXPERIMENTAL, COMPARATOR, CONTROL) with intervention data.
  • study_plan_components — Individual drugs or procedures within an arm. Stores dosing JSONB, matched_resource for linking to drugs, and investigational_component flag.

participation_criteria stores LLM-extracted disease-level criteria per trial. One row per disease-trial pair. Fields include disease_id, faceted JSONB arrays (disease_stages, disease_subtypes, treatment_settings), and biomarker requirements through participation_criterion_biomarkers.

Results follow a strict AACT hierarchy. Each result table has a _stage twin for pipeline staging.

  • result_groups — Treatment arms within results
  • result_outcomes — Outcome definitions (Primary, Secondary, Other)
  • outcome_measurements — Statistical values per outcome per group
  • outcome_analyses — p-values, confidence intervals, hazard ratios
  • reported_events / reported_event_totals — Adverse events by organ system
  • baseline_counts / baseline_measurements — Demographics and baseline characteristics

endpoints define standard clinical endpoints (Overall Survival, Progression-Free Survival). clinical_trial_endpoints links trials to their endpoints. endpoint_groups provide hierarchical categorization (Survival, Tumor Response, Quality of Life).

Diseases form a directed acyclic graph (DAG). A disease can have multiple parents through disease_parents. This captures the reality that NSCLC is both a “Lung Cancer” and a “Solid Tumor.”

Key columns on diseases:

ColumnPurpose
nameDisease name (e.g., “Non-Small Cell Lung Cancer”)
nci_codeNCI Thesaurus code. Unique, not null.
keywords, excluded_keywordsJSONB arrays for condition-matching pipelines
synonymsJSONB array of alternative names
all_descendantsJSONB array of descendant IDs. Computed on save.
hierarchy_levelDepth from root
simplifiedWhether this disease is in the curated tree
standard_of_careWhether SoC data has been configured

disease_biomarkers — Links biomarkers to diseases with prevalence data. Each carries prevalence ranges and potential_values JSONB.

biomarkers — Master biomarker entities (EGFR, PD-L1, HER2). Hierarchical via parent_id. Stores known_values JSONB (Positive, Negative, High) and carries NCI Thesaurus codes.

disease_nodes — NCI Thesaurus-aligned classification nodes within a disease. Each carries matching rules (inclusion_rule, exclusion_rule) for automated condition parsing.

guidelines — Standard-of-care treatment guidelines per disease. Rich JSONB data model with line_of_therapy, disease_subtype, stages, prognosis, raw_biomarkers, and results. Links to drugs, drug groups, and chemo combinations through HABTM join tables.

disease_treatment_lines — Links diseases to applicable treatment lines (First-line, Second-line, Maintenance) with ordering.

disease_stats — SEER-style epidemiological data: incidence rates, death rates, survival rates.

Diseases partition into patient segments for standard-of-care mapping. disease_base_segments define segmentation by stage or subtype. disease_biomarker_segments add biomarker-based stratification. These connect to treatment guidelines and prevalence data.

Organizations represent pharmaceutical companies, biotechs, CROs, academic institutions, and diagnostics companies.

Key columns on organisations:

ColumnPurpose
bioloupe_idStable identifier (e.g., BC567). Auto-generated.
nameCompany name. Unique.
typePharmaceutical/Biotechnology, Diagnostic, Academic Institution/Hospital, CRO, etc.
ownership_statusPublicly Held, Privately Held, Subsidiary
branch_namesJSONB array of subsidiary and division names for matching
headquartersISO country code
subsidiary_of_idSelf-referential FK. Parent company.

Financial columns: market_cap, enterprise_value, revenue_ttm, profit_margin, total_cash_mrq, cash_runaway_month. Stock data: us_stock_exchange, us_stock_symbol, non_us_stock_symbol. Regulatory identifier: sec_cik (SEC Central Index Key, populated from the stock symbol via Financial Modeling Prep).

organisation_histories — M&A, licensing, and partnership deals. Carries financial terms (total_value, upfront_payment, milestone_payment), geographic scope, and deal classification. Links to drugs through drugs_organisation_histories.

organisation_history_participants — Links organizations to deals with roles (acquirer, target, licensor, licensee).

pipeline_snapshots — Automated screenshots of company pipeline pages. current_screenshot_hash vs previous_screenshot_hash detect changes. change_explanation stores LLM-generated description of what changed.

investor_presentations — Investor-facing presentation PDFs linked to organisations. Stores extracted content (presentation_content), category (deck, clinical, financial_results, annual_report, other_investor_event), LLM-detected content markers (content_markers JSONB), and structured extraction warnings (content_extraction_warnings JSONB; e.g. missing_page_sections when placeholder sections were inserted for unrecoverable PDF pages). Presentations in the deck and clinical categories are materialized into publications via InvestorPresentations::PublicationMaterializationTask; other categories are skipped.

organisation_sec_filings — SEC filing metadata (10-K, 10-Q, 20-F and amended variants) captured per organisation from Financial Modeling Prep. Stores filing_type, filed_at, accepted_at, filing_url (direct document), sec_index_url, raw source_payload JSONB, and the original FMP source_symbol / cik. Deduplication keys on (organisation_id, source_uid) where source_uid is the SEC accession number when present, otherwise a SHA-256 of filing identifiers. first_seen_at / last_seen_at track sync history. content_extraction_status (succeeded / retryable_failed / permanent_failed / null), content_extraction_attempt_count, content_extraction_failure_kind, content_extraction_error, and content_extraction_attempted_at track the discovery pipeline’s SEC body extraction (see Evidence Discovery).

Raw source records from each regulatory agency live in dedicated tables. These are the unprocessed imports before entity resolution.

TableSourceKey identifier
fda_dataFDA Drugs@FDA APIapplication_number
ema_approved_drugsEMA XLSX exportproduct_number
kegg_approved_drugsKEGG Japan databasekegg identifier
china_approved_drugsCDE/NMPA Chinacde_code
purplebook_recordsFDA Purple Bookbla_number
orangebook_recordsFDA Orange Bookappl_no + product_no
breakthroughsFDA Breakthrough Therapyagent + indication

Each source table carries a drug_id FK (nullable before matching) and serves as a polymorphic source for drug_approvals.

news — Press releases from Business Wire, GlobeNewsWire, Cision, and financial sources. LLM pipelines classify articles by category (Regulatory, Business Deal, Trial Results), extract drug and disease mentions, and identify FDA submission events.

news_chunks — Chunked segments for RAG and embedding search. Uses the Vectorizable concern.

publications — Scientific articles from PubMed, conferences (ASCO, AACR, ASH, EHA, ESMO), and materialized investor presentations. Stores abstracts, full text when available, and LLM extraction results. Links to trials through publication_clinical_trials.

news_fda_submissions — FDA submission events (NDA, BLA, sNDA) extracted from news. Tracks PDUFA dates and links to matched indications.

source_drugs — External drug records from ChEMBL, NCI Thesaurus, RxNorm, and Adis. Used for entity resolution. Drugs link via dedicated FKs: chembl_id, ncit_id, rx_norm_id, adis_id.

ncit_concepts — NCI Thesaurus concepts broader than drugs. Covers procedures, diseases, and biomedical entities. Drug-like concepts can be converted to Drug records.

master_lists / master_records — Dropdown values for controlled vocabularies. A MasterList defines a category (Mechanism of Action). MasterRecord entries store individual values (Inhibitor, Agonist). Many drug fields reference master records via FKs.

forecasting_statistics — Reference values powering the pharmaceutical forecasting app. Each row is a single data point for a {geo, indication, line, name} variable (e.g., incidenceBase for USA / Breast Cancer / no line). The source JSONB array lists citations ({name, url, sampleSize, comment}); year is pure metadata (the citation year of the source) and does not participate in uniqueness.

A single {geo, indication, line, name} variable has exactly one canonical row (is_guidance = TRUE), enforced by a partial unique index scoped to live rows (UNIQUE (geo, indication, line, name) WHERE is_guidance = TRUE AND deleted_at IS NULL). Additional rows with is_guidance = FALSE represent historical or alternative values surfaced in the frontend’s source tooltip. Soft-deleted rows release their guidance slot, allowing a new canonical row for the same key. Restoring a soft-deleted guidance row via Paranoia’s restore triggers a before_restore callback (demote_if_guidance_conflict) that demotes it to is_guidance = false when a live guidance sibling already occupies the slot, preventing partial unique index violations.

  • Uniqueness: exactly one row per {geo, indication, line, name} variable with is_guidance = TRUE. YEAR is pure metadata (citation year of the source) and does not participate in uniqueness.
  • year is nullable. Label/enum rows (type, displayName, category, biomarkerName) and any sourceless row can have a null year — see ForecastingStatistic::YEAR_OPTIONAL_NAMES in app/models/forecasting_statistic.rb (the Thor importer and GuidanceService both reference this model constant). Admin/API creates auto-stamp the current year when blank (ForecastingStatistics::GuidanceService); the Thor import honors null years per the allowlist and records real publication years on alternatives.
  • Bulk import accepts an optional IS_GUIDANCE column. When multiple CSV rows share the same {geo, indication, line, name} variable, exactly one must be marked TRUE (canonical); the rest are stored as alternatives. Blank or missing defaults to TRUE. The pre-validation pass rejects imports with 0 or >1 TRUE rows per key.
  • indication and line are NOT NULL DEFAULT ''. Empty string replaces the previous NULL semantics (“no line”, “no indication”). Required because Postgres treats NULL as distinct in unique indexes, which would otherwise let duplicate-guidance rows through.
  • popGrowth is retained in the table and API but not currently consumed by the forecasting app.
  • Stage distribution rows (earlyStagePercent, metStagePercent, unknownStagePercent, earlyStageLocalizedPercent, earlyStageLocallyAdvancedPercent) are gated by ForecastingStatistics::StageDistributionValidator on every admin-API save (via GuidanceService) and on the CSV import preflight in forecasting_statistics:import. Three invariants apply to the canonical guidance state per (geo, indication): early sub-buckets cannot exceed parent earlyStagePercent (block), top-level buckets must sum to ~100% within ±1.5pp (acknowledgeable warn), and no single value may exceed 100% (block). The service returns :stage_violation / :stage_warning alongside the existing :conflict / :invalid statuses. The reconcile_stages batch endpoint validates the proposed end state once and writes all five buckets in a single transaction (it sets skip_stage_validation: true per-row to avoid re-running validation against in-flight state).
  • Disease linkage is restricted to the curated whitelist in ForecastingStatistics::AnalyticsDiseases (lib/forecasting_statistics/analytics_diseases.rb). The Thor link_diseases task and the /api/forecasting/diseases dropdown both source from IDS_TO_LABELS; normalize_indications rewrites indication to the xlsx label and raises if the linked disease is not on the whitelist. The import task itself pre-flights every non-blank CSV indication against INDICATION_TO_DISEASE + the whitelist, validates stage-distribution invariants (INV-1, INV-3) per (geo, indication, line) group via StageDistributionValidator.evaluate, and aborts before wiping any data on any violation. INDICATION_TO_DISEASE also carries explicit aliases for the xlsx parent labels (e.g. Colorectal Cancer, Endometrial Cancer, Other Hodgkin Lymphoma) so a re-import of normalized export data resolves to the same analytics parents. Pre-whitelist subtypes are mapped to their analytics parents via COLLAPSE_MAP (e.g. ESCC → Esophageal Cancer); the collapse_to_analytics Thor task applies this map to already-imported subtype rows (with a pre-flight duplicate-key check, idempotent, acts_as_paranoid-aware).
  • Selection: the single is_guidance = TRUE row is the canonical. Alternatives (FALSE rows) are sorted newest-year first for UI display but do not participate in selection.
  • Biomarker rows use line: "bm:<slug>" (e.g. bm:hpvPositive) and come as a 3-name triplet per (disease, geo, slug): biomarkerName, biomarkerPercentage, optional biomarkerTestingRate. Writes go through ForecastingStatistics::BiomarkerService.upsert_triplet which sets indication = disease.name on all rows and wraps create/update/delete in a transaction. BiomarkerService.list_for(disease, geo) returns native rows for the requested geo plus inherited USA rows when a non-USA geo has no local triplet for a given slug (matching the forecasting client’s geo fallback). StageDistributionValidator.stage_row? only matches blank-line rows, so biomarker rows bypass the stage invariants — no contradiction with the stage-distribution story above; the validator simply doesn’t scope biomarker triplets. Percentage values (biomarkerPercentage, biomarkerTestingRate) are constrained to 0..100 at the model layer via ForecastingStatistic::BIOMARKER_PERCENTAGE_NAMES, so the bound holds regardless of write path. Slugs must match \A[a-zA-Z][a-zA-Z0-9]{0,63}\z (letter-leading, 1–64 chars), enforced in both BiomarkerService::SLUG_REGEX and BiomarkerModal.jsx’s client-side validation. Write protection is enforced at the model: ForecastingStatistic#reject_unmanaged_biomarker_write (validator) and #reject_unmanaged_biomarker_destroy (before_destroy) reject any save or destroy of a row whose current or previous line starts with bm: unless the writer flips biomarker_managed_write. The flag is set by the two authorized seed paths: BiomarkerService.upsert_triplet / .destroy_triplet (admin/API edits) and the wipe-and-load forecasting_statistics:import Thor task (lib/tasks/forecasting_statistics.thor:281), which flips it per-instance only for rows whose line starts with bm:. Every other surface — ad-hoc Thor tasks, Rails console, ActiveAdmin form, direct ForecastingStatistic.create! — hits the guard, preventing accidental single-row writes that would break the triplet invariant.

Stage-distribution variables (earlyStagePercent, metStagePercent, unknownStagePercent, earlyStageLocalizedPercent, earlyStageLocallyAdvancedPercent) have three on-save invariants enforced by ForecastingStatistics::StageDistributionValidator:

  • INV-1: earlyStageLocalizedPercent + earlyStageLocallyAdvancedPercent ≤ earlyStagePercent (blocking).
  • INV-2: top-level stage buckets must sum to ~100% within ±1.5pp (acknowledgeable warning — overridable by user; INV-1 and INV-3 are not).
  • INV-3: any single stage value ≤ 100% (blocking, strict — no tolerance).

The React Statistics console surfaces violations via StageReconciliationModal, which commits all five values atomically via POST /api/forecasting/admin/statistics/reconcile_stages. The endpoint loads the canonical state for the (geo, indication, line) tuple (disease_id-agnostic so validator scope structurally matches the upsert_stage_set writer key), merges the payload on top, validates the final state, then upserts each row inside a transaction via GuidanceService.upsert_stage_set.

The data is read through two API surfaces. The client endpoint at /api/forecasting/statistics gates by ForecastingStatistics::ReadinessChecker.ready? — only diseases whose USA-geo rows satisfy the essentials in KEY_DATA_REQUIREMENTS (common + type-specific for Solid Tumor / Hematology) are returned, plus global geo markers (disease_id IS NULL AND indication = '') the client needs to render its geo selector. The admin endpoint at /api/forecasting/admin/statistics returns all rows flat for the curation UI. The forecasting frontend selects the canonical is_guidance = TRUE row and surfaces alternatives (sorted newest-year first) in the source tooltip. Conflict resolution happens in the admin panel (create/edit/delete surface an interactive 409 + resolution flow, plus a stage-reconciliation flow when StageDistributionValidator blocks the save) or in the Thor import (pre-validated atomic failure on any duplicate-guidance key). The partial unique index is the last line of defense — it should never surface as a user-facing error in normal operation.

forecasting_pricings — Drug pricing points per geo/indication/line. Populated via lib/tasks/forecasting_pricing.thor.

users — Admin users with roles and Devise authentication.

workflow_instances / workflow_steps — Pipeline orchestration state. Each workflow instance tracks a single run. Steps carry status, retry counts, and AWS Batch integration.

versions — PaperTrail audit trail. Tracks changes to drugs, approvals, diseases, organizations, biomarkers, indications, and guidelines.

lockable_attributes — Polymorphic table recording which fields on which records are locked against automated overwrite.

assignments — Polymorphic user assignments for QA workflows.

embeddables — Polymorphic table storing vector embeddings for any model.

These patterns appear throughout the schema. Understand them once and you understand every table.

JSONB columns serve two purposes:

  1. LLM stagingllm_data holds AI extraction results before human QA promotes them to relational columns. The raw_data column preserves the original API response as an immutable audit trail.
  2. Flexible storageall_synonyms, keywords, therapeutic_areas, and similar columns store arrays or objects that do not warrant dedicated tables.

JSONB fields use GIN indexes for containment queries. The StoreModel gem types some JSONB columns as validated Ruby objects. The jsonb_accessor gem provides typed accessors for nested keys.

Many clinical-trial-result tables have a _stage twin (e.g., milestones / milestones_stage). Stage tables receive raw AACT pipeline data. After QA, records promote to the production table. Both share the same schema and ID sequence.

The Lockable concern lets any model protect manually edited fields from automated overwrites. When an analyst manually sets a drug’s name, a lockable_attributes row locks that field. Pipelines check drug.name_locked? before overwriting. This preserves human curation across import cycles.

The Claimable concern implements 24-hour review windows. An analyst claims a record (claimed_by / claimed_at). Claims expire after 24 hours if not completed. The analyst reviews, approves, and the data flows to authoritative tables.

Several tables use source_type/source_id pairs: drug_approvals (five source types), aggregations, embeddables, audit_issues, and workflow_resource_links. This pattern supports multiple parent types without schema changes.

Drugs (B1234), organizations (BC567), and targets (BT123) carry auto-generated, human-readable bioloupe_id strings. These are stable external references used across all Bioloupe products.

Standard Rails conventions throughout: plural snake_case table names, id bigint primary keys, {model}_id foreign keys, created_at/updated_at timestamps, deleted_at for soft deletes (via Paranoia gem), and {table_a}_{table_b} for join tables.

Now that you understand what data lives in Data Gov, read how it gets there.