Data model

The Data Gov database tells the story of every tracked drug — from discovery through clinical trials to regulatory approval and market launch. This page explains what data lives in the system, why it is structured the way it is, and how 206 tables connect into a coherent knowledge graph.

The big picture

Four core domains sit at the center of the graph. Every other table exists to connect, enrich, or contextualize these domains.

erDiagram
    drugs ||--o{ drug_approvals : "approved as"
    drugs ||--o{ drug_target_actions : "acts on"
    drugs ||--o{ drug_ownerships : "owned by"
    drugs ||--o{ drug_statuses : "status per area"
    drugs ||--o{ brand_drugs : "marketed as"
    drugs }o--|| technologies : "modality"
    drug_approvals ||--o{ indications : "indicated for"
    indications }o--o{ diseases : "treats"
    drug_target_actions }o--|| targets : "targets"
    drug_ownerships }o--|| organisations : "company"
    drugs }o--o{ clinical_trials : "tested in"
    clinical_trials ||--o{ participation_criteria : "eligibility"
    clinical_trials ||--o{ study_plan_extractions : "study plans"
    clinical_trials ||--o{ result_outcomes : "results"
    clinical_trials ||--o{ sponsors : "sponsored by"
    sponsors }o--|| organisations : "company"
    diseases ||--o{ disease_parents : "hierarchy"
    diseases ||--o{ disease_biomarkers : "characterized by"
    disease_biomarkers }o--|| biomarkers : "biomarker"
    diseases ||--o{ guidelines : "treated per"
    organisations ||--o{ organisation_histories : "deal history"
    news ||--o{ news_drug_mentions : "mentions"
    news ||--o{ news_fda_submissions : "tracks"
    publications ||--o{ publication_clinical_trials : "references"

Three schemas, one primary

The database uses three PostgreSQL schemas. Only public is writable.

Schema	Tables	Purpose
`public`	~170	Core entities, relationships, pipeline staging, audit trails
`analytics`	~25	Dimensional model (star/snowflake) for BI tools. No FK constraints.
`forecasting`	~10	Django application tables for forecasting models

Six PostgreSQL extensions power the search and matching infrastructure: pg_trgm (trigram similarity), fuzzystrmatch (Levenshtein distance), vector (pgvector embeddings), pg_stat_statements (query monitoring), pgstattuple (storage stats), and plpgsql (stored procedures).

Domain 1: Drugs

The drugs table is the center of the knowledge graph. One row per compound, tracked at the INN (International Nonproprietary Name) level — the active ingredient, not the brand.

Key columns on drugs:

Column	Purpose
`bioloupe_id`	Stable external identifier (e.g., `B1234`). Auto-generated.
`name`	Primary name (INN or common name)
`all_synonyms`	JSONB array of all known names. GIN-indexed for search.
`drug_type`	`New Molecular Entity`, `New Active Ingredient`, `New Formulation`, `Generic`, `Biosimilar`, `Interchangeable`, `Original`
`drug_status`	`drug_marketed`, `drug_development`, `drug_development_stopped`, `drug_withdrawn`, `drug_unknown`
`development_phase`	`discovery` through `approved`
`is_combination`	Fixed-dose combination drugs. May carry their own `technology_id` and `drug_target_actions` in addition to component links via `combination_drug_components`.
`technology_id`	FK to `technologies`. Drug modality (small molecule, ADC, CAR-T, etc.)
`organisation_id`	FK to `organisations`. Primary associated company.
`therapeutic_areas`	JSONB array. Oncology, Hematology, etc.
`recent_development_activity`	3-year activity signal: `active`, `inactive`, `unknown`. Manually editable.
`pdufa_date`	FDA target action date
`inn_key`	Normalized INN for cross-source matching

Drugs connect outward to every other domain through relationship tables.

drug_approvals — Regulatory approvals. One row per region per drug. Polymorphic source_type/source_id links back to the raw regulatory record (FDA, EMA, KEGG, or CDE). Each approval has many indications.

drug_ownerships — Which company owns or partners on the drug, with geographic scope (geography_include/geography_exclude JSONB arrays).

drug_target_actions — Links drugs to molecular targets with mechanism of action (Inhibitor, Agonist, etc.). Supports mutation-specific targeting and allosteric binding flags.

drug_statuses — Per-therapeutic-area status overrides. One drug can be drug_marketed in Oncology but still drug_development in Hematology.

brand_drugs — Marketed brand names. Aggregates regional approvals with per-region approval FKs (usa_approval_id, eu_approval_id, jp_approval_id, cn_approval_id).

combination_drug_components — Self-join linking combination drugs to their component compounds.

technologies — Hierarchical drug modality taxonomy (Small Molecule > Kinase Inhibitor). Self-referential parent_id. CAR-T, ADC, and cell/gene therapy drugs carry additional attributes through master_records FKs on the drug itself.

Indications: where drugs meet diseases

The indications table is the critical junction between drugs and diseases. Each drug_approval has many indications. Each indication captures:

disease_name — Raw text from the label
disease_stages, disease_subtypes, disease_statuses, disease_extents — JSONB arrays of faceted disease characteristics
treatment_lines — JSONB array (e.g., ["1L", "2L+"])
indication_type — treatment, diagnosis, prevention, mitigation
accelerated_approval_date, full_approval_date — Track the approval pathway

Indications link to diseases through indication_diseases. They carry sub-tables for biomarker requirements (indicated_biomarkers), therapeutic approaches (indicated_therapeutic_approaches), combination partners (indicated_combination_partners), and prior therapy requirements (indicated_prior_therapies).

Domain 2: Clinical trials

Clinical trials capture the experimental evidence for every drug. 72,000+ trials from two registries, each decomposed into structured components.

clinical_trials — One row per registered trial. Key columns: nct_id (unique identifier), phase, overall_status (recruitment status), collection_source (aact or chictr), llm_data (JSONB for LLM extraction results), and versions (JSONB for CT.gov version history).

The trial domain splits into four sub-areas.

Study plans

Study plans decompose trial arms into individual drug components with dosing details. Three tables form the hierarchy:

study_plan_extractions — One per trial. Tracks confidence and match statistics.
study_plan_arms — Individual arms (EXPERIMENTAL, COMPARATOR, CONTROL) with intervention data.
study_plan_components — Individual drugs or procedures within an arm. Stores dosing JSONB, matched_resource for linking to drugs, and investigational_component flag.

Participation criteria

participation_criteria stores LLM-extracted disease-level criteria per trial. One row per disease-trial pair. Fields include disease_id, faceted JSONB arrays (disease_stages, disease_subtypes, treatment_settings), and biomarker requirements through participation_criterion_biomarkers.

Trial results

Results follow a strict AACT hierarchy. Each result table has a _stage twin for pipeline staging.

result_groups — Treatment arms within results
result_outcomes — Outcome definitions (Primary, Secondary, Other)
outcome_measurements — Statistical values per outcome per group
outcome_analyses — p-values, confidence intervals, hazard ratios
reported_events / reported_event_totals — Adverse events by organ system
baseline_counts / baseline_measurements — Demographics and baseline characteristics

Endpoints

endpoints define standard clinical endpoints (Overall Survival, Progression-Free Survival). clinical_trial_endpoints links trials to their endpoints. endpoint_groups provide hierarchical categorization (Survival, Tumor Response, Quality of Life).

Domain 3: Diseases and biomarkers

Diseases form a directed acyclic graph (DAG). A disease can have multiple parents through disease_parents. This captures the reality that NSCLC is both a “Lung Cancer” and a “Solid Tumor.”

Key columns on diseases:

Column	Purpose
`name`	Disease name (e.g., “Non-Small Cell Lung Cancer”)
`nci_code`	NCI Thesaurus code. Unique, not null.
`keywords`, `excluded_keywords`	JSONB arrays for condition-matching pipelines
`synonyms`	JSONB array of alternative names
`all_descendants`	JSONB array of descendant IDs. Computed on save.
`hierarchy_level`	Depth from root
`simplified`	Whether this disease is in the curated tree
`standard_of_care`	Whether SoC data has been configured

disease_biomarkers — Links biomarkers to diseases with prevalence data. Each carries prevalence ranges and potential_values JSONB.

biomarkers — Master biomarker entities (EGFR, PD-L1, HER2). Hierarchical via parent_id. Stores known_values JSONB (Positive, Negative, High) and carries NCI Thesaurus codes.

disease_nodes — NCI Thesaurus-aligned classification nodes within a disease. Each carries matching rules (inclusion_rule, exclusion_rule) for automated condition parsing.

guidelines — Standard-of-care treatment guidelines per disease. Rich JSONB data model with line_of_therapy, disease_subtype, stages, prognosis, raw_biomarkers, and results. Links to drugs, drug groups, and chemo combinations through HABTM join tables.

disease_treatment_lines — Links diseases to applicable treatment lines (First-line, Second-line, Maintenance) with ordering.

disease_stats — SEER-style epidemiological data: incidence rates, death rates, survival rates.

Segments

Diseases partition into patient segments for standard-of-care mapping. disease_base_segments define segmentation by stage or subtype. disease_biomarker_segments add biomarker-based stratification. These connect to treatment guidelines and prevalence data.

Domain 4: Organizations

Organizations represent pharmaceutical companies, biotechs, CROs, academic institutions, and diagnostics companies.

Key columns on organisations:

Column	Purpose
`bioloupe_id`	Stable identifier (e.g., `BC567`). Auto-generated.
`name`	Company name. Unique.
`type`	`Pharmaceutical/Biotechnology`, `Diagnostic`, `Academic Institution/Hospital`, `CRO`, etc.
`ownership_status`	`Publicly Held`, `Privately Held`, `Subsidiary`
`branch_names`	JSONB array of subsidiary and division names for matching
`headquarters`	ISO country code
`subsidiary_of_id`	Self-referential FK. Parent company.

Financial columns: market_cap, enterprise_value, revenue_ttm, profit_margin, total_cash_mrq, cash_runaway_month. Stock data: us_stock_exchange, us_stock_symbol, non_us_stock_symbol. Regulatory identifier: sec_cik (SEC Central Index Key, populated from the stock symbol via Financial Modeling Prep).

organisation_histories — M&A, licensing, and partnership deals. Carries financial terms (total_value, upfront_payment, milestone_payment), geographic scope, and deal classification. Links to drugs through drugs_organisation_histories.

organisation_history_participants — Links organizations to deals with roles (acquirer, target, licensor, licensee).

pipeline_snapshots — Automated screenshots of company pipeline pages. current_screenshot_hash vs previous_screenshot_hash detect changes. change_explanation stores LLM-generated description of what changed.

investor_presentations — Investor-facing presentation PDFs linked to organisations. Stores extracted content (presentation_content), category (deck, clinical, financial_results, annual_report, other_investor_event), LLM-detected content markers (content_markers JSONB), and structured extraction warnings (content_extraction_warnings JSONB; e.g. missing_page_sections when placeholder sections were inserted for unrecoverable PDF pages). Presentations in the deck and clinical categories are materialized into publications via InvestorPresentationPublications::Materializer; other categories are skipped.

organisation_sec_filings — SEC filing metadata (10-K, 10-Q, 20-F and amended variants) captured per organisation from Financial Modeling Prep. Stores filing_type, filed_at, accepted_at, filing_url (direct document), sec_index_url, raw source_payload JSONB, and the original FMP source_symbol / cik. Deduplication keys on (organisation_id, source_uid) where source_uid is the SEC accession number when present, otherwise a SHA-256 of filing identifiers. first_seen_at / last_seen_at track sync history. content_extraction_status (succeeded / retryable_failed / permanent_failed / null), content_extraction_attempt_count, content_extraction_failure_kind, content_extraction_error, and content_extraction_attempted_at track the discovery pipeline’s SEC body extraction (see Evidence Discovery).

Supporting domains

Regulatory source tables

Raw source records from each regulatory agency live in dedicated tables. These are the unprocessed imports before entity resolution.

Table	Source	Key identifier
`fda_data`	FDA Drugs@FDA API	`application_number`
`ema_approved_drugs`	EMA XLSX export	`product_number`
`kegg_approved_drugs`	KEGG Japan database	`kegg` identifier
`china_approved_drugs`	CDE/NMPA China	`cde_code`
`purplebook_records`	FDA Purple Book	`bla_number`
`orangebook_records`	FDA Orange Book	`appl_no` + `product_no`
`breakthroughs`	FDA Breakthrough Therapy	`agent` + `indication`

Each source table carries a drug_id FK (nullable before matching) and serves as a polymorphic source for drug_approvals.

News and publications

news — Press releases from Business Wire, GlobeNewsWire, Cision, and financial sources. LLM pipelines classify articles by category (Regulatory, Business Deal, Trial Results), extract drug and disease mentions, and identify FDA submission events.

news_chunks — Chunked segments for RAG and embedding search. Uses the Vectorizable concern.

publications — Scientific articles from PubMed, conferences (ASCO, AACR, ASH, EHA, ESMO), and materialized investor presentations. Stores abstracts, full text when available, and LLM extraction results. Links to trials through publication_clinical_trials.

news_fda_submissions — FDA submission events (NDA, BLA, sNDA) extracted from news. Tracks PDUFA dates and links to matched indications.

Reference data

ncit_concepts — NCI Thesaurus concepts broader than drugs. Covers procedures, diseases, and biomedical entities. Drug-like concepts can be converted to Drug records.

master_lists / master_records — Dropdown values for controlled vocabularies. A MasterList defines a category (Mechanism of Action). MasterRecord entries store individual values (Inhibitor, Agonist). Many drug fields reference master records via FKs.

Forecasting reference data

forecasting_statistics — Reference values powering the pharmaceutical forecasting app. Each row is a single data point for a {geo, indication, line, name} variable (e.g., incidenceBase for USA / Breast Cancer / no line). The source JSONB array lists citations ({name, url, sampleSize, comment}); year is pure metadata (the citation year of the source) and does not participate in uniqueness.

A single {geo, indication, line, name} variable has exactly one canonical row (is_guidance = TRUE), enforced by a partial unique index scoped to live rows (UNIQUE (geo, indication, line, name) WHERE is_guidance = TRUE AND deleted_at IS NULL). Additional rows with is_guidance = FALSE represent historical or alternative values surfaced in the frontend’s source tooltip. Soft-deleted rows release their guidance slot, allowing a new canonical row for the same key. Restoring a soft-deleted guidance row via Paranoia’s restore triggers a before_restore callback (demote_if_guidance_conflict) that demotes it to is_guidance = false when a live guidance sibling already occupies the slot, preventing partial unique index violations.

Uniqueness: exactly one row per {geo, indication, line, name} variable with is_guidance = TRUE. YEAR is pure metadata (citation year of the source) and does not participate in uniqueness.
year is nullable. Label/enum rows (type, displayName, category, biomarkerName) and any sourceless row can have a null year — see ForecastingStatistic::YEAR_OPTIONAL_NAMES in app/models/forecasting_statistic.rb (the Thor importer and GuidanceService both reference this model constant). Admin/API creates auto-stamp the current year when blank (ForecastingStatistics::GuidanceService); the Thor import honors null years per the allowlist and records real publication years on alternatives.
Bulk import accepts an optional IS_GUIDANCE column. When multiple CSV rows share the same {geo, indication, line, name} variable, exactly one must be marked TRUE (canonical); the rest are stored as alternatives. Blank or missing defaults to TRUE. The pre-validation pass rejects imports with 0 or >1 TRUE rows per key.
indication and line are NOT NULL DEFAULT ''. Empty string replaces the previous NULL semantics (“no line”, “no indication”). Required because Postgres treats NULL as distinct in unique indexes, which would otherwise let duplicate-guidance rows through.
popGrowth is retained in the table and API but not currently consumed by the forecasting app.
Stage distribution rows (earlyStagePercent, metStagePercent, unknownStagePercent, earlyStageLocalizedPercent, earlyStageLocallyAdvancedPercent) are gated by ForecastingStatistics::StageDistributionValidator on every admin-API save (via GuidanceService) and on the CSV import preflight in forecasting_statistics:import. Three invariants apply to the canonical guidance state per (geo, indication): early sub-buckets cannot exceed parent earlyStagePercent (block), top-level buckets must sum to ~100% within ±1.5pp (acknowledgeable warn), and no single value may exceed 100% (block). The service returns :stage_violation / :stage_warning alongside the existing :conflict / :invalid statuses. The reconcile_stages batch endpoint validates the proposed end state once and writes all five buckets in a single transaction (it sets skip_stage_validation: true per-row to avoid re-running validation against in-flight state).
Indication taxonomy is the forecasting_indications table (ForecastingIndication) — the single source of truth for the canonical forecasting indication labels and their disease_type (solid/heme) classification. Forecasting rows store the indication label verbatim (the older disease_id linkage was decoupled and dropped); there is no disease-id whitelist. The /api/forecasting/diseases dropdown sources from ForecastingIndication.options, and the import Thor task pre-flights every non-blank CSV indication against the table via ForecastingIndication.allowed_label? (lib/tasks/forecasting_statistics.thor:85), validates stage-distribution invariants (INV-1, INV-3) per (geo, indication, line) group via StageDistributionValidator.evaluate, and aborts before wiping any data on any violation. An unknown / non-taxonomy indication aborts the import preflight instead of loading stale data. The taxonomy is seeded by the one-off thor seed_forecasting_indications:seed (lib/tasks/one_off/seed_forecasting_indications.thor). Pricing has an analogous publish table, forecasting_pricing_publications (ForecastingPricingPublication), which mirrors forecasting_indications.published_at but is keyed by the (brand_name, indication) tuple rather than a single label.
Pricing indication taxonomy extends the statistics taxonomy with curated parent labels for the Pricing console. ForecastingIndication.allowed_for_pricing? is the single source of truth: it accepts a label when allowed_label? passes OR the label is one of the code-only ForecastingIndication::PRICING_PARENTS (currently ["Non-Small Cell Lung Cancer"]) — there is no DB row for the curated parents; the list lives only in the model constant. Two write boundaries enforce it: the API create/update path rejects an off-taxonomy indication with a 422 (Api::ForecastingPricingsController#reject_indication_not_for_pricing!, body {error: "Indication not allowed for pricing: <label>"}), and the ActiveAdmin create/update path flashes the same message and re-renders the form (app/admin/forecasting_pricings.rb). The forecasting_pricings:import Thor task SKIPS (not errors) any row whose titleized indication is neither a taxonomy label nor a curated parent — it logs the row and counts it under :skipped. The Pricing disease picker opts in via GET /api/forecasting/diseases?include_pricing_parents=1, which appends each curated parent (tagged type: "parent") on top of ForecastingIndication.options(include_parents:); the param is default-off and the unflagged response is byte-identical by construction (the statistics picker never sends it). Publish state is decoupled from these write boundaries: it lives on the separate forecasting_pricing_publications table (keyed by (brand_name, indication)), not on forecasting_pricings, so the wipe-and-load importer never touches it.
Selection: the single is_guidance = TRUE row is the canonical. Alternatives (FALSE rows) are sorted newest-year first for UI display but do not participate in selection.
Biomarker rows use line: "bm:<slug>" (e.g. bm:hpvPositive) and come as a 3-name triplet per (disease, geo, slug): biomarkerName, biomarkerPercentage, optional biomarkerTestingRate. Writes go through ForecastingStatistics::BiomarkerService.upsert_triplet which sets indication = disease.name on all rows and wraps create/update/delete in a transaction. BiomarkerService.list_for(disease, geo) returns native rows for the requested geo plus inherited USA rows when a non-USA geo has no local triplet for a given slug (matching the forecasting client’s geo fallback). StageDistributionValidator.stage_row? only matches blank-line rows, so biomarker rows bypass the stage invariants — no contradiction with the stage-distribution story above; the validator simply doesn’t scope biomarker triplets. Percentage values (biomarkerPercentage, biomarkerTestingRate) are constrained to 0..100 at the model layer via ForecastingStatistic::BIOMARKER_PERCENTAGE_NAMES, so the bound holds regardless of write path. Slugs must match \A[a-zA-Z][a-zA-Z0-9]{0,63}\z (letter-leading, 1–64 chars), enforced in both BiomarkerService::SLUG_REGEX and BiomarkerModal.jsx’s client-side validation. Write protection is enforced at the model: ForecastingStatistic#reject_unmanaged_biomarker_write (validator) and #reject_unmanaged_biomarker_destroy (before_destroy) reject any save or destroy of a row whose current or previous line starts with bm: unless the writer flips biomarker_managed_write. The flag is set by the two authorized seed paths: BiomarkerService.upsert_triplet / .destroy_triplet (admin/API edits) and the wipe-and-load forecasting_statistics:import Thor task (lib/tasks/forecasting_statistics.thor:281), which flips it per-instance only for rows whose line starts with bm:. Every other surface — ad-hoc Thor tasks, Rails console, ActiveAdmin form, direct ForecastingStatistic.create! — hits the guard, preventing accidental single-row writes that would break the triplet invariant.

Stage-distribution invariants

Stage-distribution variables (earlyStagePercent, metStagePercent, unknownStagePercent, earlyStageLocalizedPercent, earlyStageLocallyAdvancedPercent) have three on-save invariants enforced by ForecastingStatistics::StageDistributionValidator:

INV-1: earlyStageLocalizedPercent + earlyStageLocallyAdvancedPercent ≤ earlyStagePercent (blocking).
INV-2: top-level stage buckets must sum to ~100% within ±1.5pp (acknowledgeable warning — overridable by user; INV-1 and INV-3 are not).
INV-3: any single stage value ≤ 100% (blocking, strict — no tolerance).

The React Statistics console surfaces violations via StageReconciliationModal, which commits all five values atomically via POST /api/forecasting/admin/statistics/reconcile_stages. The endpoint loads the canonical state for the (geo, indication, line) tuple (disease_id-agnostic so validator scope structurally matches the upsert_stage_set writer key), merges the payload on top, validates the final state, then upserts each row inside a transaction via GuidanceService.upsert_stage_set.

The data is read through two API surfaces. The client endpoint at /api/forecasting/statistics gates by ForecastingStatistics::ReadinessChecker.ready? — only diseases whose USA-geo rows satisfy the essentials in KEY_DATA_REQUIREMENTS (common + type-specific for Solid Tumor / Hematology) are returned, plus blank-indication geo markers at indication = '': population markers (popBase/popGrowth) the client reads to render its geo selector, and healthcareAccess geo-defaults (USA 95 / EU5 100 / Japan 100) it resolves per-geo. These are the only names served at a blank indication — BLANK_INDICATION_ALLOWED_NAMES (forecasting_statistic.rb:56); any other name at indication = '' is excluded (statistics_controller.rb:29-31). The admin endpoint at /api/forecasting/admin/statistics returns all rows flat for the curation UI. The forecasting frontend selects the canonical is_guidance = TRUE row and surfaces alternatives (sorted newest-year first) in the source tooltip. Conflict resolution happens in the admin panel (create/edit/delete surface an interactive 409 + resolution flow, plus a stage-reconciliation flow when StageDistributionValidator blocks the save) or in the Thor import (pre-validated atomic failure on any duplicate-guidance key). The partial unique index is the last line of defense — it should never surface as a user-facing error in normal operation. Pricing applies the analogous serve gate: GET /api/forecasting_pricings returns a row to a JWT client only when its (brand_name, indication) tuple has a forecasting_pricing_publications row (composite-key INNER JOIN); publish-capable sessions (admin-session cookie AND the :publish ability — admins + editors) are ungated, while viewer admin sessions get the published-only feed, same as JWT clients. The per-row GET /api/forecasting_pricings/:id applies the same gate (unpublished tuple → 404 unless the session is publish-capable).

forecasting_pricings — Drug pricing points per brand/indication. Populated via lib/tasks/forecasting_pricing.thor (wipe-and-load; delete_all then create!, which never touches publish state). Publish state lives on the separate forecasting_pricing_publications table, keyed by the import-stable (brand_name, indication) tuple — presence of a row == Published. The client-served feed (GET /api/forecasting_pricings, JWT clients) and the per-row GET /api/forecasting_pricings/:id are gated by a composite-key INNER JOIN to that table (the feed silently omits an unpublished tuple; only the per-row show 404s it); publish-capable sessions (the :publish ability — admins + editors) stay ungated so they can see and publish unpublished tuples, while viewer admin sessions get the published-only feed, same as JWT clients. A model after_commit (ForecastingPricing::PRICING_REVERT_TRIGGER_ATTRS) auto-unpublishes a tuple when its pricing data changes (price, brand, indication, target, technology, date_approved) — but NOT on price_as_of / source edits, and NEVER on create (so re-import preserves Published). “Outdated” is a console-only informational badge derived from price_as_of (> 1 year old); it never hides or unpublishes. The publish flag is server-only and never serialized (mirrors deleted_at), so the cross-repo payload shape is unchanged.

System tables

users — Admin users with roles and Devise authentication.

workflow_instances / workflow_steps — Pipeline orchestration state. Each workflow instance tracks a single run. Steps carry status, retry counts, and AWS Batch integration.

versions — PaperTrail audit trail. Tracks changes to drugs, approvals, diseases, organizations, biomarkers, indications, and guidelines.

lockable_attributes — Polymorphic table recording which fields on which records are locked against automated overwrite.

assignments — Polymorphic user assignments for QA workflows.

embeddables — Polymorphic table storing vector embeddings for any model.

Cross-cutting patterns

These patterns appear throughout the schema. Understand them once and you understand every table.

JSONB: staging and flexibility

JSONB columns serve two purposes:

LLM staging — llm_data holds AI extraction results before human QA promotes them to relational columns. The raw_data column preserves the original API response as an immutable audit trail.
Flexible storage — all_synonyms, keywords, therapeutic_areas, and similar columns store arrays or objects that do not warrant dedicated tables.

JSONB fields use GIN indexes for containment queries. The StoreModel gem types some JSONB columns as validated Ruby objects. The jsonb_accessor gem provides typed accessors for nested keys.

Stage tables

Many clinical-trial-result tables have a _stage twin (e.g., milestones / milestones_stage). Stage tables receive raw AACT pipeline data. After QA, records promote to the production table. Both share the same schema and ID sequence.

Lockable attributes

The Lockable concern lets any model protect manually edited fields from automated overwrites. When an analyst manually sets a drug’s name, a lockable_attributes row locks that field. Pipelines check drug.name_locked? before overwriting. This preserves human curation across import cycles.

Claim-based QA

The Claimable concern implements 24-hour review windows. An analyst claims a record (claimed_by / claimed_at). Claims expire after 24 hours if not completed. The analyst reviews, approves, and the data flows to authoritative tables.

Polymorphic associations

Several tables use source_type/source_id pairs: drug_approvals (five source types), aggregations, embeddables, audit_issues, and workflow_resource_links. This pattern supports multiple parent types without schema changes.

Bioloupe IDs

Drugs (B1234), organizations (BC567), and targets (BT123) carry auto-generated, human-readable bioloupe_id strings. These are stable external references used across all Bioloupe products.

Naming conventions

Standard Rails conventions throughout: plural snake_case table names, id bigint primary keys, {model}_id foreign keys, created_at/updated_at timestamps, deleted_at for soft deletes (via Paranoia gem), and {table_a}_{table_b} for join tables.

Next steps

Now that you understand what data lives in Data Gov, read how it gets there.

Clinical trials — How 72,000 trials flow from registries into the knowledge graph
Drug approvals — How four regulatory agencies become unified approval records
News and intelligence — How press releases become structured intelligence