Diseases

Diseases are the second axis of the knowledge graph (drugs being the first). The domain models a hierarchical ontology of ~2,000 diseases using four tables: diseases, disease_closures, disease_stats, and disease_biomarkers. Every disease carries an NCI Thesaurus code as its natural key, and the hierarchy is stored as a directed acyclic graph (DAG) via a closure table.

Entities

erDiagram
    diseases ||--o{ disease_closures : "ancestor"
    diseases ||--o{ disease_closures : "descendant"
    diseases ||--o| disease_stats : "has"
    diseases ||--o{ disease_biomarkers : "characterized by"
    disease_biomarkers }o--|| biomarkers : "biomarker"
    diseases }o--o| therapeutic_areas : "belongs to"

diseases

One row per disease in the ontology. The canonical source of truth for disease identity.

Column	Type	Purpose
`id`	uuid	Primary key
`name`	text	Canonical disease name (e.g., “Non-Small Cell Lung Cancer”)
`ncit_code`	text	NCI Thesaurus code (e.g., C3262). Unique natural key for ontology mapping
`description`	text	Free-text description
`synonyms`	jsonb	Array of alternative names and abbreviations (e.g., `["NSCLC"]`)
`therapeutic_area_id`	uuid	FK to `therapeutic_areas`. Primary therapeutic area
`clinical_segments`	jsonb	Stage/demographic sub-populations (e.g., `["early_stage", "elderly"]`)
`biomarker_segments`	jsonb	Molecular marker segmentations (e.g., `["EGFR_mutant", "PD-L1_high"]`)
`key_endpoints`	jsonb	Standard clinical endpoints for this disease (OS, PFS, ORR) with measurement criteria
`indication_level`	integer	Clinical specificity depth (0—9). Higher = more specific indication
`show_in_analytics`	boolean	Whether this disease appears in analytics rollup groupings
`embedding`	text	Vector embedding for semantic search

Indexes: idx_diseases_area (therapeutic area), idx_diseases_name (name).

disease_closures

Pre-computed transitive closure of the disease DAG. Every ancestor-descendant pair appears as a row, including self-references at depth 0. This enables single-query subtree lookups without recursive CTEs.

Column	Type	Purpose
`id`	uuid	Primary key
`ancestor_id`	uuid	FK to `diseases`. The ancestor in the pair
`descendant_id`	uuid	FK to `diseases`. The descendant in the pair
`depth`	integer	Hops from ancestor to descendant. 0 = self, 1 = direct child, 2 = grandchild

Unique constraint: (ancestor_id, descendant_id). Indexes on ancestor_id, descendant_id, and depth.

disease_stats

Epidemiological statistics per disease. One-to-one with diseases. Sourced primarily from SEER. US-centric rates.

Column	Type	Purpose
`id`	uuid	Primary key
`disease_id`	uuid	FK to `diseases` (unique — one row per disease)
`estimated_new_cases`	integer	Estimated new cases per year (US)
`estimated_new_cases_year`	integer	Year of the estimate
`rate_of_new_cases`	decimal(8,4)	Age-adjusted incidence rate per 100,000/year
`death_rate`	decimal(8,4)	Age-adjusted death rate per 100,000/year
`survival_rate`	decimal(8,4)	5-year relative survival rate (0—1)
`metastatic_survival_rate`	decimal(8,4)	5-year survival, distant/metastatic stage
`localized_survival_rate`	decimal(8,4)	5-year survival, localized stage
`regional_survival_rate`	decimal(8,4)	5-year survival, regional stage
`source`	text	Data source identifier (e.g., “SEER 2024”)

disease_biomarkers

Links biomarkers to diseases with prevalence data and clinical context. One row per disease-biomarker pair.

Column	Type	Purpose
`id`	uuid	Primary key
`disease_id`	uuid	FK to `diseases`
`biomarker_id`	uuid	FK to `biomarkers`
`prevalence_rate`	decimal(5,4)	Point estimate of prevalence in this disease (0—1)
`prevalence_minimum`	decimal(5,4)	Lower bound of prevalence range
`prevalence_maximum`	decimal(5,4)	Upper bound of prevalence range
`potential_values`	jsonb	Possible test results (e.g., `["positive", "negative", "equivocal"]`)
`biomarker_context`	text	Clinical role: predictive, prognostic, diagnostic, or screening
`reference`	jsonb	Citation references (PMIDs, DOIs, URLs)

Unique constraint: (disease_id, biomarker_id). Check constraint ensures prevalence_minimum <= prevalence_maximum.

Design decisions

Closure table instead of JSONB descendants array

The legacy schema stored an all_descendants JSONB array on each disease row. This made subtree queries fast but introduced write amplification: inserting a single leaf disease required updating every ancestor’s array. It also could not answer “all ancestors of X” without a separate scan.

The closure table (disease_closures) solves both problems. Reads are equally fast (indexed lookups), writes are local (insert one row per ancestor), and both ancestor and descendant queries use the same table. The tradeoff is storage — the closure table has O(n * d) rows where d is average depth — but for ~2,000 diseases with average depth ~4, this is negligible.

Separated disease_stats

Epidemiological data is periodic (SEER publishes annually) and comes from a completely different source than the ontology itself. Keeping it in a separate one-to-one table means:

Stats can be refreshed independently without touching the disease hierarchy.
Null stats do not bloat the core disease row.
Future support for historical snapshots (multiple years) requires only relaxing the unique constraint.

Materialized views

Two materialized views join through the diseases domain:

mv_development_programs — Highest development phase per drug-disease pair. Unions approved indications (via indications.disease_id) with clinical trial mappings (via trial_diseases.disease_id). One row per (drug, disease) with denormalized disease_name.
mv_competitive_landscape — Drug-disease-phase matrix with drug metadata and owner info. Built on top of mv_development_programs. Adds technology, organization, and approval data. One row per (drug, disease).

Both views denormalize disease_name to avoid joining back to diseases in downstream queries.

Example queries

All subtypes of lung cancer:

SELECT d.name, dc.depth
FROM disease_closures dc
JOIN diseases d ON d.id = dc.descendant_id
WHERE dc.ancestor_id = (
    SELECT id FROM diseases WHERE name = 'Lung Cancer'
)
AND dc.depth > 0
ORDER BY dc.depth, d.name;

Biomarker prevalence for NSCLC:

SELECT
    b.name           AS biomarker,
    db.prevalence_rate,
    db.prevalence_minimum,
    db.prevalence_maximum,
    db.biomarker_context
FROM disease_biomarkers db
JOIN biomarkers b ON b.id = db.biomarker_id
WHERE db.disease_id = (
    SELECT id FROM diseases WHERE name = 'Non-Small Cell Lung Cancer'
)
ORDER BY db.prevalence_rate DESC NULLS LAST;

Incidence and survival for breast cancer:

SELECT
    d.name,
    ds.estimated_new_cases,
    ds.estimated_new_cases_year,
    ds.rate_of_new_cases,
    ds.survival_rate,
    ds.metastatic_survival_rate,
    ds.source
FROM disease_stats ds
JOIN diseases d ON d.id = ds.disease_id
WHERE d.name = 'Breast Cancer';