Diseases
Diseases are the second axis of the knowledge graph (drugs being the first). The domain models a hierarchical ontology of ~2,000 diseases using four tables: diseases, disease_closures, disease_stats, and disease_biomarkers. Every disease carries an NCI Thesaurus code as its natural key, and the hierarchy is stored as a directed acyclic graph (DAG) via a closure table.
Entities
Section titled “Entities”erDiagram
diseases ||--o{ disease_closures : "ancestor"
diseases ||--o{ disease_closures : "descendant"
diseases ||--o| disease_stats : "has"
diseases ||--o{ disease_biomarkers : "characterized by"
disease_biomarkers }o--|| biomarkers : "biomarker"
diseases }o--o| therapeutic_areas : "belongs to"
diseases
Section titled “diseases”One row per disease in the ontology. The canonical source of truth for disease identity.
| Column | Type | Purpose |
|---|---|---|
id | uuid | Primary key |
name | text | Canonical disease name (e.g., “Non-Small Cell Lung Cancer”) |
ncit_code | text | NCI Thesaurus code (e.g., C3262). Unique natural key for ontology mapping |
description | text | Free-text description |
synonyms | jsonb | Array of alternative names and abbreviations (e.g., ["NSCLC"]) |
therapeutic_area_id | uuid | FK to therapeutic_areas. Primary therapeutic area |
clinical_segments | jsonb | Stage/demographic sub-populations (e.g., ["early_stage", "elderly"]) |
biomarker_segments | jsonb | Molecular marker segmentations (e.g., ["EGFR_mutant", "PD-L1_high"]) |
key_endpoints | jsonb | Standard clinical endpoints for this disease (OS, PFS, ORR) with measurement criteria |
indication_level | integer | Clinical specificity depth (0—9). Higher = more specific indication |
show_in_analytics | boolean | Whether this disease appears in analytics rollup groupings |
embedding | text | Vector embedding for semantic search |
Indexes: idx_diseases_area (therapeutic area), idx_diseases_name (name).
disease_closures
Section titled “disease_closures”Pre-computed transitive closure of the disease DAG. Every ancestor-descendant pair appears as a row, including self-references at depth 0. This enables single-query subtree lookups without recursive CTEs.
| Column | Type | Purpose |
|---|---|---|
id | uuid | Primary key |
ancestor_id | uuid | FK to diseases. The ancestor in the pair |
descendant_id | uuid | FK to diseases. The descendant in the pair |
depth | integer | Hops from ancestor to descendant. 0 = self, 1 = direct child, 2 = grandchild |
Unique constraint: (ancestor_id, descendant_id). Indexes on ancestor_id, descendant_id, and depth.
disease_stats
Section titled “disease_stats”Epidemiological statistics per disease. One-to-one with diseases. Sourced primarily from SEER. US-centric rates.
| Column | Type | Purpose |
|---|---|---|
id | uuid | Primary key |
disease_id | uuid | FK to diseases (unique — one row per disease) |
estimated_new_cases | integer | Estimated new cases per year (US) |
estimated_new_cases_year | integer | Year of the estimate |
rate_of_new_cases | decimal(8,4) | Age-adjusted incidence rate per 100,000/year |
death_rate | decimal(8,4) | Age-adjusted death rate per 100,000/year |
survival_rate | decimal(8,4) | 5-year relative survival rate (0—1) |
metastatic_survival_rate | decimal(8,4) | 5-year survival, distant/metastatic stage |
localized_survival_rate | decimal(8,4) | 5-year survival, localized stage |
regional_survival_rate | decimal(8,4) | 5-year survival, regional stage |
source | text | Data source identifier (e.g., “SEER 2024”) |
disease_biomarkers
Section titled “disease_biomarkers”Links biomarkers to diseases with prevalence data and clinical context. One row per disease-biomarker pair.
| Column | Type | Purpose |
|---|---|---|
id | uuid | Primary key |
disease_id | uuid | FK to diseases |
biomarker_id | uuid | FK to biomarkers |
prevalence_rate | decimal(5,4) | Point estimate of prevalence in this disease (0—1) |
prevalence_minimum | decimal(5,4) | Lower bound of prevalence range |
prevalence_maximum | decimal(5,4) | Upper bound of prevalence range |
potential_values | jsonb | Possible test results (e.g., ["positive", "negative", "equivocal"]) |
biomarker_context | text | Clinical role: predictive, prognostic, diagnostic, or screening |
reference | jsonb | Citation references (PMIDs, DOIs, URLs) |
Unique constraint: (disease_id, biomarker_id). Check constraint ensures prevalence_minimum <= prevalence_maximum.
Design decisions
Section titled “Design decisions”Closure table instead of JSONB descendants array
Section titled “Closure table instead of JSONB descendants array”The legacy schema stored an all_descendants JSONB array on each disease row. This made subtree queries fast but introduced write amplification: inserting a single leaf disease required updating every ancestor’s array. It also could not answer “all ancestors of X” without a separate scan.
The closure table (disease_closures) solves both problems. Reads are equally fast (indexed lookups), writes are local (insert one row per ancestor), and both ancestor and descendant queries use the same table. The tradeoff is storage — the closure table has O(n * d) rows where d is average depth — but for ~2,000 diseases with average depth ~4, this is negligible.
Separated disease_stats
Section titled “Separated disease_stats”Epidemiological data is periodic (SEER publishes annually) and comes from a completely different source than the ontology itself. Keeping it in a separate one-to-one table means:
- Stats can be refreshed independently without touching the disease hierarchy.
- Null stats do not bloat the core disease row.
- Future support for historical snapshots (multiple years) requires only relaxing the unique constraint.
Materialized views
Section titled “Materialized views”Two materialized views join through the diseases domain:
-
mv_development_programs— Highest development phase per drug-disease pair. Unions approved indications (viaindications.disease_id) with clinical trial mappings (viatrial_diseases.disease_id). One row per (drug, disease) with denormalizeddisease_name. -
mv_competitive_landscape— Drug-disease-phase matrix with drug metadata and owner info. Built on top ofmv_development_programs. Adds technology, organization, and approval data. One row per (drug, disease).
Both views denormalize disease_name to avoid joining back to diseases in downstream queries.
Example queries
Section titled “Example queries”All subtypes of lung cancer:
SELECT d.name, dc.depthFROM disease_closures dcJOIN diseases d ON d.id = dc.descendant_idWHERE dc.ancestor_id = ( SELECT id FROM diseases WHERE name = 'Lung Cancer')AND dc.depth > 0ORDER BY dc.depth, d.name;Biomarker prevalence for NSCLC:
SELECT b.name AS biomarker, db.prevalence_rate, db.prevalence_minimum, db.prevalence_maximum, db.biomarker_contextFROM disease_biomarkers dbJOIN biomarkers b ON b.id = db.biomarker_idWHERE db.disease_id = ( SELECT id FROM diseases WHERE name = 'Non-Small Cell Lung Cancer')ORDER BY db.prevalence_rate DESC NULLS LAST;Incidence and survival for breast cancer:
SELECT d.name, ds.estimated_new_cases, ds.estimated_new_cases_year, ds.rate_of_new_cases, ds.survival_rate, ds.metastatic_survival_rate, ds.sourceFROM disease_stats dsJOIN diseases d ON d.id = ds.disease_idWHERE d.name = 'Breast Cancer';