Skip to content

Diseases

Diseases are the second axis of the knowledge graph (drugs being the first). The domain models a hierarchical ontology of ~2,000 diseases using four tables: diseases, disease_closures, disease_stats, and disease_biomarkers. Every disease carries an NCI Thesaurus code as its natural key, and the hierarchy is stored as a directed acyclic graph (DAG) via a closure table.

erDiagram
    diseases ||--o{ disease_closures : "ancestor"
    diseases ||--o{ disease_closures : "descendant"
    diseases ||--o| disease_stats : "has"
    diseases ||--o{ disease_biomarkers : "characterized by"
    disease_biomarkers }o--|| biomarkers : "biomarker"
    diseases }o--o| therapeutic_areas : "belongs to"

One row per disease in the ontology. The canonical source of truth for disease identity.

ColumnTypePurpose
iduuidPrimary key
nametextCanonical disease name (e.g., “Non-Small Cell Lung Cancer”)
ncit_codetextNCI Thesaurus code (e.g., C3262). Unique natural key for ontology mapping
descriptiontextFree-text description
synonymsjsonbArray of alternative names and abbreviations (e.g., ["NSCLC"])
therapeutic_area_iduuidFK to therapeutic_areas. Primary therapeutic area
clinical_segmentsjsonbStage/demographic sub-populations (e.g., ["early_stage", "elderly"])
biomarker_segmentsjsonbMolecular marker segmentations (e.g., ["EGFR_mutant", "PD-L1_high"])
key_endpointsjsonbStandard clinical endpoints for this disease (OS, PFS, ORR) with measurement criteria
indication_levelintegerClinical specificity depth (0—9). Higher = more specific indication
show_in_analyticsbooleanWhether this disease appears in analytics rollup groupings
embeddingtextVector embedding for semantic search

Indexes: idx_diseases_area (therapeutic area), idx_diseases_name (name).

Pre-computed transitive closure of the disease DAG. Every ancestor-descendant pair appears as a row, including self-references at depth 0. This enables single-query subtree lookups without recursive CTEs.

ColumnTypePurpose
iduuidPrimary key
ancestor_iduuidFK to diseases. The ancestor in the pair
descendant_iduuidFK to diseases. The descendant in the pair
depthintegerHops from ancestor to descendant. 0 = self, 1 = direct child, 2 = grandchild

Unique constraint: (ancestor_id, descendant_id). Indexes on ancestor_id, descendant_id, and depth.

Epidemiological statistics per disease. One-to-one with diseases. Sourced primarily from SEER. US-centric rates.

ColumnTypePurpose
iduuidPrimary key
disease_iduuidFK to diseases (unique — one row per disease)
estimated_new_casesintegerEstimated new cases per year (US)
estimated_new_cases_yearintegerYear of the estimate
rate_of_new_casesdecimal(8,4)Age-adjusted incidence rate per 100,000/year
death_ratedecimal(8,4)Age-adjusted death rate per 100,000/year
survival_ratedecimal(8,4)5-year relative survival rate (0—1)
metastatic_survival_ratedecimal(8,4)5-year survival, distant/metastatic stage
localized_survival_ratedecimal(8,4)5-year survival, localized stage
regional_survival_ratedecimal(8,4)5-year survival, regional stage
sourcetextData source identifier (e.g., “SEER 2024”)

Links biomarkers to diseases with prevalence data and clinical context. One row per disease-biomarker pair.

ColumnTypePurpose
iduuidPrimary key
disease_iduuidFK to diseases
biomarker_iduuidFK to biomarkers
prevalence_ratedecimal(5,4)Point estimate of prevalence in this disease (0—1)
prevalence_minimumdecimal(5,4)Lower bound of prevalence range
prevalence_maximumdecimal(5,4)Upper bound of prevalence range
potential_valuesjsonbPossible test results (e.g., ["positive", "negative", "equivocal"])
biomarker_contexttextClinical role: predictive, prognostic, diagnostic, or screening
referencejsonbCitation references (PMIDs, DOIs, URLs)

Unique constraint: (disease_id, biomarker_id). Check constraint ensures prevalence_minimum <= prevalence_maximum.

Closure table instead of JSONB descendants array

Section titled “Closure table instead of JSONB descendants array”

The legacy schema stored an all_descendants JSONB array on each disease row. This made subtree queries fast but introduced write amplification: inserting a single leaf disease required updating every ancestor’s array. It also could not answer “all ancestors of X” without a separate scan.

The closure table (disease_closures) solves both problems. Reads are equally fast (indexed lookups), writes are local (insert one row per ancestor), and both ancestor and descendant queries use the same table. The tradeoff is storage — the closure table has O(n * d) rows where d is average depth — but for ~2,000 diseases with average depth ~4, this is negligible.

Epidemiological data is periodic (SEER publishes annually) and comes from a completely different source than the ontology itself. Keeping it in a separate one-to-one table means:

  • Stats can be refreshed independently without touching the disease hierarchy.
  • Null stats do not bloat the core disease row.
  • Future support for historical snapshots (multiple years) requires only relaxing the unique constraint.

Two materialized views join through the diseases domain:

  • mv_development_programs — Highest development phase per drug-disease pair. Unions approved indications (via indications.disease_id) with clinical trial mappings (via trial_diseases.disease_id). One row per (drug, disease) with denormalized disease_name.

  • mv_competitive_landscape — Drug-disease-phase matrix with drug metadata and owner info. Built on top of mv_development_programs. Adds technology, organization, and approval data. One row per (drug, disease).

Both views denormalize disease_name to avoid joining back to diseases in downstream queries.

All subtypes of lung cancer:

SELECT d.name, dc.depth
FROM disease_closures dc
JOIN diseases d ON d.id = dc.descendant_id
WHERE dc.ancestor_id = (
SELECT id FROM diseases WHERE name = 'Lung Cancer'
)
AND dc.depth > 0
ORDER BY dc.depth, d.name;

Biomarker prevalence for NSCLC:

SELECT
b.name AS biomarker,
db.prevalence_rate,
db.prevalence_minimum,
db.prevalence_maximum,
db.biomarker_context
FROM disease_biomarkers db
JOIN biomarkers b ON b.id = db.biomarker_id
WHERE db.disease_id = (
SELECT id FROM diseases WHERE name = 'Non-Small Cell Lung Cancer'
)
ORDER BY db.prevalence_rate DESC NULLS LAST;

Incidence and survival for breast cancer:

SELECT
d.name,
ds.estimated_new_cases,
ds.estimated_new_cases_year,
ds.rate_of_new_cases,
ds.survival_rate,
ds.metastatic_survival_rate,
ds.source
FROM disease_stats ds
JOIN diseases d ON d.id = ds.disease_id
WHERE d.name = 'Breast Cancer';