Skip to content

Materialized Views

Materialized views are the “computed intelligence” layer of Data Gov. They are derived, refreshed, and never manually edited. Their purpose is simple: AI agent text-to-SQL accuracy drops to roughly 15% once a query requires 4-5 JOINs. Matviews pre-join common access patterns so agents can answer pipeline questions with zero-JOIN queries.

Five materialized views exist today. Each one answers a specific family of questions that would otherwise require expensive multi-table joins against the base schema.

Matviews form a two-level dependency graph. Base-table-only views refresh first; views that depend on other matviews refresh second.

flowchart TD
  drugs([drugs])
  drug_approvals([drug_approvals])
  indications([indications])
  diseases([diseases])
  clinical_trials([clinical_trials])
  trial_arm_interventions([trial_arm_interventions])
  interventions([interventions])
  trial_arms([trial_arms])
  trial_diseases([trial_diseases])
  trial_arm_results([trial_arm_results])
  publication_outcomes([publication_outcomes])
  publications([publications])
  drug_ownerships([drug_ownerships])
  technologies([technologies])
  targets([targets])
  drug_target_actions([drug_target_actions])
  organisations([organisations])

  drugs --> mv_development_programs
  drug_approvals --> mv_development_programs
  indications --> mv_development_programs
  diseases --> mv_development_programs
  clinical_trials --> mv_development_programs
  trial_arm_interventions --> mv_development_programs
  interventions --> mv_development_programs
  trial_arms --> mv_development_programs
  trial_diseases --> mv_development_programs

  trial_arm_interventions --> mv_drug_trials
  interventions --> mv_drug_trials
  trial_arms --> mv_drug_trials
  clinical_trials --> mv_drug_trials

  trial_arm_results --> mv_efficacy_evidence
  clinical_trials --> mv_efficacy_evidence
  publication_outcomes --> mv_efficacy_evidence
  publications --> mv_efficacy_evidence

  mv_development_programs --> mv_competitive_landscape
  drugs --> mv_competitive_landscape
  technologies --> mv_competitive_landscape
  drug_ownerships --> mv_competitive_landscape
  organisations --> mv_competitive_landscape

  mv_development_programs --> mv_drug_pipeline
  drugs --> mv_drug_pipeline
  technologies --> mv_drug_pipeline
  targets --> mv_drug_pipeline
  drug_target_actions --> mv_drug_pipeline
  drug_ownerships --> mv_drug_pipeline
  organisations --> mv_drug_pipeline

  mv_development_programs:::matview
  mv_drug_trials:::matview
  mv_efficacy_evidence:::matview
  mv_competitive_landscape:::matview
  mv_drug_pipeline:::matview

  classDef matview fill:#4f46e5,color:#fff,stroke:#312e81

mv_development_programs and mv_drug_trials sit at the first level — they read only from base tables. mv_competitive_landscape and mv_drug_pipeline sit at the second level — they depend on mv_development_programs. mv_efficacy_evidence is first-level and independent.

Question it answers: “What is the highest development phase for drug X in disease Y?”

How it works: A CTE called phase_sources produces two sets of rows via UNION ALL:

  1. Approvals branch — joins drug_approvals to indications and diseases. Every approved indication gets phase rank 100 and phase_source = 'approval'.
  2. Clinical trials branch — joins trial_arm_interventions through interventions, trial_arms, clinical_trials, and trial_diseases. Each trial-disease pair maps its phase string to an integer rank (phase3 = 3, phase2 = 2, phase1 = 1). Withdrawn trials are excluded.

An aggregated CTE groups by drug, disease, phase, and source — collecting distinct NCT IDs into an array. A cleaned CTE removes NULL array entries. Finally, DISTINCT ON (drug_id, disease_id) ordered by highest_phase_rank DESC picks the single highest-phase row per drug-disease pair.

Key columns:

ColumnTypeDescription
drug_iduuidPK (composite with disease_id). FK to drugs.
disease_iduuidPK (composite with drug_id). FK to diseases.
drug_nametextDenormalized INN name for zero-join queries.
disease_nametextDenormalized disease name.
highest_phasetextapproved, phase4, phase3, phase2_3, phase2, phase1_2, phase1, early_phase1.
highest_phase_rankintegerSortable integer. 100 = approved, 4 = phase4, 3 = phase3, 2 = phase2, 1 = phase1, 0 = other.
phase_sourcetextapproval or clinical_trial.
nct_idstext[]Array of ClinicalTrials.gov NCT IDs contributing to the phase determination.
approval_datetimestamptzEarliest approval date (if source is approval).
earliest_completion_datetimestamptzEarliest trial completion date (if source is clinical_trial).

Indexes:

IndexColumns
uq_mv_dev_programs (unique)drug_id, disease_id
idx_mv_dev_programs_phasehighest_phase_rank
idx_mv_dev_programs_diseasedisease_id
idx_mv_dev_programs_drugdrug_id

Example query:

-- Highest phase for pembrolizumab across all diseases
SELECT disease_name, highest_phase, phase_source
FROM mv_development_programs
WHERE drug_name = 'pembrolizumab'
ORDER BY highest_phase_rank DESC;

Question it answers: “Which trials is drug X in, and what role does it play?”

How it works: A single SELECT DISTINCT joins trial_arm_interventions through interventions, trial_arms, and clinical_trials. The role is derived from a CASE expression that checks, in order:

  1. is_investigational = true on the trial-arm-intervention row — investigational
  2. arm_type = 'sham_comparator'sham_comparator
  3. arm_type = 'active_comparator'active_comparator
  4. arm_type = 'placebo_comparator'placebo_comparator
  5. arm_type = 'experimental' AND is_investigational = falsecombination_partner
  6. Everything else — other

Only rows where drug_id IS NOT NULL are included (interventions that resolved to a known drug).

Key columns:

ColumnTypeDescription
drug_iduuidPK (composite). FK to drugs.
clinical_trial_iduuidFK to clinical_trials.
roletextinvestigational, combination_partner, active_comparator, placebo_comparator, sham_comparator, other.

Indexes:

IndexColumns
uq_mv_drug_trials (unique)drug_id, clinical_trial_id, role
idx_mv_drug_trials_drugdrug_id
idx_mv_drug_trials_trialclinical_trial_id

Example query:

-- All trials where nivolumab is the investigational agent
SELECT dt.clinical_trial_id, ct.nct_id, ct.phase, ct.overall_status
FROM mv_drug_trials dt
JOIN clinical_trials ct ON ct.id = dt.clinical_trial_id
WHERE dt.drug_id = '<nivolumab-uuid>'
AND dt.role = 'investigational';

Question it answers: “What efficacy data exists for drug X in disease Y?”

How it works: A UNION ALL merges two sources into a single table of endpoint results:

  1. Trial resultstrial_arm_results joined to clinical_trials for the NCT ID. source_type = 'trial', source_id points to the clinical trial.
  2. Publication outcomespublication_outcomes joined to publications and optionally to clinical_trials (via the publication’s clinical_trial_id). source_type = 'publication', source_id points to the publication.

Both branches share the same output columns: endpoint name, endpoint type, measure value, unit, p-value, hazard ratio, confidence interval, median follow-up, and subgroup description.

Key columns:

ColumnTypeDescription
iduuidPK (composite with source_type). Source record ID.
source_typetexttrial or publication.
source_iduuidFK to source table (clinical_trials or publications).
endpoint_nametextClinical endpoint (e.g., Overall Survival, PFS, ORR).
endpoint_typetextprimary, secondary, exploratory.
measure_valuetextMeasured value (e.g., 12.5, 45.2%, NR).
unittextUnit of measurement (months, %, ratio).
p_valuetextStatistical significance (e.g., 0.001, <0.0001, NS).
hazard_ratiotextHazard ratio (e.g., 0.73).
confidence_intervaltextCI as text (e.g., 0.58-0.92).
median_follow_uptextMedian follow-up duration (e.g., 24.5 months).
subgroup_descriptiontextSubgroup label if this is a subgroup analysis.
clinical_trial_iduuidFK to clinical_trials. Present for both sources when linkable.
nct_idtextDenormalized NCT ID.
publication_iduuidFK to publications. Only set for publication-sourced rows.

Indexes:

IndexColumns
uq_mv_efficacy (unique)id, source_type
idx_mv_efficacy_evidence_endpointendpoint_name
idx_mv_efficacy_evidence_trialclinical_trial_id
idx_mv_efficacy_evidence_pubpublication_id

Example query:

-- All primary endpoint results for a given trial
SELECT endpoint_name, measure_value, unit, p_value, hazard_ratio, source_type
FROM mv_efficacy_evidence
WHERE nct_id = 'NCT04380636'
AND endpoint_type = 'primary';

Question it answers: “Who is competing in disease Y at Phase 3?”

How it works: Starts from mv_development_programs (second-level dependency) and enriches each drug-disease pair with drug metadata. Joins drugs for drug_type, technologies for modality/class, and uses a LATERAL subquery to aggregate current owner names from drug_ownerships and organisations into an array.

The LATERAL pattern is deliberate: a plain LEFT JOIN to drug_ownerships would duplicate rows when a drug has multiple current owners, breaking the unique index. The LATERAL subquery with array_agg(DISTINCT ...) collapses owners into a single array per row.

Key columns:

ColumnTypeDescription
drug_iduuidPK (composite with disease_id). FK to drugs.
disease_iduuidFK to diseases.
drug_nametextDenormalized INN name.
disease_nametextDenormalized disease name.
highest_phasetextFrom mv_development_programs.
highest_phase_rankintegerSortable rank.
phase_sourcetextapproval or clinical_trial.
drug_typetextMolecular classification (small_molecule, biologic, antibody, adc, cell_therapy, etc.).
technology_nametextDrug modality name from technologies.
drug_classtextDrug class from technologies.
current_owner_namestext[]Array of current owner organisation names.

Indexes:

IndexColumns
uq_mv_competitive (unique)drug_id, disease_id
idx_mv_comp_landscape_diseasedisease_id
idx_mv_comp_landscape_phasehighest_phase_rank
idx_mv_comp_landscape_techtechnology_name

Example query:

-- All Phase 3+ competitors in NSCLC
SELECT drug_name, drug_type, technology_name, current_owner_names
FROM mv_competitive_landscape
WHERE disease_name = 'Non-Small Cell Lung Cancer'
AND highest_phase_rank >= 3
ORDER BY highest_phase_rank DESC, drug_name;

Question it answers: “Give me a complete profile of drug X with zero JOINs.”

How it works: One row per drug. Starts from drugs and enriches with four LATERAL subqueries:

  1. Primary target — picks the primary target (is_primary DESC) from drug_target_actions joined to targets, falling back to alphabetical order. Returns a single name.
  2. All targetsarray_agg(DISTINCT name) of all targets for the drug.
  3. Current ownersarray_agg(DISTINCT name) of organisations with is_current = true from drug_ownerships.
  4. Highest phase overallMAX(highest_phase_rank) across all diseases from mv_development_programs, with the corresponding phase label.

Also joins technologies for modality name and drug class, and computes is_combination by checking for child drugs (EXISTS(SELECT 1 FROM drugs c WHERE c.parent_id = d.id)).

Key columns:

ColumnTypeDescription
drug_iduuidPK. FK to drugs.
bloupe_idtextInternal Bioloupe identifier (e.g., B1234).
drug_nametextINN name.
drug_statustextmarketed, in_development, discontinued, withdrawn, suspended.
drug_typetextMolecular classification.
first_in_classbooleanTRUE if first drug approved for its mechanism.
is_combinationbooleanTRUE if fixed-dose combination product.
technology_nametextModality name from technologies.
drug_classtextDrug class from technologies.
primary_target_nametextName of the primary molecular target.
target_namestext[]Array of all target names.
current_owner_namestext[]Array of current owner organisation names.
highest_phase_overalltextHighest development phase across all diseases.
highest_phase_rank_overallintegerSortable rank of the highest phase.

Indexes:

IndexColumns
uq_mv_pipeline (unique)drug_id
idx_mv_drug_pipeline_statusdrug_status
idx_mv_drug_pipeline_phasehighest_phase_rank_overall
idx_mv_drug_pipeline_techtechnology_name

Example query:

-- Full profile of sotorasib, zero JOINs
SELECT *
FROM mv_drug_pipeline
WHERE drug_name = 'sotorasib';

Matview refresh is catalog-driven. Three database objects, installed via migration, handle dependency resolution and ordered refresh automatically.

A recursive CTE that walks pg_depend and pg_rewrite to discover which relations (tables, views, matviews) feed into which matviews. Produces rows like:

start_relnamerelnamedepth
drugsmv_development_programs1
mv_development_programsmv_competitive_landscape1
drugsmv_competitive_landscape2

Consumes mat_view_dependencies and produces a topological sort: MAX(depth) per matview, ordered ascending. Views with no matview dependencies (depth 0) refresh first.

relnamerefresh_order
mv_development_programs1
mv_drug_trials1
mv_efficacy_evidence1
mv_competitive_landscape2
mv_drug_pipeline2

A PL/pgSQL function that iterates mat_view_refresh_order and runs REFRESH MATERIALIZED VIEW CONCURRENTLY for each view in order. A single call refreshes the entire graph:

SELECT refresh_all_matviews();

All matviews are created WITH NO DATA. They contain no rows until explicitly refreshed. The first refresh must use a non-concurrent REFRESH MATERIALIZED VIEW (without CONCURRENTLY) because concurrent refresh requires at least one unique index to already be populated. After the first population, all subsequent refreshes use CONCURRENTLY for zero-downtime updates.

Add a materialized view when all four criteria are met:

  1. The query requires 3+ JOINs. If a question can be answered with 1-2 JOINs, a matview adds complexity without meaningful accuracy gains.
  2. AI agents ask it frequently. Check agent query logs. If the same multi-join pattern appears repeatedly, it is a candidate.
  3. Underlying data changes infrequently. Matviews are stale between refreshes. If the base data changes every few minutes, a matview may serve outdated results.
  4. The result set is bounded. Matviews that would produce billions of rows are impractical. The drug-disease cross-product is large but bounded by the number of drugs and diseases in the system.

When adding a new matview, define the SQL expression, declare indexes (including at least one unique index for concurrent refresh), and run a refresh. The mat_view_dependencies infrastructure will automatically discover the new view and slot it into the correct refresh order.