Materialized Views

Materialized views are the “computed intelligence” layer of Data Gov. They are derived, refreshed, and never manually edited. Their purpose is simple: AI agent text-to-SQL accuracy drops to roughly 15% once a query requires 4-5 JOINs. Matviews pre-join common access patterns so agents can answer pipeline questions with zero-JOIN queries.

Five materialized views exist today. Each one answers a specific family of questions that would otherwise require expensive multi-table joins against the base schema.

Dependency diagram

Matviews form a two-level dependency graph. Base-table-only views refresh first; views that depend on other matviews refresh second.

flowchart TD
  drugs([drugs])
  drug_approvals([drug_approvals])
  indications([indications])
  diseases([diseases])
  clinical_trials([clinical_trials])
  trial_arm_interventions([trial_arm_interventions])
  interventions([interventions])
  trial_arms([trial_arms])
  trial_diseases([trial_diseases])
  trial_arm_results([trial_arm_results])
  publication_outcomes([publication_outcomes])
  publications([publications])
  drug_ownerships([drug_ownerships])
  technologies([technologies])
  targets([targets])
  drug_target_actions([drug_target_actions])
  organisations([organisations])

  drugs --> mv_development_programs
  drug_approvals --> mv_development_programs
  indications --> mv_development_programs
  diseases --> mv_development_programs
  clinical_trials --> mv_development_programs
  trial_arm_interventions --> mv_development_programs
  interventions --> mv_development_programs
  trial_arms --> mv_development_programs
  trial_diseases --> mv_development_programs

  trial_arm_interventions --> mv_drug_trials
  interventions --> mv_drug_trials
  trial_arms --> mv_drug_trials
  clinical_trials --> mv_drug_trials

  trial_arm_results --> mv_efficacy_evidence
  clinical_trials --> mv_efficacy_evidence
  publication_outcomes --> mv_efficacy_evidence
  publications --> mv_efficacy_evidence

  mv_development_programs --> mv_competitive_landscape
  drugs --> mv_competitive_landscape
  technologies --> mv_competitive_landscape
  drug_ownerships --> mv_competitive_landscape
  organisations --> mv_competitive_landscape

  mv_development_programs --> mv_drug_pipeline
  drugs --> mv_drug_pipeline
  technologies --> mv_drug_pipeline
  targets --> mv_drug_pipeline
  drug_target_actions --> mv_drug_pipeline
  drug_ownerships --> mv_drug_pipeline
  organisations --> mv_drug_pipeline

  mv_development_programs:::matview
  mv_drug_trials:::matview
  mv_efficacy_evidence:::matview
  mv_competitive_landscape:::matview
  mv_drug_pipeline:::matview

  classDef matview fill:#4f46e5,color:#fff,stroke:#312e81

mv_development_programs and mv_drug_trials sit at the first level — they read only from base tables. mv_competitive_landscape and mv_drug_pipeline sit at the second level — they depend on mv_development_programs. mv_efficacy_evidence is first-level and independent.

Deep dive: mv_development_programs

Question it answers: “What is the highest development phase for drug X in disease Y?”

How it works: A CTE called phase_sources produces two sets of rows via UNION ALL:

Approvals branch — joins drug_approvals to indications and diseases. Every approved indication gets phase rank 100 and phase_source = 'approval'.
Clinical trials branch — joins trial_arm_interventions through interventions, trial_arms, clinical_trials, and trial_diseases. Each trial-disease pair maps its phase string to an integer rank (phase3 = 3, phase2 = 2, phase1 = 1). Withdrawn trials are excluded.

An aggregated CTE groups by drug, disease, phase, and source — collecting distinct NCT IDs into an array. A cleaned CTE removes NULL array entries. Finally, DISTINCT ON (drug_id, disease_id) ordered by highest_phase_rank DESC picks the single highest-phase row per drug-disease pair.

Key columns:

Column	Type	Description
`drug_id`	uuid	PK (composite with `disease_id`). FK to `drugs`.
`disease_id`	uuid	PK (composite with `drug_id`). FK to `diseases`.
`drug_name`	text	Denormalized INN name for zero-join queries.
`disease_name`	text	Denormalized disease name.
`highest_phase`	text	`approved`, `phase4`, `phase3`, `phase2_3`, `phase2`, `phase1_2`, `phase1`, `early_phase1`.
`highest_phase_rank`	integer	Sortable integer. 100 = approved, 4 = phase4, 3 = phase3, 2 = phase2, 1 = phase1, 0 = other.
`phase_source`	text	`approval` or `clinical_trial`.
`nct_ids`	text[]	Array of ClinicalTrials.gov NCT IDs contributing to the phase determination.
`approval_date`	timestamptz	Earliest approval date (if source is approval).
`earliest_completion_date`	timestamptz	Earliest trial completion date (if source is clinical_trial).

Indexes:

Index	Columns
`uq_mv_dev_programs` (unique)	`drug_id`, `disease_id`
`idx_mv_dev_programs_phase`	`highest_phase_rank`
`idx_mv_dev_programs_disease`	`disease_id`
`idx_mv_dev_programs_drug`	`drug_id`

Example query:

-- Highest phase for pembrolizumab across all diseases
SELECT disease_name, highest_phase, phase_source
FROM mv_development_programs
WHERE drug_name = 'pembrolizumab'
ORDER BY highest_phase_rank DESC;

Deep dive: mv_drug_trials

Question it answers: “Which trials is drug X in, and what role does it play?”

How it works: A single SELECT DISTINCT joins trial_arm_interventions through interventions, trial_arms, and clinical_trials. The role is derived from a CASE expression that checks, in order:

is_investigational = true on the trial-arm-intervention row — investigational
arm_type = 'sham_comparator' — sham_comparator
arm_type = 'active_comparator' — active_comparator
arm_type = 'placebo_comparator' — placebo_comparator
arm_type = 'experimental' AND is_investigational = false — combination_partner
Everything else — other

Only rows where drug_id IS NOT NULL are included (interventions that resolved to a known drug).

Key columns:

Column	Type	Description
`drug_id`	uuid	PK (composite). FK to `drugs`.
`clinical_trial_id`	uuid	FK to `clinical_trials`.
`role`	text	`investigational`, `combination_partner`, `active_comparator`, `placebo_comparator`, `sham_comparator`, `other`.

Indexes:

Index	Columns
`uq_mv_drug_trials` (unique)	`drug_id`, `clinical_trial_id`, `role`
`idx_mv_drug_trials_drug`	`drug_id`
`idx_mv_drug_trials_trial`	`clinical_trial_id`

Example query:

-- All trials where nivolumab is the investigational agent
SELECT dt.clinical_trial_id, ct.nct_id, ct.phase, ct.overall_status
FROM mv_drug_trials dt
JOIN clinical_trials ct ON ct.id = dt.clinical_trial_id
WHERE dt.drug_id = '<nivolumab-uuid>'
  AND dt.role = 'investigational';

Deep dive: mv_efficacy_evidence

Question it answers: “What efficacy data exists for drug X in disease Y?”

How it works: A UNION ALL merges two sources into a single table of endpoint results:

Trial results — trial_arm_results joined to clinical_trials for the NCT ID. source_type = 'trial', source_id points to the clinical trial.
Publication outcomes — publication_outcomes joined to publications and optionally to clinical_trials (via the publication’s clinical_trial_id). source_type = 'publication', source_id points to the publication.

Both branches share the same output columns: endpoint name, endpoint type, measure value, unit, p-value, hazard ratio, confidence interval, median follow-up, and subgroup description.

Key columns:

Column	Type	Description
`id`	uuid	PK (composite with `source_type`). Source record ID.
`source_type`	text	`trial` or `publication`.
`source_id`	uuid	FK to source table (`clinical_trials` or `publications`).
`endpoint_name`	text	Clinical endpoint (e.g., Overall Survival, PFS, ORR).
`endpoint_type`	text	`primary`, `secondary`, `exploratory`.
`measure_value`	text	Measured value (e.g., 12.5, 45.2%, NR).
`unit`	text	Unit of measurement (months, %, ratio).
`p_value`	text	Statistical significance (e.g., 0.001, <0.0001, NS).
`hazard_ratio`	text	Hazard ratio (e.g., 0.73).
`confidence_interval`	text	CI as text (e.g., 0.58-0.92).
`median_follow_up`	text	Median follow-up duration (e.g., 24.5 months).
`subgroup_description`	text	Subgroup label if this is a subgroup analysis.
`clinical_trial_id`	uuid	FK to `clinical_trials`. Present for both sources when linkable.
`nct_id`	text	Denormalized NCT ID.
`publication_id`	uuid	FK to `publications`. Only set for publication-sourced rows.

Indexes:

Index	Columns
`uq_mv_efficacy` (unique)	`id`, `source_type`
`idx_mv_efficacy_evidence_endpoint`	`endpoint_name`
`idx_mv_efficacy_evidence_trial`	`clinical_trial_id`
`idx_mv_efficacy_evidence_pub`	`publication_id`

Example query:

-- All primary endpoint results for a given trial
SELECT endpoint_name, measure_value, unit, p_value, hazard_ratio, source_type
FROM mv_efficacy_evidence
WHERE nct_id = 'NCT04380636'
  AND endpoint_type = 'primary';

Deep dive: mv_competitive_landscape

Question it answers: “Who is competing in disease Y at Phase 3?”

How it works: Starts from mv_development_programs (second-level dependency) and enriches each drug-disease pair with drug metadata. Joins drugs for drug_type, technologies for modality/class, and uses a LATERAL subquery to aggregate current owner names from drug_ownerships and organisations into an array.

The LATERAL pattern is deliberate: a plain LEFT JOIN to drug_ownerships would duplicate rows when a drug has multiple current owners, breaking the unique index. The LATERAL subquery with array_agg(DISTINCT ...) collapses owners into a single array per row.

Key columns:

Column	Type	Description
`drug_id`	uuid	PK (composite with `disease_id`). FK to `drugs`.
`disease_id`	uuid	FK to `diseases`.
`drug_name`	text	Denormalized INN name.
`disease_name`	text	Denormalized disease name.
`highest_phase`	text	From `mv_development_programs`.
`highest_phase_rank`	integer	Sortable rank.
`phase_source`	text	`approval` or `clinical_trial`.
`drug_type`	text	Molecular classification (small_molecule, biologic, antibody, adc, cell_therapy, etc.).
`technology_name`	text	Drug modality name from `technologies`.
`drug_class`	text	Drug class from `technologies`.
`current_owner_names`	text[]	Array of current owner organisation names.

Indexes:

Index	Columns
`uq_mv_competitive` (unique)	`drug_id`, `disease_id`
`idx_mv_comp_landscape_disease`	`disease_id`
`idx_mv_comp_landscape_phase`	`highest_phase_rank`
`idx_mv_comp_landscape_tech`	`technology_name`

Example query:

-- All Phase 3+ competitors in NSCLC
SELECT drug_name, drug_type, technology_name, current_owner_names
FROM mv_competitive_landscape
WHERE disease_name = 'Non-Small Cell Lung Cancer'
  AND highest_phase_rank >= 3
ORDER BY highest_phase_rank DESC, drug_name;

Deep dive: mv_drug_pipeline

Question it answers: “Give me a complete profile of drug X with zero JOINs.”

How it works: One row per drug. Starts from drugs and enriches with four LATERAL subqueries:

Primary target — picks the primary target (is_primary DESC) from drug_target_actions joined to targets, falling back to alphabetical order. Returns a single name.
All targets — array_agg(DISTINCT name) of all targets for the drug.
Current owners — array_agg(DISTINCT name) of organisations with is_current = true from drug_ownerships.
Highest phase overall — MAX(highest_phase_rank) across all diseases from mv_development_programs, with the corresponding phase label.

Also joins technologies for modality name and drug class, and computes is_combination by checking for child drugs (EXISTS(SELECT 1 FROM drugs c WHERE c.parent_id = d.id)).

Key columns:

Column	Type	Description
`drug_id`	uuid	PK. FK to `drugs`.
`bloupe_id`	text	Internal Bioloupe identifier (e.g., B1234).
`drug_name`	text	INN name.
`drug_status`	text	`marketed`, `in_development`, `discontinued`, `withdrawn`, `suspended`.
`drug_type`	text	Molecular classification.
`first_in_class`	boolean	TRUE if first drug approved for its mechanism.
`is_combination`	boolean	TRUE if fixed-dose combination product.
`technology_name`	text	Modality name from `technologies`.
`drug_class`	text	Drug class from `technologies`.
`primary_target_name`	text	Name of the primary molecular target.
`target_names`	text[]	Array of all target names.
`current_owner_names`	text[]	Array of current owner organisation names.
`highest_phase_overall`	text	Highest development phase across all diseases.
`highest_phase_rank_overall`	integer	Sortable rank of the highest phase.

Indexes:

Index	Columns
`uq_mv_pipeline` (unique)	`drug_id`
`idx_mv_drug_pipeline_status`	`drug_status`
`idx_mv_drug_pipeline_phase`	`highest_phase_rank_overall`
`idx_mv_drug_pipeline_tech`	`technology_name`

Example query:

-- Full profile of sotorasib, zero JOINs
SELECT *
FROM mv_drug_pipeline
WHERE drug_name = 'sotorasib';

Refresh strategy

Matview refresh is catalog-driven. Three database objects, installed via migration, handle dependency resolution and ordered refresh automatically.

mat_view_dependencies (view)

A recursive CTE that walks pg_depend and pg_rewrite to discover which relations (tables, views, matviews) feed into which matviews. Produces rows like:

start_relname	relname	depth
drugs	mv_development_programs	1
mv_development_programs	mv_competitive_landscape	1
drugs	mv_competitive_landscape	2

mat_view_refresh_order (view)

Consumes mat_view_dependencies and produces a topological sort: MAX(depth) per matview, ordered ascending. Views with no matview dependencies (depth 0) refresh first.

relname	refresh_order
mv_development_programs	1
mv_drug_trials	1
mv_efficacy_evidence	1
mv_competitive_landscape	2
mv_drug_pipeline	2

refresh_all_matviews() (function)

A PL/pgSQL function that iterates mat_view_refresh_order and runs REFRESH MATERIALIZED VIEW CONCURRENTLY for each view in order. A single call refreshes the entire graph:

SELECT refresh_all_matviews();

Initial population

All matviews are created WITH NO DATA. They contain no rows until explicitly refreshed. The first refresh must use a non-concurrent REFRESH MATERIALIZED VIEW (without CONCURRENTLY) because concurrent refresh requires at least one unique index to already be populated. After the first population, all subsequent refreshes use CONCURRENTLY for zero-downtime updates.

When to add a new matview

Add a materialized view when all four criteria are met:

The query requires 3+ JOINs. If a question can be answered with 1-2 JOINs, a matview adds complexity without meaningful accuracy gains.
AI agents ask it frequently. Check agent query logs. If the same multi-join pattern appears repeatedly, it is a candidate.
Underlying data changes infrequently. Matviews are stale between refreshes. If the base data changes every few minutes, a matview may serve outdated results.
The result set is bounded. Matviews that would produce billions of rows are impractical. The drug-disease cross-product is large but bounded by the number of drugs and diseases in the system.

When adding a new matview, define the SQL expression, declare indexes (including at least one unique index for concurrent refresh), and run a refresh. The mat_view_dependencies infrastructure will automatically discover the new view and slot it into the correct refresh order.