Materialized Views
Materialized views are the “computed intelligence” layer of Data Gov. They are derived, refreshed, and never manually edited. Their purpose is simple: AI agent text-to-SQL accuracy drops to roughly 15% once a query requires 4-5 JOINs. Matviews pre-join common access patterns so agents can answer pipeline questions with zero-JOIN queries.
Five materialized views exist today. Each one answers a specific family of questions that would otherwise require expensive multi-table joins against the base schema.
Dependency diagram
Section titled “Dependency diagram”Matviews form a two-level dependency graph. Base-table-only views refresh first; views that depend on other matviews refresh second.
flowchart TD drugs([drugs]) drug_approvals([drug_approvals]) indications([indications]) diseases([diseases]) clinical_trials([clinical_trials]) trial_arm_interventions([trial_arm_interventions]) interventions([interventions]) trial_arms([trial_arms]) trial_diseases([trial_diseases]) trial_arm_results([trial_arm_results]) publication_outcomes([publication_outcomes]) publications([publications]) drug_ownerships([drug_ownerships]) technologies([technologies]) targets([targets]) drug_target_actions([drug_target_actions]) organisations([organisations]) drugs --> mv_development_programs drug_approvals --> mv_development_programs indications --> mv_development_programs diseases --> mv_development_programs clinical_trials --> mv_development_programs trial_arm_interventions --> mv_development_programs interventions --> mv_development_programs trial_arms --> mv_development_programs trial_diseases --> mv_development_programs trial_arm_interventions --> mv_drug_trials interventions --> mv_drug_trials trial_arms --> mv_drug_trials clinical_trials --> mv_drug_trials trial_arm_results --> mv_efficacy_evidence clinical_trials --> mv_efficacy_evidence publication_outcomes --> mv_efficacy_evidence publications --> mv_efficacy_evidence mv_development_programs --> mv_competitive_landscape drugs --> mv_competitive_landscape technologies --> mv_competitive_landscape drug_ownerships --> mv_competitive_landscape organisations --> mv_competitive_landscape mv_development_programs --> mv_drug_pipeline drugs --> mv_drug_pipeline technologies --> mv_drug_pipeline targets --> mv_drug_pipeline drug_target_actions --> mv_drug_pipeline drug_ownerships --> mv_drug_pipeline organisations --> mv_drug_pipeline mv_development_programs:::matview mv_drug_trials:::matview mv_efficacy_evidence:::matview mv_competitive_landscape:::matview mv_drug_pipeline:::matview classDef matview fill:#4f46e5,color:#fff,stroke:#312e81
mv_development_programs and mv_drug_trials sit at the first level — they read only from base tables. mv_competitive_landscape and mv_drug_pipeline sit at the second level — they depend on mv_development_programs. mv_efficacy_evidence is first-level and independent.
Deep dive: mv_development_programs
Section titled “Deep dive: mv_development_programs”Question it answers: “What is the highest development phase for drug X in disease Y?”
How it works: A CTE called phase_sources produces two sets of rows via UNION ALL:
- Approvals branch — joins
drug_approvalstoindicationsanddiseases. Every approved indication gets phase rank 100 andphase_source = 'approval'. - Clinical trials branch — joins
trial_arm_interventionsthroughinterventions,trial_arms,clinical_trials, andtrial_diseases. Each trial-disease pair maps its phase string to an integer rank (phase3 = 3, phase2 = 2, phase1 = 1). Withdrawn trials are excluded.
An aggregated CTE groups by drug, disease, phase, and source — collecting distinct NCT IDs into an array. A cleaned CTE removes NULL array entries. Finally, DISTINCT ON (drug_id, disease_id) ordered by highest_phase_rank DESC picks the single highest-phase row per drug-disease pair.
Key columns:
| Column | Type | Description |
|---|---|---|
drug_id | uuid | PK (composite with disease_id). FK to drugs. |
disease_id | uuid | PK (composite with drug_id). FK to diseases. |
drug_name | text | Denormalized INN name for zero-join queries. |
disease_name | text | Denormalized disease name. |
highest_phase | text | approved, phase4, phase3, phase2_3, phase2, phase1_2, phase1, early_phase1. |
highest_phase_rank | integer | Sortable integer. 100 = approved, 4 = phase4, 3 = phase3, 2 = phase2, 1 = phase1, 0 = other. |
phase_source | text | approval or clinical_trial. |
nct_ids | text[] | Array of ClinicalTrials.gov NCT IDs contributing to the phase determination. |
approval_date | timestamptz | Earliest approval date (if source is approval). |
earliest_completion_date | timestamptz | Earliest trial completion date (if source is clinical_trial). |
Indexes:
| Index | Columns |
|---|---|
uq_mv_dev_programs (unique) | drug_id, disease_id |
idx_mv_dev_programs_phase | highest_phase_rank |
idx_mv_dev_programs_disease | disease_id |
idx_mv_dev_programs_drug | drug_id |
Example query:
-- Highest phase for pembrolizumab across all diseasesSELECT disease_name, highest_phase, phase_sourceFROM mv_development_programsWHERE drug_name = 'pembrolizumab'ORDER BY highest_phase_rank DESC;Deep dive: mv_drug_trials
Section titled “Deep dive: mv_drug_trials”Question it answers: “Which trials is drug X in, and what role does it play?”
How it works: A single SELECT DISTINCT joins trial_arm_interventions through interventions, trial_arms, and clinical_trials. The role is derived from a CASE expression that checks, in order:
is_investigational = trueon the trial-arm-intervention row —investigationalarm_type = 'sham_comparator'—sham_comparatorarm_type = 'active_comparator'—active_comparatorarm_type = 'placebo_comparator'—placebo_comparatorarm_type = 'experimental'ANDis_investigational = false—combination_partner- Everything else —
other
Only rows where drug_id IS NOT NULL are included (interventions that resolved to a known drug).
Key columns:
| Column | Type | Description |
|---|---|---|
drug_id | uuid | PK (composite). FK to drugs. |
clinical_trial_id | uuid | FK to clinical_trials. |
role | text | investigational, combination_partner, active_comparator, placebo_comparator, sham_comparator, other. |
Indexes:
| Index | Columns |
|---|---|
uq_mv_drug_trials (unique) | drug_id, clinical_trial_id, role |
idx_mv_drug_trials_drug | drug_id |
idx_mv_drug_trials_trial | clinical_trial_id |
Example query:
-- All trials where nivolumab is the investigational agentSELECT dt.clinical_trial_id, ct.nct_id, ct.phase, ct.overall_statusFROM mv_drug_trials dtJOIN clinical_trials ct ON ct.id = dt.clinical_trial_idWHERE dt.drug_id = '<nivolumab-uuid>' AND dt.role = 'investigational';Deep dive: mv_efficacy_evidence
Section titled “Deep dive: mv_efficacy_evidence”Question it answers: “What efficacy data exists for drug X in disease Y?”
How it works: A UNION ALL merges two sources into a single table of endpoint results:
- Trial results —
trial_arm_resultsjoined toclinical_trialsfor the NCT ID.source_type = 'trial',source_idpoints to the clinical trial. - Publication outcomes —
publication_outcomesjoined topublicationsand optionally toclinical_trials(via the publication’sclinical_trial_id).source_type = 'publication',source_idpoints to the publication.
Both branches share the same output columns: endpoint name, endpoint type, measure value, unit, p-value, hazard ratio, confidence interval, median follow-up, and subgroup description.
Key columns:
| Column | Type | Description |
|---|---|---|
id | uuid | PK (composite with source_type). Source record ID. |
source_type | text | trial or publication. |
source_id | uuid | FK to source table (clinical_trials or publications). |
endpoint_name | text | Clinical endpoint (e.g., Overall Survival, PFS, ORR). |
endpoint_type | text | primary, secondary, exploratory. |
measure_value | text | Measured value (e.g., 12.5, 45.2%, NR). |
unit | text | Unit of measurement (months, %, ratio). |
p_value | text | Statistical significance (e.g., 0.001, <0.0001, NS). |
hazard_ratio | text | Hazard ratio (e.g., 0.73). |
confidence_interval | text | CI as text (e.g., 0.58-0.92). |
median_follow_up | text | Median follow-up duration (e.g., 24.5 months). |
subgroup_description | text | Subgroup label if this is a subgroup analysis. |
clinical_trial_id | uuid | FK to clinical_trials. Present for both sources when linkable. |
nct_id | text | Denormalized NCT ID. |
publication_id | uuid | FK to publications. Only set for publication-sourced rows. |
Indexes:
| Index | Columns |
|---|---|
uq_mv_efficacy (unique) | id, source_type |
idx_mv_efficacy_evidence_endpoint | endpoint_name |
idx_mv_efficacy_evidence_trial | clinical_trial_id |
idx_mv_efficacy_evidence_pub | publication_id |
Example query:
-- All primary endpoint results for a given trialSELECT endpoint_name, measure_value, unit, p_value, hazard_ratio, source_typeFROM mv_efficacy_evidenceWHERE nct_id = 'NCT04380636' AND endpoint_type = 'primary';Deep dive: mv_competitive_landscape
Section titled “Deep dive: mv_competitive_landscape”Question it answers: “Who is competing in disease Y at Phase 3?”
How it works: Starts from mv_development_programs (second-level dependency) and enriches each drug-disease pair with drug metadata. Joins drugs for drug_type, technologies for modality/class, and uses a LATERAL subquery to aggregate current owner names from drug_ownerships and organisations into an array.
The LATERAL pattern is deliberate: a plain LEFT JOIN to drug_ownerships would duplicate rows when a drug has multiple current owners, breaking the unique index. The LATERAL subquery with array_agg(DISTINCT ...) collapses owners into a single array per row.
Key columns:
| Column | Type | Description |
|---|---|---|
drug_id | uuid | PK (composite with disease_id). FK to drugs. |
disease_id | uuid | FK to diseases. |
drug_name | text | Denormalized INN name. |
disease_name | text | Denormalized disease name. |
highest_phase | text | From mv_development_programs. |
highest_phase_rank | integer | Sortable rank. |
phase_source | text | approval or clinical_trial. |
drug_type | text | Molecular classification (small_molecule, biologic, antibody, adc, cell_therapy, etc.). |
technology_name | text | Drug modality name from technologies. |
drug_class | text | Drug class from technologies. |
current_owner_names | text[] | Array of current owner organisation names. |
Indexes:
| Index | Columns |
|---|---|
uq_mv_competitive (unique) | drug_id, disease_id |
idx_mv_comp_landscape_disease | disease_id |
idx_mv_comp_landscape_phase | highest_phase_rank |
idx_mv_comp_landscape_tech | technology_name |
Example query:
-- All Phase 3+ competitors in NSCLCSELECT drug_name, drug_type, technology_name, current_owner_namesFROM mv_competitive_landscapeWHERE disease_name = 'Non-Small Cell Lung Cancer' AND highest_phase_rank >= 3ORDER BY highest_phase_rank DESC, drug_name;Deep dive: mv_drug_pipeline
Section titled “Deep dive: mv_drug_pipeline”Question it answers: “Give me a complete profile of drug X with zero JOINs.”
How it works: One row per drug. Starts from drugs and enriches with four LATERAL subqueries:
- Primary target — picks the primary target (
is_primary DESC) fromdrug_target_actionsjoined totargets, falling back to alphabetical order. Returns a single name. - All targets —
array_agg(DISTINCT name)of all targets for the drug. - Current owners —
array_agg(DISTINCT name)of organisations withis_current = truefromdrug_ownerships. - Highest phase overall —
MAX(highest_phase_rank)across all diseases frommv_development_programs, with the corresponding phase label.
Also joins technologies for modality name and drug class, and computes is_combination by checking for child drugs (EXISTS(SELECT 1 FROM drugs c WHERE c.parent_id = d.id)).
Key columns:
| Column | Type | Description |
|---|---|---|
drug_id | uuid | PK. FK to drugs. |
bloupe_id | text | Internal Bioloupe identifier (e.g., B1234). |
drug_name | text | INN name. |
drug_status | text | marketed, in_development, discontinued, withdrawn, suspended. |
drug_type | text | Molecular classification. |
first_in_class | boolean | TRUE if first drug approved for its mechanism. |
is_combination | boolean | TRUE if fixed-dose combination product. |
technology_name | text | Modality name from technologies. |
drug_class | text | Drug class from technologies. |
primary_target_name | text | Name of the primary molecular target. |
target_names | text[] | Array of all target names. |
current_owner_names | text[] | Array of current owner organisation names. |
highest_phase_overall | text | Highest development phase across all diseases. |
highest_phase_rank_overall | integer | Sortable rank of the highest phase. |
Indexes:
| Index | Columns |
|---|---|
uq_mv_pipeline (unique) | drug_id |
idx_mv_drug_pipeline_status | drug_status |
idx_mv_drug_pipeline_phase | highest_phase_rank_overall |
idx_mv_drug_pipeline_tech | technology_name |
Example query:
-- Full profile of sotorasib, zero JOINsSELECT *FROM mv_drug_pipelineWHERE drug_name = 'sotorasib';Refresh strategy
Section titled “Refresh strategy”Matview refresh is catalog-driven. Three database objects, installed via migration, handle dependency resolution and ordered refresh automatically.
mat_view_dependencies (view)
Section titled “mat_view_dependencies (view)”A recursive CTE that walks pg_depend and pg_rewrite to discover which relations (tables, views, matviews) feed into which matviews. Produces rows like:
| start_relname | relname | depth |
|---|---|---|
| drugs | mv_development_programs | 1 |
| mv_development_programs | mv_competitive_landscape | 1 |
| drugs | mv_competitive_landscape | 2 |
mat_view_refresh_order (view)
Section titled “mat_view_refresh_order (view)”Consumes mat_view_dependencies and produces a topological sort: MAX(depth) per matview, ordered ascending. Views with no matview dependencies (depth 0) refresh first.
| relname | refresh_order |
|---|---|
| mv_development_programs | 1 |
| mv_drug_trials | 1 |
| mv_efficacy_evidence | 1 |
| mv_competitive_landscape | 2 |
| mv_drug_pipeline | 2 |
refresh_all_matviews() (function)
Section titled “refresh_all_matviews() (function)”A PL/pgSQL function that iterates mat_view_refresh_order and runs REFRESH MATERIALIZED VIEW CONCURRENTLY for each view in order. A single call refreshes the entire graph:
SELECT refresh_all_matviews();Initial population
Section titled “Initial population”All matviews are created WITH NO DATA. They contain no rows until explicitly refreshed. The first refresh must use a non-concurrent REFRESH MATERIALIZED VIEW (without CONCURRENTLY) because concurrent refresh requires at least one unique index to already be populated. After the first population, all subsequent refreshes use CONCURRENTLY for zero-downtime updates.
When to add a new matview
Section titled “When to add a new matview”Add a materialized view when all four criteria are met:
- The query requires 3+ JOINs. If a question can be answered with 1-2 JOINs, a matview adds complexity without meaningful accuracy gains.
- AI agents ask it frequently. Check agent query logs. If the same multi-join pattern appears repeatedly, it is a candidate.
- Underlying data changes infrequently. Matviews are stale between refreshes. If the base data changes every few minutes, a matview may serve outdated results.
- The result set is bounded. Matviews that would produce billions of rows are impractical. The drug-disease cross-product is large but bounded by the number of drugs and diseases in the system.
When adding a new matview, define the SQL expression, declare indexes (including at least one unique index for concurrent refresh), and run a refresh. The mat_view_dependencies infrastructure will automatically discover the new view and slot it into the correct refresh order.