AI Agent Querying

This guide is for the AI agent (and humans) writing text-to-SQL queries against bioloupe-data. It covers how to discover the schema, which materialized views to prefer, common query patterns, and pitfalls to avoid.

How the agent discovers the schema

Every table and column in the database carries a COMMENT ON annotation. These comments describe:

What the column means — e.g., “INN or common name of the drug (e.g., pembrolizumab, trastuzumab).”
What values it takes — e.g., “Values: ‘marketed’, ‘in_development’, ‘discontinued’, ‘withdrawn’, ‘suspended’.”
Foreign key relationships — e.g., “FK to drugs.id.”
Denormalization notes — e.g., “Denormalized drug INN name for zero-join queries.”

The agent reads these comments to understand the schema without external documentation. To inspect them:

-- Table comment
SELECT obj_description('drugs'::regclass);

-- All column comments for a table
SELECT column_name, col_description('drugs'::regclass, ordinal_position)
FROM information_schema.columns
WHERE table_name = 'drugs'
ORDER BY ordinal_position;

Query strategy: matviews first

Five materialized views pre-compute the most common joins. Always check whether a matview answers the question before going to base tables.

flowchart TD
    Q["Incoming question"] --> A{"Can a single matview answer it?"}
    A -- Yes --> MV["Query that matview (zero JOINs)"]
    A -- No --> B{"Does it need 2+ matviews?"}
    B -- Yes --> MV2["JOIN matviews together"]
    B -- No --> BT["Go to base tables"]

Matview cheat sheet

Question pattern	Use this	Key columns
Drug profile / overview	`mv_drug_pipeline`	`drug_name`, `drug_status`, `drug_type`, `primary_target_name`, `target_names`, `current_owner_names`, `highest_phase_overall`
Drug x disease phase	`mv_development_programs`	`drug_name`, `disease_name`, `highest_phase`, `highest_phase_rank`, `phase_source`, `nct_ids`, `approval_date`
Competitive landscape	`mv_competitive_landscape`	`drug_name`, `disease_name`, `highest_phase`, `drug_type`, `technology_name`, `drug_class`, `current_owner_names`
Trial involvement	`mv_drug_trials`	`drug_id`, `clinical_trial_id`, `role`
Efficacy data	`mv_efficacy_evidence`	`endpoint_name`, `endpoint_type`, `measure_value`, `p_value`, `hazard_ratio`, `nct_id`, `source_type`

Phase values and sort order

All phase columns use the same enumeration and ranking:

Phase value	Rank
`approved`	100
`phase4`	4
`phase3`	3
`phase2_3`	3
`phase2`	2
`phase1_2`	2
`phase1`	1
`early_phase1`	1

Always sort by highest_phase_rank (integer), not by highest_phase (text).

Drug trial roles

mv_drug_trials.role classifies how the drug participates in a trial:

Role	Meaning
`investigational`	The drug being studied
`combination_partner`	Combined with the investigational drug
`active_comparator`	Control arm, active drug
`placebo_comparator`	Control arm, placebo
`sham_comparator`	Control arm, sham procedure
`other`	None of the above

Common query patterns

1. List all Phase 3 drugs for breast cancer with their owners

SELECT drug_name, current_owner_names, highest_phase
FROM mv_competitive_landscape
WHERE disease_name ILIKE '%breast cancer%'
  AND highest_phase = 'phase3'
ORDER BY drug_name;

Uses mv_competitive_landscape because it pre-joins drug, disease, phase, and owner data in a single row.

2. What targets does pembrolizumab act on?

SELECT drug_name, target_names, primary_target_name
FROM mv_drug_pipeline
WHERE drug_name ILIKE '%pembrolizumab%';

mv_drug_pipeline aggregates all targets into an array, so no joins to drug_target_actions or targets are needed.

3. What clinical trials are recruiting for NSCLC?

SELECT ct.nct_id, ct.brief_title, ct.phase, ct.enrollment
FROM mv_development_programs dp
JOIN clinical_trials ct ON ct.id = ANY(
  SELECT unnest(dp2.nct_ids::text[])::uuid
  FROM mv_development_programs dp2
  WHERE dp2.disease_name ILIKE '%non-small cell lung%'
)
WHERE ct.overall_status = 'recruiting';

Alternative approach using mv_drug_trials when you need trial details with drug context:

SELECT DISTINCT ct.nct_id, ct.brief_title, ct.phase, ct.overall_status
FROM mv_drug_trials dt
JOIN clinical_trials ct ON ct.id = dt.clinical_trial_id
JOIN trial_diseases td ON td.clinical_trial_id = ct.id
JOIN diseases dis ON dis.id = td.disease_id
WHERE dis.name ILIKE '%non-small cell lung%'
  AND ct.overall_status = 'recruiting'
ORDER BY ct.phase DESC;

4. Show me recent deals involving Pfizer

Deals live in base tables since no matview covers them:

SELECT oh.deal_type, oh.headline, oh.total_value, oh.announcement_date
FROM organisation_histories oh
JOIN organisation_history_participants ohp ON ohp.organisation_history_id = oh.id
JOIN organisations o ON o.id = ohp.organisation_id
WHERE o.name ILIKE '%Pfizer%'
ORDER BY oh.announcement_date DESC NULLS LAST
LIMIT 20;

5. Compare efficacy of two drugs in melanoma

SELECT
  d.name AS drug_name,
  ee.endpoint_name,
  ee.measure_value,
  ee.unit,
  ee.p_value,
  ee.hazard_ratio,
  ee.source_type
FROM mv_efficacy_evidence ee
JOIN mv_drug_trials dt ON dt.clinical_trial_id = ee.clinical_trial_id
JOIN drugs d ON d.id = dt.drug_id
JOIN trial_diseases td ON td.clinical_trial_id = ee.clinical_trial_id
JOIN diseases dis ON dis.id = td.disease_id
WHERE d.name IN ('pembrolizumab', 'nivolumab')
  AND dis.name ILIKE '%melanoma%'
  AND ee.endpoint_type = 'primary'
ORDER BY d.name, ee.endpoint_name;

6. Which drugs were approved in the last 6 months?

SELECT drug_name, disease_name, approval_date
FROM mv_development_programs
WHERE phase_source = 'approval'
  AND approval_date >= NOW() - INTERVAL '6 months'
ORDER BY approval_date DESC;

7. What biomarkers are relevant for lung cancer?

Biomarkers live in base tables:

SELECT b.name AS biomarker, db.prevalence, db.potential_values
FROM disease_biomarkers db
JOIN biomarkers b ON b.id = db.biomarker_id
JOIN diseases dis ON dis.id = db.disease_id
WHERE dis.name ILIKE '%lung cancer%'
ORDER BY b.name;

8. Full drug profile with pipeline status across all diseases

SELECT
  p.drug_name,
  p.drug_status,
  p.drug_type,
  p.primary_target_name,
  p.current_owner_names,
  dp.disease_name,
  dp.highest_phase,
  dp.phase_source
FROM mv_drug_pipeline p
LEFT JOIN mv_development_programs dp ON dp.drug_id = p.drug_id
WHERE p.drug_name ILIKE '%trastuzumab%'
ORDER BY dp.highest_phase_rank DESC;

This joins two matviews: mv_drug_pipeline for the drug profile and mv_development_programs for per-disease phase data.

Anti-patterns

Do not query base tables when a matview exists. If you need a drug’s targets, owners, and phase, use mv_drug_pipeline instead of joining drugs, drug_target_actions, targets, drug_ownerships, organisations, and mv_development_programs yourself.

Do not JOIN 4+ base tables when a matview pre-computes the answer. The matviews exist specifically to flatten complex joins into single-table queries.

Do not use SELECT *. Always select specific columns. Matviews contain array columns (target_names, current_owner_names, nct_ids) that are expensive to serialize unnecessarily.

Do not assume drug_status is computed. It is a curated value on the drugs table, not derived from approvals or trials. Values: marketed, in_development, discontinued, withdrawn, suspended.

Do not query drugs for development phase. The drugs table has no phase column. Development phase per disease lives in mv_development_programs. The highest phase across all diseases lives in mv_drug_pipeline.highest_phase_overall.

Do not sort phases alphabetically. Use highest_phase_rank (integer) for ordering. Alphabetical sorting puts approved before phase1, but early_phase1 before phase2.

Do not forget ILIKE for name searches. Drug and disease names vary in casing across sources. Use ILIKE with % wildcards for resilient matching.

Schema metadata

COMMENT ON TABLE and COMMENT ON COLUMN are the agent’s primary schema discovery mechanism. Every table and column in bioloupe-data carries a comment that follows consistent patterns:

Column purpose: what the column stores (“Drug INN name.”)
Value enumeration: valid values for constrained columns (“Values: ‘marketed’, ‘in_development’, …”)
FK references: which table and column a foreign key points to (“FK to drugs.id.”)
Denormalization flags: when a column duplicates data from another table for query convenience (“Denormalized drug INN name for zero-join queries.”)
Examples: sample values where helpful (“Example: NCT04380636”)

When encountering an unfamiliar table, the agent should read the table comment first (it summarizes the table’s role and row granularity), then read column comments to understand available filters and joins.