Data Philosophy

The Data Gov database draws a hard line between facts and intelligence. Facts are scraped from authoritative sources or entered by human analysts. Intelligence is computed by joining and aggregating those facts. The two never mix, and the boundary between them shapes every schema decision.

The core principle

Tables hold accurately scraped and, when needed, manually curated facts. Materialized views compute intelligence on top of those facts. No fact is stored in two places. Computed data is never editable — to change a derived result, fix the underlying source data.

This rule is simple but has far-reaching consequences. It eliminates data redundancy, prevents conflicting copies of the same fact, and gives every number in the system a single authoritative home.

Why this matters

flowchart LR
    A["Raw sources\n(FDA, EMA, CT.gov, PubMed)"] --> B["Scraping pipelines"]
    B --> C["Fact tables\n(curated)"]
    C --> D["Materialized views\n(computed)"]
    D --> E["AI agent queries"]
    C --> E

Raw data flows in one direction. Scraping pipelines ingest from regulatory agencies, trial registries, and literature databases. The output lands in fact tables where analysts can curate, correct, and lock fields. Materialized views then derive higher-order intelligence — pipeline landscapes, development phases, competitive positioning — by joining across those fact tables.

The AI agent gets fast, pre-joined query surfaces without the schema harboring conflicting copies of the same fact. When an agent asks “what phase is this drug in for breast cancer?” the answer comes from a single materialized view that traces back to curated trial and approval records. There is no second copy to contradict it.

Curated vs computed

The hardest schema decisions are not about what data to store but about where to store it. Two examples illustrate the reasoning.

Drug status: a curated column

Decision: drug_status is a curated column on the drugs table, not a computed value.

Why: 762 drugs in production are marked as “marketed” but have zero approval records. Their status comes from editorial knowledge — regulatory databases outside our current scraping coverage, analyst research, or legacy data that predates the approval pipeline. Computing status from approval records alone would silently downgrade these drugs to “in_development,” destroying curated knowledge.

Alternative considered: Compute status from drug_approvals (approved = marketed, no approval = in_development). Rejected because it loses curated knowledge that cannot be recovered from structured sources.

Development phase: a computed view

Decision: Development phase is computed per drug x disease in the mv_development_programs materialized view, not stored on the drugs table.

Why: A drug can be Phase 3 for breast cancer and approved for lung cancer simultaneously. Phase is inherently a property of the drug x disease pair, not the drug alone. The materialized view joins drug_approvals (with indications) and clinical_trials (with trial diseases), ranks phases, and returns the highest phase per pair.

Alternative considered: Store a single development_phase column on drugs. Rejected because it forces a false choice — which disease’s phase do you pick? — and creates a value that contradicts the underlying trial and approval data for every disease except one.

The Provenance pattern

Every fact-bearing table embeds provenance metadata tracking where data came from and how trustworthy it is.

Column	Purpose
`data_source`	System that produced the record: `bioloupe`, `clinicaltrials_gov`, `fda`, `ema`, `pubmed`, `nci`, `manual`, `llm`
`source_reference`	External ID or URL in the source system
`confidence`	Extraction confidence score, 0.00—1.00. Only set for LLM-extracted data.
`last_verified_at`	When the record was last validated against its source

Provenance answers three questions at any row: Where did this come from? How sure are we? When did we last check? These columns are embedded directly on fact tables, not stored in a separate audit table, so every query can filter by trustworthiness without an extra join.

The BaseEntity pattern

Every table shares a uniform foundation: UUID v4 primary keys, created_at, and updated_at timestamps with automatic update on write. A check constraint enforces updated_at >= created_at, preventing clock-skew anomalies.

Column	Purpose
`id`	UUID v4 primary key, generated at the database level
`created_at`	Row creation timestamp (database `now()`)
`updated_at`	Row last-update timestamp (automatically maintained)

UUIDs eliminate insert-order coupling between tables and make cross-system references safe. The timestamp pair gives every row a built-in audit trail without relying on application-level logging.

Decision framework: table vs materialized view

When adding new data to the schema, apply these rules in order.

If the data is…	Store it in…	Reason
Scraped, curated, or manually entered by humans	A table	Facts need to be editable and carry provenance
Derivable by joining or aggregating existing tables	A materialized view	Eliminates redundancy; refresh replaces recomputation
Reference data that rarely changes	A table (seeded)	Seed scripts populate it; no pipeline needed
Something that needs to be editable	A table	Computed data is never editable

When in doubt, ask: If I delete this column, can I recompute it from other tables? If yes, it belongs in a materialized view. If no — if deleting it would lose knowledge that exists nowhere else in the schema — it belongs in a table.