Data Philosophy
The Data Gov database draws a hard line between facts and intelligence. Facts are scraped from authoritative sources or entered by human analysts. Intelligence is computed by joining and aggregating those facts. The two never mix, and the boundary between them shapes every schema decision.
The core principle
Section titled “The core principle”Tables hold accurately scraped and, when needed, manually curated facts. Materialized views compute intelligence on top of those facts. No fact is stored in two places. Computed data is never editable — to change a derived result, fix the underlying source data.
This rule is simple but has far-reaching consequences. It eliminates data redundancy, prevents conflicting copies of the same fact, and gives every number in the system a single authoritative home.
Why this matters
Section titled “Why this matters”flowchart LR
A["Raw sources\n(FDA, EMA, CT.gov, PubMed)"] --> B["Scraping pipelines"]
B --> C["Fact tables\n(curated)"]
C --> D["Materialized views\n(computed)"]
D --> E["AI agent queries"]
C --> E
Raw data flows in one direction. Scraping pipelines ingest from regulatory agencies, trial registries, and literature databases. The output lands in fact tables where analysts can curate, correct, and lock fields. Materialized views then derive higher-order intelligence — pipeline landscapes, development phases, competitive positioning — by joining across those fact tables.
The AI agent gets fast, pre-joined query surfaces without the schema harboring conflicting copies of the same fact. When an agent asks “what phase is this drug in for breast cancer?” the answer comes from a single materialized view that traces back to curated trial and approval records. There is no second copy to contradict it.
Curated vs computed
Section titled “Curated vs computed”The hardest schema decisions are not about what data to store but about where to store it. Two examples illustrate the reasoning.
Drug status: a curated column
Section titled “Drug status: a curated column”Decision:
drug_statusis a curated column on thedrugstable, not a computed value.Why: 762 drugs in production are marked as “marketed” but have zero approval records. Their status comes from editorial knowledge — regulatory databases outside our current scraping coverage, analyst research, or legacy data that predates the approval pipeline. Computing status from approval records alone would silently downgrade these drugs to “in_development,” destroying curated knowledge.
Alternative considered: Compute status from
drug_approvals(approved = marketed, no approval = in_development). Rejected because it loses curated knowledge that cannot be recovered from structured sources.
Development phase: a computed view
Section titled “Development phase: a computed view”Decision: Development phase is computed per drug x disease in the
mv_development_programsmaterialized view, not stored on thedrugstable.Why: A drug can be Phase 3 for breast cancer and approved for lung cancer simultaneously. Phase is inherently a property of the drug x disease pair, not the drug alone. The materialized view joins
drug_approvals(with indications) andclinical_trials(with trial diseases), ranks phases, and returns the highest phase per pair.Alternative considered: Store a single
development_phasecolumn ondrugs. Rejected because it forces a false choice — which disease’s phase do you pick? — and creates a value that contradicts the underlying trial and approval data for every disease except one.
The Provenance pattern
Section titled “The Provenance pattern”Every fact-bearing table embeds provenance metadata tracking where data came from and how trustworthy it is.
| Column | Purpose |
|---|---|
data_source | System that produced the record: bioloupe, clinicaltrials_gov, fda, ema, pubmed, nci, manual, llm |
source_reference | External ID or URL in the source system |
confidence | Extraction confidence score, 0.00—1.00. Only set for LLM-extracted data. |
last_verified_at | When the record was last validated against its source |
Provenance answers three questions at any row: Where did this come from? How sure are we? When did we last check? These columns are embedded directly on fact tables, not stored in a separate audit table, so every query can filter by trustworthiness without an extra join.
The BaseEntity pattern
Section titled “The BaseEntity pattern”Every table shares a uniform foundation: UUID v4 primary keys, created_at, and updated_at timestamps with automatic update on write. A check constraint enforces updated_at >= created_at, preventing clock-skew anomalies.
| Column | Purpose |
|---|---|
id | UUID v4 primary key, generated at the database level |
created_at | Row creation timestamp (database now()) |
updated_at | Row last-update timestamp (automatically maintained) |
UUIDs eliminate insert-order coupling between tables and make cross-system references safe. The timestamp pair gives every row a built-in audit trail without relying on application-level logging.
Decision framework: table vs materialized view
Section titled “Decision framework: table vs materialized view”When adding new data to the schema, apply these rules in order.
| If the data is… | Store it in… | Reason |
|---|---|---|
| Scraped, curated, or manually entered by humans | A table | Facts need to be editable and carry provenance |
| Derivable by joining or aggregating existing tables | A materialized view | Eliminates redundancy; refresh replaces recomputation |
| Reference data that rarely changes | A table (seeded) | Seed scripts populate it; no pipeline needed |
| Something that needs to be editable | A table | Computed data is never editable |
When in doubt, ask: If I delete this column, can I recompute it from other tables? If yes, it belongs in a materialized view. If no — if deleting it would lose knowledge that exists nowhere else in the schema — it belongs in a table.