What is Data Gov?
Data Gov turns fragmented pharmaceutical data into a single, curated knowledge graph. It collects drug approvals, clinical trials, disease hierarchies, and industry news from dozens of public sources. AI pipelines normalize the chaos. Human analysts verify every record before it becomes authoritative.
The problem Data Gov solves
Section titled “The problem Data Gov solves”Pharmaceutical data is broken across dozens of sources. The same drug appears as “Pembrolizumab”, “KEYTRUDA”, “MK-3475”, and “lambrolizumab” depending on who filed the paperwork. ClinicalTrials.gov accepts free-text self-reported data with no naming standards. The FDA, EMA, and KEGG all use different identifiers for the same approvals. Press releases bury clinical trial results in marketing language.
No single source tells the full story of a drug’s lifecycle. Data Gov exists to assemble that story.
flowchart LR
subgraph Sources["Public Data Sources"]
AACT["ClinicalTrials.gov\n72,000+ trials"]
ChiCTR["Chinese Clinical\nTrial Registry"]
FDA["FDA / EMA\nKEGG / CDE"]
PubMed["PubMed\n+ Conferences"]
News["4 News Wires\n+ Financial APIs"]
end
subgraph Engine["Data Gov Engine"]
Collect["Collect\n(Thor + Sidekiq)"]
LLM["Extract\n(GPT-4.1)"]
Match["Resolve\n(5-strategy cascade)"]
QA["Verify\n(Human QA)"]
end
subgraph Output["What comes out"]
Graph["Knowledge\nGraph"]
API["REST API"]
Admin["ActiveAdmin\nDashboard"]
Views["72 SQL Views\n+ Analytics"]
end
Sources --> Collect --> LLM --> Match --> QA --> Graph
Graph --> API
Graph --> Admin
Graph --> Views
Four pillars of the knowledge graph
Section titled “Four pillars of the knowledge graph”Every entity in Data Gov connects through four core domains.
Drugs form the center of the graph. Each drug is tracked at the INN (International Nonproprietary Name) level — the active ingredient, not the brand. A drug connects outward to trials, approvals, targets, technologies, diseases, organizations, and publications.
Clinical trials capture the experimental evidence. 72,000+ trials from AACT and ChiCTR, each decomposed into study plans, participation criteria, endpoints, and results.
Diseases form a directed acyclic graph rooted in broad categories (Solid Tumors, Leukemia) and narrowing to subtypes (EGFR-mutant NSCLC). Every disease carries NCI Thesaurus codes for standardized classification.
Organizations track the companies developing these drugs — from Pfizer to single-asset biotechs. Ownership records, deal history, pipeline snapshots, and financial data all connect through the organization entity.
Tech stack
Section titled “Tech stack”| Layer | Technology |
|---|---|
| Language | Ruby 3.4.2 |
| Framework | Rails 7.1.2 |
| Primary UI | ActiveAdmin (87 resources) |
| Embedded UI | React on Rails 14.0.4 via Shadow DOM + Shakapacker 8.2 |
| Background jobs | Sidekiq + Redis + sidekiq-cron |
| Databases | PostgreSQL 17 (primary), ChEMBL (read-only), AACT (read-only) |
| Vector search | pgvector + OpenAI embeddings |
| Full-text search | pg_trgm + fuzzystrmatch extensions |
| AI/LLM | OpenAI GPT-4.1 via OpenAiService (sequential + batch) |
| CLI pipelines | Thor (82 task files across 6 domains) |
| Workflow engine | BaseWorkflow DAG orchestrator (11 workflows, 239 steps) |
| Auth | Devise + Google OAuth + JWT |
| Deployment | GitHub Actions, Docker/ECR, Kamal 2.x to EC2 |
| Monitoring | New Relic + Airbrake + Health Monitor |
Platform focus
Section titled “Platform focus”Data Gov tracks drug development across three therapeutic areas:
- Oncology — solid tumors, the largest share of tracked compounds
- Malignant hematology — leukemia, lymphoma, myeloma
- Non-malignant hematology — hemophilia, sickle cell disease, thalassemia
The database at a glance
Section titled “The database at a glance”The primary PostgreSQL 17 database holds 206 tables across three schemas. Two read-only external databases provide supplementary data.
| Schema | Purpose |
|---|---|
public | Core entities, relationships, transactional records |
analytics | Dimensional model for BI reporting (star/snowflake) |
forecasting | Django application tables for forecasting models |
72 versioned SQL view definitions (managed by Scenic) and 14 materialized views power the API and analytics layers. Views refresh four times daily.
How to read these docs
Section titled “How to read these docs”These docs are designed to read in order. Each page builds on the last.
- Get started — Clone, configure, and run Data Gov locally
- Data model — The complete schema story: what data lives here and why
- Clinical trials — How 72,000 trials get into the system
- Drug approvals — From FDA approval to your dashboard
- News and intelligence — Turning press releases into structured intelligence
- Architecture — How the pieces fit together for someone adding code
- Operations — Running Data Gov in production
Start with Get started if you need to run the code. Start with Data model if you need to understand what the system stores.