What is Data Gov?

Data Gov turns fragmented pharmaceutical data into a single, curated knowledge graph. It collects drug approvals, clinical trials, disease hierarchies, and industry news from dozens of public sources. AI pipelines normalize the chaos. Human analysts verify every record before it becomes authoritative.

The problem Data Gov solves

Pharmaceutical data is broken across dozens of sources. The same drug appears as “Pembrolizumab”, “KEYTRUDA”, “MK-3475”, and “lambrolizumab” depending on who filed the paperwork. ClinicalTrials.gov accepts free-text self-reported data with no naming standards. The FDA, EMA, and KEGG all use different identifiers for the same approvals. Press releases bury clinical trial results in marketing language.

No single source tells the full story of a drug’s lifecycle. Data Gov exists to assemble that story.

flowchart LR
  subgraph Sources["Public Data Sources"]
    AACT["ClinicalTrials.gov\n72,000+ trials"]
    ChiCTR["Chinese Clinical\nTrial Registry"]
    FDA["FDA / EMA\nKEGG / CDE"]
    PubMed["PubMed\n+ Conferences"]
    News["4 News Wires\n+ Financial APIs"]
  end

  subgraph Engine["Data Gov Engine"]
    Collect["Collect\n(Thor + Sidekiq)"]
    LLM["Extract\n(GPT-4.1)"]
    Match["Resolve\n(5-strategy cascade)"]
    QA["Verify\n(Human QA)"]
  end

  subgraph Output["What comes out"]
    Graph["Knowledge\nGraph"]
    API["REST API"]
    Admin["ActiveAdmin\nDashboard"]
    Views["72 SQL Views\n+ Analytics"]
  end

  Sources --> Collect --> LLM --> Match --> QA --> Graph
  Graph --> API
  Graph --> Admin
  Graph --> Views

Four pillars of the knowledge graph

Every entity in Data Gov connects through four core domains.

Drugs form the center of the graph. Each drug is tracked at the INN (International Nonproprietary Name) level — the active ingredient, not the brand. A drug connects outward to trials, approvals, targets, technologies, diseases, organizations, and publications.

Clinical trials capture the experimental evidence. 72,000+ trials from AACT and ChiCTR, each decomposed into study plans, participation criteria, endpoints, and results.

Diseases form a directed acyclic graph rooted in broad categories (Solid Tumors, Leukemia) and narrowing to subtypes (EGFR-mutant NSCLC). Every disease carries NCI Thesaurus codes for standardized classification.

Organizations track the companies developing these drugs — from Pfizer to single-asset biotechs. Ownership records, deal history, pipeline snapshots, and financial data all connect through the organization entity.

Tech stack

Layer	Technology
Language	Ruby 3.4.2
Framework	Rails 7.1.2
Primary UI	ActiveAdmin (87 resources)
Embedded UI	React on Rails 14.0.4 via Shadow DOM + Shakapacker 8.2
Background jobs	Sidekiq + Redis + sidekiq-cron
Databases	PostgreSQL 17 (primary), ChEMBL (read-only), AACT (read-only)
Vector search	pgvector + OpenAI embeddings
Full-text search	pg_trgm + fuzzystrmatch extensions
AI/LLM	OpenAI GPT-4.1 via `OpenAiService` (sequential + batch)
CLI pipelines	Thor (82 task files across 6 domains)
Workflow engine	`BaseWorkflow` DAG orchestrator (11 workflows, 239 steps)
Auth	Devise + Google OAuth + JWT
Deployment	GitHub Actions, Docker/ECR, Kamal 2.x to EC2
Monitoring	New Relic + Airbrake + Health Monitor

Platform focus

Data Gov tracks drug development across three therapeutic areas:

Oncology — solid tumors, the largest share of tracked compounds
Malignant hematology — leukemia, lymphoma, myeloma
Non-malignant hematology — hemophilia, sickle cell disease, thalassemia

The database at a glance

The primary PostgreSQL 17 database holds 206 tables across three schemas. Two read-only external databases provide supplementary data.

Schema	Purpose
`public`	Core entities, relationships, transactional records
`analytics`	Dimensional model for BI reporting (star/snowflake)
`forecasting`	Django application tables for forecasting models

72 versioned SQL view definitions (managed by Scenic) and 14 materialized views power the API and analytics layers. Views refresh four times daily.

How to read these docs

These docs are designed to read in order. Each page builds on the last.

Get started — Clone, configure, and run Data Gov locally
Data model — The complete schema story: what data lives here and why
Clinical trials — How 72,000 trials get into the system
Drug approvals — From FDA approval to your dashboard
News and intelligence — Turning press releases into structured intelligence
Architecture — How the pieces fit together for someone adding code
Operations — Running Data Gov in production

Start with Get started if you need to run the code. Start with Data model if you need to understand what the system stores.