Skip to content

What is Data Gov?

Data Gov turns fragmented pharmaceutical data into a single, curated knowledge graph. It collects drug approvals, clinical trials, disease hierarchies, and industry news from dozens of public sources. AI pipelines normalize the chaos. Human analysts verify every record before it becomes authoritative.

Pharmaceutical data is broken across dozens of sources. The same drug appears as “Pembrolizumab”, “KEYTRUDA”, “MK-3475”, and “lambrolizumab” depending on who filed the paperwork. ClinicalTrials.gov accepts free-text self-reported data with no naming standards. The FDA, EMA, and KEGG all use different identifiers for the same approvals. Press releases bury clinical trial results in marketing language.

No single source tells the full story of a drug’s lifecycle. Data Gov exists to assemble that story.

flowchart LR
  subgraph Sources["Public Data Sources"]
    AACT["ClinicalTrials.gov\n72,000+ trials"]
    ChiCTR["Chinese Clinical\nTrial Registry"]
    FDA["FDA / EMA\nKEGG / CDE"]
    PubMed["PubMed\n+ Conferences"]
    News["4 News Wires\n+ Financial APIs"]
  end

  subgraph Engine["Data Gov Engine"]
    Collect["Collect\n(Thor + Sidekiq)"]
    LLM["Extract\n(GPT-4.1)"]
    Match["Resolve\n(5-strategy cascade)"]
    QA["Verify\n(Human QA)"]
  end

  subgraph Output["What comes out"]
    Graph["Knowledge\nGraph"]
    API["REST API"]
    Admin["ActiveAdmin\nDashboard"]
    Views["72 SQL Views\n+ Analytics"]
  end

  Sources --> Collect --> LLM --> Match --> QA --> Graph
  Graph --> API
  Graph --> Admin
  Graph --> Views

Every entity in Data Gov connects through four core domains.

Drugs form the center of the graph. Each drug is tracked at the INN (International Nonproprietary Name) level — the active ingredient, not the brand. A drug connects outward to trials, approvals, targets, technologies, diseases, organizations, and publications.

Clinical trials capture the experimental evidence. 72,000+ trials from AACT and ChiCTR, each decomposed into study plans, participation criteria, endpoints, and results.

Diseases form a directed acyclic graph rooted in broad categories (Solid Tumors, Leukemia) and narrowing to subtypes (EGFR-mutant NSCLC). Every disease carries NCI Thesaurus codes for standardized classification.

Organizations track the companies developing these drugs — from Pfizer to single-asset biotechs. Ownership records, deal history, pipeline snapshots, and financial data all connect through the organization entity.

LayerTechnology
LanguageRuby 3.4.2
FrameworkRails 7.1.2
Primary UIActiveAdmin (87 resources)
Embedded UIReact on Rails 14.0.4 via Shadow DOM + Shakapacker 8.2
Background jobsSidekiq + Redis + sidekiq-cron
DatabasesPostgreSQL 17 (primary), ChEMBL (read-only), AACT (read-only)
Vector searchpgvector + OpenAI embeddings
Full-text searchpg_trgm + fuzzystrmatch extensions
AI/LLMOpenAI GPT-4.1 via OpenAiService (sequential + batch)
CLI pipelinesThor (82 task files across 6 domains)
Workflow engineBaseWorkflow DAG orchestrator (11 workflows, 239 steps)
AuthDevise + Google OAuth + JWT
DeploymentGitHub Actions, Docker/ECR, Kamal 2.x to EC2
MonitoringNew Relic + Airbrake + Health Monitor

Data Gov tracks drug development across three therapeutic areas:

  • Oncology — solid tumors, the largest share of tracked compounds
  • Malignant hematology — leukemia, lymphoma, myeloma
  • Non-malignant hematology — hemophilia, sickle cell disease, thalassemia

The primary PostgreSQL 17 database holds 206 tables across three schemas. Two read-only external databases provide supplementary data.

SchemaPurpose
publicCore entities, relationships, transactional records
analyticsDimensional model for BI reporting (star/snowflake)
forecastingDjango application tables for forecasting models

72 versioned SQL view definitions (managed by Scenic) and 14 materialized views power the API and analytics layers. Views refresh four times daily.

These docs are designed to read in order. Each page builds on the last.

  1. Get started — Clone, configure, and run Data Gov locally
  2. Data model — The complete schema story: what data lives here and why
  3. Clinical trials — How 72,000 trials get into the system
  4. Drug approvals — From FDA approval to your dashboard
  5. News and intelligence — Turning press releases into structured intelligence
  6. Architecture — How the pieces fit together for someone adding code
  7. Operations — Running Data Gov in production

Start with Get started if you need to run the code. Start with Data model if you need to understand what the system stores.