Architecture
This page is for someone who needs to add code to Data Gov, not just understand it. It covers the Rails app structure, the service layer patterns, the workflow engine, background job architecture, and the conventions you need to follow when building new features.
System architecture
Section titled “System architecture”Data Gov is an admin-first data processing platform. ActiveAdmin provides the primary UI. Thor tasks drive batch pipelines. Sidekiq handles background scheduling. Every design decision traces back to one reality: pharmaceutical data is fragmented across dozens of sources with inconsistent naming, unstructured text, and no common identifiers.
flowchart TB
subgraph External["External Data Sources"]
AACT["ClinicalTrials.gov\n(AACT database)"]
ChiCTR["ChiCTR\n(headless scraping)"]
FDA["FDA / EMA / KEGG / CDE"]
PubMed["PubMed + Conferences"]
Wires["News Wires\n(4 sources)"]
end
subgraph Processing["Processing Layer"]
Thor["Thor Tasks\n(82 task files)"]
Sidekiq["Sidekiq Workers\n(19 job classes)"]
Workflows["Workflow Engine\n(11 workflows)"]
LLM["OpenAI Service\n(batch + sequential)"]
Matching["Entity Resolution\n(5-strategy cascade)"]
end
subgraph Storage["Storage Layer"]
PG[("PostgreSQL 17\n206 tables\n3 schemas")]
ChEMBL[("ChEMBL\nread-only")]
AACTdb[("AACT DB\nread-only")]
Views["72 SQL Views\n14 Materialized Views"]
S3["AWS S3"]
end
subgraph Presentation["Presentation Layer"]
Admin["ActiveAdmin\n87 resources"]
API["REST API\ndrugs, diseases, trials,\norgs, publications, news"]
React["React Components\nShadow DOM embedded"]
end
External --> Thor
Thor --> LLM --> Matching --> PG
Sidekiq --> Thor
Workflows --> Sidekiq
PG --> Views
PG --> Admin --> React
PG --> API
ChEMBL -.->|chemical data| PG
AACTdb -.->|trial data| PG
PG --> S3
Three databases
Section titled “Three databases”| Database | Access | Purpose |
|---|---|---|
| Primary (PostgreSQL 17) | Read/write | 206 tables across public, analytics, forecasting |
| ChEMBL | Read-only | Chemical bioactivity data for molecular matching |
| AACT | Read-only | Raw ClinicalTrials.gov data for reconciliation |
External databases connect through dedicated database.yml entries. Namespaced models (External::Aact) isolate access.
Repository structure
Section titled “Repository structure”app/ admin/ # ActiveAdmin resources (87 files, primary UI) controllers/ # Minimal -- mostly API endpoints models/ # 179 ActiveRecord models concerns/ # 13 cross-cutting behaviors external/ # Read-only external DB models forms/ # Form objects for complex UI interactions services/ # 60+ business logic classes workflows/ # 11 BaseWorkflow subclasses jobs/ # 19 Sidekiq job classes javascript/ bundles/ # React components by feature domain views/ # ERB templates (admin, mailers)lib/ tasks/ # Thor task files (82 across 6 domains)config/ schedule.yml # sidekiq-cron recurring jobs deploy.yml # Kamal deployment configuration database.yml # Multi-database connectionsdb/ schema.rb # 206 tables, auto-generated views/ # 72 Scenic SQL view definitions migrate/ # Migration historyThe pipeline pattern
Section titled “The pipeline pattern”Every data pipeline follows the same five-phase lifecycle. Understand this pattern once and every pipeline makes sense.
flowchart LR C["1. Collect\nThor tasks pull\nfrom external APIs"] E["2. Extract\nGPT-4.1 structures\nfree text"] M["3. Match\n5-strategy\nentity resolution"] Q["4. QA\nHuman review\nin ActiveAdmin"] P["5. Present\nSQL views +\nAPI endpoints"] C --> E --> M --> Q --> P
Phase 1: Collect. Thor tasks fetch data from external sources. Raw API responses land in raw_data JSONB fields, never modified after collection.
Phase 2: Extract. Records batch into groups of 100-1,000 and go to GPT-4.1. StoreModel classes define typed JSON schemas. Results land in llm_data JSONB fields with metadata (model version, confidence, timestamps).
Phase 3: Match. Entity resolution uses a five-strategy cascade: exact match, synonym match, fuzzy match (85% threshold), semantic search (pgvector), and LLM judge (consensus voting). Domain-specific implementations: DiseaseMatchingService, BiomarkerMatchingService, DrugMatchingService.
Phase 4: QA. The Claimable concern implements 24-hour claim-based review. The Lockable concern protects manual edits. Approved data flows to relational tables.
Phase 5: Present. 72 Scenic SQL views and 14 materialized views serve the API and analytics layers. Views refresh four times daily via RefreshViewsJob.
Service layer patterns
Section titled “Service layer patterns”60+ service files in app/services/ follow four patterns. Know which pattern to use for your feature.
Singleton API client
Section titled “Singleton API client”External API clients with connection reuse and retry logic.
# Examples: NcitService, FdaService, CtGovService, RxNormServiceclass NcitService include Singleton # Connection pooling, retry logic, rate limitingendUse this when you need to call an external API.
Template method inheritance
Section titled “Template method inheritance”A base class defines the processing pipeline. Subclasses override domain-specific behavior.
# TermMatchingService defines the 4-stage cascade# DiseaseMatchingService, BiomarkerMatchingService, DrugMatchingService# override source-specific behaviorUse this when you need entity resolution for a new domain.
Configurable service
Section titled “Configurable service”Constructor validates strategy or mode options at initialization. Fails fast on invalid configuration.
# SimpleCandidateMatchingService selects strategies via constructor flagsservice = SimpleCandidateMatchingService.new( use_semantic: true, use_nci: false, threshold: 0.85)Use this when a service needs runtime configuration.
LLM orchestrator
Section titled “LLM orchestrator”OpenAiService switches between immediate and batch processing based on its constructor flag. StoreModel schemas generate OpenAI-compatible JSON schema definitions. RetryWithBackoff handles transient failures.
# Sequential mode: single prompt, blocking callservice = OpenAiService.new(mode: :sequential)
# Batch mode: multiple prompts via OpenAI Batch APIservice = OpenAiService.new(mode: :batch, parallelism: 5)Use this when you need LLM processing. Always use batch mode for >100 records.
Workflow orchestration
Section titled “Workflow orchestration”The BaseWorkflow framework coordinates multi-step pipelines. Each workflow defines a directed acyclic graph (DAG) of steps.
How workflows work
Section titled “How workflows work”| Component | Role |
|---|---|
BaseWorkflow subclass | Defines steps, dependencies, and execution parameters |
WorkflowInstance | Persistent record tracking a single execution run |
WorkflowStep | Individual step with status, retries, and AWS Batch integration |
WorkflowRunnerJob | Sidekiq job that launches workflow instances |
Steps support three execution patterns:
- Sequential — Step A completes, then Step B starts
- Fan-out — Step A completes, then B, C, and D start in parallel
- Convergence — Step E waits for B, C, and D to all complete
Failed steps retry up to 2 times (3 total attempts). Operators can retry, skip, or manually complete steps from the ActiveAdmin workflow dashboard at /admin/workflow_instances.
Active workflows
Section titled “Active workflows”| Workflow | Steps | Trigger | Purpose |
|---|---|---|---|
ClinicalTrialsWorkflow | 22 | Tue 08:00 UTC | Trial collection and study plans |
ClinicalTrialEligibilitiesWorkflow | 30 | After CT workflow | Disease/biomarker matching |
DrugsWorkflow | 31 | Manual | Drug approval collection |
IndicationWorkflow | 37 | After Drugs workflow | Indication extraction and linking |
NewsLlmWorkflow | 33 | Sun 22:00 UTC | News classification and extraction |
PublicationsWorkflow | 17 | Manual | Publication ingestion |
PublicationDiseaseWorkflow | 22 | Manual | Disease matching for publications |
StandardOfCareWorkflow | 25 | Fri 22:00 UTC | Treatment guideline processing |
CompaniesWorkflow | 4 | Manual | Organisation enrichment |
NewsDiseaseWorkflow | 7 | Manual | Disease linking for news |
PrivateWorkflow | 11 | Manual | Internal data processing |
AWS Batch execution
Section titled “AWS Batch execution”Pipeline steps run on AWS Batch. Two job definitions exist:
data-lake-tasks:7— Ruby/Thor tasks. Most steps use this.data-collection-job:6— Node.js scripts for legacy collectors.
Each step specifies vCPU and memory requirements. When a step completes, the workflow engine submits successor jobs based on the DAG.
Background job architecture
Section titled “Background job architecture”Sidekiq runs with 10 workers on a single default queue. sidekiq-cron manages the recurring schedule from config/schedule.yml.
Two execution patterns for jobs
Section titled “Two execution patterns for jobs”In-process jobs run directly in the Sidekiq worker. Fast and low-overhead.
# NewsCisionJob, OrgSyncFmpJob, ManualDataInsightsNotificationJobdef perform CisionApiService.new.fetch_and_save_newsendAWS Batch jobs create a OneOffJob record that submits work to AWS Batch. The Sidekiq worker only handles submission.
# NewsBusinessWireJob, RefreshViewsJob, FdaApprovalNotificationsJobdef perform OneOffJob.create!( job_definition_id: 'data-lake-tasks:7', command: 'bundle exec thor searchful_news:collect_business_wire', vcpus: 1, memory: 2 ).startendRecurring schedule (UTC)
Section titled “Recurring schedule (UTC)”| Job | Schedule | Type |
|---|---|---|
ClinicalTrialsWorkflowSchedulerJob | Tue 08:00 | Workflow |
WorkflowRunnerJob (SOC) | Fri 22:00 | Workflow |
WorkflowRunnerJob (News LLM) | Sun 22:00 | Workflow |
NewsBusinessWireJob | Daily 05:00 | AWS Batch |
NewsCisionJob | Twice daily 00:00, 16:00 | In-process |
NewsGlobalnewsWireJob | Twice daily 03:00, 21:00 | AWS Batch |
NewsFinancialJob | Daily 12:00 | In-process |
OrgSyncFmpJob | Daily 01:00 | In-process |
RefreshViewsJob | 4x daily 06:00, 12:00, 18:00, 22:00 | AWS Batch |
AggregationSchedulerJob | Sun 03:00 | In-process |
FdaApprovalNotificationsJob | Sat 12:00 | AWS Batch |
ManualDataInsightsNotificationJob | Daily 06:00 | In-process |
FixCounterCultureCountsJob | Daily 06:00 | In-process |
PipelineMonitorJob | Sun 22:00 | AWS Batch |
Model layer
Section titled “Model layer”179 model files plus 13 concerns implement cross-cutting behaviors.
Key concerns
Section titled “Key concerns”| Concern | What it does | Which models use it |
|---|---|---|
Claimable | 24-hour QA claim workflow | Drug, Target |
Lockable | Protect manual edits from automation | 22 models |
Searchable | Full-text search with flexifind | ~20 models |
Vectorizable | OpenAI embeddings via pgvector | Disease, Biomarker, Publication, NewsChunk |
Assignable | Persistent user assignment | News, OrganisationHistory, StudyPlanExtraction |
Auditable | Data quality issue tracking | Indication |
DuplicateDetectable | Fuzzy duplicate detection | Drug, Organisation, Biomarker, Technology, NcitConcept |
BatchExecutable | AWS Batch job execution | WorkflowStep, OneOffJob |
WorkflowResourceTrackable | Link workflows to touched records | DrugApproval, Eligibility, Guideline, ClinicalTrialCondition, ClinicalTrialReference, LlmLog, FdaApprovalNotification |
Approvable | Legacy editorial status/QA fields | Most core entities (legacy, not actively used) |
TherapeuticAreaFilterable | Filter by therapeutic area JSONB | News |
BulkUpdatable | Bulk update support via ActiveAdmin | Drug |
ReflectableAssociations | Runtime association introspection (via extend) | Disease, Endpoint, StoreModel schemas |
Key conventions
Section titled “Key conventions”- Thor tasks are thin wrappers. CLI parsing, query scoping, and progress logging go in Thor. Business logic lives in services.
- Idempotent operations. Data-state checks like
WHERE llm_data IS NULLprevent reprocessing. - Batch processing.
find_in_batches(batch_size: 1000)keeps memory stable. - Services are reusable. Both Thor tasks and Sidekiq jobs invoke the same services.
- PaperTrail everywhere. The
versionstable tracks changes to all core entities.
REST API
Section titled “REST API”The API serves the Bioloupe client application. Routes live in config/routes.rb under the /api namespace.
Authenticated endpoints
Section titled “Authenticated endpoints”| Resource | Key endpoints |
|---|---|
/api/drugs | Index, show, search, bulk, by_disease_and_phase, card_details |
/api/diseases | Index, show, search, epidemiology, biomarkers, standard_of_care |
/api/clinical_trials | Show (by NCT ID), bulk, card_details, trial_details, trial_results |
/api/organisations | Index, show, search, development, compare |
/api/publications | Show, search, compare |
/api/news | Index, show |
/api/targets | Index, show, search, development |
/api/technologies | Index, show, search |
/api/search | Universal search, module-specific, column values |
Authentication
Section titled “Authentication”JWT tokens issued via Devise + Google OAuth for admin users. API sessions use /api/sessions or /api/login. The OliveBranch middleware auto-converts between snake_case (Rails) and camelCase (JavaScript clients).
Why the architecture looks this way
Section titled “Why the architecture looks this way”Pharmaceutical data has properties that drive every design decision.
Unreliable naming. The same entity appears under dozens of names. The five-strategy matching cascade exists because of this.
Complex relationships. A drug connects to approvals, trials, targets, technologies, organisations, diseases, biomarkers, and publications. The 206-table schema reflects this density.
Human judgment required. LLMs extract structured data from free text, but oncology domain expertise catches errors automation cannot. The claim/review/lock system encodes this.
Continuous updates. Trial data changes weekly. Approvals arrive unpredictably. The lockable attributes system preserves human curation across import cycles.
Audit requirements. JSONB fields (raw_data, llm_data) persist as immutable audit trails. PaperTrail tracks every edit. The system never silently overwrites curated data.
Next steps
Section titled “Next steps”- Operations — Deployment, monitoring, and troubleshooting for production
- Data model — The complete schema reference