Skip to content

Architecture

This page is for someone who needs to add code to Data Gov, not just understand it. It covers the Rails app structure, the service layer patterns, the workflow engine, background job architecture, and the conventions you need to follow when building new features.

Data Gov is an admin-first data processing platform. ActiveAdmin provides the primary UI. Thor tasks drive batch pipelines. Sidekiq handles background scheduling. Every design decision traces back to one reality: pharmaceutical data is fragmented across dozens of sources with inconsistent naming, unstructured text, and no common identifiers.

flowchart TB
  subgraph External["External Data Sources"]
    AACT["ClinicalTrials.gov\n(AACT database)"]
    ChiCTR["ChiCTR\n(headless scraping)"]
    FDA["FDA / EMA / KEGG / CDE"]
    PubMed["PubMed + Conferences"]
    Wires["News Wires\n(4 sources)"]
  end

  subgraph Processing["Processing Layer"]
    Thor["Thor Tasks\n(82 task files)"]
    Sidekiq["Sidekiq Workers\n(19 job classes)"]
    Workflows["Workflow Engine\n(11 workflows)"]
    LLM["OpenAI Service\n(batch + sequential)"]
    Matching["Entity Resolution\n(5-strategy cascade)"]
  end

  subgraph Storage["Storage Layer"]
    PG[("PostgreSQL 17\n206 tables\n3 schemas")]
    ChEMBL[("ChEMBL\nread-only")]
    AACTdb[("AACT DB\nread-only")]
    Views["72 SQL Views\n14 Materialized Views"]
    S3["AWS S3"]
  end

  subgraph Presentation["Presentation Layer"]
    Admin["ActiveAdmin\n87 resources"]
    API["REST API\ndrugs, diseases, trials,\norgs, publications, news"]
    React["React Components\nShadow DOM embedded"]
  end

  External --> Thor
  Thor --> LLM --> Matching --> PG
  Sidekiq --> Thor
  Workflows --> Sidekiq
  PG --> Views
  PG --> Admin --> React
  PG --> API
  ChEMBL -.->|chemical data| PG
  AACTdb -.->|trial data| PG
  PG --> S3
DatabaseAccessPurpose
Primary (PostgreSQL 17)Read/write206 tables across public, analytics, forecasting
ChEMBLRead-onlyChemical bioactivity data for molecular matching
AACTRead-onlyRaw ClinicalTrials.gov data for reconciliation

External databases connect through dedicated database.yml entries. Namespaced models (External::Aact) isolate access.

app/
admin/ # ActiveAdmin resources (87 files, primary UI)
controllers/ # Minimal -- mostly API endpoints
models/ # 179 ActiveRecord models
concerns/ # 13 cross-cutting behaviors
external/ # Read-only external DB models
forms/ # Form objects for complex UI interactions
services/ # 60+ business logic classes
workflows/ # 11 BaseWorkflow subclasses
jobs/ # 19 Sidekiq job classes
javascript/
bundles/ # React components by feature domain
views/ # ERB templates (admin, mailers)
lib/
tasks/ # Thor task files (82 across 6 domains)
config/
schedule.yml # sidekiq-cron recurring jobs
deploy.yml # Kamal deployment configuration
database.yml # Multi-database connections
db/
schema.rb # 206 tables, auto-generated
views/ # 72 Scenic SQL view definitions
migrate/ # Migration history

Every data pipeline follows the same five-phase lifecycle. Understand this pattern once and every pipeline makes sense.

flowchart LR
  C["1. Collect\nThor tasks pull\nfrom external APIs"]
  E["2. Extract\nGPT-4.1 structures\nfree text"]
  M["3. Match\n5-strategy\nentity resolution"]
  Q["4. QA\nHuman review\nin ActiveAdmin"]
  P["5. Present\nSQL views +\nAPI endpoints"]

  C --> E --> M --> Q --> P

Phase 1: Collect. Thor tasks fetch data from external sources. Raw API responses land in raw_data JSONB fields, never modified after collection.

Phase 2: Extract. Records batch into groups of 100-1,000 and go to GPT-4.1. StoreModel classes define typed JSON schemas. Results land in llm_data JSONB fields with metadata (model version, confidence, timestamps).

Phase 3: Match. Entity resolution uses a five-strategy cascade: exact match, synonym match, fuzzy match (85% threshold), semantic search (pgvector), and LLM judge (consensus voting). Domain-specific implementations: DiseaseMatchingService, BiomarkerMatchingService, DrugMatchingService.

Phase 4: QA. The Claimable concern implements 24-hour claim-based review. The Lockable concern protects manual edits. Approved data flows to relational tables.

Phase 5: Present. 72 Scenic SQL views and 14 materialized views serve the API and analytics layers. Views refresh four times daily via RefreshViewsJob.

60+ service files in app/services/ follow four patterns. Know which pattern to use for your feature.

External API clients with connection reuse and retry logic.

# Examples: NcitService, FdaService, CtGovService, RxNormService
class NcitService
include Singleton
# Connection pooling, retry logic, rate limiting
end

Use this when you need to call an external API.

A base class defines the processing pipeline. Subclasses override domain-specific behavior.

# TermMatchingService defines the 4-stage cascade
# DiseaseMatchingService, BiomarkerMatchingService, DrugMatchingService
# override source-specific behavior

Use this when you need entity resolution for a new domain.

Constructor validates strategy or mode options at initialization. Fails fast on invalid configuration.

# SimpleCandidateMatchingService selects strategies via constructor flags
service = SimpleCandidateMatchingService.new(
use_semantic: true,
use_nci: false,
threshold: 0.85
)

Use this when a service needs runtime configuration.

OpenAiService switches between immediate and batch processing based on its constructor flag. StoreModel schemas generate OpenAI-compatible JSON schema definitions. RetryWithBackoff handles transient failures.

# Sequential mode: single prompt, blocking call
service = OpenAiService.new(mode: :sequential)
# Batch mode: multiple prompts via OpenAI Batch API
service = OpenAiService.new(mode: :batch, parallelism: 5)

Use this when you need LLM processing. Always use batch mode for >100 records.

The BaseWorkflow framework coordinates multi-step pipelines. Each workflow defines a directed acyclic graph (DAG) of steps.

ComponentRole
BaseWorkflow subclassDefines steps, dependencies, and execution parameters
WorkflowInstancePersistent record tracking a single execution run
WorkflowStepIndividual step with status, retries, and AWS Batch integration
WorkflowRunnerJobSidekiq job that launches workflow instances

Steps support three execution patterns:

  • Sequential — Step A completes, then Step B starts
  • Fan-out — Step A completes, then B, C, and D start in parallel
  • Convergence — Step E waits for B, C, and D to all complete

Failed steps retry up to 2 times (3 total attempts). Operators can retry, skip, or manually complete steps from the ActiveAdmin workflow dashboard at /admin/workflow_instances.

WorkflowStepsTriggerPurpose
ClinicalTrialsWorkflow22Tue 08:00 UTCTrial collection and study plans
ClinicalTrialEligibilitiesWorkflow30After CT workflowDisease/biomarker matching
DrugsWorkflow31ManualDrug approval collection
IndicationWorkflow37After Drugs workflowIndication extraction and linking
NewsLlmWorkflow33Sun 22:00 UTCNews classification and extraction
PublicationsWorkflow17ManualPublication ingestion
PublicationDiseaseWorkflow22ManualDisease matching for publications
StandardOfCareWorkflow25Fri 22:00 UTCTreatment guideline processing
CompaniesWorkflow4ManualOrganisation enrichment
NewsDiseaseWorkflow7ManualDisease linking for news
PrivateWorkflow11ManualInternal data processing

Pipeline steps run on AWS Batch. Two job definitions exist:

  • data-lake-tasks:7 — Ruby/Thor tasks. Most steps use this.
  • data-collection-job:6 — Node.js scripts for legacy collectors.

Each step specifies vCPU and memory requirements. When a step completes, the workflow engine submits successor jobs based on the DAG.

Sidekiq runs with 10 workers on a single default queue. sidekiq-cron manages the recurring schedule from config/schedule.yml.

In-process jobs run directly in the Sidekiq worker. Fast and low-overhead.

# NewsCisionJob, OrgSyncFmpJob, ManualDataInsightsNotificationJob
def perform
CisionApiService.new.fetch_and_save_news
end

AWS Batch jobs create a OneOffJob record that submits work to AWS Batch. The Sidekiq worker only handles submission.

# NewsBusinessWireJob, RefreshViewsJob, FdaApprovalNotificationsJob
def perform
OneOffJob.create!(
job_definition_id: 'data-lake-tasks:7',
command: 'bundle exec thor searchful_news:collect_business_wire',
vcpus: 1, memory: 2
).start
end
JobScheduleType
ClinicalTrialsWorkflowSchedulerJobTue 08:00Workflow
WorkflowRunnerJob (SOC)Fri 22:00Workflow
WorkflowRunnerJob (News LLM)Sun 22:00Workflow
NewsBusinessWireJobDaily 05:00AWS Batch
NewsCisionJobTwice daily 00:00, 16:00In-process
NewsGlobalnewsWireJobTwice daily 03:00, 21:00AWS Batch
NewsFinancialJobDaily 12:00In-process
OrgSyncFmpJobDaily 01:00In-process
RefreshViewsJob4x daily 06:00, 12:00, 18:00, 22:00AWS Batch
AggregationSchedulerJobSun 03:00In-process
FdaApprovalNotificationsJobSat 12:00AWS Batch
ManualDataInsightsNotificationJobDaily 06:00In-process
FixCounterCultureCountsJobDaily 06:00In-process
PipelineMonitorJobSun 22:00AWS Batch

179 model files plus 13 concerns implement cross-cutting behaviors.

ConcernWhat it doesWhich models use it
Claimable24-hour QA claim workflowDrug, Target
LockableProtect manual edits from automation22 models
SearchableFull-text search with flexifind~20 models
VectorizableOpenAI embeddings via pgvectorDisease, Biomarker, Publication, NewsChunk
AssignablePersistent user assignmentNews, OrganisationHistory, StudyPlanExtraction
AuditableData quality issue trackingIndication
DuplicateDetectableFuzzy duplicate detectionDrug, Organisation, Biomarker, Technology, NcitConcept
BatchExecutableAWS Batch job executionWorkflowStep, OneOffJob
WorkflowResourceTrackableLink workflows to touched recordsDrugApproval, Eligibility, Guideline, ClinicalTrialCondition, ClinicalTrialReference, LlmLog, FdaApprovalNotification
ApprovableLegacy editorial status/QA fieldsMost core entities (legacy, not actively used)
TherapeuticAreaFilterableFilter by therapeutic area JSONBNews
BulkUpdatableBulk update support via ActiveAdminDrug
ReflectableAssociationsRuntime association introspection (via extend)Disease, Endpoint, StoreModel schemas
  • Thor tasks are thin wrappers. CLI parsing, query scoping, and progress logging go in Thor. Business logic lives in services.
  • Idempotent operations. Data-state checks like WHERE llm_data IS NULL prevent reprocessing.
  • Batch processing. find_in_batches(batch_size: 1000) keeps memory stable.
  • Services are reusable. Both Thor tasks and Sidekiq jobs invoke the same services.
  • PaperTrail everywhere. The versions table tracks changes to all core entities.

The API serves the Bioloupe client application. Routes live in config/routes.rb under the /api namespace.

ResourceKey endpoints
/api/drugsIndex, show, search, bulk, by_disease_and_phase, card_details
/api/diseasesIndex, show, search, epidemiology, biomarkers, standard_of_care
/api/clinical_trialsShow (by NCT ID), bulk, card_details, trial_details, trial_results
/api/organisationsIndex, show, search, development, compare
/api/publicationsShow, search, compare
/api/newsIndex, show
/api/targetsIndex, show, search, development
/api/technologiesIndex, show, search
/api/searchUniversal search, module-specific, column values

JWT tokens issued via Devise + Google OAuth for admin users. API sessions use /api/sessions or /api/login. The OliveBranch middleware auto-converts between snake_case (Rails) and camelCase (JavaScript clients).

Pharmaceutical data has properties that drive every design decision.

Unreliable naming. The same entity appears under dozens of names. The five-strategy matching cascade exists because of this.

Complex relationships. A drug connects to approvals, trials, targets, technologies, organisations, diseases, biomarkers, and publications. The 206-table schema reflects this density.

Human judgment required. LLMs extract structured data from free text, but oncology domain expertise catches errors automation cannot. The claim/review/lock system encodes this.

Continuous updates. Trial data changes weekly. Approvals arrive unpredictably. The lockable attributes system preserves human curation across import cycles.

Audit requirements. JSONB fields (raw_data, llm_data) persist as immutable audit trails. PaperTrail tracks every edit. The system never silently overwrites curated data.

  • Operations — Deployment, monitoring, and troubleshooting for production
  • Data model — The complete schema reference