Architecture

This page is for someone who needs to add code to Data Gov, not just understand it. It covers the Rails app structure, the service layer patterns, the workflow engine, background job architecture, and the conventions you need to follow when building new features.

System architecture

Data Gov is an admin-first data processing platform. ActiveAdmin provides the primary UI. Thor tasks drive batch pipelines. Sidekiq handles background scheduling. Every design decision traces back to one reality: pharmaceutical data is fragmented across dozens of sources with inconsistent naming, unstructured text, and no common identifiers.

flowchart TB
  subgraph External["External Data Sources"]
    AACT["ClinicalTrials.gov\n(AACT database)"]
    ChiCTR["ChiCTR\n(headless scraping)"]
    FDA["FDA / EMA / KEGG / CDE"]
    PubMed["PubMed + Conferences"]
    Wires["News Wires\n(4 sources)"]
  end

  subgraph Processing["Processing Layer"]
    Thor["Thor Tasks\n(82 task files)"]
    Sidekiq["Sidekiq Workers\n(19 job classes)"]
    Workflows["Workflow Engine\n(11 workflows)"]
    LLM["OpenAI Service\n(batch + sequential)"]
    Matching["Entity Resolution\n(5-strategy cascade)"]
  end

  subgraph Storage["Storage Layer"]
    PG[("PostgreSQL 17\n206 tables\n3 schemas")]
    ChEMBL[("ChEMBL\nread-only")]
    AACTdb[("AACT DB\nread-only")]
    Views["72 SQL Views\n14 Materialized Views"]
    S3["AWS S3"]
  end

  subgraph Presentation["Presentation Layer"]
    Admin["ActiveAdmin\n87 resources"]
    API["REST API\ndrugs, diseases, trials,\norgs, publications, news"]
    React["React Components\nShadow DOM embedded"]
  end

  External --> Thor
  Thor --> LLM --> Matching --> PG
  Sidekiq --> Thor
  Workflows --> Sidekiq
  PG --> Views
  PG --> Admin --> React
  PG --> API
  ChEMBL -.->|chemical data| PG
  AACTdb -.->|trial data| PG
  PG --> S3

Three databases

Database	Access	Purpose
Primary (PostgreSQL 17)	Read/write	206 tables across `public`, `analytics`, `forecasting`
ChEMBL	Read-only	Chemical bioactivity data for molecular matching
AACT	Read-only	Raw ClinicalTrials.gov data for reconciliation

External databases connect through dedicated database.yml entries. Namespaced models (External::Aact) isolate access.

Repository structure

app/
  admin/           # ActiveAdmin resources (87 files, primary UI)
  controllers/     # Minimal -- mostly API endpoints
  models/          # 179 ActiveRecord models
    concerns/      # 13 cross-cutting behaviors
    external/      # Read-only external DB models
    forms/         # Form objects for complex UI interactions
  services/        # 60+ business logic classes
  workflows/       # 11 BaseWorkflow subclasses
  jobs/            # 19 Sidekiq job classes
  javascript/
    bundles/       # React components by feature domain
  views/           # ERB templates (admin, mailers)
lib/
  tasks/           # Thor task files (82 across 6 domains)
config/
  schedule.yml     # sidekiq-cron recurring jobs
  deploy.yml       # Kamal deployment configuration
  database.yml     # Multi-database connections
db/
  schema.rb        # 206 tables, auto-generated
  views/           # 72 Scenic SQL view definitions
  migrate/         # Migration history

The pipeline pattern

Every data pipeline follows the same five-phase lifecycle. Understand this pattern once and every pipeline makes sense.

flowchart LR
  C["1. Collect\nThor tasks pull\nfrom external APIs"]
  E["2. Extract\nGPT-4.1 structures\nfree text"]
  M["3. Match\n5-strategy\nentity resolution"]
  Q["4. QA\nHuman review\nin ActiveAdmin"]
  P["5. Present\nSQL views +\nAPI endpoints"]

  C --> E --> M --> Q --> P

Phase 1: Collect. Thor tasks fetch data from external sources. Raw API responses land in raw_data JSONB fields, never modified after collection.

Phase 2: Extract. Records batch into groups of 100-1,000 and go to GPT-4.1. StoreModel classes define typed JSON schemas. Results land in llm_data JSONB fields with metadata (model version, confidence, timestamps).

Phase 3: Match. Entity resolution uses a five-strategy cascade: exact match, synonym match, fuzzy match (85% threshold), semantic search (pgvector), and LLM judge (consensus voting). Domain-specific implementations: DiseaseMatchingService, BiomarkerMatchingService, DrugMatchingService.

Phase 4: QA. The Claimable concern implements 24-hour claim-based review. The Lockable concern protects manual edits. Approved data flows to relational tables.

Phase 5: Present. 72 Scenic SQL views and 14 materialized views serve the API and analytics layers. Views refresh four times daily via RefreshViewsJob.

Service layer patterns

60+ service files in app/services/ follow four patterns. Know which pattern to use for your feature.

Singleton API client

External API clients with connection reuse and retry logic.

# Examples: NcitService, FdaService, CtGovService, RxNormService
class NcitService
  include Singleton
  # Connection pooling, retry logic, rate limiting
end

Use this when you need to call an external API.

Template method inheritance

A base class defines the processing pipeline. Subclasses override domain-specific behavior.

# TermMatchingService defines the 4-stage cascade
# DiseaseMatchingService, BiomarkerMatchingService, DrugMatchingService
# override source-specific behavior

Use this when you need entity resolution for a new domain.

Configurable service

Constructor validates strategy or mode options at initialization. Fails fast on invalid configuration.

# SimpleCandidateMatchingService selects strategies via constructor flags
service = SimpleCandidateMatchingService.new(
  use_semantic: true,
  use_nci: false,
  threshold: 0.85
)

Use this when a service needs runtime configuration.

LLM orchestrator

OpenAiService switches between immediate and batch processing based on its constructor flag. StoreModel schemas generate OpenAI-compatible JSON schema definitions. RetryWithBackoff handles transient failures.

# Sequential mode: single prompt, blocking call
service = OpenAiService.new(mode: :sequential)

# Batch mode: multiple prompts via OpenAI Batch API
service = OpenAiService.new(mode: :batch, parallelism: 5)

Use this when you need LLM processing. Always use batch mode for >100 records.

Workflow orchestration

The BaseWorkflow framework coordinates multi-step pipelines. Each workflow defines a directed acyclic graph (DAG) of steps.

How workflows work

Component	Role
`BaseWorkflow` subclass	Defines steps, dependencies, and execution parameters
`WorkflowInstance`	Persistent record tracking a single execution run
`WorkflowStep`	Individual step with status, retries, and AWS Batch integration
`WorkflowRunnerJob`	Sidekiq job that launches workflow instances

Steps support three execution patterns:

Sequential — Step A completes, then Step B starts
Fan-out — Step A completes, then B, C, and D start in parallel
Convergence — Step E waits for B, C, and D to all complete

Failed steps retry up to 2 times (3 total attempts). Operators can retry, skip, or manually complete steps from the ActiveAdmin workflow dashboard at /admin/workflow_instances.

Active workflows

Workflow	Steps	Trigger	Purpose
`ClinicalTrialsWorkflow`	22	Tue 08:00 UTC	Trial collection and study plans
`ClinicalTrialEligibilitiesWorkflow`	30	After CT workflow	Disease/biomarker matching
`DrugsWorkflow`	31	Manual	Drug approval collection
`IndicationWorkflow`	37	After Drugs workflow	Indication extraction and linking
`NewsLlmWorkflow`	33	Sun 22:00 UTC	News classification and extraction
`PublicationsWorkflow`	17	Manual	Publication ingestion
`PublicationDiseaseWorkflow`	22	Manual	Disease matching for publications
`StandardOfCareWorkflow`	25	Fri 22:00 UTC	Treatment guideline processing
`CompaniesWorkflow`	4	Manual	Organisation enrichment
`NewsDiseaseWorkflow`	7	Manual	Disease linking for news
`PrivateWorkflow`	11	Manual	Internal data processing

AWS Batch execution

Pipeline steps run on AWS Batch. Two job definitions exist:

data-lake-tasks:7 — Ruby/Thor tasks. Most steps use this.
data-collection-job:6 — Node.js scripts for legacy collectors.

Each step specifies vCPU and memory requirements. When a step completes, the workflow engine submits successor jobs based on the DAG.

Background job architecture

Sidekiq runs with 10 workers on a single default queue. sidekiq-cron manages the recurring schedule from config/schedule.yml.

Two execution patterns for jobs

In-process jobs run directly in the Sidekiq worker. Fast and low-overhead.

# NewsCisionJob, OrgSyncFmpJob, ManualDataInsightsNotificationJob
def perform
  CisionApiService.new.fetch_and_save_news
end

AWS Batch jobs create a OneOffJob record that submits work to AWS Batch. The Sidekiq worker only handles submission.

# NewsBusinessWireJob, RefreshViewsJob, FdaApprovalNotificationsJob
def perform
  OneOffJob.create!(
    job_definition_id: 'data-lake-tasks:7',
    command: 'bundle exec thor searchful_news:collect_business_wire',
    vcpus: 1, memory: 2
  ).start
end

Recurring schedule (UTC)

Job	Schedule	Type
`ClinicalTrialsWorkflowSchedulerJob`	Tue 08:00	Workflow
`WorkflowRunnerJob` (SOC)	Fri 22:00	Workflow
`WorkflowRunnerJob` (News LLM)	Sun 22:00	Workflow
`NewsBusinessWireJob`	Daily 05:00	AWS Batch
`NewsCisionJob`	Twice daily 00:00, 16:00	In-process
`NewsGlobalnewsWireJob`	Twice daily 03:00, 21:00	AWS Batch
`NewsFinancialJob`	Daily 12:00	In-process
`OrgSyncFmpJob`	Daily 01:00	In-process
`RefreshViewsJob`	4x daily 06:00, 12:00, 18:00, 22:00	AWS Batch
`AggregationSchedulerJob`	Sun 03:00	In-process
`FdaApprovalNotificationsJob`	Sat 12:00	AWS Batch
`ManualDataInsightsNotificationJob`	Daily 06:00	In-process
`FixCounterCultureCountsJob`	Daily 06:00	In-process
`PipelineMonitorJob`	Sun 22:00	AWS Batch

Model layer

179 model files plus 13 concerns implement cross-cutting behaviors.

Key concerns

Concern	What it does	Which models use it
`Claimable`	24-hour QA claim workflow	Drug, Target
`Lockable`	Protect manual edits from automation	22 models
`Searchable`	Full-text search with `flexifind`	~20 models
`Vectorizable`	OpenAI embeddings via pgvector	Disease, Biomarker, Publication, NewsChunk
`Assignable`	Persistent user assignment	News, OrganisationHistory, StudyPlanExtraction
`Auditable`	Data quality issue tracking	Indication
`DuplicateDetectable`	Fuzzy duplicate detection	Drug, Organisation, Biomarker, Technology, NcitConcept
`BatchExecutable`	AWS Batch job execution	WorkflowStep, OneOffJob
`WorkflowResourceTrackable`	Link workflows to touched records	DrugApproval, Eligibility, Guideline, ClinicalTrialCondition, ClinicalTrialReference, LlmLog, FdaApprovalNotification
`Approvable`	Legacy editorial status/QA fields	Most core entities (legacy, not actively used)
`TherapeuticAreaFilterable`	Filter by therapeutic area JSONB	News
`BulkUpdatable`	Bulk update support via ActiveAdmin	Drug
`ReflectableAssociations`	Runtime association introspection (via `extend`)	Disease, Endpoint, StoreModel schemas

Key conventions

Thor tasks are thin wrappers. CLI parsing, query scoping, and progress logging go in Thor. Business logic lives in services.
Idempotent operations. Data-state checks like WHERE llm_data IS NULL prevent reprocessing.
Batch processing. find_in_batches(batch_size: 1000) keeps memory stable.
Services are reusable. Both Thor tasks and Sidekiq jobs invoke the same services.
PaperTrail everywhere. The versions table tracks changes to all core entities.

REST API

The API serves the Bioloupe client application. Routes live in config/routes.rb under the /api namespace.

Authenticated endpoints

Resource	Key endpoints
`/api/drugs`	Index, show, search, bulk, by_disease_and_phase, card_details
`/api/diseases`	Index, show, search, epidemiology, biomarkers, standard_of_care
`/api/clinical_trials`	Show (by NCT ID), bulk, card_details, trial_details, trial_results
`/api/organisations`	Index, show, search, development, compare
`/api/publications`	Show, search, compare
`/api/news`	Index, show
`/api/targets`	Index, show, search, development
`/api/technologies`	Index, show, search
`/api/search`	Universal search, module-specific, column values

Authentication

JWT tokens issued via Devise + Google OAuth for admin users. API sessions use /api/sessions or /api/login. The OliveBranch middleware auto-converts between snake_case (Rails) and camelCase (JavaScript clients).

Why the architecture looks this way

Pharmaceutical data has properties that drive every design decision.

Unreliable naming. The same entity appears under dozens of names. The five-strategy matching cascade exists because of this.

Complex relationships. A drug connects to approvals, trials, targets, technologies, organisations, diseases, biomarkers, and publications. The 206-table schema reflects this density.

Human judgment required. LLMs extract structured data from free text, but oncology domain expertise catches errors automation cannot. The claim/review/lock system encodes this.

Continuous updates. Trial data changes weekly. Approvals arrive unpredictably. The lockable attributes system preserves human curation across import cycles.

Audit requirements. JSONB fields (raw_data, llm_data) persist as immutable audit trails. PaperTrail tracks every edit. The system never silently overwrites curated data.

Next steps

Operations — Deployment, monitoring, and troubleshooting for production
Data model — The complete schema reference