Skip to content

Operations

Everything an on-call engineer needs to run Data Gov in production. This page covers deployment, monitoring, troubleshooting, and the commands you reach for at 2 AM.

Every push to main triggers a GitHub Actions workflow that builds a Docker image, pushes it to Amazon ECR, and deploys via Kamal to two EC2 servers. The full pipeline takes 5-10 minutes.

flowchart LR
  Push["Push to main"] --> Build["Build Docker\nimage (Buildx)"]
  Build --> ECR["Push to ECR\n:sha + :latest"]
  ECR --> Secrets["Sync secrets\nfrom AWS SM"]
  Secrets --> Deploy["kamal deploy\n--skip-push"]
  Deploy --> Web["Web server\n(98.93.194.3)"]
  Deploy --> Replica["Web replica\n(34.227.74.245)"]
  Deploy --> Job["Sidekiq worker\n(98.93.194.3)"]
ComponentDetail
CI/CDGitHub Actions (deploy.yml)
Container registryAmazon ECR (211125575460.dkr.ecr.us-east-1.amazonaws.com)
Imagebioloupe/datalake (unified: web, Sidekiq, Chrome)
OrchestrationKamal 2.x
Primary server98.93.194.3 (web + Sidekiq)
Replica server34.227.74.245 (web replica)
SSH userubuntu
Deploy timeout1,200 seconds
Health checkGET /up every 1 second, 5-second timeout
LoggingAWS CloudWatch (/bioloupe/data-lake)
SecretsAWS Secrets Manager (data-lake/prod)

Kamal deploys three container roles from config/deploy.yml.

RoleHostResourcesCommand
web98.93.194.31.5 CPUs, 1.75 GBRails/Puma (default)
web_replica34.227.74.2452 GBRails/Puma (proxy enabled)
job98.93.194.30.25 CPUs, 512 MBbundle exec sidekiq -C ./config/sidekiq.yml

The Dockerfile uses a multi-stage build with three parallel stages:

  1. gems — Ruby dependencies + Bootsnap cache
  2. js-build — Node.js dependencies + Webpack build
  3. unified — Combines gems and JS assets, compiles Sprockets, installs Chrome for headless scraping

Build caching uses GitHub Actions cache (type=gha). Cached builds complete in 2-4 minutes.

Secrets sync from AWS Secrets Manager (data-lake/prod) during the deploy stage.

  1. GitHub Actions fetches the secret JSON via OIDC-federated AWS credentials
  2. Converts to KEY=VALUE format
  3. Writes to .kamal/secrets
  4. Kamal injects into containers per config/deploy.yml

Key secret categories: Database (DB_*), AACT (AACT_DB_*), ChEMBL (CHEMBL_DB_*), AWS (AWS_*, S3_*, ATHENA_*), OpenAI (OPENAI_API_KEY, OPENROUTER_API_KEY), Redis (REDIS_URL), Rails (SECRET_KEY_BASE, RAILS_MASTER_KEY), Auth (GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET), Email (BREVO_API_KEY, BREVO_*), Monitoring (Airbrake, New Relic), News APIs (Cision, FMP), Slack (SLACK_*), Client apps (APP_CLIENT_HOST, DATA_GOV_TOKEN, MULTI_APP_SECRET_KEY), Conference scrapers (ASCO_*).

Terminal window
# Standard deploy (automatic on push to main)
# No manual action needed.
# Manual deploy from GitHub CLI
gh workflow run deploy.yml
# Manual deploy with Kamal
kamal deploy --skip-push --verbose
# Rollback
kamal app containers # list recent versions
kamal rollback <container_id>
# Restart a single role
kamal app boot -r job # restart Sidekiq
kamal app boot -r web # restart web
# Access Rails console in production
kamal app exec -r web --interactive "rails console"
# Tail logs
kamal app logs -r web # web logs
kamal app logs -r job # Sidekiq logs

The workflow uses concurrency: deploy-production with cancel-in-progress: false. Multiple pushes to main queue deployments. No deployment cancels mid-flight.

Endpoint / ToolPurpose
GET /upHealth check for Kamal proxy and load balancer
/admin/sidekiqSidekiq web UI (admin-only)
CloudWatch /bioloupe/data-lakeContainer stdout/stderr logs
AirbrakeException tracking
New RelicAPM metrics

Materialized views. RefreshViewsJob runs 4x daily. If it fails, API data goes stale. Monitor for long-running view refresh queries.

Sidekiq queue. If the default queue backs up, check for stuck jobs or OOM kills on the Sidekiq container (512 MB limit).

Workflow instances. Stale workflows block scheduled runs. Check /admin/workflow_instances for workflows stuck in a running state.

AWS Batch jobs. If jobs stay in RUNNABLE, check EC2 capacity in the AWS Batch compute environment.

SymptomCauseFix
Deploy hangs after “kamal deploy”Health check at /up failingSSH in, check logs: kamal app logs -r web
Deploy fails at “Sync secrets”AWS OIDC role misconfiguredVerify IAM role github-actions-datalake-deploy. Check data-lake/prod in Secrets Manager.
”Kamal lock” errorPrevious deploy did not release lockkamal lock release --verbose, then retry
ECR login failsOIDC token expired or region mismatchCheck AWS_REGION is us-east-1. Retry workflow.
Container returns 502Puma did not bind to port 3000 in timeIncrease deploy_timeout in config/deploy.yml (currently 1200s = 20 min, though the inline comment says 10 min)
Asset 404 after deployIn-flight requests hit old assetsWait for rolling restart. Kamal bridges via asset_path.
SymptomCauseFix
Jobs not runningSidekiq container not startedkamal app boot -r job. Check kamal app containers.
Sidekiq OOM killedJob exceeded 512 MB memoryCheck which job ran. Increase memory in config/deploy.yml.
Duplicate job executionCron scheduled before previous finishedAdd concurrency guard (see ClinicalTrialsWorkflowSchedulerJob).
”Redis::CannotConnectError”Redis URL wrong or server downCheck REDIS_URL in secrets. Verify reachability.
AWS Batch stuck in RUNNABLENo EC2 capacityCheck Batch console. Scale up maxvCpus.
AWS Batch FAILEDContainer errorCheck CloudWatch at /bioloupe/data-lake. Match log stream to Batch job ID.
SymptomCauseFix
”Workflow already running”Previous instance not completedActiveAdmin > Workflow Instances > Complete the stale run.
CT workflow skippedConcurrency guard found running instanceComplete stale workflow. Then: rails runner "ClinicalTrialsWorkflowSchedulerJob.perform_now"
ChiCTR returns 0 trialsAnti-bot protection or site changeCheck site manually. Try --sample 5. Check Ferrum/Chrome version.
CT.gov version sync failsAPI rate limiting (HTTP 429)Wait 10-15 min. Service uses RetryWithBackoff.
Study plan extraction hangsLLM API timeoutRe-run with --limit. Check OpenAI status.
Term matching bad resultsNCI API errorsCheck NCI availability. Use --search-method=semantic to bypass.
FDA download 403API key expiredUpdate API_KEY in regulatory/fda.thor.
Materialized views staleRefreshViewsJob failedkamal app exec -r web "bundle exec thor views:refresh_all". Check for blocking queries.
SymptomCauseFix
”ConnectionNotEstablished”Pool exhausted or RDS restartCheck DB_HOST reachability. Restart containers: kamal app boot.
AACT connection refusedCredentials expired (they rotate quarterly)Get new credentials from AACT website. Update Secrets Manager. Redeploy.
Slow queries on viewsViews need refresh or missing indexesRun RefreshViewsJob. Check EXPLAIN ANALYZE.
Migration timeoutMaterialized view creation slowRun migration manually first: kamal app exec -r web "rails db:migrate".
SymptomCauseFix
”OpenAI::RateLimitError”Token/request limit hitReduce --parallelism. Wait. Consider batch mode.
Malformed JSON from LLMModel hallucinated non-JSONService retries 3 times. Check JSON schema definition.
Unexpected LLM resultsModel version change or prompt driftPin model version (e.g., gpt-4.1). Compare with prior runs.
Terminal window
# Check all container statuses
kamal app containers
# Tail Sidekiq logs in real-time
kamal app logs -r job -f
# Check Sidekiq queue size
kamal app exec -r web "rails runner 'puts Sidekiq::Queue.new.size'"
# View running workflow instances
kamal app exec -r web "rails runner \"
WorkflowInstance.where(completed_at: nil).each do |wi|
puts [wi.id, wi.workflow_type, wi.started_at].join(' | ')
end
\""
# Check database connectivity
kamal app exec -r web "rails runner 'puts ActiveRecord::Base.connection.active?'"
# Force-refresh a specific materialized view
kamal app exec -r web "rails runner \"
Scenic.database.refresh_materialized_view('view_name', concurrently: true, cascade: false)
\""
# Check disk usage on servers
kamal app exec -r web "df -h"

If standard fixes do not resolve the issue:

  1. Airbrake — Check for exception details and stack traces
  2. CloudWatch — Check /bioloupe/data-lake for container output
  3. New Relic — Check for performance degradation or error spikes
  4. ActiveAdmin — Review workflow instances for step-level errors at /admin/workflow_instances
  5. AWS Batch console — Check job status, compute environment capacity
  • Architecture — Understand the system design behind what you are operating
  • Get started — Set up a local environment for debugging