Operations

Everything an on-call engineer needs to run Data Gov in production. This page covers deployment, monitoring, troubleshooting, and the commands you reach for at 2 AM.

Deployment

Every push to main triggers a GitHub Actions workflow that builds a Docker image, pushes it to Amazon ECR, and deploys via Kamal to two EC2 servers. The full pipeline takes 5-10 minutes.

flowchart LR
  Push["Push to main"] --> Build["Build Docker\nimage (Buildx)"]
  Build --> ECR["Push to ECR\n:sha + :latest"]
  ECR --> Secrets["Sync secrets\nfrom AWS SM"]
  Secrets --> Deploy["kamal deploy\n--skip-push"]
  Deploy --> Web["Web server\n(98.93.194.3)"]
  Deploy --> Replica["Web replica\n(34.227.74.245)"]
  Deploy --> Job["Sidekiq worker\n(98.93.194.3)"]

Infrastructure

Component	Detail
CI/CD	GitHub Actions (`deploy.yml`)
Container registry	Amazon ECR (`211125575460.dkr.ecr.us-east-1.amazonaws.com`)
Image	`bioloupe/datalake` (unified: web, Sidekiq, Chrome)
Orchestration	Kamal 2.x
Primary server	`98.93.194.3` (web + Sidekiq)
Replica server	`34.227.74.245` (web replica)
SSH user	`ubuntu`
Deploy timeout	1,200 seconds
Health check	`GET /up` every 1 second, 5-second timeout
Logging	AWS CloudWatch (`/bioloupe/data-lake`)
Secrets	AWS Secrets Manager (`data-lake/prod`)

Server roles

Kamal deploys three container roles from config/deploy.yml.

Role	Host	Resources	Command
`web`	98.93.194.3	1.5 CPUs, 1.75 GB	Rails/Puma (default)
`web_replica`	34.227.74.245	2 GB	Rails/Puma (proxy enabled)
`job`	98.93.194.3	0.25 CPUs, 512 MB	`bundle exec sidekiq -C ./config/sidekiq.yml`

Docker build

The Dockerfile uses a multi-stage build with three parallel stages:

gems — Ruby dependencies + Bootsnap cache
js-build — Node.js dependencies + Webpack build
unified — Combines gems and JS assets, compiles Sprockets, installs Chrome for headless scraping

Build caching uses GitHub Actions cache (type=gha). Cached builds complete in 2-4 minutes.

Secrets management

Secrets sync from AWS Secrets Manager (data-lake/prod) during the deploy stage.

GitHub Actions fetches the secret JSON via OIDC-federated AWS credentials
Converts to KEY=VALUE format
Writes to .kamal/secrets
Kamal injects into containers per config/deploy.yml

Key secret categories: Database (DB_*), AACT (AACT_DB_*), ChEMBL (CHEMBL_DB_*), AWS (AWS_*, S3_*, ATHENA_*), OpenAI (OPENAI_API_KEY, OPENROUTER_API_KEY), Redis (REDIS_URL), Rails (SECRET_KEY_BASE, RAILS_MASTER_KEY), Auth (GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET), Email (BREVO_API_KEY, BREVO_*), Monitoring (Airbrake, New Relic), News APIs (Cision, FMP), Slack (SLACK_*), Client apps (APP_CLIENT_HOST, DATA_GOV_TOKEN, MULTI_APP_SECRET_KEY), Conference scrapers (ASCO_*).

Deploy commands

# Standard deploy (automatic on push to main)
# No manual action needed.

# Manual deploy from GitHub CLI
gh workflow run deploy.yml

# Manual deploy with Kamal
kamal deploy --skip-push --verbose

# Rollback
kamal app containers   # list recent versions
kamal rollback <container_id>

# Restart a single role
kamal app boot -r job   # restart Sidekiq
kamal app boot -r web   # restart web

# Access Rails console in production
kamal app exec -r web --interactive "rails console"

# Tail logs
kamal app logs -r web    # web logs
kamal app logs -r job    # Sidekiq logs

CI/CD concurrency

The workflow uses concurrency: deploy-production with cancel-in-progress: false. Multiple pushes to main queue deployments. No deployment cancels mid-flight.

Monitoring

Monitoring endpoints

Endpoint / Tool	Purpose
`GET /up`	Health check for Kamal proxy and load balancer
`/admin/sidekiq`	Sidekiq web UI (admin-only)
CloudWatch `/bioloupe/data-lake`	Container stdout/stderr logs
Airbrake	Exception tracking
New Relic	APM metrics

What to watch

Materialized views. RefreshViewsJob runs 4x daily. If it fails, API data goes stale. Monitor for long-running view refresh queries.

Sidekiq queue. If the default queue backs up, check for stuck jobs or OOM kills on the Sidekiq container (512 MB limit).

Workflow instances. Stale workflows block scheduled runs. Check /admin/workflow_instances for workflows stuck in a running state.

AWS Batch jobs. If jobs stay in RUNNABLE, check EC2 capacity in the AWS Batch compute environment.

Troubleshooting

Deployment issues

Symptom	Cause	Fix
Deploy hangs after “kamal deploy”	Health check at `/up` failing	SSH in, check logs: `kamal app logs -r web`
Deploy fails at “Sync secrets”	AWS OIDC role misconfigured	Verify IAM role `github-actions-datalake-deploy`. Check `data-lake/prod` in Secrets Manager.
”Kamal lock” error	Previous deploy did not release lock	`kamal lock release --verbose`, then retry
ECR login fails	OIDC token expired or region mismatch	Check `AWS_REGION` is `us-east-1`. Retry workflow.
Container returns 502	Puma did not bind to port 3000 in time	Increase `deploy_timeout` in `config/deploy.yml` (currently 1200s = 20 min, though the inline comment says 10 min)
Asset 404 after deploy	In-flight requests hit old assets	Wait for rolling restart. Kamal bridges via `asset_path`.

Sidekiq and job issues

Symptom	Cause	Fix
Jobs not running	Sidekiq container not started	`kamal app boot -r job`. Check `kamal app containers`.
Sidekiq OOM killed	Job exceeded 512 MB memory	Check which job ran. Increase memory in `config/deploy.yml`.
Duplicate job execution	Cron scheduled before previous finished	Add concurrency guard (see `ClinicalTrialsWorkflowSchedulerJob`).
”Redis::CannotConnectError”	Redis URL wrong or server down	Check `REDIS_URL` in secrets. Verify reachability.
AWS Batch stuck in RUNNABLE	No EC2 capacity	Check Batch console. Scale up `maxvCpus`.
AWS Batch FAILED	Container error	Check CloudWatch at `/bioloupe/data-lake`. Match log stream to Batch job ID.

Pipeline issues

Symptom	Cause	Fix
”Workflow already running”	Previous instance not completed	ActiveAdmin > Workflow Instances > Complete the stale run.
CT workflow skipped	Concurrency guard found running instance	Complete stale workflow. Then: `rails runner "ClinicalTrialsWorkflowSchedulerJob.perform_now"`
ChiCTR returns 0 trials	Anti-bot protection or site change	Check site manually. Try `--sample 5`. Check Ferrum/Chrome version.
CT.gov version sync fails	API rate limiting (HTTP 429)	Wait 10-15 min. Service uses `RetryWithBackoff`.
Study plan extraction hangs	LLM API timeout	Re-run with `--limit`. Check OpenAI status.
Term matching bad results	NCI API errors	Check NCI availability. Use `--search-method=semantic` to bypass.
FDA download 403	API key expired	Update `API_KEY` in `regulatory/fda.thor`.
Materialized views stale	`RefreshViewsJob` failed	`kamal app exec -r web "bundle exec thor views:refresh_all"`. Check for blocking queries.

Database issues

Symptom	Cause	Fix
”ConnectionNotEstablished”	Pool exhausted or RDS restart	Check `DB_HOST` reachability. Restart containers: `kamal app boot`.
AACT connection refused	Credentials expired (they rotate quarterly)	Get new credentials from AACT website. Update Secrets Manager. Redeploy.
Slow queries on views	Views need refresh or missing indexes	Run `RefreshViewsJob`. Check `EXPLAIN ANALYZE`.
Migration timeout	Materialized view creation slow	Run migration manually first: `kamal app exec -r web "rails db:migrate"`.

LLM / OpenAI issues

Symptom	Cause	Fix
”OpenAI::RateLimitError”	Token/request limit hit	Reduce `--parallelism`. Wait. Consider batch mode.
Malformed JSON from LLM	Model hallucinated non-JSON	Service retries 3 times. Check JSON schema definition.
Unexpected LLM results	Model version change or prompt drift	Pin model version (e.g., `gpt-4.1`). Compare with prior runs.

Quick diagnostic commands

# Check all container statuses
kamal app containers

# Tail Sidekiq logs in real-time
kamal app logs -r job -f

# Check Sidekiq queue size
kamal app exec -r web "rails runner 'puts Sidekiq::Queue.new.size'"

# View running workflow instances
kamal app exec -r web "rails runner \"
  WorkflowInstance.where(completed_at: nil).each do |wi|
    puts [wi.id, wi.workflow_type, wi.started_at].join(' | ')
  end
\""

# Check database connectivity
kamal app exec -r web "rails runner 'puts ActiveRecord::Base.connection.active?'"

# Force-refresh a specific materialized view
kamal app exec -r web "rails runner \"
  Scenic.database.refresh_materialized_view('view_name', concurrently: true, cascade: false)
\""

# Check disk usage on servers
kamal app exec -r web "df -h"

Escalation path

If standard fixes do not resolve the issue:

Airbrake — Check for exception details and stack traces
CloudWatch — Check /bioloupe/data-lake for container output
New Relic — Check for performance degradation or error spikes
ActiveAdmin — Review workflow instances for step-level errors at /admin/workflow_instances
AWS Batch console — Check job status, compute environment capacity

Next steps

Architecture — Understand the system design behind what you are operating
Get started — Set up a local environment for debugging