Operations
Everything an on-call engineer needs to run Data Gov in production. This page covers deployment, monitoring, troubleshooting, and the commands you reach for at 2 AM.
Deployment
Section titled “Deployment”Every push to main triggers a GitHub Actions workflow that builds a Docker image, pushes it to Amazon ECR, and deploys via Kamal to two EC2 servers. The full pipeline takes 5-10 minutes.
flowchart LR Push["Push to main"] --> Build["Build Docker\nimage (Buildx)"] Build --> ECR["Push to ECR\n:sha + :latest"] ECR --> Secrets["Sync secrets\nfrom AWS SM"] Secrets --> Deploy["kamal deploy\n--skip-push"] Deploy --> Web["Web server\n(98.93.194.3)"] Deploy --> Replica["Web replica\n(34.227.74.245)"] Deploy --> Job["Sidekiq worker\n(98.93.194.3)"]
Infrastructure
Section titled “Infrastructure”| Component | Detail |
|---|---|
| CI/CD | GitHub Actions (deploy.yml) |
| Container registry | Amazon ECR (211125575460.dkr.ecr.us-east-1.amazonaws.com) |
| Image | bioloupe/datalake (unified: web, Sidekiq, Chrome) |
| Orchestration | Kamal 2.x |
| Primary server | 98.93.194.3 (web + Sidekiq) |
| Replica server | 34.227.74.245 (web replica) |
| SSH user | ubuntu |
| Deploy timeout | 1,200 seconds |
| Health check | GET /up every 1 second, 5-second timeout |
| Logging | AWS CloudWatch (/bioloupe/data-lake) |
| Secrets | AWS Secrets Manager (data-lake/prod) |
Server roles
Section titled “Server roles”Kamal deploys three container roles from config/deploy.yml.
| Role | Host | Resources | Command |
|---|---|---|---|
web | 98.93.194.3 | 1.5 CPUs, 1.75 GB | Rails/Puma (default) |
web_replica | 34.227.74.245 | 2 GB | Rails/Puma (proxy enabled) |
job | 98.93.194.3 | 0.25 CPUs, 512 MB | bundle exec sidekiq -C ./config/sidekiq.yml |
Docker build
Section titled “Docker build”The Dockerfile uses a multi-stage build with three parallel stages:
- gems — Ruby dependencies + Bootsnap cache
- js-build — Node.js dependencies + Webpack build
- unified — Combines gems and JS assets, compiles Sprockets, installs Chrome for headless scraping
Build caching uses GitHub Actions cache (type=gha). Cached builds complete in 2-4 minutes.
Secrets management
Section titled “Secrets management”Secrets sync from AWS Secrets Manager (data-lake/prod) during the deploy stage.
- GitHub Actions fetches the secret JSON via OIDC-federated AWS credentials
- Converts to
KEY=VALUEformat - Writes to
.kamal/secrets - Kamal injects into containers per
config/deploy.yml
Key secret categories: Database (DB_*), AACT (AACT_DB_*), ChEMBL (CHEMBL_DB_*), AWS (AWS_*, S3_*, ATHENA_*), OpenAI (OPENAI_API_KEY, OPENROUTER_API_KEY), Redis (REDIS_URL), Rails (SECRET_KEY_BASE, RAILS_MASTER_KEY), Auth (GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET), Email (BREVO_API_KEY, BREVO_*), Monitoring (Airbrake, New Relic), News APIs (Cision, FMP), Slack (SLACK_*), Client apps (APP_CLIENT_HOST, DATA_GOV_TOKEN, MULTI_APP_SECRET_KEY), Conference scrapers (ASCO_*).
Deploy commands
Section titled “Deploy commands”# Standard deploy (automatic on push to main)# No manual action needed.
# Manual deploy from GitHub CLIgh workflow run deploy.yml
# Manual deploy with Kamalkamal deploy --skip-push --verbose
# Rollbackkamal app containers # list recent versionskamal rollback <container_id>
# Restart a single rolekamal app boot -r job # restart Sidekiqkamal app boot -r web # restart web
# Access Rails console in productionkamal app exec -r web --interactive "rails console"
# Tail logskamal app logs -r web # web logskamal app logs -r job # Sidekiq logsCI/CD concurrency
Section titled “CI/CD concurrency”The workflow uses concurrency: deploy-production with cancel-in-progress: false. Multiple pushes to main queue deployments. No deployment cancels mid-flight.
Monitoring
Section titled “Monitoring”Monitoring endpoints
Section titled “Monitoring endpoints”| Endpoint / Tool | Purpose |
|---|---|
GET /up | Health check for Kamal proxy and load balancer |
/admin/sidekiq | Sidekiq web UI (admin-only) |
CloudWatch /bioloupe/data-lake | Container stdout/stderr logs |
| Airbrake | Exception tracking |
| New Relic | APM metrics |
What to watch
Section titled “What to watch”Materialized views. RefreshViewsJob runs 4x daily. If it fails, API data goes stale. Monitor for long-running view refresh queries.
Sidekiq queue. If the default queue backs up, check for stuck jobs or OOM kills on the Sidekiq container (512 MB limit).
Workflow instances. Stale workflows block scheduled runs. Check /admin/workflow_instances for workflows stuck in a running state.
AWS Batch jobs. If jobs stay in RUNNABLE, check EC2 capacity in the AWS Batch compute environment.
Troubleshooting
Section titled “Troubleshooting”Deployment issues
Section titled “Deployment issues”| Symptom | Cause | Fix |
|---|---|---|
| Deploy hangs after “kamal deploy” | Health check at /up failing | SSH in, check logs: kamal app logs -r web |
| Deploy fails at “Sync secrets” | AWS OIDC role misconfigured | Verify IAM role github-actions-datalake-deploy. Check data-lake/prod in Secrets Manager. |
| ”Kamal lock” error | Previous deploy did not release lock | kamal lock release --verbose, then retry |
| ECR login fails | OIDC token expired or region mismatch | Check AWS_REGION is us-east-1. Retry workflow. |
| Container returns 502 | Puma did not bind to port 3000 in time | Increase deploy_timeout in config/deploy.yml (currently 1200s = 20 min, though the inline comment says 10 min) |
| Asset 404 after deploy | In-flight requests hit old assets | Wait for rolling restart. Kamal bridges via asset_path. |
Sidekiq and job issues
Section titled “Sidekiq and job issues”| Symptom | Cause | Fix |
|---|---|---|
| Jobs not running | Sidekiq container not started | kamal app boot -r job. Check kamal app containers. |
| Sidekiq OOM killed | Job exceeded 512 MB memory | Check which job ran. Increase memory in config/deploy.yml. |
| Duplicate job execution | Cron scheduled before previous finished | Add concurrency guard (see ClinicalTrialsWorkflowSchedulerJob). |
| ”Redis::CannotConnectError” | Redis URL wrong or server down | Check REDIS_URL in secrets. Verify reachability. |
| AWS Batch stuck in RUNNABLE | No EC2 capacity | Check Batch console. Scale up maxvCpus. |
| AWS Batch FAILED | Container error | Check CloudWatch at /bioloupe/data-lake. Match log stream to Batch job ID. |
Pipeline issues
Section titled “Pipeline issues”| Symptom | Cause | Fix |
|---|---|---|
| ”Workflow already running” | Previous instance not completed | ActiveAdmin > Workflow Instances > Complete the stale run. |
| CT workflow skipped | Concurrency guard found running instance | Complete stale workflow. Then: rails runner "ClinicalTrialsWorkflowSchedulerJob.perform_now" |
| ChiCTR returns 0 trials | Anti-bot protection or site change | Check site manually. Try --sample 5. Check Ferrum/Chrome version. |
| CT.gov version sync fails | API rate limiting (HTTP 429) | Wait 10-15 min. Service uses RetryWithBackoff. |
| Study plan extraction hangs | LLM API timeout | Re-run with --limit. Check OpenAI status. |
| Term matching bad results | NCI API errors | Check NCI availability. Use --search-method=semantic to bypass. |
| FDA download 403 | API key expired | Update API_KEY in regulatory/fda.thor. |
| Materialized views stale | RefreshViewsJob failed | kamal app exec -r web "bundle exec thor views:refresh_all". Check for blocking queries. |
Database issues
Section titled “Database issues”| Symptom | Cause | Fix |
|---|---|---|
| ”ConnectionNotEstablished” | Pool exhausted or RDS restart | Check DB_HOST reachability. Restart containers: kamal app boot. |
| AACT connection refused | Credentials expired (they rotate quarterly) | Get new credentials from AACT website. Update Secrets Manager. Redeploy. |
| Slow queries on views | Views need refresh or missing indexes | Run RefreshViewsJob. Check EXPLAIN ANALYZE. |
| Migration timeout | Materialized view creation slow | Run migration manually first: kamal app exec -r web "rails db:migrate". |
LLM / OpenAI issues
Section titled “LLM / OpenAI issues”| Symptom | Cause | Fix |
|---|---|---|
| ”OpenAI::RateLimitError” | Token/request limit hit | Reduce --parallelism. Wait. Consider batch mode. |
| Malformed JSON from LLM | Model hallucinated non-JSON | Service retries 3 times. Check JSON schema definition. |
| Unexpected LLM results | Model version change or prompt drift | Pin model version (e.g., gpt-4.1). Compare with prior runs. |
Quick diagnostic commands
Section titled “Quick diagnostic commands”# Check all container statuseskamal app containers
# Tail Sidekiq logs in real-timekamal app logs -r job -f
# Check Sidekiq queue sizekamal app exec -r web "rails runner 'puts Sidekiq::Queue.new.size'"
# View running workflow instanceskamal app exec -r web "rails runner \" WorkflowInstance.where(completed_at: nil).each do |wi| puts [wi.id, wi.workflow_type, wi.started_at].join(' | ') end\""
# Check database connectivitykamal app exec -r web "rails runner 'puts ActiveRecord::Base.connection.active?'"
# Force-refresh a specific materialized viewkamal app exec -r web "rails runner \" Scenic.database.refresh_materialized_view('view_name', concurrently: true, cascade: false)\""
# Check disk usage on serverskamal app exec -r web "df -h"Escalation path
Section titled “Escalation path”If standard fixes do not resolve the issue:
- Airbrake — Check for exception details and stack traces
- CloudWatch — Check
/bioloupe/data-lakefor container output - New Relic — Check for performance degradation or error spikes
- ActiveAdmin — Review workflow instances for step-level errors at
/admin/workflow_instances - AWS Batch console — Check job status, compute environment capacity
Next steps
Section titled “Next steps”- Architecture — Understand the system design behind what you are operating
- Get started — Set up a local environment for debugging