News and intelligence

Data Gov turns press releases and scientific publications into structured intelligence. When a company announces a licensing deal, a trial readout, or an FDA submission, the news pipeline extracts the entities, classifies the event, and links everything to the knowledge graph. This page covers the full intelligence layer: news collection, LLM classification, publications ingestion, and deal extraction.

The intelligence problem

Pharmaceutical intelligence hides in unstructured text. A Business Wire press release says “Pfizer and Seagen announce FDA has accepted sBLA for ADCETRIS.” Buried in that sentence are: an organization (Pfizer), a second organization (Seagen), an FDA submission type (sBLA), a brand name (ADCETRIS), and an event type (FDA acceptance). The news pipeline extracts all of this automatically.

flowchart TB
  subgraph Collection["Daily Collection"]
    BW["Business Wire\n(daily 05:00)"]
    Cision["Cision\n(twice daily)"]
    GNW["GlobeNewsWire\n(twice daily)"]
    FIN["Financial APIs\n(daily noon)"]
  end

  subgraph LLM["Weekly LLM Processing"]
    Classify["Classify articles\n(16 categories)"]
    Entities["Extract drug, disease,\norg mentions"]
    FDA_Sub["Detect FDA\nsubmission events"]
    Trials["Identify trial\nresult mentions"]
    Deals["Extract deal\ninformation"]
  end

  subgraph Link["Entity Linking"]
    Drug_Link["Link drugs"]
    Disease_Link["Link diseases"]
    Org_Link["Link orgs"]
    Trial_Link["Link trials"]
  end

  subgraph Output["Structured Output"]
    NDM["news_drug_mentions"]
    NFS["news_fda_submissions"]
    NTM["news_trial_mentions"]
    OH["organisation_histories\n(deals)"]
    DN["diseases_news\n(disease links)"]
  end

  Collection --> Classify --> Entities --> FDA_Sub & Trials & Deals
  Entities --> Drug_Link & Disease_Link & Org_Link
  FDA_Sub --> NFS
  Trials --> NTM --> Trial_Link
  Deals --> OH
  Drug_Link --> NDM
  Disease_Link --> DN

News collection

Four sources feed the news pipeline. Collection jobs run on separate daily schedules via sidekiq-cron.

Source	Job class	Schedule (UTC)	Method
Business Wire	`NewsBusinessWireJob`	Daily 05:00	`BusinessWireService` via AWS Batch
Cision	`NewsCisionJob`	Twice daily 00:00, 16:00	`CisionApiService` (in-process)
GlobeNewsWire	`NewsGlobalnewsWireJob`	Twice daily 03:00, 21:00	`GlobalnewsWireService` via AWS Batch
Financial APIs	`NewsFinancialJob`	Daily 12:00	`FinancialApiService`

Each article lands in the news table with its release_id (unique per source), full HTML body in data, source metadata in JSONB columns, and the source enum (business_wire, globalnews_wire, cision, financial).

Duplicate detection runs on title match before creation. Only articles matching pharmaceutical relevance criteria (is_pharma: true) proceed to LLM processing.

LLM classification and extraction

The NewsLlmWorkflow (39 steps) runs weekly on Sunday at 22:00 UTC. It processes all unclassified news articles through GPT-4.1.

Article classification

Each article receives one or more category labels from 16 defined categories:

Regulatory, Business Development, Business Deal, Leadership Changes, Quarterly Results, Trial Development Updates, Trial Results, Publication, Research Data Presentation, Market Forecast and Analysis, Grant/Award/Recognition, Corporate and Business Updates, Stockholder/Shareholder Announcements, Funding and Financing, Product Launch and Marketing, Patents and Intellectual Property.

Classification results land in news.category (JSONB array). The therapeutic_areas JSONB array tags articles by disease area (Oncology, Malignant Hematology, Non-Malignant Hematology).

Entity extraction

GPT-4.1 extracts structured mentions from the article text. Results land in news.llm_data JSONB.

Drug mentions create news_drug_mentions rows. Each carries a drug_name (raw extracted text) and nullable drug_id / brand_drug_id FKs that link after entity resolution.

Disease mentions link through the diseases_news join table after the NewsDiseaseWorkflow matches extracted disease names to canonical entities.

Organization mentions link through the news_organisations join table.

FDA submission detection

When an article discusses an FDA filing, the pipeline creates news_fda_submissions records. Each captures:

Application type (NDA, sNDA, BLA, sBLA, 351(k))
PDUFA target action date
Change type (new_indication, dosage_form, formulation)
Indication data with potential disease matches
Computed approval status (approved, pending, overdue, rejected)

These feed into the PDUFA tracking feature. The FdaApprovalNotificationsJob (Saturday noon) separately fetches FDA press release notifications and matches them to existing drug approvals.

Trial result mentions

news_trial_mentions are lightweight links between an article and a clinical trial. Post-processing stores only the extracted NCT ID and title, plus a nullable clinical_trial_id resolved by exact normalized NCT ID. Trial result fields remain in news.llm_data; the next workflow step materializes them into Publication records, which own any normalized publication result data.

The post-processing task supports three modes: default filling mode skips articles already marked with llm_data.clinical_trials.trials_processed, --overwrite destroys and rebuilds an article’s mentions, and --sync reconciles mentions while deleting stale rows. Overwrite and sync are mutually exclusive.

Deal extraction

Articles flagged as relatable_deal_subject: true feed into deal extraction. The pipeline creates organisation_histories records with deal type, financial terms, geographic scope, and participant organizations. Significance scores help analysts prioritize review.

Company name change extraction

Articles announcing corporate rebrands or formal name changes feed into a separate extraction step (NewsLlmClassification::DealExtraction::NameChangeTask). The task scopes candidates by category (Business Development, Corporate and Business Updates, Stockholder and Shareholder Announcements, Funding and Financing) plus title patterns (%rebrand%, %renamed%, %name change%, %new ticker%, etc.) and uses the article’s matched organisations as candidate identifiers.

Results land in news.llm_data.company_changes. The post_process_deals task then creates organisation_histories records with change_type: 'NameChange', an initiator participant for the old organisation, and a resulting_company participant for the new name (auto-creating an Organisation row when the new name has no existing match). Changes scored as None are skipped.

Downstream ownership transfers (acquisitions, mergers, asset deals) resolve participant organisations through the chain of NameChange, Acquisition, and Merger histories so drug ownership follows the latest active legal entity rather than a stale predecessor name.

Deal review UI

Reviewers open deals via the Review Deals action on a news record. The page is a React app (app/javascript/bundles/News/components/NewsDealsReview.jsx) backed by Admin::Services::NewsDealReviewService, which exposes review_deals, update_deal_review, destroy_deal_review_entry, and destroy_duplicate_deal_review member actions on the news admin resource. The service returns the article HTML, extracted entries, and field/role configuration derived from OrganisationHistory::FIELDS_BY_DEAL_TYPE and ROLES_BY_DEAL_TYPE, so the UI shows only the inputs relevant to the entry’s change_type/deal_type.

News chunking and RAG

The news_chunks table stores sentence-aware segments of news articles for retrieval-augmented generation. Each chunk carries:

chunk_text — The segment with context overlap from adjacent chunks
chunk_index — Position in the article
metadata JSONB — Paragraph count, word count, overlap statistics

Chunks use the Vectorizable concern for OpenAI embedding generation via the EmbedRecordJob. This powers semantic search across the news corpus.

Publications pipeline

Publications arrive from PubMed and medical conferences (ASCO, AACR, ASH, ASGCT, EHA, ESMO). The PublicationsWorkflow (17 steps) handles ingestion and extraction.

Collection

Publications are collected on-demand through PublicationIngestionJob. Each ingestion run tracks its source, status, and parameters in publication_ingestion_runs. Full text is fetched from PubMed Central or Unpaywall when available.

Conferences that publish numbered abstract books as a PDF (ASGCT, ESGCT, ISTH, etc.) can be ingested through the generic conference_pdf source. The console operator supplies a source name (e.g. ASGCT_2026) and a public PDF URL; Publications::ConferencePdfService downloads, caches under tmp/conference_pdf/, and parses each numbered abstract into a Publication record keyed on (source, source_id).

LLM extraction

GPT-4.1 processes publications to extract:

Trial references (NCT IDs linked through publication_clinical_trials)
Drug interventions (publication_interventions with dosing details)
Disease populations and patient characteristics
Trial outcomes and subgroup analyses

Publications use the Vectorizable concern. Embeddings cover title, abstract, and publication date.

Disease matching

The PublicationDiseaseWorkflow (22 steps) matches disease mentions in publications to canonical entities. It uses the same 4-stage term matching cascade described in the clinical trials pipeline.

Standard of care pipeline

The StandardOfCareWorkflow (25 steps) processes treatment guidelines. It runs weekly on Friday at 22:00 UTC.

Guidelines link to diseases and describe standard treatments including:

Treatment regimens (drug names, combinations)
Applicable treatment lines and settings
Biomarker requirements
Disease subtypes and stages
Supporting clinical evidence

Results populate the guidelines table with rich JSONB data. HABTM join tables connect guidelines to drugs, drug groups, chemo combinations, biomarkers, and clinical trials.

Organization financial sync

The OrgSyncFmpJob runs daily at 01:00 UTC. It syncs financial data for tracked organizations from the Financial Modeling Prep API: market cap, enterprise value, revenue, profit margin, and cash position. This runs in-process (no AWS Batch).

SEC filings sync

A separate FMP-backed pipeline captures SEC filings (10-K, 10-Q, 20-F and amended variants) per organisation. It runs in two stages:

CIK enrichment — OrganisationSecCikPopulationService resolves an organisation’s sec_cik from its us_stock_symbol (or non_us_stock_symbol) via FMP’s company search endpoint. Triggered manually through OrganisationSecCikPopulationJob or the Thor tasks below; only runs against active organisations missing a CIK by default.
Filing sync — OrganisationSecFilingsSyncJob runs monthly on the 1st at 03:00 UTC and calls OrganisationSecFilingsSyncService#sync_all over a 60-day lookback window. The service paginates FMP’s CIK filing endpoint, filters to the configured form types, and upserts into organisation_sec_filings keyed on (organisation_id, source_uid). first_seen_at is preserved across re-runs; last_seen_at is bumped on every sighting.

Both services raise on configuration errors (FMP_API_KEY missing) and aggregate per-organisation eligibility failures into a single PopulateAllError / SyncAllError after the run.

Pipeline monitoring

The PipelineMonitorJob runs weekly on Sunday at 22:00 UTC. It captures pipeline health screenshots and validates system status through PipelineMonitorService. Screenshots store in pipeline_snapshots with hash-based change detection.

Key services

Service	Purpose
`BusinessWireService`	Collects Business Wire press releases
`CisionApiService`	Fetches Cision news via API
`GlobalnewsWireService`	Collects GlobeNewsWire articles
`FinancialApiService`	Collects financial news
`OrgFmpService`	Syncs organization financial data from FMP
`OrganisationSecCikPopulationService`	Resolves SEC CIK from stock symbol via FMP
`OrganisationSecFilingsSyncService`	Fetches and upserts SEC filings (10-K/Q, 20-F) by CIK from FMP
`OpenAiService`	LLM classification and entity extraction
`DiseaseMatchingService`	Disease entity resolution for news and publications
`PipelineMonitorService`	Pipeline health monitoring
`SemanticQaService`	Semantic quality checks on extracted data

Key Thor tasks

Task	Purpose
`searchful_news:collect_business_wire`	Collect Business Wire articles
`classify_news:*`	News LLM classification steps (includes `extract_name_changes` for corporate rebrands)
`news:post_process_trials`	Populate or reconcile lightweight trial mentions before publication materialization
`term_matching:*`	Entity resolution for news entities
`diseases:*`	Disease matching for news and publications
`standard_of_care:*`	SOC guideline processing
`sec_filings:populate_cik` / `sec_filings:populate_cik_all`	Resolve SEC CIK for one or all eligible organisations
`sec_filings:sync_organisation` / `sec_filings:sync_all`	Sync SEC filings for one or all organisations with a CIK

Common problems

Symptom	Likely cause	Fix
News collection returns 0 articles	API key expired or source API changed	Check `CISION_PASS_KEY` and `FMP_API_KEY`
LLM classification timeout	Too many unprocessed articles	Run with `--limit` flag
Deal extraction misses context	Article too long for context window	Check chunking. May need multi-pass extraction.
Publication full text missing	PMC or Unpaywall did not have it	Expected for paywalled journals. Abstract-only extraction.
Duplicate news articles	Source feed includes reruns	Dedup runs on `release_id`. Check for ID changes.

Next steps

Architecture — How the service layer, workflows, and background jobs fit together
Clinical trials — The pipeline that processes the trials news articles reference