News and intelligence
Data Gov turns press releases and scientific publications into structured intelligence. When a company announces a licensing deal, a trial readout, or an FDA submission, the news pipeline extracts the entities, classifies the event, and links everything to the knowledge graph. This page covers the full intelligence layer: news collection, LLM classification, publications ingestion, and deal extraction.
The intelligence problem
Section titled “The intelligence problem”Pharmaceutical intelligence hides in unstructured text. A Business Wire press release says “Pfizer and Seagen announce FDA has accepted sBLA for ADCETRIS.” Buried in that sentence are: an organization (Pfizer), a second organization (Seagen), an FDA submission type (sBLA), a brand name (ADCETRIS), and an event type (FDA acceptance). The news pipeline extracts all of this automatically.
flowchart TB
subgraph Collection["Daily Collection"]
BW["Business Wire\n(daily 05:00)"]
Cision["Cision\n(twice daily)"]
GNW["GlobeNewsWire\n(twice daily)"]
FIN["Financial APIs\n(daily noon)"]
end
subgraph LLM["Weekly LLM Processing"]
Classify["Classify articles\n(16 categories)"]
Entities["Extract drug, disease,\norg mentions"]
FDA_Sub["Detect FDA\nsubmission events"]
Trials["Identify trial\nresult mentions"]
Deals["Extract deal\ninformation"]
end
subgraph Link["Entity Linking"]
Drug_Link["Link drugs"]
Disease_Link["Link diseases"]
Org_Link["Link orgs"]
Trial_Link["Link trials"]
end
subgraph Output["Structured Output"]
NDM["news_drug_mentions"]
NFS["news_fda_submissions"]
NTM["news_trial_mentions"]
OH["organisation_histories\n(deals)"]
DN["diseases_news\n(disease links)"]
end
Collection --> Classify --> Entities --> FDA_Sub & Trials & Deals
Entities --> Drug_Link & Disease_Link & Org_Link
FDA_Sub --> NFS
Trials --> NTM --> Trial_Link
Deals --> OH
Drug_Link --> NDM
Disease_Link --> DN
News collection
Section titled “News collection”Four sources feed the news pipeline. Collection jobs run on separate daily schedules via sidekiq-cron.
| Source | Job class | Schedule (UTC) | Method |
|---|---|---|---|
| Business Wire | NewsBusinessWireJob | Daily 05:00 | BusinessWireService via AWS Batch |
| Cision | NewsCisionJob | Twice daily 00:00, 16:00 | CisionApiService (in-process) |
| GlobeNewsWire | NewsGlobalnewsWireJob | Twice daily 03:00, 21:00 | GlobalnewsWireService via AWS Batch |
| Financial APIs | NewsFinancialJob | Daily 12:00 | FinancialApiService |
Each article lands in the news table with its release_id (unique per source), full HTML body in data, source metadata in JSONB columns, and the source enum (business_wire, globalnews_wire, cision, financial).
Duplicate detection runs on title match before creation. Only articles matching pharmaceutical relevance criteria (is_pharma: true) proceed to LLM processing.
LLM classification and extraction
Section titled “LLM classification and extraction”The NewsLlmWorkflow (33 steps) runs weekly on Sunday at 22:00 UTC. It processes all unclassified news articles through GPT-4.1.
Article classification
Section titled “Article classification”Each article receives one or more category labels from 16 defined categories:
Regulatory, Business Development, Business Deal, Leadership Changes, Quarterly Results, Trial Development Updates, Trial Results, Publication, Research Data Presentation, Market Forecast and Analysis, Grant/Award/Recognition, Corporate and Business Updates, Stockholder/Shareholder Announcements, Funding and Financing, Product Launch and Marketing, Patents and Intellectual Property.
Classification results land in news.category (JSONB array). The therapeutic_areas JSONB array tags articles by disease area (Oncology, Malignant Hematology, Non-Malignant Hematology).
Entity extraction
Section titled “Entity extraction”GPT-4.1 extracts structured mentions from the article text. Results land in news.llm_data JSONB.
Drug mentions create news_drug_mentions rows. Each carries a drug_name (raw extracted text) and nullable drug_id / brand_drug_id FKs that link after entity resolution.
Disease mentions link through the diseases_news join table after the NewsDiseaseWorkflow matches extracted disease names to canonical entities.
Organization mentions link through the news_organisations join table.
FDA submission detection
Section titled “FDA submission detection”When an article discusses an FDA filing, the pipeline creates news_fda_submissions records. Each captures:
- Application type (
NDA,sNDA,BLA,sBLA,351(k)) - PDUFA target action date
- Change type (
new_indication,dosage_form,formulation) - Indication data with potential disease matches
- Computed approval status (
approved,pending,overdue,rejected)
These feed into the PDUFA tracking feature. The FdaApprovalNotificationsJob (Saturday noon) separately fetches FDA press release notifications and matches them to existing drug approvals.
Trial result mentions
Section titled “Trial result mentions”news_trial_mentions capture clinical trial references from news articles. Each record stores the NCT ID, trial title, patient population, outcome summary, and enrollment numbers. These serve as polymorphic sources for curated trial result data — the same extraction model (trial_endpoints, trial_outcome_measures, adverse_events, trial_subgroups) used by publications.
Deal extraction
Section titled “Deal extraction”Articles flagged as relatable_deal_subject: true feed into deal extraction. The pipeline creates organisation_histories records with deal type, financial terms, geographic scope, and participant organizations. Significance scores help analysts prioritize review.
News chunking and RAG
Section titled “News chunking and RAG”The news_chunks table stores sentence-aware segments of news articles for retrieval-augmented generation. Each chunk carries:
chunk_text— The segment with context overlap from adjacent chunkschunk_index— Position in the articlemetadataJSONB — Paragraph count, word count, overlap statistics
Chunks use the Vectorizable concern for OpenAI embedding generation via the EmbedRecordJob. This powers semantic search across the news corpus.
Publications pipeline
Section titled “Publications pipeline”Publications arrive from PubMed and medical conferences (ASCO, AACR, ASH, EHA, ESMO). The PublicationsWorkflow (17 steps) handles ingestion and extraction.
Collection
Section titled “Collection”Publications are collected on-demand through PublicationIngestionJob. Each ingestion run tracks its source, status, and parameters in publication_ingestion_runs. Full text is fetched from PubMed Central or Unpaywall when available.
LLM extraction
Section titled “LLM extraction”GPT-4.1 processes publications to extract:
- Trial references (NCT IDs linked through
publication_clinical_trials) - Drug interventions (
publication_interventionswith dosing details) - Disease populations and patient characteristics
- Trial outcomes and subgroup analyses
Publications use the Vectorizable concern. Embeddings cover title, abstract, and publication date.
Disease matching
Section titled “Disease matching”The PublicationDiseaseWorkflow (22 steps) matches disease mentions in publications to canonical entities. It uses the same 4-stage term matching cascade described in the clinical trials pipeline.
Standard of care pipeline
Section titled “Standard of care pipeline”The StandardOfCareWorkflow (25 steps) processes treatment guidelines. It runs weekly on Friday at 22:00 UTC.
Guidelines link to diseases and describe standard treatments including:
- Treatment regimens (drug names, combinations)
- Applicable treatment lines and settings
- Biomarker requirements
- Disease subtypes and stages
- Supporting clinical evidence
Results populate the guidelines table with rich JSONB data. HABTM join tables connect guidelines to drugs, drug groups, chemo combinations, biomarkers, and clinical trials.
Organization financial sync
Section titled “Organization financial sync”The OrgSyncFmpJob runs daily at 01:00 UTC. It syncs financial data for tracked organizations from the Financial Modeling Prep API: market cap, enterprise value, revenue, profit margin, and cash position. This runs in-process (no AWS Batch).
Pipeline monitoring
Section titled “Pipeline monitoring”The PipelineMonitorJob runs weekly on Sunday at 22:00 UTC. It captures pipeline health screenshots and validates system status through PipelineMonitorService. Screenshots store in pipeline_snapshots with hash-based change detection.
Key services
Section titled “Key services”| Service | Purpose |
|---|---|
BusinessWireService | Collects Business Wire press releases |
CisionApiService | Fetches Cision news via API |
GlobalnewsWireService | Collects GlobeNewsWire articles |
FinancialApiService | Collects financial news |
OrgFmpService | Syncs organization financial data from FMP |
OpenAiService | LLM classification and entity extraction |
DiseaseMatchingService | Disease entity resolution for news and publications |
PipelineMonitorService | Pipeline health monitoring |
SemanticQaService | Semantic quality checks on extracted data |
Key Thor tasks
Section titled “Key Thor tasks”| Task | Purpose |
|---|---|
searchful_news:collect_business_wire | Collect Business Wire articles |
classify_news:* | News LLM classification steps |
term_matching:* | Entity resolution for news entities |
diseases:* | Disease matching for news and publications |
standard_of_care:* | SOC guideline processing |
Common problems
Section titled “Common problems”| Symptom | Likely cause | Fix |
|---|---|---|
| News collection returns 0 articles | API key expired or source API changed | Check CISION_PASS_KEY and FMP_API_KEY |
| LLM classification timeout | Too many unprocessed articles | Run with --limit flag |
| Deal extraction misses context | Article too long for context window | Check chunking. May need multi-pass extraction. |
| Publication full text missing | PMC or Unpaywall did not have it | Expected for paywalled journals. Abstract-only extraction. |
| Duplicate news articles | Source feed includes reruns | Dedup runs on release_id. Check for ID changes. |
Next steps
Section titled “Next steps”- Architecture — How the service layer, workflows, and background jobs fit together
- Clinical trials — The pipeline that processes the trials news articles reference