Evidence Discovery

Evidence Discovery Substrate

A derived content layer that sits alongside the canonical source tables (publications, news, drug_approvals, clinical_trials, …). Its job is to give the agent — and eventually end-user search — a single place to look up evidence-bearing text regardless of origin (admin uploads, source projections, later agentic workflows).

The substrate ships with infrastructure and governance for the content graph plus an end-to-end pipeline (Discovery::Pipeline, orchestrated by DiscoveryWorkflow) that populates content items, auto-assigns facets via LLM, extracts mentions, and links them to canonical entities. Investor presentation bodies are chunked by extracted slide/page headings, and SEC filing bodies are fetched on-demand and chunked by PART/ITEM headings; broader chunking strategies, embeddings, and retrieval remain future work.

Data model

All tables are prefixed discovery_. One-to-many everywhere unless noted.

Table	Purpose	Parent
`discovery_content_items`	Canonical content identity, provenance, processing status	—
`discovery_content_assets`	URL references attached to an item (uploads use Active Storage)	item
`discovery_content_bodies`	Extracted/normalised full text. Unique per `(item, source_format)`	item
`discovery_content_chunks`	Retrieval units derived from a body. Unique `position` per body	body
`discovery_content_mentions`	Resolved evidence spans inside a chunk	chunk
`discovery_entity_links`	Canonical-entity attachments on an item (polymorphic)	item
`discovery_facet_definitions`	Facet registry keyed by `(namespace, key)`	—
`discovery_facet_values`	Allowed values for controlled facets	facet_definition
`discovery_content_facets`	Facet assignments on an item	item + facet_definition

Schema invariants worth knowing:

DiscoveryContentItem#description is a short human-readable summary/snippet for browsing. It is not the extracted document body. Full normalized text belongs in discovery_content_bodies.
Each derived row carries only its nearest parent (chunk.body_id, mention.chunk_id). Everything above is reached via :through associations, e.g. content_item.discovery_content_chunks or mention.discovery_content_item. This avoids the “parent IDs drift” bug class at the cost of a join on item-scoped lookups.
DiscoveryEntityLink.entity and DiscoveryContentMention.entity are polymorphic; entity_type is a model class name (e.g. Drug, Disease, Organisation) and is nil until a resolution workflow populates it.
DiscoveryEntityLink.match_status is resolved or unresolved. Resolved links must carry an entity_type/entity_id; unresolved links must not. A link can be either an item-level attachment (discovery_content_mention_id is null) or a per-mention adjudication outcome (discovery_content_mention_id is set, unique per mention).
Facet assignments are either controlled (discovery_facet_value_id set) or freeform (freeform_value set). The two are mutually exclusive and each is enforced by a partial unique index.

Lifecycle

DiscoveryContentItem#processing_status is a string enum with four states:

pending   → created, nothing processed yet (default)
processing → a workflow is actively working on it
processed  → workflow finished; derived outputs may be attached
failed     → workflow raised; processing_error holds the message

Transitions via the helpers on the model:

item.mark_processing!
item.mark_processed!        # sets processed_at, clears error
item.mark_failed!("reason") # stores the error string

Producing content items

All producers — admin console, source projections, future agentic workflows — MUST go through Discovery::ContentItemService or the submission helpers on top of it. This is the shared substrate entry point.

From scratch

service = Discovery::ContentItemService.new
result = service.create(
  title: "Phase 3 Study Protocol for Drug X",
  content_kind: "protocol",                 # must be in DiscoveryContentItem::CONTENT_KINDS
  submission_source: "source_projection",   # admin_upload, admin_url, source_projection, agentic
  source_type: "ClinicalTrial",             # optional polymorphic back-link
  source_id: trial.id,
  document_date: Date.parse("2026-01-15"),
  metadata: { version: "2.0" }
)
content_item = result.content_item if result.success?

From an existing source record

result = service.create_from_source(publication,
  content_kind: "publication",
  document_date: publication.publish_date
)

This stamps source_type/source_id from the record and sets submission_source: "source_projection". Use it for source-table projections.

Admin-triggered (file / URL)

Admin-driven paths go through Discovery::SubmissionService:

Discovery::SubmissionService.new.submit_upload(
  file: uploaded_file, content_kind: "press_release", submitted_by: current_user
)

Discovery::SubmissionService.new.submit_url(
  url: "https://example.com/deck.pdf", content_kind: "corporate_deck",
  submitted_by: current_user
)

Uploads attach via Active Storage (item.files); URL submissions also create a discovery_content_assets row for the raw URL reference. Source projections with a URL (for example investor presentation links) also mirror that URL into discovery_content_assets so link/file references are kept separate from extracted text bodies.

Extending with derived outputs

Later extraction, chunking, mention-detection, and entity-resolution workflows attach to an existing content item. Reuse these attach points — do NOT create parallel tables or new content items for derived outputs.

Bodies

body = content_item.discovery_content_bodies.create!(
  source_format: "plain_text",       # pdf | html | plain_text | markdown
  extraction_method: "tika",         # optional: tika | trafilatura | manual | llm
  body: extracted_text
)

Unique per (content_item, source_format) — if your extractor wants to store multiple formats for the same item, use different source_format values.

Chunks

body.discovery_content_chunks.create!(
  position: 0,                       # unique within the body
  content: chunk_text,
  chunk_strategy: "paragraph",       # fixed_size | paragraph | section | semantic
  token_count: 128
)

Item-scoped access via content_item.discovery_content_chunks (joins through bodies).

Mentions

chunk.discovery_content_mentions.create!(
  mention_type: "drug",              # see DiscoveryContentMention::MENTION_TYPES
  category: "Drug",                  # see DiscoveryContentMention::ENTITY_CATEGORIES
  surface_form: "pembrolizumab",
  start_offset: 42,                  # optional, character offset into chunk.content
  end_offset: 55,
  resolution_method: "llm",          # optional: ner | llm | dictionary | manual
  extraction_method: "discovery_entity_extraction_linking",
  entity_type: "Drug",               # optional — nil until resolution
  entity_id: drug.id,
  confidence: 0.92
)

Body and item reached via :through (mention.discovery_content_item).

mention_type is the lowercase storage form; category is the canonical class-name form (e.g. Drug, ClinicalTrial, Organisation) used by the extraction/linking pipeline.

Entity links

Entity links can be item-level attachments (no mention) or mention-level adjudication outcomes (one link per mention).

# Item-level (no mention) — populated from aggregates or content-level classifiers
content_item.discovery_entity_links.create!(
  entity_type: "Drug",
  entity_id: drug.id,
  link_type: "primary_subject",      # mentioned | primary_subject | related
  match_status: "resolved",          # resolved | unresolved
  confidence: 0.95
)

# Mention-level — one link per mention, written by the extraction pipeline
content_item.discovery_entity_links.create!(
  discovery_content_mention: mention,
  entity_type: "Drug",               # nil for unresolved
  entity_id: drug.id,                # nil for unresolved
  link_type: "mentioned",
  match_status: "resolved",
  confidence: 0.92
)

Item-level links are unique per (content_item, entity_type, entity_id) where discovery_content_mention_id IS NULL. Mention-level links are unique per discovery_content_mention_id. Resolved links require an entity; unresolved links must leave entity_type/entity_id blank.

Discovery pipeline

Discovery::Pipeline (app/services/discovery/pipeline.rb) is the end-to-end workflow that turns a source record into a discovery content item with facets, mentions, and entity links. It is orchestrated by DiscoveryWorkflow (app/workflows/discovery_workflow.rb) and exposed via thor tasks under lib/tasks/discovery/pipeline.thor.

The required steps (Discovery::Pipeline::REQUIRED_STEPS):

Step	Thor task	What it does
`populate_content_item_from_source`	`discovery:pipeline:populate_content_items`	Iterates eligible source records and upserts a `DiscoveryContentItem` keyed by `(source_type, source_id)`. For most sources, writes a `plain_text` `discovery_content_bodies` row (`extraction_method: "manual"`) from the source text and keeps `description` as a bounded summary. For `investor_presentations` (scope: records with both a `presentation_link` and a non-blank `presentation_content`), it upserts the content item, upserts a `discovery_content_assets` row from `presentation_link`, writes `presentation_content` as the `plain_text` body (`extraction_method: "llm"`), and chunks the body into one `section` chunk per `## Slide N:`/`## Page N:` block. Presentations without `presentation_content` are skipped — populate them first by running `clinical_trials:investor_presentations:extract` (`InvestorPresentations::PdfContentExtractionTask`). When a `DiscoveryContentItem` already exists for the presentation, the step refreshes only its source asset and otherwise leaves it alone. For `sec_filings` (scope: `Discovery::Pipeline::SEC_DISCOVERY_FORMS` — currently `10-K`, `10-Q`, `20-F` — with a non-blank `filing_url` or `sec_index_url`, restricted to filings the `eligible_for_content_extraction` scope considers retryable), it fetches the document via `Discovery::SecFilingExtractor` (HTML/plain_text, retries transient HTTP failures up to three times). HTML filings are rendered to markdown — HTML tables become markdown tables and meaningful figures/images are extracted via `Discovery::SecFilingVisualExtractor` (LLM, default `gpt-5-mini`) and inlined as markdown; plain-text filings are written as-is. The rendered text is persisted as a `discovery_content_bodies` row with `extraction_method: "manual"` and `source_format` set to `markdown` for HTML sources or `plain_text` otherwise (recorded as `rendered_source_format` on the item metadata too). The body is chunked via `Discovery::SecFilingChunker` into `section` chunks keyed on `PART`/`ITEM` headings (≤ 12,000 chars per chunk); markdown tables are kept intact within a single chunk, and table-of-contents `ITEM` rows are preserved under the prior section instead of opening new ones. If an existing SEC filing item is missing its body or chunks, the step repairs it in place even without `--override`. Extraction state is recorded back onto the `organisation_sec_filings` row (`content_extraction_status`, `content_extraction_attempt_count`, `content_extraction_failure_kind`, `content_extraction_error`); only filings under `OrganisationSecFiling::CONTENT_EXTRACTION_MAX_ATTEMPTS` that are unattempted or in `retryable_failed` are picked up on re-runs
`assign_facets_with_llm`	`discovery:pipeline:assign_facets_with_llm`	Calls `DiscoveryLlmExtraction::FacetAssigner` with the persisted content body text plus source metadata, then applies facet assignments via `ContentItemService#assign_facet`
`extract_mentions`	`discovery:pipeline:extract_mentions`	Calls `DiscoveryLlmExtraction::EntityMentionLinking#extract_mentions!` against the persisted content body to write `DiscoveryContentMention` rows with character offsets. Investor presentations are extracted per slide/page chunk; other sources reuse existing body chunks when present, otherwise the pipeline creates a single whole-body fallback chunk first
`resolve_entity_mentions`	`discovery:pipeline:resolve_entity_mentions`	Calls `DiscoveryLlmExtraction::EntityMentionCandidateResolver#resolve_content_item!` for items whose mentions already exist. Looks up candidates per mention, picks deterministically when confidence is high, otherwise calls the LLM once per content item with all undecided mentions and their candidate shortlists, then writes resolved/unresolved `DiscoveryEntityLink` rows. Items with no prior pipeline mentions are skipped — run `extract_mentions` first
`link_mentions_to_entities`	`discovery:pipeline:link_mentions_to_entities`	Verifies that pipeline-extracted mentions and resolved links exist for each item and marks the step complete. Items with no prior mentions or no resolved links are skipped — run `extract_mentions` and `resolve_entity_mentions` first

Discovery::Pipeline::OPTIONAL_STEPS adds one trailing step that runs after link_mentions_to_entities:

Step	Thor task	What it does
`generate_entity_anchor_questions`	`discovery:pipeline:generate_entity_anchor_questions`	Calls `DiscoveryLlmExtraction::EntityAnchorQuestionGenerator#generate_for_content_item!` to write mention-anchored questions into `discovery_entity_anchor_questions`. Items with no pipeline-extracted mentions are skipped — run `extract_mentions` first. Workflow runs only include this step when `include_entity_questions: true` (toggled per-run from the admin console’s “Include in workflow run” checkbox); the thor task can be invoked directly without that flag

Source scopes accepted by every step (--source option): all, publications, investor_presentations, news, clinical_trials, drug_approvals, sec_filings. The --ids option narrows to specific source-record ids (combined with a non-all source). Every step also accepts --submission-source (default source_projection) to scope by DiscoveryContentItem#submission_source.

By default each step skips records whose output for that step already exists: assign_facets_with_llm excludes items that already have any discovery_content_facets, extract_mentions excludes items that already have any discovery_content_mentions, resolve_entity_mentions excludes items with at least one resolved discovery_entity_links row, and link_mentions_to_entities excludes items with any discovery_entity_links. Pass --override (boolean, default false) on any step to ignore that filter and reprocess every in-scope record; for populate_content_items --override also re-extracts and overwrites an existing DiscoveryContentItem instead of leaving it intact. The step-scoping logic lives in Discovery::Pipeline#apply_step_scope.

Run a single step:

bundle exec thor discovery:pipeline:populate_content_items --source=publications
bundle exec thor discovery:pipeline:assign_facets_with_llm --source=publications --ids 123 456
bundle exec thor discovery:pipeline:extract_mentions --source=publications

The LLM-driven steps (assign_facets_with_llm, extract_mentions, and resolve_entity_mentions) default to OpenAI batch mode and accept --batched (boolean, default true), --batch-size (default 1000), and --parallelism (default 1). Pass --batched=false to fall back to sequential per-item LLM calls — useful for small --ids runs or local debugging. In batched mode each step collects all in-scope LLM prompts across items and submits them via OpenAiService batch mode (see Discovery::Pipeline#discovery_llm_client): extract_mentions routes one prompt per content item (with all that item’s chunks in the payload) through the batch, and resolve_entity_mentions routes one prompt per content item (with every mention that needs LLM adjudication for that item) through the batch. link_mentions_to_entities is a verification step that does no LLM calls and only accepts --source, --submission-source, --ids, and --override.

The full chain is also driven from the admin Discovery Pipeline console (see Admin surface).

Entity extraction and linking

The extract_mentions and resolve_entity_mentions pipeline steps run the entity-mention pipeline. Extraction lives in DiscoveryLlmExtraction::EntityMentionLinking (app/tasks/discovery_llm_extraction/entity_mention_linking.rb); candidate lookup and disambiguation live in DiscoveryLlmExtraction::EntityMentionCandidateResolver (app/tasks/discovery_llm_extraction/entity_mention_candidate_resolver.rb). Together they take a content item plus its persisted plain-text body and write mentions plus mention-level entity links against the selected chunks. extract_mentions! requires content_chunks to be supplied — the Discovery::Pipeline step is the integration point that prepares them (slide/page chunks for investor decks, an existing body chunk if one exists, or a whole-body fallback chunk otherwise) before invoking extraction.

flowchart LR
  text[plain_text body] --> chunks[Chunks<br/>slide/page for investor decks<br/>whole body fallback otherwise]
  chunks --> extractor[EntityMentionExtractor (LLM)]
  extractor --> mentions[Mentions]
  mentions --> candidate[EntityMentionLinkerCandidateProvider]
  candidate --> resolver[EntityMentionCandidateResolver]
  resolver -->|resolved| link_resolved[DiscoveryEntityLink<br/>resolved]
  resolver -->|unresolved| link_unresolved[DiscoveryEntityLink<br/>unresolved]
  link_resolved --> mention_row[(DiscoveryContentMention)]
  link_unresolved --> mention_row

Stages:

EntityMentionExtractor — app/tasks/discovery_llm_extraction/entity_mention_extractor.rb. LLM call (default gpt-5-nano) with a structured-output schema. The prompt receives a chunks payload (one entry per chunk with chunk_id and text) and returns { text, category, chunk_id, start_offset, end_offset } mentions where category is one of DiscoveryContentMention::ENTITY_CATEGORIES and offsets locate the mention inside the named chunk’s text.
EntityMentionLinkerCandidateProvider — app/tasks/discovery_llm_extraction/entity_mention_linker_candidate_provider.rb. For each mention, queries the canonical models for the mention’s category (Drug, Disease, Target, Technology, Biomarker, ClinicalTrial, BioloupeIntervention for Intervention, Organisation, Endpoint). Tries exact name/synonym matches first, then multi_flexifind fuzzy matches. Categories without a canonical model (TherapeuticArea, DevelopmentPhase, Other) yield no candidates and stay unresolved.
EntityMentionCandidateResolver — app/tasks/discovery_llm_extraction/entity_mention_candidate_resolver.rb. Picks the top candidate deterministically when confidence is high and clearly separated; otherwise calls the LLM once per content item with every undecided mention and its candidate shortlist (a LinkingDecisionSet schema) and parses one decision per input mention, or marks unresolved. Persists mention-level entity links and is the entry point for the resolve_entity_mentions pipeline step (resolve_content_item! / resolve_content_items_batched!).
EntityMentionLinking — app/tasks/discovery_llm_extraction/entity_mention_linking.rb. Drives extraction: marks the item processing, requires content_chunks to already exist (the pipeline creates them — slide/page chunks for investor decks via Discovery::ContentChunkingService, otherwise a single whole-body fallback chunk with metadata.pipeline = "discovery_entity_extraction_linking" created by Discovery::Pipeline#content_chunks_for_extraction), clears prior pipeline mentions, and persists mentions in one transaction with start_offset / end_offset located by re-finding each surface form inside its chunk. Resolution lives in EntityMentionCandidateResolver; resolved mention-level links are deduped there by (entity_type, entity_id) within the item (pre-existing item-level links count too): duplicate resolutions still persist the mention row but skip the entity-link write.

Thor task

Mention extraction is run through the Discovery Pipeline command surface:

bundle exec thor discovery:pipeline:extract_mentions --source=investor_presentations --ids 123

The pipeline step picks the item’s markdown body if present, otherwise the plain_text body, otherwise the oldest body. Items without an extracted body are skipped.

Re-runs

EntityMentionLinking#persist! deletes mentions where extraction_method = "discovery_entity_extraction_linking" under the selected body before re-inserting. Whole-body fallback chunks owned by that pipeline are removed when slide/page chunks exist, so re-runs replace this pipeline’s outputs without disturbing chunks owned by other producers.

Facets are typed dimensions attached to an item — e.g. publication.article_type=Journal Article or sec.form_type=10-Q. The registry decides which keys are valid.

Registry

DiscoveryFacetDefinition.create!(
  namespace: "publication",          # the source/domain the facet belongs to
  key: "article_type",
  label: "Article Type",
  value_type: "controlled",          # controlled | freeform
  applicable_content_kinds: ["publication"]  # optional scope gate
)

For controlled facets, add allowed values:

fd.discovery_facet_values.create!(value: "Journal Article", label: "Journal Article")

Assigning

service = Discovery::ContentItemService.new
fd = DiscoveryFacetDefinition.find_by!(namespace: "publication", key: "article_type")
fv = fd.discovery_facet_values.find_by!(value: "Journal Article")

service.assign_facet(content_item, facet_definition: fd, value: fv)
# freeform:
service.assign_facet(content_item, facet_definition: other_fd, freeform_value: "v1.2")

assign_facet uses find_or_initialize_by, so duplicate calls are no-ops. Assignments are rejected if the item’s content_kind is not in the definition’s applicable_content_kinds (when that list is non-empty).

Auto-assignment via LLM

DiscoveryLlmExtraction::FacetAssigner (app/tasks/discovery_llm_extraction/facet_assigner.rb) is the LLM-driven facet assigner used by the assign_facets_with_llm pipeline step. It gathers every DiscoveryFacetDefinition whose applicable_content_kinds matches the item’s content_kind, sends the source payload + facet catalogue to the LLM (default gpt-5-nano) with a structured-output schema, validates the returned qualified_key and (for controlled facets) the value against allowed values, and applies the surviving assignments through ContentItemService#assign_facet. Unknown keys or unknown controlled values are dropped into an ignored list, not raised.

Seeding

Initial facet definitions + values live in lib/tasks/one_off/data/discovery_facets.json and are seeded with:

bundle exec thor one_off:seed_discovery_facets:seed

Adding a new facet: edit the JSON, re-run the thor task. The task is idempotent — existing definitions/values are upserted, not duplicated.

Admin surface

Page	Route	Purpose
Discovery Pipeline	`/admin/discovery_pipeline`	React workflow console (`DiscoveryWorkflowConsole`) for running the five pipeline steps against a chosen source. Backed by `DiscoveryWorkflow`.
Workflow Pipelines	`/admin/workflow_pipelines`	Cross-pipeline dashboard. Includes a Discovery panel that links into the Discovery Pipeline console.
Content Items	`/admin/discovery_content_items`	Browse, inspect, manage items. File and URL submission forms. Facet assignment panel.
Discovery Browse	`/admin/discovery_browse`	React-driven faceted browse view. Server-paginated via `/admin/discovery_content_items/search.json`.
Facet Definitions	`/admin/discovery_facet_definitions`	CRUD on the facet registry.
Facet Values	`/admin/discovery_facet_values`	CRUD on controlled facet values.

Adding a new source

Typical walkthrough for a new source projection (e.g. a new SEC filing stream):

Pick a content_kind. If none of the existing constants in DiscoveryContentItem::CONTENT_KINDS fit, add one there. Keep the list coarse — fine-grained classification belongs in facets.
Register source-specific facets if you need filtering dimensions that don’t apply to everything: edit lib/tasks/one_off/data/discovery_facets.json, add entries under your namespace (e.g. sec), run the seed task.
Write a projection job that iterates your source records and calls ContentItemService#create_from_source. Idempotency is the job’s responsibility — key off source_type+source_id or external_id.
Extraction workflows later attach bodies/chunks/mentions/entity_links to the existing item. They do NOT create new items; they find by (source_type, source_id) or by item id.

Not yet built

These remain deferred:

General text extraction workflows for arbitrary uploads/URLs (PDF→text, HTML→text, etc.). Investor presentation PDFs are the current source-specific exception; entity extraction consumes an existing plain_text body, it does not produce one
General chunking pipelines for sources beyond investor presentation slide/page chunks
Embeddings and vector indexes (pgvector column on chunks is a reserved extension point, not yet defined)
Retrieval API for the agent
End-user semantic search UI
Strong file-level deduplication

Retry/idempotency for derived-output writers is per-pipeline — the entity extraction pipeline replaces its own mentions on re-run by filtering on extraction_method. Other future extractors decide their own rerun strategy.

Code locations

Models: app/models/discovery_*.rb
Services: app/services/discovery/
- ContentItemService — core creator + facet assignment
- SubmissionService — admin upload/URL wrappers
- ProducerContract — full attach-point documentation (source of truth for the integration contract)
- Pipeline — four-step end-to-end orchestrator (populate → facets → extract → link)
- ContentChunkingService — slide/page chunker for investor presentation bodies (one section chunk per ## Slide N: / ## Page N: heading)
- SecFilingExtractor — fetches SEC filing documents and normalises bodies for the sec_filings populate step; HTML filings are rendered to markdown (tables converted, figures inlined) and plain-text filings are written as-is
- SecFilingVisualExtractor — LLM-powered (default gpt-5-mini) figure/image extractor invoked by SecFilingExtractor to convert embedded charts and tables into markdown
- SecFilingChunker — splits a normalised SEC filing body into section chunks keyed on PART/ITEM headings (≤ 12,000 chars per chunk); markdown tables are kept inside a single chunk and table-of-contents ITEM rows are absorbed into the prior section
Workflow: app/workflows/discovery_workflow.rb (DiscoveryWorkflow)
Entity-extraction tasks: app/tasks/discovery_llm_extraction/ (EntityMentionExtractor, CandidateProvider, Disambiguator, EntityMentionLinking, LlmAssigner)
Thor tasks:
- lib/tasks/discovery/pipeline.thor — pipeline-step entry points
Admin: app/admin/discovery_*.rb, app/admin/workflow_pipelines.rb
React UI: app/javascript/bundles/Discovery/ (includes DiscoveryWorkflowConsole)
Seed data: lib/tasks/one_off/data/discovery_facets.json + lib/tasks/one_off/seed_discovery_facets.thor
Migrations: db/migrate/20260416100000_create_discovery_tables.rb, db/migrate/20260416130000_create_discovery_bodies_and_chunks.rb, db/migrate/20260417100000_create_discovery_content_mentions.rb, db/migrate/20260422083510_add_entity_extraction_fields_to_discovery_tables.rb, db/migrate/20260511082425_add_content_extraction_tracking_to_organisation_sec_filings.rb