Skip to content

Evidence Discovery

A derived content layer that sits alongside the canonical source tables (publications, news, drug_approvals, clinical_trials, …). Its job is to give the agent — and eventually end-user search — a single place to look up evidence-bearing text regardless of origin (admin uploads, source projections, later agentic workflows).

The substrate ships with infrastructure and governance for the content graph plus a four-step end-to-end pipeline (Discovery::Pipeline, orchestrated by DiscoveryWorkflow) that populates content items, auto-assigns facets via LLM, extracts mentions, and links them to canonical entities. Investor presentation bodies are chunked by extracted slide/page headings, and SEC filing bodies are fetched on-demand and chunked by PART/ITEM headings; broader chunking strategies, embeddings, and retrieval remain future work.

All tables are prefixed discovery_. One-to-many everywhere unless noted.

TablePurposeParent
discovery_content_itemsCanonical content identity, provenance, processing status
discovery_content_assetsURL references attached to an item (uploads use Active Storage)item
discovery_content_bodiesExtracted/normalised full text. Unique per (item, source_format)item
discovery_content_chunksRetrieval units derived from a body. Unique position per bodybody
discovery_content_mentionsResolved evidence spans inside a chunkchunk
discovery_entity_linksCanonical-entity attachments on an item (polymorphic)item
discovery_facet_definitionsFacet registry keyed by (namespace, key)
discovery_facet_valuesAllowed values for controlled facetsfacet_definition
discovery_content_facetsFacet assignments on an itemitem + facet_definition

Schema invariants worth knowing:

  • DiscoveryContentItem#description is a short human-readable summary/snippet for browsing. It is not the extracted document body. Full normalized text belongs in discovery_content_bodies.
  • Each derived row carries only its nearest parent (chunk.body_id, mention.chunk_id). Everything above is reached via :through associations, e.g. content_item.discovery_content_chunks or mention.discovery_content_item. This avoids the “parent IDs drift” bug class at the cost of a join on item-scoped lookups.
  • DiscoveryEntityLink.entity and DiscoveryContentMention.entity are polymorphic; entity_type is a model class name (e.g. Drug, Disease, Organisation) and is nil until a resolution workflow populates it.
  • DiscoveryEntityLink.match_status is resolved or unresolved. Resolved links must carry an entity_type/entity_id; unresolved links must not. A link can be either an item-level attachment (discovery_content_mention_id is null) or a per-mention adjudication outcome (discovery_content_mention_id is set, unique per mention).
  • Facet assignments are either controlled (discovery_facet_value_id set) or freeform (freeform_value set). The two are mutually exclusive and each is enforced by a partial unique index.

DiscoveryContentItem#processing_status is a string enum with four states:

pending → created, nothing processed yet (default)
processing → a workflow is actively working on it
processed → workflow finished; derived outputs may be attached
failed → workflow raised; processing_error holds the message

Transitions via the helpers on the model:

item.mark_processing!
item.mark_processed! # sets processed_at, clears error
item.mark_failed!("reason") # stores the error string

All producers — admin console, source projections, future agentic workflows — MUST go through Discovery::ContentItemService or the submission helpers on top of it. This is the shared substrate entry point.

service = Discovery::ContentItemService.new
result = service.create(
title: "Phase 3 Study Protocol for Drug X",
content_kind: "protocol", # must be in DiscoveryContentItem::CONTENT_KINDS
submission_source: "source_projection", # admin_upload, admin_url, source_projection, agentic
source_type: "ClinicalTrial", # optional polymorphic back-link
source_id: trial.id,
document_date: Date.parse("2026-01-15"),
metadata: { version: "2.0" }
)
content_item = result.content_item if result.success?
result = service.create_from_source(publication,
content_kind: "publication",
document_date: publication.publish_date
)

This stamps source_type/source_id from the record and sets submission_source: "source_projection". Use it for source-table projections.

Admin-driven paths go through Discovery::SubmissionService:

Discovery::SubmissionService.new.submit_upload(
file: uploaded_file, content_kind: "press_release", submitted_by: current_user
)
Discovery::SubmissionService.new.submit_url(
url: "https://example.com/deck.pdf", content_kind: "corporate_deck",
submitted_by: current_user
)

Uploads attach via Active Storage (item.files); URL submissions also create a discovery_content_assets row for the raw URL reference. Source projections with a URL (for example investor presentation links) also mirror that URL into discovery_content_assets so link/file references are kept separate from extracted text bodies.

Later extraction, chunking, mention-detection, and entity-resolution workflows attach to an existing content item. Reuse these attach points — do NOT create parallel tables or new content items for derived outputs.

body = content_item.discovery_content_bodies.create!(
source_format: "plain_text", # pdf | html | plain_text | markdown
extraction_method: "tika", # optional: tika | trafilatura | manual | llm
body: extracted_text
)

Unique per (content_item, source_format) — if your extractor wants to store multiple formats for the same item, use different source_format values.

body.discovery_content_chunks.create!(
position: 0, # unique within the body
content: chunk_text,
chunk_strategy: "paragraph", # fixed_size | paragraph | section | semantic
token_count: 128
)

Item-scoped access via content_item.discovery_content_chunks (joins through bodies).

chunk.discovery_content_mentions.create!(
mention_type: "drug", # see DiscoveryContentMention::MENTION_TYPES
category: "Drug", # see DiscoveryContentMention::ENTITY_CATEGORIES
surface_form: "pembrolizumab",
start_offset: 42, # optional, character offset into chunk.content
end_offset: 55,
resolution_method: "llm", # optional: ner | llm | dictionary | manual
extraction_method: "discovery_entity_extraction_linking",
entity_type: "Drug", # optional — nil until resolution
entity_id: drug.id,
confidence: 0.92
)

Body and item reached via :through (mention.discovery_content_item).

mention_type is the lowercase storage form; category is the canonical class-name form (e.g. Drug, ClinicalTrial, Organisation) used by the extraction/linking pipeline.

Entity links can be item-level attachments (no mention) or mention-level adjudication outcomes (one link per mention).

# Item-level (no mention) — populated from aggregates or content-level classifiers
content_item.discovery_entity_links.create!(
entity_type: "Drug",
entity_id: drug.id,
link_type: "primary_subject", # mentioned | primary_subject | related
match_status: "resolved", # resolved | unresolved
confidence: 0.95
)
# Mention-level — one link per mention, written by the extraction pipeline
content_item.discovery_entity_links.create!(
discovery_content_mention: mention,
entity_type: "Drug", # nil for unresolved
entity_id: drug.id, # nil for unresolved
link_type: "mentioned",
match_status: "resolved",
confidence: 0.92
)

Item-level links are unique per (content_item, entity_type, entity_id) where discovery_content_mention_id IS NULL. Mention-level links are unique per discovery_content_mention_id. Resolved links require an entity; unresolved links must leave entity_type/entity_id blank.

Discovery::Pipeline (app/services/discovery/pipeline.rb) is the end-to-end workflow that turns a source record into a discovery content item with facets, mentions, and entity links. It is orchestrated by DiscoveryWorkflow (app/workflows/discovery_workflow.rb) and exposed via thor tasks under lib/tasks/discovery/pipeline.thor.

The five steps:

StepThor taskWhat it does
populate_content_item_from_sourcediscovery:pipeline:populate_content_itemsIterates eligible source records and upserts a DiscoveryContentItem keyed by (source_type, source_id). For most sources, writes a plain_text discovery_content_bodies row (extraction_method: "manual") from the source text and keeps description as a bounded summary. For investor_presentations (scope: records with both a presentation_link and a non-blank presentation_content), it upserts the content item, upserts a discovery_content_assets row from presentation_link, writes presentation_content as the plain_text body (extraction_method: "llm"), and chunks the body into one section chunk per ## Slide N:/## Page N: block. Presentations without presentation_content are skipped — populate them first by running clinical_trials:investor_presentations:extract (InvestorPresentations::PdfContentExtractionTask). When a DiscoveryContentItem already exists for the presentation, the step refreshes only its source asset and otherwise leaves it alone. For sec_filings (scope: Discovery::Pipeline::SEC_DISCOVERY_FORMS — currently 10-K, 10-Q, 20-F — with a non-blank filing_url or sec_index_url, restricted to filings the eligible_for_content_extraction scope considers retryable), it fetches the document via Discovery::SecFilingExtractor (HTML/plain_text, retries transient HTTP failures up to three times), writes the normalised text as a plain_text body (extraction_method: "manual"), and chunks it via Discovery::SecFilingChunker into section chunks keyed on PART/ITEM headings (≤ 12,000 chars per chunk). Extraction state is recorded back onto the organisation_sec_filings row (content_extraction_status, content_extraction_attempt_count, content_extraction_failure_kind, content_extraction_error); only filings under OrganisationSecFiling::CONTENT_EXTRACTION_MAX_ATTEMPTS that are unattempted or in retryable_failed are picked up on re-runs
assign_facets_with_llmdiscovery:pipeline:assign_facets_with_llmCalls DiscoveryLlmExtraction::FacetAssigner with the persisted content body text plus source metadata, then applies facet assignments via ContentItemService#assign_facet
extract_mentionsdiscovery:pipeline:extract_mentionsCalls DiscoveryLlmExtraction::EntityMentionLinking#extract_mentions! against the persisted content body to write DiscoveryContentMention rows with character offsets. Investor presentations are extracted per slide/page chunk; other sources reuse existing body chunks when present, otherwise the pipeline creates a single whole-body fallback chunk first
resolve_entity_mentionsdiscovery:pipeline:resolve_entity_mentionsCalls DiscoveryLlmExtraction::EntityMentionCandidateResolver#resolve_content_item! for items whose mentions already exist. Looks up candidates per mention, picks deterministically when confidence is high, otherwise calls the LLM once per content item with all undecided mentions and their candidate shortlists, then writes resolved/unresolved DiscoveryEntityLink rows. Items with no prior pipeline mentions are skipped — run extract_mentions first
link_mentions_to_entitiesdiscovery:pipeline:link_mentions_to_entitiesVerifies that pipeline-extracted mentions and resolved links exist for each item and marks the step complete. Items with no prior mentions or no resolved links are skipped — run extract_mentions and resolve_entity_mentions first

Source scopes accepted by every step (--source option): all, publications, investor_presentations, news, clinical_trials, drug_approvals, sec_filings. The --ids option narrows to specific source-record ids (combined with a non-all source). Every step also accepts --submission-source (default source_projection) to scope by DiscoveryContentItem#submission_source.

By default each step skips records whose output for that step already exists: assign_facets_with_llm excludes items that already have any discovery_content_facets, extract_mentions excludes items that already have any discovery_content_mentions, resolve_entity_mentions excludes items with at least one resolved discovery_entity_links row, and link_mentions_to_entities excludes items with any discovery_entity_links. Pass --override (boolean, default false) on any step to ignore that filter and reprocess every in-scope record; for populate_content_items --override also re-extracts and overwrites an existing DiscoveryContentItem instead of leaving it intact. The step-scoping logic lives in Discovery::Pipeline#apply_step_scope.

Run a single step:

bundle exec thor discovery:pipeline:populate_content_items --source=publications
bundle exec thor discovery:pipeline:assign_facets_with_llm --source=publications --ids 123 456
bundle exec thor discovery:pipeline:extract_mentions --source=publications

The LLM-driven steps (assign_facets_with_llm, extract_mentions, and resolve_entity_mentions) default to OpenAI batch mode and accept --batched (boolean, default true), --batch-size (default 1000), and --parallelism (default 1). Pass --batched=false to fall back to sequential per-item LLM calls — useful for small --ids runs or local debugging. In batched mode each step collects all in-scope LLM prompts across items and submits them via OpenAiService batch mode (see Discovery::Pipeline#discovery_llm_client): extract_mentions routes one prompt per content item (with all that item’s chunks in the payload) through the batch, and resolve_entity_mentions routes one prompt per content item (with every mention that needs LLM adjudication for that item) through the batch. link_mentions_to_entities is a verification step that does no LLM calls and only accepts --source, --submission-source, --ids, and --override.

The full chain is also driven from the admin Discovery Pipeline console (see Admin surface).

The extract_mentions and resolve_entity_mentions pipeline steps run the entity-mention pipeline. Extraction lives in DiscoveryLlmExtraction::EntityMentionLinking (app/tasks/discovery_llm_extraction/entity_mention_linking.rb); candidate lookup and disambiguation live in DiscoveryLlmExtraction::EntityMentionCandidateResolver (app/tasks/discovery_llm_extraction/entity_mention_candidate_resolver.rb). Together they take a content item plus its persisted plain-text body and write mentions plus mention-level entity links against the selected chunks. extract_mentions! requires content_chunks to be supplied — the Discovery::Pipeline step is the integration point that prepares them (slide/page chunks for investor decks, an existing body chunk if one exists, or a whole-body fallback chunk otherwise) before invoking extraction.

flowchart LR
  text[plain_text body] --> chunks[Chunks<br/>slide/page for investor decks<br/>whole body fallback otherwise]
  chunks --> extractor[EntityMentionExtractor (LLM)]
  extractor --> mentions[Mentions]
  mentions --> candidate[EntityMentionLinkerCandidateProvider]
  candidate --> resolver[EntityMentionCandidateResolver]
  resolver -->|resolved| link_resolved[DiscoveryEntityLink<br/>resolved]
  resolver -->|unresolved| link_unresolved[DiscoveryEntityLink<br/>unresolved]
  link_resolved --> mention_row[(DiscoveryContentMention)]
  link_unresolved --> mention_row

Stages:

  • EntityMentionExtractorapp/tasks/discovery_llm_extraction/entity_mention_extractor.rb. LLM call (default gpt-5-nano) with a structured-output schema. The prompt receives a chunks payload (one entry per chunk with chunk_id and text) and returns { text, category, chunk_id, start_offset, end_offset } mentions where category is one of DiscoveryContentMention::ENTITY_CATEGORIES and offsets locate the mention inside the named chunk’s text.
  • EntityMentionLinkerCandidateProviderapp/tasks/discovery_llm_extraction/entity_mention_linker_candidate_provider.rb. For each mention, queries the canonical models for the mention’s category (Drug, Disease, Target, Technology, Biomarker, ClinicalTrial, BioloupeIntervention for Intervention, Organisation, Endpoint). Tries exact name/synonym matches first, then multi_flexifind fuzzy matches. Categories without a canonical model (TherapeuticArea, DevelopmentPhase, Other) yield no candidates and stay unresolved.
  • EntityMentionCandidateResolverapp/tasks/discovery_llm_extraction/entity_mention_candidate_resolver.rb. Picks the top candidate deterministically when confidence is high and clearly separated; otherwise calls the LLM once per content item with every undecided mention and its candidate shortlist (a LinkingDecisionSet schema) and parses one decision per input mention, or marks unresolved. Persists mention-level entity links and is the entry point for the resolve_entity_mentions pipeline step (resolve_content_item! / resolve_content_items_batched!).
  • EntityMentionLinkingapp/tasks/discovery_llm_extraction/entity_mention_linking.rb. Drives extraction: marks the item processing, requires content_chunks to already exist (the pipeline creates them — slide/page chunks for investor decks via Discovery::ContentChunkingService, otherwise a single whole-body fallback chunk with metadata.pipeline = "discovery_entity_extraction_linking" created by Discovery::Pipeline#content_chunks_for_extraction), clears prior pipeline mentions, and persists mentions in one transaction with start_offset / end_offset located by re-finding each surface form inside its chunk. Resolution lives in EntityMentionCandidateResolver; resolved mention-level links are deduped there by (entity_type, entity_id) within the item (pre-existing item-level links count too): duplicate resolutions still persist the mention row but skip the entity-link write.

Mention extraction is run through the Discovery Pipeline command surface:

bundle exec thor discovery:pipeline:extract_mentions --source=investor_presentations --ids 123

The pipeline step picks the item’s plain_text body, falling back to the oldest body if none. Items without an extracted body are skipped.

EntityMentionLinking#persist! deletes mentions where extraction_method = "discovery_entity_extraction_linking" under the selected body before re-inserting. Whole-body fallback chunks owned by that pipeline are removed when slide/page chunks exist, so re-runs replace this pipeline’s outputs without disturbing chunks owned by other producers.

Facets are typed dimensions attached to an item — e.g. publication.article_type=Journal Article or sec.form_type=10-Q. The registry decides which keys are valid.

DiscoveryFacetDefinition.create!(
namespace: "publication", # the source/domain the facet belongs to
key: "article_type",
label: "Article Type",
value_type: "controlled", # controlled | freeform
applicable_content_kinds: ["publication"] # optional scope gate
)

For controlled facets, add allowed values:

fd.discovery_facet_values.create!(value: "Journal Article", label: "Journal Article")
service = Discovery::ContentItemService.new
fd = DiscoveryFacetDefinition.find_by!(namespace: "publication", key: "article_type")
fv = fd.discovery_facet_values.find_by!(value: "Journal Article")
service.assign_facet(content_item, facet_definition: fd, value: fv)
# freeform:
service.assign_facet(content_item, facet_definition: other_fd, freeform_value: "v1.2")

assign_facet uses find_or_initialize_by, so duplicate calls are no-ops. Assignments are rejected if the item’s content_kind is not in the definition’s applicable_content_kinds (when that list is non-empty).

DiscoveryLlmExtraction::FacetAssigner (app/tasks/discovery_llm_extraction/facet_assigner.rb) is the LLM-driven facet assigner used by the assign_facets_with_llm pipeline step. It gathers every DiscoveryFacetDefinition whose applicable_content_kinds matches the item’s content_kind, sends the source payload + facet catalogue to the LLM (default gpt-5-nano) with a structured-output schema, validates the returned qualified_key and (for controlled facets) the value against allowed values, and applies the surviving assignments through ContentItemService#assign_facet. Unknown keys or unknown controlled values are dropped into an ignored list, not raised.

Initial facet definitions + values live in lib/tasks/one_off/data/discovery_facets.json and are seeded with:

bundle exec thor one_off:seed_discovery_facets:seed

Adding a new facet: edit the JSON, re-run the thor task. The task is idempotent — existing definitions/values are upserted, not duplicated.

PageRoutePurpose
Discovery Pipeline/admin/discovery_pipelineReact workflow console (DiscoveryWorkflowConsole) for running the five pipeline steps against a chosen source. Backed by DiscoveryWorkflow.
Workflow Pipelines/admin/workflow_pipelinesCross-pipeline dashboard. Includes a Discovery panel that links into the Discovery Pipeline console.
Content Items/admin/discovery_content_itemsBrowse, inspect, manage items. File and URL submission forms. Facet assignment panel.
Discovery Browse/admin/discovery_browseReact-driven faceted browse view. Server-paginated via /admin/discovery_content_items/search.json.
Facet Definitions/admin/discovery_facet_definitionsCRUD on the facet registry.
Facet Values/admin/discovery_facet_valuesCRUD on controlled facet values.

Typical walkthrough for a new source projection (e.g. a new SEC filing stream):

  1. Pick a content_kind. If none of the existing constants in DiscoveryContentItem::CONTENT_KINDS fit, add one there. Keep the list coarse — fine-grained classification belongs in facets.
  2. Register source-specific facets if you need filtering dimensions that don’t apply to everything: edit lib/tasks/one_off/data/discovery_facets.json, add entries under your namespace (e.g. sec), run the seed task.
  3. Write a projection job that iterates your source records and calls ContentItemService#create_from_source. Idempotency is the job’s responsibility — key off source_type+source_id or external_id.
  4. Extraction workflows later attach bodies/chunks/mentions/entity_links to the existing item. They do NOT create new items; they find by (source_type, source_id) or by item id.

These remain deferred:

  • General text extraction workflows for arbitrary uploads/URLs (PDF→text, HTML→text, etc.). Investor presentation PDFs are the current source-specific exception; entity extraction consumes an existing plain_text body, it does not produce one
  • General chunking pipelines for sources beyond investor presentation slide/page chunks
  • Embeddings and vector indexes (pgvector column on chunks is a reserved extension point, not yet defined)
  • Retrieval API for the agent
  • End-user semantic search UI
  • Strong file-level deduplication

Retry/idempotency for derived-output writers is per-pipeline — the entity extraction pipeline replaces its own mentions on re-run by filtering on extraction_method. Other future extractors decide their own rerun strategy.

  • Models: app/models/discovery_*.rb
  • Services: app/services/discovery/
    • ContentItemService — core creator + facet assignment
    • SubmissionService — admin upload/URL wrappers
    • ProducerContract — full attach-point documentation (source of truth for the integration contract)
    • Pipeline — four-step end-to-end orchestrator (populate → facets → extract → link)
    • ContentChunkingService — slide/page chunker for investor presentation bodies (one section chunk per ## Slide N: / ## Page N: heading)
    • SecFilingExtractor — fetches SEC filing documents and normalises HTML/plain-text bodies for the sec_filings populate step
    • SecFilingChunker — splits a normalised SEC filing body into section chunks keyed on PART/ITEM headings (≤ 12,000 chars per chunk)
  • Workflow: app/workflows/discovery_workflow.rb (DiscoveryWorkflow)
  • Entity-extraction tasks: app/tasks/discovery_llm_extraction/ (EntityMentionExtractor, CandidateProvider, Disambiguator, EntityMentionLinking, LlmAssigner)
  • Thor tasks:
    • lib/tasks/discovery/pipeline.thor — pipeline-step entry points
  • Admin: app/admin/discovery_*.rb, app/admin/workflow_pipelines.rb
  • React UI: app/javascript/bundles/Discovery/ (includes DiscoveryWorkflowConsole)
  • Seed data: lib/tasks/one_off/data/discovery_facets.json + lib/tasks/one_off/seed_discovery_facets.thor
  • Migrations: db/migrate/20260416100000_create_discovery_tables.rb, db/migrate/20260416130000_create_discovery_bodies_and_chunks.rb, db/migrate/20260417100000_create_discovery_content_mentions.rb, db/migrate/20260422083510_add_entity_extraction_fields_to_discovery_tables.rb, db/migrate/20260511082425_add_content_extraction_tracking_to_organisation_sec_filings.rb