Evidence Discovery
Evidence Discovery Substrate
Section titled “Evidence Discovery Substrate”A derived content layer that sits alongside the canonical source tables
(publications, news, drug_approvals, clinical_trials, …). Its job is to
give the agent — and eventually end-user search — a single place to look up
evidence-bearing text regardless of origin (admin uploads, source projections,
later agentic workflows).
The substrate ships with infrastructure and governance for the content
graph plus a four-step end-to-end pipeline (Discovery::Pipeline,
orchestrated by DiscoveryWorkflow) that populates content items,
auto-assigns facets via LLM, extracts mentions, and links them to
canonical entities. Investor presentation bodies are chunked by extracted
slide/page headings, and SEC filing bodies are fetched on-demand and
chunked by PART/ITEM headings; broader chunking strategies,
embeddings, and retrieval remain future work.
Data model
Section titled “Data model”All tables are prefixed discovery_. One-to-many everywhere unless noted.
| Table | Purpose | Parent |
|---|---|---|
discovery_content_items | Canonical content identity, provenance, processing status | — |
discovery_content_assets | URL references attached to an item (uploads use Active Storage) | item |
discovery_content_bodies | Extracted/normalised full text. Unique per (item, source_format) | item |
discovery_content_chunks | Retrieval units derived from a body. Unique position per body | body |
discovery_content_mentions | Resolved evidence spans inside a chunk | chunk |
discovery_entity_links | Canonical-entity attachments on an item (polymorphic) | item |
discovery_facet_definitions | Facet registry keyed by (namespace, key) | — |
discovery_facet_values | Allowed values for controlled facets | facet_definition |
discovery_content_facets | Facet assignments on an item | item + facet_definition |
Schema invariants worth knowing:
DiscoveryContentItem#descriptionis a short human-readable summary/snippet for browsing. It is not the extracted document body. Full normalized text belongs indiscovery_content_bodies.- Each derived row carries only its nearest parent (
chunk.body_id,mention.chunk_id). Everything above is reached via:throughassociations, e.g.content_item.discovery_content_chunksormention.discovery_content_item. This avoids the “parent IDs drift” bug class at the cost of a join on item-scoped lookups. DiscoveryEntityLink.entityandDiscoveryContentMention.entityare polymorphic;entity_typeis a model class name (e.g.Drug,Disease,Organisation) and is nil until a resolution workflow populates it.DiscoveryEntityLink.match_statusisresolvedorunresolved. Resolved links must carry anentity_type/entity_id; unresolved links must not. A link can be either an item-level attachment (discovery_content_mention_idis null) or a per-mention adjudication outcome (discovery_content_mention_idis set, unique per mention).- Facet assignments are either controlled (
discovery_facet_value_idset) or freeform (freeform_valueset). The two are mutually exclusive and each is enforced by a partial unique index.
Lifecycle
Section titled “Lifecycle”DiscoveryContentItem#processing_status is a string enum with four states:
pending → created, nothing processed yet (default)processing → a workflow is actively working on itprocessed → workflow finished; derived outputs may be attachedfailed → workflow raised; processing_error holds the messageTransitions via the helpers on the model:
item.mark_processing!item.mark_processed! # sets processed_at, clears erroritem.mark_failed!("reason") # stores the error stringProducing content items
Section titled “Producing content items”All producers — admin console, source projections, future agentic workflows —
MUST go through Discovery::ContentItemService or the submission helpers on
top of it. This is the shared substrate entry point.
From scratch
Section titled “From scratch”service = Discovery::ContentItemService.newresult = service.create( title: "Phase 3 Study Protocol for Drug X", content_kind: "protocol", # must be in DiscoveryContentItem::CONTENT_KINDS submission_source: "source_projection", # admin_upload, admin_url, source_projection, agentic source_type: "ClinicalTrial", # optional polymorphic back-link source_id: trial.id, document_date: Date.parse("2026-01-15"), metadata: { version: "2.0" })content_item = result.content_item if result.success?From an existing source record
Section titled “From an existing source record”result = service.create_from_source(publication, content_kind: "publication", document_date: publication.publish_date)This stamps source_type/source_id from the record and sets
submission_source: "source_projection". Use it for source-table projections.
Admin-triggered (file / URL)
Section titled “Admin-triggered (file / URL)”Admin-driven paths go through Discovery::SubmissionService:
Discovery::SubmissionService.new.submit_upload( file: uploaded_file, content_kind: "press_release", submitted_by: current_user)
Discovery::SubmissionService.new.submit_url( url: "https://example.com/deck.pdf", content_kind: "corporate_deck", submitted_by: current_user)Uploads attach via Active Storage (item.files); URL submissions also create
a discovery_content_assets row for the raw URL reference.
Source projections with a URL (for example investor presentation links) also
mirror that URL into discovery_content_assets so link/file references are
kept separate from extracted text bodies.
Extending with derived outputs
Section titled “Extending with derived outputs”Later extraction, chunking, mention-detection, and entity-resolution workflows attach to an existing content item. Reuse these attach points — do NOT create parallel tables or new content items for derived outputs.
Bodies
Section titled “Bodies”body = content_item.discovery_content_bodies.create!( source_format: "plain_text", # pdf | html | plain_text | markdown extraction_method: "tika", # optional: tika | trafilatura | manual | llm body: extracted_text)Unique per (content_item, source_format) — if your extractor wants to store
multiple formats for the same item, use different source_format values.
Chunks
Section titled “Chunks”body.discovery_content_chunks.create!( position: 0, # unique within the body content: chunk_text, chunk_strategy: "paragraph", # fixed_size | paragraph | section | semantic token_count: 128)Item-scoped access via content_item.discovery_content_chunks (joins through
bodies).
Mentions
Section titled “Mentions”chunk.discovery_content_mentions.create!( mention_type: "drug", # see DiscoveryContentMention::MENTION_TYPES category: "Drug", # see DiscoveryContentMention::ENTITY_CATEGORIES surface_form: "pembrolizumab", start_offset: 42, # optional, character offset into chunk.content end_offset: 55, resolution_method: "llm", # optional: ner | llm | dictionary | manual extraction_method: "discovery_entity_extraction_linking", entity_type: "Drug", # optional — nil until resolution entity_id: drug.id, confidence: 0.92)Body and item reached via :through (mention.discovery_content_item).
mention_type is the lowercase storage form; category is the canonical
class-name form (e.g. Drug, ClinicalTrial, Organisation) used by the
extraction/linking pipeline.
Entity links
Section titled “Entity links”Entity links can be item-level attachments (no mention) or mention-level adjudication outcomes (one link per mention).
# Item-level (no mention) — populated from aggregates or content-level classifierscontent_item.discovery_entity_links.create!( entity_type: "Drug", entity_id: drug.id, link_type: "primary_subject", # mentioned | primary_subject | related match_status: "resolved", # resolved | unresolved confidence: 0.95)
# Mention-level — one link per mention, written by the extraction pipelinecontent_item.discovery_entity_links.create!( discovery_content_mention: mention, entity_type: "Drug", # nil for unresolved entity_id: drug.id, # nil for unresolved link_type: "mentioned", match_status: "resolved", confidence: 0.92)Item-level links are unique per (content_item, entity_type, entity_id)
where discovery_content_mention_id IS NULL. Mention-level links are unique
per discovery_content_mention_id. Resolved links require an entity;
unresolved links must leave entity_type/entity_id blank.
Discovery pipeline
Section titled “Discovery pipeline”Discovery::Pipeline (app/services/discovery/pipeline.rb) is the
end-to-end workflow that turns a source record into a discovery content
item with facets, mentions, and entity links. It is orchestrated by
DiscoveryWorkflow (app/workflows/discovery_workflow.rb) and exposed
via thor tasks under lib/tasks/discovery/pipeline.thor.
The five steps:
| Step | Thor task | What it does |
|---|---|---|
populate_content_item_from_source | discovery:pipeline:populate_content_items | Iterates eligible source records and upserts a DiscoveryContentItem keyed by (source_type, source_id). For most sources, writes a plain_text discovery_content_bodies row (extraction_method: "manual") from the source text and keeps description as a bounded summary. For investor_presentations (scope: records with both a presentation_link and a non-blank presentation_content), it upserts the content item, upserts a discovery_content_assets row from presentation_link, writes presentation_content as the plain_text body (extraction_method: "llm"), and chunks the body into one section chunk per ## Slide N:/## Page N: block. Presentations without presentation_content are skipped — populate them first by running clinical_trials:investor_presentations:extract (InvestorPresentations::PdfContentExtractionTask). When a DiscoveryContentItem already exists for the presentation, the step refreshes only its source asset and otherwise leaves it alone. For sec_filings (scope: Discovery::Pipeline::SEC_DISCOVERY_FORMS — currently 10-K, 10-Q, 20-F — with a non-blank filing_url or sec_index_url, restricted to filings the eligible_for_content_extraction scope considers retryable), it fetches the document via Discovery::SecFilingExtractor (HTML/plain_text, retries transient HTTP failures up to three times), writes the normalised text as a plain_text body (extraction_method: "manual"), and chunks it via Discovery::SecFilingChunker into section chunks keyed on PART/ITEM headings (≤ 12,000 chars per chunk). Extraction state is recorded back onto the organisation_sec_filings row (content_extraction_status, content_extraction_attempt_count, content_extraction_failure_kind, content_extraction_error); only filings under OrganisationSecFiling::CONTENT_EXTRACTION_MAX_ATTEMPTS that are unattempted or in retryable_failed are picked up on re-runs |
assign_facets_with_llm | discovery:pipeline:assign_facets_with_llm | Calls DiscoveryLlmExtraction::FacetAssigner with the persisted content body text plus source metadata, then applies facet assignments via ContentItemService#assign_facet |
extract_mentions | discovery:pipeline:extract_mentions | Calls DiscoveryLlmExtraction::EntityMentionLinking#extract_mentions! against the persisted content body to write DiscoveryContentMention rows with character offsets. Investor presentations are extracted per slide/page chunk; other sources reuse existing body chunks when present, otherwise the pipeline creates a single whole-body fallback chunk first |
resolve_entity_mentions | discovery:pipeline:resolve_entity_mentions | Calls DiscoveryLlmExtraction::EntityMentionCandidateResolver#resolve_content_item! for items whose mentions already exist. Looks up candidates per mention, picks deterministically when confidence is high, otherwise calls the LLM once per content item with all undecided mentions and their candidate shortlists, then writes resolved/unresolved DiscoveryEntityLink rows. Items with no prior pipeline mentions are skipped — run extract_mentions first |
link_mentions_to_entities | discovery:pipeline:link_mentions_to_entities | Verifies that pipeline-extracted mentions and resolved links exist for each item and marks the step complete. Items with no prior mentions or no resolved links are skipped — run extract_mentions and resolve_entity_mentions first |
Source scopes accepted by every step (--source option):
all, publications, investor_presentations, news,
clinical_trials, drug_approvals, sec_filings. The --ids option narrows to
specific source-record ids (combined with a non-all source). Every
step also accepts --submission-source (default source_projection)
to scope by DiscoveryContentItem#submission_source.
By default each step skips records whose output for that step already
exists: assign_facets_with_llm excludes items that already have any
discovery_content_facets, extract_mentions excludes items that already
have any discovery_content_mentions, resolve_entity_mentions excludes
items with at least one resolved discovery_entity_links row, and
link_mentions_to_entities excludes items with any
discovery_entity_links. Pass --override (boolean, default false)
on any step to ignore that filter and reprocess every in-scope record;
for populate_content_items --override also re-extracts and overwrites
an existing DiscoveryContentItem instead of leaving it intact. The
step-scoping logic lives in Discovery::Pipeline#apply_step_scope.
Run a single step:
bundle exec thor discovery:pipeline:populate_content_items --source=publicationsbundle exec thor discovery:pipeline:assign_facets_with_llm --source=publications --ids 123 456bundle exec thor discovery:pipeline:extract_mentions --source=publicationsThe LLM-driven steps (assign_facets_with_llm, extract_mentions,
and resolve_entity_mentions) default to OpenAI batch mode and accept
--batched (boolean, default true), --batch-size (default 1000),
and --parallelism (default 1). Pass --batched=false to fall back
to sequential per-item LLM calls — useful for small --ids runs or
local debugging. In batched mode each step collects all in-scope LLM
prompts across items and submits them via OpenAiService batch mode
(see Discovery::Pipeline#discovery_llm_client): extract_mentions
routes one prompt per content item (with all that item’s chunks in
the payload) through the batch, and resolve_entity_mentions routes
one prompt per content item (with every mention that needs LLM
adjudication for that item) through the batch. link_mentions_to_entities is a verification step
that does no LLM calls and only accepts --source,
--submission-source, --ids, and --override.
The full chain is also driven from the admin Discovery Pipeline console (see Admin surface).
Entity extraction and linking
Section titled “Entity extraction and linking”The extract_mentions and resolve_entity_mentions pipeline steps run
the entity-mention pipeline. Extraction lives in
DiscoveryLlmExtraction::EntityMentionLinking
(app/tasks/discovery_llm_extraction/entity_mention_linking.rb); candidate
lookup and disambiguation live in
DiscoveryLlmExtraction::EntityMentionCandidateResolver
(app/tasks/discovery_llm_extraction/entity_mention_candidate_resolver.rb).
Together they take a content item plus its persisted plain-text body and
write mentions plus mention-level entity links against the selected chunks.
extract_mentions! requires content_chunks to be supplied — the
Discovery::Pipeline step is the integration point that prepares them
(slide/page chunks for investor decks, an existing body chunk if one
exists, or a whole-body fallback chunk otherwise) before invoking
extraction.
flowchart LR text[plain_text body] --> chunks[Chunks<br/>slide/page for investor decks<br/>whole body fallback otherwise] chunks --> extractor[EntityMentionExtractor (LLM)] extractor --> mentions[Mentions] mentions --> candidate[EntityMentionLinkerCandidateProvider] candidate --> resolver[EntityMentionCandidateResolver] resolver -->|resolved| link_resolved[DiscoveryEntityLink<br/>resolved] resolver -->|unresolved| link_unresolved[DiscoveryEntityLink<br/>unresolved] link_resolved --> mention_row[(DiscoveryContentMention)] link_unresolved --> mention_row
Stages:
- EntityMentionExtractor —
app/tasks/discovery_llm_extraction/entity_mention_extractor.rb. LLM call (defaultgpt-5-nano) with a structured-output schema. The prompt receives achunkspayload (one entry per chunk withchunk_idandtext) and returns{ text, category, chunk_id, start_offset, end_offset }mentions wherecategoryis one ofDiscoveryContentMention::ENTITY_CATEGORIESand offsets locate the mention inside the named chunk’s text. - EntityMentionLinkerCandidateProvider —
app/tasks/discovery_llm_extraction/entity_mention_linker_candidate_provider.rb. For each mention, queries the canonical models for the mention’s category (Drug,Disease,Target,Technology,Biomarker,ClinicalTrial,BioloupeInterventionforIntervention,Organisation,Endpoint). Tries exact name/synonym matches first, thenmulti_flexifindfuzzy matches. Categories without a canonical model (TherapeuticArea,DevelopmentPhase,Other) yield no candidates and stay unresolved. - EntityMentionCandidateResolver —
app/tasks/discovery_llm_extraction/entity_mention_candidate_resolver.rb. Picks the top candidate deterministically when confidence is high and clearly separated; otherwise calls the LLM once per content item with every undecided mention and its candidate shortlist (aLinkingDecisionSetschema) and parses one decision per input mention, or marksunresolved. Persists mention-level entity links and is the entry point for theresolve_entity_mentionspipeline step (resolve_content_item!/resolve_content_items_batched!). - EntityMentionLinking —
app/tasks/discovery_llm_extraction/entity_mention_linking.rb. Drives extraction: marks the itemprocessing, requirescontent_chunksto already exist (the pipeline creates them — slide/page chunks for investor decks viaDiscovery::ContentChunkingService, otherwise a single whole-body fallback chunk withmetadata.pipeline = "discovery_entity_extraction_linking"created byDiscovery::Pipeline#content_chunks_for_extraction), clears prior pipeline mentions, and persists mentions in one transaction withstart_offset/end_offsetlocated by re-finding each surface form inside its chunk. Resolution lives inEntityMentionCandidateResolver; resolved mention-level links are deduped there by(entity_type, entity_id)within the item (pre-existing item-level links count too): duplicate resolutions still persist the mention row but skip the entity-link write.
Thor task
Section titled “Thor task”Mention extraction is run through the Discovery Pipeline command surface:
bundle exec thor discovery:pipeline:extract_mentions --source=investor_presentations --ids 123The pipeline step picks the item’s plain_text body, falling back to the
oldest body if none. Items without an extracted body are skipped.
Re-runs
Section titled “Re-runs”EntityMentionLinking#persist! deletes mentions where extraction_method = "discovery_entity_extraction_linking" under the selected body before
re-inserting. Whole-body fallback chunks owned by that pipeline are removed
when slide/page chunks exist, so re-runs replace this pipeline’s outputs
without disturbing chunks owned by other producers.
Facets
Section titled “Facets”Facets are typed dimensions attached to an item — e.g. publication.article_type=Journal Article
or sec.form_type=10-Q. The registry decides which keys are valid.
Registry
Section titled “Registry”DiscoveryFacetDefinition.create!( namespace: "publication", # the source/domain the facet belongs to key: "article_type", label: "Article Type", value_type: "controlled", # controlled | freeform applicable_content_kinds: ["publication"] # optional scope gate)For controlled facets, add allowed values:
fd.discovery_facet_values.create!(value: "Journal Article", label: "Journal Article")Assigning
Section titled “Assigning”service = Discovery::ContentItemService.newfd = DiscoveryFacetDefinition.find_by!(namespace: "publication", key: "article_type")fv = fd.discovery_facet_values.find_by!(value: "Journal Article")
service.assign_facet(content_item, facet_definition: fd, value: fv)# freeform:service.assign_facet(content_item, facet_definition: other_fd, freeform_value: "v1.2")assign_facet uses find_or_initialize_by, so duplicate calls are no-ops.
Assignments are rejected if the item’s content_kind is not in the
definition’s applicable_content_kinds (when that list is non-empty).
Auto-assignment via LLM
Section titled “Auto-assignment via LLM”DiscoveryLlmExtraction::FacetAssigner
(app/tasks/discovery_llm_extraction/facet_assigner.rb) is the LLM-driven
facet assigner used by the assign_facets_with_llm pipeline step. It
gathers every DiscoveryFacetDefinition whose applicable_content_kinds
matches the item’s content_kind, sends the source payload + facet
catalogue to the LLM (default gpt-5-nano) with a structured-output
schema, validates the returned qualified_key and (for controlled
facets) the value against allowed values, and applies the surviving
assignments through ContentItemService#assign_facet. Unknown keys or
unknown controlled values are dropped into an ignored list, not raised.
Seeding
Section titled “Seeding”Initial facet definitions + values live in
lib/tasks/one_off/data/discovery_facets.json and are seeded with:
bundle exec thor one_off:seed_discovery_facets:seedAdding a new facet: edit the JSON, re-run the thor task. The task is idempotent — existing definitions/values are upserted, not duplicated.
Admin surface
Section titled “Admin surface”| Page | Route | Purpose |
|---|---|---|
| Discovery Pipeline | /admin/discovery_pipeline | React workflow console (DiscoveryWorkflowConsole) for running the five pipeline steps against a chosen source. Backed by DiscoveryWorkflow. |
| Workflow Pipelines | /admin/workflow_pipelines | Cross-pipeline dashboard. Includes a Discovery panel that links into the Discovery Pipeline console. |
| Content Items | /admin/discovery_content_items | Browse, inspect, manage items. File and URL submission forms. Facet assignment panel. |
| Discovery Browse | /admin/discovery_browse | React-driven faceted browse view. Server-paginated via /admin/discovery_content_items/search.json. |
| Facet Definitions | /admin/discovery_facet_definitions | CRUD on the facet registry. |
| Facet Values | /admin/discovery_facet_values | CRUD on controlled facet values. |
Adding a new source
Section titled “Adding a new source”Typical walkthrough for a new source projection (e.g. a new SEC filing stream):
- Pick a
content_kind. If none of the existing constants inDiscoveryContentItem::CONTENT_KINDSfit, add one there. Keep the list coarse — fine-grained classification belongs in facets. - Register source-specific facets if you need filtering dimensions that
don’t apply to everything: edit
lib/tasks/one_off/data/discovery_facets.json, add entries under your namespace (e.g.sec), run the seed task. - Write a projection job that iterates your source records and calls
ContentItemService#create_from_source. Idempotency is the job’s responsibility — key offsource_type+source_idorexternal_id. - Extraction workflows later attach bodies/chunks/mentions/entity_links
to the existing item. They do NOT create new items; they find by
(source_type, source_id)or by item id.
Not yet built
Section titled “Not yet built”These remain deferred:
- General text extraction workflows for arbitrary uploads/URLs
(PDF→text, HTML→text, etc.). Investor presentation PDFs are the current
source-specific exception; entity extraction consumes an existing
plain_textbody, it does not produce one - General chunking pipelines for sources beyond investor presentation slide/page chunks
- Embeddings and vector indexes (pgvector column on chunks is a reserved extension point, not yet defined)
- Retrieval API for the agent
- End-user semantic search UI
- Strong file-level deduplication
Retry/idempotency for derived-output writers is per-pipeline — the entity
extraction pipeline replaces its own mentions on re-run by filtering on
extraction_method. Other future extractors decide their own rerun
strategy.
Code locations
Section titled “Code locations”- Models:
app/models/discovery_*.rb - Services:
app/services/discovery/ContentItemService— core creator + facet assignmentSubmissionService— admin upload/URL wrappersProducerContract— full attach-point documentation (source of truth for the integration contract)Pipeline— four-step end-to-end orchestrator (populate → facets → extract → link)ContentChunkingService— slide/page chunker for investor presentation bodies (onesectionchunk per## Slide N:/## Page N:heading)SecFilingExtractor— fetches SEC filing documents and normalises HTML/plain-text bodies for thesec_filingspopulate stepSecFilingChunker— splits a normalised SEC filing body intosectionchunks keyed onPART/ITEMheadings (≤ 12,000 chars per chunk)
- Workflow:
app/workflows/discovery_workflow.rb(DiscoveryWorkflow) - Entity-extraction tasks:
app/tasks/discovery_llm_extraction/(EntityMentionExtractor,CandidateProvider,Disambiguator,EntityMentionLinking,LlmAssigner) - Thor tasks:
lib/tasks/discovery/pipeline.thor— pipeline-step entry points
- Admin:
app/admin/discovery_*.rb,app/admin/workflow_pipelines.rb - React UI:
app/javascript/bundles/Discovery/(includesDiscoveryWorkflowConsole) - Seed data:
lib/tasks/one_off/data/discovery_facets.json+lib/tasks/one_off/seed_discovery_facets.thor - Migrations:
db/migrate/20260416100000_create_discovery_tables.rb,db/migrate/20260416130000_create_discovery_bodies_and_chunks.rb,db/migrate/20260417100000_create_discovery_content_mentions.rb,db/migrate/20260422083510_add_entity_extraction_fields_to_discovery_tables.rb,db/migrate/20260511082425_add_content_extraction_tracking_to_organisation_sec_filings.rb