Standard of Care pipeline

The Standard of Care pipeline turns FDA oncology and hematology approvals into disease-specific treatment entries with linked label evidence. It starts from approved US DrugApproval records, structures their FDA indication language into Indication and IndicatedTherapeuticApproach records, extracts trial efficacy and safety data from the label, then matches the right label studies and measurements back to each treatment scenario.

This page covers the automated FDA-label driven process only. It intentionally does not document the semi-manual or manual guideline entry workflows.

The core idea

A single FDA label can contain many approved uses. Each use can contain multiple treatment approaches: monotherapy, combination therapy, different treatment lines, or explicitly distinct phases such as induction and consolidation.

The Standard of Care process normalizes that into this shape:

flowchart LR
  DA["DrugApproval\nFDA / Purple Book approval"] --> IND["Indication\nstructured approved use"]
  IND --> ITA["IndicatedTherapeuticApproach\none regimen in one context"]
  ITA --> GL["Guideline\nSOC display entry"]
  DA --> LS["LabelStudy\nefficacy or safety evidence"]
  ITA --> ALS["ApproachLabelStudy\napproach-to-evidence match"]
  LS --> ALS
  LS --> TOM["TrialOutcomeMeasure\nendpoint result"]
  TOM --> TAO["TrialArmOutcome\narm-level value"]
  GL --> KGAO["KeyGuidelineArmOutcome\nheadline evidence"]
  TAO --> KGAO

Guideline is the disease-facing Standard of Care entry. For automated records it is not a guideline-source import; it is a normalized label-derived SOC row linked to an FDA indication and one therapeutic approach. Guidelines marked with standard_of_care = true appear in the final SoC table and are what we actually consider in analytical steps.

End-to-end flow

The workflow is defined in StandardOfCareWorkflow.

standard_of_care:populate_guidelines --mode=sync Creates or syncs Guideline rows from reviewed, approved USA indications for diseases marked as standard-of-care.
regulatory:trials:extract_label_studies Reads FDA label Clinical Studies sections and writes structured trial overviews into drug_approvals.label['structured']['clinical_trials'].
regulatory:trials:extract_efficacy_endpoints, extract_efficacy_subgroups, extract_efficacy_results Extracts endpoint definitions, patient/subgroup structure, and result values from each label trial.
regulatory:trials:scout_trial_nctids Searches for missing registry IDs when the label only names a study.
regulatory:trials:match_efficacy_arms, classify_endpoint_domains, match_endpoints Matches extracted result arms to database study plan arms, classifies endpoint domains, and links extracted endpoints to canonical Endpoint records.
regulatory:trials:update_ae_sections, segment_ae_sections, extract_ae_reports, match_ae_arms, extract_ae_details Extracts adverse event data from the FDA label Adverse Reactions section and ties AE tables to the relevant label trials and arms.
regulatory:trials:match_approach_studies Deterministically matches structured therapeutic approaches to label studies using approval-study identifiers when available.
regulatory:trials:match_approach_studies_llm Uses an LLM fallback for approach-to-study matching when deterministic identifiers are not enough.
regulatory:trials:post_process_label_studies Materializes label trial JSON into category-scoped LabelStudy, ApproachLabelStudy, TrialEndpoint, TrialSubgroup, TrialOutcomeMeasure, TrialArmOutcome, and AdverseEvent rows.
regulatory:trials:match_approach_study_arms Selects the investigational study plan arm that best represents the exact approach/study pair.
standard_of_care:match_segments_llm Matches indication language to disease base segments and biomarker segments.
standard_of_care:post_process_guidelines --entry-types=drug_approval Applies the matched segment data to Guideline entries and marks completed SOC rows.
standard_of_care:determine_key_efficacy_results Selects the key efficacy arm-level results for each post-processed guideline segment context.
standard_of_care:determine_key_safety_arms Matches whole safety-study arms (adverse-event evidence) to each guideline segment context, writing KeyGuidelineSafetyArm rows.

FDA approvals and labels

The upstream regulatory pipeline creates DrugApproval records from FDA, Purple Book, EMA, KEGG, and CDE sources. Standard of Care only uses approved US records from FDA-like sources:

source_type in FdaDatum or PurplebookRecord
approval_status = drug_approved
region = USA
linked Drug
reviewed structured indications
disease is under the standard-of-care disease scope

DailyMed labels are fetched by Fda::LabelSync and stored in drug_approvals.label.

Important label paths:

JSON path	Role
`label['spl']['indications']`	Parsed FDA Indications and Usage section.
`label['spl']['clinical_trials']`	Parsed FDA Clinical Studies section, usually structured by section title and content.
`label['spl']['adverse_events']`	Raw Adverse Reactions section. Kept raw because AE tables vary heavily across labels.
`label['structured']['clinical_trials']`	LLM-structured label trial overviews, endpoints, subgroups, results, arm matches, endpoint matches, and audits.
`label['structured']['adverse_events']`	Segmented AE sections, matched AE reports, arm counts, and detailed AE rows before post-processing.
`label['structured']['matched_approaches']`	Approach ID to matched efficacy/safety evidence references. New entries store separate `efficacy` selections (`trials`, single-trial and pooled) and `safety` selections (`reports`).
`label['structured']['needs_trials_update']`	Set when a label or generated SOC entry changes and label trial extraction needs to rerun.
`label['structured']['needs_ct_post_process_sync']`	Set when matched trial data changed and `LabelStudy` rows need reconciliation.
`label['structured']['needs_ae_update']` / `needs_ae_post_process_sync`	Set when AE label content changes and downstream AE extraction/post-processing needs to rerun.

Indications and therapeutic approaches

The indication pipeline structures FDA indication text before SOC begins.

regulatory:indications:parse runs ApprovalLlmClassification::TaskV2. The LLM reads label indications and extracts:

disease name, subtypes, extents, stages, statuses, risks
eligible and non-eligible populations
biomarkers
prior therapy gates and their AND/OR logic
one or more therapeutic approaches
treatment lines and treatment settings
raw label text for traceability

regulatory:indications:post_process materializes that JSON into:

indications
indicated_biomarkers
indicated_prior_therapy_groups
indicated_prior_therapies
indicated_therapeutic_approaches
indicated_combination_partners

Disease names are resolved through TermMatch and linked through indication_diseases. Biomarker and therapy names are resolved to Biomarker, Drug, or NcitConcept where possible.

Treatment approach boundaries

An IndicatedTherapeuticApproach represents one coherent way the approved product is used in one clinical context.

The parser creates separate approaches for:

different explicit treatment lines, such as 1L vs 2L
clearly independent monotherapy vs combination indications
different required partner sets
explicitly distinct phases when partner composition differs

It does not split restatements, synonyms, or continuation monotherapy that belongs to the same course unless the label states an independent monotherapy indication.

Treatment lines are stored twice:

raw label-derived values in indicated_therapeutic_approaches.treatment_lines
parsed numeric range in min_line / max_line, plus phase-like settings in treatment_settings

Line parsing uses IndicatedTherapeuticApproach::LINE_MAPPING:

Label value	Parsed range
`1L`	`min_line = 0`, `max_line = 0`
`2L`	`min_line = 1`, `max_line = 1`
`3L+`	`min_line = 2`, `max_line = nil`
`1L+`	`min_line = 0`, `max_line = nil`

Settings such as Induction, Consolidation, Maintenance, Bridging, Neoadjuvant, and Adjuvant are stored separately in treatment_settings.

What guideline entries are for

For automated SOC, Guideline is the normalized disease-facing treatment entry derived from an FDA indication and one therapeutic approach.

StandardOfCareStructuring::PopulateGuidelines creates one Guideline per:

disease in the SOC scope
reviewed Indication with indication_type = 'treatment' linked to that disease or one of its descendants
IndicatedTherapeuticApproach under that indication

Non-treatment indications (diagnosis, prevention, mitigation) are skipped. In sync mode, any existing automated guidelines tied to a non-treatment indication are removed as stale.

Each automated Guideline stores:

indication_id
indicated_therapeutic_approach_id
disease_id
treatment_string
linked TreatmentLine, DiseaseBaseSegment and DiseaseBiomarkerSegment rows
approval date and accelerated/full approval status
related_diseases
source = 'Label'

The entry exists so disease pages and TPP reports can ask: “For this disease, segment, biomarker segment, and line, which approved treatments are standard of care and what label evidence supports them?”

How treatment strings and therapies are built

For monotherapy, treatment_string is the approved drug name.

For combinations, it is:

Primary drug + partner 1 + partner 2

The primary therapy is also written to guideline_therapies. The drugs_guidelines HABTM association is kept for existing consumers.

How treatment lines are linked

PopulateGuidelines#link_treatment_lines reads treatment-line names from the IndicatedTherapeuticApproach. It then maps those values to TreatmentLine records associated with the guideline disease.

Important behavior:

It matches by treatment_lines.line_mapping, not only display name.
Induction, Consolidation, and Maintenance can be standalone treatment lines for some diseases. If a disease has a standalone line with that mapping, it is linked directly.
If those phase names are not standalone lines, they are appended to numeric lines as "{line}; {phase}".
If a phase appears without a numeric line, it defaults to 1L; {phase}.
Plus and non-plus mappings fall back both ways: 2L+ can match 2L, and 2L can match 2L+ if the exact disease line is missing.

How entries become standard of care

PostProcessGuidelines does not mark every generated automated guideline as SOC just because it exists. After segment matching, it finalizes automated entries by disease:

A guideline is eligible only if it is post-processed and has at least one linked TreatmentLine.
Duplicate entries are suppressed when they have the same disease, base segments, biomarker segments, treatment lines, and regimen signature.
The duplicate winner is the entry with the richer evidence set: more label studies, trial results, arm outcomes, adverse events, and key result selections.
The winning entries get standard_of_care = true; suppressed or ineligible entries get standard_of_care = false.

Label study extraction

The label-study extraction task reads label['spl']['clinical_trials'] and identifies clinical trials that present detailed results.

For each trial it extracts:

registry ID, such as NCT########; unknown if absent
trial title, such as KEYNOTE-426
patient population and study design

The result initially lives in drug_approvals.label['structured']['clinical_trials']. It is later materialized into category-scoped LabelStudy rows.

Trials are not extracted from every approval. The default scope is approved FDA/Purple Book approvals with standard-of-care treatment-line diseases and a Clinical Studies section.

Pooled analyses

Some labels report efficacy as a pooled analysis across two or more of the identified trials (e.g. “pooled analysis of Studies 1 and 2”). Pooled evidence is not a separate extraction step: a clinical_trials[] entry that carries more than one constituent under its trials array (or is explicitly typed pooled_trials) is materialized as a pooled LabelStudy during post-processing (see Materializing label study evidence).

How indications are matched to label studies

The system matches IndicatedTherapeuticApproach records to structured label evidence in two passes. Persisted matches use:

efficacy.trials for efficacy evidence (single-trial and pooled studies).
safety.reports for safety evidence.
matched_by for the selection source.

Deterministic pass

MatchApproachTrialsTask uses approval-study fields already attached to the indication when they are available:

indications.full_approval_studies
indications.accelerated_approval_studies

For each study, it compares:

clinical_trial_number against structured label trial id
study_name against structured label trial title

It also sets needs_ct_post_process_sync = true.

LLM fallback

MatchApproachTrialsLlmTask handles approaches not matched deterministically or flagged for sync.

It can only return references from the provided enums. If no evidence passes all filters, it writes empty efficacy and safety selections with matched_by = 'llm'.

Efficacy candidates (candidate_trials) include both single-trial and pooled studies, so an approach can select pooled efficacy evidence as proof. Safety reports are offered separately as candidate_safety_reports; they carry their own trials and are not inferred from efficacy trial matches. Selected safety references are persisted by AE report section/index.

Materializing label study evidence

PostProcessLabelStudiesTask converts label['structured']['matched_approaches'], label['structured']['clinical_trials'], and AE reports into relational rows.

For each matched approach/evidence pair:

Resolve the trial JSON by identifier and title.
Find or create an efficacy or safety LabelStudy keyed by drug_approval_id, category, and the single-trial or synthetic pooled identity.
Link single-trial LabelStudy rows to ClinicalTrial when the registry ID matches clinical_trials.nct_id.
Find or create ApproachLabelStudy for the approach and label study.
Create or sync endpoints, subgroups, outcome measures, adverse events, arm outcomes, and trial disease details under the LabelStudy.

The task is sync-oriented. It removes stale approach links and stale label studies when upstream matches change, while skipping LabelStudy rows that are automation-protected (manually_touched or auditor_touched); the LabelStudy.automation_owned and LabelStudy.automation_protected scopes encode this split.

For each matched pooled analysis it materializes one pooled LabelStudy per category that references many trials but carries one result set. label_studies stores study-level metadata (study_identifier, study_title, patient_number, category, study_type), while label_study_trials stores the canonical constituent trial list with fields trial_nctid, trial_title, patient_population, patient_number, and clinical_trial_id. Single-trial studies contain at most one child trial row. Pooled rows are distinguished by label_studies.study_type = 'pooled_trials' and use a stable synthetic study_identifier built from the sorted constituent NCT IDs so reruns reconcile rather than duplicate. LabelStudy.category separates efficacy rows from safety rows; post_marketing is safety-only.

Safety post-process no longer looks up AE reports through label['structured']['clinical_trials']. AE overview extraction records its own trials from the safety section, and selected safety reports materialize directly as LabelStudy(category: safety) plus LabelStudyTrial rows. Post-process also removes any stale adverse_events attached to efficacy studies so safety data lives only on safety LabelStudy rows.

Efficacy rows

Label trial efficacy data is stored as:

TrialEndpoint: endpoint definitions for the label trial
TrialSubgroup: patient population or result subgroup
TrialOutcomeMeasure: one endpoint measurement for one subgroup
TrialArmOutcome: arm-level result value for that measurement

TrialOutcomeMeasure and TrialEndpoint use polymorphic source_type/source_id. For the automated FDA-label path, source_type = 'LabelStudy'.

Adverse event rows

AE extraction follows a separate label-section path because FDA Adverse Reactions sections are inconsistent.

The AE process:

AeUpdateSectionsTask detects changed AE regions when label content changes.
AeSegmentSectionsTask splits the AE section into analyzable report segments.
AeExtractReportsTask identifies which trial each AE segment belongs to and extracts report-level arm counts.
AeMatchArmsTask matches AE arms to study plan arms.
AeExtractDetailsTask extracts detailed adverse events and arm-level measurements.
PostProcessLabelStudiesTask creates AdverseEvent and TrialArmOutcome rows under the matching safety LabelStudy.

For single-trial safety LabelStudy rows, only valid trial-report AE segments are materialized. Pooled AE reports are materialized onto pooled safety LabelStudy rows. Postmarketing evidence uses study_type = 'post_marketing' when materialized.

After rows exist, regulatory:trials:standardize_adverse_events deterministically matches AE names to safety endpoints where possible, and regulatory:trials:classify_adverse_events_llm uses an LLM fallback for unmatched safety endpoint classification.

How measurements are matched to endpoints

Measurements become useful only after their endpoint language is normalized.

The extraction and post-processing layers use several endpoint matching stages:

The label trial endpoint task extracts endpoint names, abbreviations, and definitions from the FDA label trial text.
Endpoint domain classification groups endpoints into clinical domains.
Endpoint matching links extracted endpoints to canonical Endpoint rows.
During post-processing, resolve_endpoint prefers:
- explicit endpoint_id from the extraction/matching output
- exact abbreviation match
- Endpoint.flexifind against endpoint synonyms

The result is stored on trial_endpoints.endpoint_id. Downstream queries use that canonical endpoint ID when available and fall back to endpoint name or abbreviation when necessary.

How key measurements are chosen for a guideline entry

Key results are chosen per Guideline, not globally per trial.

That distinction matters because the same label study can support several approaches or indications. The key result for a 1L monotherapy scenario may not be the same as the key result for a combination or subgroup scenario.

StandardOfCareStructuring::DetermineKeyLabelStudyResults works like this:

Choose the efficacy evidence set: prefer LabelStudy rows with category = 'efficacy' and study_type = 'single_trial'. Pooled efficacy LabelStudy evidence is used only as a fallback when no single-trial result covers the guideline’s required key endpoints.
Get disease-specific key endpoints from diseases.key_endpoints_jsonb.
If the disease has no key endpoint config, walk up parent diseases until one is found.
Pick endpoint abbreviations based on treatment setting:
- neoadjuvant endpoints when the approach has neoadjuvant treatment lines/settings
- adjuvant endpoints when the approach has adjuvant treatment lines/settings
- other endpoints for all other treatment settings
- all as fallback
Filter the label study’s outcome measures to those key endpoints.
Try deterministic selection when:
- all key measures are in one subgroup
- each key endpoint appears once
- each selected outcome has a single confident investigational arm
Otherwise use an LLM prompt to select up to one result per key endpoint, choosing the subgroup and investigational arm closest to the indication, disease, treatment line, and exact therapeutic approach.

Selections are written to key_guideline_arm_outcomes, which points to the exact trial_arm_outcomes rows that should be surfaced as headline evidence and records the treatment-line and base/biomarker segment context used during selection.

How approach arms are matched

MatchApproachTrialArmsTask chooses the investigational study plan arm that represents the approach/study pair.

It considers only ApproachLabelStudy rows that:

are linked to an automated IndicatedTherapeuticApproach
have a single-trial LabelStudy linked to a ClinicalTrial
have label study outcome measures

The task builds candidates from investigational StudyPlanArm rows on the trial — both those linked through TrialArmOutcome rows and those attached directly to the ClinicalTrial. Investigational arms are identified by arm_type matching investigational or experimental. Result-arm outcomes that lack a linked study_plan_arm_id are still used as supporting evidence for matching arms by id.

Selection rules:

If there is exactly one candidate investigational study plan arm, persist it deterministically.
If multiple candidates exist, ask the LLM to select the one matching the approach.
The LLM prioritizes exact therapeutic fit, single-agent vs combination status, partner composition, and structured study plan interventions.
It must not choose control or comparator arms.

The selected arm is stored on:

approach_label_studies.matched_study_plan_arm_id
approach_label_studies.arm_matched_by
approach_label_studies.arm_match_reasoning

KeyGuidelineArmOutcome selection uses this arm when selecting the right arm-level result.

Disease segments

Disease segments are disease-specific filters used to organize SOC treatment entries on disease pages.

There are two segment families:

Segment type	Table	Meaning
Base segment	`disease_base_segments`	Non-biomarker disease qualifiers, such as histology, transplant eligibility, site-specific metastasis, or child disease.
Biomarker segment	`disease_biomarker_segments`	Biomarker-defined cohorts, such as HER2-positive, EGFR-mutant, PD-L1-high.

Both segment types can link to guidelines through HABTM join tables:

disease_base_segments_guidelines
disease_biomarker_segments_guidelines

Biomarker segments can also link to actual biomarkers through biomarker_segment_biomarkers.

StandardOfCareStructuring::MatchSegmentsLlm matches label indication text to these segment lists. It gives the LLM:

the disease’s allowed base segments
the disease’s allowed biomarker segments
the indication raw text
the FDA label indication section for context

The task is conservative:

It prefers exact candidate strings.
It allows deterministic abbreviation/symbol equivalences such as HER2+ to HER2-Positive.
It does not extract treatment-line phrases as base segments.
It avoids generic adult/pediatric and raw staging tokens unless the exact segment is in the candidate list.
It stores insufficient-data reasoning when the label does not support confident segment assignment.

The matched result is stored in guidelines.llm_data['matched_segments'] and later applied to the guideline’s segment associations.

Treatment lines

TreatmentLine is the disease-specific treatment-line vocabulary used by SOC views. A treatment line can be a simple line (First line), a phase-specific line (First line Maintenance), or a disease-specific special line. The machine key used by automated matching is line_mapping.

DiseaseTreatmentLine connects lines to diseases and preserves display ordering through position.

Automated FDA indications provide treatment-line values as strings on IndicatedTherapeuticApproach. PopulateGuidelines maps those values to disease-specific TreatmentLine rows.

Treatment line handling has two layers:

On the indication/approach: line semantics are extracted from FDA label text and parsed to min_line, max_line, and treatment_settings.
On the guideline: those semantics are mapped into disease-specific TreatmentLine records for filtering and display.

This lets the system preserve label-derived line logic while still using disease-specific display lines in the SOC UI.

Table roles and associations

Regulatory and indication tables

Table	Role	Key associations
`drug_approvals`	Unified regulatory approval. Stores FDA label JSON and structured LLM/intermediate workflow data.	`has_many :indications`; `has_many :label_studies`; polymorphic source via `source_type/source_id`; `belongs_to :drug`.
`indications`	One structured approved use extracted from FDA label indication text.	`belongs_to :drug_approval`; `has_many :indicated_therapeutic_approaches`; `has_many :indication_diseases`; `has_many :diseases, through: :indication_diseases`; `has_many :guidelines`.
`indication_diseases`	Join from structured indication to canonical diseases.	`belongs_to :indication`; `belongs_to :disease`.
`indicated_therapeutic_approaches`	One approved treatment scenario under an indication. Stores line and setting semantics.	`belongs_to :indication`; `has_many :indicated_combination_partners`; `has_many :guidelines`; `has_many :approach_label_studies`; `has_many :label_studies, through: :approach_label_studies`.
`indicated_combination_partners`	Current-regimen partner therapies for a non-single-agent approach.	`belongs_to :indicated_therapeutic_approach`; optional polymorphic-like `partner_type/partner_id`.
`indicated_biomarkers`	Biomarker qualifiers extracted from indication text.	`belongs_to :indication`; optional `biomarker_id`.
`indicated_prior_therapy_groups`	AND/OR grouping for prior therapy requirements.	`belongs_to :indication`; `has_many :indicated_prior_therapies`.
`indicated_prior_therapies`	Individual prior therapy gates such as progressed after platinum therapy.	`belongs_to :indication`; optional therapy polymorphic fields; optional group.

SOC entry and segment tables

Table	Role	Key associations
`guidelines`	Disease-facing SOC treatment entry. Automated rows link one disease, one indication, and one therapeutic approach.	`belongs_to :disease`; optional `belongs_to :indication`; optional `belongs_to :indicated_therapeutic_approach`; HABTM `drugs`, `treatment_lines`, `disease_base_segments`, `disease_biomarker_segments`; `has_many :guideline_therapies`; `has_many :key_guideline_arm_outcomes`.
`guideline_therapies`	Polymorphic therapy link for the primary therapy displayed by a guideline.	`belongs_to :guideline`; `belongs_to :therapy, polymorphic: true`.
`treatment_lines`	Canonical disease-specific line/setting display vocabulary.	`has_many :disease_treatment_lines`; `has_many :diseases, through: :disease_treatment_lines`; HABTM `guidelines`.
`disease_treatment_lines`	Disease-to-treatment-line join with display order.	`belongs_to :disease`; `belongs_to :treatment_line`.
`disease_base_segments`	Non-biomarker disease segment options.	`belongs_to :disease`; optional `belongs_to :child_disease`; HABTM `guidelines`.
`disease_biomarker_segments`	Biomarker-defined disease segment options.	`belongs_to :disease`; optional `belongs_to :child_disease`; HABTM `guidelines`; `has_many :biomarker_segment_biomarkers`; `has_many :standard_of_care_prevalences`.
`biomarker_segment_biomarkers`	Biomarker composition of a biomarker segment.	`belongs_to :disease_biomarker_segment`; `belongs_to :biomarker`.
`standard_of_care_prevalences`	Prevalence metadata for biomarker segments, optionally treatment-line-specific.	`belongs_to :disease`; `belongs_to :disease_biomarker_segment`; optional `belongs_to :treatment_line`; `belongs_to :biomarker_prevalence`.
`disease_key_endpoints`	Disease-to-endpoint join for key SOC efficacy endpoints.	`belongs_to :disease`; `belongs_to :endpoint`.

Label study evidence tables

Table	Role	Key associations
`label_studies`	One category-scoped evidence container from an FDA label. `category` is `efficacy` or `safety`; `study_type` is `single_trial`, `pooled_trials`, or safety-only `post_marketing`. Study-level identity lives in `study_identifier`/`study_title`; constituent trial identity lives in `label_study_trials`.	`belongs_to :drug_approval`; `has_many :label_study_trials`; `has_many :clinical_trials, through: :label_study_trials`; `has_many :approach_label_studies`; polymorphic source for trial result tables; scopes `efficacy`/`safety`/`single_trial`/`pooled`/`post_marketing`.
`label_study_trials`	Constituent clinical trials referenced by a `LabelStudy`. Single-trial studies have one row; pooled studies can have many.	`belongs_to :label_study`; optional `belongs_to :clinical_trial`; stores `trial_nctid`, `trial_title`, `sponsor_name`, `trial_index`, and `label_reference`.
`approach_label_studies`	Join from an approach to a label study. This is where “this evidence supports this approach” is stored.	`belongs_to :indicated_therapeutic_approach`; `belongs_to :label_study`; optional `belongs_to :matched_study_plan_arm`.
`trial_endpoints`	Endpoint definitions extracted from label trial text.	Polymorphic `source`; optional `belongs_to :endpoint`; optional `belongs_to :clinical_trial`; `has_many :trial_outcome_measures`.
`trial_subgroups`	Result subgroup or analysis population.	Polymorphic `source`; optional `belongs_to :clinical_trial`; optional `belongs_to :disease`; `has_many :trial_outcome_measures`.
`trial_outcome_measures`	One measurement for one endpoint and subgroup.	Polymorphic `source`; `belongs_to :trial_endpoint`; `belongs_to :trial_subgroup`; optional `belongs_to :clinical_trial`; `has_many :trial_arm_outcomes`.
`trial_arm_outcomes`	Arm-level value for an efficacy measurement or adverse event.	Optional `belongs_to :trial_outcome_measure`; optional `belongs_to :adverse_event`; optional `belongs_to :study_plan_arm`.
`adverse_events`	Safety event measurement extracted from label AE report.	Polymorphic `source`; optional `belongs_to :clinical_trial`; optional `belongs_to :endpoint`; `has_many :trial_arm_outcomes`.
`trial_disease_details`	Disease/population details extracted from trial text.	Polymorphic `source`; optional `belongs_to :disease`; optional `belongs_to :clinical_trial`; `has_many :trial_disease_biomarkers`.
`key_guideline_arm_outcomes`	Marks the exact arm outcomes that are headline/key results for a guideline, with the treatment-line and segment context used during selection.	`belongs_to :guideline`; `belongs_to :trial_arm_outcome`; HABTM `treatment_lines`, `disease_base_segments`, `disease_biomarker_segments`.

Query consumption

Disease SOC pages use Diseases::StandardOfCareQuery.

The query starts from Guideline.standard_of_care, filters by disease, optional base segments, biomarker segments, and top endpoints, then joins through:

guidelines
  -> approach_label_studies
  -> label_studies
  -> trial_outcome_measures
  -> trial_endpoints
guidelines
  -> key_guideline_arm_outcomes (filtered by treatment line and segment context)
  -> trial_arm_outcomes

For label-study-backed entries, endpoint values are only surfaced through KeyGuidelineArmOutcome when top endpoint filtering is requested. The key-outcome join also filters by the guideline’s matched treatment line and base/biomarker segments when the key result was scoped to a specific context. Legacy guideline-sourced rows are kept as a fallback for older data, but the automated FDA-label path should use LabelStudy as the evidence source.

TPP Standard of Care reports use Tpp::StandardOfCareQuery. It loads automated guideline entries, finds matching ApproachLabelStudy rows by indicated_therapeutic_approach_id, and reads measurements from category-scoped LabelStudy rows. It uses KeyGuidelineArmOutcome IDs to prefer headline arm outcomes when summarizing efficacy, applying treatment-line and biomarker filters against the key outcome’s stored context.

The materialized SoC efficacy view is Scenic-managed as vw_soc_efficacy_data. Version 2 reads through approach_label_studies/label_studies, uses source_type = 'LabelStudy', and exposes label_study_id, label_study_category, and label_study_type so consumers can identify pooled evidence.

Sync and invalidation rules

The pipeline is designed to rerun without blindly duplicating results.

Important invalidation behavior:

If a new automated guideline is created, the approval label is marked needs_trials_update.
If guideline sync changes indication text, treatment string, treatment lines, related diseases, or accelerated status, downstream trial/segment fields can be cleared and recomputed.
If Clinical Studies label content changes, Fda::LabelSync marks needs_trials_update.
If Adverse Reactions label content changes, Fda::LabelSync records changed AE regions and marks needs_ae_update.
If approach-to-trial matches change, needs_ct_post_process_sync is set.
If post-processed trial result structure changes, persisted approach arm matches and key result selections are invalidated.
Stale ApproachLabelStudy, LabelStudy, subgroup, outcome, endpoint, and arm rows are reconciled during sync.

The main invariant is that label evidence should live once per category on LabelStudy, while treatment-specific relevance lives on ApproachLabelStudy and guideline-scoped headline evidence lives on KeyGuidelineArmOutcome.

Parallel.ai audits and corrections

Once SOC entries are produced, a second pass uses Parallel.ai web search to audit structured fields against official sources (FDA labels, Drugs@FDA, ClinicalTrials.gov, sponsor releases, peer-reviewed pivotal trials).

The framework lives under app/tasks/standard_of_care_structuring/parallel_soc_audits/. BaseAuditor defines the shared JSON schema, scope filters (approval_ids, disease_ids, target-specific ids, limit), Parallel.ai task-group submission, and the AuditIssue write path. Each subclass picks a target table, builds one prompt per record, and lists the editable fields in a Field Contract that constrains what the LLM may correct.

Auditor	Targets	Checks
`IndicationAuditor`	`Indication`, `IndicatedTherapeuticApproach`, `IndicatedCombinationPartner`	Treatment lines, single-agent vs combination flag, partner composition. Scoped to `indication_type = 'treatment'` only.
`LabelStudyAuditor`	`LabelStudy`, `ApproachLabelStudy`, `TrialDiseaseDetail`, `TrialSubgroup`, `TrialOutcomeMeasure`, `TrialArmOutcome`	Study identifier, title, and structured result fields against the FDA label and registry.
`ApproachSegmentAuditor`	`Guideline` segment associations	Disease base segment and biomarker segment IDs linked to each guideline row, restricted to the disease’s candidate segment lists.
`DrugLineAuditor`	`Guideline` rows grouped by disease and treatment line	Whether the drug list per disease/line matches standard-of-care drugs; reports `missing` and `should_not_be_there` only.

IndicationAuditor, LabelStudyAuditor, and ApproachSegmentAuditor write findings to audit_issues keyed by issue_type (soc_indication_audit, soc_label_study_audit, soc_approach_segment_audit). DrugLineAuditor is read-only: it logs results but does not create issues, since drug coverage corrections require human review.

Applying corrections

CorrectionApplicator and its subclasses read open AuditIssue rows for a given issue_type and apply each correction as a direct field replacement on the target record. Corrections are filtered the same way audits are scoped (issue_ids, approval_ids, disease_ids, target ids, limit), and dry_run logs intended changes without writing.

Applicator	Issue type	Special handling
`IndicationCorrectionApplicator`	`soc_indication_audit`	Replaces `indicated_combination_partners` rows wholesale when `combination_partners` is corrected, and resolves partner names through `Therapy.find_therapy`.
`LabelStudyCorrectionApplicator`	`soc_label_study_audit`	Field replacement on the target model row.
`ApproachSegmentCorrectionApplicator`	`soc_approach_segment_audit`	Replaces guideline segment associations to match the corrected ID list.

Study-extraction audits

AuditApproachStudyMatches (app/tasks/approval_llm_classification/study_extraction/audit_approach_study_matches.rb) follows the same Parallel.ai task-group pattern but lives in the study-extraction pipeline rather than the SOC module. The --batched, --parallelism, and --accept_findings Thor options on regulatory:trials:audit_approach_study_matches are kept as no-ops for backwards compatibility; Parallel.ai task groups are always used.

NCCN recommendation classification

Guideline entries are label-derived, so the fact that a treatment is FDA approved for a disease says nothing about whether NCCN recommends it. Three columns on guidelines answer that separately:

Column	Values	Meaning
`nccn_recommended`	`true` / `false` / `NULL`	Whether the disease’s NCCN guideline recommends this treatment. `NULL` means the entry has not been classified yet.
`nccn_recommendation_classification`	`Preferred`, `Other Recommended`, `Useful in Certain Circumstances`, `NULL`	NCCN Category of Preference. `NULL` on a recommended entry is a real answer: many NCCN tables list options without stratifying them by preference.
`nccn_evidence_level`	`1`, `2A`, `2B`, `3`	NCCN Category of Evidence and Consensus. NCCN prints a category only when it is not 2A (“All recommendations are category 2A unless otherwise indicated”), so a recommended entry always carries a level.

The source is the local NCCN recommendation PDF already matched onto each disease by standard_of_care:resolve_nccn_recommendation_pdfs (stored at diseases.metadata.old_chemo_search.nccn_recommendation_pdf). Classification runs in two steps so a PDF is read once per disease instead of once per treatment.

standard_of_care:index_nccn_recommendations (IndexNccnRecommendations) Walks the disease’s PDF one page at a time and indexes every therapy recommendation printed on it, with its preference heading, its evidence category, its required and optional (±) components, and the verbatim heading/setting context. The catalog is written to diseases.metadata.nccn_recommendation_index. Pages that only carry references, staging tables, or discussion are skipped.
standard_of_care:classify_nccn_recommendations (ClassifyNccnRecommendations) One text prompt per guideline: the guideline’s treatment and recorded context against that disease’s indexed recommendations. A treatment is marked recommended only when an indexed recommendation is the same therapy, and — when the guideline records treatment lines or restrictions — covers at least one of them. Optional components matter here: Cladribine ± Rituximab covers a Cladribine entry and a Cladribine + Rituximab entry alike.

Two properties keep the output honest:

The index is the only evidence. The classifier never sees the PDF and is told not to answer from its own knowledge of NCCN, so a treatment absent from the disease’s guideline comes back false rather than plausibly recommended.
Gradings are read off matched entries, not generated. The model reports which indexed recommendation ids it matched; the stored classification and evidence level are then resolved from those entries (strongest first: Preferred > Other Recommended > Useful in Certain Circumstances, 1 > 2A > 2B > 3). A category no matched entry carries can never be written.

Provenance lands in guidelines.llm_data['nccn_classification'] — matched recommendations with their PDF pages, the reasoning, and the source PDF — and its presence is what marks an entry as classified. Both tasks accept --disease_ids, --limit, --overwrite, and OpenAI batch options; the classifier also accepts --guideline_ids and --entry_types. Guidelines on a disease with no indexed PDF are reported as skipped missing index rather than guessed at.

Key commands

Command	Purpose
`bundle exec thor standard_of_care:populate_guidelines --mode=sync`	Create or sync automated SOC guideline entries from reviewed FDA indications.
`bundle exec thor regulatory:trials:extract_label_studies`	Identify label trial overviews from FDA Clinical Studies text.
`bundle exec thor regulatory:trials:extract_efficacy_endpoints`	Extract endpoints for label trials.
`bundle exec thor regulatory:trials:extract_efficacy_subgroups`	Extract trial subgroups and analysis populations.
`bundle exec thor regulatory:trials:extract_efficacy_results`	Extract efficacy results.
`bundle exec thor regulatory:trials:scout_trial_nctids`	Find missing trial registry IDs.
`bundle exec thor regulatory:trials:match_efficacy_arms`	Match extracted result arms to study plan arms.
`bundle exec thor regulatory:trials:classify_endpoint_domains`	Classify endpoint domains.
`bundle exec thor regulatory:trials:match_endpoints`	Link extracted endpoints to canonical endpoints.
`bundle exec thor regulatory:trials:segment_ae_sections`	Segment FDA AE label sections.
`bundle exec thor regulatory:trials:extract_ae_reports`	Match AE report overviews to label studies.
`bundle exec thor regulatory:trials:match_ae_arms`	Match AE arms to study plan arms.
`bundle exec thor regulatory:trials:extract_ae_details`	Extract detailed adverse event rows.
`bundle exec thor regulatory:trials:match_approach_studies`	Deterministically match approaches to label studies.
`bundle exec thor regulatory:trials:match_approach_studies_llm`	LLM fallback for approach-to-study matching.
`bundle exec thor regulatory:trials:post_process_label_studies`	Materialize label study evidence into relational rows.
`bundle exec thor regulatory:trials:match_approach_study_arms`	Pick the approach’s investigational arm.
`bundle exec thor standard_of_care:match_segments_llm`	Match FDA indication text to disease base and biomarker segments.
`bundle exec thor standard_of_care:post_process_guidelines --entry-types=drug_approval`	Apply automated SOC post-processing.
`bundle exec thor standard_of_care:determine_key_efficacy_results`	Pick key efficacy arm outcomes for each post-processed guideline segment context.
`bundle exec thor standard_of_care:determine_key_safety_arms`	Match whole safety-study arms to each guideline segment context (`KeyGuidelineSafetyArm`).
`bundle exec thor standard_of_care:index_nccn_recommendations`	Index every therapy recommendation (preference + evidence category) from each disease’s matched NCCN PDF.
`bundle exec thor standard_of_care:classify_nccn_recommendations`	Mark each guideline NCCN recommended or not, with its NCCN preference and evidence category.
`bundle exec thor standard_of_care:audit_indications_parallel`	Parallel.ai audit of structured indications and approach line/partner fields.
`bundle exec thor standard_of_care:apply_indication_parallel_audit_corrections`	Apply open `soc_indication_audit` corrections.
`bundle exec thor standard_of_care:audit_label_studies_parallel`	Parallel.ai audit of `LabelStudy` identifiers, titles, and result fields.
`bundle exec thor standard_of_care:apply_label_study_parallel_audit_corrections`	Apply open `soc_label_study_audit` corrections.
`bundle exec thor standard_of_care:audit_approach_segments_parallel`	Parallel.ai audit of guideline disease base/biomarker segment associations.
`bundle exec thor standard_of_care:apply_approach_segment_parallel_audit_corrections`	Apply open `soc_approach_segment_audit` corrections.
`bundle exec thor standard_of_care:audit_drug_lines_parallel`	Parallel.ai audit of SOC drug coverage by disease and treatment line (read-only report).

Mental model

Keep these ownership boundaries in mind:

DrugApproval owns the FDA label and intermediate JSON.
Indication owns the approved disease/population context.
IndicatedTherapeuticApproach owns the exact approved regimen and treatment-line semantics.
Guideline owns the disease-facing SOC row and display/filter associations.
LabelStudy owns extracted label evidence once per category and study type.
ApproachLabelStudy explains why one label study supports one approach.
KeyGuidelineArmOutcome explains which specific measurement is closest to the guideline entry, including the treatment-line and segment context used for that decision.

When debugging an SOC entry, start from guidelines.indicated_therapeutic_approach_id, then inspect the approach’s approach_label_studies, the linked label_studies, and finally the guideline’s key_guideline_arm_outcomes that select the evidence shown to users.