Publication Issues Tracker 2

Publication Issues Tracker (continued)

Continuation of docs/publication_issues_tracker.md. New issues are logged here to keep file sizes manageable.

Last updated: 2026-04-06

Issue index

#	Title	Short description	Status
52	”All Arms” view join fans out across all trial_arm_interventions, duplicating dose	View v23 line 474-475 matches every `drug_intervention` to “All Arms” outcomes regardless of which arm the intervention belongs to. Creates duplicate rows with conflicting dose profiles (RP2D vs escalation range). ~2,776 pubs, ~15k row groups.	Investigation complete

Each issue entry should keep analysis and remediation separate.

Recommended issue structure:

Short summary
Where this sits in the current pipeline
Exact restriction causing the drop
Concrete examples
Downstream impact
What the issue is not
Scale
Spot checks
Open characterization questions
Explored solution direction
Solution applied

Solution applied should remain empty until an actual fix is agreed and implemented.

Backfill pattern: When an issue requires backfilling historical data, see the “One-Off Backfill Tasks” section in .claude/skills/backend-expert/SKILL.md.

52. “All Arms” view join fans out across all trial_arm_interventions, duplicating dose

Short summary

View v23’s drug join for “All Arms” outcomes uses a broad match (aoe.arm_name = 'All Arms') that matches every trial_arm_intervention for the publication — not just the relevant one. When a publication has multiple dose-level arms (e.g., dose escalation with range 4.8-16.0 mg/kg AND dose expansion with single_dose 12.0 mg/kg), each “All Arms” outcome row gets duplicated with conflicting dose profiles. The query layer then takes first_row, making the dose shown effectively non-deterministic.

Where this sits in the current pipeline

View layer: db/views/vw_publication_efficacy_data_v23.sql, lines 470-475 (drug join for Source 0 “All Arms”):

-- Source 0 (trial_arm_interventions): direct trial_arm_id match
(di.trial_arm_id IS NOT NULL AND aoe.trial_arm_id IS NOT NULL
  AND di.trial_arm_id = aoe.trial_arm_id)
-- Source 0 for "All Arms" outcomes (trial_arm_id set but no interventions on that arm)
OR (di.trial_arm_id IS NOT NULL AND aoe.trial_arm_id IS NOT NULL
  AND aoe.arm_name = 'All Arms')

Query layer: app/queries/tpp/clinical_evidence_query.rb, line 386 — build_single_row takes dose from first_row:

single_dose: first_row['single_dose'],
dose_min: first_row['dose_min'],
dose_max: first_row['dose_max'],
rp2d: first_row['rp2d'],

Exact restriction causing the drop

The aoe.arm_name = 'All Arms' clause is a catch-all: it matches any drug_intervention row that has a trial_arm_id set, regardless of which arm the intervention belongs to. When multiple arms have interventions with different dose profiles, the fan-out creates N duplicate rows per outcome (one per intervention).

The arm_dose_lookup CTE then joins on trial_arm_intervention_id, pulling each intervention’s distinct dose. The deduped_rows CTE (SELECT DISTINCT *) doesn’t collapse these because the dose columns differ.

Concrete examples

Pub 74158 (DS-7300 / Ifinatamab deruxtecan, B7-H3 ADC phase I/II extended follow-up):

Two trial_arm_interventions:
- TAI 33097 on arm “Dose escalation (4.8-16.0 mg/kg)”: dose_min=4.8, dose_max=16.0, dose_context_type=escalation
- TAI 33098 on arm “Dose expansion (12.0 mg/kg)”: single_dose=12.0, rp2d=12.0, dose_context_type=rp2d
sqNSCLC outcomes are on arm “All Arms” (arm 25400), which has NO interventions
View produces duplicate rows: some with single_dose=12.0 mg/kg, others with dose_min=4.8 / dose_max=16.0
Abstract explicitly states efficacy is pooled across 4.8-16.0 mg/kg cohorts — the range is correct, but the query may show 12.0 mg/kg (RP2D) depending on sort order
Audit issue 8512 flagged single_dose=12.0 mg/kg as incorrect for sqNSCLC

Downstream impact

Clinical Evidence report shows a non-deterministic dose for “All Arms” rows — whichever duplicate sorts first wins
The RP2D (single_dose) may be shown when the efficacy population spans the full dose range, misrepresenting the study population
Row duplication in the view inflates materialized view size and may cause subtle metric selection issues in extract_efficacy_metrics

What the issue is not

Not a data extraction issue — the LLM correctly identified both dose arms
Not an arm-linking issue (Issue 49) — the “All Arms” arm intentionally has no interventions
Not the same as Issue 31 (dose bleed onto control arms via COALESCE) — this is a JOIN fan-out, not a fallback chain
Not the same as Issue 51 (per-arm dose not populated) — here both arms have correct dose, but both are matched to outcomes they don’t belong to

Scale

Metric	Count
Pubs with “All Arms” outcomes + conflicting dose profiles across different arms (single/rp2d vs range)	715
Pubs where the view produces duplicate rows with different dose variants for “All Arms” outcomes	2,776
Total duplicate row groups in the view	15,025

Spot checks

Pub 74158: Confirmed — sqNSCLC rows duplicated with RP2D (12.0) and range (4.8-16.0). Abstract pools all cohorts.

Open characterization questions

How many of the 2,776 pubs have the “wrong” dose surfaced by the query (i.e., first_row picks RP2D when range would be more accurate)?
Should “All Arms” outcomes inherit dose from the broadest-range intervention, or should dose be null/aggregated?
Does dose_context_type (escalation vs rp2d) provide enough signal to pick the right intervention?

Explored solution direction

Option A — View-layer fix: When aoe.arm_name = 'All Arms', prefer the intervention with dose_context_type = 'escalation' (or the one with dose_min/dose_max range) over single_dose/rp2d. Could use a ROW_NUMBER() window with a priority ordering.

Option B — Query-layer fix: In build_single_row, when multiple view rows exist for the same outcome with different dose profiles, prefer the range (dose_min/dose_max) over single_dose for “All Arms” rows — since pooled data is better described by the range.

Option C — Data-model fix: Link “All Arms” outcomes to all relevant arms explicitly (via trial_arm_id or a new junction), then aggregate dose at the query level.

Option A is the most targeted fix with lowest risk. Option B is simpler but doesn’t address the view duplication. Option C is the cleanest long-term but highest effort.