Skip to content

Publication Issues Tracker 2

Continuation of docs/publication_issues_tracker.md. New issues are logged here to keep file sizes manageable.

Last updated: 2026-04-06

#TitleShort descriptionStatus
52”All Arms” view join fans out across all trial_arm_interventions, duplicating doseView v23 line 474-475 matches every drug_intervention to “All Arms” outcomes regardless of which arm the intervention belongs to. Creates duplicate rows with conflicting dose profiles (RP2D vs escalation range). ~2,776 pubs, ~15k row groups.Investigation complete

Each issue entry should keep analysis and remediation separate.

Recommended issue structure:

  • Short summary
  • Where this sits in the current pipeline
  • Exact restriction causing the drop
  • Concrete examples
  • Downstream impact
  • What the issue is not
  • Scale
  • Spot checks
  • Open characterization questions
  • Explored solution direction
  • Solution applied

Solution applied should remain empty until an actual fix is agreed and implemented.

Backfill pattern: When an issue requires backfilling historical data, see the “One-Off Backfill Tasks” section in .claude/skills/backend-expert/SKILL.md.

52. “All Arms” view join fans out across all trial_arm_interventions, duplicating dose

Section titled “52. “All Arms” view join fans out across all trial_arm_interventions, duplicating dose”

View v23’s drug join for “All Arms” outcomes uses a broad match (aoe.arm_name = 'All Arms') that matches every trial_arm_intervention for the publication — not just the relevant one. When a publication has multiple dose-level arms (e.g., dose escalation with range 4.8-16.0 mg/kg AND dose expansion with single_dose 12.0 mg/kg), each “All Arms” outcome row gets duplicated with conflicting dose profiles. The query layer then takes first_row, making the dose shown effectively non-deterministic.

View layer: db/views/vw_publication_efficacy_data_v23.sql, lines 470-475 (drug join for Source 0 “All Arms”):

-- Source 0 (trial_arm_interventions): direct trial_arm_id match
(di.trial_arm_id IS NOT NULL AND aoe.trial_arm_id IS NOT NULL
AND di.trial_arm_id = aoe.trial_arm_id)
-- Source 0 for "All Arms" outcomes (trial_arm_id set but no interventions on that arm)
OR (di.trial_arm_id IS NOT NULL AND aoe.trial_arm_id IS NOT NULL
AND aoe.arm_name = 'All Arms')

Query layer: app/queries/tpp/clinical_evidence_query.rb, line 386 — build_single_row takes dose from first_row:

single_dose: first_row['single_dose'],
dose_min: first_row['dose_min'],
dose_max: first_row['dose_max'],
rp2d: first_row['rp2d'],

The aoe.arm_name = 'All Arms' clause is a catch-all: it matches any drug_intervention row that has a trial_arm_id set, regardless of which arm the intervention belongs to. When multiple arms have interventions with different dose profiles, the fan-out creates N duplicate rows per outcome (one per intervention).

The arm_dose_lookup CTE then joins on trial_arm_intervention_id, pulling each intervention’s distinct dose. The deduped_rows CTE (SELECT DISTINCT *) doesn’t collapse these because the dose columns differ.

Pub 74158 (DS-7300 / Ifinatamab deruxtecan, B7-H3 ADC phase I/II extended follow-up):

  • Two trial_arm_interventions:
    • TAI 33097 on arm “Dose escalation (4.8-16.0 mg/kg)”: dose_min=4.8, dose_max=16.0, dose_context_type=escalation
    • TAI 33098 on arm “Dose expansion (12.0 mg/kg)”: single_dose=12.0, rp2d=12.0, dose_context_type=rp2d
  • sqNSCLC outcomes are on arm “All Arms” (arm 25400), which has NO interventions
  • View produces duplicate rows: some with single_dose=12.0 mg/kg, others with dose_min=4.8 / dose_max=16.0
  • Abstract explicitly states efficacy is pooled across 4.8-16.0 mg/kg cohorts — the range is correct, but the query may show 12.0 mg/kg (RP2D) depending on sort order
  • Audit issue 8512 flagged single_dose=12.0 mg/kg as incorrect for sqNSCLC
  • Clinical Evidence report shows a non-deterministic dose for “All Arms” rows — whichever duplicate sorts first wins
  • The RP2D (single_dose) may be shown when the efficacy population spans the full dose range, misrepresenting the study population
  • Row duplication in the view inflates materialized view size and may cause subtle metric selection issues in extract_efficacy_metrics
  • Not a data extraction issue — the LLM correctly identified both dose arms
  • Not an arm-linking issue (Issue 49) — the “All Arms” arm intentionally has no interventions
  • Not the same as Issue 31 (dose bleed onto control arms via COALESCE) — this is a JOIN fan-out, not a fallback chain
  • Not the same as Issue 51 (per-arm dose not populated) — here both arms have correct dose, but both are matched to outcomes they don’t belong to
MetricCount
Pubs with “All Arms” outcomes + conflicting dose profiles across different arms (single/rp2d vs range)715
Pubs where the view produces duplicate rows with different dose variants for “All Arms” outcomes2,776
Total duplicate row groups in the view15,025
  • Pub 74158: Confirmed — sqNSCLC rows duplicated with RP2D (12.0) and range (4.8-16.0). Abstract pools all cohorts.
  1. How many of the 2,776 pubs have the “wrong” dose surfaced by the query (i.e., first_row picks RP2D when range would be more accurate)?
  2. Should “All Arms” outcomes inherit dose from the broadest-range intervention, or should dose be null/aggregated?
  3. Does dose_context_type (escalation vs rp2d) provide enough signal to pick the right intervention?

Option A — View-layer fix: When aoe.arm_name = 'All Arms', prefer the intervention with dose_context_type = 'escalation' (or the one with dose_min/dose_max range) over single_dose/rp2d. Could use a ROW_NUMBER() window with a priority ordering.

Option B — Query-layer fix: In build_single_row, when multiple view rows exist for the same outcome with different dose profiles, prefer the range (dose_min/dose_max) over single_dose for “All Arms” rows — since pooled data is better described by the range.

Option C — Data-model fix: Link “All Arms” outcomes to all relevant arms explicitly (via trial_arm_id or a new junction), then aggregate dose at the query level.

Option A is the most targeted fix with lowest risk. Option B is simpler but doesn’t address the view duplication. Option C is the cleanest long-term but highest effort.