Sandhya Indurkar

Math, Applied

Simpson's Paradox: When Every Slice Wins but the Total Loses

Segment wins vs aggregate reversal

The idea

An A/B test can show variant winning on mobile and desktop, yet losing overall. That sounds impossible until you look at who landed in each arm. If control received more high-converting desktop traffic while variant received more low-converting mobile traffic, the aggregate rate can reverse even when variant wins inside every slice.

Simpson's paradox is not a trick formula. It is a reminder that rolled-up rates weight each segment by how many people were in that segment for each arm.

The paradox answers: Did we compare like with like before we declared a winner?

Example: segment wins vs aggregate loss

New checkout flow A/B test split by device type. Compare each slice first, then the rolled-up rate. A different mix of traffic between arms can flip the overall winner.

SegmentControlVariantLift
Mobile8.40%9.60%+1.20 pts
Desktop12.00%13.00%+1.00 pts
Aggregate10.20%11.30%+1.10 pts

Segment and aggregate stories align here, but always compare slices before rolling up.

The math

How rollups work

aggregate rate = total successes ÷ total trials

Combine mobile and desktop by adding successes and trials across segments. The overall rate is a weighted mix. Different weights per arm change the total even when each slice moves the same direction.

Slice before rollup

compare variant vs control within each segment first

Mobile: variant ahead. Desktop: variant ahead. Aggregate: control ahead. That pattern means the traffic mix differed between arms, not that the slice math is wrong.

Random assignment usually balances mix over large samples, but uneven attrition, geo launches, or device-specific bugs can skew who saw which version. Always inspect segment tables in experiment tools before you ship on the headline metric alone.

A simple application: segment tables

Experiment reviewers require device, region, and plan-tier cuts on every readout. Analytics teams flag when aggregate lift disagrees with every segment. Product leads delay rollouts until assignment balance is explained or results are analyzed within the segments that matter for the decision.

Segment tables: aggregate vs slices

Shift mobile share. Aggregate conversion can favor control while every segment favors variant.

Aggregate aligns with segments at this mix

Conversion by segment (%)

Mob ctrl: 6.0% · Mob var: 7.2% · Desk ctrl: 12.0% · Desk var: 11.0%

Aggregate (%)

Control: 8.7% · Variant: 8.9%

Aggregate lift

0.2 pp

Mobile share

55%

Paradox?

No

Optimize (move here)

  • Require device/region/plan cuts on every readout
  • Delay rollout when stories disagree

Hold (do not over-react)

  • Shipping on aggregate when segments disagree

Escalate if

  • Assignment mix shifted mid-test

Mix is not flipping the story here.

The habit is simple: report slices, then aggregate. If stories disagree, trust the slices you care about operationally and fix the mix before you call a winner.