The Five A/B Testing Mistakes That Make Your Results Meaningless

Peeking at results early is the most common and most consequential A/B testing mistake. The false positive rate of a standard A/B test at p < 0.05 assumes you look at the results once, at the end, after collecting a pre-specified number of observations. If you check daily and stop when you see p < 0.05, the actual false positive rate can be 30 to 50 percent, not 5 percent. This means a large proportion of the "significant" results from tests that were stopped early are noise. The reason teams do this is understandable — the pressure to ship is real and waiting for statistical significance feels slow — but the cost of acting on a false positive is shipping a change that does not actually work and potentially harming the metric while believing you improved it.

The sample size calculation that most teams skip

Calculating required sample size before running a test is the step that prevents under-powered tests. An under-powered test cannot detect real effects — it produces "not significant" results that are often interpreted as "the change had no effect" when the true interpretation is "we do not have enough data to know." A test needs 80 percent power to detect the minimum effect size you care about at your baseline conversion rate. Free calculators at statsig.com or Evan Miller's site make this calculation take two minutes. Skipping it makes every subsequent analysis unreliable, because you have no basis for knowing whether a null result means no effect or insufficient power.

What statistical significance does and does not tell you

A p-value below 0.05 means the result is unlikely to be due to chance. It does not mean the effect is large enough to matter. A test that shows a +0.2% improvement in conversion at p = 0.03 is statistically significant but may not be worth the ongoing maintenance cost of the change. Always pair statistical significance with practical significance — ask what the effect size means in absolute terms for the business before deciding to ship. A result can be real and still not worth acting on. The decision to ship is a business decision that statistical significance informs but does not make.

The sample size calculation that most teams skip

What statistical significance does and does not tell you

Ready to make the move?