We audited 47 brands' A/B testing histories last year. In 38 of those 47 cases, the tests were statistically invalid. Decisions had been made — budgets reallocated, designs shipped, copy changed — based on data that was essentially random noise. This is one of the most expensive silent killers in digital marketing.
If your A/B tests are run for less than two full business cycles, call them early, or have fewer than 500 conversions per variant, the results are meaningless. Full stop.
The Four Reasons A/B Tests Fail
1. Stopping Tests Too Early (The Peeking Problem)
This is the most common mistake and the most damaging. You launch a test, check it after three days, see Variant B is winning by 18%, and call it. You've just made a decision based on statistical noise.
Here's what's actually happening: early in any A/B test, random variation causes wild swings in the data. If you flip a coin 10 times, you might get 7 heads and 3 tails — that doesn't mean the coin is biased. You need to flip it 1,000 times to know. A/B tests are the same. You need to run them until you reach statistical significance at 95% confidence AND have collected a minimum sample size calculated before the test begins.
The fix: calculate your required sample size before you start using a statistical significance calculator. Commit to that number. Don't look at results until you hit it.
2. Testing Too Many Variables at Once
Multivariate testing sounds impressive. In practice, unless you have tens of thousands of monthly visitors, you don't have the traffic to make it work. When you test a new headline AND a new hero image AND a new CTA button simultaneously, you can't know which change drove the result.
Our rule: test one hypothesis per test. One change. One question. "Does a CTA that says 'Start Free Trial' outperform 'Get Started'?" That's a testable hypothesis. "Does a completely redesigned homepage convert better?" is not — there are too many variables.
3. Ignoring Seasonality and External Factors
A test that runs from Friday to Sunday will show completely different results than one that runs Monday to Friday. A test during a sale period will show different results than one during a normal week. A test running when you're also running a big social campaign will be contaminated by that traffic.
Always run tests for a minimum of two full business cycles (two weeks minimum) to account for weekly traffic variation. Pause tests during sales events or major campaign pushes. Document external factors that could affect results.
4. Measuring the Wrong Metric
We see this constantly: a team runs a test optimised for click-through rate, declares a winner, ships the change, and then wonders why revenue didn't move. CTR is a proxy metric. Revenue is what matters.
Always test against the metric that actually moves your business: completed purchases, trial activations, qualified form submissions. If your traffic volume doesn't support testing directly against conversions (you'd need months to reach significance), consider testing against micro-conversions that are strongly predictive of revenue — like add-to-cart rate or scroll depth on pricing pages.
Our Testing Framework: The ICE-S Method
Every test we run goes through a prioritisation framework before it's added to the roadmap. We call it ICE-S:
- Impact — How much will this move the needle if it works? (Score 1–10)
- Confidence — How confident are we it will work, based on data and research? (Score 1–10)
- Ease — How easy is it to implement and run? (Score 1–10)
- Sample — Do we have enough traffic to reach significance in a reasonable timeframe? (Yes/No gate)
Any test that doesn't pass the Sample gate doesn't run — full stop. It doesn't matter how high its ICE score is. Running an underpowered test is worse than running no test at all, because you'll make decisions based on bad data.
What Good A/B Testing Actually Looks Like
For one of our e-commerce clients (a premium skincare brand doing $4M/year), we ran 11 tests over a 6-month period. Here's what that looked like in practice:
- 4 tests produced statistically significant winners
- 3 tests showed no significant difference (valuable — we didn't ship unnecessary changes)
- 2 tests showed the challenger losing (the control was already better than we thought)
- 2 tests were invalidated by external factors and re-run
Those 4 winners compounded to a 41% improvement in overall conversion rate over 6 months. The annualised revenue impact was approximately $1.6M — from 11 tests and a disciplined process.
CRO is not about running more tests. It's about running better tests, learning from every one of them, and building a compounding library of insights about your customers.
The Quick-Win Tests We Run on Every New Client
If you want to start immediately, here are five tests that reliably produce significant results across most e-commerce and SaaS sites:
- CTA button copy — Action-oriented, specific copy ("Start My Free Trial") vs generic ("Get Started"). Typical lift: 10–25%.
- Social proof placement — Moving star ratings/review counts above the fold vs below. Typical lift: 8–18%.
- Urgency/scarcity on product pages — "Only 4 left in stock" vs no urgency indicator. Highly variable, can be 20–60%.
- Checkout form fields — Removing optional fields and reducing friction. Typical lift: 10–30% on checkout completion.
- Hero headline specificity — Specific outcome headline ("Lose Your First 10 Pounds in 30 Days") vs generic benefit ("Feel Better, Live Better"). Typical lift: 15–35%.
Want a Free CRO Audit of Your Site?
We'll identify your top 5 conversion killers and prioritise the tests most likely to move your revenue in the next 90 days.
Get Free CRO Audit