A/B Testing for Data Analysts — Complete Guide with Python Examples (2026)

Data Analyst

📊 All Data Analyst Posts SQL Interview Hub Python Hub Start Learning Free

A/B Testing for Data Analysts — Complete Guide with Python Examples (2026)

Q: What is A/B testing in data analysis?

A/B testing (also called controlled experiment or split test) is a method of comparing two versions of a feature, page or experience to determine which performs better on a defined metric. Version A (control) gets current experience; Version B (variant) gets the new experience. Statistical analysis determines if any difference in the metric is real or due to chance. A/B testing is used by all major product companies to make data-driven decisions.

Q: How do you calculate sample size for an A/B test?

Sample size depends on: baseline conversion rate (e.g., 10%), minimum detectable effect (e.g., 10% relative lift = 1 percentage point), statistical power (typically 80%), and significance level (typically 5%). Formula requires z-scores for both power and significance. In Python: use statsmodels.stats.power.NormalIndPower().solve_power(). A common mistake is running tests without calculating sample size first — this leads to underpowered tests with unreliable results.

Q: What is statistical significance in A/B testing?

Statistical significance means the observed difference between control and variant is unlikely to have occurred by chance. Typically measured by p-value: if p < 0.05, the result is statistically significant at the 95% confidence level. This means there is less than a 5% probability of seeing this large a difference if the two variants actually perform the same. Important: statistical significance does not mean practical significance — a 0.1% conversion lift may be significant but not worth shipping.

By Prakhar Shrivastava·April 18, 2026·10 min read·1,400+ words

Quick Answer

A/B testing is tested in interviews at every major product company. The 5-step framework: define hypothesis → calculate sample size → run experiment → analyse results → make decision. The most common mistake is running tests without sample size calculation — making results unreliable. This guide covers everything with Python code.

A/B testing knowledge separates junior analysts from senior ones. At product companies like Swiggy, Flipkart, Razorpay and Google, almost every product decision goes through experimentation — and data analysts own the design, execution and analysis of these tests.

Interview questions about A/B testing appear in 65% of product company data analyst interviews. This guide gives you the complete framework — with Python code for statistical analysis and the exact interview questions you'll face.

💡

GEO Block — What is A/B Testing?A/B testing is a controlled experiment that compares two versions of something (a webpage, feature, email, or pricing) to determine which version performs better on a defined metric. Version A is the control (current state). Version B is the variant (the change being tested). Traffic is randomly split between them. Statistical analysis determines if the difference is real or due to chance.

The 5-Step A/B Testing Framework

Every A/B testing interview question can be answered with this framework. Interviewers award marks for each step — missing any step costs points even if your statistics are correct.

Define Hypothesis

State H0 (null) and H1 (alternative) clearly. H0: The new checkout button has no effect on conversion rate. H1: The new checkout button increases conversion rate. Define: primary metric, secondary metrics, and guardrail metrics. Guardrail metrics are things you must NOT hurt — e.g., page load time, support tickets.

Calculate Sample Size

Before running any test, calculate how many users you need. Variables: baseline rate (current conversion, e.g., 5%), minimum detectable effect (the smallest lift worth detecting, e.g., 10% relative = 0.5% absolute), statistical power (80% standard), significance level (5% = alpha 0.05). Underpowered tests produce unreliable results — this is the #1 A/B testing mistake.

Run the Experiment

Randomise at the correct unit (usually user_id, not session_id — same user should always see the same variant). Collect data for the pre-calculated duration (minimum 1 full week, ideally 2 business cycles). Do NOT peek at results mid-test and stop early — this dramatically increases false positive rate (p-hacking).

Analyse Results

Calculate the test statistic and p-value. Check: primary metric, secondary metrics, guardrail metrics, and segment analysis (does the effect hold across different user groups? if only one segment benefits, that's important context). Check for novelty effects — early engagement often inflates results.

Make Decision and Communicate

If p < 0.05 AND no guardrail metrics harmed AND effect size is practically meaningful → ship the variant. Calculate business impact: lift × daily users × revenue per conversion = annual revenue impact. Communicate findings in plain language with a recommendation — not just statistics.

Python Code — A/B Test Analysis

Python — Complete A/B Test Analysis

from scipy import stats
import pandas as pd
import numpy as np
# ── Sample Size Calculation ───────────────────────────────
from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize
baseline_rate = 0.05   # 5% current conversion
target_rate = 0.055    # 5.5% target (10% relative lift)
power = 0.80
alpha = 0.05
effect_size = proportion_effectsize(baseline_rate, target_rate)
analysis = NormalIndPower()
sample_size = analysis.solve_power(
    effect_size=effect_size,
    power=power,
    alpha=alpha,
    alternative='larger'
)
print(ff"Sample size needed per group: {int(sample_size):,}")
# ── Statistical Significance Test ─────────────────────────
control_n, control_conv = 10000, 498   # 4.98% conversion
variant_n, variant_conv = 10000, 554   # 5.54% conversion
z_stat, p_value = stats.proportions_ztest(
    [control_conv, variant_conv],
    [control_n, variant_n],
    alternative='smaller'
)
ctrl_rate = control_conv / control_n
var_rate  = variant_conv / variant_n
lift_pct  = (var_rate - ctrl_rate) / ctrl_rate * 100
print(ff"Control: {ctrl_rate:.2%} | Variant: {var_rate:.2%}")
print(ff"Relative lift: {lift_pct:.1f}%")
print(ff"p-value: {p_value:.4f}")
print(ff"Result: {'Significant ✓' if p_value < 0.05 else 'Not significant ✗'}")

5 A/B Testing Mistakes That Kill Experiments

Peeking and stopping early. Checking results daily and stopping when p < 0.05 inflates false positives dramatically. Run until pre-calculated sample size is reached.
Not calculating sample size before starting. Running for "a week and seeing what happens" produces underpowered results you can't trust.
Randomising at session level instead of user level. The same user may see both variants — contaminating results. Always randomise by user_id.
Testing too many metrics without correction. Testing 20 metrics at 5% significance means 1 will appear significant by chance. Use Bonferroni correction or pre-register your primary metric.
Ignoring novelty effects. New features often get a short-term boost from users trying something new. Run tests for at least 2 business cycles to see through novelty.

⚠️

Interview Trap QuestionInterviewer: 'Our A/B test showed p = 0.04 — should we ship it?' Correct answer: NOT automatically. Also check: (1) Is the effect size practically meaningful? A 0.01% lift with p=0.04 is not worth shipping. (2) Are any guardrail metrics hurt? (3) Does the result hold across key user segments? (4) Have you checked for novelty effects? Statistical significance is necessary but not sufficient to ship.

Real A/B Testing Interview Questions

Question	Company	What They're Testing
Design an A/B test for a new Swiggy search ranking algorithm	Swiggy	Hypothesis, metrics, randomisation unit, duration
DAU increased 8% in the test — should we ship? What else do you check?	Flipkart	Guardrails, segment analysis, novelty effects
How would you detect if our A/B test was corrupted by a bug in variant assignment?	Google	SRM (Sample Ratio Mismatch) detection
Our test ran for 3 days. The results look great. Can we stop early?	Amazon	Understanding of optional stopping problem
How do you test a feature that only affects 1% of users?	Razorpay	Power analysis with rare events

⭐ Key Takeaways

A/B testing framework: hypothesis → sample size → run experiment → analyse → decide and communicate
Always calculate sample size BEFORE running — underpowered tests produce unreliable results
Randomise at user_id level, not session level — same user must always see the same variant
Statistical significance (p < 0.05) is necessary but not sufficient — also check practical significance and guardrails
Never peek and stop early — run until pre-calculated sample size or use sequential testing methods
Python: scipy.stats.proportions_ztest for significance, statsmodels for sample size calculation

❓ Frequently Asked Questions

What is A/B testing in data analysis?

A/B testing (also called controlled experiment or split test) is a method of comparing two versions of a feature, page or experience to determine which performs better on a defined metric. Version A (control) gets current experience; Version B (variant) gets the new experience. Statistical analysis determines if any difference in the metric is real or due to chance. A/B testing is used by all major product companies to make data-driven decisions.

How do you calculate sample size for an A/B test?

Sample size depends on: baseline conversion rate (e.g., 10%), minimum detectable effect (e.g., 10% relative lift = 1 percentage point), statistical power (typically 80%), and significance level (typically 5%). Formula requires z-scores for both power and significance. In Python: use statsmodels.stats.power.NormalIndPower().solve_power(). A common mistake is running tests without calculating sample size first — this leads to underpowered tests with unreliable results.

What is statistical significance in A/B testing?

Statistical significance means the observed difference between control and variant is unlikely to have occurred by chance. Typically measured by p-value: if p < 0.05, the result is statistically significant at the 95% confidence level. This means there is less than a 5% probability of seeing this large a difference if the two variants actually perform the same. Important: statistical significance does not mean practical significance — a 0.1% conversion lift may be significant but not worth shipping.

Practice A/B testing questions with a mentor

Our data analyst mock sessions include A/B test design questions from Swiggy, Flipkart and Google — with live feedback.

Book Free Mock Session →

Prakhar Shrivastava

Founder · 10+ years in analytics · 800+ candidates mentored

Former analytics lead at top product companies. Helping India's data analysts crack interviews through structured, practical preparation.

📖 More Data Analyst Posts

Data Analyst

Google & Amazon Interview Questions

Data Analyst

Data Analyst Portfolio Projects

Data Analyst

Pandas Top 20 Functions

Data Analyst

SQL vs Python — Which First?

A/B Testing for Data Analysts — Complete Guide with Python Examples (2026)

The 5-Step A/B Testing Framework

Define Hypothesis

Calculate Sample Size

Run the Experiment

Analyse Results

Make Decision and Communicate

Python Code — A/B Test Analysis

5 A/B Testing Mistakes That Kill Experiments

Real A/B Testing Interview Questions

⭐ Key Takeaways

Practice A/B testing questions with a mentor

📖 More Data Analyst Posts

Like this:

Related

Leave a ReplyCancel reply

A/B Testing for Data Analysts — Complete Guide with Python Examples (2026)

The 5-Step A/B Testing Framework

Define Hypothesis

Calculate Sample Size

Run the Experiment

Analyse Results

Make Decision and Communicate

Python Code — A/B Test Analysis

5 A/B Testing Mistakes That Kill Experiments

Real A/B Testing Interview Questions

⭐ Key Takeaways

Practice A/B testing questions with a mentor

📖 More Data Analyst Posts

Share this:

Like this:

Related

Related Posts

Leave a ReplyCancel reply

Discover more from Interview Preperation