Data Analyst
Cohort Analysis Explained With a Real Example — And Why Every Data Analyst Interview in India Is Testing It Right Now
Swiggy’s data team once discovered that customers acquired during a cashback campaign had a 60-day retention rate nearly 40% lower than organically acquired users — and that single cohort insight shifted millions of rupees in marketing budget almost overnight. Cohort analysis is not a fancy academic technique. It is one of the most practical, high-impact analyses you will run as a data analyst, and in 2026, it is showing up in interviews at Zomato, PhonePe, CRED, Meesho, and pretty much every product-led company that is serious about growth. If you cannot explain cohort analysis clearly and build one from scratch in SQL or Python, you are leaving interview marks on the table.
What Cohort Analysis Actually Is — And Why It Matters More Than Ever in 2026
At its core, cohort analysis is about grouping users by a shared characteristic — almost always the time they first did something — and then tracking how that group behaves over time. The word “cohort” just means a group. A January 2025 cohort on Zepto would be all the users who placed their first order in January 2025. You then watch what percentage of them are still ordering in February, March, April, and so on. This sounds simple. The reason it is powerful is that it separates signal from noise in a way that aggregate metrics simply cannot.
Think about it this way. If Zepto’s total monthly active users are growing every month, that looks great on a dashboard. But what if the growth is entirely driven by acquiring new users, while almost every older user quietly stops ordering? The aggregate MAU number hides that rot completely. Cohort analysis exposes it. You can see that users from six months ago have almost entirely churned, which means the business is running on a leaky bucket — pouring water in from the top while it drains out from the bottom. That is a fundamentally different business problem than a healthy growth story, and it requires a completely different response from the product, marketing, and operations teams.
In 2026, as Indian startups face tighter funding conditions and increasing pressure to demonstrate unit economics and sustainable growth, cohort analysis has become the go-to lens for investors, product managers, and analysts alike. Understanding retention curves, lifetime value by cohort, and how product changes affect user behaviour over time is no longer optional knowledge. It is a baseline expectation for any data analyst role above the entry level.
Most candidates think cohort analysis only applies to consumer apps. That is wrong. HDFC uses cohort analysis to track loan repayment behaviour by disbursement month. Razorpay uses it to track merchant activation and transaction volume by onboarding cohort. The technique is universal — the only thing that changes is what event you define as the “starting point” and what behaviour you are tracking over time.
How Cohort Analysis Skills Are Shaping Hiring and Salaries at Indian Companies Right Now
If you look at data analyst job descriptions posted by Flipkart, Meesho, PhonePe, and CRED on LinkedIn in 2025 and 2026, you will notice a pattern. Terms like “retention analysis,” “LTV modelling,” “cohort-based reporting,” and “user lifecycle analytics” appear again and again. These are not buzzwords. They are direct signals that these companies want analysts who can build and interpret cohort analyses independently, without hand-holding from a senior.
The salary impact is real. A data analyst at a mid-stage startup in Bengaluru who can own end-to-end cohort reporting — building the SQL pipelines, maintaining the dashboards, and presenting actionable insights to the growth team — is typically earning between 12 and 18 lakh per annum at two to four years of experience. An analyst who can only pull numbers from existing dashboards tops out faster and hits a ceiling around 8 to 10 lakh. The difference is not just technical skill. It is the ability to answer the question: “Are we retaining the right users, and is our retention improving or degrading over time?” That is a question that comes up in every product review, every fundraise, and every quarterly business review.
Companies like CRED and Juspay are especially heavy users of cohort analysis because their business models depend on understanding how user behaviour evolves after the first few interactions. CRED wants to know whether users who joined for credit card bill payments eventually engage with CRED’s financial products. Juspay wants to know whether merchants who integrated their SDK in a given quarter are still actively processing payments a year later. These are cohort questions at their heart, and analysts who can answer them fluently are the ones getting hired and promoted.
Interview Questions on Cohort Analysis That Are Appearing at Top Indian Companies
Companies are asking cohort analysis questions at every stage of the interview — in the SQL round, the product sense round, the case study round, and even the stakeholder communication round. The questions are designed to test whether you understand the concept deeply enough to apply it in messy, real-world situations, not just in a textbook scenario. Here are five questions you should be ready to answer confidently.
The right approach here is to define the experiment clearly — users who went through the new onboarding flow versus those who went through the old one — and then build a cohort retention table that tracks what percentage of each group placed orders at Day 7, Day 30, and Day 60. The key insight to demonstrate is that you would not just look at conversion from the onboarding flow itself, but at the downstream retention curve, because a flow that converts more users but retains them poorly is worse for business than a flow that converts fewer but keeps them longer. Always connect the retention curve to LTV, not just to the immediate conversion metric.
This is one of the most common SQL interview questions at product companies in India right now. The approach involves using date truncation to assign each user to their first-month cohort, then joining that back to all their subsequent activity, calculating the difference in months between the cohort month and each activity month, and aggregating to get the percentage of users still active at each period. The trap most candidates fall into is forgetting to handle the denominator correctly — the retention percentage at each period must be calculated as a share of the original cohort size, not the number of users active in a given month overall.
Use a structured diagnostic framework: first confirm the data is not an artifact of how cohort size changed or a tracking issue, then segment the January cohort by acquisition channel, city, and device type to isolate whether the drop is concentrated in a specific sub-segment, and finally look at the product timeline to identify what changed between December and January. Did a promotion end? Did delivery fees change? Was a feature rolled back? The point is to show that you do not jump to conclusions but rather build a hypothesis tree and test each branch systematically before recommending action.
The interviewer is testing whether you can translate analytical output into business language without losing accuracy. A good answer sounds like this: “I would tell the marketing manager that this chart shows us, out of every 100 users we acquired in a given month, how many are still using the product one month later, two months later, and so on. If the line flattens out, that means we have a loyal base. If it keeps falling toward zero, we have a retention problem that no amount of new user acquisition will fix.” Avoid jargon like “cohort size” or “retention rate at T plus N” without explaining what those mean in plain business terms.
This question separates candidates who have actually done cohort analysis from those who have only read about it. Key issues include: users with multiple accounts being counted as new users repeatedly, which inflates the cohort size artificially; events being logged with a delay, which makes recent cohorts look worse than older ones because not all their activity has been captured yet; and definition ambiguity around what counts as “active” — a user who opened the app but did not transact may be counted as active in some definitions and not in others. Always ask about the definition of the event and confirm data completeness before drawing conclusions.
When answering any cohort analysis question in an interview, state your cohort definition out loud before you do anything else. Say: “I am defining the cohort as all users who completed their first transaction in a given month, and I am tracking what percentage of them completed at least one transaction in each subsequent month.” Interviewers at companies like Razorpay and PhonePe specifically appreciate this habit because it shows you understand that the definition of the cohort is a business decision, not just a technical one, and different definitions can produce dramatically different results.
The SQL You Need to Build a Cohort Retention Table From Scratch
Let us say you are working with a transactions table at a company like Meesho. Each row represents a purchase event, with a user ID, a transaction date, and some other columns. Your task is to build a cohort retention table that shows what percentage of users who first purchased in each month were still purchasing in months one, two, three, and beyond. This is exactly the kind of query you will be asked to write in a technical interview, and it needs to be clean, readable, and correct. Here is how to approach it using CTEs and window functions.
-- Cohort Retention Analysis for Meesho-style transactions data
-- Goal: find what % of users who first purchased in a given month
-- are still purchasing in each subsequent month
WITH first_purchase AS (
-- Step 1: Find the first transaction month for each user
SELECT
user_id,
DATE_TRUNC('month', MIN(transaction_date)) AS cohort_month
FROM transactions
WHERE transaction_status = 'completed'
GROUP BY user_id
),
user_activity AS (
-- Step 2: Get all months where each user was active
SELECT
t.user_id,
DATE_TRUNC('month', t.transaction_date) AS activity_month
FROM transactions t
WHERE t.transaction_status = 'completed'
GROUP BY t.user_id, DATE_TRUNC('month', t.transaction_date)
),
cohort_data AS (
-- Step 3: Join to calculate period number (months since first purchase)
SELECT
f.cohort_month,
EXTRACT(YEAR FROM AGE(a.activity_month, f.cohort_month)) * 12 +
EXTRACT(MONTH FROM AGE(a.activity_month, f.cohort_month)) AS period_number,
COUNT(DISTINCT a.user_id) AS active_users
FROM first_purchase f
JOIN user_activity a ON f.user_id = a.user_id
GROUP BY f.cohort_month, period_number
),
cohort_sizes AS (
-- Step 4: Get the original cohort size (period 0 = month of first purchase)
SELECT
cohort_month,
active_users AS cohort_size
FROM cohort_data
WHERE period_number = 0
)
-- Step 5: Calculate retention percentage for each cohort at each period
SELECT
cd.cohort_month,
cd.period_number,
cd.active_users,
cs.cohort_size,
ROUND(100.0 * cd.active_users / cs.cohort_size, 2) AS retention_pct
FROM cohort_data cd
JOIN cohort_sizes cs ON cd.cohort_month = cs.cohort_month
ORDER BY cd.cohort_month, cd.period_number;
The output of this query gives you a table where each row is a cohort-month and period combination, and the retention percentage tells you what fraction of that original cohort was still active at that point in time. When you present this to a product manager or a growth lead, you would describe it as: “Out of every 100 users who bought from us in January, 45 came back in February, 30 in March, and 22 in April.” That framing is intuitive and directly actionable for a non-technical audience.
A very common SQL error in cohort analysis is using the transaction date directly without truncating to the month, which causes each individual date to be treated as a separate cohort and produces hundreds of near-empty cohorts instead of meaningful monthly buckets. Always use DATE_TRUNC or the equivalent function in your SQL dialect (MONTH() in MySQL, to_char in Oracle) to group dates into periods before building the cohort table. A second mistake is joining on user_id without also filtering for completed transactions, which inflates active user counts by including failed or cancelled events.
Python for Cohort Analysis — What Analysts Actually Do on the Job
In practice, once you have pulled the raw data using SQL, you will often do the final cohort table construction and visualisation in Python, especially if you are building a one-time analysis or a presentation for leadership. The standard approach uses pandas to pivot the cohort data into a matrix and seaborn or matplotlib to render a heatmap, which is the most common way to display cohort retention visually. The dataset here represents user transaction events from a hypothetical PhonePe-style payments platform, and the business question is: are newer user cohorts retaining better or worse than older ones, following a product change made in Q3 2024?
# Cohort retention heatmap using pandas and seaborn
# Dataset: user transactions from a UPI payments platform
# Business question: is retention improving across monthly cohorts?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load transaction data
df = pd.read_csv('transactions.csv', parse_dates=['transaction_date'])
df = df[df['transaction_status'] == 'completed']
# Step 1: Assign each user to their first transaction month (cohort)
df['cohort_month'] = df.groupby('user_id')['transaction_date'].transform('min').dt.to_period('M')
# Step 2: Get the activity month for each transaction
df['activity_month'] = df['transaction_date'].dt.to_period('M')
# Step 3: Calculate the period number (0 = cohort month, 1 = one month later, etc.)
df['period_number'] = (df['activity_month'] - df['cohort_month']).apply(lambda x: x.n)
# Step 4: Count unique active users per cohort-period combination
cohort_data = df.groupby(['cohort_month', 'period_number'])['user_id'].nunique().reset_index()
cohort_data.columns = ['cohort_month', 'period_number', 'active_users']
# Step
