Statistics and Probability needed in DS

15 min readAug 30, 2024

In this blog, we’ll provide a concise overview of the essential statistics and probability concepts needed for a data science role.

Central limit Theorem :

The distribution of sample means is Gaussian, no matter what the shape of the original distribution is.

Assumptions: population mean and standard deviation should be finite and sample size >=30

How to prove CLT in python:

sample_30 = [df['income'].sample(30).mean() for i in range(10000)]
sns.histplot(sample_30, kde=True)
sns.histplot(df['income'], kde = True)

Hypothesis Testing:

A hypothesis test is a statistical method used to make inferences or prediction or decisions about a population based on sample data.

Null Hypothesis (H₀): The default assumption or statement being tested. It usually suggests that there is no effect, no difference, or no relationship between variables.

Example:

The accused is innocent
The average height of men in a population is 70 inches.

Alternative Hypothesis (H₁ or Ha): The statement that contradicts the null hypothesis. It represents the effect, difference, or relationship you expect or want to prove.

Example:

The accused is guilty
The average height of men in a population is not 70 inches.

A test statistic directs us to either reject or not reject the null hypotheis.

P_value

Probability of observing the Test statistic as extreme or more than T observed considering the null hypothesis as true.

If p-value < significance level; reject the null hypothesis, else fail to reject the null hypothesis.

General framework for hypothesis testing:
1. Define the experiment and a sensible test statistic variable.
2. Define the null hypothesis and alternate hypothesis.
3. Decide a test statistic and a corresponding distribution
4. Determine whether the test should be left-tailed, right- tailed, or two tailed
5. Determine the p-value
6. Choose a significance level
7. Accept or reject the null hypothesis by comparing the obtained p-value with the chosen significance level.

Always remember, in hypothesis testing, rejecting the null hypothesis is often considered as positive outcome because it support the presence or an effect while failing to reject the null hypothesis is considered as negative outcome because it suggests the absence of evidence for an effect.

This is because in hypothesis testing, we try to disprove the null hypothesis by providing the evidence i.e. there is a difference between the two groups which is consider as a positive effect or move for hypothesis testing.

Type of errors in hypothesis testing:

Type I error: It occurs when the null hypothesis is true, but we reject the null hypothesis.

E.g. We decided that the accused is guilty but he is actually innocent (false positive) i.e. we are rejecting the H₀ when the null hypothesis is actually true.

A better way to remember this is by considering rejecting the null hypothesis as positive outcome i.e. there is a difference between the two groups which is consider as a positive effect or move for hypothesis testing.

2. Type II error: It occurs when the null hypothesis is false, but we fail to reject the null hypothesis (false negative).

E.g. We decided that the accused is innocent but in reality he is actually guilty.

In short, Type I error (α) - Reject a null hypothesis that is true. Type II error (β) - Not reject a null hypothesis that is false.

Example : Imagine you’re testing a new drug, and you want to know if it lowers blood pressure more than the current standard treatment.

H₀: The new drug has the same effect on blood pressure as the current treatment.
H₁: The new drug has a different (likely better) effect on blood pressure than the current treatment.

Type I error : If the drug is actually “NOT” effective (H₀ is true), a type-I error would mean you incorrectly conclude that the drug is effective. The probability of this error is 5% since the C.I is 95%.

Type II error: If the drug is actually effective (H₁ & Ha is true), a type II error would mean you fail to detect its effectiveness (i.e. you fail to reject H₀ i.e. drug has no effect)

Trade off between type-I and type-II error :

Reduce significant level (α) decreases the likelihood of a type-I error but increases the likelihood of type-II error (β).

Increasing (α) increases the likelihood of type-I error but decreases the likelihood of type-II error.

Type of Hypothesis testing:

T_Test

1. Independent t-test:

It is used to determine whether the population mean is significantly different from an assumed value. It uses Standard normal distribution as the baseline. It is used to compare the means of two group.

Assumptions:
Either the standard deviation of the population should be known or,
we should estimate them well when the sample size is not too small (n>30)

i) two tailed t_test:
H₀ -> μ1 = μ2
H₁ -> μ1 != μ2

# hypothesis test for checking if there is a significant difference between iq level of two school
from scipy.stats import ttest_ind

iq_1 = df[df['school']=='school_1']['iq']
iq_2 = df[df['school']=='school_2']['iq']

t_stat, p_value = ttest_ind(iq_1, iq_2, alternative = 'two-tailed') #two tailed by defualt

if p_value < alpha:
  print("Reject the null hypothesis")
else:
  print("fail to reject the null hypothesis")

ii). Right tailed t_test:
H₀ -> μ1 = μ2
H₁ -> μ1 >μ2

# hypothesis test for checking if school_1 iq level is a significant greater than  school_2 iq level.
from scipy.stats import ttest_ind

iq_1 = df[df['school']=='school_1']['iq']
iq_2 = df[df['school']=='school_2']['iq']

t_stat, p_value = ttest_ind(iq_1, iq_2, alternative = 'greater') #two tailed by defualt

if p_value < alpha:
  print("Reject the null hypothesis")
else:
  print("fail to reject the null hypothesis")

iii). Left tailed t_test:
H₀ -> μ1 = μ2
H₁ -> μ1 >μ2

# hypothesis test for checking if school_1 iq level is a significant less than  school_2 iq level.
from scipy.stats import ttest_ind

iq_1 = df[df['school']=='school_1']['iq']
iq_2 = df[df['school']=='school_2']['iq']

t_stat, p_value = ttest_ind(iq_1, iq_2, alternative = 'less') #two tailed by defualt

if p_value < alpha:
  print("Reject the null hypothesis")
else:
  print("fail to reject the null hypothesis")

Note: If sample size is n>30, then t_test &z_test become essentially the same. That’s why scipy does not even have an implementation for z_test.

2. Paired t_test:

A paired t-test, also known as a dependent t-test, is a statistical method used to compare the means of two related groups. It’s typically used when you have two measurements taken from the same subjects (e.g., before and after a treatment) . The goal is to determine whether there is a statistically significant difference between the paired observations.

from scipy.stats import ttest_rel

When to use pair t_test:

Pre-test/Post-test scenarios: Measuring the same subjects before and after an intervention to see if there’s an effect.
Matched pairs: Comparing two related items, such as the right and left hands of the same individuals in a grip strength test.
Repeated measures: When the same individuals are measured under two different conditions.

For example:

Suppose a group of students takes a math test before and after attending a special training program. You want to determine if the program has a significant effect on their test scores.

Step 1: For each student, calculate the difference in test scores (after — before).
Step 2: Compute the mean and standard deviation of these differences.
Step 3: Calculate the t-statistic.
Step 4: Compare the t-statistic to the critical value from the t-distribution or use software to determine the p-value.
Step 5: Based on the p-value, decide whether the training program significantly improved the students’ scores.

ANOVA test

It is a statistical test used to compare the mean of more than 2 independent groups.

One Way ANOVA test

For example:

We are comparing the relationship between income and different type of product category.

H₀: Null hypothesis states that there is no significant difference in the means of all groups.
H₁ : Alternative hypothesis states that there is a significant difference in the means of at least one of the groups.

from scipy.stats import f_oneway

income_kp281 = df[df['Product']=="KP281"]['income']
income_kp481 = df[df['Product']=="KP481"]['income']
income_kp781 = df[df['Product']=="KP781"]['income']

f_stat, p_value = f_oneway(income_kp281, income_kp481, income_kp781)

if p_value<alpha:
  print("Reject null hypothesis. At least one group have different mean.")
else:
  print("Fail to reject null hypothesis. Means of all the groups are same.")

Note that: In ANOVA, we cannot find which two categories have different means. It only tell us at least one of the groups have a different mean.

To check which two groups have different mean which has to do ttest_ind for each individual group.

A second way to check if there is a significant difference in the means of two groups is ttest_ind.

Since ttest only caters to categorical features with 2 cateories, we would need to perform t_test multiple times by pairing the income data.

Income of customers who bought product item KP281 v/s item KP481
Income of customers who bought product item KP281 v/s item KP781
Income of customers who bought product item KP781 v/s item KP481

from scipy.stats import ttest_ind

income_kp281 = df[df['Product']=="KP281"]['income']
income_kp481 = df[df['Product']=="KP481"]['income']
income_kp781 = df[df['Product']=="KP781"]['income']

t_stat1, p_value1 = ttest_ind(income_kp281, income_kp481)
t_stat2, p_value2 = ttest_ind(income_kp281, income_kp781)
t_stat3, p_value3 = ttest_ind(income_kp781, income_kp481)

if p_value1<alpha:
  print("Reject null hypothesis. The mean of group kp281 and kp481")
else:
  print("Fail to reject null hypothesis. Means of both the groups are same.")

if p_value2<alpha:
  print("Reject null hypothesis. The mean of group kp281 and kp781")
else:
  print("Fail to reject null hypothesis. Means of both the groups are same.")

if p_value3<alpha:
  print("Reject null hypothesis. The mean of group kp781 and kp481")
else:
  print("Fail to reject null hypothesis. Means of both the groups are same.")

Assumptions of one way ANOVA:

i) Data should be normally distributed i.e. it follows a gaussian distribution.

ii) Equal variance across all categories.

iii) Data should be independent across each record.

2. Two Way ANOVA test

A Two-Way ANOVA (Analysis of Variance) is a statistical test used to determine the effect of two independent categorical variables on a continuous dependent variable. It extends the one-way ANOVA by examining the interaction between two factors as well as their individual effects.

In case of ONE way ANOVA test, we could only investigate one of the categorical variables at a same time.

In two way ANOVA test we have 2 null and 2 alternative hypothesis.
one for main effects.
other for interaction effect.

For Example:

Suppose you want to analysis whether sales of cola is influence by the flavour of the drink and the location where it is sold.

a) Null hypothesis for main effect:

i) Null hypothesis for flavour: There is no significant difference in sales between 3 flavours (lemon, cola & orange).
ii) Null hypothesis for location: There is no significant difference in sales between 4 location (East, West, North and South).

b) Alternative hypothesis for main effect:

i) Null hypothesis for flavour: There is a significant difference in sales between at least 2 flavours.
ii) Null hypothesis for location: There is a significant difference in sales between at least 2 location.

2. a) Null hypothesis for interaction effect:
i) There is no interaction between the choice of flavour & the location of sales on sales. In other words, the impact of flavour on sales does not depend on the locations & vice versa.

ii) Alternative hypothesis for interaction effect:
There is a significant interaction effect between the choice of flavour & the location.

Kruskal-Wallis test

It is a statistical test used to compare the medians of 2 or more independent groups. It is a subsequent of ANOVA test where it compare means of 2 or more groups as the data is normally distributed.

When to use Kruskal-Wallis test:
1. It is use when data is not normally distributed.
2. When there are outliers in data as it is robust to outliers.

from scipy.stats import kruskal

income_kp281 = df[df['Product']=="KP281"]['income']
income_kp481 = df[df['Product']=="KP481"]['income']
income_kp781 = df[df['Product']=="KP781"]['income']

kruskal(income_kp281, income_kp481, income_kp781)

Data Normality test

QQ Plot:

A Q-Q plot (Quantile-Quantile plot) is a graphical tool used to compare the distribution of a dataset to a theoretical normal distribution. It helps to assess whether a dataset follows a particular distribution by comparing the quantiles of the data against the quantiles of the theoretical distribution. In other words, it is like comparing the z_score of gaussian distribution and z_score of the given data.

#for normally distributed data

import numpy as np
import matplotlib.pyplot as plt
from statsmodels.graphics.gofplots import qqplot

# Generate some data (e.g., normally distributed data)
data = np.random.normal(0, 1, 1000)

# Create Q-Q plot
qqplot(data, line='s')
plt.title("Q-Q Plot")
plt.show()

#for skewed data

# Generate some skewed data (e.g., log-normal distributed data)
data = np.random.lognoraml(0, 1, 1000)

# Create Q-Q plot
qqplot(data, line='s')
plt.title("Q-Q Plot")
plt.show()

Shapiro Wilk Test

The Shapiro-Wilk test is a statistical test used to assess whether a given sample comes from a normally distributed population. It is one of the most powerful tests for normality and is widely used to check the assumption of normality in statistical analyses.

Null Hypothesis (H0): The null hypothesis states that the data is normally distributed.
Alternative Hypothesis (H1): The alternative hypothesis states that the data is not normally distributed.

from scipy.stats import shapiro

gender_income = df[df['Gender']=="male"]['income']

shapiro_stat, p_value = shapiro(gender_income)

if p_value<alpha:
  print("Reject null hypothesis. The data is not normally distributed.")
else:
  print("Fail to reject null hypothesis. The data is normally distributed.")

KS Test

This test is used to determine whether two sets of data follows the same distribution or not.

With this test, we can also check whether a given data is noramlly distributed or not by creating normally distributed data points using np.random.normal(0, 1, 1000) and comparing it with the given data.

KS test tells us how much the CDFs of X and y differ from each other. In simple terms, it measures the gap between two CDFs.

from scipy.stats import kstest

d1 = df[df["Age_group"]=="31-40"]['income']
d2 = df[df["Age_group"]=="41-50"]['income']

ks, p_val = stats.kstest(d1, d2)

if p_val < alpha:
  print("Reject the null hypothesis i.e. two samples have different distribution.")
else:
  print("Fail to reject the null hypothesis i.e. two samples have same distribution.")

Under the null hypothesis, when X&Y follows the same distributions, the ks-test statistical tends towards 0. This is because thier CDFs overlaps with each other.

Under alternative hypothesis, when X & Y have different distributions, ks-test has positive values. This means there is a significant difference between two CDFs, indicating that the distributions are not same.

Note that — KS test is a right tailed test. ks-test has no negative value only 0 or positive values.

Levene Test

The Levene’s test is a statistical test used to assess the equality of variances across different groups. It is often used as a preliminary test in analyses where the assumption of homogeneity of variances is required, such as in ANOVA (Analysis of Variance) or t-tests.

Null Hypothesis (H0): The null hypothesis states that all group variances are equal.

Alternative Hypothesis (H1): The alternative hypothesis states that at least one group variance is different from the

when p_value > 0.05: It implies that the variances are relatively equal across the groups.

when p_value < 0.05 : It suggests that the variances are significantly differ across the groups.

from scipy.stats import levene
height_men = df[df['Gender']=='Male']["Height"]
height_women = df[df['Gender']=='Female']["Height"]

levene_stat, p_value = levene(height_men, height_women)

if p_value<alpha:
  print("Variance of the groups are different.")

Skewness

It is a statistical measure that describes the asymmetry of a distribution of data. In other words, it tells us whether the data is more spread out on oe side of the mean compared to the other.

Kurtosis

Kurtosis is a statistical measure used to describe the shape of a probability distribution’s tails relative to its overall shape. It provides information about the extremity of the tails and the peak of the distribution.

It measure how peaked or how heavily tailed the data is (means data is less spread out and more concentrated around the mean).

from scipy.stats import kurtosis

# Generate some data
data = np.random.normal(0, 1, 1000)

# Calculate kurtosis
kurt = kurtosis(data, fisher=True)  # Fisher's definition (subtracts 3)
print(f"Kurtosis: {kurt}")

# or

data.kurt()

Kurtosis is like a tool that tells us if our data is pointy (Leptokurtic/+ve), normal(Mesokurtic) and flat(Platykurtic/-ve).

Skewness & Kurtosis both measures describes the shape of a distribution but they focus on different aspects.

Skewness focuses on the symmetry of the distribution while kurtosis focuses on the concentration of data in the tails.

A/B testing

It is a method used to compare two versions of a variable to determine which one perform better in achieving a specific goal. It is used in marketing, web development & product management to make data driven decisions.

Example of A/B testing

YouTube 1 ad v/s 2 ads

Group A is our treatment group, to which we introduce new feature (show 2 ads per ad break)

Group B is our control group, to which we don’t introduce new feature (show 1 ad per ad break)

We will perform z_proportional test after collecting data about the mean watch time per day of every use in two groups.

Ho: The mean watch time of users given 2 ads = mean watch time given 1 ad per ad break.

Ha: The mean watch time of 2 ads of each user != mean watch time of 1 ad of each user.

Correlation

Pearson and Spearman correlations are both measures of the strength and direction of the relationship between two numerical variables.

Pearson Correlation

It measures the strength and direction of the linear relationship between two continuous variables. It assumes that the relationship between the variables is linear, meaning that as one variable increases or decreases, the other does so in a consistent manner.

It assumes that the data are normally distributed and that the relationship between the variables is linear. It is sensitive to outliers, which can disproportionately affect the correlation coefficient.

Spearman Correlation

Pearson correlation does not tell the relationship between two variables when there are exponentially correlated.

It measures the strength and direction of the monotonic relationship between two variables. A monotonic relationship is one where the variables move in the same direction but not necessarily at a constant rate. It calculate the relation by taking the rank the datapoints. Unlike Pearson, Spearman does not assume a linear relationship.

It does not assume a normal distribution or linear relationship. It is more robust to outliers than Pearson’s correlation because it relies on the ranks of the data rather than the raw data.

Pearson -> Parametric (since it assume normally distributed data)

Spearman -> Non-Parametric (since it assume non-normally distributed data)

Type of data distribution

Binomial Distribution is the collection of Bernoulli trails for the same event i.e. it contains more than 1 Bernoulli event for the same scenario for which the Bernoulli trail is calculated.

from scipy.stats import binom

binom.pmf(n=2, p=1/6, k=0) 
# n = no. of times dice is throw (trial)
# p = rate of success
# k = means we are intersted in prob. of getting 0 success out of 2.

E.g. 1. Suppose a hospital experiences an avg. of 2 births/hr. Calculate prob. of experiencing 0, 1, 2, 3, birth in a given hr.
2. Imagine we collect the data of all football matches played. It is found that in a 90 min match the avg. goals are 2.5. Calculate the prob. of getting 1 goal in last 30 mins?
3. A city sees 3 accidents per day on average. Find the prob. that there will be 5 or fewer accidents tomorrow?
4. There are 80 students. Each one of them has 0.015 prob. of forgetting thier lunch on any given day. Calculate avg. or expected no of students who forget lunch in the class? Prob. that exactly 3 of them will forget their lunch today?

Conditional Probability:

Prob. of an event occurring given that another event has already occurred.

Bayes’ Theorem

It provides a way to update the prob. of an event based on new evidence. It related the conditional prob. of 2 events in both direction & incorporate prior knowledge.

Bayes’ Theorem is build upon the concept of conditional prob. i.e. bayes’ theorem is a specific application of conditional prob. that allows us to update our beliefs in light of new evidence.

E.g.

import pandas as pd

pd.crosstab(data['June'], data['Flood'], margin = True, normalize = 'index')

When normalise set to “index”, it calculates conditional prob. based on rows, treating each row as a separate condition.

This means that each row in the table is divided by the row, making each rows values sum upto 1, representing conditional prob.

When set to “columns”, it calculates conditional prob. based on columns, treating each column as the condition we are focus on.