Chi square test - 15 marks MPT level answer of this question

I'll write a comprehensive 15-mark MPT (Master of Public Health/Preventive & Social Medicine Theory) level answer on the Chi-square test. Let me provide a thorough, exam-quality response.

Chi-Square Test (χ²)

Definition

The Chi-square (χ²) test is a non-parametric statistical test used to determine whether there is a significant association between two categorical variables, or whether an observed frequency distribution differs significantly from a theoretically expected distribution. It was introduced by Karl Pearson in 1900.

Types of Chi-Square Tests

TypePurpose
1. Chi-square test of independence (association)Tests association between two categorical variables
2. Chi-square test of goodness of fitTests whether observed frequencies fit an expected distribution
3. Yates' correction (corrected χ²)Applied to 2×2 contingency tables when any expected frequency is 5-40
4. Mantel-Haenszel χ²Tests association after stratification for confounders

Assumptions / Pre-requisites

  1. Data must be in frequencies (counts), not percentages or means
  2. Sample size: Total n should be ≥ 40
  3. Expected frequency: Each cell's expected frequency should be ≥ 5 (in a 2×2 table)
  4. In larger tables (r×c), not more than 20% of cells should have expected frequency < 5, and no cell should have expected frequency < 1
  5. Observations must be independent
  6. Random sampling from the population
  7. Data should be nominal or ordinal (categorical)
If expected frequency in any cell of a 2×2 table is < 5 - use Fisher's Exact Test If n is < 40 - use Fisher's Exact Test If expected frequency is 5–40 - use Yates' correction If expected frequency is > 40 - use Standard χ² test

Formula

$$\chi^2 = \sum \frac{(O - E)^2}{E}$$
Where:
  • O = Observed frequency
  • E = Expected frequency
  • Σ = Sum over all cells

Expected Frequency Calculation:

$$E = \frac{\text{Row total} \times \text{Column total}}{\text{Grand total}}$$

Yates' Correction for Continuity (2×2 table)

When expected frequency in any cell is between 5 and 40:
$$\chi^2_{corrected} = \sum \frac{(|O - E| - 0.5)^2}{E}$$

Steps in Calculation

Step 1: State the Null hypothesis (H₀): There is no significant association between the two variables. Step 2: Prepare the contingency table with observed frequencies. Step 3: Calculate expected frequencies for each cell using: E = (Row total × Column total) / Grand total Step 4: Apply the formula: χ² = Σ(O-E)²/E Step 5: Calculate degrees of freedom (df): df = (r-1)(c-1), where r = rows, c = columns Step 6: Find the tabulated/critical value of χ² at the desired significance level (usually p = 0.05) from χ² distribution table Step 7: Compare calculated χ² with tabulated χ²
  • If calculated χ² > tabulated χ²Reject H₀ → Association is statistically significant
  • If calculated χ² < tabulated χ²Accept H₀ → Association is NOT significant

Worked Example

Problem: In a study, 200 individuals were assessed for smoking habit and lung cancer:
Lung Cancer (+)Lung Cancer (-)Total
Smoker4060100
Non-smoker1090100
Total50150200
Step 1 - H₀: No association between smoking and lung cancer.
Step 2 - Calculate Expected Frequencies:
CellFormulaE
Smoker + Cancer(100×50)/20025
Smoker + No Cancer(100×150)/20075
Non-smoker + Cancer(100×50)/20025
Non-smoker + No Cancer(100×150)/20075
All expected frequencies > 40, so standard χ² applies.
Step 3 - Calculate χ²:
CellOE(O-E)(O-E)²(O-E)²/E
a4025152259.0
b6075-152253.0
c1025-152259.0
d9075152253.0
χ² = 9.0 + 3.0 + 9.0 + 3.0 = 24.0
Step 4 - Degrees of freedom: df = (2-1)(2-1) = 1
Step 5 - Tabulated value: χ² at df=1, p=0.05 = 3.84
Step 6 - Conclusion: Calculated χ² (24.0) > Tabulated χ² (3.84) → Reject H₀. There is a statistically significant association between smoking and lung cancer (p < 0.05).

Degrees of Freedom

Tabledf
2×21
2×32
3×34
r×c(r-1)(c-1)

Shortcut Formula for 2×2 Table

For a 2×2 contingency table (a, b, c, d):
$$\chi^2 = \frac{N(ad - bc)^2}{(a+b)(c+d)(a+c)(b+d)}$$

χ² Critical Values Table (commonly used)

dfp = 0.05p = 0.01
13.846.64
25.999.21
37.8111.34
49.4913.28

Applications in Community Medicine / Public Health

  1. Case-control studies - Association between exposure and disease
  2. Cross-sectional surveys - Association between variables (e.g., diet and BMI category)
  3. Cohort studies - Comparing proportions of outcomes
  4. Vaccine efficacy studies - Comparing attack rates between vaccinated and unvaccinated
  5. Audit and evaluation - Whether outcome proportions differ between groups
  6. Goodness of fit - Checking if a distribution follows Mendelian ratios (genetics)

Advantages

  • Simple to calculate
  • Applicable to nominal and ordinal data
  • Does not assume normal distribution (non-parametric)
  • Applicable to both small and large samples (with modifications)

Limitations

  • Cannot be used with continuous data directly
  • Sensitive to small expected frequencies
  • Does not indicate strength of association (use Cramer's V or Phi coefficient for that)
  • Does not indicate direction of association
  • Not applicable when expected frequency < 5 in any cell of a 2×2 table

Alternatives When χ² Cannot Be Applied

SituationAlternative Test
n < 40 OR any E < 5 in 2×2 tableFisher's Exact Test
5 ≤ E ≤ 40 in 2×2 tableYates' corrected χ²
Paired proportionsMcNemar's test
Ordinal data, trendMantel-Haenszel test

Summary Points to Remember

  • χ² is used for categorical data
  • Formula: Σ(O-E)²/E
  • df = (r-1)(c-1)
  • Minimum expected frequency ≥ 5 in 2×2 table
  • If calculated χ² > table value → significant association
  • Introduced by Karl Pearson, 1900
  • For 2×2 table with E between 5-40: use Yates' correction
  • For 2×2 table with E < 5 or n < 40: use Fisher's Exact Test

Explain me in simple language

Chi-Square Test - Simple Explanation 😊


What is it in ONE line?

"Chi-square test checks if two things are RELATED or just happened by CHANCE."

Real Life Analogy

Imagine you're a doctor and you want to know:
"Do smokers get lung cancer MORE than non-smokers?"
You collect data from 200 people. Now you have two questions:
  • Is the difference you see REAL?
  • Or did it happen just by LUCK/CHANCE?
Chi-square test answers exactly this.

The Core Idea - Super Simple

You compare two things:
What you ACTUALLY sawWhat you EXPECTED to see (if no relation)
NameObserved (O)Expected (E)
  • If O and E are very close → No relationship → Just chance
  • If O and E are very different → There IS a relationship!

The Formula (Don't fear it!)

$$\chi^2 = \sum \frac{(O - E)^2}{E}$$
In plain English:
  1. For each box in your table: subtract Expected from Observed
  2. Square it (to make negatives positive)
  3. Divide by Expected
  4. Add all of them up
Bigger the number = Bigger the difference = More likely there's a real association

Step-by-Step With a Story

Story: You want to know - "Is smoking linked to lung cancer?"
You study 200 people:
              Got Cancer    No Cancer    TOTAL
Smokers          40            60         100
Non-smokers      10            90         100
TOTAL            50           150         200

Step 1 - Ask: "What SHOULD the numbers be if smoking had NO effect?"

Use this simple formula: $$E = \frac{\text{Row Total} \times \text{Column Total}}{\text{Grand Total}}$$
For Smokers + Cancer: $$E = \frac{100 \times 50}{200} = 25$$
So if smoking had NO effect, only 25 smokers should get cancer. But we saw 40! That's a BIG difference.

Step 2 - Calculate χ²

OE(O-E)² / E
Smoker + Cancer40259.0
Smoker + No Cancer60753.0
Non-smoker + Cancer10259.0
Non-smoker + No Cancer90753.0
χ² = 9 + 3 + 9 + 3 = 24

Step 3 - Compare with a "cut-off value"

  • The standard cut-off (at p=0.05, df=1) is 3.84
  • Our value = 24
  • 24 > 3.84 → YES! There is a real association!

The Decision Rule (Simplest Version)

Calculated χ²  >  Table value (3.84)
       ↓
  SIGNIFICANT
  (Real relationship exists)

Calculated χ²  <  Table value (3.84)
       ↓
  NOT SIGNIFICANT
  (Could be just chance)

When Can You Use It?

✅ Data is in counts/numbers (not averages) ✅ You are comparing groups (smoker vs non-smoker, male vs female) ✅ Each person fits in only ONE box ✅ Sample size is at least 40 ✅ Every box has at least 5 expected cases

When Can You NOT Use It?

ProblemUse This Instead
Sample < 40, or any box has < 5 expectedFisher's Exact Test
Box has 5-40 expectedYates' Corrected χ²
Same person measured twiceMcNemar's Test

One Trick to Remember the Formula

Think of it as measuring "surprise":
  • How surprised are you that O ≠ E?
  • The more surprised = The bigger the χ² value
  • Big χ² = Something real is going on!

Summary in 5 Points

  1. Chi-square = comparison of Observed vs Expected frequencies
  2. Formula = Σ(O-E)²/E
  3. df = (rows - 1) × (columns - 1)
  4. If χ² calculated > 3.84 (at df=1) → significant association
  5. Works only with categorical data and sample ≥ 40 with all expected frequencies ≥ 5

That's it! The whole test is just asking: "Is what I saw different enough from what I expected?" If yes - the association is real. If no - it's just chance.
This is a shared conversation. Sign in to Orris to start your own chat.