I'll write a comprehensive 15-mark MPT (Master of Public Health/Preventive & Social Medicine Theory) level answer on the Chi-square test. Let me provide a thorough, exam-quality response.
Chi-Square Test (χ²)
Definition
The Chi-square (χ²) test is a non-parametric statistical test used to determine whether there is a significant association between two categorical variables, or whether an observed frequency distribution differs significantly from a theoretically expected distribution. It was introduced by Karl Pearson in 1900.
Types of Chi-Square Tests
| Type | Purpose |
|---|
| 1. Chi-square test of independence (association) | Tests association between two categorical variables |
| 2. Chi-square test of goodness of fit | Tests whether observed frequencies fit an expected distribution |
| 3. Yates' correction (corrected χ²) | Applied to 2×2 contingency tables when any expected frequency is 5-40 |
| 4. Mantel-Haenszel χ² | Tests association after stratification for confounders |
Assumptions / Pre-requisites
- Data must be in frequencies (counts), not percentages or means
- Sample size: Total n should be ≥ 40
- Expected frequency: Each cell's expected frequency should be ≥ 5 (in a 2×2 table)
- In larger tables (r×c), not more than 20% of cells should have expected frequency < 5, and no cell should have expected frequency < 1
- Observations must be independent
- Random sampling from the population
- Data should be nominal or ordinal (categorical)
If expected frequency in any cell of a 2×2 table is < 5 - use Fisher's Exact Test
If n is < 40 - use Fisher's Exact Test
If expected frequency is 5–40 - use Yates' correction
If expected frequency is > 40 - use Standard χ² test
Formula
$$\chi^2 = \sum \frac{(O - E)^2}{E}$$
Where:
- O = Observed frequency
- E = Expected frequency
- Σ = Sum over all cells
Expected Frequency Calculation:
$$E = \frac{\text{Row total} \times \text{Column total}}{\text{Grand total}}$$
Yates' Correction for Continuity (2×2 table)
When expected frequency in any cell is between 5 and 40:
$$\chi^2_{corrected} = \sum \frac{(|O - E| - 0.5)^2}{E}$$
Steps in Calculation
Step 1: State the Null hypothesis (H₀): There is no significant association between the two variables.
Step 2: Prepare the contingency table with observed frequencies.
Step 3: Calculate expected frequencies for each cell using: E = (Row total × Column total) / Grand total
Step 4: Apply the formula: χ² = Σ(O-E)²/E
Step 5: Calculate degrees of freedom (df): df = (r-1)(c-1), where r = rows, c = columns
Step 6: Find the tabulated/critical value of χ² at the desired significance level (usually p = 0.05) from χ² distribution table
Step 7: Compare calculated χ² with tabulated χ²
- If calculated χ² > tabulated χ² → Reject H₀ → Association is statistically significant
- If calculated χ² < tabulated χ² → Accept H₀ → Association is NOT significant
Worked Example
Problem: In a study, 200 individuals were assessed for smoking habit and lung cancer:
| Lung Cancer (+) | Lung Cancer (-) | Total |
|---|
| Smoker | 40 | 60 | 100 |
| Non-smoker | 10 | 90 | 100 |
| Total | 50 | 150 | 200 |
Step 1 - H₀: No association between smoking and lung cancer.
Step 2 - Calculate Expected Frequencies:
| Cell | Formula | E |
|---|
| Smoker + Cancer | (100×50)/200 | 25 |
| Smoker + No Cancer | (100×150)/200 | 75 |
| Non-smoker + Cancer | (100×50)/200 | 25 |
| Non-smoker + No Cancer | (100×150)/200 | 75 |
All expected frequencies > 40, so standard χ² applies.
Step 3 - Calculate χ²:
| Cell | O | E | (O-E) | (O-E)² | (O-E)²/E |
|---|
| a | 40 | 25 | 15 | 225 | 9.0 |
| b | 60 | 75 | -15 | 225 | 3.0 |
| c | 10 | 25 | -15 | 225 | 9.0 |
| d | 90 | 75 | 15 | 225 | 3.0 |
χ² = 9.0 + 3.0 + 9.0 + 3.0 = 24.0
Step 4 - Degrees of freedom: df = (2-1)(2-1) = 1
Step 5 - Tabulated value: χ² at df=1, p=0.05 = 3.84
Step 6 - Conclusion: Calculated χ² (24.0) > Tabulated χ² (3.84)
→ Reject H₀. There is a statistically significant association between smoking and lung cancer (p < 0.05).
Degrees of Freedom
| Table | df |
|---|
| 2×2 | 1 |
| 2×3 | 2 |
| 3×3 | 4 |
| r×c | (r-1)(c-1) |
Shortcut Formula for 2×2 Table
For a 2×2 contingency table (a, b, c, d):
$$\chi^2 = \frac{N(ad - bc)^2}{(a+b)(c+d)(a+c)(b+d)}$$
χ² Critical Values Table (commonly used)
| df | p = 0.05 | p = 0.01 |
|---|
| 1 | 3.84 | 6.64 |
| 2 | 5.99 | 9.21 |
| 3 | 7.81 | 11.34 |
| 4 | 9.49 | 13.28 |
Applications in Community Medicine / Public Health
- Case-control studies - Association between exposure and disease
- Cross-sectional surveys - Association between variables (e.g., diet and BMI category)
- Cohort studies - Comparing proportions of outcomes
- Vaccine efficacy studies - Comparing attack rates between vaccinated and unvaccinated
- Audit and evaluation - Whether outcome proportions differ between groups
- Goodness of fit - Checking if a distribution follows Mendelian ratios (genetics)
Advantages
- Simple to calculate
- Applicable to nominal and ordinal data
- Does not assume normal distribution (non-parametric)
- Applicable to both small and large samples (with modifications)
Limitations
- Cannot be used with continuous data directly
- Sensitive to small expected frequencies
- Does not indicate strength of association (use Cramer's V or Phi coefficient for that)
- Does not indicate direction of association
- Not applicable when expected frequency < 5 in any cell of a 2×2 table
Alternatives When χ² Cannot Be Applied
| Situation | Alternative Test |
|---|
| n < 40 OR any E < 5 in 2×2 table | Fisher's Exact Test |
| 5 ≤ E ≤ 40 in 2×2 table | Yates' corrected χ² |
| Paired proportions | McNemar's test |
| Ordinal data, trend | Mantel-Haenszel test |
Summary Points to Remember
- χ² is used for categorical data
- Formula: Σ(O-E)²/E
- df = (r-1)(c-1)
- Minimum expected frequency ≥ 5 in 2×2 table
- If calculated χ² > table value → significant association
- Introduced by Karl Pearson, 1900
- For 2×2 table with E between 5-40: use Yates' correction
- For 2×2 table with E < 5 or n < 40: use Fisher's Exact Test