Here is a thorough breakdown of all three statistics, which are core components of logistic regression model evaluation:
Logistic Regression Model Fit Statistics
1. Omnibus Test of Model Coefficients
What it tests: Whether the model with all its predictors is significantly better than a null (intercept-only) model.
How it works:
-
It uses the likelihood ratio chi-square (LR chi²) statistic:
χ² = -2LL(null) - (-2LL(model)) = -2 × [LL(null) - LL(model)]
-
Degrees of freedom = number of predictors in the model.
-
A significant p-value (< 0.05) means the predictors, taken together, significantly improve the model's predictive ability over chance.
Interpretation:
| Result | Meaning |
|---|
| p < 0.05 | At least one predictor significantly predicts the outcome |
| p ≥ 0.05 | Predictors do not collectively improve the model |
In SPSS output, this appears as three rows - "Step," "Block," and "Model" - which differ only if predictors are entered in blocks (stepwise entry). When all predictors are entered at once, all three are identical.
Think of this as the overall F-test equivalent from linear regression, but using chi-square instead.
2. Nagelkerke R²
What it is: A pseudo R-squared - an analogue of the R² from linear regression, adapted for logistic regression.
The chain of logic:
-
Cox & Snell R² is the base version, computed as:
R²_CS = 1 - [L(null) / L(model)]^(2/n)
Its maximum value is less than 1 (capped depending on the data), making it hard to interpret on a 0-1 scale.
-
Nagelkerke R² corrects this by dividing Cox & Snell R² by its theoretical maximum:
R²_N = R²_CS / R²_max, where R²_max = 1 - L(null)^(2/n)
This forces the range to 0 to 1, making it more interpretable.
Interpretation:
| Value | Rough interpretation |
|---|
| 0.0 - 0.2 | Weak model / small effect |
| 0.2 - 0.4 | Moderate model |
| 0.4 - 0.6 | Substantial model |
| > 0.6 | Strong model |
Critical caveat: Nagelkerke R² does not measure goodness of fit in the strict sense. It measures effect size - how useful the predictors are for explaining variance in the outcome. A model can be statistically significant (Omnibus p < 0.05) but still have a low Nagelkerke R² (meaning predictors explain little variance), as is common in epidemiology.
3. Hosmer-Lemeshow Test
What it tests: Calibration - whether the model's predicted probabilities actually match the observed outcomes across the range of predicted risk.
How it works:
- Cases are ranked by predicted probability and split into 10 equal-sized groups (deciles).
- Within each group, observed vs. expected event counts are compared using a chi-square-like statistic.
- df = number of groups - 2 = 8 (typically).
Formula:
H-L χ² = Σ [(O_k - E_k)² / E_k(1 - E_k/n_k)]
Interpretation - this is the reverse of most tests:
| p-value | Meaning |
|---|
| p > 0.05 | Good fit - predicted ≈ observed (do NOT reject fit) |
| p < 0.05 | Poor fit - model's predictions deviate from reality |
Limitations to know:
- Sensitive to sample size: very large samples may produce significant p-values even for trivially small deviations
- Results can differ based on the number of groups chosen
- It tests calibration only - a well-calibrated model can still discriminate poorly (complement with AUC/ROC)
How They Work Together
These three statistics answer different questions about model quality:
| Statistic | Question answered | Good result |
|---|
| Omnibus test | Do predictors improve the model at all? | p < 0.05 |
| Nagelkerke R² | How much variance do predictors explain? | Closer to 1 |
| Hosmer-Lemeshow | Are predicted probabilities accurate? | p > 0.05 |
Practical example (from a clinical study on predicting death):
- Omnibus χ² = 23.4, p < 0.001 (predictors are significant)
- Nagelkerke R² = 0.17 (model explains only 17% of variance - small effect)
- H-L χ² = 7.3, df = 8, p = 0.50 (good calibration - predictions match observations)
This pattern - significant omnibus, low R², good H-L - is extremely common in medicine. It means the predictors are real but the model does not explain most of the variation in outcome, which is typical when biological processes have many unmeasured contributors.
Quick Reference
Omnibus test → "Is the model better than nothing?" (overall significance)
Nagelkerke R² → "How much does the model explain?" (effect size)
H-L test → "Are the predicted probabilities accurate?" (calibration)
For a complete logistic regression evaluation, these should be supplemented with the AUC (area under the ROC curve) for discrimination ability, and individual predictor Wald statistics for per-variable significance.