Professor Jayne Mitchell, Director of Infection Prevention at Leeds Teaching Hospitals NHS Trust, faces a persistent challenge: despite multiple interventions, hand hygiene compliance among healthcare workers remains suboptimal at 60%, well below the 80% target. Healthcare-associated infections continue to cause significant patient morbidity and healthcare costs. Her team has developed a comprehensive intervention combining: (1) smart dispensers that provide real-time feedback to individuals about hand hygiene moments, (2) ward-level dashboards displaying team performance, and (3) gamification elements where wards compete for monthly recognition. Preliminary testing in two pilot wards showed compliance improvements to 85%. Now Professor Mitchell must design a definitive trial but faces several challenges: individual randomisation would be impossible since the intervention affects entire wards; patients move between wards; and staff work across multiple areas. Additionally, the Trust's infection control team is concerned about equipoise – is it ethical to withhold a potentially beneficial intervention from some wards? The study must also account for seasonal variations in infection rates and different ward types (ICU, general medical, surgical). The research must provide robust evidence for a potential NHS-wide rollout worth £50 million. 1. I'd like the group to critically evaluate why Professor Mitchell cannot use individual randomisation for this intervention and what this means for his study design. Please discuss the concept of cluster randomisation, including its advantages and the specific statistical challenges it creates. How might the intracluster correlation in this hospital setting affect the sample size requirements? 2. Please analyse the specific design challenges Professor Mitchell faces with this cluster trial. I would like the group to discuss how she should handle the issues of different ward types, staff working across multiple areas, and seasonal variations. What strategies can minimise bias while maintaining practical feasibility? 3. Drawing on your professional experience, please debate whether Professor Mitchell should consider a stepped-wedge cluster design instead of a parallel-group approach. I'd like you to discuss the advantages this might offer for implementation research in the NHS, but also the additional complexity it introduces. Please bring evidence about acceptability and ethical considerations. 4. I would like the group to consider how Professor Mitchell should approach the statistical analysis of her cluster trial. Please discuss the importance of analysing at the appropriate level and what might happen if clustering is ignored. How should she handle the correlation between participants within the same ward? The following article could help you: Campbell, M.K., Piaggio, G., Elbourne, D.R. and Altman, D.G. for the CONSORT Group (2012) 'Consort 2010 statement: extension to cluster randomised trials', BMJ, 345, e5661. Available at: https://doi.org/10.1136/bmj.e5661

Reading File
Finding Sources
Searching PubMed

cluster randomised trial hand hygiene healthcare

Reading Web Page

cluster randomised trial CONSORT intracluster correlation stepped wedge

Finding Sources
Searching PubMed

stepped wedge cluster randomised trial design healthcare

Searching PubMed

intracluster correlation coefficient hospital cluster trial sample size

Reading Article
Reading Article
Searching the Web

CONSORT 2010 cluster randomised trials Campbell Piaggio Elbourne Altman intracluster correlation stepped wedge

Finding Sources
Searching PubMed

stepped wedge cluster trial ethical equipoise implementation

Searching the Web

cluster randomised trial ICC intracluster correlation hospital wards sample size inflation design effect

Reading Article
Reading Article
Reading Web Page

cluster randomised trial ICC design effect sample size analysis mixed models GEE ward level

I now have sufficient material to write a comprehensive, academically rigorous response. Let me compose the full answer.

Cluster Randomised Trial Design for Hand Hygiene Compliance: A Critical Analysis


Question 1: Why Individual Randomisation Fails and What Cluster Randomisation Means for Design

Why Individual Randomisation is Impossible Here

Professor Mitchell's intervention operates at the ward level by structural necessity. Smart dispensers, public dashboards, and ward-based gamification all function as environmental and social stimuli that affect every person in that space simultaneously. Three specific problems make individual randomisation untenable:
Contamination. If some staff on a ward receive real-time feedback and others do not, the intervention bleeds across the boundary. Staff talk to each other, observe each other's behaviour, and share the same physical space. A healthcare worker assigned to the "control" condition on a ward with smart dispensers is unavoidably exposed to the intervention's social norms effect. This contamination would bias any estimated treatment effect toward the null, making the trial unable to detect a real benefit even if one exists.
The unit of treatment. The dashboard and gamification elements do not target individuals - they target wards as social units. A ward-level compliance score displayed publicly cannot be delivered to individuals; it is definitionally a cluster-level exposure. As Campbell et al. (2012) note in the CONSORT extension for cluster trials, this situation - where the intervention must be applied to groups rather than individuals - is one of the primary justifications for cluster randomisation.
Administrative and logistical impossibility. Staff rotate between shifts; patients move between areas. Attempting to track which individuals are "allocated" to which arm across a dynamic ward environment is operationally unworkable.

The Concept of Cluster Randomisation

In a cluster randomised trial (CRT), the unit of randomisation is a group - here, hospital wards - rather than individual participants. Entire wards are randomly allocated to intervention or control. Data are still collected on individuals (hand hygiene compliance events, HAI rates), but the ward is the fundamental unit of inference.
Advantages of cluster randomisation in this setting:
  • Eliminates contamination by containing the intervention within defined units
  • Mirrors how the intervention would actually be implemented at NHS scale
  • Preserves the ecological and social dynamics that the intervention depends on
  • Administrative feasibility: ward-level allocation is straightforward to implement and monitor

The Statistical Challenge: Intracluster Correlation (ICC)

This is where cluster trials exact their price. The fundamental assumption of most statistical tests is that observations are independent. Within a hospital ward, they are not. Patients and staff on the same ward share:
  • The same physical environment, equipment, and protocols
  • The same ward culture and leadership attitude toward infection prevention
  • The same baseline compliance norm
  • Exposure to the same local outbreaks or audit events
This shared experience means that outcomes for individuals in the same cluster are more similar to each other than to outcomes for individuals in different clusters. The intracluster correlation coefficient (ICC, denoted rho) quantifies this similarity. It can be interpreted as the proportion of total variance in the outcome attributable to between-cluster differences.
The design effect and sample size inflation. The design effect (DE) - sometimes called DEFF - quantifies how many times more participants are needed compared to an individually randomised trial with the same power:
DE = 1 + (m - 1) x ICC
Where m is the average cluster size (number of hand hygiene observations per ward per period).
The implications are substantial. Published ICC values for process-of-care outcomes in secondary care settings are typically 0.01 to 0.10, with values for compliance-type measures sometimes reaching 0.15-0.20 (Smeeth & Ng, 2002; Murray et al., 2005). In a hospital ward with perhaps 200 compliance observations per month:
  • If ICC = 0.05: DE = 1 + (199 × 0.05) = 10.95 - nearly 11 times the individual-trial sample size
  • If ICC = 0.10: DE = 1 + (199 × 0.10) = 20.9 - over 20 times
This is not merely a nuisance correction. It means Professor Mitchell's trial must recruit many more wards or observations than a naive calculation would suggest. The pilot study used only 2 wards, which is far too few to provide a reliable ICC estimate for the main trial - though it can give a rough starting point.
A critical additional point, often overlooked: cluster trials face a degrees of freedom penalty separate from the variance inflation. With, say, 20 wards randomised (10 per arm), the primary comparison has only 18 degrees of freedom, not hundreds. This severely limits the precision of the treatment effect estimate and the power of significance tests, and requires small-sample corrections (e.g. t-distribution with 2K-2 degrees of freedom, Kenward-Roger approximation for mixed models, or Mancl-DeRouen correction for GEE) (Hemming et al., 2018 - PMID 30413417; PMC10555937).
Practical implication: Professor Mitchell should seek published ICC estimates from comparable NHS ward-based compliance studies before finalising her sample size calculation, and should consider simulation-based power calculations rather than simple design effect formulas, since her outcome (compliance proportion) is binary and her cluster sizes will vary.

Question 2: Handling Design Challenges - Ward Types, Staff Mobility, and Seasonality

Stratification by Ward Type

ICU, general medical, and surgical wards differ dramatically in baseline infection risk, patient acuity, hand hygiene opportunity types, staffing ratios, and workflow patterns. These differences mean:
  • Effect modification is plausible. The intervention may work differently in ICU (where compliance is already higher and staff are more trained) versus general medical wards (where patient turnover is high and staff-patient ratios lower).
  • Confounding at allocation. If, by chance, more ICU wards end up in the intervention arm, apparent benefits could reflect the ICU environment rather than the intervention.
Solution: Stratified randomisation. Wards should be stratified by type (ICU, general medical, surgical) before randomisation, with separate randomisation within each stratum. This guarantees balance across ward types between arms. For example, if there are 6 ICU wards, 3 should be allocated to intervention and 3 to control by design, not chance.
This approach also enables subgroup analyses by ward type - which will be expected by NHS commissioners evaluating whether a £50 million rollout should prioritise certain settings.

Staff Working Across Multiple Wards

Staff mobility creates two distinct problems:
1. Contamination through behavioural carry-over. A nurse who spends part of their week on an intervention ward may develop improved hand hygiene habits that they carry into their control ward shifts. This is a form of "partial contamination" that again biases estimates toward the null.
2. Misclassification of exposure. If analysis uses the individual as the unit, a staff member spending 60% of shifts on an intervention ward and 40% on a control ward has an ambiguous exposure status.
Mitigation strategies:
  • Analyse at the ward level, not the individual level, treating each ward's aggregate compliance rate as the outcome. This sidesteps individual misclassification entirely.
  • Restrict primary analysis to ward-based observations (e.g., compliance events captured by the ward's own dispensers), not to staff-level tracking.
  • Document and measure staff cross-working as a secondary variable and include it in sensitivity analyses to estimate the magnitude of contamination.
  • Choose wards with minimal cross-staffing as clusters where possible - this is a practical argument for using geographically separated wards (different floors or buildings) within the Trust as clusters.
  • Acknowledge contamination bias direction: since contamination dilutes the treatment effect, an observed benefit, if contamination is present, represents a conservative estimate of the true effect. This actually strengthens any positive finding.

Seasonal Variation

HAI rates, respiratory virus circulation, staffing pressures, and audit intensity all vary seasonally. A trial running from October to March will look different from one running April to September.
Strategies:
  1. Calendar-time stratification: Randomise clusters within time strata, ensuring that intervention and control wards begin in the same calendar periods. This is particularly important for a parallel-group design.
  2. Include calendar time as a covariate in the analysis model (e.g., month or quarter as a fixed effect in mixed models). This adjusts for secular trends that affect all wards equally.
  3. Run the trial across a full calendar year (or multiple years) so that seasonal effects wash out equally across both arms. A 12-18 month trial period is preferable.
  4. Pre-specify seasonal adjustment in the statistical analysis plan to avoid post-hoc manipulation of results.

Maintaining Practical Feasibility

The competing demands of methodological rigour and operational feasibility must be balanced. Leeds Teaching Hospitals NHS Trust is a large organisation, but the number of available wards is finite. With the design effect inflation described above, Professor Mitchell should realistically expect to need 20-40 wards (10-20 per arm), depending on ICC estimates, baseline compliance variability, and target effect size. If the Trust cannot supply sufficient wards alone, a multicentre design - recruiting from multiple NHS Trusts - would be necessary and would strengthen external validity for an NHS-wide rollout evaluation.

Question 3: Should Professor Mitchell Use a Stepped-Wedge Design?

The Case FOR a Stepped-Wedge Design

The stepped-wedge cluster randomised trial (SW-CRT) is a variant in which all clusters start in the control condition and then cross over to the intervention at different, randomly assigned time points, until all clusters receive the intervention by the end of the trial. This design deserves serious consideration here for several interconnected reasons.
1. The equipoise problem. Professor Mitchell's infection control team is right to raise ethical concerns. With pilot data showing compliance rising from 60% to 85%, and given the known link between hand hygiene and HAI prevention, there is a strong prior belief that the intervention does good. In a parallel-group design, control wards are denied the intervention for the full trial duration - potentially 1-2 years. This is difficult to justify ethically when preliminary evidence is this strong.
The SW-CRT addresses this directly: every ward eventually receives the intervention. No ward is permanently in the control condition. Prost et al. (2015 - PMID 26278521) found from case studies that stepped-wedge designs were particularly valued when there was existing evidence or belief that the intervention would do more good than harm, and when permanent denial of the intervention to a control group was felt to be unacceptable to stakeholders.
2. Alignment with phased implementation reality. NHS implementation at scale is never simultaneous. Logistically, it is impossible to install smart dispensers, train staff, and launch dashboards on all wards at once. The stepped-wedge design turns this practical constraint into a scientific asset: the phased rollout is the randomised sequence.
3. Statistical efficiency through within-cluster comparisons. Because each ward serves as both its own control (before crossover) and its own intervention unit (after crossover), the SW-CRT can be more statistically efficient than a parallel-group design with the same number of clusters, particularly when between-cluster variation is large. The within-cluster comparison is more precise because cluster-level confounders cancel out.
4. NHS acceptability. Mdege et al.'s 2011 systematic review (PMID 21411284) found that the SW-CRT design was consistently described as more acceptable to health service managers, clinical staff, and patients than parallel designs, because everyone eventually benefits. For a £50 million rollout decision, garnering political and institutional buy-in is itself a trial requirement.
5. Estimates of effectiveness under real-world rollout conditions. The SW-CRT generates estimates of the treatment effect during actual phased implementation, which is precisely the condition under which NHS rollout would occur. This external validity is a genuine advantage over explanatory parallel trials.

The Case AGAINST (or at Least: The Additional Complexity)

The SW-CRT is not a free lunch. Hemming et al. (2018 - PMID 30413417) outlined several challenges specific to SW-CRTs that Professor Mitchell must take seriously:
1. Temporal confounding. The SW-CRT relies critically on the assumption that time trends are the same across all clusters - that any secular improvement in compliance (due to, e.g., a national hand hygiene campaign, a flu season, a high-profile HAI incident) affects intervention and control clusters equally. If this assumption is violated, estimates of the treatment effect are biased. In a hospital setting where inspection cycles, staffing crises, or policy changes can differentially affect wards, this is a genuine concern.
2. Carry-over and learning effects. If earlier phases of implementation teach lessons that improve later phases (a "learning curve"), then the effect estimated in early-adopter wards will differ from that in later-adopter wards. This makes the aggregate treatment effect estimate heterogeneous over time.
3. Increased analytical complexity. SW-CRTs require mixed-effects models (or GEE) with both cluster random effects and time period fixed effects. The ICC structure is more complex than in parallel designs, requiring estimation of both the within-period ICC and the between-period within-cluster correlation. As noted by Hemming et al. (2018), these analyses need a statistician with specific SW-CRT expertise.
4. Sample size calculations are more demanding. Power calculations for SW-CRTs depend not just on the ICC but on the number of steps, the number of clusters per step, the within-cluster autocorrelation over time, and whether individuals are observed repeatedly (closed cohort) or cross-sectionally (open cohort) within wards. The freely available Shiny CRT calculator (Hemming et al., 2020) can be used, but it requires more inputs and more assumptions than parallel-design calculators.
5. Equipoise still required. Prost et al. (2015) importantly note that the SW-CRT does not eliminate the need for equipoise - it merely reconfigures it. Every ward is in the control condition at some point. The ethical justification for the design rests on genuine uncertainty about the magnitude of effect and its generalisability across ward types, not just the direction of effect. The 2012 Ottawa Statement on the ethical design of cluster trials (Weijer et al.) is directly applicable here.
6. Duration. A well-powered SW-CRT with 20+ wards and 5-6 steps will take 18-24 months to complete. This is longer than a parallel-group design of equivalent power, and may delay the implementation decision.

Recommendation

For Professor Mitchell's specific situation - preliminary evidence of benefit, ethical pressure not to withhold the intervention, phased implementation being logistically necessary anyway, and a need to demonstrate effectiveness to NHS commissioners - a stepped-wedge design is strongly preferable to a parallel-group design. The ethical argument is particularly compelling: asking the Trust's infection control team to accept permanent denial of an apparently beneficial intervention to control wards for 18 months will be politically and ethically untenable. The SW-CRT converts this objection from a blocker into a design feature.
The additional analytical complexity is manageable with appropriate statistical expertise, which any trial seeking to support a £50 million rollout should be resourced to include.

Question 4: Statistical Analysis of a Cluster Trial - Getting the Level Right

The Critical Principle: Analyse at the Level of Randomisation

Campbell et al.'s CONSORT 2010 extension for cluster trials (2012) is unambiguous on this point: analysis must account for the clustering structure. The unit of randomisation - the ward - must be the primary unit of analysis. This principle has two dimensions.
Ignoring clustering: the consequences. If Professor Mitchell were to analyse hand hygiene compliance events as if they were independent observations (e.g., a simple chi-squared test of 10,000 observations in intervention vs. control wards), she would:
  1. Massively underestimate the standard error of the treatment effect, because she would treat 10,000 observations as providing the equivalent of 10,000 independent data points when, in reality, the effective sample size is much closer to the number of clusters (wards).
  2. Produce spuriously narrow confidence intervals and falsely small p-values.
  3. Inflate the type I error rate - in severe cases, the nominal 5% significance level could correspond to a true false-positive rate of 30-50% when clustering is strong.
This is not a theoretical concern. Reviews of cluster trial analyses have repeatedly found that trials which ignore clustering report treatment effects as statistically significant when proper analysis would show them to be non-significant (Rutterford et al., 2015 - PMID 25523375).

Approaches to Handling Clustering

There are two main analytical frameworks, each with specific properties relevant to Professor Mitchell's situation:
1. Mixed-effects models (also called multilevel models or GLMM)
These models explicitly represent the hierarchical structure of the data: observations nested within wards, wards nested within trial arms, and (in a SW-CRT) time periods.
The model includes:
  • A fixed effect for treatment allocation (intervention vs. control)
  • A random effect for ward** (capturing between-ward variability in baseline compliance - the ICC)
  • Fixed effects for time period (adjusting for secular trends - particularly important in SW-CRT)
  • Potential fixed effects for ward type (ICU vs. medical vs. surgical - as a covariate for stratum)
For a binary outcome (compliant/non-compliant per hand hygiene moment), a logistic mixed model (GLMM) is appropriate. The treatment effect is then expressed as an odds ratio or, with appropriate specification, a risk ratio.
2. Generalised Estimating Equations (GEE)
GEE takes a "population-average" rather than "cluster-specific" approach. Rather than modelling the random effects explicitly, GEE uses a working correlation structure (e.g., exchangeable, meaning all observations within a cluster are assumed equally correlated) to adjust standard errors. GEE with robust (sandwich) variance estimators is particularly appropriate when the number of clusters is reasonably large (>30).
For Professor Mitchell's trial, where there may be only 20-30 wards total, mixed-effects models with REML estimation are generally preferred over GEE, because GEE sandwich estimators are known to underestimate variance when the number of clusters is small - exactly the situation here (Hemming et al., 2018; PMC10555937).
3. Two-stage analysis
A simpler but valid approach is to compute a ward-level summary statistic (e.g., compliance proportion per ward per month) and then compare these summaries between arms using appropriate tests. This approach is fully valid, is explicitly recommended by CONSORT 2012, and has the intuitive transparency needed when presenting results to NHS decision-makers. Its disadvantage is loss of efficiency when adjusting for covariates.

Handling the Specific Correlations in this Setting

In a hospital ward, clustering creates at least two layers of correlation:
  • Within-ward, within-period: Different staff members' compliance events on the same ward in the same month are correlated because they share ward culture, leadership, and the intervention itself.
  • Within-ward, between-period (autocorrelation): A ward's compliance in January is correlated with its compliance in February because the underlying ward culture persists over time.
In a stepped-wedge design, this second correlation is particularly important and requires explicit modelling using either random slope models or the "common correlation" model (Hooper et al., 2016). The Shiny CRT calculator (Hemming et al., 2020) allows researchers to specify both the ICC and the cluster autocorrelation coefficient (CAC) when estimating power.

Pre-Specification and Transparency

Given the potential for £50 million of public resource to follow from this trial, the analysis plan must be:
  • Pre-specified in a published protocol and registered in a trial registry (ISRCTN or ClinicalTrials.gov) before any data are unblinded
  • Audited against the CONSORT cluster trial checklist (Campbell et al., 2012)
  • Reported with the ICC estimate as required by CONSORT - this serves the scientific community by providing empirical ICC data for future trial planning
For the SW-CRT specifically, the CONSORT extension by Hemming et al. (2018 - PMID 30413417) should be used as the reporting template.

Summary Table: Key Design Decisions

IssueProblemRecommended Solution
Unit of randomisationWard-level intervention; contamination if individualCluster randomise by ward
Ward type confoundingICU/medical/surgical differ at baselineStratified randomisation by ward type
Staff mobilityCross-ward contamination; exposure misclassificationWard-level analysis; document cross-working; geographically separated wards
Seasonal variationTime trends inflate/deflate outcome ratesCalendar-time stratification; time period fixed effects in model
Ethical concern (equipoise)Withholding potentially beneficial interventionStepped-wedge design: all wards eventually receive intervention
ICC inflation of sample sizeEffective sample size far smaller than head countDesign effect calculation; simulation-based power; pilot ICC estimate
Small cluster numbersDegrees of freedom penaltyREML mixed models; small-sample degrees of freedom correction
Analysis levelInflation of type I error if clustering ignoredMixed-effects models (GLMM) or two-stage ward-level analysis

Key References

  • Campbell MK, Piaggio G, Elbourne DR, Altman DG; CONSORT Group (2012). CONSORT 2010 statement: extension to cluster randomised trials. BMJ, 345, e5661. [doi:10.1136/bmj.e5661]
  • Hemming K, Taljaard M, McKenzie JE et al. (2018). Reporting of stepped wedge cluster randomised trials: extension of the CONSORT 2010 statement with explanation and elaboration. BMJ, 363, k1614. [PMID: 30413417]
  • Mdege ND, Man MS, Taylor CA, Torgerson DJ (2011). Systematic review of stepped wedge cluster randomized trials. J Clin Epidemiol, 64(9): 936-48. [PMID: 21411284]
  • Prost A, Binik A, Abubakar I et al. (2015). Logistic, ethical, and political dimensions of stepped wedge trials: critical review and case studies. Trials, 16:353. [PMID: 26278521]
  • Rutterford C, Taljaard M, Dixon S et al. (2015). Reporting and methodological quality of sample size calculations in cluster randomized trials. J Clin Epidemiol, 68(6): 716-23. [PMID: 25523375]
  • Thom KA, Rock C et al. (2023). Direct gloving vs hand hygiene before donning gloves - cluster randomized clinical trial. JAMA Network Open. [PMID: 37883088]
  • Key considerations for designing, conducting and analysing a cluster randomized trial. PMC10555937 - comprehensive methodological review.
This is a shared conversation. Sign in to Orris to start your own chat.