Download Sample Size calculations in multilevel modelling

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Analysis of variance wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Sample Size calculations in
multilevel modelling
William Browne
University of Nottingham
(With thanks to
Mousa Golalizadeh and Lynda Leese)
Summary
•
•
•
•
•
•
•
Introduction to sample size calculations.
A simulation-based approach.
PINT for balanced 2 level models.
Effect of balance.
Other approaches.
Cross classified models.
Future plans.
Background
• Many quantitative social science research questions are
of the form of a hypothesis – A has a significant effect on
B.
• To answer such a question data is collected that allows
the researcher to (hopefully) test whether statistically A
has a significant effect on B. (In fact we aim to reject the
hypothesis that A doesn’t significantly affect B).
• A test is performed and either the researcher is happy
and A indeed has a significant effect on B or is left
wondering why the data collected do not back up their
hypothesis. Is the hypothesis false or was the data not
sufficient?
• The sufficiency of the data is the motivation for sample
size calculations.
Example
• Suppose I have the research question ‘Are Welshmen
on average taller than 175 cms?’
• I now need to get hold of a random sample of n
Welshmen and measure each of their heights.
• I make some statistical assumption about the distribution
of the heights of Welshmen e.g. that they come from a
Normal distribution.
• I might like to check this assumption by plotting a
histogram of the data.
• I can then form a statistical hypothesis test and test
whether indeed Welshmen are taller than 175cms.
• I need to decide how big to make n, my sample of
Welshmen.
x
Hypothesis Testing
• Let us assume our null hypothesis is that
the average height of Welshmen (μ) is
175cm.
• So we test H0:μ=175 vs HA:μ>175 (or
alternatively H0:θ=0 vs HA:θ>0 where θ=μ175)
• In practice we calculate from our sample its
mean (x ) and standard deviation (s2) and
use these along with n to form a test
statistic which we can compare with the
distribution assumed under H0
Type I and Type II errors
• No hypothesis test is perfect and there is always the possibility of
errors
Truth
Reject H0
H0 True
Type I error
Decision Accept H Correct
0
H0 False
Correct
Type II error
• P(Type I error) = α = significance level or size
• P(Type II error) = β, 1-β is the power of the test.
• In general we fix α to some value e.g. 0.05, 0.01 then 1-β depends
on our sample size.
Example hypothesis test
• Let us assume that in reality our sample mean is 180cms and the
population standard deviation (sd) is 5cms (known).
• We can then form a test statistic as follows:
Z
X  175

n

5
~ N (0,1)
5
n
• Note here that for small n and unknown sd we should use a
student-t distribution rather than Normal.
• For a 1-sided Z test we wish Z= n > 1.645 and so we need our
sample to be of size 3 to reject H0, using a student-t distribution
increases this to 5. (Here α=0.05)
• However if the sample mean had been only 176cms then we would
need n > (1.645*5)2 = 68 Welshmen to reject H0
Power calculations
• Our last slide in some sense is backwards as we cannot get
from a given sample mean to choosing a sample size!
• What we do instead is use different terminology and play
God!
• We will choose an ‘effect size’, γ which will represent a
guess at the increase in the sample mean for Welshmen.
• There then exists an (approximate) formula that links four
quantities, size (α), power (1-β), effect size (γ) and sample
size (n)

SE ( )
 z1  z1 
Here RHS is sum of cases H0 true and
H0 false.
• Note that the standard error (SE) of γ is a function of n and σ
the population sd which is assumed known.
• We can now evaluate one of these quantities conditional on
the others e.g. what sample size is required given α,1-β and
γ?
Welsh height example
Here we have looked at two examples with effect sizes 5 and 1
respectively. Assume σ takes the value 5 and so let us suppose we
take a sample of size 25 Welshmen.

Then
 z1  z1 
SE ( )
Case 1: 5/(5/√25)=1.645+z1-β,z1-β=3.355
β=0.9996
Case 2: 1/(5/ √25)=1.645+z1-β,z1-β=-0.645
β=0.25946
So here a sample of 25 Welshmen from a population with mean 180cms
would almost always result in rejecting H0,
but if the population mean is 176cms then only 26% of such samples
would be rejected.
We can plot curves of how power increases with sample size as shown
in the next slide.
Power curve for Welshmen
example
Here we see the two power curves for the
two scenarios:
Extending the idea

 z1  z1 
• The simple formula
can
SE ( )
be used in many situations and hypothesis tests.
• To generalise the idea we assume that γ is an effect
size associated with a statistic that we wish to
compare with a (null) hypothesized value of 0.
• The complication occurs in finding a formula for the
standard error for the statistic and relating this
formula to the sample size, n.
• We will next consider an alternative approach before
returning to look at how the above approach extends
to multilevel models.
The use of simulation
• In reality our (hoped for) research path will be as follows:
Construct research question -> Form null hypothesis that
we believe false -> Collect appropriate data -> Reject
hypothesis therefore proving our research question.
• Assuming what we believe in our research question is
correct and hence null hypothesis is false we can still be
let down by not collecting enough data.
• The idea behind using simulation is to simulate the data
gathering process (assuming we know the right answer)
many times and see how often we can reject the null
hypothesis. The percentage of rejected null hypotheses
(via simulation) will then estimate power.
Simulation in our example
• Consider our Welsh height example case 2 where we
believe Welshmen have a mean height of 176cms (and
sd = 5cms) and we are testing the hypothesis
H0:μ=175cms, and we consider a sample size 25.
• Then we generate N samples (e.g. 5000) of size 25,
• xi ~ N (176,5 2 ), i  1,..., N and for each sample form a lower
bound for the confidence interval of the form
• xi  1.645  S.E.( xi ) . This we compare with the value 175
and the proportion greater than 175 is an estimate of the
power of the test.
• We can repeat this exercise for different sample sizes
and form a power curve.
Power curve comparison
Note simulation curve is a good approximation of the theoretical curve
although there are some minor (Monte Carlo) errors even with 5000
simulations per sample size.
Advantages/Disadvantages
• Theoretical approach is quick when the
formula can be derived.
• Approximations for more complex
situations exist which are equally quick.
• Simulation approach generalizes to more
situations but is much slower and we may
need large numbers of simulations per
scenario to get accurate power estimates.
What happens with multilevel data?
We will here mainly consider 2-level models and
take as our application area education, so we
have students nested within schools.
When deciding on a sampling scheme we have
many choices:
• How many schools, N ?
• How many pupils per school, nj ?
• Should we collect the same size sample from
each school ?
Our decision will depend on which parameter we
wish to estimate in the model.
Education Example
• For motivation we considered a two level dataset with exam marks
measured for each student in a collection of schools. In fact this
dataset exists and has 4915 students in 96 schools.
• Our hypothesis of interest is that the exam mark for an average
student is > 20 (null hypothesis = 20) which with such a large
sample results in the null hypothesis being rejected for our particular
data.
• If we fit the following multilevel model to the data we get the
estimates given:
yij   0  u j  eij
u j ~ N (0,  u2 ), eij ~ N (0,  e2 )
 0  21.685,  u2  16.205,  e2  139.367
• If we treat these estimates as population values, we are interested in
what power for testing our hypothesis results from various
combinations of N and nj
Design effect formula
•
•
•
•
•
•
•
If we assume balance then with n pupils in each of N schools for our
simple model (and only this simple model) the following formula holds:
Design effect = 1 + (n-1)ρ where ρ is the intra-class correlation.
So if we know the simple random sample size required for a given
power we need to multiply this by the design effect.
For example our data has ρ=16.205/(16.205+139.367)=0.104
So for schools of size 10 pupils we would need 1+9*0.104=1.94 times
as many students (in total) to get the same power.
For this model (and this model only) we could therefore perform our
power calculations assuming simple random sampling from a population
with variance 155.572 and scale up the sample required based on the
design.
So 1.685  1.645  0.84  n  338.4
155.572
•
n
And for schools of size 10 we require 1.94*338.4=657 pupils which we
can round up to 66 schools.
Simulating multilevel designs
• The process here is similar to the earlier
example except that we need to simulate from a
multilevel model and fit the models using MLwiN
(Rasbash, Browne et al. 2000).
• To this end we will write macro code in the
MLwiN macro language to perform the task.
• The MLwiN macro language allows datasets to
be simulated, models to be set up and run using
various algorithms and results collected.
• It has the advantage of performing all the
operations in one package but programming in
the macro language is not for the faint hearted!
Simulation continued
• We will perform simulations for schools of 10 pupils where number
of schools (N) ranges from 5 to 70. For each N, 5000 datasets are
generated.
• For each dataset we need to generate 10*N level 1 residuals with
variance 139.367, N level 2 residuals with variance 16.205 and add
these residuals up correctly with the fixed effect estimate 21.685.
• MLwiN has commands to generate random Normally distributed
observations but also has a SIMU command which given a model is
set up and estimates given will simulate from it directly making life
easier.
• For each simulated dataset we fit the variance components model
using the RIGLS algorithm. For small numbers of level 2 units we
may have estimation difficulties but MLwiN has an ERROR 0
command which simply ignores such problems.
• Note it is also important to ensure the command BATCH 1 is
included else MLwiN may only run RIGLS for 1 iteration for each
model!!
Comparison of formula/simulations
• The following graph compares the design effect formula to the
simulation approach:
Zero variance estimates from
RIGLS algorithm
• The following graph gives a plot of percentage zero estimates for the
level 2 variance against number of level 2 units:
Other sample size issues
• There are other reasons why we may be interested in sample size
questions in multilevel modelling.
• It is often problematic to fit multilevel models when the number of
higher level units is small as demonstrated in the last graph.
• Also some methods can be biased for small sample sizes.
• Note although method comparison is done using a similar approach
of generating simulated datasets, here power calculations are not
the main aim; that said when performing power calculations
parameter bias of methods should be noted as this will result in bias
of predicted power.
• Browne (1998), Browne and Draper(2006) compare MCMC, RIGLS
and IGLS for small sample sizes and continuous responses, and
MCMC, MQL and PQL for binary response models.
• Maas and Hox (2004) also look at small sample sizes and how
robust estimation is to the Normal distributional assumption of the
level 2 residuals.
Sampling policy
The design effect formula:
Design effect = 1 + (n-1)ρ
suggests that if we are to sample a fixed (balanced)
number of pupils n*N then our best power results when n
is smallest i.e. sampling one pupil each from 100 schools
is better than sampling 100 pupils from the same school.
The effect of sampling policy is most important in scenarios
where ρ is large e.g. repeated measures designs.
The simulation procedure gives approximately the same
power curve and so in this simple example we have an
easy to use formula.
The reason in practice for sampling several pupils from
each school is purely the additional cost incurred in
visiting additional schools.
More complex examples – random
intercepts and random slopes
• We will now look at more complex random effect
models with predictor variables.
• We will consider the random intercept model
yij   0  1 xij  u j  eij , u j ~ N (0, u2 ), eij ~ N (0,  e2 )
• and the random slopes model
yij   0  1 xij  u0 j  u1 j xij  eij , u j ~ MVN (0, u ), eij ~ N (0, e2 )
• We will consider how to extend (approximately)
the theoretical approach and also the simulation
approach.
PINT – (Bosker,Snijders and
Guldemond 1996)
• Stands for Power IN Two level designs.
• Will work out (approximate) standard errors for parameters in two
level models.
• Allows arbitrary numbers of fixed parameters and random (at level
2) parameters.
• Assumes balance at level 2 i.e. each of N level 2 units contains n
level 1 units.
• Works out (approximate) standard errors for all fixed parameters in
the model given lots of information relevant to the calculation.
• For each variable, it’s mean, variance (both within and between
higher level units) and covariances (correlations) between variables
are required in the calculation.
• It differentiates between various types of fixed effect: level 1
variables with and without a random effect, level 2 variables and
cross-level interactions.
• It can also deal with monetary considerations.
Example problem
• We will continue with our educational example but also
consider the effect of gender (β1). For a random
intercepts model let us assume the true parameter
values are β0=20.9, β1= 1.6, σ2u=15, σ2e=135.
• We have two hypotheses to test:
• Hypothesis 1: boys get on average more than 20 marks
(H0: β0=20 vs HA: β0>20)
• Hypothesis 2: girls do better on average than boys (H0:
β1=0 vs HA: β1>0)
• We will also consider the effect of random slopes on the
dataset so will have a second model with additionally
σ2u0=10, σu01=2, σ2u1=5
PINT input
• PINT requires us to input σ2u=15, σ2e=135 and for the
gender parameter it’s mean (which corresponds to the
probability of being a girl) which we will assume is 0.5.
We will assume a Binomial assumption for gender
making it’s variance equal to 0.25 (within groups) and
assume zero variance between groups.
• An alternative might be to assume the between groups
variance is 0.025 (p(1-p)/n) for the 10 pupils per school
example and the within variance 0.225 which increases
the parameter SEs slightly and reduces power.
• The simulation approach is far easier to understand as
we simply choose a gender at random from a Binomial
distribution for each pupil.
• We can also easily incorporate features such as single
sex schools by giving these a probability of selection and
making all students in such a school boys or girls.
Results – Hypothesis 1
Here we see good agreement from approaches. It appears that we need a
large dataset to have strong power for this hypothesis.
Results – Hypothesis 2
Here the PinT curve appears to give slightly higher power suggesting that
maybe the alternative predictor variances would be more appropriate.
What happens when we include
random slopes?
• The following table gives power values for β1
for the random intercept model.
• Note that pairs of values with the same total
N*n have similar powers.
Schools (N)
20
30
40
50
60
20
0.39
0.49
0.60
0.70
0.75
Pupils 30
(n)
40
0.51
0.67
0.76
0.84
0.89
0.61
0.77
0.86
0.92
0.95
50
0.69
0.84
0.92
0.96
0.98
60
0.77
0.90
0.95
0.98
0.99
What happens when we include
random slopes?
• The following table gives power values for β1 for the
random slopes model.
• Note that pairs of values with the same total N*n now
do not have similar powers and larger N is better.
Schools (N)
20
30
40
50
60
20
0.33
0.45
0.54
0.63
0.69
Pupils 30
(n)
40
0.42
0.55
0.67
0.76
0.82
0.49
0.64
0.76
0.83
0.89
50
0.57
0.69
0.80
0.89
0.93
60
0.60
0.76
0.85
0.91
0.95
Effect of balance
• Here we look at 3 scenarios: balanced, unbalanced,
severe unbalanced.
• We will consider the variance components model and
construct power curves by evaluating each scenario at
4,8,12,…,100 schools.
• The balanced case for N schools has 10 pupils per
school.
• The equivalent unbalanced case has N/2 schools
containing 5 pupils, N/4 schools containing 10 pupils and
N/4 schools containing 20 pupils.
• The severely unbalanced case has N-1 schools only
containing 1 pupil and 1 school containing 9N+1 pupils.
Results
• Here we see the power curves for the 3 scenarios. Note lower power
for unbalanced and strange behaviour for severe unbalance.
Number of zero variances
Extremely unbalanced designs are really estimating the effect of the large
school instead of the global mean and hence the level 2 variance is often
estimated as 0.
Subsampling approach / post-hoc
power calculations
• We have chosen a parametric approach where, given
effect sizes, we simulate datasets prior to any actual
data being collected.
• An alternative post-data collection non-parameteric
approach is to subsample from a large existing dataset
and test power calculations on these subsamples.
• Such an approach has been investigated by Arshartous
(1995) and Mok (1995).
• The advantage of this approach is that no distributional
assumptions need be made in the dataset generation.
• The disadvantage is that post-data power calculations in
some sense miss the boat in that we really need the
power calculations to guide us in our sampling. However
such calculations may be useful for similar future
studies.
Bayesian approach
• A recent more Bayesian approach is described
in Wang and Gelfand (2002).
• Here rather than fix an effect size for each
unknown parameter the user instead can give a
prior distribution (the sampling prior) which is
used in the generation of the simulated datasets.
• They then use MCMC to fit models to their
simulated datasets and evaluate performance
criterions based on the posterior samples.
Cross-classified models
• In our ESRC grant we are intending to focus on these
model types for our Power calculations as they are
outside the remit of PINT.
( 2)
( 3)
yi   0  1 xi  uschool
( i )  u district( i )  eij ,
u (j2) ~ N (0,  s2 ), u (j3) ~ N (0,  d2 ), eij ~ N (0,  e2 )
• To date we have produced code in both MLwiN (using
both the Rasbash and Goldstein adaptation of the IGLS
algorithm and MCMC sampling) and R (lmer).
• MLwiN appears to be quicker although both lGLS and
lmer have problems as the number of units in the crossclassified models increases, particularly with random
slopes.
Cross-classified Issues
• We have considered 2 higher level classifications and an
educational application where pupils are nested in both districts
(where they live) and schools (where they study). Districts and
schools are crossed.
• This data structure can be considered as a 2-way contingency table.
• Note that in reality this table will be sparse i.e. a school takes mainly
pupils from local districts.
• We can consider sampling where we can choose any number from
any cell of the table or perhaps more realistically we could choose
numbers of pupils from one dimension e.g. school and simply be
given their district from sampling.
• For simulation purposes we can create typical tables based on
probabilities of cell membership.
• We plan to compare approaches to mimic the sampling process.
Conclusions
• In this talk we have shown the flexibility of using
simulation to perform power calculations for multilevel
models.
• Although computationally the approach is slow the
flexibility of the approach means it can be used for
virtually all models given enough assumptions.
• Low powered studies often involve small amounts of
data thus making the power calculations quicker.
• In comparison the PINT program is fantastically quick
and it will be worth also using the simulation approach to
make approximate adjustments to PINT answers for
problems it cannot deal with.
Further work: Designing a Power
simulator software package
• We are interested in using MLwiN (both IGLS
and MCMC estimation), lmer in R and
WinBUGS for fitting models to simulated
datasets.
• We want a stand-alone program that generates
macro code to be run in either MLwiN, Winbugs
or R.
• The idea is the program takes as input details of
the model to be investigated and generates
code for the problem that can be used in the
appropriate software package.
Further work
• Find (faster) approximations to simulation results
– potentially create look up tables, power curves
etc.
• Investigate other packages for power
calculations e.g. optimal design (Raudenbush et
al. 2005) for cluster randomized trials.
• Investigate the Bayesian approach of Wang and
Gelfand (2002) and compare with the standard
approach.
• Investigate efficient MCMC methods.
• Investigate models with other response types.
References
•
•
•
•
•
•
•
•
Arshartous, D. (1995). Determination of Sample Sizes for Multilevel Model
Design. Multilevel Analysis for Education Research.
Bosker, R.J., Snijders, T.A.B. and Guldemond, H. (1996) PINT (Power IN
Two-level designs) User Manual.
Browne, W.J. (1998) Applying MCMC methods to Multi-level models.
Unpublished PhD. thesis. University of Bath.
Browne, W.J. and Draper, D. (2006) A comparison of Bayesian and
likelihood-based methods for fitting multilevel models (with discussion).
Bayesian Analysis 1, 473-550.
Mass, C.J.M. and Hox, J.J. (2004) Robustness issues in multilevel
regression analysis. Statistica Neerlandica 58, 127-137.
Mok, M. (1995) Sample Size Requirements for 2-level Designs in
Educational Research. Multilevel Modelling Newsletter 7 (2): 11-15
Rasbash, J, Browne, W.J., Goldstein, H. et al. (2000). A User’s Guide to
MLwiN version 2.1, London: Institute of Education, University of London.
Wang, F. and Gelfand, A.E. (2002) A simulation-based approach to
Bayesian sample size determination for performance under a given model
and for separating models. Statistical Science 17 (2): 193-208.