Download Lecture 3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Psychometrics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Omnibus test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
Introduction to Design and Analysis of Experiments
Professor Daniel Houser
Noteset 3
A. Parametric tests about means
1. We assume we have a random sample of size n from a N(  X , X2 ) distribution
where the variance is known but the mean is unknown. We are interested in
testing the null hypothesis that  X  0 . The alternative hypothesis will typically
be in one of three forms.
(i)
(ii)
(iii)
 X  0 . This would be used if the mean could only have decreased.
 X  0 . This would be used if the mean could only have increased.
 X  0 . This would be used to test for any change in the mean.
An appropriate test statistic to test the null against any of these alternatives is
z
X  0
.
X / n
Let z correspond to the  % critical value from the standard normal distribution.
For example, z0.05  1.65 because about 5% of the mass of the standard normal
distribution lies to the right of 1.65. Then the critical regions for the z-test are as
follows.
 X  0
 X  0
Critical
Region
z   z
 X  0
 X  0
z  z
 X  0
 X  0
| z | z / 2
H0
H1
2. The z-test is only appropriate when the variance of the distribution is known.
Assume the variance is unknown. An appropriate statistic in this case to test the
null against the possible alternatives described above is
T
X  0
.
S/ n
where S  S 2 
2
1 n
Xi  X  .


n  1 i 1
The critical regions are analogous to the z-test, but with the appropriate t
distribution replacing the standard normal. The two tailed alternative, for
example, will be accepted over the null if
| T |
| X  0 |
 t / 2 (n  1).
S/ n
Example:
Suppose we believe that with probability p a person will play a certain Nash
equilibrium, and that this probability is the same for all people. We do not know
p, but have reason to believe it is about 25%. To test this hypothesis against the
two-tailed alternative we put 100 subjects through an experiment and observe
whether each plays the Nash strategy. Hence, we obtain 100 independent draws
from a Bernoulli distribution with the probability of success p. (Recall that a
Bernoulli random variable takes only two values, zero or one, and has pdf
f ( x )  p x (1  p)1 x . ) From the clt it follows that, approximately,
1 100
pˆ 
 xi ~ N ( p, p(1  p) /100)
100 i 1
Suppose that pˆ  0.20. This provides an estimate of the mean for the normal, and
we can estimate the variance by 0.20(0.80)/100 = 0.0016 (this is S 2 / n. ) Then we
can test the null using a standard z-test (which closely approximates the t-test
because the number of observations is large.)
| 0.20  0.25 |
 1.25  1.96  z0.025.
0.0016
So we accept the null at the 5% significance level.
B. Tests about variances and differences in means.
1. A random sample of n observations is taken from a normal distribution with
unknown variance. It is desired to test the null hypothesis that the variance is 100
against the two-sided alternative. Under the null, we know that
(n  1) S 2 /100 ~  2 (n  1). Hence, the test is performed by comparing the
realized value of this statistic to the  2 (n  1) distribution. In this case, since the
test is two sided, we reject the null if
S2 
100 2
100 2
1 / 2 (n  1) or S 2 
 / 2 (n  1).
n 1
n 1
2. Suppose we have n and m observations from two independent normal
distributions, X and Y, and we want to test the hypothesis that their variances are
equal against the two-sided alternative that they are different. We know that
(n  1) S X2 /  X2 ~  2 (n  1) and (m  1)SY2 /  Y2 ~  2 (m  1) . Under the null that
 X2   Y2 it follows that S X2 / SY2 ~ F (n  1, m  1) and from this the null hypothesis
can be easily tested in a way analogous to that described above.
3. Suppose we have n and m observations from two independent normal
distributions, X and Y, and that we know the variances of the distributions are
equal. We are interested in determining whether their means are statistically
significantly different. It is appropriate to use the following statistic.
X Y
T
{[(n  1) S X2  (m  1) SY2 ]/(n  m  2)}(1/ n  1/ m)
which has a t-distribution with n+m-2 degrees of freedom.
4. Suppose that in (3.) above the variances are unequal. Assessing whether there are
differences in means in this case is hard using standard classical techniques: this
is called the Behrens-Fisher problem. The difficulty is that the variances do not
drop out of the statistic in a natural way, so ad-hoc assumptions must be used to
eliminate them. One statistic that people use is
X Y
T
 S X2 SY2 
 

m
 n
which has, approximately, a t-distribution with (n+m-2) degrees of freedom. This
approximation is better when the sample size is large, in which case the t-test can
be replaced by a z-test. If the variances are known but different, the S X2 and SY2
are replaced by the true values  X2 and  Y2 .
C. Nonparametric methods (Siegel and Castellan – at bookstore.)
1. Advantages of nonparametric statistical tests.
-
If number of observations is small then there is often no alternative to
nonparametrics (except making artificial assumptions about the properties of
the data generating process.)
-
Tests based on ordinal ranks may be easier to implement nonparametrically.
-
Nonparametric methods can test for location differences between distributions
from different families easily. The Behrens-Fisher problem exemplifies the
difficulty that classical, parametric techniques have with this situation.
-
Some nonparametric tests have more intuitive appeal than certain parametric
tests, which can often look rather ad-hoc.
2. Disadvantage of nonparametric statistical tests.
-
They are less efficient than parametric tests. If the conditions of the
parametric model are met then parametric tests allow more precise inference.
3. Useful nonparametric tests (for more on these tests and other nonparametric tests
see Siegel and Castellan (1988).)
-
Chi-square goodness of fit test.
Used to test whether a sample follows a particular distribution.
H 0 : the data follows distribution with pdf f .
H1 : otherwise.
Is particularly valuable in assessing whether an estimated model “fits” the
data used to estimate it. The idea is to compare cell frequencies between the
two distributions.
k
(Oi  Ei )2
The statistic is: v  
~  2 (k  1).
Ei
i 1
Oi  observed number of cases in ith category.
Ei  expected cases in ith category when the null is true.
k  the number of categories.
The null hypothesis is rejected at an  significance level if v  2 ( k  1).
Example:
The data from an experiment includes a series of choices that subjects made in
a multiple-round game. Aggregate choice frequencies can be described as
follows:
Period/Choice
1
2
3
4
A
25%
25%
50%
0%
B
10%
40%
40%
60%
C
50%
25%
0%
20%
D
15%
10%
10%
20%
The experimenter has additional information about the subjects, such as their
history of play and other observable, individual specific data, summarized by
a vector X i for each subject i. The researcher is interested in whether a
particular parametric model which depends on a finite parameter vector  ,
G( X i |  ) , can adequately explain choices. This can be answered by
(i)
(ii)
(iii)
(iv)
Estimating the model, giving a point estimate of the parameter
vector ˆ.
Simulating the model under the estimated parameter vector.
Calculating the 16 cell frequencies for the simulated data.
Calculating the statistic v and comparing it to a  2 (15)
distribution.
This test is particularly useful when other tests of “fit” are hard to compute.

The statistic v is only asymptotically  2 . When the amount of data is small,
say less than five observations per category, one should attempt to
recategorize to increase frequencies within each cell. When there are a small
number of cells and a small amount of observations in each cell, the test may
be inaccurate.
-
Permutations tests for paired replicates.

Powerful test for treatment effects when a pair of similar experimental
units are observed in each condition. The null hypothesis is that any
observed differences are not due to treatments.
Example: An experiment on Boys’ Shoes

10 boys wore shoes made of different materials, A and B, on their left and
right foot. Whether material A was on the right foot or left foot was
randomized for each boy by flipping a coin. Any differences due to individual
boys should be apparent in both shoes.

Data was taken on the wear of the sole of each shoe, giving the following
data.
boy
1
2
3
4
5
6
7
8
9
10
material A
13.2-L
8.2-L
10.9-R
14.3-L
10.7-R
6.6-L
9.5-L
10.8-L
8.8-R
13.3-L
material B
14.0
8.8
11.2
14.2
11.8
6.4
9.8
11.3
9.3
13.6
B-A
.8
.6
.3
-.1
1.1
-.2
.3
.5
.5
.3
Mean dive: 0.41
Under the null hypothesis that there is no treatment effect – that A and B give
equal protection against wear – all that is affected by the outcome of the coin toss
is the sign of the difference, B-A. Hence, under the null there are 210  1024
possible realizations of the mean difference, and this defines the mean diffuse
sampling distribution under the null. The significance of the observed mean is
determined by comparing it to the 1023 other possible outcomes.
In this case, only 3 of the 1023 other means are greater than 0.41. There are four
cases when the mean is identical to 0.41. Conservatively then, the significance
level is 7/1024, or about 0.7%. Hence we would reject the null.

This test uses all of the information in the sample, and is among the most
powerful of all statistical tests.

It can be cumbersome to compute if the number of observations is large. An
alternative which is easier to compute, but throws out some information and is
therefore somewhat less powerful, is the Wilcoxon signed ranks test, which is
just the permutation test based on ranks instead of actual values.
-
The median test.

Tests whether two independent samples have the same median.

Is a “robust” test, in the sense that it does not make strong assumptions
about the relationship between the underlying distributions of the two
samples (they may have different dispersions, for example.)

It is the natural test to use when data is truncated.
The procedure is to derive the combined sample median and then the
following table from the two distributions.
No. of scores above
combined median
No. of scores below
combined median
Observations
Data set I
A
Data set II
B
C
D
m
n
Let the N denote the total number of observations, N=m+n.
The approximate sampling distribution of the statistic
N ([ AD  BC ]  N / 2)2
v
is  2 (1) , under the null hypothesis that
( A  B)(C  D)( A  C )( B  D)
the medians are the same. The approximation is better when sample sizes are
larger.

-
The Wilcoxon-Mann-Whitney test is more powerful than the median test,
but it requires that the distributions underlying the two populations differ
only in location. In particular, it requires that their variances are the same.
The Jonckheere test for ordered alternatives.
Suppose one has a sample from each of k independent populations. The
Jonckheere procedure may be used to test
H 0 : population distributions are identical
against
H1 : the populations have different medians, i ,
and the medians are ordered by 1   2  ...   k
where at least one of the inequalities is strict.
Note: The ordering of the variables must be specified before the data is
collected.
To run the test one first orders the data in a table as follows.
Data set 1 (low
median)
x(1,1)
x(2,1)
Data set 2 (2nd
lowest median)
x(1,2)
x(2,2)
x(n,1)
x(m,2)
…….
Data set k (highest
median)
x(1,k)
x(2,k)
x(m,2)
The columns are arranged from smallest observation to largest observation.
The test statistic, J*, is formed using the following 3-step procedure.
(i)
For each entry x(i,j) in each of the first k-1 columns, determine the
number of entries in all of the higher columns that are greater than x(i,j).
Call this number N(i,j). Note that there will be an N(i,j) corresponding to
each entry in the table except within the last column, where there will be
no entries.
(ii)
Define J as equal to the sum of the N(i,j).
(iii)
It can be shown that the sampling distribution of J under the null that the
distributions are identical has mean and variance:
k
J 
 J2 
N 2   n 2j
j 1
4
k
1
[ N 2 (2 N  3)   n 2j (2n j  3)]
72
j 1
In large sample, the statistic J * 
J  J
can be compared to the standard
J
normal cdf to compute approximate p-values for a test of the null
hypothesis.

Rejection of the null implies that at least one median is statistically
greater in magnitude than one that precedes it, but it does not tell us
which one.
D. Comments on the bootstrap

The bootstrap is a computer based method for assigning measures of accuracy
to statistical estimates. There are parametric, nonparametric, classical and
Bayesian versions of the bootstrap.

The original motivation for the bootstrap was to provide a method to assign
standard errors to estimators for which no closed form solution existed.

Intuitively, the bootstrap treats the sample as though it were the population,
and “resamples” the original sample repeatedly to generate an approximation
to the sampling distribution of any statistic that one might find useful. The
properties of the estimator are known only in large sample, although evidence
suggests it works well even when the sample size is very small.
Example: Bootstrapping the standard error of the mean.
-
Suppose one has a random sample of 25 observations, xi , i  1,...,25 from an
unknown population.
-
Definition: A bootstrap sample x* is obtained by randomly sampling 25
times with replacement from {xi }i 1,25 .
-
By repeatedly resampling from {xi }i 1,25 one obtains a large number of
bootstrap samples x*1, x*2 ,..., x*B . Around 200 bootstrap samples is usually
enough.
-
Corresponding to each bootstrap sample is its mean: x *i , i  1,..., B.
-
An estimate of the standard deviation of the mean’s sampling distribution is
the standard deviation of the bootstrapped means:
1/ 2
 boot
B

  ( x b*  x * )2 /( B  1)  ,
 b1

where x * 
1
 x b*.
B

Note that one can replace the mean with any statistic, s(x), and follow the
same procedure to generate a measure of accuracy of this statistic’s value.

In the case of the mean of a vector of independent draws from a single
distribution one would not usually want to bootstrap the standard error. The
reason is that the clt assures normality and the se’s under that distributional
assumption are tighter than will be given by the bootstrap.
Example 2: Bootstrapping the permutation test.
We observe two independent random samples from possibly different pdf’s F and
G.
F  z1 ,..., zn 
G   y1 ,..., ym .
We are interested in testing the null hypothesis H 0 : F  G.
 This is the standard two-sample problem we have discussed above
where we are interested in determining whether there is evidence of a
treatment effect. If we cannot reject the null there is little evidence of
any effect.

Note that the null is very strong. It requires that there is no difference in the
stochastic behavior of z and y.
We have discussed how to test for different means, assuming that both
distributions are Normal and have the same variance. The bootstrap procedure is
as follows.
(a) If F=G, then the m+n observations came from the same distribution,
(m  n)!
and the way they were classified as in “z” or “y” was one of
m !n !
equally likely outcomes (this result is called the “Permutation
Lemma.”
(b) We can resample from the pooled distribution repeatedly to form
bootstrapped samples of the z and y vectors.
(c) We calculate the difference between the means of each pair of
bootstrap samples, and then order (from high to low) this set of
differences.
(d) We determine where the realized difference in means occurs within
the bootstrapped order. For example, if we have 100 bootstrapped
differences in means, then if the original difference between the mean
of the z and y samples is greater than 190 of those 100 we would say
that the difference is significant at the 5% level (assuming the test is
two-sided.)
To summarize the calculation of the two-sample permutation test statistic.
(1) Choose B independent vectors g * (1), g * (2),..., g * ( B) each
consisting of n z’s and m y’s and each being randomly selected
 m  n
from the set of all 
 possible such vectors. Usually,
 n 
you will want to set B at least equal to 1000.
(2) Evaluate the desired statistic, ˆ* (b), for every bootstrapped
vector.
(3) The achieved significance level of the original sample’s test
1
statistic, ˆ, is given by 1(ˆ* (b)  ˆ), where 1(s) = 1 if s
B
is true, and zero otherwise.
Bernoulli, Binomial, Poisson and Normal distributions
Bernoulli distribution
Random variable X follows a Bernoulli distribution if it can take only two values, say
zero and one, and:
Pr(X=1) = p (between zero and one)
Pr(X=0) = 1-p (also between zero and one)
Hence, the pdf of a Bernoulli random variable is f ( X )  p X (1  p)1 X , X 0,1. The
mean and variance of a Bernoulli random variable are p and p(1-p), respectively.
Binomial distribution
Suppose an experimenter conducts a sequence of N Bernoulli trials with probability of a
“1” equal to p>0, and let Y be the random variable that indicates the number of times “1”
occurred over the N trials. Then Y is said to follow a Binomial distribution, with
parameters N and p, and has pdf
 N  y
p (1  p) N  y if y  0,1,..., N 

.
Pr ( y | N , p)   y 
0 otherwise

N
Here, the notation   denotes the number of ways y distinct elements can be selected
y 
from a set of N elements, divided by the number of distinct arrangements of y distinct
N
N!
elements. Hence,   
.
 y  ( N  y )! y !
The mean and variance of the binomial distribution are Np and Np(1-p), respectively.
The normal distribution with the same mean and variance provides a good approximation
to the Binomial distribution when N>5, and (1/ N )( (1  p) / p  p /(1  p))  0.3.
When the normal approximation is valid, one can test hypotheses about the mean of the
binomial distribution by forming the z statistic ( y0  Np) / ( Np(1  p) where y0 is the
actual number of 1s observed, and p is the hypothesized mean. It turns out that the
approximation is somewhat better if ( y0  0.5) is used to calculate the z statistic, in place
of y0 . (This is called the Yates adjustment.)
Suppose one has observations from N subjects in two treatment conditions. In each
treatment condition they make a series of yes/no decisions, each of which is either correct
or incorrect. Let Ci ( A) denote the number of correct answers provided by subject i in
treatment condition A, and similarly Ci ( B). Then it is reasonable to model
Ci ( A) Bi( p, N ) . The researcher hopes to determine whether the fraction of correct
responses varies with the treatment.
Because we have a “within” design (this means that we have observations on each subject
in both conditions), a natural approach would be to use the permutations test as in the
Boys’ shoes example. But suppose we wanted to use a paired t-test. To do this we would
form a set of N differences (correct in treatment A minus correct in treatment B for each
subject) and test whether the mean of these differences is zero. Recall that to use this
test, we must be able to assume that the data from each subject has about the same
variance. This assumption is likely violated if there are large differences across subjects
in the fraction of correct answers. The reason is that the variance of a Binomial rv is
Np(1-p), so that if different subjects have different values of p, they will also have
different variances. How to circumvent this problem?
Transformation of variables. In many cases it is useful to transform the outcome variable
of interest. For example, the log of a variable whose distribution is skewed right is often
more closely normally distributed than the variable itself. In the present case, it turns out
that the transformation xˆi  arcsin( pˆ i ) is particularly useful, where pˆ i is the fraction of
correct answers given by subject i. It turns out that xˆi (called a “score”) is a rv with
variance that does not depend on p. Hence, by transforming percentages to this score,
one can use the usual paired t-test with greater confidence that the results will be correct.
In particular, the test will be more sensitive to treatment differences after this correction
has been applied.
Sometimes one is interested in knowing whether, within a given session and treatment,
success probabilities vary (e.g., testing for learning effects.)
Example: A subject makes a sequence of 20 yes/no decisions over a series of 10 identical
blocks (a total of 200 decisions.) The researcher wants to determine whether there is
evidence of learning. A first test of this might be to ask whether success probabilities
seem to remain fixed over the 10 blocks. The null hypothesis is that there is no learning
(success probabilities are constant across blocks) and the alternative is otherwise. The
data are as follows.
Block
Number Correct
1 2 3 4 5 6 7 8 9 10
4 3 2 5 7 8 6 7 12 10
Under the null that the success probability is constant, we know that the number correct
in each block follows a Bi(N,p)=Bi(20,p) distribution. By pooling the data, one can
easily estimate pˆ  0.32. Then, if Ci represents the number correct in block i, we have
that zi  (Ci  0.32) / (20  0.32(1  0.32))
10
that
z
i i
2
i
N (0,1). (At least approximately.) It follows
~  2 (9). In this example, it turns out that doing the summation gives a result
of 19.85, which is large relative to what one would expect from a random draw from a
chi-square(9) (it is significant at the 2.5% level.) This is evidence against the null
hypothesis that the success rate is constant across blocks, and is therefore evidence in
favor of learning.
Note that the test above looks similar to the chi-square test discussed earlier. In the
above case, the general form of the test is
K
 (C  Np)
i 1
2
i
/ Np(1  p) ~  2 ( K  1).
The form of the chi-square test discussed earlier, when applied to the above example, is
K
 (C  Np)
i 1
i
2
/ Np ~  2 ( K  1).
Note that these two tests are very similar whenever p is very small (because then (1-p) is
near unity.
Poisson distribution.
Let   Np. If p  0 and N   while   Np stays constant, then the Binomial
distribution becomes the Poisson distribution. The pdf of the Poisson distribution is:
Pr ( y ) 
e   y
y!
where y is the number of successes, or occurrences of an event. Note that the probability
depends only on the expected frequency, and not on the number of events or the
probability of success per event. Note also, by examining the Binomial distribution, that
the mean and variance of a Poisson distribution are the same.
The sum of Poisson random variables is again a Poisson random variable, with mean
equal to the sum of the means of the underlying random variables. That is, if
K
y1 , y2 ,..., yK are all distributed according to a Poisson distribution, then
y
i 1
i
is also a
K
Poisson random variable, with mean equal to
 .
i 1
i
Example
A researcher is studying the effect of information events on behavior in an asset market
experiment. She wants information events to occur randomly during the experiment, but
also wants to ensure that 90% of two-minute blocks contain at least one information
event. Assuming that an information event takes only one second to occur, how many
information events should occur, on average, during the experiment?
There are 120 seconds in each two-minute block, but the probability of a random
information event occurring during any particular second is small. Hence, the Poisson
approximation is appropriate (large N, small p). In this case, the probability of not
getting an information event during a two-minute block is supposed to be 0.1. Hence,
1  e   0.9, or   2.3. Thus the researcher should ensure that there are, on average,
2.3 information events per two-minute block. If there are 45 two-minute blocks in the
experiment, this means that the researcher should allocate 45 X 2.3 = 103 information
events randomly across the timeline of the experiment.
Relationship between Poisson and the chi-square test
When  is not too small (say, greater than 5) the Poisson distribution can be
approximated well by a normal distribution with the same mean and variance (this makes
sense, given that the normal well approximates a Binomial distribution).
Hence,
K

i 1
 y    ~ N (0,1),

( yi   )2

at least approximately. Therefore,
~  2 ( K ), and
K

i 1
( yi   )2

~  2 ( K  1).
Because  is the expected frequency, this is a type of Chi-Square test (within the context
of contingency tables) as discussed above, where the expected frequency within each cell
is the same.