Download Mann-Whitney U test - E

Document related concepts

Psychometrics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Analysis of variance wikipedia , lookup

Gibbs sampling wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
BIOSTATISTICS AND COMPUTER APPLICATION II
I B.Sc MICRO BIOLOGY
UNIT I
CORRELATION AND REGRESSION
Contents
13.1 Aims and Objectives
13.2 Correlation
13.3 The Scatter Diagram
13.4 The Correlation Coefficient
13.5 Karl Pearson’s Correlation Coefficient
13.6 Relation between Regression Coefficients and Correlation Coefficient
13.7 Coefficient of Determination
13.8 Spearman’s Rank Correlation Coefficient
13.9 Tied Ranks
13.10 Regression
13.11 Linear Regression
13.12 Let us Sum Up
13.13 Lesson – End Activities
13.14 References
13.1 Introduction
There are situations where data appears as pairs of figures relating to two variables. A
correlation problem considers the joint variation of two measurements neither of which is
restricted by the experimenter. The regression problem discussed in this Lesson
considers the frequency distribution of one variable (called the dependent variable) when
another (independent variable) is held fixed at each of several levels.
Examples of correlation problems are found in the study of the relationship between IQ
and aggregate percentage of marks obtained by a person in the SSC examination, blood
pressure and metabolism or the relation between height and weight of individuals. In
these examples both variables are observed as they naturally occur, since neither variable
is fixed at predetermined levels.
Examples of regression problems can be found in the study of the yields of crops grown
with different amount of fertilizer, the length of life of certain animals exposed to
different levels of radiation, and so on. In these problems the variation in one
measurement is studied for particular levels of the other variable selected by the
experimenter.
13.2 Correlation
Correlation measures the degree of linear relation between the variables. The existence of
correlation between variables does not necessarily mean that one is the cause of the
change in the other. It should noted that the correlation analysis merely helps in
determining the degree of association between two variables, but it does not tell any
thing about the cause and effect relationship. While interpreting the correlation
coefficient, it is necessary to see whether there is any cause and effect relationship
between variables under study. If there is no such relationship, the observed is
meaningless.
In correlation analysis, all variables are assumed to be random variables.
13.3 The Scatter Diagram
The first step in correlation and regression analysis is to visualize the relationship
between the variables. A scatter diagram is obtained by plotting the points (x1, y1),
(x2, y2), …, (xn,yn) on a two-dimensional plane. If the points are scattered around a
straight line , we may infer that there exist a linear relationship between the variables. If
the points are clustered around a straight line with negative slope, then there exist
negative correlation or the variables are inversely related ( i.e, when x increases y
decreases and vice versa. ). If the points are clustered around a straight line with positive
slope, then there exist positive correlation or the variables are directly related ( i.e, when
x increases y also increases and vice versa. ).
For example, we may have figures on advertisement expenditure (X) and Sales (Y) of a
firm for the last ten years, as shown in Table 1. When this data is plotted on a graph as in
Figure 1 we obtain a scatter diagram. A scatter diagram gives two very useful types of
information. First, we can observe patterns between variables that indicate whether the
variables are related. Secondly, if the variables are related we can get an idea of what
kind of relationship (linear or non-linear) would describe the relationship.
Table 1
Year-wise data on Advertisement Expenditure and Sales
Year Advertisement Sales in
Expenditure Thousand
In thousand Rs. (X) Rs. (Y)
1988 50 700
1987 50 650
1986 50 600
1985 40 500
1984 30 450
1983 20 400
1982 20 300
1981 15 250
1980 10 210
1979 5 200
Correlation examines the first Question of determining whether an association exists
between the two variables, and if it does, to what extent. Regression examines the second
question of establishing an appropriate relation between the variables.
Figure 1 : Scatter Diagram
800 - XX
700 - X
600 X
Y500 -
X
400 - X
300 - X
X
200 - X
100 |||||
1 10 20 30 40 50
X
The scatter diagram may exhibit different kinds of patterns. Some typical patterns
indicating different correlations between two variables are shown in Figure 2.
Figure 2: Different Types of Association Between Variables
r>0
Y
X
(a) Positive Correlation
r>0
Y
X (b) Negative Correlation
r=0
Y
X ( c ) No Correlation
Y
X
(d) Non-linear Association
13.4 The Correlation Coefficient
Definition and Interpretation
The correlation coefficient measure the degree of association between two variables X and Y.
Pearson’s formula for correlation coefficient is given as
1(X X )
n
r (Y Y )
sxsy
Where r is the correlation coefficient between X and Y, sxandsy are the standard deviation of X
and Y respectively and n is the number of values of the pair of variables X
and Y in the given data. The expression 1(X X )
n
(X Y ) is known as the covariance
between X and Y. Here r is also called the Pearson’s product moment correlation coefficient.
You should note that r is a dimensionless number whose numerical value lies between +1 and -1.
Positive values of r indicate positive (or direct) correlation between the two variables X and Y
i.e. as X increase Y will also increase or as X decreases Y will also decrease. Negative values of
r indicate negative (or inverse) correlation, thereby meaning that an increase in one variable
results in a decrease in the value of the other variable. A zero correlation means that there is an o
association between the two variables. Figure II shown a number of scatter plots with
corresponding values for the correlation coefficient r.
The following form for carrying out computations of the correlation coefficient is perhaps more
convenient :
xy
r = X 2  y
2
where ……..(18.2)
x = X - X = deviation of a particular X value from the mean- X
y= Y - Y = deviation of a particular Y value from the mean Y
Equation (18.2) can be derived from equation (18.1) by substituting for sxandsy as follows:
1(X X )
n
sx 2 andsy 1(X Y)
n
2 ……..(18.3)
13.5 Karl Pearson’s Correlation Coefficient
If (x1, y1), (x2, y2), …, (xn,yn) be n given observations, then the Karl Pearson’s correlation
coefficient is defined as, r =
xy
xy
SS
S
, where Sxy is the covariance and Sx, Sy are the standard
deviations of X and Y respectively.
That is, r = 2
2
2
1 21
1




yy
n
xx
n
xy x y
n
The value of r is in in between –1 and 1. That is, -1 r 1. When r = 1, there exist a perfect
positive linear relation between x and y. when r = -1, there exist perfect negative linear
relationship between x and y. when r = 0, there is no linear relationship between x and y.
13.6 Relation between Regression Coefficients and Correlation Coefficient
Correlation coefficient is the geometric mean of the regression coefficients.
We know that byx = 2
x
xy
S
S
and bxy = 2
y
xy
S
S
The geometric mean of byx and bxy is xy yx b b = 2 2
yx
xy xy
SS
SS
=
xy
xy
SS
S
= r, the correlation coefficient.
Also note that the sign of both the regression coefficients will be same, so the sign of correlation
coefficient is same as the sign of regression coefficient.
13.7 Coefficient of Determination
Coefficient of determination is the square of correlation coefficient and which gives the
proportion of variation in y explained by x. That is, coefficient of determination is the
ratio of explained variance to the total variance. For example, r2 = 0.879 means that
87.9% of the total variances in y are explained by x. When r2 = 1, it means that all the
points on the scatter diagram fall on the regression line and the entire variations are
explained by the straight line. On the other hand, if r2 = 0 it means that none of the points
on scatter diagram falls on the regression line, meaning thereby that there is no linear
relationship between the variables.
Example: Consider the following data:
X: 15 16 17 18 19 20
Y: 80 75 60 40 30 20
1. Fit both regression lines
2. Find the correlation coefficient
3. Verify the correlation coefficient is the geometric mean of the regression coefficients
4. Find the value of y when x = 17.5
Solution:
X Y XY X2 Y2
15
16
17
18
19
20
80
75
60
40
30
20
1200
1200
1020
720
570
400
225
256
289
324
361
400
6400
5625
3600
1600
900
400
105 305 5110 1855 18525
x
=
n
x =
6
105 = 17.5,
y
=
n
y =
6
305 = 50.83
Sxy =
n
1 xi yi x
y
=
6
5110 - 17.550.83 = -37.86
Sx
2=
n
1 xi
2– (
x
)2 =
6
1855 - 17.52 = 2.92
Sy
2=
n
1 yi
2– (

y
)2 =
6
18525 -50.83 2 = 503.81
byx = 2
x
xy
S
S
=
2.92
37.86 = -12.96 and bxy = 2
y
xy
S
S
=
503.81
37.86 = -0.075
1. Regression line of y on x is y y
=2
x
xy
S
S
(xx
)
i.e., y – 50.83 = -12.96(x – 17.5)
y = -12.96 x + 277.63
Regression line of x on y is x x
=2
y
xy
S
S
(y y
)
i.e., x – 17.5 = -0.075(y – 50.83)
x = -0.075 y + 21.31
2. Correlation coefficient, r =
xy
xy
SS
S
=
1.71 22.45
37.86

= 0.986
3. byxbxy = -12.96 -0.075 = 0.972
Then, 0.972 = 0.986
So, r = -0.986
4. To predict the value of y, use regression line of y on x.
When x= 17.5, y = -12.9617.5 + 277.63 = 50.83
Short-Cut Method: The correlation coefficient is invariant under linear transformations.
Let us take the transformations, u =
1
x 18 and v =
10
y 40
X Y u v uv u2 v2
15
16
17
18
19
20
80
75
60
40
30
20
-3
-2
-1
012
4
3.5
20
-1
-2
-12
-7
-2
0-
1
-4
941014
16
12.25
4014
85 305 -3 6.5 -26 19 37.25
u
=
n
u =
6
3 =-0.5,
v
=
n
v =
6
6.5 = 1.083
Suv =
n
1 ui vi u
v
=
6
26 - -0.51.083 = -3.79
Su
2=
n
1 ui
2– (
u
)2 =
6
19 - (-0.5)2 = 2.92
Sv
2=
n
1 vi
2– (

v
)2 =
6
37.25 -1.083 2 = 5.077
bvu = 2
u
uv
S
S=
2.92
3.79 = -1.297 and buv = 2
v
uv
S
S
=
5.077
3.79 = -0.75
1. Regression line of v on u is v v
= bvu(uu
)
i.e., v – 1.083 = -1.297(u – -0.5)
v = -1.297u + 0.4345
Therefore, the regression line of y on x is
10
y 40 = -1.297
1
x 18 + 0.4345
i.e, y = -12.97 x + 277.8
Regression line of u on v is u u
= buv (v v
)
i.e., u –-0.5= -0.75(y – 1.083)
u = -0.75 v + 0.31225
Therefore, the regression line of x on y is
1
x 18 = -0.75
10
y 40 + 0.31225
i.e., x = -0.075 y + 21.31
2. Correlation coefficient, r =
uv
uv
SS
S
=
1.71 2.253
3.79

= -0.986
3. bvubuv = -1.297-0.75 = 0.97275
Then, 0.972 = 0.986
So, r = -0.986
13.8 Spearman’s Rank Correlation Coefficient
Sometimes the characteristics whose possible correlation is being investigated, cannot be
measured but individuals can only be ranked on the basis of the characteristics to be
measured. We then have two sets of ranks available for working out the correlation
coefficient. Sometimes tha data on one variable may be in the form of ranks while the
data on the other variable are in the form of measurements which can be converted into
ranks. Thus, when both the variables are ordinal or when the data are available in the
ordinal form irrespective of the type variable, we use the rank correlation coefficient.
The Spearman’s rank correlation coefficient is defined as , r = 1 ( 1)
6
2
2


nn
di
Example: Ten competitors in a beauty contest were ranked by two judges in the following
orders:
First judge: 1 6 5 10 3 2 4 9 7 8
Second judge: 3 5 8 4 7 10 2 1 6 9
Find the correlation between the rankings.
Solution:
xi yi di = xi-yi di
2
1 3 -2 4
6511
5 8 -3 9
10 4 6 36
3 7 -4 16
2 10 -8 64
4224
9 1 8 64
7611
8 9 -1 1
The Spearman’s rank correlation coefficient is defined as , r = 1 ( 1)
6
2
2


nn
di
=1-
10(10 1)
6 200
2 
= -0.212
That is, their opinions regarding beauty test are apposite of each other.
13.9 Tied Ranks
Sometimes where there is more than one item with the same value a common rank is given to
such items. This rank is the average of the ranks which these items would have got had they
differed slightly from each other. When this is done, the coefficient of rank correlation needs
some correction, because the above formula is based on the supposition that the ranks of various
items are different. If in a series, ‘mi’ be the frequency of ith tied ranks,
Then, r = 1 ( 1)
( )]
12
6[ 1
2
23


nn
dmmi
Example: Calculate the rank correlation coefficient from the sales and expenses of 10 firms are
below:
Sales(X): 50 50 55 60 65 65 65 60 60 50
Expenses(Y): 11 13 14 16 16 15 15 14 13 13
Solution:
x
R1 y R2 d= R1 – R2 d2
50
50
55
60
65
65
65
60
60
50
9975222559
11
13
14
16
16
16
15
14
13
13
10
8
5.5
1.5
1.5
3.5
3.5
5.5
88
-1
1
1.5
3.5
0.5
-1.5
-1.5
-0.5
-3
1
11
2.25
12.25
0.25
2.25
2.25
0.25
91
31.5
Here there are 7 tied ranks, m1 = 3, m2 = 3, m3 = 3, m4 = 2, m5 = 2, m6 = 2, m7 = 3.
r=1( 1)
( )]
12
6[ 1
2
23


nn
dmmi
=110(10 1)
[(3 3) (3 3) (3 3) (2 2) (2 2) (2 2) (3 3)]]
12
6[31.5 1
2
3333333


= 0.75
Exercises
1. A company selling household appliances wants to determine if there is any
relationship between advertising expenditures and sales. The following data was
compiled for 6 major sales regions. The expenditure is in thousands of rupees and the
sales are in millions of rupees.
Region : 1 2 3 4 5 6
Expenditure(X): 40 45 80 20 15 50
Sales (Y): 25 30 45 20 20 40
a) Compute the line of regression to predict sales
b) Compute the expected sales for a region where Rs.72000 is being spent on
advertising
2. The following data represents the scores in the final exam., of 10 students, in the
subjects of Economics and Finance.
Economics: 61 78 77 97 65 95 30 74 55
Finance: 84 70 93 93 77 99 43 80 67
a) Compute the correlation coefficient?
3. Calculate the rank correlation coefficient from the sales and expenses of 9
This watermark does not appear in the registered version - http://www.clicktoconvert.com
126
firms are below:
Sales(X): 42 40 54 62 55 65 65 66 62
Expenses(Y): 10 18 18 17 17 14 13 10 13
13.10 Regression
In industry and business today, large amounts of data are continuously being generated.
This may be data pertaining, for instance, to a company’s annual production, annual
sales, capacity utilisation, turnover, profits, ,manpower levels, absenteeism or some other
variable of direct interest to management. Or there might be technical data regarding a
process such as temperature or pressure at certain crucial points, concentration of a
certain chemical in the product or the braking strength of the sample produced or one of a
large number of quality attributes.
The accumulated data may be used to gain information about the system (as for instance
what happens to the output of the plant when temperature is reduced by half) or to
visually depict the past pattern of behaviours (as often happens in company’s annual
meetings where records of company progress are projected) or simply used for control
purposes to check if the process or system is operating as designed (as for instance in
quality control). Our interest in regression is primarily for the first purpose, mainly to
extract the main features of the relationships hidden in or implied by the mass of data.
What is Regression?
Suppose we consider the height and weight of adult males for some given population. If
we plot the pair (X1X2)=(height, weight), a diagram like figure I will result. Such a
diagram, you would recall from the previous Lesson, is conventionally called a scatter
diagram.
Note that for any given height there is a range of observed weights and vice-versa. This
variation will be partially due to measurement errors but primarily due to variations between
individuals. Thus no unique relationship between actual height and weight can be expected. But
we can note that average observed weight for a given observed height increases as height
increases. The locus of average observed weight for given observed height (as height varies) is
called the regression curve of weight on height. Let us denote it by X2=f(X1). There also exists
a regression curve of height on weight similarly defined which we can denote by X1=g(X2). Let
us assume that these two “curves” are both straight lines (which in general they may not be). In
general these two curves are not the same as indicated by the two lines in Figure 3.
Figure 3: Height and Weight of thirty Adult Males
X1=g(X2)
x
xxx
90 - x x
Weight in x x
kg (X2) 80 - x
x x X2=f(X1)
xxx
70 - x x x x x
xxx
60 - x x
x
50 | | | | | | | | | | | |
164 168 172 176 180 184 188
Height in cms (X1)
A pair of random variables such as (height, weight) follows some sort of bivariate
probability distribution. When we are concerned with the dependence of a random
variable Y on quantity X, which is variable but not a random variable, an equation that
relates Y to X is usually called a regression equation. Simply when more than one
independent variable is involved, we may wish to examine the way in which a response Y
depends on variables X1X2 …Xk. We determine a regression equation from data which
cover certain areas of the X-space as Y=f(X1,X2…Xk)
13.11 Linear Regression
Regression analysis is a set of statistical techniques for analyzing the relationship
between two numerical variables. One variable is viewed as the dependent variable and
the other as the independent variable. The purpose of regression analysis is to understand
the direction and extent to which values of dependent variable can be predicted by the
corresponding values of the independent variable. The regression gives the nature of
relationship between the variables.
Often the relationship between two variable x and y is not an exact mathematical
relationship, but rather several y values corresponding to a given x value scatter about a
value that depends on the x value. For example, although not all persons of the same
height have exactly the same weight, their weights bear some relation to that height. On
the average, people who are 6 feet tall are heavier than those who are 5 feet tall; the mean
weight in the population of 6-footers exceeds the mean weight in the population of 5footers.
This relationship is modeled statistically as follows: For every value of x there is a
corresponding population of y values. The population mean of y for a particular value of
x is denoted by f(x). As a function of x it is called the regression function. If this
regression function is linear it may be written as f(x) = a + bx. The quantities a and b are
parameters that define the relationship between x and f(x)
In conducting a regression analysis, we use a sample of data to estimate the values of
these parameters. The population of y values at a particular x value also has a variance;
the usual assumption is that the variance is the same for all values of x.
Principle of Least Squares
Principle of least squares is used to estimate the parameters of a linear regression. The
principle states that the best estimates of the parameters are those values of the
parameters, which minimize the sum of squares of residual errors. The residual error is
the difference between the actual value of the dependent variable and the estimated value
of the dependent variable.
Fitting of Regression Line y = a + bx
By the principle of least squares, the best estimates of a and b are
b=2
x
xy
S
S
and a =
y
-b
x
Where Sxy is the covariance between x and y and is defined as Sxy =
n
1 xi yi x
y
And Sx
2 is the variance of x, that is, Sx
2=
n
1 xi
2– (
x
)2
Example: Fit a straight line y = a + bx for the following data.
Y 3.5 4.3 5.2 5.8 6.4 7.3 7.2 7.5 7.8 8.3
X 6 8 9 12 10 15 17 20 18 24
Solution:
Y X XY X2
3.5
4.3
5.2
5.8
6.4
7.3
7.2
7.5
7.8
689 12
10
15
17
20
18
21
34.4
46.8
69.6
64
109.5
122.4
150
140.4
36
64
81
144
100
225
289
400
324
8.3 24 199.2 576
63.3 139 957.3 2239
x
=
10
139 =13.9
y
=
10
63.3 = 6.33
Sxy =
n
1 xi yi x
y
=
10
957.3 - 13.96.33 = 7.743
Sx
2=
n
1 xi
2– (
x
)2 =
10
2239 - 13.92 = 30.69
So, b = 2
x
xy
S
S
=
30.69
7.743 = 0.252
and a =
y
-b
x
= 6.33 – 0.25213.9 = 2.8272
Therefore, the straight line is y = 2.8272 + 0.252 x
Two Regression Lines
There are two regression lines; regression line of y on x and regression line of x on y. In the
regression line of y on x, y is the dependent variable and x is the independent variable and it is
used to predict the value of y for a given value of x. But in the regression line of x on y, x is the
dependent variable and y is the independent variable and it is used to predict the value of x for a
given value of y.
The regression line of y on x is given by
yy
=2
x
xy
S
S
(xx)
and the regression line of x on y is given by
xx
=2
y
xy
S
S
(y y
)
Regression Coefficients
The quantity 2
x
xy
S
S
is the regression coefficient of y ox and is denoted by byx, which gives the
slope of the line. That is, byx = 2
x
xy
S
S
is the rate of change in y for the unit change in x.
The quantity 2
y
xy
S
S
is the regression coefficient of x on y and is denoted by bxy, which gives the
slope of the line. That is, bxy = 2
y
xy
S
S
is the rate of change in x for the unit change in y.
13.12 Let us Sum Up
In this Lesson the concept of correlation and regression are discussed. The correlation is
the association between two variables. A scatter plot of the variables may suggest that
the two variables are related but the value of the Pearson’s correlation coefficient r
quantifies this association. The correlation coefficient r may assume values from –1 and
+ 1. The sign indicates whether the association is direct (+ve) or inverse (-ve). A
numerical value of 1 indicates perfect association while a value of zero indicates no
association. Regression is a device for establishing relationships between variables from
the given data. The discovered relationship can be used for predictive purposes. Some
simple examples are shown to understand the concepts.
13.13 Lesson – End Activities
1. Define correlation, Regression.
2. Give the purpose of drawing scatter diagram.
13.14 References
1. P.R. Vital – Business Mathematics and Statistics.
2. Gupta S.P. – Statistical Methods.
UNIT II
METHODS OF SAMPLING
Sampling
Sampling is that part of statistical practice concerned with the selection of a subset of individual
observations within a population of individuals intended to yield some knowledge about the
population of concern, especially for the purposes of making predictions based on statistical
inference. Sampling is an important aspect of data collection.
Researchers rarely survey the entire population for two reasons (Adèr, Mellenbergh, & Hand,
2008): the cost is too high, and the population is dynamic in that the individuals making up the
population may change over time. The three main advantages of sampling are that the cost is
lower, data collection is faster, and since the data set is smaller it is possible to ensure
homogeneity and to improve the accuracy and quality of the data.
Each observation measures one or more properties (such as weight, location, color) of
observable bodies distinguished as independent objects or individuals. In survey sampling,
survey weights can be applied to the data to adjust for the sample design. Results from
probability theory and statistical theory are employed to guide practice. In business and medical
research, sampling is widely used for gathering information about a population.[1]
Contents







1 Process
2 Population definition
3 Sampling frame
4 Probability and nonprobability sampling
5 Sampling methods
o 5.1 Simple random sampling
o 5.2 Systematic sampling
o 5.3 Stratified sampling
o 5.4 Probability proportional to size sampling
o 5.5 Cluster sampling
o 5.6 Matched random sampling
o 5.7 Quota sampling
o 5.8 Convenience sampling or Accidental Sampling
o 5.9 Line-intercept sampling
o 5.10 Panel sampling
o 5.11 Event sampling methodology
6 Replacement of selected units
7 Sample size
o
o








7.1 Formulas
7.2 Steps for using sample size tables
8 Sampling and data collection
9 Errors in sample surveys
o 9.1 Sampling errors and biases
o 9.2 Non-sampling error
10 Survey weights
11 History
12 See also
13 Notes
14 References
15 External links
Process
The sampling process comprises several stages:






Defining the population of concern
Specifying a sampling frame, a set of items or events possible to measure
Specifying a sampling method for selecting items or events from the frame
Determining the sample size
Implementing the sampling plan
Sampling and data collecting
Population definition
Successful statistical practice is based on focused problem definition. In sampling, this includes
defining the population from which our sample is drawn. A population can be defined as
including all people or items with the characteristic one wishes to understand. Because there is
very rarely enough time or money to gather information from everyone or everything in a
population, the goal becomes finding a representative sample (or subset) of that population.
Sometimes that which defines a population is obvious. For example, a manufacturer needs to
decide whether a batch of material from production is of high enough quality to be released to
the customer, or should be sentenced for scrap or rework due to poor quality. In this case, the
batch is the population.
Although the population of interest often consists of physical objects, sometimes we need to
sample over time, space, or some combination of these dimensions. For instance, an
investigation of supermarket staffing could examine checkout line length at various times, or a
study on endangered penguins might aim to understand their usage of various hunting grounds
over time. For the time dimension, the focus may be on periods or discrete occasions.
In other cases, our 'population' may be even less tangible. For example, Joseph Jagger studied the
behaviour of roulette wheels at a casino in Monte Carlo, and used this to identify a biased wheel.
In this case, the 'population' Jagger wanted to investigate was the overall behaviour of the wheel
(i.e. the probability distribution of its results over infinitely many trials), while his 'sample' was
formed from observed results from that wheel. Similar considerations arise when taking repeated
measurements of some physical characteristic such as the electrical conductivity of copper.
This situation often arises when we seek knowledge about the cause system of which the
observed population is an outcome. In such cases, sampling theory may treat the observed
population as a sample from a larger 'superpopulation'. For example, a researcher might study the
success rate of a new 'quit smoking' program on a test group of 100 patients, in order to predict
the effects of the program if it were made available nationwide. Here the superpopulation is
"everybody in the country, given access to this treatment" - a group which does not yet exist,
since the program isn't yet available to all.
Note also that the population from which the sample is drawn may not be the same as the
population about which we actually want information. Often there is large but not complete
overlap between these two groups due to frame issues etc. (see below). Sometimes they may be
entirely separate - for instance, we might study rats in order to get a better understanding of
human health, or we might study records from people born in 2008 in order to make predictions
about people born in 2009.
Time spent in making the sampled population and population of concern precise is often well
spent, because it raises many issues, ambiguities and questions that would otherwise have been
overlooked at this stage.
Sampling frame
In the most straightforward case, such as the sentencing of a batch of material from production
(acceptance sampling by lots), it is possible to identify and measure every single item in the
population and to include any one of them in our sample. However, in the more general case this
is not possible. There is no way to identify all rats in the set of all rats. Where voting is not
compulsory, there is no way to identify which people will actually vote at a forthcoming election
(in advance of the election).
These imprecise populations are not amenable to sampling in any of the ways below and to
which we could apply statistical theory.
As a remedy, we seek a sampling frame which has the property that we can identify every single
element and include any in our sample.[1] The most straightforward type of frame is a list of
elements of the population (preferably the entire population) with appropriate contact
information. For example, in an opinion poll, possible sampling frames include:


Electoral register
Telephone directory
Not all frames explicitly list population elements. For example, a street map can be used as a
frame for a door-to-door survey; although it doesn't show individual houses, we can select streets
from the map and then visit all houses on those streets. (One advantage of such a frame is that it
would include people who have recently moved and are not yet on the list frames discussed
above.)
The sampling frame must be representative of the population and this is a question outside the
scope of statistical theory demanding the judgment of experts in the particular subject matter
being studied. All the above frames omit some people who will vote at the next election and
contain some people who will not; some frames will contain multiple records for the same
person. People not in the frame have no prospect of being sampled. Statistical theory tells us
about the uncertainties in extrapolating from a sample to the frame. In extrapolating from frame
to population, its role is motivational and suggestive.
To the scientist, however, representative sampling is the only justified procedure for choosing
individual objects for use as the basis of generalization, and is therefore usually the only
acceptable basis for ascertaining truth.
—Andrew A. Marino[2]
It is important to understand this difference to steer clear of confusing prescriptions found in
many web pages.
In defining the frame, practical, economic, ethical, and technical issues need to be addressed.
The need to obtain timely results may prevent extending the frame far into the future.
The difficulties can be extreme when the population and frame are disjoint. This is a particular
problem in forecasting where inferences about the future are made from historical data. In fact,
in 1703, when Jacob Bernoulli proposed to Gottfried Leibniz the possibility of using historical
mortality data to predict the probability of early death of a living man, Gottfried Leibniz
recognized the problem in replying:
Nature has established patterns originating in the return of events but only for the most part. New
illnesses flood the human race, so that no matter how many experiments you have done on
corpses, you have not thereby imposed a limit on the nature of events so that in the future they
could not vary.
—Gottfried Leibniz
Kish posited four basic problems of sampling frames:
1.
2.
3.
4.
Missing elements: Some members of the population are not included in the frame.
Foreign elements: The non-members of the population are included in the frame.
Duplicate entries: A member of the population is surveyed more than once.
Groups or clusters: The frame lists clusters instead of individuals.
A frame may also provide additional 'auxiliary information' about its elements; when this
information is related to variables or groups of interest, it may be used to improve survey design.
For instance, an electoral register might include name and sex; this information can be used to
ensure that a sample taken from that frame covers all demographic categories of interest.
(Sometimes the auxiliary information is less explicit; for instance, a telephone number may
provide some information about location.)
Having established the frame, there are a number of ways for organizing it to improve efficiency
and effectiveness.
It's at this stage that the researcher should decide whether the sample is in fact to be the whole
population and would therefore be a census.
Probability and nonprobability sampling
A probability sampling scheme is one in which every unit in the population has a chance
(greater than zero) of being selected in the sample, and this probability can be accurately
determined. The combination of these traits makes it possible to produce unbiased estimates of
population totals, by weighting sampled units according to their probability of selection.
Example: We want to estimate the total income of adults living in a given street. We visit each
household in that street, identify all adults living there, and randomly select one adult from each
household. (For example, we can allocate each person a random number, generated from a
uniform distribution between 0 and 1, and select the person with the highest number in each
household). We then interview the selected person and find their income. People living on their
own are certain to be selected, so we simply add their income to our estimate of the total. But a
person living in a household of two adults has only a one-in-two chance of selection. To reflect
this, when we come to such a household, we would count the selected person's income twice
towards the total. (In effect, the person who is selected from that household is taken as
representing the person who isn't selected.)
In the above example, not everybody has the same probability of selection; what makes it a
probability sample is the fact that each person's probability is known. When every element in the
population does have the same probability of selection, this is known as an 'equal probability of
selection' (EPS) design. Such designs are also referred to as 'self-weighting' because all sampled
units are given the same weight.
Probability sampling includes: Simple Random Sampling, Systematic Sampling, Stratified
Sampling, Probability Proportional to Size Sampling, and Cluster or Multistage Sampling. These
various ways of probability sampling have two things in common:
1. Every element has a known nonzero probability of being sampled and
2. involves random selection at some point.
Nonprobability sampling is any sampling method where some elements of the population have
no chance of selection (these are sometimes referred to as 'out of coverage'/'undercovered'), or
where the probability of selection can't be accurately determined. It involves the selection of
elements based on assumptions regarding the population of interest, which forms the criteria for
selection. Hence, because the selection of elements is nonrandom, nonprobability sampling does
not allow the estimation of sampling errors. These conditions give rise to exclusion bias, placing
limits on how much information a sample can provide about the population. Information about
the relationship between sample and population is limited, making it difficult to extrapolate from
the sample to the population.
Example: We visit every household in a given street, and interview the first person to answer the
door. In any household with more than one occupant, this is a nonprobability sample, because
some people are more likely to answer the door (e.g. an unemployed person who spends most of
their time at home is more likely to answer than an employed housemate who might be at work
when the interviewer calls) and it's not practical to calculate these probabilities.
Nonprobability Sampling includes: Accidental Sampling, Quota Sampling and Purposive
Sampling. In addition, nonresponse effects may turn any probability design into a nonprobability
design if the characteristics of nonresponse are not well understood, since nonresponse
effectively modifies each element's probability of being sampled.
Sampling methods
Within any of the types of frame identified above, a variety of sampling methods can be
employed, individually or in combination. Factors commonly influencing the choice between
these designs include:





Nature and quality of the frame
Availability of auxiliary information about units on the frame
Accuracy requirements, and the need to measure accuracy
Whether detailed analysis of the sample is expected
Cost/operational concerns
Simple random sampling
In a simple random sample ('SRS') of a given size, all such subsets of the frame are given an
equal probability. Each element of the frame thus has an equal probability of selection: the frame
is not subdivided or partitioned. Furthermore, any given pair of elements has the same chance of
selection as any other such pair (and similarly for triples, and so on). This minimises bias and
simplifies analysis of results. In particular, the variance between individual results within the
sample is a good indicator of variance in the overall population, which makes it relatively easy to
estimate the accuracy of results.
However, SRS can be vulnerable to sampling error because the randomness of the selection may
result in a sample that doesn't reflect the makeup of the population. For instance, a simple
random sample of ten people from a given country will on average produce five men and five
women, but any given trial is likely to overrepresent one sex and underrepresent the other.
Systematic and stratified techniques, discussed below, attempt to overcome this problem by
using information about the population to choose a more representative sample.
SRS may also be cumbersome and tedious when sampling from an unusually large target
population. In some cases, investigators are interested in research questions specific to subgroups
of the population. For example, researchers might be interested in examining whether cognitive
ability as a predictor of job performance is equally applicable across racial groups. SRS cannot
accommodate the needs of researchers in this situation because it does not provide subsamples of
the population. Stratified sampling, which is discussed below, addresses this weakness of SRS.
Simple random sampling is always an EPS design, but not all EPS designs are simple random
sampling.
Systematic sampling
Systematic sampling relies on arranging the target population according to some ordering
scheme and then selecting elements at regular intervals through that ordered list. Systematic
sampling involves a random start and then proceeds with the selection of every kth element from
then onwards. In this case, k=(population size/sample size). It is important that the starting point
is not automatically the first in the list, but is instead randomly chosen from within the first to the
kth element in the list. A simple example would be to select every 10th name from the telephone
directory (an 'every 10th' sample, also referred to as 'sampling with a skip of 10').
As long as the starting point is randomized, systematic sampling is a type of probability
sampling. It is easy to implement and the stratification induced can make it efficient, if the
variable by which the list is ordered is correlated with the variable of interest. 'Every 10th'
sampling is especially useful for efficient sampling from databases.
Example: Suppose we wish to sample people from a long street that starts in a poor district
(house #1) and ends in an expensive district (house #1000). A simple random selection of
addresses from this street could easily end up with too many from the high end and too few from
the low end (or vice versa), leading to an unrepresentative sample. Selecting (e.g.) every 10th
street number along the street ensures that the sample is spread evenly along the length of the
street, representing all of these districts. (Note that if we always start at house #1 and end at
#991, the sample is slightly biased towards the low end; by randomly selecting the start between
#1 and #10, this bias is eliminated.)
However, systematic sampling is especially vulnerable to periodicities in the list. If periodicity is
present and the period is a multiple or factor of the interval used, the sample is especially likely
to be unrepresentative of the overall population, making the scheme less accurate than simple
random sampling.
Example: Consider a street where the odd-numbered houses are all on the north (expensive) side
of the road, and the even-numbered houses are all on the south (cheap) side. Under the sampling
scheme given above, it is impossible' to get a representative sample; either the houses sampled
will all be from the odd-numbered, expensive side, or they will all be from the even-numbered,
cheap side.
Another drawback of systematic sampling is that even in scenarios where it is more accurate than
SRS, its theoretical properties make it difficult to quantify that accuracy. (In the two examples of
systematic sampling that are given above, much of the potential sampling error is due to
variation between neighbouring houses - but because this method never selects two neighbouring
houses, the sample will not give us any information on that variation.)
As described above, systematic sampling is an EPS method, because all elements have the same
probability of selection (in the example given, one in ten). It is not 'simple random sampling'
because different subsets of the same size have different selection probabilities - e.g. the set
{4,14,24,...,994} has a one-in-ten probability of selection, but the set {4,13,24,34,...} has zero
probability of selection.
Systematic sampling can also be adapted to a non-EPS approach; for an example, see discussion
of PPS samples below.
Stratified sampling
Where the population embraces a number of distinct categories, the frame can be organized by
these categories into separate "strata." Each stratum is then sampled as an independent subpopulation, out of which individual elements can be randomly selected.[3] There are several
potential benefits to stratified sampling.
First, dividing the population into distinct, independent strata can enable researchers to draw
inferences about specific subgroups that may be lost in a more generalized random sample.
Second, utilizing a stratified sampling method can lead to more efficient statistical estimates
(provided that strata are selected based upon relevance to the criterion in question, instead of
availability of the samples). Even if a stratified sampling approach does not lead to increased
statistical efficiency, such a tactic will not result in less efficiency than would simple random
sampling, provided that each stratum is proportional to the group’s size in the population.
Third, it is sometimes the case that data are more readily available for individual, pre-existing
strata within a population than for the overall population; in such cases, using a stratified
sampling approach may be more convenient than aggregating data across groups (though this
may potentially be at odds with the previously noted importance of utilizing criterion-relevant
strata).
Finally, since each stratum is treated as an independent population, different sampling
approaches can be applied to different strata, potentially enabling researchers to use the approach
best suited (or most cost-effective) for each identified subgroup within the population.
There are, however, some potential drawbacks to using stratified sampling. First, identifying
strata and implementing such an approach can increase the cost and complexity of sample
selection, as well as leading to increased complexity of population estimates. Second, when
examining multiple criteria, stratifying variables may be related to some, but not to others,
further complicating the design, and potentially reducing the utility of the strata. Finally, in some
cases (such as designs with a large number of strata, or those with a specified minimum sample
size per group), stratified sampling can potentially require a larger sample than would other
methods (although in most cases, the required sample size would be no larger than would be
required for simple random sampling.
A stratified sampling approach is most effective when three conditions are met
1. Variability within strata are minimized
2. Variability between strata are maximized
3. The variables upon which the population is stratified are strongly correlated with the
desired dependent variable.
Advantages over other sampling methods
1.
2.
3.
4.
Focuses on important subpopulations and ignores irrelevant ones.
Allows use of different sampling techniques for different subpopulations.
Improves the accuracy/efficiency of estimation.
Permits greater balancing of statistical power of tests of differences between strata by
sampling equal numbers from strata varying widely in size.
Disadvantages
1. Requires selection of relevant stratification variables which can be difficult.
2. Is not useful when there are no homogeneous subgroups.
3. Can be expensive to implement.
Poststratification
Stratification is sometimes introduced after the sampling phase in a process called
"poststratification".[3] This approach is typically implemented due to a lack of prior knowledge of
an appropriate stratifying variable or when the experimenter lacks the necessary information to
create a stratifying variable during the sampling phase. Although the method is susceptible to the
pitfalls of post hoc approaches, it can provide several benefits in the right situation.
Implementation usually follows a simple random sample. In addition to allowing for
stratification on an ancillary variable, poststratification can be used to implement weighting,
which can improve the precision of a sample's estimates.[3]
Oversampling
Choice-based sampling is one of the stratified sampling strategies. In choice-based sampling,[4]
the data are stratified on the target and a sample is taken from each strata so that the rare target
class will be more represented in the sample. The model is then built on this biased sample. The
effects of the input variables on the target are often estimated with more precision with the
choice-based sample even when a smaller overall sample size is taken, compared to a random
sample. The results usually must be adjusted to correct for the oversampling.
Probability proportional to size sampling
In some cases the sample designer has access to an "auxiliary variable" or "size measure",
believed to be correlated to the variable of interest, for each element in the population. These
data can be used to improve accuracy in sample design. One option is to use the auxiliary
variable as a basis for stratification, as discussed above.
Another option is probability-proportional-to-size ('PPS') sampling, in which the selection
probability for each element is set to be proportional to its size measure, up to a maximum of 1.
In a simple PPS design, these selection probabilities can then be used as the basis for Poisson
sampling. However, this has the drawback of variable sample size, and different portions of the
population may still be over- or under-represented due to chance variation in selections. To
address this problem, PPS may be combined with a systematic approach.
Example: Suppose we have six schools with populations of 150, 180, 200, 220, 260, and 490
students respectively (total 1500 students), and we want to use student population as the basis
for a PPS sample of size three. To do this, we could allocate the first school numbers 1 to 150,
the second school 151 to 330 (= 150 + 180), the third school 331 to 530, and so on to the last
school (1011 to 1500). We then generate a random start between 1 and 500 (equal to 1500/3)
and count through the school populations by multiples of 500. If our random start was 137, we
would select the schools which have been allocated numbers 137, 637, and 1137, i.e. the first,
fourth, and sixth schools.
The PPS approach can improve accuracy for a given sample size by concentrating sample on
large elements that have the greatest impact on population estimates. PPS sampling is commonly
used for surveys of businesses, where element size varies greatly and auxiliary information is
often available - for instance, a survey attempting to measure the number of guest-nights spent in
hotels might use each hotel's number of rooms as an auxiliary variable. In some cases, an older
measurement of the variable of interest can be used as an auxiliary variable when attempting to
produce more current estimates.
Cluster sampling
Sometimes it is cheaper to 'cluster' the sample in some way e.g. by selecting respondents from
certain areas only, or certain time-periods only. (Nearly all samples are in some sense 'clustered'
in time - although this is rarely taken into account in the analysis.)
Cluster sampling is an example of 'two-stage sampling' or 'multistage sampling': in the first stage
a sample of areas is chosen; in the second stage a sample of respondents within those areas is
selected.
This can reduce travel and other administrative costs. It also means that one does not need a
sampling frame listing all elements in the target population. Instead, clusters can be chosen from
a cluster-level frame, with an element-level frame created only for the selected clusters. Cluster
sampling generally increases the variability of sample estimates above that of simple random
sampling, depending on how the clusters differ between themselves, as compared with the
within-cluster variation.
Nevertheless, some of the disadvantages of cluster sampling are the reliance of sample estimate
precision on the actual clusters chosen. If clusters chosen are biased in a certain way, inferences
drawn about population parameters from these sample estimates will be far off from being
accurate.
Multistage sampling Multistage sampling is a complex form of cluster sampling in which two
or more levels of units are embedded one in the other. The first stage consists of constructing the
clusters that will be used to sample from. In the second stage, a sample of primary units is
randomly selected from each cluster (rather than using all units contained in all selected
clusters). In following stages, in each of those selected clusters, additional samples of units are
selected, and so on. All ultimate units (individuals, for instance) selected at the last step of this
procedure are then surveyed.
This technique, thus, is essentially the process of taking random samples of preceding random
samples. It is not as effective as true random sampling, but it probably solves more of the
problems inherent to random sampling. Moreover, It is an effective strategy because it banks on
multiple randomizations. As such, it is extremely useful.
Multistage sampling is used frequently when a complete list of all members of the population
does not exist and is inappropriate. Moreover, by avoiding the use of all sample units in all
selected clusters, multistage sampling avoids the large, and perhaps unnecessary, costs associated
traditional cluster sampling.
Matched random sampling
A method of assigning participants to groups in which pairs of participants are first matched on
some characteristic and then individually assigned randomly to groups.[5]
The procedure for matched random sampling can be briefed with the following contexts,
1. Two samples in which the members are clearly paired, or are matched explicitly by the
researcher. For example, IQ measurements or pairs of identical twins.
2. Those samples in which the same attribute, or variable, is measured twice on each
subject, under different circumstances. Commonly called repeated measures. Examples
include the times of a group of athletes for 1500m before and after a week of special
training; the milk yields of cows before and after being fed a particular diet.
Quota sampling
In quota sampling, the population is first segmented into mutually exclusive sub-groups, just as
in stratified sampling. Then judgment is used to select the subjects or units from each segment
based on a specified proportion. For example, an interviewer may be told to sample 200 females
and 300 males between the age of 45 and 60.
It is this second step which makes the technique one of non-probability sampling. In quota
sampling the selection of the sample is non-random. For example interviewers might be tempted
to interview those who look most helpful. The problem is that these samples may be biased
because not everyone gets a chance of selection. This random element is its greatest weakness
and quota versus probability has been a matter of controversy for many years
Convenience sampling or Accidental Sampling
Convenience sampling (sometimes known as grab or opportunity sampling) is a type of
nonprobability sampling which involves the sample being drawn from that part of the population
which is close to hand. That is, a sample population selected because it is readily available and
convenient. It may be through meeting the person or including a person in the sample when one
meets them or chosen by finding them through technological means such as the internet or
through phone. The researcher using such a sample cannot scientifically make generalizations
about the total population from this sample because it would not be representative enough. For
example, if the interviewer was to conduct such a survey at a shopping center early in the
morning on a given day, the people that he/she could interview would be limited to those given
there at that given time, which would not represent the views of other members of society in such
an area, if the survey was to be conducted at different times of day and several times per week.
This type of sampling is most useful for pilot testing. Several important considerations for
researchers using convenience samples include:
1. Are there controls within the research design or experiment which can serve to lessen the
impact of a non-random convenience sample, thereby ensuring the results will be more
representative of the population?
2. Is there good reason to believe that a particular convenience sample would or should
respond or behave differently than a random sample from the same population?
3. Is the question being asked by the research one that can adequately be answered using a
convenience sample?
In social science research, snowball sampling is a similar technique, where existing study
subjects are used to recruit more subjects into the sample.
Line-intercept sampling
Line-intercept sampling is a method of sampling elements in a region whereby an element is
sampled if a chosen line segment, called a “transect”, intersects the element.
Panel sampling
Panel sampling is the method of first selecting a group of participants through a random
sampling method and then asking that group for the same information again several times over a
period of time. Therefore, each participant is given the same survey or interview at two or more
time points; each period of data collection is called a "wave". This sampling methodology is
often chosen for large scale or nation-wide studies in order to gauge changes in the population
with regard to any number of variables from chronic illness to job stress to weekly food
expenditures. Panel sampling can also be used to inform researchers about within-person health
changes due to age or help explain changes in continuous dependent variables such as spousal
interaction. There have been several proposed methods of analyzing panel sample data, including
MANOVA, growth curves, and structural equation modeling with lagged effects. For a more
thorough look at analytical techniques for panel data, see Johnson (1995).
Event sampling methodology
Event sampling methodology (ESM) is a new form of sampling method that allows researchers
to study ongoing experiences and events that vary across and within days in its naturallyoccurring environment. Because of the frequent sampling of events inherent in ESM, it enables
researchers to measure the typology of activity and detect the temporal and dynamic fluctuations
of work experiences. Popularity of ESM as a new form of research design increased over the
recent years because it addresses the shortcomings of cross-sectional research, where once
unable to, researchers can now detect intra-individual variances across time. In ESM,
participants are asked to record their experiences and perceptions in a paper or electronic diary.
There are three types of ESM:
1. Signal contingent – random beeping notifies participants to record data. The advantage of
this type of ESM is minimization of recall bias.
2. Event contingent – records data when certain events occur
3. Interval contingent – records data according to the passing of a certain period of time
ESM has several disadvantages. One of the disadvantages of ESM is it can sometimes be
perceived as invasive and intrusive by participants. ESM also leads to possible self-selection
bias. It may be that only certain types of individuals are willing to participate in this type of
study creating a non-random sample. Another concern is related to participant cooperation.
Participants may not be actually fill out their diaries at the specified times. Furthermore, ESM
may substantively change the phenomenon being studied. Reactivity or priming effects may
occur, such that repeated measurement may cause changes in the participants' experiences. This
method of sampling data is also highly vulnerable to common method variance.[6]
Further, it is important to think about whether or not an appropriate dependent variable is being
used in an ESM design. For example, it might be logical to use ESM in order to answer research
questions which involve dependent variables with a great deal of variation throughout the day.
Thus, variables such as change in mood, change in stress level, or the immediate impact of
particular events may be best studied using ESM methodology. However, it is not likely that
utilizing ESM will yield meaningful predictions when measuring someone performing a
repetitive task throughout the day or when dependent variables are long-term in nature (coronary
heart problems).
Replacement of selected units
Sampling schemes may be without replacement ('WOR' - no element can be selected more than
once in the same sample) or with replacement ('WR' - an element may appear multiple times in
the one sample). For example, if we catch fish, measure them, and immediately return them to
the water before continuing with the sample, this is a WR design, because we might end up
catching and measuring the same fish more than once. However, if we do not return the fish to
the water (e.g. if we eat the fish), this becomes a WOR design.
Sample size
Formulas, tables, and power function charts are well known approaches to determine sample
size.
Formulas
Where the frame and population are identical, statistical theory yields exact recommendations on
sample size.[7] However, where it is not straightforward to define a frame representative of the
population, it is more important to understand the cause system of which the population are
outcomes and to ensure that all sources of variation are embraced in the frame. Large number of
observations are of no value if major sources of variation are neglected in the study. In other
words, it is taking a sample group that matches the survey category and is easy to survey.
Bartlett, Kotrlik, and Higgins (2001) published a paper titled Organizational Research:
Determining Appropriate Sample Size in Survey Research Information Technology, Learning,
and Performance Journal[8] that provides an explanation of Cochran’s (1977) formulas. A
discussion and illustration of sample size formulas, including the formula for adjusting the
sample size for smaller populations, is included. A table is provided that can be used to select the
sample size for a research problem based on three alpha levels and a set error rate.
Steps for using sample size tables
1. Postulate the effect size of interest, α, and β.
2. Check sample size table[9]
1. Select the table corresponding to the selected α
2. Locate the row corresponding to the desired power
3. Locate the column corresponding to the estimated effect size.
4. The intersection of the column and row is the minimum sample size required.
Sampling and data collection
Good data collection involves:




Following the defined sampling process
Keeping the data in time order
Noting comments and other contextual events
Recording non-responses
Most sampling books and papers written by non-statisticians focus only in the data collection
aspect, which is just a small though important part of the sampling process.
Errors in sample surveys
Survey results are typically subject to some error. Total errors can be classified into sampling
errors and non-sampling errors. The term "error" here includes systematic biases as well as
random errors.
Sampling errors and biases
Sampling errors and biases are induced by the sample design. They include:
1. Selection bias: When the true selection probabilities differ from those assumed in
calculating the results.
2. Random sampling error: Random variation in the results due to the elements in the
sample being selected at random.
Non-sampling error
Non-sampling errors are caused by other problems in data collection and processing. They
include:
1. Overcoverage: Inclusion of data from outside of the population.
2. Undercoverage: Sampling frame does not include elements in the population.
3. Measurement error: E.g. when respondents misunderstand a question, or find it difficult
to answer.
4. Processing error: Mistakes in data coding.
5. Non-response: Failure to obtain complete data from all selected individuals.
After sampling, a review should be held of the exact process followed in sampling, rather than
that intended, in order to study any effects that any divergences might have on subsequent
analysis. A particular problem is that of non-response.
Two major types of nonresponse exist: unit nonresponse (referring to lack of completion of any
part of the survey) and item nonresponse (submission or participation in survey but failing to
complete one or more components/questions of the survey).[10][11] In survey sampling, many of
the individuals identified as part of the sample may be unwilling to participate, not have the time
to participate (opportunity cost),[12] or survey administrators may not have been able to contact
them. In this case, there is a risk of differences, between respondents and nonrespondents,
leading to biased estimates of population parameters. This is often addressed by improving
survey design, offering incentives, and conducting follow-up studies which make a repeated
attempt to contact the unresponsive and to characterize their similarities and differences with the
rest of the frame.[13] The effects can also be mitigated by weighting the data when population
benchmarks are available or by imputing data based on answers to other questions.
Nonresponse is particularly a problem in internet sampling. Reasons for this problem include
improperly designed surveys,[11] over-surveying (or survey fatigue),[14][15] and the fact that
potential participants hold multiple e-mail addresses, which they don't use anymore or don't
check regularly. Web-based surveys also tend to demonstrate nonresponse bias; for example,
studies have shown that females and those from a white/Caucasian background are more likely to
respond than their counterparts.[16]
References






Kish, Leslie (1995) Survey Sampling, Wiley, ISBN 0-471-10949-5
Korn, E L, and Graubard, B I (1999) Analysis of Health Surveys, Wiley, ISBN 0-47113773-1
Lohr, Sharon L. (1999). Sampling: Design and Analysis. Duxbury. ISBN 0-534-35361-4.
Pedhazur, E., & Schmelkin, L. (1991). Measurement design and analysis: An integrated
approach. New York: Psychology Press.
Särndal, Carl-Erik, and Swensson, Bengt, and Wretman, Jan (1992). Model Assisted
Survey Sampling. Springer-Verlag. ISBN 0-387-40620-4.
Stuart, Alan (1962) Basic Ideas of Scientific Sampling, Hafner Publishing Company, New
York
UNIT III
CONCEPT OF SAMPLING DISTRIBUTION
TESTS OF SIGNIFICANCE
Once sample data has been gathered through an observational study or experiment, statistical
inference allows analysts to assess evidence in favor or some claim about the population from
which the sample has been drawn. The methods of inference used to support or reject claims
based on sample data are known as tests of significance.
Every test of significance begins with a null hypothesis H0. H0 represents a theory that has been
put forward, either because it is believed to be true or because it is to be used as a basis for
argument, but has not been proved. For example, in a clinical trial of a new drug, the null
hypothesis might be that the new drug is no better, on average, than the current drug. We would
write H0: there is no difference between the two drugs on average.
The alternative hypothesis, Ha, is a statement of what a statistical hypothesis test is set up to
establish. For example, in a clinical trial of a new drug, the alternative hypothesis might be that
the new drug has a different effect, on average, compared to that of the current drug. We would
write Ha: the two drugs have different effects, on average. The alternative hypothesis might also
be that the new drug is better, on average, than the current drug. In this case we would write Ha:
the new drug is better than the current drug, on average.
The final conclusion once the test has been carried out is always given in terms of the null
hypothesis. We either "reject H0 in favor of Ha" or "do not reject H0"; we never conclude "reject
Ha", or even "accept Ha".
If we conclude "do not reject H0", this does not necessarily mean that the null hypothesis is true,
it only suggests that there is not sufficient evidence against H0 in favor of Ha; rejecting the null
hypothesis then, suggests that the alternative hypothesis may be true.
Hypotheses are always stated in terms of population parameter, such as the mean
. An
alternative hypothesis may be one-sided or two-sided. A one-sided hypothesis claims that a
parameter is either larger or smaller than the value given by the null hypothesis. A two-sided
hypothesis claims that a parameter is simply not equal to the value given by the null hypothesis - the direction does not matter.
Hypotheses for a one-sided test for a population mean take the following form:
H0:
=k
Ha:
or
>k
H0:
=k
Ha:
< k.
Hypotheses for a two-sided test for a population mean take the following form:
H0:
=k
Ha:
k.
A confidence interval gives an estimated range of values which is likely to include an unknown
population parameter, the estimated range being calculated from a given set of sample data.
(Definition taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1)
Example
Suppose a test has been given to all high school students in a certain state. The mean test score
for the entire state is 70, with standard deviation equal to 10. Members of the school board
suspect that female students have a higher mean score on the test than male students, because the
mean score
from a random sample of 64 female students is equal to 73. Does this provide
strong evidence that the overall mean for female students is higher?
The null hypothesis H0 claims that there is no difference between the mean score for female
students and the mean for the entire population, so that
= 70. The alternative hypothesis
claims that the mean for female students is higher than the entire student population mean, so
that
> 70.
Significance Tests for Unknown Mean and Known Standard
Deviation
Once null and alternative hypotheses have been formulated for a particular claim, the next step is
to compute a test statistic. For claims about a population mean from a population with a
normal distribution or for any sample with large sample size n (for which the sample mean
will follow a normal distribution by the Central Limit Theorem), if the standard deviation
is known, the appropriate significance test is known as the z-test, where the test statistic
is defined as z =
.
The test statistic follows the standard normal distribution (with mean = 0 and standard deviation
= 1). The test statistic z is used to compute the P-value for the standard normal distribution, the
probability that a value at least as extreme as the test statistic would be observed under the null
hypothesis. Given the null hypothesis that the population mean
is equal to a given value
the P-values for testing H0 against each of the possible alternative hypotheses are:
P(Z > z) for Ha:
>
0
P(Z < z) for Ha:
<
0
2P(Z>|z|) for Ha:
0,
0.
The probability is doubled for the two-sided test, since the two-sided alternative hypothesis
considers the possibility of observing extreme values on either tail of the normal distribution.
Example
In the test score example above, where the sample mean equals 73 and the population standard
deviation is equal to 10, the test statistic is computed as follows:
z = (73 - 70)/(10/sqrt(64)) = 3/1.25 = 2.4. Since this is a one-sided test, the P-value is equal to
the probability that of observing a value greater than 2.4 in the standard normal distribution, or
P(Z > 2.4) = 1 - P(Z < 2.4) = 1 - 0.9918 = 0.0082. The P-value is less than 0.01, indicating that it
is highly unlikely that these results would be observed under the null hypothesis. The school
board can confidently reject H0 given this result, although they cannot conclude any additional
information about the mean of the distribution.
Significance Levels
The significance level
for a given hypothesis test is a value for which a P-value less than or
equal to is considered statistically significant. Typical values for are 0.1, 0.05, and 0.01.
These values correspond to the probability of observing such an extreme value by chance. In the
test score example above, the P-value is 0.0082, so the probability of observing such a value by
chance is less that 0.01, and the result is significant at the 0.01 level.
In a one-sided test, corresponds to the critical value z* such that P(Z > z*) = . For example,
if the desired significance level for a result is 0.05, the corresponding value for z must be greater
than or equal to z* = 1.645 (or less than or equal to -1.645 for a one-sided alternative claiming
that the mean is less than the null hypothesis). For a two-sided test, we are interested in the
probability that 2P(Z > z*) = , so the critical value z* corresponds to the /2 significance
level. To achieve a significance level of 0.05 for a two-sided test, the absolute value of the test
statistic (|z|) must be greater than or equal to the critical value 1.96 (which corresponds to the
level 0.025 for a one-sided test).
Another interpretation of the significance level
, based in decision theory, is that
corresponds to the value for which one chooses to reject or accept the null hypothesis H0. In
the above example, the value 0.0082 would result in rejection of the null hypothesis at the 0.01
level. The probability that this is a mistake -- that, in fact, the null hypothesis is true given the zstatistic -- is less than 0.01. In decision theory, this is known as a Type I error. The probability of
a Type I error is equal to the significance level
, and the probability of rejecting the null
hypothesis when it is in fact false (a correct decision) is equal to 1 - . To minimize the
probability of Type I error, the significance level is generally chosen to be small.
Example
Of all of the individuals who develop a certain rash, suppose the mean recovery time for
individuals who do not use any form of treatment is 30 days with standard deviation equal to 8.
A pharmaceutical company manufacturing a certain cream wishes to determine whether the
cream shortens, extends, or has no effect on the recovery time. The company chooses a random
sample of 100 individuals who have used the cream, and determines that the mean recovery time
for these individuals was 28.5 days. Does the cream have any effect?
Since the pharmaceutical company is interested in any difference from the mean recovery time
for all individuals, the alternative hypothesis Ha is two-sided:
30. The test statistic is
calculated to be z = (28.5 - 30)/(8/sqrt(100)) = -1.5/0.8 = -1.875. The P-value for this statistic is
2P(Z > 1.875) = 2(1 - P((Z < 1.875) = 2(1- 0.9693) = 2(0.0307) = 0.0614. This is not significant
at the 0.05 level, although it is significant at the 0.1 level.
Decision theory is also concerned with a second error possible in significance testing, known as
Type II error. Contrary to Type I error, Type II error is the error made when the null hypothesis
is incorrectly accepted. The probability of correctly rejecting the null hypothesis when it is false,
the complement of the Type II error, is known as the power of a test. Formally defined, the
power of a test is the probability that a fixed level significance test will reject the null
hypothesis H0 when a particular alternative value of the parameter is true.
Example
In the test score example, for a fixed significance level of 0.10, suppose the school board wishes
to be able to reject the null hypothesis (that the mean = 70) if the mean for female students is in
fact 72. To determine the power of the test against this alternative, first note that the critical
value for rejecting the null hypothesis is z* = 1.282. The calculated value for z will be greater
than 1.282 whenever (
- 70)/(1.25) > 1.282, or
> 71.6. The probability of rejecting the
null hypothesis (mean = 70) given that the alternative hypotheses (mean = 72) is true is
calculated by:
P((
= P((
> 71.6 |
= 72)
- 72)/(1.25) > (71.6 - 72)/1.25)
= P(Z > -0.32) = 1 - P(Z < -0.32) = 1 - 0.3745 = 0.6255. The power is about 0.60, indicating that
although the test is more likely than not to reject the null hypothesis for this value, the
probability of a Type II error is high.
Significance Tests for Unknown Mean and Unknown
Standard Deviation
In most practical research, the standard deviation for the population of interest is not known. In
this case, the standard deviation
is replaced by the estimated standard deviation s, also known
as the standard error. Since the standard error is an estimate for the true value of the standard
deviation, the distribution of the sample mean
standard deviation
is no longer normal with mean
and
. Instead, the sample mean follows the t distribution with mean
and
standard deviation
. The t distribution is also described by its degrees of freedom. For a
sample of size n, the t distribution will have n-1 degrees of freedom. The notation for a t
distribution with k degrees of freedom is t(k). As the sample size n increases, the t distribution
becomes closer to the normal distribution, since the standard error approaches the true standard
deviation
for large n.
For claims about a population mean from a population with a normal distribution or for
any sample with large sample size n (for which the sample mean will follow a normal
distribution by the Central Limit Theorem) with unknown standard deviation, the
appropriate significance test is known as the t-test, where the test statistic is defined as t =
.
The test statistic follows the t distribution with n-1 degrees of freedom. The test statistic z is used
to compute the P-value for the t distribution, the probability that a value at least as extreme as
the test statistic would be observed under the null hypothesis.
Example
The dataset "Normal Body Temperature, Gender, and Heart Rate" contains 130 observations of
body temperature, along with the gender of each individual and his or her heart rate. Using the
MINITAB "DESCRIBE" command provides the following information:
Descriptive Statistics
Variable
TEMP
N
130
Variable
TEMP
Min
96.300
Mean
Median Tr Mean
98.249
98.300
98.253
Max
100.800
Q1
97.800
StDev SE Mean
0.733
0.064
Q3
98.700
Since the normal body temperature is generally assumed to be 98.6 degrees Fahrenheit, one can
use the data to test the following one-sided hypothesis:
H0:
= 98.6 vs
Ha:
< 98.6.
The t test statistic is equal to (98.249 - 98.6)/0.064 = -0.351/0.064 = -5.48. P(t< -5.48) = P(t>
5.48). The t distribution with 129 degrees of freedom may be approximated by the t distribution
with 100 degrees of freedom (found in Table E in Moore and McCabe), where P(t> 5.48) is less
than 0.0005. This result is significant at the 0.01 level and beyond, indicating that the null
hypotheses can be rejected with confidence.
To perform this t-test in MINITAB, the "TTEST" command with the "ALTERNATIVE"
subcommand may be applied as follows:
MTB > ttest mu = 98.6 c1;
SUBC > alt= -1.
T-Test of the Mean
Test of mu = 98.6000 vs mu < 98.6000
Variable
TEMP
N
130
Mean
98.2492
StDev
0.7332
SE Mean
0.0643
T
-5.45
P
0.0000
These results represents the exact calculations for the t(129) distribution.
Data source: Data presented in Mackowiak, P.A., Wasserman, S.S., and Levine, M.M. (1992), "A
Critical Appraisal of 98.6 Degrees F, the Upper Limit of the Normal Body Temperature, and
Other Legacies of Carl Reinhold August Wunderlich," Journal of the American Medical
Association, 268, 1578-1580. Dataset available through the JSE Dataset Archive.
Matched Pairs
In many experiments, one wishes to compare measurements from two populations. This is
common in medical studies involving control groups, for example, as well as in studies requiring
before-and-after measurements. Such studies have a matched pairs design, where the difference
between the two measurements in each pair is the parameter of interest.
Analysis of data from a matched pairs experiment compares the two measurements by
subtracting one from the other and basing test hypotheses upon the differences. Usually, the null
hypothesis H0 assumes that that the mean of these differences is equal to 0, while the alternative
hypothesis Ha claims that the mean of the differences is not equal to zero (the alternative
hypothesis may be one- or two-sided, depending on the experiment). Using the differences
between the paired measurements as single observations, the standard t procedures with n-1
degrees of freedom are followed as above.
Example
In the "Helium Football" experiment, a punter was given two footballs to kick, one filled with air
and the other filled with helium. The punter was unaware of the difference between the balls, and
was asked to kick each ball 39 times. The balls were alternated for each kick, so each of the 39
trials contains one measurement for the air-filled ball and one measurement for the helium-filled
ball. Given that the conditions (leg fatigue, etc.) were basically the same for each kick within a
trial, a matched pairs analysis of the trials is appropriate. Is there evidence that the helium-filled
ball improved the kicker's performance?
In MINITAB, subtracting the air-filled measurement from the helium-filled measurement for
each trial and applying the "DESCRIBE" command to the resulting differences gives the
following results:
Descriptive Statistics
Variable
Hel. - Air
N
39
Variable
Hel. - Air
Min
-14.00
Mean
0.46
Median
1.00
Max
17.00
Tr Mean
0.40
Q1
-2.00
StDev
6.87
SE Mean
1.10
Q3
4.00
Using MINITAB to perform a t-test of the null hypothesis H0:
following analysis:
= 0 vs Ha:
> 0 gives the
T-Test of the Mean
Test of mu = 0.00 vs mu > 0.00
Variable
Hel. - A
N
39
Mean
0.46
StDev
6.87
SE Mean
1.10
T
0.42
P
0.34
The P-Value of 0.34 indicates that this result is not significant at any acceptable level. A 95%
confidence interval for the t-distribution with 38 degrees of freedom for the difference in
measurements is (-1.76, 2.69), computed using the MINITAB "TINTERVAL" command.
Data source: Lafferty, M.B. (1993), "OSU scientists get a kick out of sports controversy," The
Columbus Dispatch (November 21, 1993), B7. Dataset available through the Statlib Data and
Story Library (DASL).
The Sign Test
Another method of analysis for matched pairs data is a distribution-free test known as the sign
test. This test does not require any normality assumptions about the data, and simply involves
counting the number of positive differences between the matched pairs and relating these to a
binomial distribution. The concept behind the sign test reasons that if there is no true difference,
then the probability of observing an increase in each pair is equal to the probability of observing
a decrease in each pair: p = 1/2. Assuming each pair is independent, the null hypothesis follows
the distribution B(n,1/2), where n is the number of pairs where some difference is observed.
To perform a sign test on matched pairs data, take the difference between the two
measurements in each pair and count the number of non-zero differences n. Of these, count
the number of positive differences X. Determine the probability of observing X positive
differences for a B(n,1/2) distribution, and use this probability as a P-value for the null
hypothesis.
Example
In the "Helium Football" example above, 2 of the 39 trials recorded no difference between kicks
for the air-filled and helium-filled balls. Of the remaining 37 trials, 20 recorded a positive
difference between the two kicks. Under the null hypothesis, p = 1/2, the differences would
follow the B(37,1/2) distribution. The probability of observing 20 or more positive differences,
P(X>20) = 1 - P(X<19) = 1 - 0.6286 = 0.3714. This value indicates that there is not strong
evidence against the null hypothesis, as observed previously with the t-test.
The T-Test
The t-test assesses whether the means of two groups are statistically different from each other.
This analysis is appropriate whenever you want to compare the means of two groups, and
especially appropriate as the analysis for the posttest-only two-group randomized experimental
design.
Figure 1. Idealized distributions for treated and comparison group posttest values.
Figure 1 shows the distributions for the treated (blue) and control (green) groups in a study.
Actually, the figure shows the idealized distribution -- the actual distribution would usually be
depicted with a histogram or bar graph. The figure indicates where the control and treatment
group means are located. The question the t-test addresses is whether the means are statistically
different.
What does it mean to say that the averages for two groups are statistically different? Consider the
three situations shown in Figure 2. The first thing to notice about the three situations is that the
difference between the means is the same in all three. But, you should also notice that the three
situations don't look the same -- they tell very different stories. The top example shows a case
with moderate variability of scores within each group. The second situation shows the high
variability case. the third shows the case with low variability. Clearly, we would conclude that
the two groups appear most different or distinct in the bottom or low-variability case. Why?
Because there is relatively little overlap between the two bell-shaped curves. In the high
variability case, the group difference appears least striking because the two bell-shaped
distributions overlap so much.
Figure 2. Three scenarios for differences between means.
This leads us to a very important conclusion: when we are looking at the differences between
scores for two groups, we have to judge the difference between their means relative to the spread
or variability of their scores. The t-test does just this.
Statistical Analysis of the t-test
The formula for the t-test is a ratio. The top part of the ratio is just the difference between the
two means or averages. The bottom part is a measure of the variability or dispersion of the
scores. This formula is essentially another example of the signal-to-noise metaphor in research:
the difference between the means is the signal that, in this case, we think our program or
treatment introduced into the data; the bottom part of the formula is a measure of variability that
is essentially noise that may make it harder to see the group difference. Figure 3 shows the
formula for the t-test and how the numerator and denominator are related to the distributions.
Figure 3. Formula for the t-test.
The top part of the formula is easy to compute -- just find the difference between the means. The
bottom part is called the standard error of the difference. To compute it, we take the variance
for each group and divide it by the number of people in that group. We add these two values and
then take their square root. The specific formula is given in Figure 4:
Figure 4. Formula for the Standard error of the difference between the means.
Remember, that the variance is simply the square of the standard deviation.
The final formula for the t-test is shown in Figure 5:
Figure 5. Formula for the t-test.
The t-value will be positive if the first mean is larger than the second and negative if it is smaller.
Once you compute the t-value you have to look it up in a table of significance to test whether the
ratio is large enough to say that the difference between the groups is not likely to have been a
chance finding. To test the significance, you need to set a risk level (called the alpha level). In
most social research, the "rule of thumb" is to set the alpha level at .05. This means that five
times out of a hundred you would find a statistically significant difference between the means
even if there was none (i.e., by "chance"). You also need to determine the degrees of freedom
(df) for the test. In the t-test, the degrees of freedom is the sum of the persons in both groups
minus 2. Given the alpha level, the df, and the t-value, you can look the t-value up in a standard
table of significance (available as an appendix in the back of most statistics texts) to determine
whether the t-value is large enough to be significant. If it is, you can conclude that the difference
between the means for the two groups is different (even given the variability). Fortunately,
statistical computer programs routinely print the significance test results and save you the trouble
of looking them up in a table.
The t-test, one-way Analysis of Variance (ANOVA) and a form of regression analysis are
mathematically equivalent (see the statistical analysis of the posttest-only randomized
experimental design) and would yield identical results.
The F Distribution
The F distribution is an asymmetric distribution that has a minimum value
of 0, but no maximum value. The curve reaches a peak not far to the right of
0, and then gradually approaches the horizontal axis the larger the F value
is. The F distribution approaches, but never quite touches the horizontal
axis.
The F distribution has two degrees of freedom, d1 for the numerator,
d2 for the denominator. For each combination of these degrees of freedom
there is a di®erent F distribution. The F distribution is most spread out
when the degrees of freedom are small. As the degrees of freedom increase,
the F distribution the F distribution is less dispersed.
Figure 1.1 shows the shape of the distribution. The F value is on the
horizontal axis, with the probability for each F value being represented by
the vertical axis. The shaded area in the diagram represents the level of
signi¯cance ® shown in the table.
There is a di®erent F distribution for each combination of the degrees
of freedom of the numerator and denominator. Since there are so many F
distributions, the F tables are organized somewhat di®erently than the tables
for the other distributions. The three tables which follow are organized by
the level of signi¯cance. The ¯rst table gives F values for that are associated
with ® = 0:10 of the area in the right tail of the distribution. The second
table gives the F values for ® = 0:05 of the area in the right tail, and the
third table gives F values for the ® = 0:01 level of signi¯cance. In each of
these tables, the F values are given for various combinations of degrees of
freedom.
In order to use the F table, ¯rst select the signi¯cance level to be used,
917
Figure K.1: The F distribution
and then determine the appropriate combination of degrees of freedom. For
example, if the ® = 0:10 level of signi¯cance is selected, use the ¯rst F table.
If there are 5 degrees of freedom in the numerator, and 7 degrees of freedom
in the denominator, the F value from the table is 2.88. This means that
there is exactly 0.10 of the area under the F curve that lies to the right of
F = 2:88.
When the signi¯cance level is ® = 0:05, use the second F table. If there
are 20 degrees of freedom in the numerator, and 5 degrees of freedom in the
denominator, then the critical F value is 4.56. This could be written
F20;5;0:05 = 4:56
That is, for 20 and 5 degrees of freedom, the F value that leaves exactly 0.05
of the area under the F curve in the right tail of the distribution is 4.56.
For the ® = 0:01 level of signi¯cance, the third F table is used. Suppose
that there is 1 degree of freedom in the numerator and 12 degrees of freedom
in the denominator. Then
F1;12;0:01 = 9:33:
An F value of 9.33 leaves exactly 0.01 of area under the curve in the right
tail of the distribution when there are 1 and 12 degrees of freedom.
918
F Values for ® = 0:10
d1
d2 1 2 3 4 5 6 7 8 9
1 39.86 49.5 53.59 55.83 57.24 58.2 58.91 59.44 59.86
2 8.53 9.00 9.16 9.24 9.29 9.33 9.35 9.37 9.38
3 5.54 5.46 5.39 5.34 5.31 5.28 5.27 5.25 5.24
4 4.54 4.32 4.19 4.11 4.05 4.01 3.98 3.95 3.94
5 4.06 3.78 3.62 3.52 3.45 3.40 3.37 3.34 3.32
6 3.78 3.46 3.29 3.18 3.11 3.05 3.01 2.98 2.96
7 3.59 3.26 3.07 2.96 2.88 2.83 2.78 2.75 2.72
8 3.46 3.11 2.92 2.81 2.73 2.67 2.62 2.59 2.56
9 3.36 3.01 2.81 2.69 2.61 2.55 2.51 2.47 2.44
10 3.29 2.92 2.73 2.61 2.52 2.46 2.41 2.38 2.35
11 3.23 2.86 2.66 2.54 2.45 2.39 2.34 2.3 2.27
12 3.18 2.81 2.61 2.48 2.39 2.33 2.28 2.24 2.21
13 3.14 2.76 2.56 2.43 2.35 2.28 2.23 2.20 2.16
14 3.10 2.73 2.52 2.39 2.31 2.24 2.19 2.15 2.12
15 3.07 2.70 2.49 2.36 2.27 2.21 2.16 2.12 2.09
16 3.05 2.67 2.46 2.33 2.24 2.18 2.13 2.09 2.06
17 3.03 2.64 2.44 2.31 2.22 2.15 2.10 2.06 2.03
18 3.01 2.62 2.42 2.29 2.20 2.13 2.08 2.04 2.00
19 2.99 2.61 2.40 2.27 2.18 2.11 2.06 2.02 1.98
20 2.97 2.59 2.38 2.25 2.16 2.09 2.04 2.00 1.96
21 2.96 2.57 2.36 2.23 2.14 2.08 2.02 1.98 1.95
22 2.95 2.56 2.35 2.22 2.13 2.06 2.01 1.97 1.93
23 2.94 2.55 2.34 2.21 2.11 2.05 1.99 1.95 1.92
24 2.93 2.54 2.33 2.19 2.10 2.04 1.98 1.94 1.91
25 2.92 2.53 2.32 2.18 2.09 2.02 1.97 1.93 1.89
26 2.91 2.52 2.31 2.17 2.08 2.01 1.96 1.92 1.88
27 2.90 2.51 2.30 2.17 2.07 2.00 1.95 1.91 1.87
28 2.89 2.50 2.29 2.16 2.06 2.00 1.94 1.90 1.87
29 2.89 2.50 2.28 2.15 2.06 1.99 1.93 1.89 1.86
30 2.88 2.49 2.28 2.14 2.05 1.98 1.93 1.88 1.85
40 2.84 2.44 2.23 2.09 2.00 1.93 1.87 1.83 1.79
60 2.79 2.39 2.18 2.04 1.95 1.87 1.82 1.77 1.74
120 2.75 2.35 2.13 1.99 1.90 1.82 1.77 1.72 1.68
inf 2.71 2.30 2.08 1.94 1.85 1.77 1.72 1.67 1.63
919
F Value for ® = 0:10
d1
d2 10 12 15 20 24 30 40 60 120 inf
1 60.19 60.71 61.22 61.74 62 62.26 62.53 62.79 63.06 63.33
2 9.39 9.41 9.42 9.44 9.45 9.46 9.47 9.47 9.48 9.49
3 5.23 5.22 5.20 5.18 5.18 5.17 5.16 5.15 5.14 5.13
4 3.92 3.90 3.87 3.84 3.83 3.82 3.80 3.79 3.78 3.76
5 3.30 3.27 3.24 3.21 3.19 3.17 3.16 3.14 3.12 3.10
6 2.94 2.90 2.87 2.84 2.82 2.80 2.78 2.76 2.74 2.72
7 2.70 2.67 2.63 2.59 2.58 2.56 2.54 2.51 2.49 2.47
8 2.54 2.50 2.46 2.42 2.40 2.38 2.36 2.34 2.32 2.29
9 2.42 2.38 2.34 2.30 2.28 2.25 2.23 2.21 2.18 2.16
10 2.32 2.28 2.24 2.20 2.18 2.16 2.13 2.11 2.08 2.06
11 2.25 2.21 2.17 2.12 2.10 2.08 2.05 2.03 2.00 1.97
12 2.19 2.15 2.10 2.06 2.04 2.01 1.99 1.96 1.93 1.90
13 2.40 2.10 2.05 2.01 1.98 1.96 1.93 1.90 1.88 1.85
14 2.10 2.05 2.01 1.96 1.94 1.91 1.89 1.86 1.83 1.80
15 2.06 2.02 1.97 1.92 1.90 1.87 1.85 1.82 1.79 1.76
16 2.03 1.99 1.94 1.89 1.87 1.84 1.81 1.78 1.75 1.72
17 2.00 1.96 1.91 1.86 1.84 1.81 1.78 1.75 1.72 1.69
18 1.98 1.93 1.89 1.84 1.81 1.78 1.75 1.72 1.69 1.66
19 1.96 1.91 1.86 1.81 1.79 1.76 1.73 1.70 1.67 1.63
20 1.94 1.89 1.84 1.79 1.77 1.74 1.71 1.68 1.64 1.61
21 1.92 1.87 1.83 1.78 1.75 1.72 1.69 1.66 1.62 1.59
22 1.90 1.86 1.81 1.76 1.73 1.70 1.67 1.64 1.60 1.57
23 1.89 1.84 1.80 1.74 1.72 1.69 1.66 1.62 1.59 1.55
24 1.88 1.83 1.78 1.73 1.70 1.67 1.64 1.61 1.57 1.53
25 1.87 1.82 1.77 1.72 1.69 1.66 1.63 1.59 1.56 1.52
26 1.86 1.81 1.76 1.71 1.80 1.65 1.61 1.58 1.54 1.50
27 1.85 1.80 1.75 1.70 1.67 1.64 1.60 1.57 1.53 1.49
28 1.84 1.79 1.74 1.69 1.66 1.63 1.59 1.56 1.52 1.48
29 1.83 1.78 1.73 1.68 1.65 1.62 1.58 1.55 1.51 1.47
30 1.82 1.77 1.72 1.67 1.64 1.61 1.57 1.54 1.50 1.46
40 1.76 1.71 1.66 1.61 1.57 1.54 1.51 1.47 1.42 1.38
60 1.71 1.66 1.60 1.54 1.51 1.48 1.44 1.40 1.35 1.29
120 1.65 1.60 1.55 1.48 1.45 1.41 1.37 1.32 1.26 1.19
inf 1.60 1.55 1.49 1.42 1.38 1.34 1.30 1.24 1.17 1.00
920
F Values for ® = 0:05
d1
d2 1 2 3 4 5 6 7 8 9
1 161.4 199.5 215.7 224.6 230.2 234.0 236.8 238.9 240.5
2 18.51 19.00 19.16 19.25 19.3 19.33 19.35 19.37 19.38
3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28
26 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04
120 3.92 3.07 2.68 2.45 2.29 2.17 2.09 2.02 1.96
inf 3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.88
921
F Values for ® = 0:05
d1
d2 10 12 15 20 24 30 40 60 120 inf
1 241.9 243.9 245.9 248.0 249.1 250.1 251.1 252.2 253.3 254.3
2 19.4 19.41 19.43 19.45 19.45 19.46 19.47 19.48 19.49 19.5
3 8.79 8.74 8.70 8.66 8.64 8.62 8.59 8.57 8.55 8.53
4 5.96 5.91 5.86 5.80 5.77 5.75 5.72 5.69 5.66 5.63
5 4.74 4.68 4.62 4.56 4.53 4.50 4.46 4.43 4.40 4.36
6 4.06 4.00 3.94 3.87 3.84 3.81 3.77 3.74 3.70 3.67
7 3.64 3.57 3.51 3.44 3.41 3.38 3.34 3.30 3.27 3.23
8 3.35 3.28 3.22 3.15 3.12 3.08 3.04 3.01 2.97 2.93
9 3.14 3.07 3.01 2.94 2.90 2.86 2.83 2.79 2.75 2.71
10 2.98 2.91 2.85 2.77 2.74 2.70 2.66 2.62 2.58 2.54
11 2.85 2.79 2.72 2.65 2.61 2.57 2.53 2.49 2.45 2.40
12 2.75 2.69 2.62 2.54 2.51 2.47 2.43 2.38 2.34 2.30
13 2.67 2.60 2.53 2.46 2.42 2.38 2.34 2.30 2.25 2.21
14 2.60 2.53 2.46 2.39 2.35 2.31 2.27 2.22 2.18 2.13
15 2.54 2.48 2.40 2.33 2.29 2.25 2.20 2.16 2.11 2.07
16 2.49 2.42 2.35 2.28 2.24 2.19 2.15 2.11 2.06 2.01
17 2.45 2.38 2.31 2.23 2.19 2.15 2.10 2.06 2.01 1.96
18 2.41 2.34 2.27 2.19 2.15 2.11 2.06 2.02 1.97 1.92
19 2.38 2.31 2.23 2.16 2.11 2.07 2.03 1.98 1.93 1.88
20 2.35 2.28 2.20 2.12 2.08 2.04 1.99 1.95 1.90 1.84
21 2.32 2.25 2.18 2.10 2.05 2.01 1.96 1.92 1.87 1.81
22 2.30 2.23 2.15 2.07 2.03 1.98 1.94 1.89 1.84 1.78
23 2.27 2.20 2.13 2.05 2.01 1.96 1.91 1.86 1.81 1.76
24 2.25 2.18 2.11 2.03 1.98 1.94 1.89 1.84 1.79 1.73
25 2.24 2.16 2.09 2.01 1.96 1.92 1.87 1.82 1.77 1.71
26 2.22 2.15 2.07 1.99 1.95 1.90 1.85 1.80 1.75 1.69
27 2.20 2.13 2.06 1.97 1.93 1.88 1.84 1.79 1.73 1.67
28 2.19 2.12 2.04 1.96 1.91 1.87 1.82 1.77 1.71 1.65
29 2.18 2.10 2.03 1.94 1.90 1.85 1.81 1.75 1.70 1.64
30 2.16 2.09 2.01 1.93 1.89 1.84 1.79 1.74 1.68 1.62
40 2.08 2.00 1.92 1.84 1.79 1.74 1.69 1.64 1.58 1.51
60 1.99 1.92 1.84 1.75 1.70 1.65 1.59 1.53 1.47 1.39
120 1.91 1.83 1.75 1.66 1.10 1.55 1.50 1.43 1.35 1.25
inf 1.83 1.75 1.67 1.57 1.52 1.46 1.39 1.32 1.22 1.00
922
Chi-Square Distribution
Probability
Density
Function
The chi-square distribution results when independent
variables with standard normal distributions are squared and
summed. The formula for the probability density function of
the chi-square distribution is
where is the shape parameter and is the gamma function.
The formula for the gamma function is
In a testing context, the chi-square distribution is treated as a
"standardized distribution" (i.e., no location or scale
parameters). However, in a distributional modeling context (as
with other probability distributions), the chi-square
distribution itself can be transformed with a location
parameter, , and a scale parameter, .
The following is the plot of the chi-square probability density
function for 4 different values of the shape parameter.
Cumulative
Distribution
Function
The formula for the cumulative distribution function of the
chi-square distribution is
where is the gamma function defined above and is the
incomplete gamma function. The formula for the incomplete
gamma function is
The following is the plot of the chi-square cumulative
distribution function with the same values of as the pdf plots
above.
Percent
Point
Function
The formula for the percent point function of the chi-square
distribution does not exist in a simple closed form. It is
computed numerically.
The following is the plot of the chi-square percent point
function with the same values of as the pdf plots above.
Other
Probability
Functions
Since the chi-square distribution is typically used to develop
hypothesis tests and confidence intervals and rarely for
modeling applications, we omit the formulas and plots for the
hazard, cumulative hazard, survival, and inverse survival
probability functions.
Common
Statistics
Mean
Median
Mode
Range
Standard
Deviation
Coefficient of
Variation
approximately - 2/3 for large
0 to positive infinity
Skewness
Kurtosis
Parameter
Estimation
Since the chi-square distribution is typically used to develop
hypothesis tests and confidence intervals and rarely for
modeling applications, we omit any discussion of parameter
estimation.
Comments
The chi-square distribution is used in many cases for the
critical regions for hypothesis tests and in determining
confidence intervals. Two common examples are the chisquare test for independence in an RxC contingency table and
the chi-square test to determine if the standard deviation of a
population is equal to a pre-specified value.
Software
Most general purpose statistical software programs, including
Dataplot, support at least some of the probability functions for
the chi-square distribution.
UNIT IV
NON PARAMETRIC TESTS
3.1 Introduction
Nonparametric, or distribution free tests are so-called because the assumptions
underlying their use are “fewer and weaker than those associated with parametric tests” (Siegel
& Castellan, 1988, p. 34). To put it another way, nonparametric tests require few if any
assumptions about the shapes of the underlying population distributions. For this reason, they
are often used in place of parametric tests if/when one feels that the assumptions of the
parametric test have been too grossly violated (e.g., if the distributions are too severely skewed).
Discussion of some of the more common nonparametric tests follows.
3.2 The Sign test (for 2 repeated/correlated measures)
The sign test is one of the simplest nonparametric tests. It is for use with 2 repeated (or
correlated) measures (see the example below), and measurement is assumed to be at least
ordinal. For each subject, subtract the 2nd score from the 1st, and write down the sign of the
difference. (That is write “-” if the difference score is negative, and “+” if it is positive.) The
usual null hypothesis for this test is that there is no difference between the two treatments. If this
is so, then the number of + signs (or - signs, for that matter) should have a binomial
distribution1 with p = .5, and N = the number of subjects. In other words, the sign test is just a
binomial test with + and - in place of Head and Tail (or Success and Failure).
EXAMPLE
A physiologist wants to know if monkeys prefer stimulation of brain area A to stimulation of
brain area B. In the experiment, 14 rhesus monkeys are taught to press two bars. When a light
comes on, presses on Bar 1 always result in stimulation of area A; and presses on Bar 2 always
result in stimulation of area B. After learning to press the bars, the monkeys are tested for 15
minutes, during which time the frequencies for the two bars are recorded. The data are shown in
Table 3.1.
To carry out the sign test, we could let our statistic be the number of + signs, which is 3
in this case. The researcher did not predict a particular outcome in this case, but wanted to know
if the two conditions differed. Therefore, the alternative hypothesis is nondirectional. That is,
the alternative hypothesis would be supported by an extreme number of + signs, be it small or
large. A middling number of + signs would be consistent with the null.
The sampling distribution of the statistic is the binomial distribution with N = 14 and p =
.5. With this distribution, we would find that the probability of 3 or fewer + signs is .0287. But
because the alternative is nondirectional, or two-tailed, we must also take into account the
probability 11 or more + signs, which is also .0287. Adding these together, we find that the
probability of (3 or fewer) or (11 or more) is .0574. Therefore, if our pre-determined alpha was
set at .05, we would not have sufficient evidence to allow rejection of the null hypothesis.
1 The
binomial distribution is discussed on pages 37-40 of Norman & Streiner (2nd ed.). It is also
discussed in some detail in my chapter on “Probability and Hypothesis Testing” (in the file prob_hyp.pdf).
B. Weaver (15-Feb-2002) Nonparametric Tests ... 2
Table 3.1 Number of bar-presses in brain stimulation experiment
Subject Bar 1 Bar 2 Difference
“Sign” of
Difference
1 20 40 -20 -
2 18 25 -7 3 24 38 -14 4 14 27 -13 5 5 31 -26 6 26 21 +5 +
7 15 32 -17 8 29 38 -9 9 15 25 -10 10 9 18 -9 11 25 32 -7 12 31 28 +3 +
13 35 33 +2 +
14 12 29 -17 Tied scores
If a subject has the same score in each condition, there will be no sign, because the
difference score is zero. In the case of tied scores, some textbook authors recommend dropping
those subjects, and reducing N by the appropriate number. This is not the best way to deal with
ties, however, because reduction of N can result in the loss of too much power.
A better approach would be as follows: If there is only one subject with tied scores, drop
that subject, and reduce N by one. If there are 2 subjects with tied scores, make one a + and one
a -. In general, if there is an even number of subjects with tied scores, make half of them + signs,
and half - signs. For an odd number of subjects (greater than 1), drop one randomly selected
subject, and then proceed as for an even number.
3.3 Wilcoxon Signed-Ranks Test (for 2 repeated/correlated measures)
One obvious problem with the sign test is that it discards a lot of information about the
data. It takes into account the direction of the difference, but not the magnitude of the difference
between each pair of scores. The Wilcoxon signed-ranks test is another nonparametric test that
can be used for 2 repeated (or correlated) measures when measurement is at least ordinal. But
unlike the sign test, it does take into account (to some degree, at least) the magnitude of the
difference. Let us return to the data used to illustrate the sign test. The 14 difference scores
were:
-20, -7, -14, -13, -26, +5, -17, -9, -10, -9, -7, +3, +2, -17
If we sort these on the basis of their absolute values (i.e., disregarding the sign), we get the
results shown in Table 3.2. The statistic T is found by calculating the sum of the positive ranks,
and the sum of the negative ranks. T is the smaller of these two sums. In this case, therefore, T
= 6.
B. Weaver (15-Feb-2002) Nonparametric Tests ... 3
If the null hypothesis is true, the sum of the positive ranks and the sum of the negative
ranks are expected to be roughly equal. But if H0 is false, we expect one of the sums to be quite
small--and therefore T is expected to be quite small. The most extreme outcome favourable to
rejection of H0 is T = 0.
Table 3.2 Difference scores ranked by absolute value
Score Rank
+2 1
+3 2
+5 3
-7 4.5 Sum of positive ranks = 6
-7 4.5
-9 6.5 Sum of negative ranks = 99
-9 6.5
-10 8 T = 6
-13 9
-14 10
-17 11.5
-17 11.5
-20 13
-26 14
If we wished to, we could generate the sampling distribution of T (i.e., the distribution of
T assuming that the null hypothesis is true), and see if the observed value of T is in the rejection
region. This is not necessary, however, because the sampling distribution of T can be found in
tables in most introductory level statistics textbooks. When I consulted such a table, I found that
for N = 14, and α = .05 (2-tailed), the critical value of T = 21. The rule is that if T is equal to or
less than Tcritical, we can reject the null hypothesis. Therefore, in this example, we would reject
the null hypothesis. That is, we would conclude that monkeys prefer stimulation in brain area B
to stimulation in area A.
This decision differs from our failure to reject H0 when we analysed the same data using
the sign test. The reason for this difference is that the Wilcoxon signed-ranks test is more
powerful than the sign test. Why is it more powerful? Because it makes use of more information
than does the sign test. But note that it too does discard some information by using ranks rather
than the scores themselves (like a paired t-test does, for example).
Tied scores
When ranking scores, it is customary to deal with tied ranks in the following manner:
Give the tied scores the mean of the ranks they would have if they were not tied. For example, if
you have 2 tied scores that would occupy positions 3 and 4 if they were not tied, give each one a
rank of 3.5. If you have 3 scores that would occupy positions 7, 8, and 9, give each one a rank of
8. This procedure is preferred to any other for dealing with tied scores, because the sum of the
ranks for a fixed number of scores will be the same regardless of whether or not there are any
tied scores.
B. Weaver (15-Feb-2002) Nonparametric Tests ... 4
If there are tied ranks in data you are analysing with the Wilcoxon signed-ranks test, the
statistic needs to be adjusted to compensate for the decreased variability of the sampling
distribution of T. Siegel and Castellan (1988, p. 94) describe this adjustment, for those who are
interested. Note that if you are able to reject H0 without making the correction, then do not
bother, because the correction will increase your chances of rejecting H0. Note as well that the
problem becomes more severe as the number of tied ranks increases.
3.4 Mann-Whitney U Test (for 2 independent samples)
The most basic independent groups design has two groups. These are often called
Experimental and Control. Subjects are randomly selected from the population and randomly
assigned to two groups. There is no basis for pairing scores. Nor is it necessary to have the
same number of scores in the two groups.
The Mann-Whitney U test is a nonparametric test that can be used to analyse data from a
two-group independent groups design when measurement is at least ordinal. It analyses the
degree of separation (or the amount of overlap) between the Experimental and Control groups.
The null hypothesis assumes that the two sets of scores (E and C) are samples from the
same population; and therefore, because sampling was random, the two sets of scores do not
differ systematically from each other.
The alternative hypothesis, on the other hand, states that the two sets of scores do differ
systematically. If the alternative is directional, or one-tailed, it further specifies the direction of
the difference (i.e., Group E scores are systematically higher or lower than Group C scores).
The statistic that is calculated is either U or U'.
U1 = the number of Es less than Cs
U2 = the number of Cs less than Es
U = the smaller of the two values calculated above
U' = the larger of the two values calculated above
Calculating U directly
When the total number of scores is small, U can be calculated directly by counting the
number of Es less than Cs (or Cs less than Es). Consider the following example:
Table 3.3 Data for two independent groups
Group E: 12 17 9 21
Group C: 8 18 26 15 23
It will be easier to count the number of Es less than Cs (and vice versa) if we rank the
data from lowest to highest, and rewrite it as shown in Table 3.4.
B. Weaver (15-Feb-2002) Nonparametric Tests ... 5
Table 3.4 Illustration of direct calculation of the U statistic
Score Group Rank E<C C<E
8C10
9E21
12 E 3 1
15 C 4 2 U = 7
17 E 5 2 U ′ = 13
18 C 6 3
21 E 7 3
23 C 8 4
26 C 9 4
13 7
CHECK:
Note that U + U' = n1n2. This will always be true, and can be used to check your
calculations. In this case, U + U' = 7 + 13 = 20; and n1n2 = 4(5) = 20.
Calculating U with formulae
When the total number of scores is a bit larger, or if there are tied scores, it may be more
convenient to calculate U with the following formulae:
11
1121
( 1)
2
nn
U nn R
+
= + − (3.1)
22
2122
( 1)
2
nn
U nn R
+
= + − (3.2)
where n1 = # of scores in group 1
n2 = # of scores in group 2
R1 = sum of ranks for group 1
R2 = sum of ranks for group 2
As before, U = smaller of U1 and U2, and U' = larger of U1 and U2.
For the data shown above, R1 = 2+3+5+7 = 17; and R2 = 1+4+6+8+9 = 28. Substituting
into the formulae, we get:
1
4(4 1)
4(5) 17 13
2
U
+
= + − = (3.3)
2
5(5 1)
4(5) 28 7
2
U
+
= + − = (3.4)
Therefore, U = 7 and U'= 13.
B. Weaver (15-Feb-2002) Nonparametric Tests ... 6
Making a decision
The next step is deciding whether to reject H0 or not. In principle, we could generate a
probability distribution for U that is conditional on the null hypothesis being true--much like we
did when working with the binomial distribution earlier. Fortunately, we do not have to do this,
because there are tables in the back of many statistics textbooks that give you the critical values
of U (or U') for different values of n1 and n2, and for various significance levels.
For the case we've been considering, n1 = 4 and n2 = 5. For a two-tailed test with α = .05,
the critical value of U = 1. In order to reject H0, the observed value of U would have to be equal
to or less than the critical value of U. (Note that maximum separation of E and C scores is
indicated by U = 0. As the E and C scores become more mixed, U becomes larger. Therefore,
small values of U lead to rejection of H0.) Therefore, we would decide that we cannot reject H0
in this case.
Tied scores
According to Siegel and Castellan (1988), any ties that involve observations in the same
group do not affect the values of U and U'. (Note that Siegel and Castellan refer to this test as
the Wilcoxon-Mann-Whitney Test, and that the call the statistic W rather than U.) But if two or
more tied ranks involve observations from both groups, then the values of U and U' are affected,
and a correction should be applied. See Siegel & Castellan (1988, p. 134) should you ever need
more information on this, and note that the problem is particularly severe if you are dealing with
the large-sample version of the test, which we have not yet discussed.
3.5 Kruskal-Wallis H-test (for k independent samples)
The Kruskal-Wallis H-test goes by various names, including Kruskal-Wallis one-way
analysis of variance by ranks (e.g., in Siegel & Castellan, 1988). It is for use with k independent
groups, where k is equal to or greater than 3, and measurement is at least ordinal. (When k = 2,
you would use the Mann-Whitney U-test instead.) Note that because the samples are
independent, they can be of different sizes.
The null hypothesis is that the k samples come from the same population, or from
populations with identical medians. The alternative hypothesis states that not all population
medians are equal. It is assumed that the underlying distributions are continuous; but only
ordinal measurement is required.
The statistic H (sometimes also called KW) can be calculated in one of two ways:
( )
2
1
12
( 1)
k
ii
i
H nR R
NN
•
=
=
−
+
Σ
(3.5)
or, the more common computational formula,
2
1
12
3( 1)
( 1)
k
i
ii
R
HN
NN=n
=
Σ
− +
+
(3.6)
B. Weaver (15-Feb-2002) Nonparametric Tests ... 7
where k = the number of independent samples
ni = the number of cases in the ith sample
N = the total number of cases
Ri = the sum of the ranks in the ith sample
Ri = the mean of the ranks for the ith sample
R• = N +1
2
= the mean of all ranks
Example
A student was interested in comparing the effects of four kinds of reinforcement on children's
performance on a test of reading comprehension. The four reinforcements used were: (a) praise
for correct responses; (b) a jelly bean for each correct response; (c) reproof for incorrect
responses; and (d) silence. Four independent groups of children were tested, and each group
received only one kind of reinforcement. The measure of performance given below is the number
of errors made during the course of testing.
Table 3.5 Data from 4 independent groups
abcd
68 78 94 54
63 69 82 51
58 58 73 32
51 57 67 74
41 53 66 65
61 80
The first step in carrying out the Kruskal-Wallis H-test is to rank order all of the scores
from lowest to highest. This can be quite laborious work if you try to do it by hand, but is fairly
easy if you use a spreadsheet program. Enter all scores in a single column, and enter a group
code for each score one column over. For example:
Group Score
a 68
a 63
a 58
a 51
a 41
b 78
etc.
When all the data are entered thus, sort the scores (and their codes) from lowest to highest. Now
you can enter ranks from 1 to N (taking care to deal with tied scores appropriately). After the
scores are ranked, you can sort the data the data by group code, and then calculate the sum and
mean of the ranks for each group. I did this for the data shown above, and came up with the
following ranks:
B. Weaver (15-Feb-2002) Nonparametric Tests ... 8
Table 3.6 Ranks of data from Table 3.5
abcd
15 19 22 6
11 16 21 3.5
8.5 8.5 17 1
3.5 7 14 18
2 5 13 12
10 20
40.0 55.5 97.0 60.5 Sum of Ranks
8.0 11.1 16.2 10.1 Mean of Ranks
CHECK: The sum of ranks from 1 to N will always be equal to [N(N+1)]/2. We can use this to
check our work to this point. We have 22 scores in total, so the sum of all ranks should be
[22(23)]/2 = 253. Similarly, when we add the sum of ranks for each group, we get 40 + 55.5 +
97 + 60.5 = 253. Therefore, the mean of ALL ranks = 253/22 = 11.5.
Now plugging into Equation 3.4 shown above, we get H = 4.856. If the null hypothesis
is true, and the k samples are drawn from the same population (or populations with identical
medians), and if k > 3, and all samples have 5 or more scores, then the distribution of H closely
approximates the chi-squared distribution with df = k-1. The critical value of chi-squared with
df=3 and α = .05 is 7.82. In order to reject H0, the obtained value of H would have to be equal
to
or greater than 7.82. Because it is less, we cannot reject the null hypothesis.
Sampling distribution of H
As described above, when H0 is true, if k > 3, and all samples have 5 or more scores, then
the sampling distribution of H is closely approximated by the chi-squared distribution with df =
k-1. If k = 3 and the number of scores in each sample is 5 or fewer, then the chi-squared
distribution should not be used. In this case, one should use a table of critical values of H (e.g.,
Table O in Siegel & Castellan, 1988).
Tied observations
Tied scores are dealt with in the manner described previously (i.e., they are given the
mean of the ranks they would receive if they were not tied). The presence of tied scores does
affect the variance of the sampling distribution of H. Siegel and Castellan (1988) show a
correction that can be applied in the case of tied scores, but go on to observe that its effect is to
increase the value of H. Therefore, if you are able to reject H0 without correcting for ties, there
is no need to do the correction. It should only be contemplated when you have failed to reject
H0.
B. Weaver (15-Feb-2002) Nonparametric Tests ... 9
Multiple comparisons
Rejection of H0 tells you that at least one of the k samples is drawn from a population
with a median different from the others. But it does not tell you which one, or how many are
different. There are procedures for conducting multiple comparisons between treatments, or
comparisons of a control condition to all other conditions in order to answer these kinds of
questions. Should you ever need to use one of them, consult Siegel and Castellan (1988, pp.
213-215).
3.6 The Jonckheere test for ordered alternatives
The Jonckheere test for ordered alternatives is similar to the Kruskal-Wallis test, but has
a more specific alternative hypothesis. The alternative hypothesis for the Kruskal-Wallis test
states that all population medians are not equal. The more precise alternative hypothesis for the
Jonckheere test can be summarised as follows:
H1: θ1 ≤ θ2 ... ≤ θk
where the θ‘s are the population medians. This alternative is tested against a null hypothesis of
no systematic trend across treatments.
The test can be applied when you have data for k independent samples, when
measurement is at least ordinal, and when it is possible to specify a priori the ordering of the
groups. Because the alternative hypothesis specifies the order of the medians, the test is
onetailed.
Siegel and Castellan (1988) use J to symbolise the statistic that is calculated. It is
sometimes also called the “Mann-Whitney count”. As this name implies, J is based on the same
kind of counting and summing that we saw when calculating the U statistic via the direct
method. The mechanics of it become somewhat complicated for the Jonckheere test, so we will
not go into it here. (I hope you are not too disappointed!) Should you ever need to perform this
test, see Program 5 in Appendix II of Siegel and Castellan (1988). Siegel and Castellan also
provide a table of critical values of J (for small sample tests).
3.7 Friedman ANOVA
This test is sometimes called the Friedman two-way analysis of variance by ranks. It is
for use with k repeated (or correlated) measures where measurement is at least ordinal. The null
hypothesis states that all k samples are drawn from the same population, or from populations
with equal medians.
Example
The table on the left (below) shows reaction time data from 5 subjects, each of whom was
tested in 3 conditions (A, B, and C). The Friedman ANOVA uses ranks, and so the first thing we
must do is rank order the k scores for each subject. The results of this ranking are shown in the
table on the right, and the sum of the ranks (ΣRi ) for each treatment is shown at the bottom.
B. Weaver (15-Feb-2002) Nonparametric Tests ... 10
Table 3.7 RT data and ranks for 3 levels of a within-subjects variable
Subj A B C Subj A B C
1 386 411 454 1 1 2 3
2 542 563 556 2 1 3 2
3 662 667 665 3 1 3 2
4 453 502 574 4 1 2 3
5 548 546 575 5 2 1 3
ΣR 6 11 13
i
It may be useful at this point to consider what kinds of outcomes are expected if H0 is
true. H0 states that all of the samples (columns) are drawn from the same population, or from
populations with the same median. If so, then the sums (or means) of the ranks for each of the
columns should all be roughly equal, because the ranks 1, 2, and 3 would be expected by chance
to appear equally often in each column. In this example, the expected _R for each treatment
would be 10 if H0 is true. (In general, the expected sum of ranks for each treatment is N(k
+1)/2.) The Friedman ANOVA assesses the degree to which the observed _R's depart from the
expected _R's. If the departure is too extreme (or not likely due to chance), one concludes by
rejecting H0.
The Fr statistic is calculated as follows:
2
1
12
3 ( 1)
( 1)
k
ri
i
FRNk
Nk k =
=
Σ
− +
(3.7)
+
where N = the number of subjects
k = the number of treatments
Ri = the sum of the ranks for the ith treatment
Critical values of Fr for various sample sizes and numbers of treatments can be found in
tables (e.g., Table M in Siegel & Castellan, 1988). Note that when the number of treatments or
subjects is large, the sampling distribution of Fr is closely approximated by the chi-squared
distribution with df = k - 1. (Generally, use a table of critical values for Fr if it provides a value
for your particular combination of k and N. If either k or N are too large for the table of critical
values, then use the chi-squared distribution with df = k -1.)
For the example we've been looking at, the critical values of Fr are 6.40 for α = .05, and
8.40 for α = .01. In order to reject H0, the obtained value of Fr must be equal to or greater than
the critical value. Therefore, we would fail to reject H0 in this case.
Tied scores
If there are ties among the ranks, the Fr statistic must be corrected, because the sampling
distribution changes. The formula that corrects for tied ranks is actually a general formula that
also works when there are no ties. However, it is rather complicated, which is why the
B. Weaver (15-Feb-2002) Nonparametric Tests ... 11
simplified version shown above is used when possible. The general formula is not shown here,
but can be found in Siegel and Castellan (1988, p. 179), should you need it.
Multiple comparisons
Siegel and Castellan (1988) also give formulae to be use for conducting multiple
comparisons and/or comparisons of a control condition to each of the other conditions.
3.8 Large sample versions of nonparametric tests
You may have noticed that the tables of critical values for many nonparametric statistics
only go up to sample sizes of about 25-50. If so, perhaps you have wondered what to do when
you have sample sizes larger than that, and want to carry out a nonparametric test. Fortunately,
it turns out that the sampling distributions of many nonparametric statistics converge on the
normal distribution as sample size increases. Because of that, it is possible to carry out a socalled
“large-sample” version of the test (which is really a z-test) if you know the mean and
variance of the sampling distribution for that particular statistic.
Common structure of all z- and t-tests
As I have mentioned before, all z- and t-tests have a common structure. In general terms:
0 statistic - (parameter | H is true)
(or ) =
standard error of the statistic
z t (3.8)
When the sampling distribution of the statistic in the numerator is normal, then if the true
(population) standard error (SE) of the statistic is known, the computed ratio can be evaluated
against the standard normal (z) distribution. If the true standard error of the statistic is not
known, then it must be estimated from the sample data, and the proper sampling distribution is a
t-distribution with some number of degrees of freedom.
Example: Large-sample Mann-Whitney U test
The following facts are known about the sampling distribution of the U statistic used in
the Mann-Whitney U test:
12
2U
n n μ = (3.9)
1212
( 1)
12 U
nnnnσ
+ +
= (3.10)
Furthermore, when both sample sizes are greater than about 20, the sampling distribution of U is
(for practical purposes) normal. Therefore, under these conditions, one can perform a z-test as
follows:
B. Weaver (15-Feb-2002) Nonparametric Tests ... 12
U
U
U
U
z
μ
σ
−
= (3.11)
The obtained value of zU can be evaluated against a table of the standard normal distribution
(e.g., Table A in Norman & Streiner, 2000). Alternatively, one can use software to calculate the
p-value for a given z-score, e.g., StaTable from Cytel, which is available here:
http://www.cytel.com/statable/index.html
Example: Large-sample Wilcoxon signed ranks test
The following are known to be true about the sampling distribution of T, the statistic used
in the Wilcoxon signed ranks test:
( 1)
4T
NNμ
+
= (3.12)
( 1)(2 1)
24 T
NNNσ
+ +
= (3.13)
If N > 50, then the sampling distribution of T is for practical purposes normal. And so, a z-ratio
can be computed as follows:
T
T
T
T
z
μ
σ
−
= (3.14)
The obtained value of zT can be evaluated against a table of the standard normal distribution, or
using software as described above.
Example: Large-sample Jonckheere test for ordered alternatives
The mean and standard deviation of the sampling distribution of J are given by the
following:
22
1
4
k
i
i
J
Nn
μ =
−
=
Σ
(3.15)
22
1
1
(2 3) (2 3)
72
k
Jii
i
σ NNnn
=
=
+ − +
Σ
(3.16)
B. Weaver (15-Feb-2002) Nonparametric Tests ... 13
where N = total number of observations
ni = the number of observations in the ith group
k = the number of independent groups
As sample sizes increase, the sampling distribution of J converges on the normal, and so
one can perform a z-test as follows:
J
J
J
J
z
μ
σ
−
= (3.17)
Example: Large sample sign test
The sampling distribution used in carrying out the sign test is a binomial distribution with
p =q = .5. The mean of a binomial distribution is equal to Np, and the variance is equal to Npq.
As N increases, the binomial distribution converges on the normal distribution (especially when
p = q = .5). When N is large enough (i.e., greater than 30 or 50, depending on how conservative
one is), it is possible to carry out a z-test version of the sign test as follows:
X Np
z
Npq
−
= (3.18)
You may recall that z2 is equal to χ2 with df = 1. Therefore,
2
22
1
(X Np)
z
Npq
χ
−
= = (3.19)
This formula can be expanded with what Howell (1997) calls “some not-so-obvious algebra” to
yield:
22
2
1
(X Np) (N X Nq)
Np Nq
χ
− − −
= + (3.20)
Note that X equals the observed number of p-events, and Np equals the expected number of
pevents
under the null hypothesis. Similarly, N-X equals the observed number of q-events, and
Nq = the expected number of q-events under the null hypothesis. Therefore, we can rewrite
equation (3.20) in a more familiar looking format as follows:
222
21122
12
(O E ) (O E ) (O E)
EEE
χ
− − −
= + =
Σ
(3.21)
B. Weaver (15-Feb-2002) Nonparametric Tests ... 14
Large-sample z-tests with small samples
Many computerised statistics packages automatically compute the large-sample (z-test)
version of nonparametric tests, even when the sample sizes are small. Note however, that the
ztest
is just an approximation that can be used when sample sizes are sufficiently large. If the
sample sizes are small enough to allow use of a table of critical values for your particular
nonparametric statistic, you should always use it rather than a z-test.
3.9 Advantages of nonparametric tests
Siegel and Castellan (1988, p. 35) list the following advantages of nonparametric tests:
1. If the sample size is very small, there may be no alternative to using a
nonparametric statistical test unless the nature of the population distribution is
known exactly.
2. Nonparametric tests typically make fewer assumptions about the data and may
be more relevant to a particular situation. In addition, the hypothesis tested by the
nonparametric test may be more appropriate for the research investigation.
3. Nonparametric tests are available to analyze data which are inherently in ranks
as well as data whose seemingly numerical scores have the strength of ranks.
That is, the researcher may only be able to say of his or her subjects that one has
more or less of the characteristic than another, without being able to say how
much more or less. For example, in studying such a variable as anxiety, we may
be able to state that subject A is more anxious than subject B without knowing at
all exactly how much more anxious A is. If data are inherently in ranks, or even
if they can be categorized only as plus or minus (more or less, better or worse),
they can be treated by nonparametric methods, whereas they cannot be treated by
parametric methods unless precarious and, perhaps, unrealistic assumptions are
made about the underlying distributions.
4. Nonparametric methods are available to treat data which are simply
classificatory or categorical, i.e., are measured in a nominal scale. No parametric
technique applies to such data.
5. There are suitable nonparametric statistical tests for treating samples made up
of observations from several different populations. Parametric tests often cannot
handle such data without requiring us to make seemingly unrealistic assumptions
or requiring cumbersome computations.
6. Nonparametric statistical tests are typically much easier to learn and to apply
than are parametric tests. In addition, their interpretation often is more direct than
the interpretation of parametric tests.
B. Weaver (15-Feb-2002) Nonparametric Tests ... 15
Note that the objection concerning “cumbersome computations” in point number 5 has
become less of an issue as computers and statistical software packages become more
sophisticated, and more available.
3.10 Disadvantages of nonparametric tests
In closing, I must point out that nonparametric tests do have at least two major
disadvantages in comparison to parametric tests. First, nonparametric tests are less powerful.
Why? Because parametric tests use more of the information available in a set of numbers.
Parametric tests make use of information consistent with interval scale measurement, whereas
parametric tests typically make use of ordinal information only. As Siegel and Castellan (1988)
put it, “nonparametric statistical tests are wasteful.”
Second, parametric tests are much more flexible, and allow you to test a greater range of
hypotheses. For example, factorial ANOVA designs allow you to test for interactions between
variables in a way that is not possible with nonparametric alternatives. There are nonparametric
techniques to test for certain kinds of interactions under certain circumstances, but these are
much more limited than the corresponding parametric techniques.
Therefore, when the assumptions for a parametric test are met, it is generally (but not
necessarily always) preferable to use the parametric test rather than a nonparametric test.
--------------------------------------------------------------------B. Weaver (15-Feb-2002) Nonparametric Tests ... 16
Review Questions
1. Which test is more powerful, the sign test, or the Wilcoxon signed ranks test? Explain why.
2. Which test is more powerful, the Wilcoxon signed ranks test, or the t-test for correlated
samples? Explain why.
For the scenarios described in questions 3-5, identify the nonparametric test that ought
to be used.
3. A single group of subjects is tested at 6 levels of an independent variable. You would like to
do a repeated measures ANOVA, but cannot because you have violated the assumptions for that
analysis. Your data are ordinal.
4. You have 5 independent groups of subjects, with different numbers per group. There is also
substantial departure from homogeneity of variance. The null hypothesis states that there are no
differences between the groups.
5. You have the same situation described in question 4; and in addition, the alternative
hypothesis states that when the mean ranks for the 5 groups are listed from smallest to largest,
they will appear in a particular pre-specified order.
6. Explain the rationale underlying the large-sample z-test version of the Mann-Whitney U-test.
7. Why should you not use the large-sample z-test version of a nonparametric test when you
have samples small enough to allow use of the small-sample version?
8. Give two reasons why parametric tests are generally preferred to nonparametric tests.
9. Describe the circumstances under which you might use the Kruskal-Wallis test. Under what
circumstances would you use the Jonckheere test instead? (HINT: Think about how the
alternative hypotheses for these tests differ.)
--------------------------------------------------------------------References
Howell, DC. (1997). Psychology for statistics. Duxbury Press.
Siegel, S., & Castellan, N.J. (1988). Nonparametric statistics for the behavioral sciences (2nd
Ed.). New York, NY: McGraw-Hill.
Mann-Whitney U test
Menu location: Analysis_Non-parametric_Mann-Whitney.
This is a method for the comparison of two independent random samples (x and y):
The Mann Whitney U statistic is defined as:
- where samples of size n1 and n2 are pooled and Ri are the ranks.
U can be resolved as the number of times observations in one sample precede observations in the
other sample in the ranking.
Wilcoxon rank sum, Kendall's S and the Mann-Whitney U test are exactly equivalent tests. In the
presence of ties the Mann-Whitney test is also equivalent to a chi-square test for trend.
In most circumstances a two sided test is required; here the alternative hypothesis is that x values
tend to be distributed differently to y values. For a lower side test the alternative hypothesis is
that x values tend to be smaller than y values. For an upper side test the alternative hypothesis is
that x values tend to be larger than y values.
Assumptions of the Mann-Whitney test:

random samples from populations
 independence within samples and mutual independence between samples
 measurement scale is at least ordinal
A confidence interval for the difference between two measures of location is provided with the
sample medians. The assumptions of this method are slightly different from the assumptions of
the Mann-Whitney test:

random samples from populations
 independence within samples and mutual independence between samples
 two population distribution functions are identical apart from a possible difference in
location parameters
Technical Validation
StatsDirect uses the sampling distribution of U to give exact probabilities. These calculations
may take an appreciable time to complete when many data are tied.
Confidence intervals are constructed for the difference between the means or medians (any
measure of location in fact). The level of confidence used will be as close as is theoretically
possible to the one you specify. StatsDirect approaches the selected confidence level from the
conservative side.
When samples are large (either sample > 80 or both samples >30) a normal approximation is
used for the hypothesis test and for the confidence interval. Note that StatsDirect uses more
accurate P value calculations than some other statistical software, therefore, you may notice a
difference in results (Conover, 1999; Dineen and Blakesley, 1973; Harding, 1983; Neumann,
1988).
Example
From Conover (1999, p. 218).
Test workbook (Nonparametric worksheet: Farm Boys, Town Boys).
The following data represent fitness scores from two groups of boys of the same age, those from
homes in the town and those from farm homes.
Farm Boys
14.8
7.3
5.6
6.3
9.0
4.2
10.6
12.5
12.9
16.1
11.4
2.7
Town Boys
12.7
14.2
12.6
2.1
17.7
11.8
16.9
7.9
16.0
10.6
5.6
5.6
7.6
11.3
8.3
6.7
3.6
1.0
2.4
6.4
9.1
6.7
18.6
3.2
6.2
6.1
15.3
10.6
1.8
5.9
9.9
10.6
14.8
5.0
2.6
4.0
To analyse these data in StatsDirect you must first enter them in two separate workbook
columns. Alternatively, open the test workbook using the file open function of the file menu.
Then select the Mann-Whitney from the Non-parametric section of the analysis menu. Select the
columns marked "Farm Boys" and "Town Boys" when prompted for data.
For this example:
estimated median difference = 0.8
two sided P = 0.529
95.1% confidence interval for difference between population means or medians = -2.3 to 4.4
Here we have assumed that these groups are independent and that they represent at least
hypothetical random samples of the sub-populations they represent. In this analysis, we are
clearly unable to reject the null hypothesis that one group does NOT tend to yield different
fitness scores to the other. This lack of statistical evidence of a difference is reflected in the
confidence interval for the difference between population means, in that the interval spans zero.
Note that the quoted 95.1% confidence interval is as close as you can get to 95% because of the
very nature of the mathematics involved in non-parametric methods like this.
Kruskal-Wallis Test.
As a reminder, the assumptions of the one-way ANOVA for independent samples
are
1. that the scale on which the dependent variable is measured has the
properties of an equal interval scale;T
2. that the k samples are independently and randomly drawn from the source
population(s);T
3. that the source population(s) can be reasonably supposed to have a normal
distribution; andT
4. that the k samples have approximately equal variances.
We noted in the main body of Chapter 14 that we need not worry very much about
the first, third, and fourth of these assumptions when the samples are all the same
size. For in that case the analysis of variance is quite robust, by which we mean
relatively unperturbed by the violation of its assumptions. But of course, the other
side of the coin is that when the samples are not all the same size, we do need to
worry. In this case, should one or more of assumptions 1, 3, and 4 fail to be met,
an appropriate non-parametric alternative to the one-way independent-samples
ANOVA can be found in the Kruskal-Wallis Test.
I will illustrate the Kruskal-Wallis test with an example based on rating-scale data,
since this is by far the most common situation in which unequal sample sizes would
call for the use of a non-parametric alternative. In this particular case the number
of groups is k=3. I think it will be fairly obvious how the logic and procedure would
be extended in cases where k is greater than 3.
To assess the effects of expectation on the perception of aesthetic quality,
an investigator randomly sorts 24 amateur wine aficionados into three
groups, A, B, and C, of 8 subjects each. Each subject is scheduled for an
individual interview. Unfortunately, one of the subjects of group B and two of
group C fail to show up for their interviews, so the investigator must
make do with samples of unequal size: na=8, nb=7, and nc=6, for a total
of N=21. The subjects who do show up for their interviews are each asked to
rate the overall quality of each of three wines on a 10-point scale, with "1"
standing at the bottom of the scale and "10" at the top.
Group
mean
A
B
C
6.4
6.8
7.2
8.3
8.4
9.1
9.4
9.7
2.5
3.7
4.9
5.4
5.9
8.1
8.2
1.3
4.1
4.9
5.2
5.5
8.2
8.2
5.5
4.9
As it happens, the three wines are the same for all subjects. The only difference is
in the texture of the interview, which is designed to induce a relatively high
expectation of quality in the members of group A; a relatively low expectation in
the members of group C; and a merely neutral state, tending in neither the one
direction nor the other, for the members of group B. At the end of the study, each
subject's ratings are averaged across all three wines, and this average is then taken
as the raw measure for that particular subject. The adjacent table shows these
measures for each subject in each of the three groups.
¶Mechanics
The preliminaries of the Kruskal-Wallis test are much the same as those of the
Mann-Whitney test described in Subchapter 11a. We begin by assembling the
measures from all k samples into a single set of size N. These assembled measures
are rank-ordered from lowest (rank#1) to highest (rank#N), with tied ranks
included where appropriate; and the resulting ranks are then returned to the
sample, A, B, or C, to which they belong and substituted for the raw measures that
gave rise to them. Thus, the raw measures that appear in the following table on the
left are replaced by their respective ranks, as shown in the table on the right.
Ranked
Measures
Raw Measures
A
B
C
A
B
C
6.4
2.5
1.3
11
2
1
A, B, C
6.8
7.2
8.3
8.4
9.1
9.4
9.7
3.7
4.9
5.4
5.9
8.1
8.2
4.1
4.9
5.2
5.5
8.2
sum of ranks
12
13
17
18
19
20
21
3
4
Combined
5.5
5.5
8
7
10
9
14 15.5
15.5
131
58
42
231
average of ranks 16.4
8.3
7.0
11
With the Kruskal-Wallis test, however, we take account not only of the sums
of the ranks within each group, but also of the averages. Thus the following
items of symbolic notation:
TA = the sum of the na ranks in group A
MA = the mean of the na ranks in group A
TB = the sum of the nb ranks in group B
MB = the mean of the nb ranks in group B
TC = the sum of the nc ranks in group C
MC = the mean of the nc ranks in group C
Tall = the sum of the N ranks in all groups combined
Mall = the mean of the N ranks in all groups combined
¶Logic and Procedure
·The Measure of Aggregate Group Differences
You will sometimes find the Kruskal-Wallis test described as an "analysis of variance
by ranks." Although it is not really an analysis of variance at all, it does bear a
certain resemblance to ANOVA up to a point. In both procedures, the first part of
the task is to find a measure of the aggregate degree to which the group means
differ. With ANOVA that measure is found in the quantity known as SSbg, which is
the between-groups sum of squared deviates. The same is true with the Kruskal-
Wallis test, except that here the group means are based on ranks rather than on
the raw measures. As a reminder that we are now dealing with ranks, we will
symbolize this new version of the between-groups sum of squared deviates
as SSbg(R). The following table summarizes the mean ranks for the present
example. Also included are the sums and the counts (na, nb, nc, and N) on which
these means are based.
counts
A
B
C
All
8
7
6
21
58
42
231
sums 131
means 16.4 8.3 7.0 11.0
In Chapters 13 and 14 you saw that the squared deviate for any particular
group mean is equal to the squared difference between that group mean and
the mean of the overall array of data, multiplied by the number of
observations on which the group mean is based. Thus, for each of our
current three groups
A: 8(16.4—11.0)2 = 233.3
B:
7(8.3—11.0)2 = 051.0
C:
6(7.0—11.0)2 = 096.0
SSbg(R) = 380.3
On analogy with the formulaic structures for SSbg developed in Chapters 13
and 14, we can write the conceptual formula for SSbg(R) as
(
SSbg(R) =
[n (M —M ) ]
g
g
all
2
Here as well, the subscript "g"
means "any particular group."
and the computational formula as
(Tg)2
SSbg(R) =
(Tall)2
—
ng
Na
With k=3 samples, this latter structure would be equivalent to
(TA)2
SSbg(R) =
(TB)2
+
(TC)2
+
na
(Tall)2
—
nb
nc
Na
For k=4 it would be
(TA)2
SSbg(R) =
(TB)2
+
na
(TC)2
+
(TD)2
+
nb
(Tall)2
—
nc
nd
Na
And so forth for other values of k.
Here, in any event, is how it would work out for the present example. The
discrepancy between what we get now and what we got a moment ago
(380.3) is due to rounding error in the earlier calculation. As usual, it is the
computational formula that is the less susceptible to rounding error, hence
the more reliable.
(131)2
SSbg(R) =
(58)2
+
8
(42)2
+
7
(231)2
—
6
21
= 378.7
·The Null-Hypothesis Value of SSbg(R)
The null hypothesis in this or any comparable situation involving several
independent samples of ranked data is that the mean ranks of the k groups
will not substantially differ. On this account, you might suppose that the
null-hypothesis value of SSbg(R), the aggregate measure of group
differences, would be simply zero. A moment's reflection, however, will show
why this cannot be so.
A
B
C
Consider the very simple case where there are 3 groups, each
containing 2 observations. By way of analogy, imagine you had six
small cards representing the ranks "1," "2," "3," "4," "5," and "6."
If you were to sort these cards into every possible combination of two ranks per
group, you would find the total number of possible combinations to be
x
x
x
x
x
x
N!
6!
=
na! nb! nc!
= 90
2! 2! 2!
And the values of SSbg(R) produced by these 90 combinations would constitute the
sampling distribution of SSbg(R) for this particular case. Of these 90 possible
combinations, a few (6) would yield values of SSbg(R) equal to exactly zero. All the
rest would produce values greater than zero. (It is mathematically impossible to
have a sum of squared deviates less than zero.) Accordingly, the mean of this
sampling distribution—the value that observed instances of SSbg(R) will tend to
approximate if the null hypothesis is true—is not zero, but something greater than
zero.
In any particular case of this sort, the mean of the sampling distribution of
SSbg(R) is given by the formula
(k—1) x
N(N+1)
12
which for the simple case just examined works out as
(3—1) x
6(6+1)
= 7.0
12
For our main example, we therefore know that the observed value of
SSbg(R)=378.7 belongs to a sampling distribution whose mean is equal to
(3—1) x
21(21+1)
= 77.0
12
All that now remains is to figure out how to turn this fact into a rigorous
assessment of probability.
·The Kruskal-Wallis Statistic: H
In case you have been girding yourself for some heavy slogging of the sort
encountered with the Mann-Whitney test, you can now relax, for the rest of
the journey is quite an easy one. The Kruskal-Wallis procedure concludes by
defining a ratio symbolized by the letter H, whose numerator is the observed
value of SSbg(R) and whose denominator includes a portion of the above
formula for the mean of the sampling distribution of SSbg(R). Note that most
textbooks give a very different-looking formula for the calculation of H—a
rather impenetrable structure to which we will return in a moment. This first
version affords a much clearer sense of the underlying concepts.
SSbg(R)
H=
N(N+1)/12
And now for the denouement. When each of the k samples includes at least
5 observations (that is, when na, nb, nc, etc., are all equal to or greater
than 5), the sampling distribution of H is a very close approximation of the
chi-square distribution for df=k—1. It is actually a fairly close approximation
even when one or more of the samples includes as few as 3 observations.
For our present example, we can therefore calculate the value of H as
SSbg(R)
H=
378.7
=
N(N+1)/12
= 9.84
21(21+1)/12
And then, treating this result as though it were a value of chi-square, we can
refer it to the sampling distribution of chi-square with df=3—1=2. The
following graph, borrowed from Chapter 8, will remind you of the outlines of
this particular chi-square distribution. In brief: by the Kruskal-Wallis test,
the observed aggregate difference among the three samples is significant a
bit beyond the .01 level.
Theoretical Sampling Distribution of Chi-Square for df=2
·An Alternative Formula for the Calculation of H
I noted a moment ago that textbook accounts of the Kruskal-Wallis test
usually give a different version of the formula for H. If you are a beginning
student calculating H by hand, I would recommend using the version given
above, as it gives you a clearer idea of just what H is measuring. Once you
get the hang of things, however, you might find this alternative
computational formula a bit more convenient.
12
H =
N(N+1)
(
(Tg)2
ng
)
— 3(N+1)
In any event, as you can see below, this version yields exactly the same
result as the other.
12
H =
21(21+1)
= 9.84
(
(131)2
(58)2
+
8
(42)2
+
7
6
)
— 3(21+1)
The VassarStats web site has a page that will perform all steps of the Kruskal-Wallis
test, including the rank-ordering of the raw measures.
End of Subchapter 14a.
Return to Top of Subchapter 14a
Go to Chapter 15 [One-Way Analysis of Variance for Correlated Samples] As a
reminder, the assumptions of the one-way ANOVA for independent samples are
1. that the scale on which the dependent variable is measured has the
properties of an equal interval scale;T
2. that the k samples are independently and randomly drawn from the source
population(s);T
3. that the source population(s) can be reasonably supposed to have a normal
distribution; andT
4. that the k samples have approximately equal variances.
We noted in the main body of Chapter 14 that we need not worry very much about
the first, third, and fourth of these assumptions when the samples are all the same
size. For in that case the analysis of variance is quite robust, by which we mean
relatively unperturbed by the violation of its assumptions. But of course, the other
side of the coin is that when the samples are not all the same size, we do need to
worry. In this case, should one or more of assumptions 1, 3, and 4 fail to be met,
an appropriate non-parametric alternative to the one-way independent-samples
ANOVA can be found in the Kruskal-Wallis Test.
I will illustrate the Kruskal-Wallis test with an example based on rating-scale data,
since this is by far the most common situation in which unequal sample sizes would
call for the use of a non-parametric alternative. In this particular case the number
of groups is k=3. I think it will be fairly obvious how the logic and procedure would
be extended in cases where k is greater than 3.
To assess the effects of expectation on the perception of aesthetic quality,
an investigator randomly sorts 24 amateur wine aficionados into three
groups, A, B, and C, of 8 subjects each. Each subject is scheduled for an
individual interview. Unfortunately, one of the subjects of group B and two of
group C fail to show up for their interviews, so the investigator must
make do with samples of unequal size: na=8, nb=7, and nc=6, for a total
of N=21. The subjects who do show up for their interviews are each asked to
rate the overall quality of each of three wines on a 10-point scale, with "1"
standing at the bottom of the scale and "10" at the top.
Group
mean
A
B
C
6.4
6.8
7.2
8.3
8.4
9.1
9.4
9.7
2.5
3.7
4.9
5.4
5.9
8.1
8.2
1.3
4.1
4.9
5.2
5.5
8.2
8.2
5.5
4.9
As it happens, the three wines are the same for all subjects. The only difference is
in the texture of the interview, which is designed to induce a relatively high
expectation of quality in the members of group A; a relatively low expectation in
the members of group C; and a merely neutral state, tending in neither the one
direction nor the other, for the members of group B. At the end of the study, each
subject's ratings are averaged across all three wines, and this average is then taken
as the raw measure for that particular subject. The adjacent table shows these
measures for each subject in each of the three groups.
RUN TESTS
The analysis of runs within a sequence is applied in statistics in many ways (for examples see [7,
Section II.5 (b),], [1]). The term run may in general be explained as a succession of items of the
same class. Many concepts to analyze runs in a series of data have been studied. The main
concepts are based on (i) the analysis of the total number of runs of a given class (see [9, 10])
and (ii) examinations about the appearance of long runs (see [7, Chapter XIII,], [8, 11] ).
For example, consider a sequence of bits. Binary runs are for example arbitrary repetitive
patterns within the sequence. Let
be a sequence of pairwise different
numbers like an output of a pseudorandom number generator. A sequence of bits occurs if one
considers the signs (
or
) of the differences
,
. For
example X = (5,4,1,7,2,3,6) yields S = (0,0,1,0,1,1). This concept is used in [10, 11]. A
subsequence of length l consisting of only 1 is called a "run up" of length l (this indicates a
increasing subsequence of length l+1 within the sequence X). The opposite case, a subsequence
consisting of only 0 is called a ``run down''. In the latter papers, several statistics based on this
run definition are treated for both areas (i) and (ii). These considerations form the basis of a runtest proposed by Knuth [9] in order to test pseudorandom numbers. Knuth's run-test is one of the
most common tests used for examining PRNs. An asymptotically chi-squared distributed test
statistic based on the number of runs with length l = 1,2,3,4,5 and
is given in Knuth [9, 6].
Consider a fixed sample size n. The run-test implemented in pLab calculates Knuth`s test
statistic m times. To these "asymptotically chi-squared" values a Kolmogorov-Smirnov-test is
applied (for details see [6]).
The graphics below show typically behavior of this test if "good" and "flawed" linear
congruential generators with moduli
are used. The lines in the graphics indicate the
rejection area of the Kolmogorov-Smirnov statistic with level of significance 0.05 and 0.01 . The
first graphics below shows the results obtained from Super-Duper =
for
sample sizes
and m = 100. The afterimages are obtained from the subsequences
with step sizes 99, 565 and 739 generated from Super-Duper. Note that these subsequences are
produced by similar linear congruential generators with multiplier 1031357269, 3280877789 and
3028669781. Further results for linear and inversive congruential generators are given in [2, 6].
For subsequence behavior of well known linear congruential generators see [3, 4, 5].
References
1
A.J. Ducan. Quality Control and Industrial Statistics. Richard D. IRWIN, INC., 4
edition, 1974.
2
K. Entacher. Selected random number generators in run tests. Preprint,
Mathematics Institute, University of Salzburg.
UNIT V
ANALYSIS OF VARIANCE
An important technique for analyzing the effect of categorical factors on a response is to perform an
Analysis of Variance. An ANOVA decomposes the variability in the response variable amongst the
different factors. Depending upon the type of analysis, it may be important to determine: (a) which factors
have a significant effect on the response, and/or (b) how much of the variability in the response variable is
attributable to each factor.
STATGRAPHICS Centurion provides several procedures for performing an analysis of variance:
1. One-Way ANOVA - used when there is only a single categorical factor. This is equivalent to comparing
multiple groups of data.
2. Multifactor ANOVA - used when there is more than one categorical factor, arranged in a crossed
pattern. When factors are crossed, the levels of one factor appear at more than one level of the other
factors.
3. Variance Components Analysis - used when there are multiple factors, arranged in a hierarchical
manner. In such a design, each factor is nested in the factor above it.
4. General Linear Models - used whenever there are both crossed and nested factors, when some factors
are fixed and some are random, and when both categorical and quantitative factors are present.
One-Way ANOVA
A one-way analysis of variance is used when the data are divided into groups according to only one
factor. The questions of interest are usually: (a) Is there a significant difference between the groups?, and
(b) If so, which groups are significantly different from which others? Statistical tests are provided to
compare group means, group medians, and group standard deviations. When comparing means, multiple
range tests are used, the most popular of which is Tukey's HSD procedure. For equal size samples,
significant group differences can be determined by examining the means plot and identifying those
intervals that do not overlap.
Multifactor ANOVA
When more than one factor is present and the factors are crossed, a multifactor ANOVA is appropriate.
Both main effects and interactions between the factors may be estimated. The output includes an ANOVA
table and a new graphical ANOVA from the latest edition of Statistics for Experimenters by Box, Hunter
and Hunter (Wiley, 2005). In a graphical ANOVA, the points are scaled so that any levels that differ by
more than exhibited in
distribution of the residuals are significantly different.
the
Variance Components Analysis
A Variance Components Analysis is most commonly used to determine the level at which variability is
being introduced into a product. A typical experiment might select several batches, several samples from
each batch, and then run replicates tests on each sample. The goal is to determine the relative
percentages of the overall process variability that is being introduced at each level.
General Linear Model
The General Linear Models procedure is used whenever the above procedures are not appropriate. It can
be used for models with both crossed and nested factors, models in which one or more of the variables is
random rather than fixed, and when quantitative factors are to be combined with categorical ones.
Designs that can be analyzed with the GLM procedure include partially nested designs, repeated
measures experiments, split plots, and many others. For example, pages 536-540 of the book Design and
Analysis of Experiments (sixth edition) by Douglas Montgomery (Wiley, 2005) contains an example of an
experimental design with both crossed and nested factors. For that data, the GLM procedure produces
several important tables, including estimates of the variance components for the random factors.
Analysis of Variance for Assembly Time
Source
Model
Residual
Total (Corr.)
Sum of Squares
243.7
56.0
299.7
Df
23
24
47
Mean Square
10.59
2.333
F-Ratio
4.54
P-Value
0.0002
Type III Sums of Squares
Source
Layout
Operator(Layout)
Fixture
Layout*Fixture
Fixture*Operator(Layout)
Residual
Total (corrected)
Expected Mean Squares
Sum of Squares
4.083
71.92
82.79
19.04
65.83
56.0
299.7
Df
1
6
2
2
12
24
47
Mean Square
4.083
11.99
41.4
9.521
5.486
2.333
F-Ratio
0.34
2.18
7.55
1.74
2.35
P-Value
0.5807
0.1174
0.0076
0.2178
0.0360
Source
Layout
Operator(Layout)
Fixture
Layout*Fixture
Fixture*Operator(Layout)
Residual
EMS
(6)+2.0(5)+6.0(2)+Q1
(6)+2.0(5)+6.0(2)
(6)+2.0(5)+Q2
(6)+2.0(5)+Q3
(6)+2.0(5)
(6)
Variance Components
Source
Operator(Layout)
Fixture*Operator(Layout)
Residual
Estimate
1.083
1.576
2.333
Principles of experimental design, following Ronald A.
Fisher
A methodology for designing experiments was proposed by Ronald A. Fisher, in his innovative
book The Design of Experiments (1935). As an example, he described how to test the hypothesis
that a certain lady could distinguish by flavour alone whether the milk or the tea was first placed
in the cup. While this sounds like a frivolous application, it allowed him to illustrate the most
important ideas of experimental design:
Comparison
In many fields of study it is hard to reproduce measured results exactly. Comparisons
between treatments are much more reproducible and are usually preferable. Often one
compares against a standard, scientific control, or traditional treatment that acts as
baseline.
Randomization
There is an extensive body of mathematical theory that explores the consequences of
making the allocation of units to treatments by means of some random mechanism such
as tables of random numbers, or the use of randomization devices such as playing cards
or dice. Provided the sample size is adequate, the risks associated with random allocation
(such as failing to obtain a representative sample in a survey, or having a serious
imbalance in a key characteristic between a treatment group and a control group) are
calculable and hence can be managed down to an acceptable level. Random does not
mean haphazard, and great care must be taken that appropriate random methods are used.
Replication
Measurements are usually subject to variation and uncertainty. Measurements are
repeated and full experiments are replicated to help identify the sources of variation and
to better estimate the true effects of treatments.
Blocking
Blocking is the arrangement of experimental units into groups (blocks) that are similar to
one another. Blocking reduces known but irrelevant sources of variation between units
and thus allows greater precision in the estimation of the source of variation under study.
Orthogonality
Example of orthogonal factorial design
Orthogonality concerns the forms of comparison (contrasts) that can be legitimately and
efficiently carried out. Contrasts can be represented by vectors and sets of orthogonal
contrasts are uncorrelated and independently distributed if the data are normal. Because
of this independence, each orthogonal treatment provides different information to the
others. If there are T treatments and T – 1 orthogonal contrasts, all the information that
can be captured from the experiment is obtainable from the set of contrasts.
Factorial experiments
Use of factorial experiments instead of the one-factor-at-a-time method. These are
efficient at evaluating the effects and possible interactions of several factors (independent
variables).
Analysis of the design of experiments was built on the foundation of the analysis of variance, a
collection of models in which the observed variance is partitioned into components due to
different factors which are estimated and/or tested.
Example
This example is attributed to Harold Hotelling.[9] It conveys some of the flavor of those aspects
of the subject that involve combinatorial designs.
The weights of eight objects are to be measured using a pan balance and set of standard weights.
Each weighing measures the weight difference between objects placed in the left pan vs. any
objects placed in the right pan by adding calibrated weights to the lighter pan until the balance is
in equilibrium. Each measurement has a random error. The average error is zero; the standard
deviations of the probability distribution of the errors is the same number σ on different
weighings; and errors on different weighings are independent. Denote the true weights by
We consider two different experiments:
1. Weigh each object in one pan, with the other pan empty. Let Xi be the measured weight
of the ith object, for i = 1, ..., 8.
2. Do the eight weighings according to the following schedule and let Yi be the measured
difference for i = 1, ..., 8:
Then the estimated value of the weight θ1 is
Similar estimates can be found for the weights of the other items. For example
The question of design of experiments is: which experiment is better?
The variance of the estimate X1 of θ1 is σ2 if we use the first experiment. But if we use the second
experiment, the variance of the estimate given above is σ2/8. Thus the second experiment gives
us 8 times as much precision for the estimate of a single item, and estimates all items
simultaneously, with the same precision. What is achieved with 8 weighings in the second
experiment would require 64 weighings if items are weighed separately. However, note that the
estimates for the items obtained in the second experiment have errors which are correlated with
each other.
Many problems of the design of experiments involve combinatorial designs, as in this example.
Statistical control
It is best for a process to be in reasonable statistical control prior to conducting designed
experiments. When this is not possible, proper blocking, replication, and randomization allow for
the careful conduct of designed experiments.[12]
Experimental designs after Fisher
Some efficient designs for estimating several main effects simultaneously were found by Raj
Chandra Bose and K. Kishen in 1940 at the Indian Statistical Institute, but remained little known
until the Plackett-Burman designs were published in Biometrika in 1946. About the same time,
C. R. Rao introduced the concepts of orthogonal arrays as experimental designs. This was a
concept which played a central role in the development of Taguchi methods by Genichi Taguchi,
which took place during his visit to Indian Statistical Institute in early 1950s. His methods were
successfully applied and adopted by Japanese and Indian industries and subsequently were also
embraced by US industry albeit with some reservations.
In 1950, Gertrude Mary Cox and William Gemmell Cochran published the book Experimental
Designs which became the major reference work on the design of experiments for statisticians
for years afterwards.
Developments of the theory of linear models have encompassed and surpassed the cases that
concerned early writers. Today, the theory rests on advanced topics in linear algebra, algebra and
combinatorics.
As with other branches of statistics, experimental design is pursued using both frequentist and
Bayesian approaches: In evaluating statistical procedures like experimental designs, frequentist
statistics studies the sampling distribution while Bayesian statistics updates a probability
distribution on the parameter space.
Some important contributors to the field of experimental designs are C. S. Peirce, R. A. Fisher,
F. Yates, C. R. Rao, R. C. Bose, J. N. Srivastava, Shrikhande S. S., D. Raghavarao, W. G.
Cochran, O. Kempthorne, W. T. Federer, A. S. Hedayat, J. A. Nelder, R. A. Bailey, J. Kiefer, W.
J. Studden, F. Pukelsheim, D. R. Cox, H. P. Wynn, A. C. Atkinson, G. E. P. Box and G. Taguchi.
The textbooks of D. Montgomery and R. Myers have reached generations of students and
practitioners.
Completely Randomized Design
Suppose we have 4 different diets which we want to compare. The diets are
labeled Diet A, Diet B, Diet C, and Diet D. We are interested in how the diets affect the
coagulation rates of rabbits. The coagulation rate is the time in seconds that it takes for a
cut to stop bleeding. We have 16 rabbits available for the experiment, so we will use 4
on each diet. How should we use randomization to assign the rabbits to the four
treatment groups? The 16 rabbits arrive and are placed in a large compound until you are
ready to begin the experiment, at which time they will be transferred to cages.
Possible Assignment Plans
Method 1: We assume that rabbits will be caught "at random". Catch four rabbits and
assign them to Diet A. Catch the next four rabbits and assign them to Diet B. Continue
with Diets C and D. Since the rabbits were "caught at random", this would produce a
completely randomized design. Analyze the results as a completely randomized design.
Method 1 is faulty. The first rabbits caught could be the slowest and weakest
rabbits, those least able to escape capture. This would bias the results. If the
experimental results came out to the disadvantage of Diet A, there would be no way to
determine if the results were a consequence of Diet A or the fact that the weakest rabbits
were placed on that diet by our "randomization process".
Method 2: Catch all the rabbits and label them 1-16. Select four numbers 1-16 at random
(without replacement) and put them in a cage to receive Diet A. Then select another four
numbers at random and put them in a cage to receive Diet B. Continue until you have
four cages with four rabbits each. Each cage receives a different diet, and the experiment
is analyzed as a completely randomized experiment.
Method 2 is a completely randomized design, but it has a serious flaw. The
experiment lacks replication. There are 16 rabbits, but the rabbits in each cage are not
independent. If one rabbit eats a lot, the others in that cage have less to eat. The
experimental unit is the smallest unit of experimental matter to which the treatment is
applied at random. In this case, the cages are the experimental units. For a completely
randomized design, each rabbit must live in its own cage.
Method 3: Have a bowl with the letters A, B, C, and D printed on separate slips of paper.
Catch the first rabbit, pick a slip at random from the bowl and assign the rabbit to the diet
letter on the slip. Do not replace the slip. Catch the second rabbit and select another slip
from the remaining three slips. Assign that diet to the second rabbit. Continue until the
first four rabbits are assigned one of the four diets. In this way, all of the slow rabbits
have different diets. Replace the slips and repeat the procedure until all 16 rabbits are
assigned to a diet. Analyze the results as a completely randomized design.
21
Method 3 is not a completely randomized design. Since you have selected the
rabbits in blocks of 4, one assigned to each of the diets A-D, the analysis should be for a
randomized block design. The treatment is Diet but you have blocked on "catchability".
Method 4: Catch all the rabbits and label them 1-16. Put 16 slips of paper in a bowl, four
each with the letters A, B, C, and D. Put another 16 slips of paper numbered 1-16 in a
second bowl. Pick a slip from each bowl. The rabbit with the selected number is given
the selected diet. To make it easy to remember which rabbit gets which diet, the cages
are arranged as shown below.
Method 4 has some deficiencies. The assignment of rabbits to treatment is a
completely randomized design. However, the arrangement of the cages for convenience
creates a bias in the results. The heat in the room rises, so the rabbits receiving Diet A
will be living in a very different environment than those receiving Diet D. Any observed
difference cannot be attributed to diet, but could just as easily be a result of cage
placement.
Cage placement is not a part of the treatment, but must be taken into account. In a
completely randomized design, every rabbit must have the same chance of receiving any
diet at any location in the matrix of cages.
A Completely Randomized Design
Label the cages 1-16. In a bowl put 16 strips of paper each with one of the
integers 1-16 written on it. In a second bowl put 16 strips of paper, four each labeled A,
B, C, and D. Catch a rabbit. Select a number and a letter from each bowl. Place the
rabbit in the location indicated by the number and feed it the diet assigned by the letter.
Repeat without replacement until all rabbits have been assigned a diet and cage.
22
If, for example, the first number selected was 7 and the first letter B, then the first rabbit
would be placed in location 7 and fed diet B. An example of the completed cage
selection is shown below.
Notice that the completely randomized design does not account for the difference in
heights of the cages. It is just as the name suggests, a completely random assignment. In
this case, we see that the rabbits with Diet A are primarily on the bottom and those with
Diet D are on the top. A completely randomized design assumes that these locations will
not produce a systematic difference in response (coagulation time). If we do believe the
location is an important part of the process, we should use a randomized block design.
For this example, will continue to use a completely randomized design.
One-Way ANOVA
To analyze the results of the experiment, we use a one-way analysis of variance.
The measured coagulation times for each diet are given below:
Diet A Diet B Diet C Diet D
62 63 68 56
60 67 66 62
63 71 71 60
59 64 67 61
Mean 61 66.25 68 59.75
The null hypothesis is
H0 A B C D :m m m m (all treatment means the same)
and the alternative is
Ha : at least one mean different.
The ANOVA Table is given below:
Response: Coagulation Time
Analysis of Variance
Source DF Sum of SquaresMean Square F Ratio
Model 3 191.50000 63.8333 9.1737
Error 12 83.50000 6.9583 Prob>F
Total 15 275.00000 0.0020
23
From the computer output, we see that there is a statistically significant difference in
coagulation time ( p = 0.0020) . Just what is being measured by these sums of squares
and mean squares? In this section we will consider the theory of ANOVA.
The Theory of ANOVA
There is a lot of technical notation in Analysis of Variance. The notation we will
use is consistent with the notation of Box, Hunter, and Hunter's classic text, Statistics for
Experimenters.
Some Notation
k number of treatments
In our example, there are 4 treatment classes, Diet A, Diet B, Diet C, and Diet D.
t n number of observations for treatment t .
Each of the treatments in this experiment have four observations,
n n n n 1 2 3 4 4 .
Yt i ith observation in the tth treatment class
In our example, Y1 1 62 , , Y1 3 63 , , Y3 1 68 , , and Y4 4 61 , .
N total number of observations
N nt
t


4

1
. In this case, N 16.
tti
i
Y Y g sum of observations in the ith treatment class
In our example,
4
1 1,
1
62 60 63 59 244 j
j
YY

g , 2 Y 265, g 3 Y 272, g and
4 Y 239 g .
t Y g mean of the observations in the tth treatment class
Here, 1 2 3 Y 61, Y 66.25, Y 68, g g g and 4 Y 59.75 g .
Y gg total of all observations (overall total)
In our example,
44
,
11
62 60 63 62 60 61 1020 i j
ti
YY

gg L
Y gg overall mean
24
Y
Y
N
gg
gg where N nt
t
. Here, we have
1020
63.75
16
Y
N
gg .
The ANOVA Model:
Y m e or t i t t i t t Y m t e m m t
Parameter Estimate Values in Example
m
ti t t i
ti
ti
t
t
Y
Y
n



gg
1020
63.75
16
y
N

m ti
gg
t
i
t
t
Y
Y
n


g
y 61 g 2 y 66.25 g
y 68 g 4 y 59.75 g .
t t t Y Y g
t$ . 1 2 75 t$ . 2 2 5
t$ . 3 4 25 t$ . 4 4 0
t i e ti t Y Y g
$ , e1 1 1 $ . , e 2 1 3 25 $ , e 3 1 0 $ . , e 4 1 3 75
$ , e1 2 1 $ . , e 2 2 75 $ , e 3 2 2 $ . , e 4 2 2 25
$ , e1 3 2 $ . , e 2 3 4 75 $ , e 3 3 3 $ . , e 4 3 25
$ , e1 4 2 $ . , e 2 4 2 25 $ , e 3 4 1 $ . , e 4 4 125
ANOVA as a Comparison of Estimates of Variance
Analysis of variance gets its name because it compares two different estimates of
the variance. If the null hypothesis is true, and there is no treatment effect, then the two
estimates of variance should be comparable, that is, their ratio should be one. The farther
is the ratio of variances from one, the more doubt is placed on the null hypothesis.
If the null hypothesis is true and all samples can be considered to come from one
population, we can estimate the variance in three different ways. All assume that the
observations are distributed about a common mean m with variance s 2 .
The first estimate considers the observations as a single set of data. Here we
compute the variance using the standard formula. The sum of squared deviations from
the overall mean is
1
3
2
11
total
k nt
ti
ti
SS Y Y

gg
If we divide this quantity by
11t
t
n N 
25
we have an estimate of variance over all units, ignoring treatments. This is just the
sample variance of the combined observations. In our example, SS total275 and
s2 275
15
18.333.
The second method of estimating the variance is to infer the value of s 2 from sY
2,
where sY
2 is the observed variance of the sample means. We calculate this by considering
the means of the four treatments. If the null hypothesis is true, these have a variance of
2
4
s
since they are the means of samples of size 4 drawn at random from a population
with variance s 2 . In general, the treatment means have variance
2
n
s
. Consequently,
t
their sum of squares of deviations from the overall mean, 2
t
t
y y
, divided by the
degrees of freedom, k 1, is an estimate of
g gg
2
n
s
. So the product of t n times
t
2
1
t
t
yy
k



g gg
is an estimate of s 2 if the null hypothesis is true. So, the mean square
treatment 
2
trt 2
1
tt
t
nyy
MS
k
s




g gg
B when H0 is true.
The numerator sum of squared deviations due to treatments (which also has
experimental unit differences) is computed using
2 trt t t
t
SS n y y g gg .
If all nt are the same, then (trt) ( )2 t t
t
SS n y y g gg . In our example, we have
2 2 2 2 SS trt 4 6163.75 4 66.25 63.75 4 68 63.75 4 59.75 63.75
191.5
The mean square for treatment is this sum of squares divided by the degrees of freedom.
In our example, MS( )
.
trt . 1915
3
63833, so 63.833 is another estimate of the population
variance under the null hypothesis. This is known as the estimated variance between
treatments since it was computed using the differences in treatment means.
The sum of squared deviations about the treatment mean is
error 2 2 2 total trt i t t i i t
titit
SS y y y n y SS SS g g .
In our example, this is
26
SS(error) (62 61)2 (60 61)2 L(60 59.75)2 (6159.75)2 83.5
If we divide this sum of squares by the degrees of freedom, N k , we have the pooled
variance for the four groups of observations,
835
16 4
6 95833
.
.

. The variance for Diet A
is 3.3333, the variance for Diet B is 12.9167, the variance for Diet C is 4.6667, and the
variance for Diet D is 6.9167. Since each of these is based on four observations,
sp
2 3 33333 3 12 9167 3 4 6667 3 6 9167
12
6 95833

( . ) ( . ) ( . ) ( . )
..
This is our third estimate of variance and is an estimate of the variance within treatments
since the pooled variance takes into account the treatment groups. Random variation can
be characterized by this pooled variance as measured by residual
residual
SS
MS
Nk


.
The standard deviation of a treatment mean, 2
t
t
sd Y
n
s
g , is estimated by
residual
t
t
MS
se Y
n
g . (The estimated standard deviation is called the standard
error.)
The F-Statistic
It can be shown that, in general, whether or not the null hypothesis is true,
MS error estimates s 2 and MS(trt) estimates
2
2
1
tt
t
n
k
t
s 


(see Appendix A). If
t n n for all t n , then MS(trt) estimates
2
2
1
t
t
n
k
t
s 


. If the null hypothesis is true, then
0 t t for all t, so MS(trt) estimates s 2 0 s 2. The F-score is the ratio of the mean
square treatment to the mean square residual. If the treatment effects, t t , are zero, this
ratio should be equal to one.


trt
error
MS
F
MS
estimates
2
2
2
22
1
11
tt
ttt
t
n
n
k
k
t
t
s
ss






If 0 H is true, calc F has an F-distribution with k 1 and N k degrees of freedom. The
larger the value of the F-score, the greater the estimated treatment effect. A large F-score
corresponds to a small p-value, which casts doubt on the validity of the null hypothesis of
27
equal means. The null hypothesis of equal means is equivalent to a null hypothesis of all
treatment effects being zero.
H0 : all t m are equal 0 H : all 0 t t 
In our example of the rabbit diets, 3,12 F 9.1737 . This is quite large. An F-score this
large would happen by chance only 2 out of 1,000 times when the null hypothesis is true.
This is strong evidence against the null hypothesis that all 0 t t . Thus rejected the null
hypothesis in favor of the alternative at least one of the treatment means differed from
another.
The ANOVA Table and Partitioning of Variance
The ANOVA table consolidates most of these computations, giving the essential
sums of squares and degrees of freedom for our estimates of variance. The standard table
is shown below. This is the form of the computer output seen earlier.
Source df SS MS F Prob>F
Total
Treatment
Error
N 1
k 1
N k
SS(total)
SS(trt)
SS(error)
----MS(trt)
MS(error)
MS
MS
()
()
trt
error
*
In our example, we have
Source df SS MS F Prob>F
Total
Treatment
Residual
15
3
12
275
191.5
83.5
-----
63.833
6.9583
9.1737 0.002
Notice that Total SS = Treatment SS + Residual SS. The total sums of squares has been
partitioned into two parts, the Treatment Sums of Squares and the Residual, or Error,
Sums of Squares. A proof that this will always be the case is given in Appendix B. The
Treatment Sums of Squares is a measure of the variation among the treatment groups,
which includes the variation of the rabbits. The Residual Sums of Squares is a measure
of the variation among the rabbits within each treatment group. Some texts suggest that
the MS(Treatment) is "explained" variance and MS(Residual) is "unexplained" variance.
The variance estimated by MS(Treatment) is explained by the fact that the observations
may come from different populations while the MS(Residual) cannot be explained by
variance in population parameters and is therefore considered as random or chance
variation (see Wonnacott and Wonnacott). In this terminology, the F-statistic is the ratio
of explained variance to unexplained variance, F explained variance
unexplained variance
.
28
We can make this partition even finer by including the individual treatments in
our table.
Source df SS MS F Prob>F
Total
Treatment (among Diets)
Residual (among
Rabbits)
Within Diet A
Within Diet B
Within Diet C
Within Diet D
15
3
12
3
3
3
3
275
191.5
83.5
10
38.75
14
20.75
----63.833
6.9583
3.3333
12.9167
4.6667
6.9167
9.1737 0.002
In this table, notice that the SS(Residual) is the sum of the Within Diet sums of squares;
83.5 10 38.75 14 20.75 .
Also, the MS(Residual) is the pooled variance based on mean squares Within Diets;
33.3333312.916734.666736.9167
6.958
12

.
What Affects Power?
Recall that the power of a statistical test is the probability of rejecting the null hypothesis.
Also recall that for the 1-way analysis of variance
F estimates
2
2
2
22
1
11
tt
ttt
t
n
n
k
k
t
t
s
ss






.
The larger the value of F, the greater the probability of rejecting the null hypothesis.
Consequently,
if s 2 decreases the power increases.
if n increases, the power increases.
if t t increases (the size of the treatment effects) then the power increases.
This leads to the following design strategy priorities to increase power.
1. Reduce s 2 (e.g. by blocking)
2. Increase t n
29
3. Settle for reduced power
Assumptions
Like all hypothesis tests, the one-way ANOVA has several criteria that must be
satisfied (at least approximately) for the test to be valid. These criteria are usually
described as assumptions that must be satisfied, since one often cannot verify them
directly. These assumptions are listed below:
1. The population distribution of the response variable Y must be normal within each
class.
2. Independence of observed values within and among groups.
3. The population variances of Y values must be equal for all k classes
( 222
s1 s2 Lsk )
How important are the assumptions?
1. Normality is not critical. Problems tend to arise if the distributions are highly
skewed and the design is unbalanced. The problems are aggravated if the sample sizes
are small.
2. The assumption of independence is critical.
3. The assumption of equal variance is important. However, the design of the
experiment with random assignment helps balance the variance. This is a greater
problem in observational studies.
4. The methods are sensitive to outliers. If there are outliers, we can use a
transformation, exclude the outlier and limit the domain of inference, perform the
analysis with and without the outlier and report all findings, or use non-parametric
techniques. Non-parametric techniques suffer from a lack of power.
Multiple Comparisons
Why don't we just compare treatments by repeatedly performing t-tests? Let's
think about this in terms of confidence intervals. A test of the hypothesis that two
treatment means are equal at the 5% significance level is rejected if and only if a 95%
confidence interval on the difference in the two means does not cover 0. If we have k
treatments, there are
2
k
r



possible confidence intervals (or comparisons) between
treatment means. Although each confidence interval would have a 0.95 probability of
covering the true difference in treatment means, the frequency with which all of the
30
intervals would simultaneously capture their true parameters is smaller than 95%. In fact,
it can be no larger than 95% and no smaller than 10010.05r %.
One consequence of this is that as the number of treatments increases, we are
increasingly likely to declare at least two treatment means different even if no differences
exist! To avoid this, several approaches have been suggested. One is to set
0.05
1001 %
r


confidence intervals on the difference in two treatment means for each
of the r comparisons. Then the probability that all r confidence intervals capture their
parameters is at least 95%. This is a conservative approach.
Another approach is to use the F-test in the Analysis of Variance as a guide.
Comparisons are made between treatment means only if the F-test is significant. This is
the approach most widely used in most disciplines.
Another approach is to use the method of Least Significant Difference. Compared
to other methods, the LSD procedure is more likely to call a difference significant and
therefore prone to Type I errors, but it is easy to use and is based on principles that
students already understand.
The LSD Procedure
We know that if two random samples of size n are selected from a normal
distribution with variance s 2 , then the variance of the difference in the two sample
means is
s
sss
Dn n n
2
2 2 .
In the case of ANOVA, we do not know s 2 , but we estimate it with s2 MSE . So
when two random samples of size n are taken from a population whose variance is
estimated by MSE , the standard error of the difference between the two means is
22 2

s 
n
MSE
n
. Two means will be considered significantly different at the 0.05
significance level if they differ by more than t
MSE
n
2
, where tis the t-value for a
95% confidence interval with the degrees of freedom associated with MSE. The value
LSD t
MSE
n

2
22
is called the Least Significant Difference. If the two samples do not contain the same
number of entries, then
Randomized Complete Block Design
In the previous section, we analyzed results from a completely randomized design
with a one-way analysis of variance. This design ignored the physical layout of the cages
and the potential effect of the height of cage in which a rabbit was housed. If we wanted
to acknowledge the potential effect of height of the cage on the coagulation time, we
should organize the experiment using a randomized complete block design. One diet of
each type will be used on each of the 4 shelves. The randomization procedure would
assign a number 1-16 to each of the rabbits. Put four slips of marked either 1, 2, 3, or 4
in a bowl. Select a number at random 1-16 to select a rabbit for Diet A, and pull a
number out of the bowl to select a position on the top row. Repeat three times without
replacement for Diets B, C, and D to complete the assignment to the top row. Follow the
same procedure to assign the other three rows. An example is shown below.
This is a randomized block design and should not be analyzed with the one-way
ANOVA. For the sake of illustration, the data collected with this method is the same as
with the completely randomized design. Ordering the data so it is easier to read, we have
the following observations.
Diet A Diet B Diet C Diet D Mean for
Shelf
Shelf 1 62 63 68 56 62.25
Shelf 2 60 67 66 62 63.75
Shelf 3 63 71 71 60 66.25
Shelf 4 59 64 67 61 62.75
Mean for
Diet
61 66.25 68 59.75 Grand Mean
63.75
Two-way ANOVA
The model that includes the blocking variable is
Yt i m tt bi et i .
By blocking on the shelf position, we hope to increase the power of the test by removing
variability associated with shelf height. This would allow us to detect smaller differences
between treatments.
33
Our estimates are
ˆ t i t i t i
ti
yyyyyye
tb


gg g gg g gg
.
Considering the sums of squares, we have
22
2 2 ˆ2
Trt SS Block SS
ti t i ti
tititi
y ny y y y y e


gg g gg g gg
The expression 2 2
ti
ti
y ny
is the Total sum of squares, so we have
Total SS = Trt SS + Block SS + Error SS.
Computing the sums of squares, we have
gg
2 2 2 2 2
65300 65025 191.5 38 45.5
ˆ ti t i ti
tititi
y ny y y y y e



gg g gg g gg
The computer output gives the same computed sums of squares.
Response: Coagulation Time
Effect Test
Source DF Sum of Squares Mean Square F Ratio Prob>F
Diet 3 191.50000 63.833 12.6264 0.0014
Row 3 38.00000 12.667 2.5055 0.1250
Error 9 45.50000 5.0556
C Total 15 275.00000
Diet Mean Shelf Mean
A 61.0000 1 62.2500
B 66.2500 2 63.7500
C 68.0000 3 66.2500
D 59.7500 4 62.7500
From the computer output, we see that there is again a statistically significant
difference in coagulation time, with the p-value slightly smaller ( p = 0.0014) . The mean
square error has been reduced to 5.0556 and the degrees of freedom are reduced to 9.
.**********************THE END***************************