Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Psychometrics wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
Foundations of statistics wikipedia , lookup
History of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Analysis of variance wikipedia , lookup
Gibbs sampling wikipedia , lookup
Misuse of statistics wikipedia , lookup
BIOSTATISTICS AND COMPUTER APPLICATION II I B.Sc MICRO BIOLOGY UNIT I CORRELATION AND REGRESSION Contents 13.1 Aims and Objectives 13.2 Correlation 13.3 The Scatter Diagram 13.4 The Correlation Coefficient 13.5 Karl Pearson’s Correlation Coefficient 13.6 Relation between Regression Coefficients and Correlation Coefficient 13.7 Coefficient of Determination 13.8 Spearman’s Rank Correlation Coefficient 13.9 Tied Ranks 13.10 Regression 13.11 Linear Regression 13.12 Let us Sum Up 13.13 Lesson – End Activities 13.14 References 13.1 Introduction There are situations where data appears as pairs of figures relating to two variables. A correlation problem considers the joint variation of two measurements neither of which is restricted by the experimenter. The regression problem discussed in this Lesson considers the frequency distribution of one variable (called the dependent variable) when another (independent variable) is held fixed at each of several levels. Examples of correlation problems are found in the study of the relationship between IQ and aggregate percentage of marks obtained by a person in the SSC examination, blood pressure and metabolism or the relation between height and weight of individuals. In these examples both variables are observed as they naturally occur, since neither variable is fixed at predetermined levels. Examples of regression problems can be found in the study of the yields of crops grown with different amount of fertilizer, the length of life of certain animals exposed to different levels of radiation, and so on. In these problems the variation in one measurement is studied for particular levels of the other variable selected by the experimenter. 13.2 Correlation Correlation measures the degree of linear relation between the variables. The existence of correlation between variables does not necessarily mean that one is the cause of the change in the other. It should noted that the correlation analysis merely helps in determining the degree of association between two variables, but it does not tell any thing about the cause and effect relationship. While interpreting the correlation coefficient, it is necessary to see whether there is any cause and effect relationship between variables under study. If there is no such relationship, the observed is meaningless. In correlation analysis, all variables are assumed to be random variables. 13.3 The Scatter Diagram The first step in correlation and regression analysis is to visualize the relationship between the variables. A scatter diagram is obtained by plotting the points (x1, y1), (x2, y2), …, (xn,yn) on a two-dimensional plane. If the points are scattered around a straight line , we may infer that there exist a linear relationship between the variables. If the points are clustered around a straight line with negative slope, then there exist negative correlation or the variables are inversely related ( i.e, when x increases y decreases and vice versa. ). If the points are clustered around a straight line with positive slope, then there exist positive correlation or the variables are directly related ( i.e, when x increases y also increases and vice versa. ). For example, we may have figures on advertisement expenditure (X) and Sales (Y) of a firm for the last ten years, as shown in Table 1. When this data is plotted on a graph as in Figure 1 we obtain a scatter diagram. A scatter diagram gives two very useful types of information. First, we can observe patterns between variables that indicate whether the variables are related. Secondly, if the variables are related we can get an idea of what kind of relationship (linear or non-linear) would describe the relationship. Table 1 Year-wise data on Advertisement Expenditure and Sales Year Advertisement Sales in Expenditure Thousand In thousand Rs. (X) Rs. (Y) 1988 50 700 1987 50 650 1986 50 600 1985 40 500 1984 30 450 1983 20 400 1982 20 300 1981 15 250 1980 10 210 1979 5 200 Correlation examines the first Question of determining whether an association exists between the two variables, and if it does, to what extent. Regression examines the second question of establishing an appropriate relation between the variables. Figure 1 : Scatter Diagram 800 - XX 700 - X 600 X Y500 - X 400 - X 300 - X X 200 - X 100 ||||| 1 10 20 30 40 50 X The scatter diagram may exhibit different kinds of patterns. Some typical patterns indicating different correlations between two variables are shown in Figure 2. Figure 2: Different Types of Association Between Variables r>0 Y X (a) Positive Correlation r>0 Y X (b) Negative Correlation r=0 Y X ( c ) No Correlation Y X (d) Non-linear Association 13.4 The Correlation Coefficient Definition and Interpretation The correlation coefficient measure the degree of association between two variables X and Y. Pearson’s formula for correlation coefficient is given as 1(X X ) n r (Y Y ) sxsy Where r is the correlation coefficient between X and Y, sxandsy are the standard deviation of X and Y respectively and n is the number of values of the pair of variables X and Y in the given data. The expression 1(X X ) n (X Y ) is known as the covariance between X and Y. Here r is also called the Pearson’s product moment correlation coefficient. You should note that r is a dimensionless number whose numerical value lies between +1 and -1. Positive values of r indicate positive (or direct) correlation between the two variables X and Y i.e. as X increase Y will also increase or as X decreases Y will also decrease. Negative values of r indicate negative (or inverse) correlation, thereby meaning that an increase in one variable results in a decrease in the value of the other variable. A zero correlation means that there is an o association between the two variables. Figure II shown a number of scatter plots with corresponding values for the correlation coefficient r. The following form for carrying out computations of the correlation coefficient is perhaps more convenient : xy r = X 2 y 2 where ……..(18.2) x = X - X = deviation of a particular X value from the mean- X y= Y - Y = deviation of a particular Y value from the mean Y Equation (18.2) can be derived from equation (18.1) by substituting for sxandsy as follows: 1(X X ) n sx 2 andsy 1(X Y) n 2 ……..(18.3) 13.5 Karl Pearson’s Correlation Coefficient If (x1, y1), (x2, y2), …, (xn,yn) be n given observations, then the Karl Pearson’s correlation coefficient is defined as, r = xy xy SS S , where Sxy is the covariance and Sx, Sy are the standard deviations of X and Y respectively. That is, r = 2 2 2 1 21 1 yy n xx n xy x y n The value of r is in in between –1 and 1. That is, -1 r 1. When r = 1, there exist a perfect positive linear relation between x and y. when r = -1, there exist perfect negative linear relationship between x and y. when r = 0, there is no linear relationship between x and y. 13.6 Relation between Regression Coefficients and Correlation Coefficient Correlation coefficient is the geometric mean of the regression coefficients. We know that byx = 2 x xy S S and bxy = 2 y xy S S The geometric mean of byx and bxy is xy yx b b = 2 2 yx xy xy SS SS = xy xy SS S = r, the correlation coefficient. Also note that the sign of both the regression coefficients will be same, so the sign of correlation coefficient is same as the sign of regression coefficient. 13.7 Coefficient of Determination Coefficient of determination is the square of correlation coefficient and which gives the proportion of variation in y explained by x. That is, coefficient of determination is the ratio of explained variance to the total variance. For example, r2 = 0.879 means that 87.9% of the total variances in y are explained by x. When r2 = 1, it means that all the points on the scatter diagram fall on the regression line and the entire variations are explained by the straight line. On the other hand, if r2 = 0 it means that none of the points on scatter diagram falls on the regression line, meaning thereby that there is no linear relationship between the variables. Example: Consider the following data: X: 15 16 17 18 19 20 Y: 80 75 60 40 30 20 1. Fit both regression lines 2. Find the correlation coefficient 3. Verify the correlation coefficient is the geometric mean of the regression coefficients 4. Find the value of y when x = 17.5 Solution: X Y XY X2 Y2 15 16 17 18 19 20 80 75 60 40 30 20 1200 1200 1020 720 570 400 225 256 289 324 361 400 6400 5625 3600 1600 900 400 105 305 5110 1855 18525 x = n x = 6 105 = 17.5, y = n y = 6 305 = 50.83 Sxy = n 1 xi yi x y = 6 5110 - 17.550.83 = -37.86 Sx 2= n 1 xi 2– ( x )2 = 6 1855 - 17.52 = 2.92 Sy 2= n 1 yi 2– ( y )2 = 6 18525 -50.83 2 = 503.81 byx = 2 x xy S S = 2.92 37.86 = -12.96 and bxy = 2 y xy S S = 503.81 37.86 = -0.075 1. Regression line of y on x is y y =2 x xy S S (xx ) i.e., y – 50.83 = -12.96(x – 17.5) y = -12.96 x + 277.63 Regression line of x on y is x x =2 y xy S S (y y ) i.e., x – 17.5 = -0.075(y – 50.83) x = -0.075 y + 21.31 2. Correlation coefficient, r = xy xy SS S = 1.71 22.45 37.86 = 0.986 3. byxbxy = -12.96 -0.075 = 0.972 Then, 0.972 = 0.986 So, r = -0.986 4. To predict the value of y, use regression line of y on x. When x= 17.5, y = -12.9617.5 + 277.63 = 50.83 Short-Cut Method: The correlation coefficient is invariant under linear transformations. Let us take the transformations, u = 1 x 18 and v = 10 y 40 X Y u v uv u2 v2 15 16 17 18 19 20 80 75 60 40 30 20 -3 -2 -1 012 4 3.5 20 -1 -2 -12 -7 -2 0- 1 -4 941014 16 12.25 4014 85 305 -3 6.5 -26 19 37.25 u = n u = 6 3 =-0.5, v = n v = 6 6.5 = 1.083 Suv = n 1 ui vi u v = 6 26 - -0.51.083 = -3.79 Su 2= n 1 ui 2– ( u )2 = 6 19 - (-0.5)2 = 2.92 Sv 2= n 1 vi 2– ( v )2 = 6 37.25 -1.083 2 = 5.077 bvu = 2 u uv S S= 2.92 3.79 = -1.297 and buv = 2 v uv S S = 5.077 3.79 = -0.75 1. Regression line of v on u is v v = bvu(uu ) i.e., v – 1.083 = -1.297(u – -0.5) v = -1.297u + 0.4345 Therefore, the regression line of y on x is 10 y 40 = -1.297 1 x 18 + 0.4345 i.e, y = -12.97 x + 277.8 Regression line of u on v is u u = buv (v v ) i.e., u –-0.5= -0.75(y – 1.083) u = -0.75 v + 0.31225 Therefore, the regression line of x on y is 1 x 18 = -0.75 10 y 40 + 0.31225 i.e., x = -0.075 y + 21.31 2. Correlation coefficient, r = uv uv SS S = 1.71 2.253 3.79 = -0.986 3. bvubuv = -1.297-0.75 = 0.97275 Then, 0.972 = 0.986 So, r = -0.986 13.8 Spearman’s Rank Correlation Coefficient Sometimes the characteristics whose possible correlation is being investigated, cannot be measured but individuals can only be ranked on the basis of the characteristics to be measured. We then have two sets of ranks available for working out the correlation coefficient. Sometimes tha data on one variable may be in the form of ranks while the data on the other variable are in the form of measurements which can be converted into ranks. Thus, when both the variables are ordinal or when the data are available in the ordinal form irrespective of the type variable, we use the rank correlation coefficient. The Spearman’s rank correlation coefficient is defined as , r = 1 ( 1) 6 2 2 nn di Example: Ten competitors in a beauty contest were ranked by two judges in the following orders: First judge: 1 6 5 10 3 2 4 9 7 8 Second judge: 3 5 8 4 7 10 2 1 6 9 Find the correlation between the rankings. Solution: xi yi di = xi-yi di 2 1 3 -2 4 6511 5 8 -3 9 10 4 6 36 3 7 -4 16 2 10 -8 64 4224 9 1 8 64 7611 8 9 -1 1 The Spearman’s rank correlation coefficient is defined as , r = 1 ( 1) 6 2 2 nn di =1- 10(10 1) 6 200 2 = -0.212 That is, their opinions regarding beauty test are apposite of each other. 13.9 Tied Ranks Sometimes where there is more than one item with the same value a common rank is given to such items. This rank is the average of the ranks which these items would have got had they differed slightly from each other. When this is done, the coefficient of rank correlation needs some correction, because the above formula is based on the supposition that the ranks of various items are different. If in a series, ‘mi’ be the frequency of ith tied ranks, Then, r = 1 ( 1) ( )] 12 6[ 1 2 23 nn dmmi Example: Calculate the rank correlation coefficient from the sales and expenses of 10 firms are below: Sales(X): 50 50 55 60 65 65 65 60 60 50 Expenses(Y): 11 13 14 16 16 15 15 14 13 13 Solution: x R1 y R2 d= R1 – R2 d2 50 50 55 60 65 65 65 60 60 50 9975222559 11 13 14 16 16 16 15 14 13 13 10 8 5.5 1.5 1.5 3.5 3.5 5.5 88 -1 1 1.5 3.5 0.5 -1.5 -1.5 -0.5 -3 1 11 2.25 12.25 0.25 2.25 2.25 0.25 91 31.5 Here there are 7 tied ranks, m1 = 3, m2 = 3, m3 = 3, m4 = 2, m5 = 2, m6 = 2, m7 = 3. r=1( 1) ( )] 12 6[ 1 2 23 nn dmmi =110(10 1) [(3 3) (3 3) (3 3) (2 2) (2 2) (2 2) (3 3)]] 12 6[31.5 1 2 3333333 = 0.75 Exercises 1. A company selling household appliances wants to determine if there is any relationship between advertising expenditures and sales. The following data was compiled for 6 major sales regions. The expenditure is in thousands of rupees and the sales are in millions of rupees. Region : 1 2 3 4 5 6 Expenditure(X): 40 45 80 20 15 50 Sales (Y): 25 30 45 20 20 40 a) Compute the line of regression to predict sales b) Compute the expected sales for a region where Rs.72000 is being spent on advertising 2. The following data represents the scores in the final exam., of 10 students, in the subjects of Economics and Finance. Economics: 61 78 77 97 65 95 30 74 55 Finance: 84 70 93 93 77 99 43 80 67 a) Compute the correlation coefficient? 3. Calculate the rank correlation coefficient from the sales and expenses of 9 This watermark does not appear in the registered version - http://www.clicktoconvert.com 126 firms are below: Sales(X): 42 40 54 62 55 65 65 66 62 Expenses(Y): 10 18 18 17 17 14 13 10 13 13.10 Regression In industry and business today, large amounts of data are continuously being generated. This may be data pertaining, for instance, to a company’s annual production, annual sales, capacity utilisation, turnover, profits, ,manpower levels, absenteeism or some other variable of direct interest to management. Or there might be technical data regarding a process such as temperature or pressure at certain crucial points, concentration of a certain chemical in the product or the braking strength of the sample produced or one of a large number of quality attributes. The accumulated data may be used to gain information about the system (as for instance what happens to the output of the plant when temperature is reduced by half) or to visually depict the past pattern of behaviours (as often happens in company’s annual meetings where records of company progress are projected) or simply used for control purposes to check if the process or system is operating as designed (as for instance in quality control). Our interest in regression is primarily for the first purpose, mainly to extract the main features of the relationships hidden in or implied by the mass of data. What is Regression? Suppose we consider the height and weight of adult males for some given population. If we plot the pair (X1X2)=(height, weight), a diagram like figure I will result. Such a diagram, you would recall from the previous Lesson, is conventionally called a scatter diagram. Note that for any given height there is a range of observed weights and vice-versa. This variation will be partially due to measurement errors but primarily due to variations between individuals. Thus no unique relationship between actual height and weight can be expected. But we can note that average observed weight for a given observed height increases as height increases. The locus of average observed weight for given observed height (as height varies) is called the regression curve of weight on height. Let us denote it by X2=f(X1). There also exists a regression curve of height on weight similarly defined which we can denote by X1=g(X2). Let us assume that these two “curves” are both straight lines (which in general they may not be). In general these two curves are not the same as indicated by the two lines in Figure 3. Figure 3: Height and Weight of thirty Adult Males X1=g(X2) x xxx 90 - x x Weight in x x kg (X2) 80 - x x x X2=f(X1) xxx 70 - x x x x x xxx 60 - x x x 50 | | | | | | | | | | | | 164 168 172 176 180 184 188 Height in cms (X1) A pair of random variables such as (height, weight) follows some sort of bivariate probability distribution. When we are concerned with the dependence of a random variable Y on quantity X, which is variable but not a random variable, an equation that relates Y to X is usually called a regression equation. Simply when more than one independent variable is involved, we may wish to examine the way in which a response Y depends on variables X1X2 …Xk. We determine a regression equation from data which cover certain areas of the X-space as Y=f(X1,X2…Xk) 13.11 Linear Regression Regression analysis is a set of statistical techniques for analyzing the relationship between two numerical variables. One variable is viewed as the dependent variable and the other as the independent variable. The purpose of regression analysis is to understand the direction and extent to which values of dependent variable can be predicted by the corresponding values of the independent variable. The regression gives the nature of relationship between the variables. Often the relationship between two variable x and y is not an exact mathematical relationship, but rather several y values corresponding to a given x value scatter about a value that depends on the x value. For example, although not all persons of the same height have exactly the same weight, their weights bear some relation to that height. On the average, people who are 6 feet tall are heavier than those who are 5 feet tall; the mean weight in the population of 6-footers exceeds the mean weight in the population of 5footers. This relationship is modeled statistically as follows: For every value of x there is a corresponding population of y values. The population mean of y for a particular value of x is denoted by f(x). As a function of x it is called the regression function. If this regression function is linear it may be written as f(x) = a + bx. The quantities a and b are parameters that define the relationship between x and f(x) In conducting a regression analysis, we use a sample of data to estimate the values of these parameters. The population of y values at a particular x value also has a variance; the usual assumption is that the variance is the same for all values of x. Principle of Least Squares Principle of least squares is used to estimate the parameters of a linear regression. The principle states that the best estimates of the parameters are those values of the parameters, which minimize the sum of squares of residual errors. The residual error is the difference between the actual value of the dependent variable and the estimated value of the dependent variable. Fitting of Regression Line y = a + bx By the principle of least squares, the best estimates of a and b are b=2 x xy S S and a = y -b x Where Sxy is the covariance between x and y and is defined as Sxy = n 1 xi yi x y And Sx 2 is the variance of x, that is, Sx 2= n 1 xi 2– ( x )2 Example: Fit a straight line y = a + bx for the following data. Y 3.5 4.3 5.2 5.8 6.4 7.3 7.2 7.5 7.8 8.3 X 6 8 9 12 10 15 17 20 18 24 Solution: Y X XY X2 3.5 4.3 5.2 5.8 6.4 7.3 7.2 7.5 7.8 689 12 10 15 17 20 18 21 34.4 46.8 69.6 64 109.5 122.4 150 140.4 36 64 81 144 100 225 289 400 324 8.3 24 199.2 576 63.3 139 957.3 2239 x = 10 139 =13.9 y = 10 63.3 = 6.33 Sxy = n 1 xi yi x y = 10 957.3 - 13.96.33 = 7.743 Sx 2= n 1 xi 2– ( x )2 = 10 2239 - 13.92 = 30.69 So, b = 2 x xy S S = 30.69 7.743 = 0.252 and a = y -b x = 6.33 – 0.25213.9 = 2.8272 Therefore, the straight line is y = 2.8272 + 0.252 x Two Regression Lines There are two regression lines; regression line of y on x and regression line of x on y. In the regression line of y on x, y is the dependent variable and x is the independent variable and it is used to predict the value of y for a given value of x. But in the regression line of x on y, x is the dependent variable and y is the independent variable and it is used to predict the value of x for a given value of y. The regression line of y on x is given by yy =2 x xy S S (xx) and the regression line of x on y is given by xx =2 y xy S S (y y ) Regression Coefficients The quantity 2 x xy S S is the regression coefficient of y ox and is denoted by byx, which gives the slope of the line. That is, byx = 2 x xy S S is the rate of change in y for the unit change in x. The quantity 2 y xy S S is the regression coefficient of x on y and is denoted by bxy, which gives the slope of the line. That is, bxy = 2 y xy S S is the rate of change in x for the unit change in y. 13.12 Let us Sum Up In this Lesson the concept of correlation and regression are discussed. The correlation is the association between two variables. A scatter plot of the variables may suggest that the two variables are related but the value of the Pearson’s correlation coefficient r quantifies this association. The correlation coefficient r may assume values from –1 and + 1. The sign indicates whether the association is direct (+ve) or inverse (-ve). A numerical value of 1 indicates perfect association while a value of zero indicates no association. Regression is a device for establishing relationships between variables from the given data. The discovered relationship can be used for predictive purposes. Some simple examples are shown to understand the concepts. 13.13 Lesson – End Activities 1. Define correlation, Regression. 2. Give the purpose of drawing scatter diagram. 13.14 References 1. P.R. Vital – Business Mathematics and Statistics. 2. Gupta S.P. – Statistical Methods. UNIT II METHODS OF SAMPLING Sampling Sampling is that part of statistical practice concerned with the selection of a subset of individual observations within a population of individuals intended to yield some knowledge about the population of concern, especially for the purposes of making predictions based on statistical inference. Sampling is an important aspect of data collection. Researchers rarely survey the entire population for two reasons (Adèr, Mellenbergh, & Hand, 2008): the cost is too high, and the population is dynamic in that the individuals making up the population may change over time. The three main advantages of sampling are that the cost is lower, data collection is faster, and since the data set is smaller it is possible to ensure homogeneity and to improve the accuracy and quality of the data. Each observation measures one or more properties (such as weight, location, color) of observable bodies distinguished as independent objects or individuals. In survey sampling, survey weights can be applied to the data to adjust for the sample design. Results from probability theory and statistical theory are employed to guide practice. In business and medical research, sampling is widely used for gathering information about a population.[1] Contents 1 Process 2 Population definition 3 Sampling frame 4 Probability and nonprobability sampling 5 Sampling methods o 5.1 Simple random sampling o 5.2 Systematic sampling o 5.3 Stratified sampling o 5.4 Probability proportional to size sampling o 5.5 Cluster sampling o 5.6 Matched random sampling o 5.7 Quota sampling o 5.8 Convenience sampling or Accidental Sampling o 5.9 Line-intercept sampling o 5.10 Panel sampling o 5.11 Event sampling methodology 6 Replacement of selected units 7 Sample size o o 7.1 Formulas 7.2 Steps for using sample size tables 8 Sampling and data collection 9 Errors in sample surveys o 9.1 Sampling errors and biases o 9.2 Non-sampling error 10 Survey weights 11 History 12 See also 13 Notes 14 References 15 External links Process The sampling process comprises several stages: Defining the population of concern Specifying a sampling frame, a set of items or events possible to measure Specifying a sampling method for selecting items or events from the frame Determining the sample size Implementing the sampling plan Sampling and data collecting Population definition Successful statistical practice is based on focused problem definition. In sampling, this includes defining the population from which our sample is drawn. A population can be defined as including all people or items with the characteristic one wishes to understand. Because there is very rarely enough time or money to gather information from everyone or everything in a population, the goal becomes finding a representative sample (or subset) of that population. Sometimes that which defines a population is obvious. For example, a manufacturer needs to decide whether a batch of material from production is of high enough quality to be released to the customer, or should be sentenced for scrap or rework due to poor quality. In this case, the batch is the population. Although the population of interest often consists of physical objects, sometimes we need to sample over time, space, or some combination of these dimensions. For instance, an investigation of supermarket staffing could examine checkout line length at various times, or a study on endangered penguins might aim to understand their usage of various hunting grounds over time. For the time dimension, the focus may be on periods or discrete occasions. In other cases, our 'population' may be even less tangible. For example, Joseph Jagger studied the behaviour of roulette wheels at a casino in Monte Carlo, and used this to identify a biased wheel. In this case, the 'population' Jagger wanted to investigate was the overall behaviour of the wheel (i.e. the probability distribution of its results over infinitely many trials), while his 'sample' was formed from observed results from that wheel. Similar considerations arise when taking repeated measurements of some physical characteristic such as the electrical conductivity of copper. This situation often arises when we seek knowledge about the cause system of which the observed population is an outcome. In such cases, sampling theory may treat the observed population as a sample from a larger 'superpopulation'. For example, a researcher might study the success rate of a new 'quit smoking' program on a test group of 100 patients, in order to predict the effects of the program if it were made available nationwide. Here the superpopulation is "everybody in the country, given access to this treatment" - a group which does not yet exist, since the program isn't yet available to all. Note also that the population from which the sample is drawn may not be the same as the population about which we actually want information. Often there is large but not complete overlap between these two groups due to frame issues etc. (see below). Sometimes they may be entirely separate - for instance, we might study rats in order to get a better understanding of human health, or we might study records from people born in 2008 in order to make predictions about people born in 2009. Time spent in making the sampled population and population of concern precise is often well spent, because it raises many issues, ambiguities and questions that would otherwise have been overlooked at this stage. Sampling frame In the most straightforward case, such as the sentencing of a batch of material from production (acceptance sampling by lots), it is possible to identify and measure every single item in the population and to include any one of them in our sample. However, in the more general case this is not possible. There is no way to identify all rats in the set of all rats. Where voting is not compulsory, there is no way to identify which people will actually vote at a forthcoming election (in advance of the election). These imprecise populations are not amenable to sampling in any of the ways below and to which we could apply statistical theory. As a remedy, we seek a sampling frame which has the property that we can identify every single element and include any in our sample.[1] The most straightforward type of frame is a list of elements of the population (preferably the entire population) with appropriate contact information. For example, in an opinion poll, possible sampling frames include: Electoral register Telephone directory Not all frames explicitly list population elements. For example, a street map can be used as a frame for a door-to-door survey; although it doesn't show individual houses, we can select streets from the map and then visit all houses on those streets. (One advantage of such a frame is that it would include people who have recently moved and are not yet on the list frames discussed above.) The sampling frame must be representative of the population and this is a question outside the scope of statistical theory demanding the judgment of experts in the particular subject matter being studied. All the above frames omit some people who will vote at the next election and contain some people who will not; some frames will contain multiple records for the same person. People not in the frame have no prospect of being sampled. Statistical theory tells us about the uncertainties in extrapolating from a sample to the frame. In extrapolating from frame to population, its role is motivational and suggestive. To the scientist, however, representative sampling is the only justified procedure for choosing individual objects for use as the basis of generalization, and is therefore usually the only acceptable basis for ascertaining truth. —Andrew A. Marino[2] It is important to understand this difference to steer clear of confusing prescriptions found in many web pages. In defining the frame, practical, economic, ethical, and technical issues need to be addressed. The need to obtain timely results may prevent extending the frame far into the future. The difficulties can be extreme when the population and frame are disjoint. This is a particular problem in forecasting where inferences about the future are made from historical data. In fact, in 1703, when Jacob Bernoulli proposed to Gottfried Leibniz the possibility of using historical mortality data to predict the probability of early death of a living man, Gottfried Leibniz recognized the problem in replying: Nature has established patterns originating in the return of events but only for the most part. New illnesses flood the human race, so that no matter how many experiments you have done on corpses, you have not thereby imposed a limit on the nature of events so that in the future they could not vary. —Gottfried Leibniz Kish posited four basic problems of sampling frames: 1. 2. 3. 4. Missing elements: Some members of the population are not included in the frame. Foreign elements: The non-members of the population are included in the frame. Duplicate entries: A member of the population is surveyed more than once. Groups or clusters: The frame lists clusters instead of individuals. A frame may also provide additional 'auxiliary information' about its elements; when this information is related to variables or groups of interest, it may be used to improve survey design. For instance, an electoral register might include name and sex; this information can be used to ensure that a sample taken from that frame covers all demographic categories of interest. (Sometimes the auxiliary information is less explicit; for instance, a telephone number may provide some information about location.) Having established the frame, there are a number of ways for organizing it to improve efficiency and effectiveness. It's at this stage that the researcher should decide whether the sample is in fact to be the whole population and would therefore be a census. Probability and nonprobability sampling A probability sampling scheme is one in which every unit in the population has a chance (greater than zero) of being selected in the sample, and this probability can be accurately determined. The combination of these traits makes it possible to produce unbiased estimates of population totals, by weighting sampled units according to their probability of selection. Example: We want to estimate the total income of adults living in a given street. We visit each household in that street, identify all adults living there, and randomly select one adult from each household. (For example, we can allocate each person a random number, generated from a uniform distribution between 0 and 1, and select the person with the highest number in each household). We then interview the selected person and find their income. People living on their own are certain to be selected, so we simply add their income to our estimate of the total. But a person living in a household of two adults has only a one-in-two chance of selection. To reflect this, when we come to such a household, we would count the selected person's income twice towards the total. (In effect, the person who is selected from that household is taken as representing the person who isn't selected.) In the above example, not everybody has the same probability of selection; what makes it a probability sample is the fact that each person's probability is known. When every element in the population does have the same probability of selection, this is known as an 'equal probability of selection' (EPS) design. Such designs are also referred to as 'self-weighting' because all sampled units are given the same weight. Probability sampling includes: Simple Random Sampling, Systematic Sampling, Stratified Sampling, Probability Proportional to Size Sampling, and Cluster or Multistage Sampling. These various ways of probability sampling have two things in common: 1. Every element has a known nonzero probability of being sampled and 2. involves random selection at some point. Nonprobability sampling is any sampling method where some elements of the population have no chance of selection (these are sometimes referred to as 'out of coverage'/'undercovered'), or where the probability of selection can't be accurately determined. It involves the selection of elements based on assumptions regarding the population of interest, which forms the criteria for selection. Hence, because the selection of elements is nonrandom, nonprobability sampling does not allow the estimation of sampling errors. These conditions give rise to exclusion bias, placing limits on how much information a sample can provide about the population. Information about the relationship between sample and population is limited, making it difficult to extrapolate from the sample to the population. Example: We visit every household in a given street, and interview the first person to answer the door. In any household with more than one occupant, this is a nonprobability sample, because some people are more likely to answer the door (e.g. an unemployed person who spends most of their time at home is more likely to answer than an employed housemate who might be at work when the interviewer calls) and it's not practical to calculate these probabilities. Nonprobability Sampling includes: Accidental Sampling, Quota Sampling and Purposive Sampling. In addition, nonresponse effects may turn any probability design into a nonprobability design if the characteristics of nonresponse are not well understood, since nonresponse effectively modifies each element's probability of being sampled. Sampling methods Within any of the types of frame identified above, a variety of sampling methods can be employed, individually or in combination. Factors commonly influencing the choice between these designs include: Nature and quality of the frame Availability of auxiliary information about units on the frame Accuracy requirements, and the need to measure accuracy Whether detailed analysis of the sample is expected Cost/operational concerns Simple random sampling In a simple random sample ('SRS') of a given size, all such subsets of the frame are given an equal probability. Each element of the frame thus has an equal probability of selection: the frame is not subdivided or partitioned. Furthermore, any given pair of elements has the same chance of selection as any other such pair (and similarly for triples, and so on). This minimises bias and simplifies analysis of results. In particular, the variance between individual results within the sample is a good indicator of variance in the overall population, which makes it relatively easy to estimate the accuracy of results. However, SRS can be vulnerable to sampling error because the randomness of the selection may result in a sample that doesn't reflect the makeup of the population. For instance, a simple random sample of ten people from a given country will on average produce five men and five women, but any given trial is likely to overrepresent one sex and underrepresent the other. Systematic and stratified techniques, discussed below, attempt to overcome this problem by using information about the population to choose a more representative sample. SRS may also be cumbersome and tedious when sampling from an unusually large target population. In some cases, investigators are interested in research questions specific to subgroups of the population. For example, researchers might be interested in examining whether cognitive ability as a predictor of job performance is equally applicable across racial groups. SRS cannot accommodate the needs of researchers in this situation because it does not provide subsamples of the population. Stratified sampling, which is discussed below, addresses this weakness of SRS. Simple random sampling is always an EPS design, but not all EPS designs are simple random sampling. Systematic sampling Systematic sampling relies on arranging the target population according to some ordering scheme and then selecting elements at regular intervals through that ordered list. Systematic sampling involves a random start and then proceeds with the selection of every kth element from then onwards. In this case, k=(population size/sample size). It is important that the starting point is not automatically the first in the list, but is instead randomly chosen from within the first to the kth element in the list. A simple example would be to select every 10th name from the telephone directory (an 'every 10th' sample, also referred to as 'sampling with a skip of 10'). As long as the starting point is randomized, systematic sampling is a type of probability sampling. It is easy to implement and the stratification induced can make it efficient, if the variable by which the list is ordered is correlated with the variable of interest. 'Every 10th' sampling is especially useful for efficient sampling from databases. Example: Suppose we wish to sample people from a long street that starts in a poor district (house #1) and ends in an expensive district (house #1000). A simple random selection of addresses from this street could easily end up with too many from the high end and too few from the low end (or vice versa), leading to an unrepresentative sample. Selecting (e.g.) every 10th street number along the street ensures that the sample is spread evenly along the length of the street, representing all of these districts. (Note that if we always start at house #1 and end at #991, the sample is slightly biased towards the low end; by randomly selecting the start between #1 and #10, this bias is eliminated.) However, systematic sampling is especially vulnerable to periodicities in the list. If periodicity is present and the period is a multiple or factor of the interval used, the sample is especially likely to be unrepresentative of the overall population, making the scheme less accurate than simple random sampling. Example: Consider a street where the odd-numbered houses are all on the north (expensive) side of the road, and the even-numbered houses are all on the south (cheap) side. Under the sampling scheme given above, it is impossible' to get a representative sample; either the houses sampled will all be from the odd-numbered, expensive side, or they will all be from the even-numbered, cheap side. Another drawback of systematic sampling is that even in scenarios where it is more accurate than SRS, its theoretical properties make it difficult to quantify that accuracy. (In the two examples of systematic sampling that are given above, much of the potential sampling error is due to variation between neighbouring houses - but because this method never selects two neighbouring houses, the sample will not give us any information on that variation.) As described above, systematic sampling is an EPS method, because all elements have the same probability of selection (in the example given, one in ten). It is not 'simple random sampling' because different subsets of the same size have different selection probabilities - e.g. the set {4,14,24,...,994} has a one-in-ten probability of selection, but the set {4,13,24,34,...} has zero probability of selection. Systematic sampling can also be adapted to a non-EPS approach; for an example, see discussion of PPS samples below. Stratified sampling Where the population embraces a number of distinct categories, the frame can be organized by these categories into separate "strata." Each stratum is then sampled as an independent subpopulation, out of which individual elements can be randomly selected.[3] There are several potential benefits to stratified sampling. First, dividing the population into distinct, independent strata can enable researchers to draw inferences about specific subgroups that may be lost in a more generalized random sample. Second, utilizing a stratified sampling method can lead to more efficient statistical estimates (provided that strata are selected based upon relevance to the criterion in question, instead of availability of the samples). Even if a stratified sampling approach does not lead to increased statistical efficiency, such a tactic will not result in less efficiency than would simple random sampling, provided that each stratum is proportional to the group’s size in the population. Third, it is sometimes the case that data are more readily available for individual, pre-existing strata within a population than for the overall population; in such cases, using a stratified sampling approach may be more convenient than aggregating data across groups (though this may potentially be at odds with the previously noted importance of utilizing criterion-relevant strata). Finally, since each stratum is treated as an independent population, different sampling approaches can be applied to different strata, potentially enabling researchers to use the approach best suited (or most cost-effective) for each identified subgroup within the population. There are, however, some potential drawbacks to using stratified sampling. First, identifying strata and implementing such an approach can increase the cost and complexity of sample selection, as well as leading to increased complexity of population estimates. Second, when examining multiple criteria, stratifying variables may be related to some, but not to others, further complicating the design, and potentially reducing the utility of the strata. Finally, in some cases (such as designs with a large number of strata, or those with a specified minimum sample size per group), stratified sampling can potentially require a larger sample than would other methods (although in most cases, the required sample size would be no larger than would be required for simple random sampling. A stratified sampling approach is most effective when three conditions are met 1. Variability within strata are minimized 2. Variability between strata are maximized 3. The variables upon which the population is stratified are strongly correlated with the desired dependent variable. Advantages over other sampling methods 1. 2. 3. 4. Focuses on important subpopulations and ignores irrelevant ones. Allows use of different sampling techniques for different subpopulations. Improves the accuracy/efficiency of estimation. Permits greater balancing of statistical power of tests of differences between strata by sampling equal numbers from strata varying widely in size. Disadvantages 1. Requires selection of relevant stratification variables which can be difficult. 2. Is not useful when there are no homogeneous subgroups. 3. Can be expensive to implement. Poststratification Stratification is sometimes introduced after the sampling phase in a process called "poststratification".[3] This approach is typically implemented due to a lack of prior knowledge of an appropriate stratifying variable or when the experimenter lacks the necessary information to create a stratifying variable during the sampling phase. Although the method is susceptible to the pitfalls of post hoc approaches, it can provide several benefits in the right situation. Implementation usually follows a simple random sample. In addition to allowing for stratification on an ancillary variable, poststratification can be used to implement weighting, which can improve the precision of a sample's estimates.[3] Oversampling Choice-based sampling is one of the stratified sampling strategies. In choice-based sampling,[4] the data are stratified on the target and a sample is taken from each strata so that the rare target class will be more represented in the sample. The model is then built on this biased sample. The effects of the input variables on the target are often estimated with more precision with the choice-based sample even when a smaller overall sample size is taken, compared to a random sample. The results usually must be adjusted to correct for the oversampling. Probability proportional to size sampling In some cases the sample designer has access to an "auxiliary variable" or "size measure", believed to be correlated to the variable of interest, for each element in the population. These data can be used to improve accuracy in sample design. One option is to use the auxiliary variable as a basis for stratification, as discussed above. Another option is probability-proportional-to-size ('PPS') sampling, in which the selection probability for each element is set to be proportional to its size measure, up to a maximum of 1. In a simple PPS design, these selection probabilities can then be used as the basis for Poisson sampling. However, this has the drawback of variable sample size, and different portions of the population may still be over- or under-represented due to chance variation in selections. To address this problem, PPS may be combined with a systematic approach. Example: Suppose we have six schools with populations of 150, 180, 200, 220, 260, and 490 students respectively (total 1500 students), and we want to use student population as the basis for a PPS sample of size three. To do this, we could allocate the first school numbers 1 to 150, the second school 151 to 330 (= 150 + 180), the third school 331 to 530, and so on to the last school (1011 to 1500). We then generate a random start between 1 and 500 (equal to 1500/3) and count through the school populations by multiples of 500. If our random start was 137, we would select the schools which have been allocated numbers 137, 637, and 1137, i.e. the first, fourth, and sixth schools. The PPS approach can improve accuracy for a given sample size by concentrating sample on large elements that have the greatest impact on population estimates. PPS sampling is commonly used for surveys of businesses, where element size varies greatly and auxiliary information is often available - for instance, a survey attempting to measure the number of guest-nights spent in hotels might use each hotel's number of rooms as an auxiliary variable. In some cases, an older measurement of the variable of interest can be used as an auxiliary variable when attempting to produce more current estimates. Cluster sampling Sometimes it is cheaper to 'cluster' the sample in some way e.g. by selecting respondents from certain areas only, or certain time-periods only. (Nearly all samples are in some sense 'clustered' in time - although this is rarely taken into account in the analysis.) Cluster sampling is an example of 'two-stage sampling' or 'multistage sampling': in the first stage a sample of areas is chosen; in the second stage a sample of respondents within those areas is selected. This can reduce travel and other administrative costs. It also means that one does not need a sampling frame listing all elements in the target population. Instead, clusters can be chosen from a cluster-level frame, with an element-level frame created only for the selected clusters. Cluster sampling generally increases the variability of sample estimates above that of simple random sampling, depending on how the clusters differ between themselves, as compared with the within-cluster variation. Nevertheless, some of the disadvantages of cluster sampling are the reliance of sample estimate precision on the actual clusters chosen. If clusters chosen are biased in a certain way, inferences drawn about population parameters from these sample estimates will be far off from being accurate. Multistage sampling Multistage sampling is a complex form of cluster sampling in which two or more levels of units are embedded one in the other. The first stage consists of constructing the clusters that will be used to sample from. In the second stage, a sample of primary units is randomly selected from each cluster (rather than using all units contained in all selected clusters). In following stages, in each of those selected clusters, additional samples of units are selected, and so on. All ultimate units (individuals, for instance) selected at the last step of this procedure are then surveyed. This technique, thus, is essentially the process of taking random samples of preceding random samples. It is not as effective as true random sampling, but it probably solves more of the problems inherent to random sampling. Moreover, It is an effective strategy because it banks on multiple randomizations. As such, it is extremely useful. Multistage sampling is used frequently when a complete list of all members of the population does not exist and is inappropriate. Moreover, by avoiding the use of all sample units in all selected clusters, multistage sampling avoids the large, and perhaps unnecessary, costs associated traditional cluster sampling. Matched random sampling A method of assigning participants to groups in which pairs of participants are first matched on some characteristic and then individually assigned randomly to groups.[5] The procedure for matched random sampling can be briefed with the following contexts, 1. Two samples in which the members are clearly paired, or are matched explicitly by the researcher. For example, IQ measurements or pairs of identical twins. 2. Those samples in which the same attribute, or variable, is measured twice on each subject, under different circumstances. Commonly called repeated measures. Examples include the times of a group of athletes for 1500m before and after a week of special training; the milk yields of cows before and after being fed a particular diet. Quota sampling In quota sampling, the population is first segmented into mutually exclusive sub-groups, just as in stratified sampling. Then judgment is used to select the subjects or units from each segment based on a specified proportion. For example, an interviewer may be told to sample 200 females and 300 males between the age of 45 and 60. It is this second step which makes the technique one of non-probability sampling. In quota sampling the selection of the sample is non-random. For example interviewers might be tempted to interview those who look most helpful. The problem is that these samples may be biased because not everyone gets a chance of selection. This random element is its greatest weakness and quota versus probability has been a matter of controversy for many years Convenience sampling or Accidental Sampling Convenience sampling (sometimes known as grab or opportunity sampling) is a type of nonprobability sampling which involves the sample being drawn from that part of the population which is close to hand. That is, a sample population selected because it is readily available and convenient. It may be through meeting the person or including a person in the sample when one meets them or chosen by finding them through technological means such as the internet or through phone. The researcher using such a sample cannot scientifically make generalizations about the total population from this sample because it would not be representative enough. For example, if the interviewer was to conduct such a survey at a shopping center early in the morning on a given day, the people that he/she could interview would be limited to those given there at that given time, which would not represent the views of other members of society in such an area, if the survey was to be conducted at different times of day and several times per week. This type of sampling is most useful for pilot testing. Several important considerations for researchers using convenience samples include: 1. Are there controls within the research design or experiment which can serve to lessen the impact of a non-random convenience sample, thereby ensuring the results will be more representative of the population? 2. Is there good reason to believe that a particular convenience sample would or should respond or behave differently than a random sample from the same population? 3. Is the question being asked by the research one that can adequately be answered using a convenience sample? In social science research, snowball sampling is a similar technique, where existing study subjects are used to recruit more subjects into the sample. Line-intercept sampling Line-intercept sampling is a method of sampling elements in a region whereby an element is sampled if a chosen line segment, called a “transect”, intersects the element. Panel sampling Panel sampling is the method of first selecting a group of participants through a random sampling method and then asking that group for the same information again several times over a period of time. Therefore, each participant is given the same survey or interview at two or more time points; each period of data collection is called a "wave". This sampling methodology is often chosen for large scale or nation-wide studies in order to gauge changes in the population with regard to any number of variables from chronic illness to job stress to weekly food expenditures. Panel sampling can also be used to inform researchers about within-person health changes due to age or help explain changes in continuous dependent variables such as spousal interaction. There have been several proposed methods of analyzing panel sample data, including MANOVA, growth curves, and structural equation modeling with lagged effects. For a more thorough look at analytical techniques for panel data, see Johnson (1995). Event sampling methodology Event sampling methodology (ESM) is a new form of sampling method that allows researchers to study ongoing experiences and events that vary across and within days in its naturallyoccurring environment. Because of the frequent sampling of events inherent in ESM, it enables researchers to measure the typology of activity and detect the temporal and dynamic fluctuations of work experiences. Popularity of ESM as a new form of research design increased over the recent years because it addresses the shortcomings of cross-sectional research, where once unable to, researchers can now detect intra-individual variances across time. In ESM, participants are asked to record their experiences and perceptions in a paper or electronic diary. There are three types of ESM: 1. Signal contingent – random beeping notifies participants to record data. The advantage of this type of ESM is minimization of recall bias. 2. Event contingent – records data when certain events occur 3. Interval contingent – records data according to the passing of a certain period of time ESM has several disadvantages. One of the disadvantages of ESM is it can sometimes be perceived as invasive and intrusive by participants. ESM also leads to possible self-selection bias. It may be that only certain types of individuals are willing to participate in this type of study creating a non-random sample. Another concern is related to participant cooperation. Participants may not be actually fill out their diaries at the specified times. Furthermore, ESM may substantively change the phenomenon being studied. Reactivity or priming effects may occur, such that repeated measurement may cause changes in the participants' experiences. This method of sampling data is also highly vulnerable to common method variance.[6] Further, it is important to think about whether or not an appropriate dependent variable is being used in an ESM design. For example, it might be logical to use ESM in order to answer research questions which involve dependent variables with a great deal of variation throughout the day. Thus, variables such as change in mood, change in stress level, or the immediate impact of particular events may be best studied using ESM methodology. However, it is not likely that utilizing ESM will yield meaningful predictions when measuring someone performing a repetitive task throughout the day or when dependent variables are long-term in nature (coronary heart problems). Replacement of selected units Sampling schemes may be without replacement ('WOR' - no element can be selected more than once in the same sample) or with replacement ('WR' - an element may appear multiple times in the one sample). For example, if we catch fish, measure them, and immediately return them to the water before continuing with the sample, this is a WR design, because we might end up catching and measuring the same fish more than once. However, if we do not return the fish to the water (e.g. if we eat the fish), this becomes a WOR design. Sample size Formulas, tables, and power function charts are well known approaches to determine sample size. Formulas Where the frame and population are identical, statistical theory yields exact recommendations on sample size.[7] However, where it is not straightforward to define a frame representative of the population, it is more important to understand the cause system of which the population are outcomes and to ensure that all sources of variation are embraced in the frame. Large number of observations are of no value if major sources of variation are neglected in the study. In other words, it is taking a sample group that matches the survey category and is easy to survey. Bartlett, Kotrlik, and Higgins (2001) published a paper titled Organizational Research: Determining Appropriate Sample Size in Survey Research Information Technology, Learning, and Performance Journal[8] that provides an explanation of Cochran’s (1977) formulas. A discussion and illustration of sample size formulas, including the formula for adjusting the sample size for smaller populations, is included. A table is provided that can be used to select the sample size for a research problem based on three alpha levels and a set error rate. Steps for using sample size tables 1. Postulate the effect size of interest, α, and β. 2. Check sample size table[9] 1. Select the table corresponding to the selected α 2. Locate the row corresponding to the desired power 3. Locate the column corresponding to the estimated effect size. 4. The intersection of the column and row is the minimum sample size required. Sampling and data collection Good data collection involves: Following the defined sampling process Keeping the data in time order Noting comments and other contextual events Recording non-responses Most sampling books and papers written by non-statisticians focus only in the data collection aspect, which is just a small though important part of the sampling process. Errors in sample surveys Survey results are typically subject to some error. Total errors can be classified into sampling errors and non-sampling errors. The term "error" here includes systematic biases as well as random errors. Sampling errors and biases Sampling errors and biases are induced by the sample design. They include: 1. Selection bias: When the true selection probabilities differ from those assumed in calculating the results. 2. Random sampling error: Random variation in the results due to the elements in the sample being selected at random. Non-sampling error Non-sampling errors are caused by other problems in data collection and processing. They include: 1. Overcoverage: Inclusion of data from outside of the population. 2. Undercoverage: Sampling frame does not include elements in the population. 3. Measurement error: E.g. when respondents misunderstand a question, or find it difficult to answer. 4. Processing error: Mistakes in data coding. 5. Non-response: Failure to obtain complete data from all selected individuals. After sampling, a review should be held of the exact process followed in sampling, rather than that intended, in order to study any effects that any divergences might have on subsequent analysis. A particular problem is that of non-response. Two major types of nonresponse exist: unit nonresponse (referring to lack of completion of any part of the survey) and item nonresponse (submission or participation in survey but failing to complete one or more components/questions of the survey).[10][11] In survey sampling, many of the individuals identified as part of the sample may be unwilling to participate, not have the time to participate (opportunity cost),[12] or survey administrators may not have been able to contact them. In this case, there is a risk of differences, between respondents and nonrespondents, leading to biased estimates of population parameters. This is often addressed by improving survey design, offering incentives, and conducting follow-up studies which make a repeated attempt to contact the unresponsive and to characterize their similarities and differences with the rest of the frame.[13] The effects can also be mitigated by weighting the data when population benchmarks are available or by imputing data based on answers to other questions. Nonresponse is particularly a problem in internet sampling. Reasons for this problem include improperly designed surveys,[11] over-surveying (or survey fatigue),[14][15] and the fact that potential participants hold multiple e-mail addresses, which they don't use anymore or don't check regularly. Web-based surveys also tend to demonstrate nonresponse bias; for example, studies have shown that females and those from a white/Caucasian background are more likely to respond than their counterparts.[16] References Kish, Leslie (1995) Survey Sampling, Wiley, ISBN 0-471-10949-5 Korn, E L, and Graubard, B I (1999) Analysis of Health Surveys, Wiley, ISBN 0-47113773-1 Lohr, Sharon L. (1999). Sampling: Design and Analysis. Duxbury. ISBN 0-534-35361-4. Pedhazur, E., & Schmelkin, L. (1991). Measurement design and analysis: An integrated approach. New York: Psychology Press. Särndal, Carl-Erik, and Swensson, Bengt, and Wretman, Jan (1992). Model Assisted Survey Sampling. Springer-Verlag. ISBN 0-387-40620-4. Stuart, Alan (1962) Basic Ideas of Scientific Sampling, Hafner Publishing Company, New York UNIT III CONCEPT OF SAMPLING DISTRIBUTION TESTS OF SIGNIFICANCE Once sample data has been gathered through an observational study or experiment, statistical inference allows analysts to assess evidence in favor or some claim about the population from which the sample has been drawn. The methods of inference used to support or reject claims based on sample data are known as tests of significance. Every test of significance begins with a null hypothesis H0. H0 represents a theory that has been put forward, either because it is believed to be true or because it is to be used as a basis for argument, but has not been proved. For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on average, than the current drug. We would write H0: there is no difference between the two drugs on average. The alternative hypothesis, Ha, is a statement of what a statistical hypothesis test is set up to establish. For example, in a clinical trial of a new drug, the alternative hypothesis might be that the new drug has a different effect, on average, compared to that of the current drug. We would write Ha: the two drugs have different effects, on average. The alternative hypothesis might also be that the new drug is better, on average, than the current drug. In this case we would write Ha: the new drug is better than the current drug, on average. The final conclusion once the test has been carried out is always given in terms of the null hypothesis. We either "reject H0 in favor of Ha" or "do not reject H0"; we never conclude "reject Ha", or even "accept Ha". If we conclude "do not reject H0", this does not necessarily mean that the null hypothesis is true, it only suggests that there is not sufficient evidence against H0 in favor of Ha; rejecting the null hypothesis then, suggests that the alternative hypothesis may be true. Hypotheses are always stated in terms of population parameter, such as the mean . An alternative hypothesis may be one-sided or two-sided. A one-sided hypothesis claims that a parameter is either larger or smaller than the value given by the null hypothesis. A two-sided hypothesis claims that a parameter is simply not equal to the value given by the null hypothesis - the direction does not matter. Hypotheses for a one-sided test for a population mean take the following form: H0: =k Ha: or >k H0: =k Ha: < k. Hypotheses for a two-sided test for a population mean take the following form: H0: =k Ha: k. A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data. (Definition taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1) Example Suppose a test has been given to all high school students in a certain state. The mean test score for the entire state is 70, with standard deviation equal to 10. Members of the school board suspect that female students have a higher mean score on the test than male students, because the mean score from a random sample of 64 female students is equal to 73. Does this provide strong evidence that the overall mean for female students is higher? The null hypothesis H0 claims that there is no difference between the mean score for female students and the mean for the entire population, so that = 70. The alternative hypothesis claims that the mean for female students is higher than the entire student population mean, so that > 70. Significance Tests for Unknown Mean and Known Standard Deviation Once null and alternative hypotheses have been formulated for a particular claim, the next step is to compute a test statistic. For claims about a population mean from a population with a normal distribution or for any sample with large sample size n (for which the sample mean will follow a normal distribution by the Central Limit Theorem), if the standard deviation is known, the appropriate significance test is known as the z-test, where the test statistic is defined as z = . The test statistic follows the standard normal distribution (with mean = 0 and standard deviation = 1). The test statistic z is used to compute the P-value for the standard normal distribution, the probability that a value at least as extreme as the test statistic would be observed under the null hypothesis. Given the null hypothesis that the population mean is equal to a given value the P-values for testing H0 against each of the possible alternative hypotheses are: P(Z > z) for Ha: > 0 P(Z < z) for Ha: < 0 2P(Z>|z|) for Ha: 0, 0. The probability is doubled for the two-sided test, since the two-sided alternative hypothesis considers the possibility of observing extreme values on either tail of the normal distribution. Example In the test score example above, where the sample mean equals 73 and the population standard deviation is equal to 10, the test statistic is computed as follows: z = (73 - 70)/(10/sqrt(64)) = 3/1.25 = 2.4. Since this is a one-sided test, the P-value is equal to the probability that of observing a value greater than 2.4 in the standard normal distribution, or P(Z > 2.4) = 1 - P(Z < 2.4) = 1 - 0.9918 = 0.0082. The P-value is less than 0.01, indicating that it is highly unlikely that these results would be observed under the null hypothesis. The school board can confidently reject H0 given this result, although they cannot conclude any additional information about the mean of the distribution. Significance Levels The significance level for a given hypothesis test is a value for which a P-value less than or equal to is considered statistically significant. Typical values for are 0.1, 0.05, and 0.01. These values correspond to the probability of observing such an extreme value by chance. In the test score example above, the P-value is 0.0082, so the probability of observing such a value by chance is less that 0.01, and the result is significant at the 0.01 level. In a one-sided test, corresponds to the critical value z* such that P(Z > z*) = . For example, if the desired significance level for a result is 0.05, the corresponding value for z must be greater than or equal to z* = 1.645 (or less than or equal to -1.645 for a one-sided alternative claiming that the mean is less than the null hypothesis). For a two-sided test, we are interested in the probability that 2P(Z > z*) = , so the critical value z* corresponds to the /2 significance level. To achieve a significance level of 0.05 for a two-sided test, the absolute value of the test statistic (|z|) must be greater than or equal to the critical value 1.96 (which corresponds to the level 0.025 for a one-sided test). Another interpretation of the significance level , based in decision theory, is that corresponds to the value for which one chooses to reject or accept the null hypothesis H0. In the above example, the value 0.0082 would result in rejection of the null hypothesis at the 0.01 level. The probability that this is a mistake -- that, in fact, the null hypothesis is true given the zstatistic -- is less than 0.01. In decision theory, this is known as a Type I error. The probability of a Type I error is equal to the significance level , and the probability of rejecting the null hypothesis when it is in fact false (a correct decision) is equal to 1 - . To minimize the probability of Type I error, the significance level is generally chosen to be small. Example Of all of the individuals who develop a certain rash, suppose the mean recovery time for individuals who do not use any form of treatment is 30 days with standard deviation equal to 8. A pharmaceutical company manufacturing a certain cream wishes to determine whether the cream shortens, extends, or has no effect on the recovery time. The company chooses a random sample of 100 individuals who have used the cream, and determines that the mean recovery time for these individuals was 28.5 days. Does the cream have any effect? Since the pharmaceutical company is interested in any difference from the mean recovery time for all individuals, the alternative hypothesis Ha is two-sided: 30. The test statistic is calculated to be z = (28.5 - 30)/(8/sqrt(100)) = -1.5/0.8 = -1.875. The P-value for this statistic is 2P(Z > 1.875) = 2(1 - P((Z < 1.875) = 2(1- 0.9693) = 2(0.0307) = 0.0614. This is not significant at the 0.05 level, although it is significant at the 0.1 level. Decision theory is also concerned with a second error possible in significance testing, known as Type II error. Contrary to Type I error, Type II error is the error made when the null hypothesis is incorrectly accepted. The probability of correctly rejecting the null hypothesis when it is false, the complement of the Type II error, is known as the power of a test. Formally defined, the power of a test is the probability that a fixed level significance test will reject the null hypothesis H0 when a particular alternative value of the parameter is true. Example In the test score example, for a fixed significance level of 0.10, suppose the school board wishes to be able to reject the null hypothesis (that the mean = 70) if the mean for female students is in fact 72. To determine the power of the test against this alternative, first note that the critical value for rejecting the null hypothesis is z* = 1.282. The calculated value for z will be greater than 1.282 whenever ( - 70)/(1.25) > 1.282, or > 71.6. The probability of rejecting the null hypothesis (mean = 70) given that the alternative hypotheses (mean = 72) is true is calculated by: P(( = P(( > 71.6 | = 72) - 72)/(1.25) > (71.6 - 72)/1.25) = P(Z > -0.32) = 1 - P(Z < -0.32) = 1 - 0.3745 = 0.6255. The power is about 0.60, indicating that although the test is more likely than not to reject the null hypothesis for this value, the probability of a Type II error is high. Significance Tests for Unknown Mean and Unknown Standard Deviation In most practical research, the standard deviation for the population of interest is not known. In this case, the standard deviation is replaced by the estimated standard deviation s, also known as the standard error. Since the standard error is an estimate for the true value of the standard deviation, the distribution of the sample mean standard deviation is no longer normal with mean and . Instead, the sample mean follows the t distribution with mean and standard deviation . The t distribution is also described by its degrees of freedom. For a sample of size n, the t distribution will have n-1 degrees of freedom. The notation for a t distribution with k degrees of freedom is t(k). As the sample size n increases, the t distribution becomes closer to the normal distribution, since the standard error approaches the true standard deviation for large n. For claims about a population mean from a population with a normal distribution or for any sample with large sample size n (for which the sample mean will follow a normal distribution by the Central Limit Theorem) with unknown standard deviation, the appropriate significance test is known as the t-test, where the test statistic is defined as t = . The test statistic follows the t distribution with n-1 degrees of freedom. The test statistic z is used to compute the P-value for the t distribution, the probability that a value at least as extreme as the test statistic would be observed under the null hypothesis. Example The dataset "Normal Body Temperature, Gender, and Heart Rate" contains 130 observations of body temperature, along with the gender of each individual and his or her heart rate. Using the MINITAB "DESCRIBE" command provides the following information: Descriptive Statistics Variable TEMP N 130 Variable TEMP Min 96.300 Mean Median Tr Mean 98.249 98.300 98.253 Max 100.800 Q1 97.800 StDev SE Mean 0.733 0.064 Q3 98.700 Since the normal body temperature is generally assumed to be 98.6 degrees Fahrenheit, one can use the data to test the following one-sided hypothesis: H0: = 98.6 vs Ha: < 98.6. The t test statistic is equal to (98.249 - 98.6)/0.064 = -0.351/0.064 = -5.48. P(t< -5.48) = P(t> 5.48). The t distribution with 129 degrees of freedom may be approximated by the t distribution with 100 degrees of freedom (found in Table E in Moore and McCabe), where P(t> 5.48) is less than 0.0005. This result is significant at the 0.01 level and beyond, indicating that the null hypotheses can be rejected with confidence. To perform this t-test in MINITAB, the "TTEST" command with the "ALTERNATIVE" subcommand may be applied as follows: MTB > ttest mu = 98.6 c1; SUBC > alt= -1. T-Test of the Mean Test of mu = 98.6000 vs mu < 98.6000 Variable TEMP N 130 Mean 98.2492 StDev 0.7332 SE Mean 0.0643 T -5.45 P 0.0000 These results represents the exact calculations for the t(129) distribution. Data source: Data presented in Mackowiak, P.A., Wasserman, S.S., and Levine, M.M. (1992), "A Critical Appraisal of 98.6 Degrees F, the Upper Limit of the Normal Body Temperature, and Other Legacies of Carl Reinhold August Wunderlich," Journal of the American Medical Association, 268, 1578-1580. Dataset available through the JSE Dataset Archive. Matched Pairs In many experiments, one wishes to compare measurements from two populations. This is common in medical studies involving control groups, for example, as well as in studies requiring before-and-after measurements. Such studies have a matched pairs design, where the difference between the two measurements in each pair is the parameter of interest. Analysis of data from a matched pairs experiment compares the two measurements by subtracting one from the other and basing test hypotheses upon the differences. Usually, the null hypothesis H0 assumes that that the mean of these differences is equal to 0, while the alternative hypothesis Ha claims that the mean of the differences is not equal to zero (the alternative hypothesis may be one- or two-sided, depending on the experiment). Using the differences between the paired measurements as single observations, the standard t procedures with n-1 degrees of freedom are followed as above. Example In the "Helium Football" experiment, a punter was given two footballs to kick, one filled with air and the other filled with helium. The punter was unaware of the difference between the balls, and was asked to kick each ball 39 times. The balls were alternated for each kick, so each of the 39 trials contains one measurement for the air-filled ball and one measurement for the helium-filled ball. Given that the conditions (leg fatigue, etc.) were basically the same for each kick within a trial, a matched pairs analysis of the trials is appropriate. Is there evidence that the helium-filled ball improved the kicker's performance? In MINITAB, subtracting the air-filled measurement from the helium-filled measurement for each trial and applying the "DESCRIBE" command to the resulting differences gives the following results: Descriptive Statistics Variable Hel. - Air N 39 Variable Hel. - Air Min -14.00 Mean 0.46 Median 1.00 Max 17.00 Tr Mean 0.40 Q1 -2.00 StDev 6.87 SE Mean 1.10 Q3 4.00 Using MINITAB to perform a t-test of the null hypothesis H0: following analysis: = 0 vs Ha: > 0 gives the T-Test of the Mean Test of mu = 0.00 vs mu > 0.00 Variable Hel. - A N 39 Mean 0.46 StDev 6.87 SE Mean 1.10 T 0.42 P 0.34 The P-Value of 0.34 indicates that this result is not significant at any acceptable level. A 95% confidence interval for the t-distribution with 38 degrees of freedom for the difference in measurements is (-1.76, 2.69), computed using the MINITAB "TINTERVAL" command. Data source: Lafferty, M.B. (1993), "OSU scientists get a kick out of sports controversy," The Columbus Dispatch (November 21, 1993), B7. Dataset available through the Statlib Data and Story Library (DASL). The Sign Test Another method of analysis for matched pairs data is a distribution-free test known as the sign test. This test does not require any normality assumptions about the data, and simply involves counting the number of positive differences between the matched pairs and relating these to a binomial distribution. The concept behind the sign test reasons that if there is no true difference, then the probability of observing an increase in each pair is equal to the probability of observing a decrease in each pair: p = 1/2. Assuming each pair is independent, the null hypothesis follows the distribution B(n,1/2), where n is the number of pairs where some difference is observed. To perform a sign test on matched pairs data, take the difference between the two measurements in each pair and count the number of non-zero differences n. Of these, count the number of positive differences X. Determine the probability of observing X positive differences for a B(n,1/2) distribution, and use this probability as a P-value for the null hypothesis. Example In the "Helium Football" example above, 2 of the 39 trials recorded no difference between kicks for the air-filled and helium-filled balls. Of the remaining 37 trials, 20 recorded a positive difference between the two kicks. Under the null hypothesis, p = 1/2, the differences would follow the B(37,1/2) distribution. The probability of observing 20 or more positive differences, P(X>20) = 1 - P(X<19) = 1 - 0.6286 = 0.3714. This value indicates that there is not strong evidence against the null hypothesis, as observed previously with the t-test. The T-Test The t-test assesses whether the means of two groups are statistically different from each other. This analysis is appropriate whenever you want to compare the means of two groups, and especially appropriate as the analysis for the posttest-only two-group randomized experimental design. Figure 1. Idealized distributions for treated and comparison group posttest values. Figure 1 shows the distributions for the treated (blue) and control (green) groups in a study. Actually, the figure shows the idealized distribution -- the actual distribution would usually be depicted with a histogram or bar graph. The figure indicates where the control and treatment group means are located. The question the t-test addresses is whether the means are statistically different. What does it mean to say that the averages for two groups are statistically different? Consider the three situations shown in Figure 2. The first thing to notice about the three situations is that the difference between the means is the same in all three. But, you should also notice that the three situations don't look the same -- they tell very different stories. The top example shows a case with moderate variability of scores within each group. The second situation shows the high variability case. the third shows the case with low variability. Clearly, we would conclude that the two groups appear most different or distinct in the bottom or low-variability case. Why? Because there is relatively little overlap between the two bell-shaped curves. In the high variability case, the group difference appears least striking because the two bell-shaped distributions overlap so much. Figure 2. Three scenarios for differences between means. This leads us to a very important conclusion: when we are looking at the differences between scores for two groups, we have to judge the difference between their means relative to the spread or variability of their scores. The t-test does just this. Statistical Analysis of the t-test The formula for the t-test is a ratio. The top part of the ratio is just the difference between the two means or averages. The bottom part is a measure of the variability or dispersion of the scores. This formula is essentially another example of the signal-to-noise metaphor in research: the difference between the means is the signal that, in this case, we think our program or treatment introduced into the data; the bottom part of the formula is a measure of variability that is essentially noise that may make it harder to see the group difference. Figure 3 shows the formula for the t-test and how the numerator and denominator are related to the distributions. Figure 3. Formula for the t-test. The top part of the formula is easy to compute -- just find the difference between the means. The bottom part is called the standard error of the difference. To compute it, we take the variance for each group and divide it by the number of people in that group. We add these two values and then take their square root. The specific formula is given in Figure 4: Figure 4. Formula for the Standard error of the difference between the means. Remember, that the variance is simply the square of the standard deviation. The final formula for the t-test is shown in Figure 5: Figure 5. Formula for the t-test. The t-value will be positive if the first mean is larger than the second and negative if it is smaller. Once you compute the t-value you have to look it up in a table of significance to test whether the ratio is large enough to say that the difference between the groups is not likely to have been a chance finding. To test the significance, you need to set a risk level (called the alpha level). In most social research, the "rule of thumb" is to set the alpha level at .05. This means that five times out of a hundred you would find a statistically significant difference between the means even if there was none (i.e., by "chance"). You also need to determine the degrees of freedom (df) for the test. In the t-test, the degrees of freedom is the sum of the persons in both groups minus 2. Given the alpha level, the df, and the t-value, you can look the t-value up in a standard table of significance (available as an appendix in the back of most statistics texts) to determine whether the t-value is large enough to be significant. If it is, you can conclude that the difference between the means for the two groups is different (even given the variability). Fortunately, statistical computer programs routinely print the significance test results and save you the trouble of looking them up in a table. The t-test, one-way Analysis of Variance (ANOVA) and a form of regression analysis are mathematically equivalent (see the statistical analysis of the posttest-only randomized experimental design) and would yield identical results. The F Distribution The F distribution is an asymmetric distribution that has a minimum value of 0, but no maximum value. The curve reaches a peak not far to the right of 0, and then gradually approaches the horizontal axis the larger the F value is. The F distribution approaches, but never quite touches the horizontal axis. The F distribution has two degrees of freedom, d1 for the numerator, d2 for the denominator. For each combination of these degrees of freedom there is a di®erent F distribution. The F distribution is most spread out when the degrees of freedom are small. As the degrees of freedom increase, the F distribution the F distribution is less dispersed. Figure 1.1 shows the shape of the distribution. The F value is on the horizontal axis, with the probability for each F value being represented by the vertical axis. The shaded area in the diagram represents the level of signi¯cance ® shown in the table. There is a di®erent F distribution for each combination of the degrees of freedom of the numerator and denominator. Since there are so many F distributions, the F tables are organized somewhat di®erently than the tables for the other distributions. The three tables which follow are organized by the level of signi¯cance. The ¯rst table gives F values for that are associated with ® = 0:10 of the area in the right tail of the distribution. The second table gives the F values for ® = 0:05 of the area in the right tail, and the third table gives F values for the ® = 0:01 level of signi¯cance. In each of these tables, the F values are given for various combinations of degrees of freedom. In order to use the F table, ¯rst select the signi¯cance level to be used, 917 Figure K.1: The F distribution and then determine the appropriate combination of degrees of freedom. For example, if the ® = 0:10 level of signi¯cance is selected, use the ¯rst F table. If there are 5 degrees of freedom in the numerator, and 7 degrees of freedom in the denominator, the F value from the table is 2.88. This means that there is exactly 0.10 of the area under the F curve that lies to the right of F = 2:88. When the signi¯cance level is ® = 0:05, use the second F table. If there are 20 degrees of freedom in the numerator, and 5 degrees of freedom in the denominator, then the critical F value is 4.56. This could be written F20;5;0:05 = 4:56 That is, for 20 and 5 degrees of freedom, the F value that leaves exactly 0.05 of the area under the F curve in the right tail of the distribution is 4.56. For the ® = 0:01 level of signi¯cance, the third F table is used. Suppose that there is 1 degree of freedom in the numerator and 12 degrees of freedom in the denominator. Then F1;12;0:01 = 9:33: An F value of 9.33 leaves exactly 0.01 of area under the curve in the right tail of the distribution when there are 1 and 12 degrees of freedom. 918 F Values for ® = 0:10 d1 d2 1 2 3 4 5 6 7 8 9 1 39.86 49.5 53.59 55.83 57.24 58.2 58.91 59.44 59.86 2 8.53 9.00 9.16 9.24 9.29 9.33 9.35 9.37 9.38 3 5.54 5.46 5.39 5.34 5.31 5.28 5.27 5.25 5.24 4 4.54 4.32 4.19 4.11 4.05 4.01 3.98 3.95 3.94 5 4.06 3.78 3.62 3.52 3.45 3.40 3.37 3.34 3.32 6 3.78 3.46 3.29 3.18 3.11 3.05 3.01 2.98 2.96 7 3.59 3.26 3.07 2.96 2.88 2.83 2.78 2.75 2.72 8 3.46 3.11 2.92 2.81 2.73 2.67 2.62 2.59 2.56 9 3.36 3.01 2.81 2.69 2.61 2.55 2.51 2.47 2.44 10 3.29 2.92 2.73 2.61 2.52 2.46 2.41 2.38 2.35 11 3.23 2.86 2.66 2.54 2.45 2.39 2.34 2.3 2.27 12 3.18 2.81 2.61 2.48 2.39 2.33 2.28 2.24 2.21 13 3.14 2.76 2.56 2.43 2.35 2.28 2.23 2.20 2.16 14 3.10 2.73 2.52 2.39 2.31 2.24 2.19 2.15 2.12 15 3.07 2.70 2.49 2.36 2.27 2.21 2.16 2.12 2.09 16 3.05 2.67 2.46 2.33 2.24 2.18 2.13 2.09 2.06 17 3.03 2.64 2.44 2.31 2.22 2.15 2.10 2.06 2.03 18 3.01 2.62 2.42 2.29 2.20 2.13 2.08 2.04 2.00 19 2.99 2.61 2.40 2.27 2.18 2.11 2.06 2.02 1.98 20 2.97 2.59 2.38 2.25 2.16 2.09 2.04 2.00 1.96 21 2.96 2.57 2.36 2.23 2.14 2.08 2.02 1.98 1.95 22 2.95 2.56 2.35 2.22 2.13 2.06 2.01 1.97 1.93 23 2.94 2.55 2.34 2.21 2.11 2.05 1.99 1.95 1.92 24 2.93 2.54 2.33 2.19 2.10 2.04 1.98 1.94 1.91 25 2.92 2.53 2.32 2.18 2.09 2.02 1.97 1.93 1.89 26 2.91 2.52 2.31 2.17 2.08 2.01 1.96 1.92 1.88 27 2.90 2.51 2.30 2.17 2.07 2.00 1.95 1.91 1.87 28 2.89 2.50 2.29 2.16 2.06 2.00 1.94 1.90 1.87 29 2.89 2.50 2.28 2.15 2.06 1.99 1.93 1.89 1.86 30 2.88 2.49 2.28 2.14 2.05 1.98 1.93 1.88 1.85 40 2.84 2.44 2.23 2.09 2.00 1.93 1.87 1.83 1.79 60 2.79 2.39 2.18 2.04 1.95 1.87 1.82 1.77 1.74 120 2.75 2.35 2.13 1.99 1.90 1.82 1.77 1.72 1.68 inf 2.71 2.30 2.08 1.94 1.85 1.77 1.72 1.67 1.63 919 F Value for ® = 0:10 d1 d2 10 12 15 20 24 30 40 60 120 inf 1 60.19 60.71 61.22 61.74 62 62.26 62.53 62.79 63.06 63.33 2 9.39 9.41 9.42 9.44 9.45 9.46 9.47 9.47 9.48 9.49 3 5.23 5.22 5.20 5.18 5.18 5.17 5.16 5.15 5.14 5.13 4 3.92 3.90 3.87 3.84 3.83 3.82 3.80 3.79 3.78 3.76 5 3.30 3.27 3.24 3.21 3.19 3.17 3.16 3.14 3.12 3.10 6 2.94 2.90 2.87 2.84 2.82 2.80 2.78 2.76 2.74 2.72 7 2.70 2.67 2.63 2.59 2.58 2.56 2.54 2.51 2.49 2.47 8 2.54 2.50 2.46 2.42 2.40 2.38 2.36 2.34 2.32 2.29 9 2.42 2.38 2.34 2.30 2.28 2.25 2.23 2.21 2.18 2.16 10 2.32 2.28 2.24 2.20 2.18 2.16 2.13 2.11 2.08 2.06 11 2.25 2.21 2.17 2.12 2.10 2.08 2.05 2.03 2.00 1.97 12 2.19 2.15 2.10 2.06 2.04 2.01 1.99 1.96 1.93 1.90 13 2.40 2.10 2.05 2.01 1.98 1.96 1.93 1.90 1.88 1.85 14 2.10 2.05 2.01 1.96 1.94 1.91 1.89 1.86 1.83 1.80 15 2.06 2.02 1.97 1.92 1.90 1.87 1.85 1.82 1.79 1.76 16 2.03 1.99 1.94 1.89 1.87 1.84 1.81 1.78 1.75 1.72 17 2.00 1.96 1.91 1.86 1.84 1.81 1.78 1.75 1.72 1.69 18 1.98 1.93 1.89 1.84 1.81 1.78 1.75 1.72 1.69 1.66 19 1.96 1.91 1.86 1.81 1.79 1.76 1.73 1.70 1.67 1.63 20 1.94 1.89 1.84 1.79 1.77 1.74 1.71 1.68 1.64 1.61 21 1.92 1.87 1.83 1.78 1.75 1.72 1.69 1.66 1.62 1.59 22 1.90 1.86 1.81 1.76 1.73 1.70 1.67 1.64 1.60 1.57 23 1.89 1.84 1.80 1.74 1.72 1.69 1.66 1.62 1.59 1.55 24 1.88 1.83 1.78 1.73 1.70 1.67 1.64 1.61 1.57 1.53 25 1.87 1.82 1.77 1.72 1.69 1.66 1.63 1.59 1.56 1.52 26 1.86 1.81 1.76 1.71 1.80 1.65 1.61 1.58 1.54 1.50 27 1.85 1.80 1.75 1.70 1.67 1.64 1.60 1.57 1.53 1.49 28 1.84 1.79 1.74 1.69 1.66 1.63 1.59 1.56 1.52 1.48 29 1.83 1.78 1.73 1.68 1.65 1.62 1.58 1.55 1.51 1.47 30 1.82 1.77 1.72 1.67 1.64 1.61 1.57 1.54 1.50 1.46 40 1.76 1.71 1.66 1.61 1.57 1.54 1.51 1.47 1.42 1.38 60 1.71 1.66 1.60 1.54 1.51 1.48 1.44 1.40 1.35 1.29 120 1.65 1.60 1.55 1.48 1.45 1.41 1.37 1.32 1.26 1.19 inf 1.60 1.55 1.49 1.42 1.38 1.34 1.30 1.24 1.17 1.00 920 F Values for ® = 0:05 d1 d2 1 2 3 4 5 6 7 8 9 1 161.4 199.5 215.7 224.6 230.2 234.0 236.8 238.9 240.5 2 18.51 19.00 19.16 19.25 19.3 19.33 19.35 19.37 19.38 3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 26 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 120 3.92 3.07 2.68 2.45 2.29 2.17 2.09 2.02 1.96 inf 3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.88 921 F Values for ® = 0:05 d1 d2 10 12 15 20 24 30 40 60 120 inf 1 241.9 243.9 245.9 248.0 249.1 250.1 251.1 252.2 253.3 254.3 2 19.4 19.41 19.43 19.45 19.45 19.46 19.47 19.48 19.49 19.5 3 8.79 8.74 8.70 8.66 8.64 8.62 8.59 8.57 8.55 8.53 4 5.96 5.91 5.86 5.80 5.77 5.75 5.72 5.69 5.66 5.63 5 4.74 4.68 4.62 4.56 4.53 4.50 4.46 4.43 4.40 4.36 6 4.06 4.00 3.94 3.87 3.84 3.81 3.77 3.74 3.70 3.67 7 3.64 3.57 3.51 3.44 3.41 3.38 3.34 3.30 3.27 3.23 8 3.35 3.28 3.22 3.15 3.12 3.08 3.04 3.01 2.97 2.93 9 3.14 3.07 3.01 2.94 2.90 2.86 2.83 2.79 2.75 2.71 10 2.98 2.91 2.85 2.77 2.74 2.70 2.66 2.62 2.58 2.54 11 2.85 2.79 2.72 2.65 2.61 2.57 2.53 2.49 2.45 2.40 12 2.75 2.69 2.62 2.54 2.51 2.47 2.43 2.38 2.34 2.30 13 2.67 2.60 2.53 2.46 2.42 2.38 2.34 2.30 2.25 2.21 14 2.60 2.53 2.46 2.39 2.35 2.31 2.27 2.22 2.18 2.13 15 2.54 2.48 2.40 2.33 2.29 2.25 2.20 2.16 2.11 2.07 16 2.49 2.42 2.35 2.28 2.24 2.19 2.15 2.11 2.06 2.01 17 2.45 2.38 2.31 2.23 2.19 2.15 2.10 2.06 2.01 1.96 18 2.41 2.34 2.27 2.19 2.15 2.11 2.06 2.02 1.97 1.92 19 2.38 2.31 2.23 2.16 2.11 2.07 2.03 1.98 1.93 1.88 20 2.35 2.28 2.20 2.12 2.08 2.04 1.99 1.95 1.90 1.84 21 2.32 2.25 2.18 2.10 2.05 2.01 1.96 1.92 1.87 1.81 22 2.30 2.23 2.15 2.07 2.03 1.98 1.94 1.89 1.84 1.78 23 2.27 2.20 2.13 2.05 2.01 1.96 1.91 1.86 1.81 1.76 24 2.25 2.18 2.11 2.03 1.98 1.94 1.89 1.84 1.79 1.73 25 2.24 2.16 2.09 2.01 1.96 1.92 1.87 1.82 1.77 1.71 26 2.22 2.15 2.07 1.99 1.95 1.90 1.85 1.80 1.75 1.69 27 2.20 2.13 2.06 1.97 1.93 1.88 1.84 1.79 1.73 1.67 28 2.19 2.12 2.04 1.96 1.91 1.87 1.82 1.77 1.71 1.65 29 2.18 2.10 2.03 1.94 1.90 1.85 1.81 1.75 1.70 1.64 30 2.16 2.09 2.01 1.93 1.89 1.84 1.79 1.74 1.68 1.62 40 2.08 2.00 1.92 1.84 1.79 1.74 1.69 1.64 1.58 1.51 60 1.99 1.92 1.84 1.75 1.70 1.65 1.59 1.53 1.47 1.39 120 1.91 1.83 1.75 1.66 1.10 1.55 1.50 1.43 1.35 1.25 inf 1.83 1.75 1.67 1.57 1.52 1.46 1.39 1.32 1.22 1.00 922 Chi-Square Distribution Probability Density Function The chi-square distribution results when independent variables with standard normal distributions are squared and summed. The formula for the probability density function of the chi-square distribution is where is the shape parameter and is the gamma function. The formula for the gamma function is In a testing context, the chi-square distribution is treated as a "standardized distribution" (i.e., no location or scale parameters). However, in a distributional modeling context (as with other probability distributions), the chi-square distribution itself can be transformed with a location parameter, , and a scale parameter, . The following is the plot of the chi-square probability density function for 4 different values of the shape parameter. Cumulative Distribution Function The formula for the cumulative distribution function of the chi-square distribution is where is the gamma function defined above and is the incomplete gamma function. The formula for the incomplete gamma function is The following is the plot of the chi-square cumulative distribution function with the same values of as the pdf plots above. Percent Point Function The formula for the percent point function of the chi-square distribution does not exist in a simple closed form. It is computed numerically. The following is the plot of the chi-square percent point function with the same values of as the pdf plots above. Other Probability Functions Since the chi-square distribution is typically used to develop hypothesis tests and confidence intervals and rarely for modeling applications, we omit the formulas and plots for the hazard, cumulative hazard, survival, and inverse survival probability functions. Common Statistics Mean Median Mode Range Standard Deviation Coefficient of Variation approximately - 2/3 for large 0 to positive infinity Skewness Kurtosis Parameter Estimation Since the chi-square distribution is typically used to develop hypothesis tests and confidence intervals and rarely for modeling applications, we omit any discussion of parameter estimation. Comments The chi-square distribution is used in many cases for the critical regions for hypothesis tests and in determining confidence intervals. Two common examples are the chisquare test for independence in an RxC contingency table and the chi-square test to determine if the standard deviation of a population is equal to a pre-specified value. Software Most general purpose statistical software programs, including Dataplot, support at least some of the probability functions for the chi-square distribution. UNIT IV NON PARAMETRIC TESTS 3.1 Introduction Nonparametric, or distribution free tests are so-called because the assumptions underlying their use are “fewer and weaker than those associated with parametric tests” (Siegel & Castellan, 1988, p. 34). To put it another way, nonparametric tests require few if any assumptions about the shapes of the underlying population distributions. For this reason, they are often used in place of parametric tests if/when one feels that the assumptions of the parametric test have been too grossly violated (e.g., if the distributions are too severely skewed). Discussion of some of the more common nonparametric tests follows. 3.2 The Sign test (for 2 repeated/correlated measures) The sign test is one of the simplest nonparametric tests. It is for use with 2 repeated (or correlated) measures (see the example below), and measurement is assumed to be at least ordinal. For each subject, subtract the 2nd score from the 1st, and write down the sign of the difference. (That is write “-” if the difference score is negative, and “+” if it is positive.) The usual null hypothesis for this test is that there is no difference between the two treatments. If this is so, then the number of + signs (or - signs, for that matter) should have a binomial distribution1 with p = .5, and N = the number of subjects. In other words, the sign test is just a binomial test with + and - in place of Head and Tail (or Success and Failure). EXAMPLE A physiologist wants to know if monkeys prefer stimulation of brain area A to stimulation of brain area B. In the experiment, 14 rhesus monkeys are taught to press two bars. When a light comes on, presses on Bar 1 always result in stimulation of area A; and presses on Bar 2 always result in stimulation of area B. After learning to press the bars, the monkeys are tested for 15 minutes, during which time the frequencies for the two bars are recorded. The data are shown in Table 3.1. To carry out the sign test, we could let our statistic be the number of + signs, which is 3 in this case. The researcher did not predict a particular outcome in this case, but wanted to know if the two conditions differed. Therefore, the alternative hypothesis is nondirectional. That is, the alternative hypothesis would be supported by an extreme number of + signs, be it small or large. A middling number of + signs would be consistent with the null. The sampling distribution of the statistic is the binomial distribution with N = 14 and p = .5. With this distribution, we would find that the probability of 3 or fewer + signs is .0287. But because the alternative is nondirectional, or two-tailed, we must also take into account the probability 11 or more + signs, which is also .0287. Adding these together, we find that the probability of (3 or fewer) or (11 or more) is .0574. Therefore, if our pre-determined alpha was set at .05, we would not have sufficient evidence to allow rejection of the null hypothesis. 1 The binomial distribution is discussed on pages 37-40 of Norman & Streiner (2nd ed.). It is also discussed in some detail in my chapter on “Probability and Hypothesis Testing” (in the file prob_hyp.pdf). B. Weaver (15-Feb-2002) Nonparametric Tests ... 2 Table 3.1 Number of bar-presses in brain stimulation experiment Subject Bar 1 Bar 2 Difference “Sign” of Difference 1 20 40 -20 - 2 18 25 -7 3 24 38 -14 4 14 27 -13 5 5 31 -26 6 26 21 +5 + 7 15 32 -17 8 29 38 -9 9 15 25 -10 10 9 18 -9 11 25 32 -7 12 31 28 +3 + 13 35 33 +2 + 14 12 29 -17 Tied scores If a subject has the same score in each condition, there will be no sign, because the difference score is zero. In the case of tied scores, some textbook authors recommend dropping those subjects, and reducing N by the appropriate number. This is not the best way to deal with ties, however, because reduction of N can result in the loss of too much power. A better approach would be as follows: If there is only one subject with tied scores, drop that subject, and reduce N by one. If there are 2 subjects with tied scores, make one a + and one a -. In general, if there is an even number of subjects with tied scores, make half of them + signs, and half - signs. For an odd number of subjects (greater than 1), drop one randomly selected subject, and then proceed as for an even number. 3.3 Wilcoxon Signed-Ranks Test (for 2 repeated/correlated measures) One obvious problem with the sign test is that it discards a lot of information about the data. It takes into account the direction of the difference, but not the magnitude of the difference between each pair of scores. The Wilcoxon signed-ranks test is another nonparametric test that can be used for 2 repeated (or correlated) measures when measurement is at least ordinal. But unlike the sign test, it does take into account (to some degree, at least) the magnitude of the difference. Let us return to the data used to illustrate the sign test. The 14 difference scores were: -20, -7, -14, -13, -26, +5, -17, -9, -10, -9, -7, +3, +2, -17 If we sort these on the basis of their absolute values (i.e., disregarding the sign), we get the results shown in Table 3.2. The statistic T is found by calculating the sum of the positive ranks, and the sum of the negative ranks. T is the smaller of these two sums. In this case, therefore, T = 6. B. Weaver (15-Feb-2002) Nonparametric Tests ... 3 If the null hypothesis is true, the sum of the positive ranks and the sum of the negative ranks are expected to be roughly equal. But if H0 is false, we expect one of the sums to be quite small--and therefore T is expected to be quite small. The most extreme outcome favourable to rejection of H0 is T = 0. Table 3.2 Difference scores ranked by absolute value Score Rank +2 1 +3 2 +5 3 -7 4.5 Sum of positive ranks = 6 -7 4.5 -9 6.5 Sum of negative ranks = 99 -9 6.5 -10 8 T = 6 -13 9 -14 10 -17 11.5 -17 11.5 -20 13 -26 14 If we wished to, we could generate the sampling distribution of T (i.e., the distribution of T assuming that the null hypothesis is true), and see if the observed value of T is in the rejection region. This is not necessary, however, because the sampling distribution of T can be found in tables in most introductory level statistics textbooks. When I consulted such a table, I found that for N = 14, and α = .05 (2-tailed), the critical value of T = 21. The rule is that if T is equal to or less than Tcritical, we can reject the null hypothesis. Therefore, in this example, we would reject the null hypothesis. That is, we would conclude that monkeys prefer stimulation in brain area B to stimulation in area A. This decision differs from our failure to reject H0 when we analysed the same data using the sign test. The reason for this difference is that the Wilcoxon signed-ranks test is more powerful than the sign test. Why is it more powerful? Because it makes use of more information than does the sign test. But note that it too does discard some information by using ranks rather than the scores themselves (like a paired t-test does, for example). Tied scores When ranking scores, it is customary to deal with tied ranks in the following manner: Give the tied scores the mean of the ranks they would have if they were not tied. For example, if you have 2 tied scores that would occupy positions 3 and 4 if they were not tied, give each one a rank of 3.5. If you have 3 scores that would occupy positions 7, 8, and 9, give each one a rank of 8. This procedure is preferred to any other for dealing with tied scores, because the sum of the ranks for a fixed number of scores will be the same regardless of whether or not there are any tied scores. B. Weaver (15-Feb-2002) Nonparametric Tests ... 4 If there are tied ranks in data you are analysing with the Wilcoxon signed-ranks test, the statistic needs to be adjusted to compensate for the decreased variability of the sampling distribution of T. Siegel and Castellan (1988, p. 94) describe this adjustment, for those who are interested. Note that if you are able to reject H0 without making the correction, then do not bother, because the correction will increase your chances of rejecting H0. Note as well that the problem becomes more severe as the number of tied ranks increases. 3.4 Mann-Whitney U Test (for 2 independent samples) The most basic independent groups design has two groups. These are often called Experimental and Control. Subjects are randomly selected from the population and randomly assigned to two groups. There is no basis for pairing scores. Nor is it necessary to have the same number of scores in the two groups. The Mann-Whitney U test is a nonparametric test that can be used to analyse data from a two-group independent groups design when measurement is at least ordinal. It analyses the degree of separation (or the amount of overlap) between the Experimental and Control groups. The null hypothesis assumes that the two sets of scores (E and C) are samples from the same population; and therefore, because sampling was random, the two sets of scores do not differ systematically from each other. The alternative hypothesis, on the other hand, states that the two sets of scores do differ systematically. If the alternative is directional, or one-tailed, it further specifies the direction of the difference (i.e., Group E scores are systematically higher or lower than Group C scores). The statistic that is calculated is either U or U'. U1 = the number of Es less than Cs U2 = the number of Cs less than Es U = the smaller of the two values calculated above U' = the larger of the two values calculated above Calculating U directly When the total number of scores is small, U can be calculated directly by counting the number of Es less than Cs (or Cs less than Es). Consider the following example: Table 3.3 Data for two independent groups Group E: 12 17 9 21 Group C: 8 18 26 15 23 It will be easier to count the number of Es less than Cs (and vice versa) if we rank the data from lowest to highest, and rewrite it as shown in Table 3.4. B. Weaver (15-Feb-2002) Nonparametric Tests ... 5 Table 3.4 Illustration of direct calculation of the U statistic Score Group Rank E<C C<E 8C10 9E21 12 E 3 1 15 C 4 2 U = 7 17 E 5 2 U ′ = 13 18 C 6 3 21 E 7 3 23 C 8 4 26 C 9 4 13 7 CHECK: Note that U + U' = n1n2. This will always be true, and can be used to check your calculations. In this case, U + U' = 7 + 13 = 20; and n1n2 = 4(5) = 20. Calculating U with formulae When the total number of scores is a bit larger, or if there are tied scores, it may be more convenient to calculate U with the following formulae: 11 1121 ( 1) 2 nn U nn R + = + − (3.1) 22 2122 ( 1) 2 nn U nn R + = + − (3.2) where n1 = # of scores in group 1 n2 = # of scores in group 2 R1 = sum of ranks for group 1 R2 = sum of ranks for group 2 As before, U = smaller of U1 and U2, and U' = larger of U1 and U2. For the data shown above, R1 = 2+3+5+7 = 17; and R2 = 1+4+6+8+9 = 28. Substituting into the formulae, we get: 1 4(4 1) 4(5) 17 13 2 U + = + − = (3.3) 2 5(5 1) 4(5) 28 7 2 U + = + − = (3.4) Therefore, U = 7 and U'= 13. B. Weaver (15-Feb-2002) Nonparametric Tests ... 6 Making a decision The next step is deciding whether to reject H0 or not. In principle, we could generate a probability distribution for U that is conditional on the null hypothesis being true--much like we did when working with the binomial distribution earlier. Fortunately, we do not have to do this, because there are tables in the back of many statistics textbooks that give you the critical values of U (or U') for different values of n1 and n2, and for various significance levels. For the case we've been considering, n1 = 4 and n2 = 5. For a two-tailed test with α = .05, the critical value of U = 1. In order to reject H0, the observed value of U would have to be equal to or less than the critical value of U. (Note that maximum separation of E and C scores is indicated by U = 0. As the E and C scores become more mixed, U becomes larger. Therefore, small values of U lead to rejection of H0.) Therefore, we would decide that we cannot reject H0 in this case. Tied scores According to Siegel and Castellan (1988), any ties that involve observations in the same group do not affect the values of U and U'. (Note that Siegel and Castellan refer to this test as the Wilcoxon-Mann-Whitney Test, and that the call the statistic W rather than U.) But if two or more tied ranks involve observations from both groups, then the values of U and U' are affected, and a correction should be applied. See Siegel & Castellan (1988, p. 134) should you ever need more information on this, and note that the problem is particularly severe if you are dealing with the large-sample version of the test, which we have not yet discussed. 3.5 Kruskal-Wallis H-test (for k independent samples) The Kruskal-Wallis H-test goes by various names, including Kruskal-Wallis one-way analysis of variance by ranks (e.g., in Siegel & Castellan, 1988). It is for use with k independent groups, where k is equal to or greater than 3, and measurement is at least ordinal. (When k = 2, you would use the Mann-Whitney U-test instead.) Note that because the samples are independent, they can be of different sizes. The null hypothesis is that the k samples come from the same population, or from populations with identical medians. The alternative hypothesis states that not all population medians are equal. It is assumed that the underlying distributions are continuous; but only ordinal measurement is required. The statistic H (sometimes also called KW) can be calculated in one of two ways: ( ) 2 1 12 ( 1) k ii i H nR R NN • = = − + Σ (3.5) or, the more common computational formula, 2 1 12 3( 1) ( 1) k i ii R HN NN=n = Σ − + + (3.6) B. Weaver (15-Feb-2002) Nonparametric Tests ... 7 where k = the number of independent samples ni = the number of cases in the ith sample N = the total number of cases Ri = the sum of the ranks in the ith sample Ri = the mean of the ranks for the ith sample R• = N +1 2 = the mean of all ranks Example A student was interested in comparing the effects of four kinds of reinforcement on children's performance on a test of reading comprehension. The four reinforcements used were: (a) praise for correct responses; (b) a jelly bean for each correct response; (c) reproof for incorrect responses; and (d) silence. Four independent groups of children were tested, and each group received only one kind of reinforcement. The measure of performance given below is the number of errors made during the course of testing. Table 3.5 Data from 4 independent groups abcd 68 78 94 54 63 69 82 51 58 58 73 32 51 57 67 74 41 53 66 65 61 80 The first step in carrying out the Kruskal-Wallis H-test is to rank order all of the scores from lowest to highest. This can be quite laborious work if you try to do it by hand, but is fairly easy if you use a spreadsheet program. Enter all scores in a single column, and enter a group code for each score one column over. For example: Group Score a 68 a 63 a 58 a 51 a 41 b 78 etc. When all the data are entered thus, sort the scores (and their codes) from lowest to highest. Now you can enter ranks from 1 to N (taking care to deal with tied scores appropriately). After the scores are ranked, you can sort the data the data by group code, and then calculate the sum and mean of the ranks for each group. I did this for the data shown above, and came up with the following ranks: B. Weaver (15-Feb-2002) Nonparametric Tests ... 8 Table 3.6 Ranks of data from Table 3.5 abcd 15 19 22 6 11 16 21 3.5 8.5 8.5 17 1 3.5 7 14 18 2 5 13 12 10 20 40.0 55.5 97.0 60.5 Sum of Ranks 8.0 11.1 16.2 10.1 Mean of Ranks CHECK: The sum of ranks from 1 to N will always be equal to [N(N+1)]/2. We can use this to check our work to this point. We have 22 scores in total, so the sum of all ranks should be [22(23)]/2 = 253. Similarly, when we add the sum of ranks for each group, we get 40 + 55.5 + 97 + 60.5 = 253. Therefore, the mean of ALL ranks = 253/22 = 11.5. Now plugging into Equation 3.4 shown above, we get H = 4.856. If the null hypothesis is true, and the k samples are drawn from the same population (or populations with identical medians), and if k > 3, and all samples have 5 or more scores, then the distribution of H closely approximates the chi-squared distribution with df = k-1. The critical value of chi-squared with df=3 and α = .05 is 7.82. In order to reject H0, the obtained value of H would have to be equal to or greater than 7.82. Because it is less, we cannot reject the null hypothesis. Sampling distribution of H As described above, when H0 is true, if k > 3, and all samples have 5 or more scores, then the sampling distribution of H is closely approximated by the chi-squared distribution with df = k-1. If k = 3 and the number of scores in each sample is 5 or fewer, then the chi-squared distribution should not be used. In this case, one should use a table of critical values of H (e.g., Table O in Siegel & Castellan, 1988). Tied observations Tied scores are dealt with in the manner described previously (i.e., they are given the mean of the ranks they would receive if they were not tied). The presence of tied scores does affect the variance of the sampling distribution of H. Siegel and Castellan (1988) show a correction that can be applied in the case of tied scores, but go on to observe that its effect is to increase the value of H. Therefore, if you are able to reject H0 without correcting for ties, there is no need to do the correction. It should only be contemplated when you have failed to reject H0. B. Weaver (15-Feb-2002) Nonparametric Tests ... 9 Multiple comparisons Rejection of H0 tells you that at least one of the k samples is drawn from a population with a median different from the others. But it does not tell you which one, or how many are different. There are procedures for conducting multiple comparisons between treatments, or comparisons of a control condition to all other conditions in order to answer these kinds of questions. Should you ever need to use one of them, consult Siegel and Castellan (1988, pp. 213-215). 3.6 The Jonckheere test for ordered alternatives The Jonckheere test for ordered alternatives is similar to the Kruskal-Wallis test, but has a more specific alternative hypothesis. The alternative hypothesis for the Kruskal-Wallis test states that all population medians are not equal. The more precise alternative hypothesis for the Jonckheere test can be summarised as follows: H1: θ1 ≤ θ2 ... ≤ θk where the θ‘s are the population medians. This alternative is tested against a null hypothesis of no systematic trend across treatments. The test can be applied when you have data for k independent samples, when measurement is at least ordinal, and when it is possible to specify a priori the ordering of the groups. Because the alternative hypothesis specifies the order of the medians, the test is onetailed. Siegel and Castellan (1988) use J to symbolise the statistic that is calculated. It is sometimes also called the “Mann-Whitney count”. As this name implies, J is based on the same kind of counting and summing that we saw when calculating the U statistic via the direct method. The mechanics of it become somewhat complicated for the Jonckheere test, so we will not go into it here. (I hope you are not too disappointed!) Should you ever need to perform this test, see Program 5 in Appendix II of Siegel and Castellan (1988). Siegel and Castellan also provide a table of critical values of J (for small sample tests). 3.7 Friedman ANOVA This test is sometimes called the Friedman two-way analysis of variance by ranks. It is for use with k repeated (or correlated) measures where measurement is at least ordinal. The null hypothesis states that all k samples are drawn from the same population, or from populations with equal medians. Example The table on the left (below) shows reaction time data from 5 subjects, each of whom was tested in 3 conditions (A, B, and C). The Friedman ANOVA uses ranks, and so the first thing we must do is rank order the k scores for each subject. The results of this ranking are shown in the table on the right, and the sum of the ranks (ΣRi ) for each treatment is shown at the bottom. B. Weaver (15-Feb-2002) Nonparametric Tests ... 10 Table 3.7 RT data and ranks for 3 levels of a within-subjects variable Subj A B C Subj A B C 1 386 411 454 1 1 2 3 2 542 563 556 2 1 3 2 3 662 667 665 3 1 3 2 4 453 502 574 4 1 2 3 5 548 546 575 5 2 1 3 ΣR 6 11 13 i It may be useful at this point to consider what kinds of outcomes are expected if H0 is true. H0 states that all of the samples (columns) are drawn from the same population, or from populations with the same median. If so, then the sums (or means) of the ranks for each of the columns should all be roughly equal, because the ranks 1, 2, and 3 would be expected by chance to appear equally often in each column. In this example, the expected _R for each treatment would be 10 if H0 is true. (In general, the expected sum of ranks for each treatment is N(k +1)/2.) The Friedman ANOVA assesses the degree to which the observed _R's depart from the expected _R's. If the departure is too extreme (or not likely due to chance), one concludes by rejecting H0. The Fr statistic is calculated as follows: 2 1 12 3 ( 1) ( 1) k ri i FRNk Nk k = = Σ − + (3.7) + where N = the number of subjects k = the number of treatments Ri = the sum of the ranks for the ith treatment Critical values of Fr for various sample sizes and numbers of treatments can be found in tables (e.g., Table M in Siegel & Castellan, 1988). Note that when the number of treatments or subjects is large, the sampling distribution of Fr is closely approximated by the chi-squared distribution with df = k - 1. (Generally, use a table of critical values for Fr if it provides a value for your particular combination of k and N. If either k or N are too large for the table of critical values, then use the chi-squared distribution with df = k -1.) For the example we've been looking at, the critical values of Fr are 6.40 for α = .05, and 8.40 for α = .01. In order to reject H0, the obtained value of Fr must be equal to or greater than the critical value. Therefore, we would fail to reject H0 in this case. Tied scores If there are ties among the ranks, the Fr statistic must be corrected, because the sampling distribution changes. The formula that corrects for tied ranks is actually a general formula that also works when there are no ties. However, it is rather complicated, which is why the B. Weaver (15-Feb-2002) Nonparametric Tests ... 11 simplified version shown above is used when possible. The general formula is not shown here, but can be found in Siegel and Castellan (1988, p. 179), should you need it. Multiple comparisons Siegel and Castellan (1988) also give formulae to be use for conducting multiple comparisons and/or comparisons of a control condition to each of the other conditions. 3.8 Large sample versions of nonparametric tests You may have noticed that the tables of critical values for many nonparametric statistics only go up to sample sizes of about 25-50. If so, perhaps you have wondered what to do when you have sample sizes larger than that, and want to carry out a nonparametric test. Fortunately, it turns out that the sampling distributions of many nonparametric statistics converge on the normal distribution as sample size increases. Because of that, it is possible to carry out a socalled “large-sample” version of the test (which is really a z-test) if you know the mean and variance of the sampling distribution for that particular statistic. Common structure of all z- and t-tests As I have mentioned before, all z- and t-tests have a common structure. In general terms: 0 statistic - (parameter | H is true) (or ) = standard error of the statistic z t (3.8) When the sampling distribution of the statistic in the numerator is normal, then if the true (population) standard error (SE) of the statistic is known, the computed ratio can be evaluated against the standard normal (z) distribution. If the true standard error of the statistic is not known, then it must be estimated from the sample data, and the proper sampling distribution is a t-distribution with some number of degrees of freedom. Example: Large-sample Mann-Whitney U test The following facts are known about the sampling distribution of the U statistic used in the Mann-Whitney U test: 12 2U n n μ = (3.9) 1212 ( 1) 12 U nnnnσ + + = (3.10) Furthermore, when both sample sizes are greater than about 20, the sampling distribution of U is (for practical purposes) normal. Therefore, under these conditions, one can perform a z-test as follows: B. Weaver (15-Feb-2002) Nonparametric Tests ... 12 U U U U z μ σ − = (3.11) The obtained value of zU can be evaluated against a table of the standard normal distribution (e.g., Table A in Norman & Streiner, 2000). Alternatively, one can use software to calculate the p-value for a given z-score, e.g., StaTable from Cytel, which is available here: http://www.cytel.com/statable/index.html Example: Large-sample Wilcoxon signed ranks test The following are known to be true about the sampling distribution of T, the statistic used in the Wilcoxon signed ranks test: ( 1) 4T NNμ + = (3.12) ( 1)(2 1) 24 T NNNσ + + = (3.13) If N > 50, then the sampling distribution of T is for practical purposes normal. And so, a z-ratio can be computed as follows: T T T T z μ σ − = (3.14) The obtained value of zT can be evaluated against a table of the standard normal distribution, or using software as described above. Example: Large-sample Jonckheere test for ordered alternatives The mean and standard deviation of the sampling distribution of J are given by the following: 22 1 4 k i i J Nn μ = − = Σ (3.15) 22 1 1 (2 3) (2 3) 72 k Jii i σ NNnn = = + − + Σ (3.16) B. Weaver (15-Feb-2002) Nonparametric Tests ... 13 where N = total number of observations ni = the number of observations in the ith group k = the number of independent groups As sample sizes increase, the sampling distribution of J converges on the normal, and so one can perform a z-test as follows: J J J J z μ σ − = (3.17) Example: Large sample sign test The sampling distribution used in carrying out the sign test is a binomial distribution with p =q = .5. The mean of a binomial distribution is equal to Np, and the variance is equal to Npq. As N increases, the binomial distribution converges on the normal distribution (especially when p = q = .5). When N is large enough (i.e., greater than 30 or 50, depending on how conservative one is), it is possible to carry out a z-test version of the sign test as follows: X Np z Npq − = (3.18) You may recall that z2 is equal to χ2 with df = 1. Therefore, 2 22 1 (X Np) z Npq χ − = = (3.19) This formula can be expanded with what Howell (1997) calls “some not-so-obvious algebra” to yield: 22 2 1 (X Np) (N X Nq) Np Nq χ − − − = + (3.20) Note that X equals the observed number of p-events, and Np equals the expected number of pevents under the null hypothesis. Similarly, N-X equals the observed number of q-events, and Nq = the expected number of q-events under the null hypothesis. Therefore, we can rewrite equation (3.20) in a more familiar looking format as follows: 222 21122 12 (O E ) (O E ) (O E) EEE χ − − − = + = Σ (3.21) B. Weaver (15-Feb-2002) Nonparametric Tests ... 14 Large-sample z-tests with small samples Many computerised statistics packages automatically compute the large-sample (z-test) version of nonparametric tests, even when the sample sizes are small. Note however, that the ztest is just an approximation that can be used when sample sizes are sufficiently large. If the sample sizes are small enough to allow use of a table of critical values for your particular nonparametric statistic, you should always use it rather than a z-test. 3.9 Advantages of nonparametric tests Siegel and Castellan (1988, p. 35) list the following advantages of nonparametric tests: 1. If the sample size is very small, there may be no alternative to using a nonparametric statistical test unless the nature of the population distribution is known exactly. 2. Nonparametric tests typically make fewer assumptions about the data and may be more relevant to a particular situation. In addition, the hypothesis tested by the nonparametric test may be more appropriate for the research investigation. 3. Nonparametric tests are available to analyze data which are inherently in ranks as well as data whose seemingly numerical scores have the strength of ranks. That is, the researcher may only be able to say of his or her subjects that one has more or less of the characteristic than another, without being able to say how much more or less. For example, in studying such a variable as anxiety, we may be able to state that subject A is more anxious than subject B without knowing at all exactly how much more anxious A is. If data are inherently in ranks, or even if they can be categorized only as plus or minus (more or less, better or worse), they can be treated by nonparametric methods, whereas they cannot be treated by parametric methods unless precarious and, perhaps, unrealistic assumptions are made about the underlying distributions. 4. Nonparametric methods are available to treat data which are simply classificatory or categorical, i.e., are measured in a nominal scale. No parametric technique applies to such data. 5. There are suitable nonparametric statistical tests for treating samples made up of observations from several different populations. Parametric tests often cannot handle such data without requiring us to make seemingly unrealistic assumptions or requiring cumbersome computations. 6. Nonparametric statistical tests are typically much easier to learn and to apply than are parametric tests. In addition, their interpretation often is more direct than the interpretation of parametric tests. B. Weaver (15-Feb-2002) Nonparametric Tests ... 15 Note that the objection concerning “cumbersome computations” in point number 5 has become less of an issue as computers and statistical software packages become more sophisticated, and more available. 3.10 Disadvantages of nonparametric tests In closing, I must point out that nonparametric tests do have at least two major disadvantages in comparison to parametric tests. First, nonparametric tests are less powerful. Why? Because parametric tests use more of the information available in a set of numbers. Parametric tests make use of information consistent with interval scale measurement, whereas parametric tests typically make use of ordinal information only. As Siegel and Castellan (1988) put it, “nonparametric statistical tests are wasteful.” Second, parametric tests are much more flexible, and allow you to test a greater range of hypotheses. For example, factorial ANOVA designs allow you to test for interactions between variables in a way that is not possible with nonparametric alternatives. There are nonparametric techniques to test for certain kinds of interactions under certain circumstances, but these are much more limited than the corresponding parametric techniques. Therefore, when the assumptions for a parametric test are met, it is generally (but not necessarily always) preferable to use the parametric test rather than a nonparametric test. --------------------------------------------------------------------B. Weaver (15-Feb-2002) Nonparametric Tests ... 16 Review Questions 1. Which test is more powerful, the sign test, or the Wilcoxon signed ranks test? Explain why. 2. Which test is more powerful, the Wilcoxon signed ranks test, or the t-test for correlated samples? Explain why. For the scenarios described in questions 3-5, identify the nonparametric test that ought to be used. 3. A single group of subjects is tested at 6 levels of an independent variable. You would like to do a repeated measures ANOVA, but cannot because you have violated the assumptions for that analysis. Your data are ordinal. 4. You have 5 independent groups of subjects, with different numbers per group. There is also substantial departure from homogeneity of variance. The null hypothesis states that there are no differences between the groups. 5. You have the same situation described in question 4; and in addition, the alternative hypothesis states that when the mean ranks for the 5 groups are listed from smallest to largest, they will appear in a particular pre-specified order. 6. Explain the rationale underlying the large-sample z-test version of the Mann-Whitney U-test. 7. Why should you not use the large-sample z-test version of a nonparametric test when you have samples small enough to allow use of the small-sample version? 8. Give two reasons why parametric tests are generally preferred to nonparametric tests. 9. Describe the circumstances under which you might use the Kruskal-Wallis test. Under what circumstances would you use the Jonckheere test instead? (HINT: Think about how the alternative hypotheses for these tests differ.) --------------------------------------------------------------------References Howell, DC. (1997). Psychology for statistics. Duxbury Press. Siegel, S., & Castellan, N.J. (1988). Nonparametric statistics for the behavioral sciences (2nd Ed.). New York, NY: McGraw-Hill. Mann-Whitney U test Menu location: Analysis_Non-parametric_Mann-Whitney. This is a method for the comparison of two independent random samples (x and y): The Mann Whitney U statistic is defined as: - where samples of size n1 and n2 are pooled and Ri are the ranks. U can be resolved as the number of times observations in one sample precede observations in the other sample in the ranking. Wilcoxon rank sum, Kendall's S and the Mann-Whitney U test are exactly equivalent tests. In the presence of ties the Mann-Whitney test is also equivalent to a chi-square test for trend. In most circumstances a two sided test is required; here the alternative hypothesis is that x values tend to be distributed differently to y values. For a lower side test the alternative hypothesis is that x values tend to be smaller than y values. For an upper side test the alternative hypothesis is that x values tend to be larger than y values. Assumptions of the Mann-Whitney test: random samples from populations independence within samples and mutual independence between samples measurement scale is at least ordinal A confidence interval for the difference between two measures of location is provided with the sample medians. The assumptions of this method are slightly different from the assumptions of the Mann-Whitney test: random samples from populations independence within samples and mutual independence between samples two population distribution functions are identical apart from a possible difference in location parameters Technical Validation StatsDirect uses the sampling distribution of U to give exact probabilities. These calculations may take an appreciable time to complete when many data are tied. Confidence intervals are constructed for the difference between the means or medians (any measure of location in fact). The level of confidence used will be as close as is theoretically possible to the one you specify. StatsDirect approaches the selected confidence level from the conservative side. When samples are large (either sample > 80 or both samples >30) a normal approximation is used for the hypothesis test and for the confidence interval. Note that StatsDirect uses more accurate P value calculations than some other statistical software, therefore, you may notice a difference in results (Conover, 1999; Dineen and Blakesley, 1973; Harding, 1983; Neumann, 1988). Example From Conover (1999, p. 218). Test workbook (Nonparametric worksheet: Farm Boys, Town Boys). The following data represent fitness scores from two groups of boys of the same age, those from homes in the town and those from farm homes. Farm Boys 14.8 7.3 5.6 6.3 9.0 4.2 10.6 12.5 12.9 16.1 11.4 2.7 Town Boys 12.7 14.2 12.6 2.1 17.7 11.8 16.9 7.9 16.0 10.6 5.6 5.6 7.6 11.3 8.3 6.7 3.6 1.0 2.4 6.4 9.1 6.7 18.6 3.2 6.2 6.1 15.3 10.6 1.8 5.9 9.9 10.6 14.8 5.0 2.6 4.0 To analyse these data in StatsDirect you must first enter them in two separate workbook columns. Alternatively, open the test workbook using the file open function of the file menu. Then select the Mann-Whitney from the Non-parametric section of the analysis menu. Select the columns marked "Farm Boys" and "Town Boys" when prompted for data. For this example: estimated median difference = 0.8 two sided P = 0.529 95.1% confidence interval for difference between population means or medians = -2.3 to 4.4 Here we have assumed that these groups are independent and that they represent at least hypothetical random samples of the sub-populations they represent. In this analysis, we are clearly unable to reject the null hypothesis that one group does NOT tend to yield different fitness scores to the other. This lack of statistical evidence of a difference is reflected in the confidence interval for the difference between population means, in that the interval spans zero. Note that the quoted 95.1% confidence interval is as close as you can get to 95% because of the very nature of the mathematics involved in non-parametric methods like this. Kruskal-Wallis Test. As a reminder, the assumptions of the one-way ANOVA for independent samples are 1. that the scale on which the dependent variable is measured has the properties of an equal interval scale;T 2. that the k samples are independently and randomly drawn from the source population(s);T 3. that the source population(s) can be reasonably supposed to have a normal distribution; andT 4. that the k samples have approximately equal variances. We noted in the main body of Chapter 14 that we need not worry very much about the first, third, and fourth of these assumptions when the samples are all the same size. For in that case the analysis of variance is quite robust, by which we mean relatively unperturbed by the violation of its assumptions. But of course, the other side of the coin is that when the samples are not all the same size, we do need to worry. In this case, should one or more of assumptions 1, 3, and 4 fail to be met, an appropriate non-parametric alternative to the one-way independent-samples ANOVA can be found in the Kruskal-Wallis Test. I will illustrate the Kruskal-Wallis test with an example based on rating-scale data, since this is by far the most common situation in which unequal sample sizes would call for the use of a non-parametric alternative. In this particular case the number of groups is k=3. I think it will be fairly obvious how the logic and procedure would be extended in cases where k is greater than 3. To assess the effects of expectation on the perception of aesthetic quality, an investigator randomly sorts 24 amateur wine aficionados into three groups, A, B, and C, of 8 subjects each. Each subject is scheduled for an individual interview. Unfortunately, one of the subjects of group B and two of group C fail to show up for their interviews, so the investigator must make do with samples of unequal size: na=8, nb=7, and nc=6, for a total of N=21. The subjects who do show up for their interviews are each asked to rate the overall quality of each of three wines on a 10-point scale, with "1" standing at the bottom of the scale and "10" at the top. Group mean A B C 6.4 6.8 7.2 8.3 8.4 9.1 9.4 9.7 2.5 3.7 4.9 5.4 5.9 8.1 8.2 1.3 4.1 4.9 5.2 5.5 8.2 8.2 5.5 4.9 As it happens, the three wines are the same for all subjects. The only difference is in the texture of the interview, which is designed to induce a relatively high expectation of quality in the members of group A; a relatively low expectation in the members of group C; and a merely neutral state, tending in neither the one direction nor the other, for the members of group B. At the end of the study, each subject's ratings are averaged across all three wines, and this average is then taken as the raw measure for that particular subject. The adjacent table shows these measures for each subject in each of the three groups. ¶Mechanics The preliminaries of the Kruskal-Wallis test are much the same as those of the Mann-Whitney test described in Subchapter 11a. We begin by assembling the measures from all k samples into a single set of size N. These assembled measures are rank-ordered from lowest (rank#1) to highest (rank#N), with tied ranks included where appropriate; and the resulting ranks are then returned to the sample, A, B, or C, to which they belong and substituted for the raw measures that gave rise to them. Thus, the raw measures that appear in the following table on the left are replaced by their respective ranks, as shown in the table on the right. Ranked Measures Raw Measures A B C A B C 6.4 2.5 1.3 11 2 1 A, B, C 6.8 7.2 8.3 8.4 9.1 9.4 9.7 3.7 4.9 5.4 5.9 8.1 8.2 4.1 4.9 5.2 5.5 8.2 sum of ranks 12 13 17 18 19 20 21 3 4 Combined 5.5 5.5 8 7 10 9 14 15.5 15.5 131 58 42 231 average of ranks 16.4 8.3 7.0 11 With the Kruskal-Wallis test, however, we take account not only of the sums of the ranks within each group, but also of the averages. Thus the following items of symbolic notation: TA = the sum of the na ranks in group A MA = the mean of the na ranks in group A TB = the sum of the nb ranks in group B MB = the mean of the nb ranks in group B TC = the sum of the nc ranks in group C MC = the mean of the nc ranks in group C Tall = the sum of the N ranks in all groups combined Mall = the mean of the N ranks in all groups combined ¶Logic and Procedure ·The Measure of Aggregate Group Differences You will sometimes find the Kruskal-Wallis test described as an "analysis of variance by ranks." Although it is not really an analysis of variance at all, it does bear a certain resemblance to ANOVA up to a point. In both procedures, the first part of the task is to find a measure of the aggregate degree to which the group means differ. With ANOVA that measure is found in the quantity known as SSbg, which is the between-groups sum of squared deviates. The same is true with the Kruskal- Wallis test, except that here the group means are based on ranks rather than on the raw measures. As a reminder that we are now dealing with ranks, we will symbolize this new version of the between-groups sum of squared deviates as SSbg(R). The following table summarizes the mean ranks for the present example. Also included are the sums and the counts (na, nb, nc, and N) on which these means are based. counts A B C All 8 7 6 21 58 42 231 sums 131 means 16.4 8.3 7.0 11.0 In Chapters 13 and 14 you saw that the squared deviate for any particular group mean is equal to the squared difference between that group mean and the mean of the overall array of data, multiplied by the number of observations on which the group mean is based. Thus, for each of our current three groups A: 8(16.4—11.0)2 = 233.3 B: 7(8.3—11.0)2 = 051.0 C: 6(7.0—11.0)2 = 096.0 SSbg(R) = 380.3 On analogy with the formulaic structures for SSbg developed in Chapters 13 and 14, we can write the conceptual formula for SSbg(R) as ( SSbg(R) = [n (M —M ) ] g g all 2 Here as well, the subscript "g" means "any particular group." and the computational formula as (Tg)2 SSbg(R) = (Tall)2 — ng Na With k=3 samples, this latter structure would be equivalent to (TA)2 SSbg(R) = (TB)2 + (TC)2 + na (Tall)2 — nb nc Na For k=4 it would be (TA)2 SSbg(R) = (TB)2 + na (TC)2 + (TD)2 + nb (Tall)2 — nc nd Na And so forth for other values of k. Here, in any event, is how it would work out for the present example. The discrepancy between what we get now and what we got a moment ago (380.3) is due to rounding error in the earlier calculation. As usual, it is the computational formula that is the less susceptible to rounding error, hence the more reliable. (131)2 SSbg(R) = (58)2 + 8 (42)2 + 7 (231)2 — 6 21 = 378.7 ·The Null-Hypothesis Value of SSbg(R) The null hypothesis in this or any comparable situation involving several independent samples of ranked data is that the mean ranks of the k groups will not substantially differ. On this account, you might suppose that the null-hypothesis value of SSbg(R), the aggregate measure of group differences, would be simply zero. A moment's reflection, however, will show why this cannot be so. A B C Consider the very simple case where there are 3 groups, each containing 2 observations. By way of analogy, imagine you had six small cards representing the ranks "1," "2," "3," "4," "5," and "6." If you were to sort these cards into every possible combination of two ranks per group, you would find the total number of possible combinations to be x x x x x x N! 6! = na! nb! nc! = 90 2! 2! 2! And the values of SSbg(R) produced by these 90 combinations would constitute the sampling distribution of SSbg(R) for this particular case. Of these 90 possible combinations, a few (6) would yield values of SSbg(R) equal to exactly zero. All the rest would produce values greater than zero. (It is mathematically impossible to have a sum of squared deviates less than zero.) Accordingly, the mean of this sampling distribution—the value that observed instances of SSbg(R) will tend to approximate if the null hypothesis is true—is not zero, but something greater than zero. In any particular case of this sort, the mean of the sampling distribution of SSbg(R) is given by the formula (k—1) x N(N+1) 12 which for the simple case just examined works out as (3—1) x 6(6+1) = 7.0 12 For our main example, we therefore know that the observed value of SSbg(R)=378.7 belongs to a sampling distribution whose mean is equal to (3—1) x 21(21+1) = 77.0 12 All that now remains is to figure out how to turn this fact into a rigorous assessment of probability. ·The Kruskal-Wallis Statistic: H In case you have been girding yourself for some heavy slogging of the sort encountered with the Mann-Whitney test, you can now relax, for the rest of the journey is quite an easy one. The Kruskal-Wallis procedure concludes by defining a ratio symbolized by the letter H, whose numerator is the observed value of SSbg(R) and whose denominator includes a portion of the above formula for the mean of the sampling distribution of SSbg(R). Note that most textbooks give a very different-looking formula for the calculation of H—a rather impenetrable structure to which we will return in a moment. This first version affords a much clearer sense of the underlying concepts. SSbg(R) H= N(N+1)/12 And now for the denouement. When each of the k samples includes at least 5 observations (that is, when na, nb, nc, etc., are all equal to or greater than 5), the sampling distribution of H is a very close approximation of the chi-square distribution for df=k—1. It is actually a fairly close approximation even when one or more of the samples includes as few as 3 observations. For our present example, we can therefore calculate the value of H as SSbg(R) H= 378.7 = N(N+1)/12 = 9.84 21(21+1)/12 And then, treating this result as though it were a value of chi-square, we can refer it to the sampling distribution of chi-square with df=3—1=2. The following graph, borrowed from Chapter 8, will remind you of the outlines of this particular chi-square distribution. In brief: by the Kruskal-Wallis test, the observed aggregate difference among the three samples is significant a bit beyond the .01 level. Theoretical Sampling Distribution of Chi-Square for df=2 ·An Alternative Formula for the Calculation of H I noted a moment ago that textbook accounts of the Kruskal-Wallis test usually give a different version of the formula for H. If you are a beginning student calculating H by hand, I would recommend using the version given above, as it gives you a clearer idea of just what H is measuring. Once you get the hang of things, however, you might find this alternative computational formula a bit more convenient. 12 H = N(N+1) ( (Tg)2 ng ) — 3(N+1) In any event, as you can see below, this version yields exactly the same result as the other. 12 H = 21(21+1) = 9.84 ( (131)2 (58)2 + 8 (42)2 + 7 6 ) — 3(21+1) The VassarStats web site has a page that will perform all steps of the Kruskal-Wallis test, including the rank-ordering of the raw measures. End of Subchapter 14a. Return to Top of Subchapter 14a Go to Chapter 15 [One-Way Analysis of Variance for Correlated Samples] As a reminder, the assumptions of the one-way ANOVA for independent samples are 1. that the scale on which the dependent variable is measured has the properties of an equal interval scale;T 2. that the k samples are independently and randomly drawn from the source population(s);T 3. that the source population(s) can be reasonably supposed to have a normal distribution; andT 4. that the k samples have approximately equal variances. We noted in the main body of Chapter 14 that we need not worry very much about the first, third, and fourth of these assumptions when the samples are all the same size. For in that case the analysis of variance is quite robust, by which we mean relatively unperturbed by the violation of its assumptions. But of course, the other side of the coin is that when the samples are not all the same size, we do need to worry. In this case, should one or more of assumptions 1, 3, and 4 fail to be met, an appropriate non-parametric alternative to the one-way independent-samples ANOVA can be found in the Kruskal-Wallis Test. I will illustrate the Kruskal-Wallis test with an example based on rating-scale data, since this is by far the most common situation in which unequal sample sizes would call for the use of a non-parametric alternative. In this particular case the number of groups is k=3. I think it will be fairly obvious how the logic and procedure would be extended in cases where k is greater than 3. To assess the effects of expectation on the perception of aesthetic quality, an investigator randomly sorts 24 amateur wine aficionados into three groups, A, B, and C, of 8 subjects each. Each subject is scheduled for an individual interview. Unfortunately, one of the subjects of group B and two of group C fail to show up for their interviews, so the investigator must make do with samples of unequal size: na=8, nb=7, and nc=6, for a total of N=21. The subjects who do show up for their interviews are each asked to rate the overall quality of each of three wines on a 10-point scale, with "1" standing at the bottom of the scale and "10" at the top. Group mean A B C 6.4 6.8 7.2 8.3 8.4 9.1 9.4 9.7 2.5 3.7 4.9 5.4 5.9 8.1 8.2 1.3 4.1 4.9 5.2 5.5 8.2 8.2 5.5 4.9 As it happens, the three wines are the same for all subjects. The only difference is in the texture of the interview, which is designed to induce a relatively high expectation of quality in the members of group A; a relatively low expectation in the members of group C; and a merely neutral state, tending in neither the one direction nor the other, for the members of group B. At the end of the study, each subject's ratings are averaged across all three wines, and this average is then taken as the raw measure for that particular subject. The adjacent table shows these measures for each subject in each of the three groups. RUN TESTS The analysis of runs within a sequence is applied in statistics in many ways (for examples see [7, Section II.5 (b),], [1]). The term run may in general be explained as a succession of items of the same class. Many concepts to analyze runs in a series of data have been studied. The main concepts are based on (i) the analysis of the total number of runs of a given class (see [9, 10]) and (ii) examinations about the appearance of long runs (see [7, Chapter XIII,], [8, 11] ). For example, consider a sequence of bits. Binary runs are for example arbitrary repetitive patterns within the sequence. Let be a sequence of pairwise different numbers like an output of a pseudorandom number generator. A sequence of bits occurs if one considers the signs ( or ) of the differences , . For example X = (5,4,1,7,2,3,6) yields S = (0,0,1,0,1,1). This concept is used in [10, 11]. A subsequence of length l consisting of only 1 is called a "run up" of length l (this indicates a increasing subsequence of length l+1 within the sequence X). The opposite case, a subsequence consisting of only 0 is called a ``run down''. In the latter papers, several statistics based on this run definition are treated for both areas (i) and (ii). These considerations form the basis of a runtest proposed by Knuth [9] in order to test pseudorandom numbers. Knuth's run-test is one of the most common tests used for examining PRNs. An asymptotically chi-squared distributed test statistic based on the number of runs with length l = 1,2,3,4,5 and is given in Knuth [9, 6]. Consider a fixed sample size n. The run-test implemented in pLab calculates Knuth`s test statistic m times. To these "asymptotically chi-squared" values a Kolmogorov-Smirnov-test is applied (for details see [6]). The graphics below show typically behavior of this test if "good" and "flawed" linear congruential generators with moduli are used. The lines in the graphics indicate the rejection area of the Kolmogorov-Smirnov statistic with level of significance 0.05 and 0.01 . The first graphics below shows the results obtained from Super-Duper = for sample sizes and m = 100. The afterimages are obtained from the subsequences with step sizes 99, 565 and 739 generated from Super-Duper. Note that these subsequences are produced by similar linear congruential generators with multiplier 1031357269, 3280877789 and 3028669781. Further results for linear and inversive congruential generators are given in [2, 6]. For subsequence behavior of well known linear congruential generators see [3, 4, 5]. References 1 A.J. Ducan. Quality Control and Industrial Statistics. Richard D. IRWIN, INC., 4 edition, 1974. 2 K. Entacher. Selected random number generators in run tests. Preprint, Mathematics Institute, University of Salzburg. UNIT V ANALYSIS OF VARIANCE An important technique for analyzing the effect of categorical factors on a response is to perform an Analysis of Variance. An ANOVA decomposes the variability in the response variable amongst the different factors. Depending upon the type of analysis, it may be important to determine: (a) which factors have a significant effect on the response, and/or (b) how much of the variability in the response variable is attributable to each factor. STATGRAPHICS Centurion provides several procedures for performing an analysis of variance: 1. One-Way ANOVA - used when there is only a single categorical factor. This is equivalent to comparing multiple groups of data. 2. Multifactor ANOVA - used when there is more than one categorical factor, arranged in a crossed pattern. When factors are crossed, the levels of one factor appear at more than one level of the other factors. 3. Variance Components Analysis - used when there are multiple factors, arranged in a hierarchical manner. In such a design, each factor is nested in the factor above it. 4. General Linear Models - used whenever there are both crossed and nested factors, when some factors are fixed and some are random, and when both categorical and quantitative factors are present. One-Way ANOVA A one-way analysis of variance is used when the data are divided into groups according to only one factor. The questions of interest are usually: (a) Is there a significant difference between the groups?, and (b) If so, which groups are significantly different from which others? Statistical tests are provided to compare group means, group medians, and group standard deviations. When comparing means, multiple range tests are used, the most popular of which is Tukey's HSD procedure. For equal size samples, significant group differences can be determined by examining the means plot and identifying those intervals that do not overlap. Multifactor ANOVA When more than one factor is present and the factors are crossed, a multifactor ANOVA is appropriate. Both main effects and interactions between the factors may be estimated. The output includes an ANOVA table and a new graphical ANOVA from the latest edition of Statistics for Experimenters by Box, Hunter and Hunter (Wiley, 2005). In a graphical ANOVA, the points are scaled so that any levels that differ by more than exhibited in distribution of the residuals are significantly different. the Variance Components Analysis A Variance Components Analysis is most commonly used to determine the level at which variability is being introduced into a product. A typical experiment might select several batches, several samples from each batch, and then run replicates tests on each sample. The goal is to determine the relative percentages of the overall process variability that is being introduced at each level. General Linear Model The General Linear Models procedure is used whenever the above procedures are not appropriate. It can be used for models with both crossed and nested factors, models in which one or more of the variables is random rather than fixed, and when quantitative factors are to be combined with categorical ones. Designs that can be analyzed with the GLM procedure include partially nested designs, repeated measures experiments, split plots, and many others. For example, pages 536-540 of the book Design and Analysis of Experiments (sixth edition) by Douglas Montgomery (Wiley, 2005) contains an example of an experimental design with both crossed and nested factors. For that data, the GLM procedure produces several important tables, including estimates of the variance components for the random factors. Analysis of Variance for Assembly Time Source Model Residual Total (Corr.) Sum of Squares 243.7 56.0 299.7 Df 23 24 47 Mean Square 10.59 2.333 F-Ratio 4.54 P-Value 0.0002 Type III Sums of Squares Source Layout Operator(Layout) Fixture Layout*Fixture Fixture*Operator(Layout) Residual Total (corrected) Expected Mean Squares Sum of Squares 4.083 71.92 82.79 19.04 65.83 56.0 299.7 Df 1 6 2 2 12 24 47 Mean Square 4.083 11.99 41.4 9.521 5.486 2.333 F-Ratio 0.34 2.18 7.55 1.74 2.35 P-Value 0.5807 0.1174 0.0076 0.2178 0.0360 Source Layout Operator(Layout) Fixture Layout*Fixture Fixture*Operator(Layout) Residual EMS (6)+2.0(5)+6.0(2)+Q1 (6)+2.0(5)+6.0(2) (6)+2.0(5)+Q2 (6)+2.0(5)+Q3 (6)+2.0(5) (6) Variance Components Source Operator(Layout) Fixture*Operator(Layout) Residual Estimate 1.083 1.576 2.333 Principles of experimental design, following Ronald A. Fisher A methodology for designing experiments was proposed by Ronald A. Fisher, in his innovative book The Design of Experiments (1935). As an example, he described how to test the hypothesis that a certain lady could distinguish by flavour alone whether the milk or the tea was first placed in the cup. While this sounds like a frivolous application, it allowed him to illustrate the most important ideas of experimental design: Comparison In many fields of study it is hard to reproduce measured results exactly. Comparisons between treatments are much more reproducible and are usually preferable. Often one compares against a standard, scientific control, or traditional treatment that acts as baseline. Randomization There is an extensive body of mathematical theory that explores the consequences of making the allocation of units to treatments by means of some random mechanism such as tables of random numbers, or the use of randomization devices such as playing cards or dice. Provided the sample size is adequate, the risks associated with random allocation (such as failing to obtain a representative sample in a survey, or having a serious imbalance in a key characteristic between a treatment group and a control group) are calculable and hence can be managed down to an acceptable level. Random does not mean haphazard, and great care must be taken that appropriate random methods are used. Replication Measurements are usually subject to variation and uncertainty. Measurements are repeated and full experiments are replicated to help identify the sources of variation and to better estimate the true effects of treatments. Blocking Blocking is the arrangement of experimental units into groups (blocks) that are similar to one another. Blocking reduces known but irrelevant sources of variation between units and thus allows greater precision in the estimation of the source of variation under study. Orthogonality Example of orthogonal factorial design Orthogonality concerns the forms of comparison (contrasts) that can be legitimately and efficiently carried out. Contrasts can be represented by vectors and sets of orthogonal contrasts are uncorrelated and independently distributed if the data are normal. Because of this independence, each orthogonal treatment provides different information to the others. If there are T treatments and T – 1 orthogonal contrasts, all the information that can be captured from the experiment is obtainable from the set of contrasts. Factorial experiments Use of factorial experiments instead of the one-factor-at-a-time method. These are efficient at evaluating the effects and possible interactions of several factors (independent variables). Analysis of the design of experiments was built on the foundation of the analysis of variance, a collection of models in which the observed variance is partitioned into components due to different factors which are estimated and/or tested. Example This example is attributed to Harold Hotelling.[9] It conveys some of the flavor of those aspects of the subject that involve combinatorial designs. The weights of eight objects are to be measured using a pan balance and set of standard weights. Each weighing measures the weight difference between objects placed in the left pan vs. any objects placed in the right pan by adding calibrated weights to the lighter pan until the balance is in equilibrium. Each measurement has a random error. The average error is zero; the standard deviations of the probability distribution of the errors is the same number σ on different weighings; and errors on different weighings are independent. Denote the true weights by We consider two different experiments: 1. Weigh each object in one pan, with the other pan empty. Let Xi be the measured weight of the ith object, for i = 1, ..., 8. 2. Do the eight weighings according to the following schedule and let Yi be the measured difference for i = 1, ..., 8: Then the estimated value of the weight θ1 is Similar estimates can be found for the weights of the other items. For example The question of design of experiments is: which experiment is better? The variance of the estimate X1 of θ1 is σ2 if we use the first experiment. But if we use the second experiment, the variance of the estimate given above is σ2/8. Thus the second experiment gives us 8 times as much precision for the estimate of a single item, and estimates all items simultaneously, with the same precision. What is achieved with 8 weighings in the second experiment would require 64 weighings if items are weighed separately. However, note that the estimates for the items obtained in the second experiment have errors which are correlated with each other. Many problems of the design of experiments involve combinatorial designs, as in this example. Statistical control It is best for a process to be in reasonable statistical control prior to conducting designed experiments. When this is not possible, proper blocking, replication, and randomization allow for the careful conduct of designed experiments.[12] Experimental designs after Fisher Some efficient designs for estimating several main effects simultaneously were found by Raj Chandra Bose and K. Kishen in 1940 at the Indian Statistical Institute, but remained little known until the Plackett-Burman designs were published in Biometrika in 1946. About the same time, C. R. Rao introduced the concepts of orthogonal arrays as experimental designs. This was a concept which played a central role in the development of Taguchi methods by Genichi Taguchi, which took place during his visit to Indian Statistical Institute in early 1950s. His methods were successfully applied and adopted by Japanese and Indian industries and subsequently were also embraced by US industry albeit with some reservations. In 1950, Gertrude Mary Cox and William Gemmell Cochran published the book Experimental Designs which became the major reference work on the design of experiments for statisticians for years afterwards. Developments of the theory of linear models have encompassed and surpassed the cases that concerned early writers. Today, the theory rests on advanced topics in linear algebra, algebra and combinatorics. As with other branches of statistics, experimental design is pursued using both frequentist and Bayesian approaches: In evaluating statistical procedures like experimental designs, frequentist statistics studies the sampling distribution while Bayesian statistics updates a probability distribution on the parameter space. Some important contributors to the field of experimental designs are C. S. Peirce, R. A. Fisher, F. Yates, C. R. Rao, R. C. Bose, J. N. Srivastava, Shrikhande S. S., D. Raghavarao, W. G. Cochran, O. Kempthorne, W. T. Federer, A. S. Hedayat, J. A. Nelder, R. A. Bailey, J. Kiefer, W. J. Studden, F. Pukelsheim, D. R. Cox, H. P. Wynn, A. C. Atkinson, G. E. P. Box and G. Taguchi. The textbooks of D. Montgomery and R. Myers have reached generations of students and practitioners. Completely Randomized Design Suppose we have 4 different diets which we want to compare. The diets are labeled Diet A, Diet B, Diet C, and Diet D. We are interested in how the diets affect the coagulation rates of rabbits. The coagulation rate is the time in seconds that it takes for a cut to stop bleeding. We have 16 rabbits available for the experiment, so we will use 4 on each diet. How should we use randomization to assign the rabbits to the four treatment groups? The 16 rabbits arrive and are placed in a large compound until you are ready to begin the experiment, at which time they will be transferred to cages. Possible Assignment Plans Method 1: We assume that rabbits will be caught "at random". Catch four rabbits and assign them to Diet A. Catch the next four rabbits and assign them to Diet B. Continue with Diets C and D. Since the rabbits were "caught at random", this would produce a completely randomized design. Analyze the results as a completely randomized design. Method 1 is faulty. The first rabbits caught could be the slowest and weakest rabbits, those least able to escape capture. This would bias the results. If the experimental results came out to the disadvantage of Diet A, there would be no way to determine if the results were a consequence of Diet A or the fact that the weakest rabbits were placed on that diet by our "randomization process". Method 2: Catch all the rabbits and label them 1-16. Select four numbers 1-16 at random (without replacement) and put them in a cage to receive Diet A. Then select another four numbers at random and put them in a cage to receive Diet B. Continue until you have four cages with four rabbits each. Each cage receives a different diet, and the experiment is analyzed as a completely randomized experiment. Method 2 is a completely randomized design, but it has a serious flaw. The experiment lacks replication. There are 16 rabbits, but the rabbits in each cage are not independent. If one rabbit eats a lot, the others in that cage have less to eat. The experimental unit is the smallest unit of experimental matter to which the treatment is applied at random. In this case, the cages are the experimental units. For a completely randomized design, each rabbit must live in its own cage. Method 3: Have a bowl with the letters A, B, C, and D printed on separate slips of paper. Catch the first rabbit, pick a slip at random from the bowl and assign the rabbit to the diet letter on the slip. Do not replace the slip. Catch the second rabbit and select another slip from the remaining three slips. Assign that diet to the second rabbit. Continue until the first four rabbits are assigned one of the four diets. In this way, all of the slow rabbits have different diets. Replace the slips and repeat the procedure until all 16 rabbits are assigned to a diet. Analyze the results as a completely randomized design. 21 Method 3 is not a completely randomized design. Since you have selected the rabbits in blocks of 4, one assigned to each of the diets A-D, the analysis should be for a randomized block design. The treatment is Diet but you have blocked on "catchability". Method 4: Catch all the rabbits and label them 1-16. Put 16 slips of paper in a bowl, four each with the letters A, B, C, and D. Put another 16 slips of paper numbered 1-16 in a second bowl. Pick a slip from each bowl. The rabbit with the selected number is given the selected diet. To make it easy to remember which rabbit gets which diet, the cages are arranged as shown below. Method 4 has some deficiencies. The assignment of rabbits to treatment is a completely randomized design. However, the arrangement of the cages for convenience creates a bias in the results. The heat in the room rises, so the rabbits receiving Diet A will be living in a very different environment than those receiving Diet D. Any observed difference cannot be attributed to diet, but could just as easily be a result of cage placement. Cage placement is not a part of the treatment, but must be taken into account. In a completely randomized design, every rabbit must have the same chance of receiving any diet at any location in the matrix of cages. A Completely Randomized Design Label the cages 1-16. In a bowl put 16 strips of paper each with one of the integers 1-16 written on it. In a second bowl put 16 strips of paper, four each labeled A, B, C, and D. Catch a rabbit. Select a number and a letter from each bowl. Place the rabbit in the location indicated by the number and feed it the diet assigned by the letter. Repeat without replacement until all rabbits have been assigned a diet and cage. 22 If, for example, the first number selected was 7 and the first letter B, then the first rabbit would be placed in location 7 and fed diet B. An example of the completed cage selection is shown below. Notice that the completely randomized design does not account for the difference in heights of the cages. It is just as the name suggests, a completely random assignment. In this case, we see that the rabbits with Diet A are primarily on the bottom and those with Diet D are on the top. A completely randomized design assumes that these locations will not produce a systematic difference in response (coagulation time). If we do believe the location is an important part of the process, we should use a randomized block design. For this example, will continue to use a completely randomized design. One-Way ANOVA To analyze the results of the experiment, we use a one-way analysis of variance. The measured coagulation times for each diet are given below: Diet A Diet B Diet C Diet D 62 63 68 56 60 67 66 62 63 71 71 60 59 64 67 61 Mean 61 66.25 68 59.75 The null hypothesis is H0 A B C D :m m m m (all treatment means the same) and the alternative is Ha : at least one mean different. The ANOVA Table is given below: Response: Coagulation Time Analysis of Variance Source DF Sum of SquaresMean Square F Ratio Model 3 191.50000 63.8333 9.1737 Error 12 83.50000 6.9583 Prob>F Total 15 275.00000 0.0020 23 From the computer output, we see that there is a statistically significant difference in coagulation time ( p = 0.0020) . Just what is being measured by these sums of squares and mean squares? In this section we will consider the theory of ANOVA. The Theory of ANOVA There is a lot of technical notation in Analysis of Variance. The notation we will use is consistent with the notation of Box, Hunter, and Hunter's classic text, Statistics for Experimenters. Some Notation k number of treatments In our example, there are 4 treatment classes, Diet A, Diet B, Diet C, and Diet D. t n number of observations for treatment t . Each of the treatments in this experiment have four observations, n n n n 1 2 3 4 4 . Yt i ith observation in the tth treatment class In our example, Y1 1 62 , , Y1 3 63 , , Y3 1 68 , , and Y4 4 61 , . N total number of observations N nt t 4 1 . In this case, N 16. tti i Y Y g sum of observations in the ith treatment class In our example, 4 1 1, 1 62 60 63 59 244 j j YY g , 2 Y 265, g 3 Y 272, g and 4 Y 239 g . t Y g mean of the observations in the tth treatment class Here, 1 2 3 Y 61, Y 66.25, Y 68, g g g and 4 Y 59.75 g . Y gg total of all observations (overall total) In our example, 44 , 11 62 60 63 62 60 61 1020 i j ti YY gg L Y gg overall mean 24 Y Y N gg gg where N nt t . Here, we have 1020 63.75 16 Y N gg . The ANOVA Model: Y m e or t i t t i t t Y m t e m m t Parameter Estimate Values in Example m ti t t i ti ti t t Y Y n gg 1020 63.75 16 y N m ti gg t i t t Y Y n g y 61 g 2 y 66.25 g y 68 g 4 y 59.75 g . t t t Y Y g t$ . 1 2 75 t$ . 2 2 5 t$ . 3 4 25 t$ . 4 4 0 t i e ti t Y Y g $ , e1 1 1 $ . , e 2 1 3 25 $ , e 3 1 0 $ . , e 4 1 3 75 $ , e1 2 1 $ . , e 2 2 75 $ , e 3 2 2 $ . , e 4 2 2 25 $ , e1 3 2 $ . , e 2 3 4 75 $ , e 3 3 3 $ . , e 4 3 25 $ , e1 4 2 $ . , e 2 4 2 25 $ , e 3 4 1 $ . , e 4 4 125 ANOVA as a Comparison of Estimates of Variance Analysis of variance gets its name because it compares two different estimates of the variance. If the null hypothesis is true, and there is no treatment effect, then the two estimates of variance should be comparable, that is, their ratio should be one. The farther is the ratio of variances from one, the more doubt is placed on the null hypothesis. If the null hypothesis is true and all samples can be considered to come from one population, we can estimate the variance in three different ways. All assume that the observations are distributed about a common mean m with variance s 2 . The first estimate considers the observations as a single set of data. Here we compute the variance using the standard formula. The sum of squared deviations from the overall mean is 1 3 2 11 total k nt ti ti SS Y Y gg If we divide this quantity by 11t t n N 25 we have an estimate of variance over all units, ignoring treatments. This is just the sample variance of the combined observations. In our example, SS total275 and s2 275 15 18.333. The second method of estimating the variance is to infer the value of s 2 from sY 2, where sY 2 is the observed variance of the sample means. We calculate this by considering the means of the four treatments. If the null hypothesis is true, these have a variance of 2 4 s since they are the means of samples of size 4 drawn at random from a population with variance s 2 . In general, the treatment means have variance 2 n s . Consequently, t their sum of squares of deviations from the overall mean, 2 t t y y , divided by the degrees of freedom, k 1, is an estimate of g gg 2 n s . So the product of t n times t 2 1 t t yy k g gg is an estimate of s 2 if the null hypothesis is true. So, the mean square treatment 2 trt 2 1 tt t nyy MS k s g gg B when H0 is true. The numerator sum of squared deviations due to treatments (which also has experimental unit differences) is computed using 2 trt t t t SS n y y g gg . If all nt are the same, then (trt) ( )2 t t t SS n y y g gg . In our example, we have 2 2 2 2 SS trt 4 6163.75 4 66.25 63.75 4 68 63.75 4 59.75 63.75 191.5 The mean square for treatment is this sum of squares divided by the degrees of freedom. In our example, MS( ) . trt . 1915 3 63833, so 63.833 is another estimate of the population variance under the null hypothesis. This is known as the estimated variance between treatments since it was computed using the differences in treatment means. The sum of squared deviations about the treatment mean is error 2 2 2 total trt i t t i i t titit SS y y y n y SS SS g g . In our example, this is 26 SS(error) (62 61)2 (60 61)2 L(60 59.75)2 (6159.75)2 83.5 If we divide this sum of squares by the degrees of freedom, N k , we have the pooled variance for the four groups of observations, 835 16 4 6 95833 . . . The variance for Diet A is 3.3333, the variance for Diet B is 12.9167, the variance for Diet C is 4.6667, and the variance for Diet D is 6.9167. Since each of these is based on four observations, sp 2 3 33333 3 12 9167 3 4 6667 3 6 9167 12 6 95833 ( . ) ( . ) ( . ) ( . ) .. This is our third estimate of variance and is an estimate of the variance within treatments since the pooled variance takes into account the treatment groups. Random variation can be characterized by this pooled variance as measured by residual residual SS MS Nk . The standard deviation of a treatment mean, 2 t t sd Y n s g , is estimated by residual t t MS se Y n g . (The estimated standard deviation is called the standard error.) The F-Statistic It can be shown that, in general, whether or not the null hypothesis is true, MS error estimates s 2 and MS(trt) estimates 2 2 1 tt t n k t s (see Appendix A). If t n n for all t n , then MS(trt) estimates 2 2 1 t t n k t s . If the null hypothesis is true, then 0 t t for all t, so MS(trt) estimates s 2 0 s 2. The F-score is the ratio of the mean square treatment to the mean square residual. If the treatment effects, t t , are zero, this ratio should be equal to one. trt error MS F MS estimates 2 2 2 22 1 11 tt ttt t n n k k t t s ss If 0 H is true, calc F has an F-distribution with k 1 and N k degrees of freedom. The larger the value of the F-score, the greater the estimated treatment effect. A large F-score corresponds to a small p-value, which casts doubt on the validity of the null hypothesis of 27 equal means. The null hypothesis of equal means is equivalent to a null hypothesis of all treatment effects being zero. H0 : all t m are equal 0 H : all 0 t t In our example of the rabbit diets, 3,12 F 9.1737 . This is quite large. An F-score this large would happen by chance only 2 out of 1,000 times when the null hypothesis is true. This is strong evidence against the null hypothesis that all 0 t t . Thus rejected the null hypothesis in favor of the alternative at least one of the treatment means differed from another. The ANOVA Table and Partitioning of Variance The ANOVA table consolidates most of these computations, giving the essential sums of squares and degrees of freedom for our estimates of variance. The standard table is shown below. This is the form of the computer output seen earlier. Source df SS MS F Prob>F Total Treatment Error N 1 k 1 N k SS(total) SS(trt) SS(error) ----MS(trt) MS(error) MS MS () () trt error * In our example, we have Source df SS MS F Prob>F Total Treatment Residual 15 3 12 275 191.5 83.5 ----- 63.833 6.9583 9.1737 0.002 Notice that Total SS = Treatment SS + Residual SS. The total sums of squares has been partitioned into two parts, the Treatment Sums of Squares and the Residual, or Error, Sums of Squares. A proof that this will always be the case is given in Appendix B. The Treatment Sums of Squares is a measure of the variation among the treatment groups, which includes the variation of the rabbits. The Residual Sums of Squares is a measure of the variation among the rabbits within each treatment group. Some texts suggest that the MS(Treatment) is "explained" variance and MS(Residual) is "unexplained" variance. The variance estimated by MS(Treatment) is explained by the fact that the observations may come from different populations while the MS(Residual) cannot be explained by variance in population parameters and is therefore considered as random or chance variation (see Wonnacott and Wonnacott). In this terminology, the F-statistic is the ratio of explained variance to unexplained variance, F explained variance unexplained variance . 28 We can make this partition even finer by including the individual treatments in our table. Source df SS MS F Prob>F Total Treatment (among Diets) Residual (among Rabbits) Within Diet A Within Diet B Within Diet C Within Diet D 15 3 12 3 3 3 3 275 191.5 83.5 10 38.75 14 20.75 ----63.833 6.9583 3.3333 12.9167 4.6667 6.9167 9.1737 0.002 In this table, notice that the SS(Residual) is the sum of the Within Diet sums of squares; 83.5 10 38.75 14 20.75 . Also, the MS(Residual) is the pooled variance based on mean squares Within Diets; 33.3333312.916734.666736.9167 6.958 12 . What Affects Power? Recall that the power of a statistical test is the probability of rejecting the null hypothesis. Also recall that for the 1-way analysis of variance F estimates 2 2 2 22 1 11 tt ttt t n n k k t t s ss . The larger the value of F, the greater the probability of rejecting the null hypothesis. Consequently, if s 2 decreases the power increases. if n increases, the power increases. if t t increases (the size of the treatment effects) then the power increases. This leads to the following design strategy priorities to increase power. 1. Reduce s 2 (e.g. by blocking) 2. Increase t n 29 3. Settle for reduced power Assumptions Like all hypothesis tests, the one-way ANOVA has several criteria that must be satisfied (at least approximately) for the test to be valid. These criteria are usually described as assumptions that must be satisfied, since one often cannot verify them directly. These assumptions are listed below: 1. The population distribution of the response variable Y must be normal within each class. 2. Independence of observed values within and among groups. 3. The population variances of Y values must be equal for all k classes ( 222 s1 s2 Lsk ) How important are the assumptions? 1. Normality is not critical. Problems tend to arise if the distributions are highly skewed and the design is unbalanced. The problems are aggravated if the sample sizes are small. 2. The assumption of independence is critical. 3. The assumption of equal variance is important. However, the design of the experiment with random assignment helps balance the variance. This is a greater problem in observational studies. 4. The methods are sensitive to outliers. If there are outliers, we can use a transformation, exclude the outlier and limit the domain of inference, perform the analysis with and without the outlier and report all findings, or use non-parametric techniques. Non-parametric techniques suffer from a lack of power. Multiple Comparisons Why don't we just compare treatments by repeatedly performing t-tests? Let's think about this in terms of confidence intervals. A test of the hypothesis that two treatment means are equal at the 5% significance level is rejected if and only if a 95% confidence interval on the difference in the two means does not cover 0. If we have k treatments, there are 2 k r possible confidence intervals (or comparisons) between treatment means. Although each confidence interval would have a 0.95 probability of covering the true difference in treatment means, the frequency with which all of the 30 intervals would simultaneously capture their true parameters is smaller than 95%. In fact, it can be no larger than 95% and no smaller than 10010.05r %. One consequence of this is that as the number of treatments increases, we are increasingly likely to declare at least two treatment means different even if no differences exist! To avoid this, several approaches have been suggested. One is to set 0.05 1001 % r confidence intervals on the difference in two treatment means for each of the r comparisons. Then the probability that all r confidence intervals capture their parameters is at least 95%. This is a conservative approach. Another approach is to use the F-test in the Analysis of Variance as a guide. Comparisons are made between treatment means only if the F-test is significant. This is the approach most widely used in most disciplines. Another approach is to use the method of Least Significant Difference. Compared to other methods, the LSD procedure is more likely to call a difference significant and therefore prone to Type I errors, but it is easy to use and is based on principles that students already understand. The LSD Procedure We know that if two random samples of size n are selected from a normal distribution with variance s 2 , then the variance of the difference in the two sample means is s sss Dn n n 2 2 2 . In the case of ANOVA, we do not know s 2 , but we estimate it with s2 MSE . So when two random samples of size n are taken from a population whose variance is estimated by MSE , the standard error of the difference between the two means is 22 2 s n MSE n . Two means will be considered significantly different at the 0.05 significance level if they differ by more than t MSE n 2 , where tis the t-value for a 95% confidence interval with the degrees of freedom associated with MSE. The value LSD t MSE n 2 22 is called the Least Significant Difference. If the two samples do not contain the same number of entries, then Randomized Complete Block Design In the previous section, we analyzed results from a completely randomized design with a one-way analysis of variance. This design ignored the physical layout of the cages and the potential effect of the height of cage in which a rabbit was housed. If we wanted to acknowledge the potential effect of height of the cage on the coagulation time, we should organize the experiment using a randomized complete block design. One diet of each type will be used on each of the 4 shelves. The randomization procedure would assign a number 1-16 to each of the rabbits. Put four slips of marked either 1, 2, 3, or 4 in a bowl. Select a number at random 1-16 to select a rabbit for Diet A, and pull a number out of the bowl to select a position on the top row. Repeat three times without replacement for Diets B, C, and D to complete the assignment to the top row. Follow the same procedure to assign the other three rows. An example is shown below. This is a randomized block design and should not be analyzed with the one-way ANOVA. For the sake of illustration, the data collected with this method is the same as with the completely randomized design. Ordering the data so it is easier to read, we have the following observations. Diet A Diet B Diet C Diet D Mean for Shelf Shelf 1 62 63 68 56 62.25 Shelf 2 60 67 66 62 63.75 Shelf 3 63 71 71 60 66.25 Shelf 4 59 64 67 61 62.75 Mean for Diet 61 66.25 68 59.75 Grand Mean 63.75 Two-way ANOVA The model that includes the blocking variable is Yt i m tt bi et i . By blocking on the shelf position, we hope to increase the power of the test by removing variability associated with shelf height. This would allow us to detect smaller differences between treatments. 33 Our estimates are ˆ t i t i t i ti yyyyyye tb gg g gg g gg . Considering the sums of squares, we have 22 2 2 ˆ2 Trt SS Block SS ti t i ti tititi y ny y y y y e gg g gg g gg The expression 2 2 ti ti y ny is the Total sum of squares, so we have Total SS = Trt SS + Block SS + Error SS. Computing the sums of squares, we have gg 2 2 2 2 2 65300 65025 191.5 38 45.5 ˆ ti t i ti tititi y ny y y y y e gg g gg g gg The computer output gives the same computed sums of squares. Response: Coagulation Time Effect Test Source DF Sum of Squares Mean Square F Ratio Prob>F Diet 3 191.50000 63.833 12.6264 0.0014 Row 3 38.00000 12.667 2.5055 0.1250 Error 9 45.50000 5.0556 C Total 15 275.00000 Diet Mean Shelf Mean A 61.0000 1 62.2500 B 66.2500 2 63.7500 C 68.0000 3 66.2500 D 59.7500 4 62.7500 From the computer output, we see that there is again a statistically significant difference in coagulation time, with the p-value slightly smaller ( p = 0.0014) . The mean square error has been reduced to 5.0556 and the degrees of freedom are reduced to 9. .**********************THE END***************************