Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Degrees of freedom (statistics) wikipedia , lookup
Foundations of statistics wikipedia , lookup
History of statistics wikipedia , lookup
Confidence interval wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Gibbs sampling wikipedia , lookup
Misuse of statistics wikipedia , lookup
Chapter 1: Introduction 1.1 What is Statistics? Statistics involves collecting, analysing, presenting and interpreting data. We frequently see statistical tools (such as bar charts, tables, plots of data, averages and percentages) on TV, in newspapers and in magazines. Such methods used to organise and summarise data, so as to increase the understanding of the data, are called descriptive statistics. Statistics is also used in practice in many different walks of life, going beyond simple data summarisation to answer a wide variety of questions such as: Medicine: Does a certain new drug prolong life for AIDS sufferers? Science: Is global warming really happening? Education: Are GCSE and A level examinations standards declining? Psychology: Is the national lottery making us a nation of compulsive gamblers? Sociology: Is the gap between rich and poor widening in Britain? Business: Do Persil adverts really make us want to buy Persil? Finance: What will interest rates be in 6 months time? 1.2 Populations and Samples Suppose that we wanted to investigate whether smoking during pregnancy leads to lower birth weight of babies. We use this example to illustrate the following definitions. Definitions: Experimental unit: the object on which measurements are made. For above example, we are measuring birth weights of newborn babies, so a unit is a newborn baby. Variable: a measurable characteristic of a unit. For above example, the variable is birth weight. Population: the set of all units about which information is required. For above example, the population is all newborn babies. Sample: a subset of units of the population for which we can observe the variable of interest. For above example, a sample would be the observed birth weights for a set of newborn babies (which will be a subset of all newborn babies). Random sample: a sample such that each unit in the population has the same chance of being chosen independently of whether or not any other unit is chosen. To determine whether smoking during pregnancy leads to lower birth weight of babies, we would compare a random sample of weights of new-born babies whose mothers smoke, with a random sample of weights of newborn babies of non-smoking mothers. By analysing the sample data, we would hope to be able to draw conclusions about the effects on birth weight of smoking during pregnancy for all babies (i.e. the population). The process of using a random sample to draw conclusions about a population is called statistical inference. If we do not have a random sample, then sampling bias can invalidate our statistical results. For example, birth weights of twins are generally lower than the weights of babies born alone. So if all the non-smoking mothers in the sample were giving birth to twins, whereas all the smoking mothers were giving birth to single babies, then the conclusions we draw about the effects of smoking in pregnancy will not necessarily be correct as they are affected by sampling bias. Different units of the same population will have different values of the same variable this is called natural variation. For example, obviously the weights of all newborn babies are not the same. So different samples will contain different data- called sampling variability. Therefore it is important to bear in mind that slightly different conclusions could be reached from different samples. 1 1.3 Types of Data Different types of data require different types of analysis. The type of data set is determined by several factors: Type of variable: quantitative data - i.e. numerical (e.g., heights of students, number of phone calls in an hour). qualitative data - i.e. non-numerical (for example, eye colour, M/F). Quantitative data can be subdivided further: discrete – a discrete variable can take only particular values (e.g., number of phone calls received at an exchange). continuous- a continuous variable can take any value in a given range (e.g., heights of students). Number of variables measured: 1 variable univariate data. 2 variables bivariate data. E.g., we may have both the heights and weights of a set of individuals. The data set then consists of pairs of observations on each unit such as (1.7m, 65kg). 3 or more variables multivariate data. E.g., we have heights, weights, eye colour, gender for a group of individuals. In this case the data set consists of sets of 4 observations made on each unit such as (1.7m, 65kg, blue, M). Number of samples: For example, when investigating the effects of smoking during pregnancy, we would observe two samples: a sample of birth weights of babies born to smoking mothers a sample of birth weights of babies born to non-smoking mothers. Relationship of samples (if more than 1 sample): Are the samples independent? E.g., the two birth weight samples should be independent. Are the samples dependent? Example: Suppose that a doctor would like to assess the effectiveness of changing to a low-fat diet in lowering cholesterol for a group of patients. To do this the doctor might measure the cholesterol of the patients before starting on the low-fat diet and then measure the cholesterol for the same patients after they have been on the low-fat diet. We therefore have 2 samples of measured cholesterol: a sample before the diet a sample after the diet. However, the 2 samples are not independent, since the cholesterol measurements for each sample were taken on the same patients. Samples of this type are called matched pair data. 1.4 Recommended Books You will need to use statistical tables for the course. The tables used in the exams are: Lindley, D.V. and Scott, W.F., New Cambridge Elementary Statistical Tables, C.U.P., 1984. Statistical tables will be used throughout this course. There are many books which cover the material in this course. Some good books are: Introduction to probability and statistics for engineers and scientists; [with CD-ROM] / Sheldon M. Ross Probability and Statistics for Engineers and Scientists - 7th edition, R.E.Walpole, R.H.Myers, S.L.Myres and K. Ye, Prentice Hall, 2002 Clarke, G.M., and Cooke, D. A Basic Course in Statistics, Edward Arnold, 4th edition, 1999. Daly, F., Hand, D.J., Jones, M.C., Lunn, A.D. and McConway, K.J. Elements of Statistics, Open University, 1995. Goes beyond what's required for this course, but is quite clearly written with some real examples. Devore, J and Peck, R. Introductory Statistics, West, 1990. Rather simplistic at times, but has lots of real examples. Especially good if you have not done any statistics before. Spiegel, M.R., Probability and Statistics, Schaum Outline Series, 1988. 2 In addition, you could browse in the library around QA276 and find a book which suits you. For starters you could try looking at some of the following. Anderson, D.R., Sweeney, D.J. and Williams, T.A. Introduction to Statistics: Concepts and Applications, West, 2nd edition, 1991. Bassett, E.E., Bremner, J.M., Jolliffe, I.T., Jones, B., Morgan, B.J.T. and North, P.M., Statistics: Problems and Solutions, Edward Arnold, 1986. Moore, D.S., The Basic Practice of Statistics, Freeman, 1995. Moore, D.S., Think and Explain with Statistics, Addison-Wesley, 1986. Moore, D.S., Statistics: Concepts and Controversies, Freeman, 1991, 1985, 1979. There are many online books which could be useful. See for example http://www.statsoft.com/textbook/stathome.html 3 Chapter 2: Graphical and Numerical Statistics 2.1 Histograms Histograms give a visual representation of continuous data. We consider two separate cases corresponding to when (i) all the bars in the histogram have the same width; (ii) the intervals are of variable widths. 2.1.1 Histograms with equal class widths Example: Mercury contamination can be particularly high in certain types of fish. The mercury content (ppm) on the hair of 40 fishermen in a region thought to be particularly vulnerable are given below (From paper “Mercury content of commercially imported fish of the Seychelles, and hair mercury levels of a selected part of the population.” Environ. Research, (1983), 305-312.) 13.26 32.43 18.10 58.23 64.00 68.20 35.35 33.92 23.94 18.28 22.05 39.14 31.43 18.51 21.03 5.50 6.96 5.19 28.66 26.29 13.89 25.87 9.84 26.88 16.81 38.65 19.23 21.82 31.58 30.13 42.42 16.51 21.16 32.97 9.84 10.64 29.56 40.69 12.86 13.80 The first step is to group the data. A reasonable choice of class intervals is: 0-10, 10-20, 20-30, 30-40, 40-50, 50-60, 60-70. The frequency table that results from the use of these intervals is: Interval 0-10 10-20 20-30 30-40 40-50 50-60 60-70 Frequency 5 11 10 9 2 1 2 N.B. By convention, any observation that is at a boundary of a class will be put into the higher class. For example, an observation of 10 above would be put into the 10-20 category. To construct the histogram in this situation (i.e. all class widths equal): Mark boundaries of the class intervals on the horizontal axis. The height of the bars above each interval can be taken as the frequency for that interval. A histogram showing mercury contamination in hair Frequency 10 5 0 0 10 20 30 40 50 Mercury content (ppm) 4 60 70 Instead of using frequencies to give the heights of the rectangles in a histogram, relative frequencies may be used. The relative frequency for an interval is that interval's frequency divided by the total frequency. So for the mercury example… Interval 0-10 10-20 20-30 30-40 40-50 50-60 60-70 Total Frequency 5 11 10 9 2 1 2 40 Relative frequency .125 .275 .250 .225 .050 .025 .050 1 The relative frequencies can be expressed as percentages (which is how Minitab produces a relative frequency histogram): A relative frequency histogram for the mercury data Relative frequency (%) 30 20 10 0 0 10 20 30 40 50 60 70 Mercury content (ppm) Notice that the shape of the histograms, whether using frequencies or relative frequencies, is the same. 2.1.2 Histograms with unequal class widths There is no hard and fast rule as to how many intervals should be used. Too many classes produce an uneven distribution, but having too few loses information. Usually the number of classes is about 6-20. The more observations we have, the more classes we will usually use. The width of the intervals defining the histograms need not all be equal. It is often sensible to choose short intervals where the data is quite dense but intervals with a longer width where the data is more sparse. This will ensure that we don’t have too many intervals with zero frequency, yet keeps as much information about the distributional shape of the data as possible. When unequal interval widths are used, then the frequency density should be used on the vertical scale on the histogram, where Frequency density = Frequency class width. 5 Example: The lengths (in metres) of 250 vehicles aboard a cross-channel ferry are summarised in the following table: Vehicle length (m) 3.0-4.0 4.0-4.5 4.5-5.0 5.0-5.5 5.5-7.5 Class width 1 0.5 0.5 0.5 2 Frequency 90 80 40 24 16 Frequency density 90 160 80 48 8 A histogram showing the lengths of 250 vehicles 200 180 160 Frequency density 140 120 100 80 60 40 20 0 2 3 4 5 Vehicle length (m) 6 7 8 Notice that if we had simply defined the heights of the rectangles to be the frequencies, then the histogram would exaggerate, for example, the incidence of cars between 3 and 4 metres in length. An alternative way of producing a histogram in situations were not all class widths are equal is to set the bar height to be the relative frequency density. This is given by: Relative freq. density = Relative freq. class width. If the histogram is produced in this way, then the total area of all the bars is 1. Example (continued) The relative frequency densities for the car vehicle length data are as follows: Vehicle length (m) 3.0-4.0 4.0-4.5 4.5-5.0 5.0-5.5 5.5-7.5 Class width 1 0.5 0.5 0.5 2 Frequency 90 80 40 24 16 The corresponding histogram can then be produced: 6 Relative freq. 0.36 0.32 0.16 0.096 0.064 Rel. freq. density 0.36 0.64 0.32 0.192 0.032 A histogram showing vehicle lengths 0.7 Frequency density 0.6 0.5 0.4 0.3 0.2 0.1 0.0 3.0 4.0 4.5 5.0 5.5 7.5 Length (m) 2.1.3 Histogram shapes Histograms are very useful for giving some idea of the shape of a density by approximating the histogram to a smooth curve. Densities can take many different shapes: Unimodal Unimodal distribution Bimodal Bimodal distribution 7 Multimodal Multimodal distribution Symmetric Positive skew Positively skew distribution Symmetric distribution Normal Negatively skew distribution Heavy-tailed Normal distribution 2.1.4 Negative skew Light-tailed Light-tailed distribution Heavy-tailed distribution Histograms for discrete data Discrete data is usually illustrated using a bar-line chart (or a bar chart), whilst histograms are generally used for continuous data. However, when the number of possible values for the observations is large, a bar diagram would become uninformative. In this case it is acceptable to group the values into class intervals, much as you would for continuous data. Example: Suppose we have the following data: 1 8 13 17 1 9 13 17 2 9 13 17 2 9 13 18 2 9 14 18 3 10 14 19 3 10 14 19 4 10 14 20 4 10 14 21 5 10 14 21 5 11 15 22 5 11 15 22 5 11 15 23 6 11 15 23 6 12 15 24 7 12 16 26 7 12 16 27 7 12 16 29 As there are a large number of different values here, to get a better idea of the shape of the distribution, we can group data into classes. Let's consider grouping all observations between 1 - 3, 4 - 6 and so on. To draw a histogram we need a continuous scale and so we need to define our histogram intervals to be 0.5 - 3.5, 3.5 - 6.5, and so on. (Remember: a histogram never has gaps between the bars). We then get the following frequency distribution: 8 Interval 0.5 - 3.5 3.5 - 5.5 5.5 - 9.5 9.5 - 12.5 12.5 - 15.5 15.5 - 18.5 18.5 - 21.5 21.5 - 24.5 24.5 - 27.5 27.5 - 30.5 Frequency 7 8 8 13 14 8 5 5 2 1 The histogram can now be drawn in the normal way. 2.2 Stem-and-leaf plots Stem-and-leaf plots are an effective way of providing a visual display of quantitative data with very little effort. The idea of the plots is to separate each observation into 2 parts - the first part being the stem and the second the leaf. To construct a stem-and-leaf plot: Select one or more leading digits for the stem values. The following digit or digits become the leaves. List possible stem values in a vertical column. Record the leaf value for every observation beside the corresponding stem value. Indicate the units for stems and leaves. Example: To investigate the efficiency of new air-conditioning equipment installed on Boeing 720 aircraft, the times (in hours) to first failure of the equipment were obtained from 28 different aircraft: 79 90 10 60 61 49 14 24 56 20 84 44 25 59 46 37 32 76 26 35 29 53 75 25 44 23 27 33 For these data an obvious choice for the stems is the leading digit (tens) and the leaves are then the second digits (units). So, for example, the first observation of 79 has stem 7 and leaf 9. The data values range from 10 up to 90, so we have the stem values 1-9. 1 2 3 4 5 6 7 8 9 0 4 7 9 6 0 9 4 0 4 0 2 4 9 1 6 5 6 9 5 3 7 5 3 6 4 3 An unordered stem-and-leaf diagram for the Boeing data Leaves- these should be in columns 5 Stem Scale: Stem = 10s 1 2 3 4 5 0 0 2 4 3 4 3 3 4 6 Leaves = units 4 5 5 6 7 9 5 7 6 9 9 Leaves have now been put in order An ordered stem-and-leaf diagram for the Boeing data 9 6 7 8 9 0 1 5 6 9 4 0 Scale: Stem =10s Leaves = units N.B. Rearranging the leaves in ascending order clarifies things and is useful for producing numerical summaries. N.B.2 One advantage that stem-and-leaf diagrams have over histograms is that they retain the detail of the raw data. 2.2.1 Use of stem-and-leaf plots Stem-and-leaf plots give a visual display of the rough shape of the distribution of the variable being measured. We can identify whether the density is a) unimodal or multimodal; b) symmetric, negatively or positively skewed; c) normal, heavy- or light-tailed. Stem-and-leaf plots are useful for informal inference. We can find medians and quartiles easily from the diagrams and obtain estimates of probabilities. For example, in the Boeing data 10 pieces of equipment lasted under 30 hours so we could estimate the probability of a new piece of equipment failing within the first 30 hours as 10/28. Stem-and-leaf plots are useful for identifying outliers- these are unusually large or small observations. For example, for the Boeing example, if there had been an extra observation of 119, then this might be an outlier: 1 2 3 4 5 6 7 8 9 0 0 2 4 3 0 5 4 0 4 3 3 4 6 1 6 4 5 5 6 7 9 5 7 6 9 9 9 This could be considered an outlying value 10 11 9 2.2.2 Choice of stem unit Choice of stem unit can be important. Example: To determine the age of a pre-historic settlement in North Wales, 24 small fragments from a wooden boat found at the settlement were independently radio-carbon dated. The radio-carbon determiniations (in years) of age of fragments are: 4969 5163 5052 5144 4965 5152 4967 4934 4895 5078 5019 4908 5009 5046 4912 5012 4889 5034 4914 5117 4931 5081 4984 4881 Possibility 1: We could round each observation to the nearest one hundred years: 5000 5000 5200 5000 5100 4900 5100 5000 5000 4900 5200 5000 5000 4900 4900 5100 4900 4900 5100 5100 5000 5000 4900 4900 Taking the stem unit to be 1000 years gives the following diagram: 10 Scale: Stem = 1000's Leaves = 100's 4 9 9 9 9 9 9 9 9 5 0 2 1 1 0 2 0 1 0 0 0 0 0 1 1 0 Because we have so few stem values here, we lose a lot of information. We can’t say anything for example about the shape of the distribution. Possibility 2: Round observations to the nearest 10 years. 4970 5010 5160 5050 5050 4910 5140 5010 4970 4890 5150 5030 4970 4910 4930 5120 4900 4930 5080 5080 5020 4980 4910 4880 Taking the stem unit as 100 years gives: 48 49 50 51 9 7 5 6 8 7 7 3 0 0 1 1 3 8 8 2 1 5 1 3 8 4 5 2 Scale: Stem = 100's Leaves = 10's This plot is a little more informative, but we could still do with having slightly more stems. 11 Possibility 3: Split the stems into high and low values 48L 48H 49L 49H 50L 50H 51L 50H 8 0 7 1 5 2 5 9 0 7 1 5 4 6 1 7 2 8 In the high category you write any 5s, 6s, 7s, 8s or 9s. 1 3 3 8 3 8 Scale: Stem = 100's Leaves = 10's In each low category you put any 0s, 1s, 2s, 3s, or 4s. The diagram is now quite informative about the distribution- there is evidence of a positive skew. [Note that if the stem unit was taken to be 10s, then the diagram we would get would be poor- we would then have too many stem values (a lot of the rows would have no values in them).] 2.2.3 Back-to-back displays for displaying two independent samples If there are 2 sets of data which you wish to compare, then both of these can be put on the same stem-and-leaf plot with the leaves for one dataset going to the right and the leaves of the other dataset going to the left. Example: Using a technique involving chromium dioxide, the protein assimilation efficiencies (i.e. percentage of protein intake actually absorbed) were measured on field mice and voles fed on their natural diets. The assimilation efficiencies (in percentages) are given below: A.E.'s of field mice: 61.3 65.4 57.8 70.6 71.7 70.5 62.6 68.9 63.6 62.6 76.3 69.7 67.8 74.6 61.9 A.E.'s of voles: 51.7 70.1 72.0 75.2 69.8 73.8 63.7 59.6 77.2 69.9 62.6 77.6 63.5 74.1 66.7 67.3 69.2 73.7 Rounding observations to the nearest integer gives us: An unordered back-to-back stem-and-leaf diagram for the protein data A.E.s for field mice 3 2 0 Scale: Stem = 10's 4 9 1 3 8 1 5 A.E.s for voles 8 1 5 2 6 5L 5H 6L 6H 7L 7H 2 4 7 2 7 Outlier? 3 9 0 5 4 8 0 8 Leaves = 1's Then ordering the leaves we get… 12 0 7 4 0 4 4 67.5 An ordered back-to-back stem-and-leaf diagram showing the protein data A.E.s for field mice 4 3 3 9 1 2 Scale: Stem = 10's 2.2.4 A.E.s for voles 5L 5H 6L 6H 7L 7H 8 1 5 0 5 2 8 1 6 2 0 7 0 5 3 7 0 7 4 8 0 8 4 9 2 4 4 4 Leaves = 1's Stem-and-leaf diagrams for matched-pair data It is not a good idea to do a back-to-back plot if the 2 variates are not independent. Consider the following example. Example: Fifteen people participated on a short typing course. Their typing speeds (words/min) before and after the course were recorded: Subject Before After 1 15 26 2 18 28 3 23 27 4 27 26 5 36 28 6 12 24 7 8 26 8 19 42 9 32 32 10 22 36 11 17 20 12 21 29 13 16 21 14 15 22 15 33 28 These data are an example of matched-pair data (there are two measurements recorded on each participant). Matched-pair data are likely to be dependent (a person with a fast typing speed before the course is also likely to have a fast typing speed after the course). By drawing a stem-and-leaf diagram you lose information about how the measurements pair up. You could draw a scatter diagram (this would show the pairings). Alternatively, you could produce a stem-and-leaf diagram of the differences: Subject Change 1 11 2 10 3 4 4 -1 5 -8 6 12 7 18 8 23 9 0 10 14 11 3 12 8 13 5 14 7 15 -5 A stem-and-leaf diagram showing the change in typing speeds after a short course -0 0 1 2 1 4 1 3 8 0 0 5 3 2 8 8 5 4 7 Scale: Stem = 10’s Leaves = units. A slightly more informative diagram can be obtained by splitting each stem up into two parts (one for the lower leaves and the other for higher leaves): 13 A stem-and-leaf diagram showing the change in typing speeds after a short course -0H -0L 0L 0H 1L 1H 2L 8 1 4 8 1 8 3 5 0 5 0 3 7 2 Scale: Stem = 10’s Leaves = units. 4 Each diagram could then be ordered. 2.2.5 Problems Stem-and-leaf plots cannot be used for displaying qualitative data and they become impractical for large numbers of observations. 2.3 Cumulative Frequency Plots A cumulative frequency plot also uses classes and frequencies. The cumulative frequency for a class is the number of observations with values less than the upper boundary for that class. Example: Consider the mercury example again. The cumulative frequencies are given in the table below: Interval 0-10 10-20 20-30 30-40 40-50 50-60 60-70 Frequency 5 11 10 9 2 1 2 Cumulative frequency 5 16 26 35 37 38 40 In a cumulative frequency polygon the cumulative frequencies are plotted against the upper class boundaries of the classes. These points are then joined with a straight line. Example (continued) For the mercury example we want to plot the points (0, 0), (10, 5), (20, 16),…, (70, 40) and then join these points: 14 A cumulative frequency polygon for the mercury data Cumulative frequency 40 30 20 10 0 0 10 20 30 40 50 60 70 Mecrcury level A cumulative frequency plot is useful for giving us some idea of the shape of the distribution function of the variable. They can also be used to obtain estimates of the median and other quantiles for grouped data. 2.4 Scatter Plots. Scatter plots are useful for assessing relationships between 2 variables. To draw a scatter plot we represent one of the variables by the horizontal axis and the other variable by the vertical axis. We then simply plot the pairs of data points on the graph. Example: Fifteen children were given a visual-discrimination (V) test during the first week at primary school and a reading-achievement (R) test at the end of their first year of schooling. Scores out of 100 were calculated for each test. Child no. V-score R-score 1 75 95 2 69 90 3 70 82 4 62 69 5 52 58 6 45 49 7 42 38 8 39 35 9 37 30 10 34 20 11 34 31 12 66 75 13 54 61 To draw a scatter plot we now want to plot the points (75, 95), (69, 90), (70, 82), …, (63, 77). A scatter plot depicting primary school test results 100 90 R-score 80 70 60 50 40 30 20 30 40 50 60 70 80 90 100 V-score The plot would suggest that there is a positive relationship between the V-score and the R-score. 15 14 58 64 15 63 77 2.4.1 Positive/ negative correlation The following graphs give illustrations of variables that are (a) positively and (b) negatively correlated with each other. Correlation can also be categorised as strong or weak depending upon how close the points are to lying on a straight line. 15 15 Weak, positive Strong, positive 10 y y 10 5 5 0 0 0 5 10 -5 15 0 5 x 10 15 10 15 x 15 20 Strong, negative Weak, negative 15 10 y y 10 5 5 0 0 0 5 10 -5 15 0 5 x 2.4.2 x Correlation does not imply causation It is important to realise that scatter plots point to associations between variables. They do not necessarily show a causal relationship. Example: Information about two variables (life expectancy and the number of people per television set) is available for 12 countries: Life expectancy plotted against number of people per TV Life expectancy 80 70 60 50 40 0 100 200 Number of people per TV It is clear that the two variables are negatively correlated. However, it clearly would be wrong to conclude that simply sending more televisions to countries with low life expectancies would cause their inhabitants to live longer. 16 This example illustrates the very important distinction between causation and association. Two variables may be strongly correlated without a cause-and-effect relationship existing between them. Often the explanation is that both variables are related to a third variable not being measured. In the example above for instance both life expectancy and the number of televisions in the population will both be related to the country’s wealth. There is one further type of graph that we will consider later in the chapter (namely box-and-whisker plots). We first however need to look at numerical summary measures for data. 2.5 Numerical summaries of data In the next few sections we will look at some numerical ways of summarising data. 2.5.1 Some notation Suppose that we would like to learn about the random variable X. To do this we will observe a random sample of n observations, X 1 ,..., X n , such that each X i has the same distribution as X. The observed values of X 1 ,..., X n are then denoted x1 ,..., x n . Example: Suppose we are interested in the number of units of alcohol students at UKC consumed last week. To do this we could randomly select 50 students to form a random sample X 1 ,..., X 50 , where X i is the random variable representing the number of units of alcohol consumed by the ith student. The observed value of X i is denoted xi . Now suppose that we order the random sample x1 ,..., x n . We let: x (1) denote the smallest observation; x ( 2) denote the second smallest observation; … x (i ) denote the ith smallest observation; … x (n ) denote the largest observation. Then x (i ) is called the ith order statistic and the following relation holds: x(1) x( 2) ... x( n) . Example: Suppose that we have the observations: x1 5, x 2 10, x3 2, x 4 7. Then x(1) x3 2, x( 2) x1 5, x(3) x4 7, x( 4) x2 10. When we have frequency data, we will denote the frequency of the kth class by f k for k = 1,…, K, where K is the K number of classes. Then fk n. k 1 17 Example: Consider the mercury example again. Here we have the frequency table given by: Interval 0-10 10-20 20-30 30-40 40-50 50-60 60-70 Frequency 5 11 10 9 2 1 2 Here we have 7 classes, so that K = 7. Then f1 5, f 2 11, and so on, such that 2.5.2 7 f k 40 n . k 1 Measures of location The Sample Mean Let X 1 ,..., X n denote the random variables for a sample of size n. The sample mean, denoted X , is defined by: X 1 ... X n 1 n Xi. n n i 1 The observed value of the sample mean for a particular sample is therefore: x ... x n 1 n x 1 xi . n n i 1 X When the data are grouped by means of a frequency table, then the equivalent formula for x is given by: K x xk f k k 1 K fk , k 1 where K is the number of classes or groups, and x k is the mid-point of class k. Example: Consider the mercury example again. Interval 0-10 10-20 20-30 30-40 40-50 50-60 60-70 Mid-point, x k 5 15 25 35 45 55 65 Frequency, f k 5 11 10 9 2 1 2 The sample mean is therefore: (5 5) (15 11) (25 10) ... (65 2) 1030 x 25.75. 40 40 18 Note: The mean is probably the most useful measure of location. Its advantages are that it uses all the values in the data and is easy to manipulate mathematically. A disadvantage is that it is not robust- this means that its value can be sensitive to the presence of outlying values. More robust measures of location (such as the median or trimmed mean) are increasing in popularity amongst statisticians. The Median To find the median of a set of n data values, we must first rearrange them in order of size. The median is then equal to the middle observation if n is odd, and the average of the middle two observations is n is even. More formally, if n is odd X ( 0.5( n 1)) median 1 if n is even. 2 X ( 0.5n ) X ( 0.5n 1) Example 1: The values below are systolic blood pressures of patients admitted to a hospital: 112.1 138.6 115.9 109.5 108.2 110.9 159.6 115.8 122.3 122.4 123.8 117.5. To find the median value for the blood pressure, we must first list them in ascending order: 108.2 109.5 110.9 112.1 115.8 115.9 117.5 122.3 122.4 123.8 138.6 159.6. Here we have an even number of observations. So 1 1 Sample median = X (6) X (7) 115.9 117.5 116.7. 2 2 For these data the sample mean is: 1 108.2 109.5 110.9 ... 159.6 1456 .6 121.38 12 12 which is somewhat larger than the sample median. The mean is influenced by the outlying value (159.6). The median is more robust than the mean and is not really affected by outliers. Sample mean = Example 2: A football team has scored the following number of goals in the last 44 matches: Number of goals Frequency 0 9 1 8 2 15 3 9 4 3 As n = 44, the median will lie halfway between the 22nd and 23rd observations. Since both x( 22) and x( 23) are 2, the median value is 2. For grouped data, the most convenient way to estimate the median is by graphical methods. This is most easily demonstrated via an example. Example Consider the mercury example once again. The cumulative frequency plot is given below. We have a total of 40 observations, so when the cumulative frequency is 20 we might expect the corresponding value of mercury read off from the graph to be an estimate of the median. In this case we estimate the median as 23 approximately. 19 A cumulative frequency polygon for the mercury data Cumulative frequency 40 30 20 10 0 0 10 20 30 40 50 60 70 Mecrcury level Note: The median is also often a better measure of location than the mean when data are highly skewed. The following show the relative positions of the mean and median for 3 densities: 20 Example: Distributions of incomes are commonly positively skewed as there are typically a few very large salaries which gives the density a long right-hand tail. Therefore the median is often used to give a typical salary value, rather than the mean. Disadvantages for the median: There are two main disadvantages of using the median. It ignores the actual values of the data and uses only their ranks (it effectively uses only the “middle” part of the data set). It is also not as easy to use mathematically in the theory of statistics as the arithmetic mean. The Trimmed Mean The trimmed mean can be viewed as some sort of compromise between the mean and the median. To calculate a trimmed mean: order the data values delete a selected number of values from each end of the ordered list average the remaining values. The trimmed mean avoids the disadvantages of the mean by excluding extreme observations and avoids that of the median by taking some account of the observations other than the middle one. To calculate the 5% trimmed mean for example, discard the top 5% and the bottom 5% of observations, and average those remaining. Example: The body temperatures (deg. F) of 10 patients hospitalised with meningitis are as follows: 104.0 100.8 104.8 104.2 101.6 100.2 The sample mean for these data is: 108.0 103.8 102.4 101.4 1031 .2 x 103.12. 10 To find the 10% trimmed mean, as we have 10 observations, we drop the smallest and largest data values. 823 10% trimmed mean = 102 .875. 8 In this case the 10% trimmed mean is probably a better representation of the centre of the distribution as it ignores the (possible) outlier, 108. The Mode The mode is a very simple measure of location. For discrete data, it is the value of x with the largest frequency. We cannot calculate a mode for ungrouped continuous data. For data grouped into classes we obtain a modal class. Example: Consider again the family size data presented in the previous section. The numbers of children in the sampled families are: 2, 6, 3, 2, 2, 7, 5, 4, 1, 4, 0, 5, 2, 4, 1. Here the most commonly occurring value is 2 and so this is the mode. Quantiles The median divides the data into two equal parts. In a similar way, quartiles divide the data into four equal parts, deciles divide the data into 10 equal parts and percentiles divide it into 100 equal parts. The upper and lower quartlies can be found in the following way: 21 sample lower quartile = median of lower half of data sample upper quartile = median of upper half of data If n is odd, then the median of the entire sample is included in both halves. Note that deciles and percentiles only tend to be used on very large data sets. Example: The salinity values for 28 water specimens are as follows: 7.6 7.7 4.3 5.9 5.0 10.5 6.5 8.3 8.2 13.2 12.6 13.6 10.4 10.8 13.1 12.3 10.4 13.0 7.7 14.1 14.1 9.5 13.5 15.1 12.0 11.5 12.6 12.0 To find the quartiles we first need to order the data: 4.3 5.0 5.9 6.5 7.6 10.4 10.4 10.5 10.8 11.5 13.0 13.1 13.2 13.5 13.6 7.7 12.0 14.1 8.2 12.3 15.1 8.3 12.6 9.5 12.6 7.7 12.0 14.1 We have 28 observations and so 1 1 median x(14) x(15) 10.8 11.5 11.15. 2 2 To find the lower and upper quartiles we need to find the median of the lower 14 and upper 14 observations respectively: 1 1 lower quartile x(7) x(8) 7.7 8.2 7.95. 2 2 1 1 upper quartile x( 21) x(22) 13.0 13.1 13.05. 2 2 Exercise: Find the median, together with the lower and upper quartiles for the following examination marks: 68, 72, 31, 60, 90, 96, 45, 57, 54, 45, 16, 22, 82, 63, 52. Just as with finding the median, we can estimate quantiles graphically. Example: Consider again the cumulative frequency polygon for the mercury data. As the total number of observations is 40, we can estimate the lower and upper quartiles by reading off the mercury values from the graph for a cumulative frequency of 10 and 30, respectively. A cumulative frequency polygon for the mercury data Cumulative frequency 40 30 20 10 0 0 10 20 30 40 50 Mecrcury level We see UQ = 34 and LQ = 14 (approximately). 22 60 70 2.5.3 Measures of dispersion Obviously specifying the central value of a set of data does not tell the whole story. We also need to consider the variability (or spread or dispersion) of the data. The Range The simplest measure of dispersion is the range which is simply the difference between the largest and smallest values in the data set. If we have grouped data then we cannot calculate an exact range, only an upper limit. Example: For the water salinity data, the largest observation is 15.1 and the smallest is 7.6. Therefore, range = 15.1 - 7.6 = 7.5. Note: The range is sensitive to the presence of one or two extremely large or small values in the data. Inter-quartile range This is a more useful measure of dispersion than the range. It is simply the difference between the upper and lower quartiles. The inter-quartile range contains the middle half of the data set. Example: We calculated the upper and lower quartiles for the water salinity data to be 13.05 and 7.95 respectively. Therefore, Inter-quartile range = 13.05 - 7.95 = 5.1. The Mean Deviation The deviations in a sample are the differences, x1 x, x2 x, ..., xn x. One possible idea for obtaining a summary measure of the dispersion in the sample would be to calculate the mean of these deviations. However, the mean of these deviations is always zero. [Think about why this should be.] Instead we could take the absolute value of each of the deviations and calculate the mean of these. This gives the mean (absolute) deviation: mean absolute deviation 1 n | xi x | . n i 1 For grouped data the equivalent formula is: mean absolute deviation 1 K f k | xk x | . n k 1 where, x k is the midpoint of the kth class. Example Twelve students record their weight in kg, creating the following sample: 50, 51, 61, 75, 62, 73, 64, 86, 65, 58, 73, 59. 23 1 1 (50 51 ... 59) 777 64.75 kg. 12 12 The deviations of each value from the mean are: The mean of these 12 observations is: x -14.75, -13.75, -3.75, 10.25, -2.75, 8.25, -0.75, 21.25, 0.25, -6.75, 8.25, -5.75. So the mean deviation is: Mean deviation = 1 1 (14.75 13.75 3.75 10.25 ... 5.75) 96.5 8.0417 kg . 12 12 The Sample Variance and Sample Standard Deviation Instead of taking the absolute values of the deviations (so that the positive and negative deviations don't just cancel each other out), we could use the squares of the deviations. The sample variance (usually denoted by s 2 ) can be thought of as an ‘average’ of the squared deviations. The sample variance is defined by: s2 1 n ( xi x) 2 . n 1 i 1 Note that although we are summing n squared deviations, we divide through by n – 1. This is important! The reason why we use n - 1 and not n in the definition of the sample variance will become apparent later on in the course when we look at unbiased estimators. The disadvantage of using the sample variance is that it is not measured in the units of measurement used for the data, but in squared units. This problem is overcome by using the standard deviation. The sample standard deviation is simply the square root of the sample variance, ie: s 1 n ( xi x) 2 . n 1 i 1 Note: For grouped data, we use the following definition for a sample s.d.: s 1 K f k ( x k x) 2 . n 1 k 1 Example Consider again the weights of the 12 students given above. The deviations from the mean were: -14.75, -13.75, -3.75, 10.25, -2.75, 8.25, -0.75, 21.25, 0.25, -6.75, 8.25, -5.75. So the sample variance is: 1 n 1 s2 ( xi x) 2 (14.75) 2 (13.75) 2 (3.75) 2 10.25 2 ... (5.75) 2 n 1 i 1 11 1 1200 .25 109 .1136 . 11 This means that the sample standard deviation is s = 109.1136 = 10.446 kg. 24 Result: Using the above formula to calculate the sample variance can be complicated. In general it is better to use the expression: s2 xi 2 . 1 n 2 xi n 1 i 1 n To calculate the variance using this expression we need to know the sum of the observations and the sum of the squares. Proof: We need to show that both formulae for the sample variance are equivalent. It suffices to show: n ( xi x) 2 i 1 n i 1 xi2 xi 2 . n Now, n n i 1 i 1 ( xi x) 2 xi2 2 xi x x But, x 1 n 2 n n 2 xi 2 x xi n x . i 1 i 1 2 n xi , so i 1 n n ( xi x) i 1 2 i 1 n xi 1 1 2 2 xi n xi xi2 n n i 1 n i 2 xi2 2 as required. Note: There is an equivalent expression for grouped data, so that: 1 K s2 f k xk2 n 1 k 1 f k xk 2 . n Example 1: Consider again the student height data: 50, 51, 61, 75, 62, 73, 64, 86, 65, 58, 73, 59. We can check that the new formula for calculating the variance does in fact give us the same result: 12 xi2 50 2 512 612 ... 59 2 51511, i 1 12 xi 50 51 61 ... 59 777 i 1 So, s2 ( xi ) 2 1 (777 ) 2 1 2 x 51511 109.1136 i 11 n 1 i n 12 as before. Example 2: For an example of grouped data, consider the mercury data again: 25 Interval 0 - 10 10 - 20 20 - 30 30 - 40 40 - 50 50 - 60 60 - 70 Frequency 5 11 10 9 2 1 2 Mid-point, x k 5 15 25 35 45 55 65 Here we have, 7 f k xk2 5 5 2 11 15 2 10 25 2 ... 2 65 2 35400 , i 1 7 f k xk 5 5 11 15 10 25 ... 2 65 1030 i 1 So, ( xi ) 2 1 (1030 ) 2 1 2 s xi 35400 227 .6282 . 39 n 1 i n 40 2 The sample standard deviation is therefore 227.6282 = 15.09. Exercise: A sample of 50 adults were asked how many lottery tickets they purchased last week: Number of lottery tickets Frequency 0 19 1 11 2 10 3 3 4 4 5 3 Find the sample standard deviation. Note: Find out how to use your calculator’s statistical mode to calculate s.d.s. 2.6 Box-and-whisker plots Box-and-whisker plots aim to highlight a few important features of a data set. They are based on the following location summaries: minimum, lower quartile, median, upper quartile and maximum. These 5 quantities are sometimes referred to as the five-number summary. Simple Example: The number of runs scored by a batsman on 14 occasions are as follows: 40, 22, 17, 50, 24, 48, 5, 0, 28, 19, 30, 25, 16, 37. Ordering these values we get: 0, 5, 16, 17, 19, 22, 24, 25, 28, 30, 37, 40, 48, 50. The five-number summary then is: Minimum value = 0 Median, Q2 = 24.5 Maximum value = 50 Lower quartile, Q1 = 17 The box-and-whisker plot then looks like: 26 Upper quartile, Q1 = 37 A box plot showing a batsman's runs 0 10 20 30 40 50 Number of runs In the above diagram, the box indicates the interquartile range. The whiskers go from the lower and upper quartiles to the smallest and largest observations respectively. The median is represented by a line within the box. Note: the position of the median within the box gives an indication of whether the data are skewed: Symmetry: Q2 Q1 Q3 Q2 ; positive skew: Q2 Q1 Q3 Q2 ; negative skew: Q2 Q1 Q3 Q2 . Box-and-whisker plots are especially useful for comparing two different data sets as they give a simple picture of the locations and spreads of different distributions. Example: The numbers of hysterectomies performed by 15 male doctors and 10 female doctors are given below: Male doctors Female doctors 20 5 25 7 25 10 27 14 28 18 31 19 33 25 34 29 36 31 First of all we need to find the five-number summaries for the two data sets. Summary statistic Minimum Lower quartile Median Upper quartile Maximum Male doctors 20 27.5 34 47 86 Female doctors 5 10 18.5 29 33 27 37 33 44 50 59 85 Box-and-whisker plot comparing number of procedures by sex Male doctors Female doctors 0 10 20 30 40 50 60 70 Number of hysterectomies performed 80 90 Exercise Consider again the protein assimilation efficiency data given in Section 2.2.3. We then had the following stemand-leaf diagram: An ordered back-to-back stem-and-leaf diagram showing the protein data A.E.s for field mice 4 3 2 Scale: Stem = 10’s 3 9 1 2 8 1 6 A.E.s for voles 8 1 5 0 5 5L 5H 6L 6H 7L 7H 2 0 7 0 5 3 7 0 7 4 8 0 8 4 9 2 4 4 4 Leaves = 1’s Draw box-and-whisker plots for the field mice and voles and compare the shapes of these. Note: Minitab calculates the quartiles slightly differently to the method used in this course. Consequently, slightly different values for the quartiles can arise when using Minitab. 28 Chapter 3: Common Distributions In this chapter we examine four of the distributions that will be frequently encountered later in the course. 3.1 The Normal Distribution 3.1.1 Recap from MA304 The normal distribution is the most widely used distribution in statistics. Continuous data such as mass, length, etc, can often be modelled using a normal distribution. The normal distribution has two parameters- the mean ( ) and variance ( 2 ). If a random variable X has a normal distribution then we can write this as: X ~ N[ , 2 ]. A normal distribution with = 0 and = 1 is referred to as a standard normal distribution (and a random variable with this distribution is usually denoted Z). Important result: If X is a random variable distributed as N[ , 2 ] , then X ~ N[0,1]. The process of subtracting the mean and dividing by the standard deviation is referred to as standardisation: General Normal X ~ N[ , 2 ] Standard Normal Z ~ N[0, 1] z x Example: The fully grown lengths (in mm) of a certain insect can be regarded as having the following normal distribution: X ~ N[64, 16]. What is the probability that an insect has length less than 59 mm? Applying the standardisation formula, z Thus, 3.1.2 x 59 64 1.25. 4 P( X 59) P(Z 1.25) P(Z 1.25) 1 (1.25) 1 0.8944 0.1056 . Percentage points 29 Definition: Consider a random variable X with some distribution. The (upper) 100 % point is the value of x such that: P(X > x) = . For the standard normal distribution, we will denote the (upper) 100% point by z , i.e.: P(Z > z ) = . X ~ N[ , 2 ] Z ~ N[0, 1] z x In statistical tables (e.g. Lindley and Scott), there is a separate percentage point table covering the most used values of . In Lindley and Scott, P represents 100 , x(P) represents the value of z . Extract: P = 100 10% 5% 2% 1% 0.1% 0.01 0.05 0.02 0.01 0.001 The 10% point for the standard normal is x(P) = z 1.2816 1.6449 2.0537 2.3263 3.0902 z 0.1 1.2816 . Example 1: Let X ~ N[50, 16]. Find the value of x such that P(X > x) = 0.05, i.e. find the (upper) 5% point. X 50 ~ N[0,1]. 4 The 5% point for the standard normal is z 0.05 1.6449 . If X ~ N[50, 16], then Thus, the 5% point for a N[50, 16] distribution can be obtained by solving So, the 5% point is x 50 1.6449 4 56.5796 . x 50 1.6449 . 4 Example 2: Let Z ~ N[0, 1]. Find the value of z such that P(Z < z) = 0.01 (i.e. find the lower 1% point). The upper 1% point for a standard normal is z 0.01 2.3263 . Therefore, P(Z > 2.3263) = 0.01. By symmetry, we must also have P(Z < -2.3263) = 0.01. So, the lower 1% point is –2.3263. 3.2 The chi-squared distribution 30 3.2.1 Introduction The chi-squared ( 2 ) distribution has a single parameter called the degrees of freedom- this can be any positive integer. The 2 distribution with n degrees of freedom is denoted n2 . Probability density function: If X ~ n2 , then the p.d.f. of X (for x > 0) is given by: 1 f ( x) n / 2 2 For x 0, f ( x) 0. n 2 x n / 21e x / 2 . This density is written in terms of the gamma function. Some of the key properties of this function are: ( x) ( x 1)( x 1); 12 ; ( x) ( x 1)! if x is a natural number. The degrees of freedom, n, define the shape of the 2 density. For n < 3, the density has a mode at zero. For n 3, the mode moves further away from zero as n increases. The shapes of some specific densities are given below. Graph of several chi-squared densities 0.6 n= n= n= n= 0.5 1 2 4 8 0.4 0.3 0.2 0.1 0 3.2.2 0 2 4 6 8 10 12 Finding probabilities Probabilities associated with the 2 distribution can be looked up in probability tables. Lindley and Scott list the d.o.f. (which they denote ) along the top of each column. Then for each value x listed, the values in the table are the probability that X < x. Extracts: 31 = 3.0 x 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 etc = 7.0 P(X < x) 0.0000 0.0811 0.1987 0.3177 0.4276 0.5247 0.6084 0.6792 0.7385 x 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 P(X < x) 0.0052 0.0402 0.1150 0.2202 0.3400 0.4603 0.5711 0.6674 0.7473 0.8114 Example 1: If X ~ 32 , then P(X < 2.5) = 0.5247. Example 2: Suppose X ~ 72 . Find P(X > 10). Now, from tables we can find, P(X < 10) = 0.8114 P(X > 10) = 1 – 0.8114 = 0.1886. 3.2.3 Percentage points The 100 % point for the n2 distribution is denoted n2, . Therefore, if X ~ n2 , then P(X > n2, ) = . The percentage points of the 2 distribution are in a separate table in Lindley and Scott. Extract: P 99 95 10 5 1 = 1.0 = 2.0 = 3.0 = 4.0 = 5.0 = 6.0 = 7.0 = 8.0 0.000 0.020 0.115 0.297 0.554 0.872 1.239 1.646 0.004 0.103 0.352 0.711 1.145 1.635 2.167 2.733 2.706 4.606 6.251 7.779 9.236 10.64 12.02 13.36 3.841 5.991 7.815 9.488 11.07 12.59 14.07 15.51 6.635 9.210 11.34 13.28 15.09 16.81 18.48 20.09 52, 0.1 9.236. So P(X > 9.236) = 0.1 In this table, the degrees of freedom for the distribution are listed going down the rows and P is 100. The chi-squared distribution is not symmetric (unlike the normal distribution). So if we want a lower percentage point (i.e. a value of x such that P(X < x) = ) , then we can't simply negate the corresponding upper percentage point. Instead we need to find n2,1 . 32 Example 1: Let X ~ 82 . Find the lower 1% point (i.e. the value of x such that P(X < x) = 0.01). The lower 1% point is denoted 82, 0.99 , the value for which is 1.646. Example 2: 2 Suppose X ~ 10 . Find the value of t for which P(X > t) = 0.1321. Here, t would be the 13.21% point for the distribution. But, 0.1321 is a non-standard value of . So we need to use the distribution function table to find t. P(X > t) = 0.1321 P(X < t) = 1 – 0.1321 = 0.8679. Going through the distribution table we find that t = 15. 3.3 The Student t-distribution 3.3.1 Introduction Definition: Suppose that we have two independent random variables Y and Z, such that: Y ~ N[0, 1] and Z ~ n2 . Then the random variable X defined by Y X Z n has a t-distribution with n degrees of freedom- denoted t n . The t-distribution is symmetric about zero and its general shape is like the bell-shape of a normal distribution. However, the tails of the t-distribution can approach zero much more slowly than those of the normal distribution- i.e. the t-distribution is more heavy tailed than the normal. The degrees of freedom define how heavy-tailed the t-distribution is. Note: The t-distribution with n = 1 is sometimes referred to as the Cauchy distribution. This is so heavy tailed that its mean and variance do not exist! (This is because the integrals specifying the mean and variance are not absolutely convergent.) Important note: The density of a t-distribution converges to that of the standard normal as n . The diagram below shows how the t-distribution varies for different degrees of freedom. 33 Comparing several t distributions with the standard normal 0.4 normal t2 t5 t 20 0.35 0.3 Density 0.25 0.2 0.15 0.1 0.05 0 -3 3.3.2 -2 -1 0 x 1 2 3 Probabilities Probabilities associated with the t-distribution can be looked up in tables. In Lindley and Scott, the degrees of freedom are again denoted by and are listed along the top of the columns. Then for each value t listed, the values in the table are the probability that X < t. Example 1: Let X ~ t 3 . Then P(X < 2.5) = 0.9561. Example 2: Let X ~ t12 . Find P(X > 2.5). P(X > 2.5) = 1 - P(X < 2.5) = 1 - 0.986 = 0.014. 3.3.3 Percentage points The 100 % point for the t n distribution is denoted by t n, . If X ~ t n , then: P(X > t n, ) = . Percentage points for the t-distribution are tabulated separately. The degrees of freedom for the distribution are listed down the rows and P = 100. Example 1: Find the 5% point for t 6 . Directly from tables, this is seen to be t 6, 0.05 1.943 . (Thus P(X > 1.943) = 0.05.) As the t-distribution is symmetric, finding lower percentage points is simple. Example 2: Let X ~ t10 . Find the value of t such that P(X < t) = 0.01 (i.e. find the lower 1% point). 34 The upper 1% point is t10,0.01 2.764. But P(X > 2.764) = 0.01 P(X < -2.764) = 0.01. So, the lower 1% point, t, is -2.764. Note: To find non-standard percentage points (such as the 12.5% point, for example), we need to use the tdistribution function table. 3.4 The (Fisher’s) F-distribution 3.4.1 Introduction Definition: Consider two independent random variables Y and Z such that nY ~ n2 and mZ ~ m2 . The random variable X defined by Y Z is then said to have an F-distribution with n and m degrees of freedom- denoted Fn, m . X The F-distribution therefore has two parameters, both of which are degrees of freedom. The order of the degrees of freedom is important! The Fn ,m distribution is not the same as the Fm, n distribution. Note: The density for the F-distribution is only defined for positive values of x. The values of the two degrees of freedom define the shape of the distribution. Plots of the F-distribution for various values of n and m are shown below. 35 Graphs of several F distributions 1 n=2, m=2 n=4, m=4 n=8, m=8 n=20, m=20 0.9 0.8 0.7 Density 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 x 4 5 6 Graphs of several more F distributions 1 n= n= n= n= 0.9 0.8 2, m = 4 4, m = 2 5, m = 10 10, m = 20 0.7 Density 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 x 36 3 3.5 4 4.5 5 Lindley and Scott do not have tables for looking up probabilities associated with the F-distribution. 3.4.2 Percentage points Separate tables giving 10, 5, 2.5, 1, 0.5 and 0.1 percentage points for F-distributions with different combinations of degrees of freedom can be found in Lindley and Scott. We will denote the (upper) 100 % point for the Fn, m distribution by Fn, m, . If X ~ Fn, m , then: P(X > Fn, m, ) = . In the table of the 100 percentage points for the F-distribution, the first degrees of freedom is denoted 1 and listed along the columns. The second degrees of freedom is denoted by 2 and listed down the rows. Extract: 1% points of the F-distribution 1 2 1 2 3 4 5 1 4052 98.50 34.12 21.20 16.26 2 4999 99.00 30.82 18.00 13.27 3 5403 99.17 29.46 16.69 12.06 4 5625 99.25 28.71 15.98 11.39 5 5764 99.30 28.24 15.52 10.97 The (upper) 1% point for an F5, 3 distribution is 28.24. We write F5, 3, 0.01 28.24. Example: Find the 5% point for both the F5,10 and the F10, 5 distributions. From the 5% points table: F5,10, 0.05 3.326 F10, 5, 0.05 4.735 Notice that these are not the same. The tables in Lindley and Scott give the upper percentage points only- i.e. they give the values of x such that P(X > x) = , for small values of . Since the F-distribution is not symmetric, to find lower upper percentage points we cannot simply use the negative of the corresponding upper percentage point: P( X x) P( X x). The density is in fact not even defined for x < 0. 37 3.4.3 Finding lower percentage points Result: Suppose that X Y ~ Fn, m . Then Z X 1 Z ~ Fm, n . Y Proof: Y ~ Fn, m if nY ~ n2 and mZ ~ m2 . Z But by definition of the F-distribution, this means that Z ~ Fm, n Y as required. X We can use this result to find lower percentage points for F-distributions: Important result: The lower 100 percentage point for the Fn, m distribution is the reciprocal of the upper 100 percentage point of the Fm, n distribution. Proof: If X ~ Fn, m and x represents the lower 100 percentage point for this distribution, then P(X < x) = . But 1 1 P . X x P( X x ) 1 1 ~ Fm, n then is (by definition) the upper 100 percentage point of the Fm, n distribution. x X 1 So, x . Fm, n, As Example 1: Let X ~ F5,10 . Suppose we wish to find x such that P(X < x) = 0.05- i.e. we want to find the lower 5% point of the F5,10 distribution. The lower 5% point of the F5,10 distribution is the reciprocal of the upper 5% point of F10, 5 distribution. So, x 1 F10, 5, 0.05 1 0.2112 . 4.735 Example 2: Suppose X ~ F4,7 . Find the upper and lower 10% points. 38 The upper 10% point can be found directly from tables: F4, 7, 0.1 2.961 . The lower 10% point is the reciprocal of the upper 10% point of the F7, 4 distribution: Lower 10% point = F4, 7, 0.9 1 F7, 4, 0.1 1 0.2513 . 3.979 Exercise: Suppose X ~ F2, 4 . Find the upper and lower 1% points. 3.5 Some additional facts about distributions 1) If X 1 ,..., X n are independent with X i ~ N[ i , i2 ] , i = 1, …, n, then n n a 0 ai X i ~ N a 0 ai i , i 1 i 1 2) If X 1 ,..., X n are i.i.d. as N[0, 1], then (a) X i2 ~ 12 , for i = 1, 2, …, n; n (b) X i2 ~ n2 ; i 1 3) If X 1 ,..., X n are independent with X i ~ k2i , i = 1, …, n, then n X i ~ k2 , i 1 where k k1 ... k n . 4) If X ~ t n , then X 2 ~ F1, n . These results are not proved in this course. 39 n i 1 ai2 i2 ; Chapter 4: Sampling Distributions 4.1 Parameters The purpose of many statistical investigations is to learn about the distribution of some random variable X. Many aspects about X's distribution may be of interest, but attention often focuses on one or two particular population characteristics. Example 1: A bakery needs to decide how many loaves of fresh bread it should put out on its shelves each day. If they put out too many, then they will lose money as stale bread will not sell, and if they put out too few, then they will lose potential sales. Therefore, to help the bakery make its order, interest might focus on the mean number of loaves, , usually sold on a particular day. Example 2: Suppose that a company has the job of packing a certain breakfast cereal into boxes, so that each box approximately contains 500g of cereal. The weight of cereal in each box varies around 500g due to the variability of the cereal product. The company wants to check that the amount going into each box doesn't vary too much about 500g- weights greater than 500g will lose the company money and weights less than 500g could lead to customer dissatisfaction. In this case, attention may focus on the variability of weights in the boxes as described by , the standard deviation of weights. Example 3: When testing a new drug, a doctor might not be interested so much in the number of people cured by the drug, but rather the proportion, , of people who are cured by the drug. We call , , or population parameters. To learn about such parameters, we can observe a random sample of n observations, x1 ,..., x n , and then use these data to calculate estimates for the parameter(s) of interest. For example, a sample mean could be used to estimate . Definition: Any quantity computed from values in a sample is called a (sample) statistic. Example: All the numerical summaries introduced in Chapter 2 are statistics as they are all calculated from values in the random sample. This includes statistics such as the sample mean (which utilises all the observations in its calculation) and the sample median (which only takes account of the middle observations). It is important to realise that there is a difference between population parameters and sample statistics. The population parameter is a characteristic of the distribution of the random variable, is typically unknown and cannot be observed. By contrast, a statistic is a characteristic of the sample and can be observed. For example, the population mean has some fixed (but unknown) value. On the other hand, the sample mean, X , can be observed and therefore can be known for a particular sample. The observed value of X , however, can vary from sample to sample (as different samples will give different values of x1 ,..., x n ). The value of a statistic, therefore, is subject to sampling variability. Definition: As a statistic is a function of the random variables X 1 ,..., X n , it is itself a random variable. The distribution of a statistic is called its sampling distribution. The sampling distribution of a statistic describes the long-run behaviour of the statistic's values when many different samples, each of size n, are obtained and the value of the statistic is computed for each sample. 40 4.2 The sampling distribution of the sample mean To investigate the sampling distribution for X , we will consider several experiments. Experiment 1: We generate 500 random samples (each of size n) from N[100, 400]. For each of these 500 samples we calculate x , so we have a random sample of 500 observations from the sampling distribution of X . This was repeated for n = 5, 20, 50. Sampling distribution for the sample mean (n = 20) 60 70 50 60 40 Frequency Frequency Sampling distribution for the sample mean (n = 5) 30 20 10 50 40 30 20 10 0 0 80 90 100 110 85 120 95 105 115 Sample mean Sample mean Sampling distribution for the sample mean (n = 50) 90 80 Frequency 70 60 50 40 30 20 10 0 90 100 110 Sample mean Observations: In each case the distribution seems roughly normal and it is clear that each of these histograms is centred roughly at 100 (the mean of the normal distribution from which the samples were generated). We can also see that as the sample size n increases, the variability in the sampling distributions decreases (look carefully at the scales on the horizontal axes). These points can also be seen if we look at some statistics relating to each histogram above: Sample size n=5 n = 20 n = 50 Mean 100.07 99.83 100.05 Standard deviation 8.17 4.40 2.81 We will do a similar set of experiments to see what the sampling distribution for X is like when we are not sampling from the normal distribution. Experiment 2: We generate 500 random samples (each of size n) from a uniform U[0,1] distribution. Again, for each of these 500 samples we calculate x , so we have a random sample of 500 observations from the sampling distribution of X . This was repeated for n = 5, 10, 20, 50. 41 Note: If X ~ U[0, 1], then E[X] = 0.5 and Var[X] = 1/12 (so s.d. = 0.289). Sampling distribution for the sample mean (n = 10) Sampling distribution for the sample mean (n = 5) 80 60 70 60 Frequency Frequency 50 40 30 20 50 40 30 20 10 10 0 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.2 0.9 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Sample mean Sample mean Sampling distribution for the sample mean (n = 20) Sampling distribution for the sample mean (n = 50) 90 60 80 50 60 Frequency Frequency 70 50 40 30 40 30 20 20 10 10 0 0 0.25 0.35 0.45 0.55 0.65 0.75 0.35 Sample mean 0.45 0.55 0.65 Sample mean Observations: The shapes of the histograms relating to the sample means look increasingly more like normal distributions as n increases- this is despite the data being sampled from a uniform distribution. The histograms in each case seem to centre on 0.5 (the mean of the U[0, 1] distribution). Also, the variability of the sampling distributions is decreasing as the sample size becomes larger. The mean and standard deviation for the data in the four situations above are given below: Sample size n=5 n = 10 n = 20 n = 50 Mean 0.491 0.504 0.502 0.499 Standard deviation 0.133 0.095 0.068 0.042 42 Important Result: For an independent random X 1 ,..., X n from a distribution with mean and variance 2 , the sampling distribution for X has the following properties: 1. E[ X ] . 2. Var[ X ] 2 n . The standard deviation of X (often called the standard error) is therefore . n 2 3. If each X i ~ N[ , 2 ] , then X ~ N , regardless of the size of n. n 4. If X 1 ,..., X n are not normally distributed then when n is large (say at least 30) the distribution of X is 2 approximately N , . n Proof 1 n 1 n 1 E[ X ] E X i E[ X i ] n (as required). n n 1 n 1 Because we are assuming that the random variables are independent, we can also write: 1 n 1 n 1 2 (as required). Var[ X ] Var X i 2 Var[ X i ] 2 n 2 n n n 1 n 1 A linear combination of normally distributed random variables also has a normal distribution. The mean and variance are as given above. Not proved here. Note: Part (4) of the above result is the Central Limit Theorem, an extremely powerful and useful result in Statistics. Example 1: X 1 ,..., X 20 are independently and identically distributed N[30, 5]. Find the sampling distribution for X . Here n = 20 and so X ~ N[30, 5/20] = N[30, 0.25]. Example 2: X 1 ,..., X 40 are i.i.d Po(10) random variables. What approximately is the sampling distribution for X ? The sample size can be considered large enough for the Central Limit Theorem to be applied. The sampling distribution can therefore be considered approximately normal. A Po(10) distribution has mean and variance 10 equal to 10, therefore X ~ N 10, N10, 0.25 (roughly). 40 43 4.3 Sampling distribution of the sample proportion In many statistical investigations we are interested in learning about the proportion of individuals, or objects, in a population that possess a specified property. For example, we might be interested in what proportion of patients are alive 5 years after diagnosis of a particular cancer, or we might be interested in the proportion of UK adults who would like a ban on blood-sports. Denote the true population proportion of interest by . Note that is a population parameter. To learn about , we could observe a random sample in which each of the n observations is either a “success” or a “failure”. The sample proportion, p, is given by: p = (number of successes) n. The sample proportion is clearly a sample statistic. It makes sense to use p to learn about . We are therefore interested in the sampling distribution for p. To investigate the sampling distribution for p, we will look at 2 experiments in which we generate random samples of observed values of p. Experiment 1: Suppose that we generate 500 samples of size n where each sampled value is either a success (with probability = 0.25) or a failure (with probability 1 - = 0.75). We then calculate the observed proportion of “successes” in each of the 500 samples. We will do this for n = 5, 10, 25 and 50. Sampling distribution for the sample proportion (n = 5) Sampling distribution for the sample proportion (n = 10) 140 200 120 Frequency Frequency 100 100 80 60 40 20 0 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Sample proportion, p Sample proportion, p Sampling distribution for the sample proportion (n = 20) Sampling distribution for the sample proportion (n = 50) 70 100 60 Frequency Frequency 50 50 40 30 20 10 0 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.1 Sample proportion, p 0.2 0.3 0.4 0.5 Sample proportion, p Observations: For a sample of size 5, the possible values of p are 0, 0.2, 0.4, 0.6, 0.8 and 1. The sampling distribution for p gives the probability of each of these 6 values. The histogram for the case n = 5 is positively skewed. 44 As n increases, the histograms become more and more symmetrical and in fact when n = 50 the histogram clearly resembles a normal curve centred on 0.25. In addition, increasing the sample size decreases the range of observed values for p. Experiment 2: Once again we will generate 500 samples, but this time we will have the sample sizes n = 10, 25, 50 and 100 and we will take the true proportion of successes, to be 0.07. So once again each observation in each sample is either a success (S) with probability 0.07, or failure (F) with probability 0.93. Sampling distribution for the sample proportion (n = 10) Sampling distribution for the sample proportion (n = 25) 150 Frequency Frequency 200 100 100 50 0 0 0.00 0.0 0.1 0.2 0.3 0.4 0.05 0.10 0.15 0.20 0.25 Sample proportion, p Sample proportion, p Sampling distribution for the sample proportion (n = 50) Sampling distribution for the sample proportion (n = 100) 80 70 60 Frequency Frequency 100 50 50 40 30 20 10 0 0 0.0 0.1 0.2 0.00 Sample proportion, p 0.05 0.10 0.15 Sample proportion, p Observations: When n = 10, the possible values for p are 0, 0.1, 0.2, …, 1. The histogram for the 500 samples is very positively skewed and no values greater than 0.4 was observed for p. [Notice how in the previous experiment, the density for p was not very skewed when n = 10]. As n increases to 25 and 50, the histograms still look positively skewed. However, when the sample size reaches 100, the histogram is beginning to look slightly more normal. Therefore we note that in this experiment we need larger sample sizes than in Experiment 1 before the sampling distribution for p looks approximately normal. We also note that increasing the sample size again results in a narrowing in the range of observed values for p. Thus to summarise the observations from this experiment: Densities are roughly centred about = 0.07. Variance for p decreases as n increases. As the sample size increases, the density for p becomes approximately normal. However, the density tends to normality much slower than when we had = 0.25. Therefore, it appears that the rate at which the sampling distribution for p tends to normality depends not only on the sample size n, but also on the value of . 45 Important result: If p is the sample proportion of successes in a random sample of size n where is the true proportion of successes, then the following results hold: The expected value of p is . (1 ) The standard error (i.e. s.d.) of p is . n When n is sufficiently large, the sampling distribution for p is approximately normal. Note: The further the value of is from 0.5, the larger the value of n must be in order for the normal approximation of the sampling distribution for p to be accurate. Rule of thumb: If both n 5 and n(1 ) 5 , then we may use the normal approximation for p. Proof: Let X = total number of successes in the sample. Then X ~ Bi[n, ] and so: E[X] = n V[X]= n(1 - ) sd[X] = n (1 ) . But, by definition, the sample proportion p = X , and so n 1 X 1 E[p] = E E[ X ] n . n n n 2 (1 ) 1 X 1 Also, V[p] = V V[ X ] 2 n (1 ) . n n n n Taking square roots, we get the required standard error for p. Proof of the normality approximation is simply an application of the Central Limit Theorem, so that for large n (1 ) X ~ N , . n approximately. Example 1: Suppose that the proportion of women who believe that they are underpaid is 0.55. a) If we had a random sample of size 10, could we assume that the sampling distribution for p is approximately normal? b) For a random sample of 400, what are the mean value and standard deviation for p? c) In a sample of size 400, what is the probability that we observe the proportion of women who believe they are underpaid to be greater than 0.6? a) = 0.55 and n = 10, so n = 5.5 and n(1 - ) = 4.5. As both of these are not 5, then we cannot assume that the distribution of p is normal with only a sample size of 10. b) n = 400, so: E[p] = = 0.55 (1 ) 0.55 0.45 V[p] = 0.000619 n 400 sd[X] = 0.0249. For n = 400, n = 220 and n(1 - ) = 180 and so p's distribution can be considered approximately normal. Therefore: 46 p ~ N[0.55, 0.000619]. c) 0.6 0.55 P( p 0.6) P Z P(Z 2.008) 1 (2.008) 1 0.9778 0.0222 approximat ely. 0.0249 Example 2: Suppose that the true proportion of individuals with a particular disease is 0.02. What minimum sample size would be needed before p's distribution can be assumed to be approximately normal? For approximate normality we need n 5 and n(1 - ) 5. Now n (0.02) 5 n 250. n (0.98) 5 n 5.102 Therefore, to assume approximate normality for p, we would need a sample size of at least 250. Exercise: 90% of the population are right-handed. In a sample of 200 people, what is the probability that the sample proportion who are right-handed is less than 0.86? 4.4 Sampling distribution for sample variance When we want to learn about the variance, 2 , of a population, it is natural to first look towards the sample variance, S 2 . We are therefore interested in the sampling distribution for S 2 . In general, the sampling distribution for S 2 does not follow any fixed rules and so here we will only look at the case when X 1 ,..., X n are i.i.d. N[ , 2 ]. Important result: If X 1 ,..., X n are i.i.d. N[ , 2 ] where is unknown, then (n 1) S 2 2 Proof: The proof will not be given in this course. Experiment: 47 ~ n21 . To demonstrate that this result does in fact hold in practice, 500 samples were generated from N[100, 400] for various samples sizes, n and the value of (n 1) S 2 2 (n 1) S 2 calculated for each of the 500 samples. 400 Histogram for n = 3 Frequency 200 100 0 0 5 10 15 Statistic Histograms of these samples then demonstrate what the sampling distribution for ( n 1) S 2 2 looks like in each case. Histogram for n = 5 90 80 Frequency 70 60 50 40 30 20 10 0 0 5 10 Statistic Histogram for n = 20 60 70 50 60 Frequency Frequency Histogram for n = 10 40 30 20 50 40 30 20 10 10 0 0 0 0 1 2 3 4 5 1 2 6 3 4 5 Statistic Statistic Observations: In the case when n = 3, the histogram for the sample of 500 observations of ( n 1) S 2 is heavily positively 2 skewed and resembles a 22 distribution. The histograms for the other cases, where n = 5, 10 and 20, also resemble chi-squared distributions (the respective degrees of freedom should be 4, 9 and 19). 48 49 Chapter 5: Point Estimation Definition: A (point) estimator, ˆ , is a statistic (some function of the sample X 1 ,..., X n ) used to produce a single value estimate of a parameter . An estimate is the value an estimator takes for a particular sample. Statistic Parameter Population mean, Sample mean, median, trimmed mean, … Estimator for Sample variance, S 2 Population variance, 2 Sample proportion, p Population proportion, There will be a range of possible estimators for a population parameter, . However, some estimators will be sensible to use and some will not. To help us decide whether ˆ is good to use, we look at its sampling distribution. Suppose that the sampling distribution of ˆ (an estimator for ) looks like: In this case, the true value of is to the right of the sampling distribution for ˆ , so ˆ is a poor estimator, as it will always underestimate . Ideally, the distribution of ˆ should be concentrated around , i.e. we want: Definition: ˆ is an unbiased estimator of if 50 E[ˆ] . So on average the observed value of an unbiased estimator will be the true value of the parameter it is trying to estimate. X is an unbiased estimator of . Result 1: Proof: 1 n 1 n 1 n E[ X ] E X i E[ X i ] . n 1 n 1 n 1 Therefore as E[X ] , X is an unbiased estimator of . S 2 is an unbiased estimator of 2 . Result 2: Proof: 1 n Recall that S Xi X n 1 1 But, 2 2 . X i X ( X i ) ( X ) X i 1 1 1 2 n 2 n 2 n X i 2 X n 2 n 1 n 1 n 2 n 2 n X i 2 X X i n n X 1 1 X i 2n X 2 2 nX 2 1 n Xi 2 2 n X . 1 So 1 n E S 2 E Xi X n 1 1 2 n 1 1 E X i 2 nX 2 n 1 2 1 n 2 E X i nE X . n 1 1 But, by definition, 2 E X i 2 VarX i 2 and E X Var X . n n 2 1 2 1 E S2 n 2 2 2 . Therefore, we have n n 1 n n 1 1 51 n 1 X i nX 2 1 2 X i X X 2 2 Therefore as E S 2 2 , S 2 is an unbiased estimator of 2 . Note: This is why we choose n - 1 rather than n as the divisor in the definition of the sample variance. Result 3: Suppose that X is the number of successes in n trials so that X ~ Bi[n, ]. Then the sample proportion p = X / n is an unbiased estimator of . Proof: As X has a binomial distribution, and therefore mean n, we have: 1 X 1 E[ p] E E[ X ] n . n n n Therefore p is an unbiased estimator of . Definition: The bias of ˆ is defined as B ˆ E ˆ . Example: Find the bias of the estimator ˆ 2 1 n Xi X n 1 2 Now, we know that 1 n 2 E Xi X 2. n 1 1 So, 1 n 2 n 1 1 n E ˆ 2 E X i X E Xi X n n 1 n 1 1 Therefore, n 1 2 1 B ˆ 2 E ˆ 2 2 2 2 n n n 2 1 X i X will always underestimate 2 . indicating that n 1 1 2 n n 1 2 . Just because an estimator is unbiased, it doesn't necessarily mean that it is a good estimator (it just means that on average it will produce a value that is the true value of ). Illustration: 52 Suppose that we have 2 possible estimators ˆ1 and ˆ2 so that: Here ˆ1 is unbiased whereas ˆ2 is biased. However, in this case we would prefer ˆ2 to ˆ1 . This is because the observed value of ˆ is likely to be closer to the true value of than ˆ (it has a smaller standard error). So, 2 1 by choosing ˆ2 rather than ˆ1 we are maximising our chance that the observed value of our estimator will be close to the true value of . Ideally we want an estimator with small bias and small standard error. Example: Suppose that X 1 ,..., X n , n > 1, is a random sample from N[ , 2 ]. Show that X1, the first observation, is an unbiased estimator of . If you were given a choice of using X1 or X as your estimator for , which would you prefer? Now, X1 ~ N[ , 2 ], so E[X1] = . Therefore X1 is an unbiased estimator for . Both X1 and X are unbiased estimators, so we'll choose the one with the smallest standard error. s.e.[X1 ] = s.d.[X1 ] = , s.e.[ X ] = . n So as n > 1, s.e.[ X ] < s.e.[ X1] and so we would prefer to use X as an estimator of . 53 Chapter 6: Interval Estimation 6.1 Introduction The heights (cm) of a random sample of 12 primary school children of a certain age were as follows: 114, 137, 132, 140, 125, 116, 110, 118, 136, 131, 122, 128. We might be interested in learning about the mean height, , of all children of that age. We know that the sample mean can be used as a point estimate for - here x 125 .75 cm. However, because of sampling variability, the true value of may be quite different from this estimated value. It would be more useful if we could use the data to identify an interval within which we believe the true mean would lie. We call this a confidence interval. We can show the above data diagrammatically on a dotplot: A dotplot showing the heights of children 110 120 130 140 C1 It is unlikely that would be here. The true value of is likely to be somewhere in the centre of the data. Likewise, it is not likely that would be here (if the sample were random). In Statistics, the degree of confidence we have that an interval contains the parameter we are trying to estimate is expressed as a percentage. For example, if a 95% confidence interval were produced then we would be 95% confident that the resulting interval would contain the true value of the parameter. Alternatively, we could produce a 99% confidence interval- this would be wider than a 95% confidence interval. 6.1.1 Definitions Definition: An interval [T1 , T 2 ] is a 100(1 - )% confidence interval for a parameter if it contains with probability (1 - ). An alternative way of thinking of this is as follows…. If the method for deriving, for example, a 95% confidence interval were to be repeated a large number of times then approximately 95% of the intervals produced would contain the true value of . Note: We have to be very careful when talking about confidence intervals. It is not acceptable, for example, to refer to having a given probability of lying in a confidence interval. This is because, by attaching a probability to lying within the interval, you are creating the impression that is not a fixed quantity. It is the end-points of the confidence interval that are random quantities varying from sample to sample. 6.1.2 Example (continued) 54 Returning to the simple introductory example about the heights of primary school children, a 95% confidence interval for the population mean is shown in the diagram below. Dotplot of Heights of children (with 95% t-confidence interval for the mean) [ 110 ] _ X 120 130 140 Heights of children Later in the chapter (Section 6.3), you will find out how to calculate this interval for yourselves. You will also discover how to find confidence intervals for a population variance and for a population proportion. We start with the most basic situation, namely finding a confidence interval for a population mean when the population variance is known. 6.2 Confidence Intervals for (Known Population Variance) 6.2.1 Confidence intervals when data follow a normal distribution Background: Consider a random sample X 1 ,..., X n drawn from a N[ , 2 ] distribution, where we assume that the population variance, 2 , is known. Problem: Suppose that we wish to calculate a 100(1 - )% confidence interval for . Then we want to find two statistics T1 and T2 such that: P[T1 , T2 ] 1 Note 1: T1 and T2 are the random variables, not . Note 2: (1 - ) is usually taken to be 0.9, 0.95 or 0.99. The higher the value of (1 - ), the more confident we are that the confidence interval does in fact contain . However, the higher (1 - ) is, the wider the interval becomes and therefore the less informative it is about ’s location. So there exists a trade off. Derivation of confidence interval: We know that if X 1 ,..., X n are normally distributed then 2 X ~ N . . n Thus, by applying the standardisation formula, X ~ N[0,1]. n Therefore, 55 X P z / 2 z / 2 1 P z / 2 X z / 2 1 . n n n Rearranging further gives: P X z / 2 X z / 2 X z / 2 1 P X z / 2 1 . n n n n We can therefore see that we have the following result: Result: When we have a sample X 1 ,..., X n from a N[ , 2 ] distribution with known variance 2 then a 100(1 - )% confidence interval for is given by: X z / 2 . n Example: A biologist selects 15 beetles at random from a colony she is studying. The weights of these beetles (in g) are as follows: 5.7, 4.9, 5.3, 5.0, 5.4, 5.1, 5.2, 5.2, 5.3, 5.4, 5.7, 5.1, 5.6, 5.0, 5.3. Assuming that the weights follow a normal distribution with known population standard deviation 0.2 g, calculate a 95% confidence interval for the population mean weight. 5.7 4.9 5.3 ... 5.3 79.2 5.28 g. 15 15 From normal percentage point tables, z 0.025 1.96. Thus, the 95% confidence interval is Sample mean = X z / 2 5.28 1.96 0.2 n (5.179, 5.381). 15 Exercise: A new drug to lower blood pressure is given to 20 volunteers and their fall in BP is recorded. From previous work the standard deviation of the change in BP is known to be 8mmHg and the falls are believed to follow a normal distribution. The mean fall in the sample is 6mmHg. Find a 99% confidence interval for the mean fall in BP. 6.2.2 Confidence intervals when sample size n is large The assumption that the data follow a normal distribution can be relaxed if the sample size, n, is large (rule of thumb, n > 30). This is because in such situations the Central Limit Theorem can be applied ensuring that the sample mean, X , will approximately be normally distributed. Thus we have the result: Result: When we have a sample X 1 ,..., X n from any distribution with mean and known variance 2 then, if the sample size n is large, a 100(1 - )% confidence interval for is approximately given by: X z / 2 Example: 56 n . Michael is a keen cyclist and rides his bicycle every day. On a random sample of 44 days he averages 18 miles per day. The standard deviation for all days is known to be 5 miles. Find a 90% confidence interval for his mean daily mileage. In this example, the sample size is n = 44, i.e. large enough for the Central Limit Theorem to apply. From tables, the 5% point for a normal distribution is z 0.05 1.65. Therefore, the 90% confidence interval is: 5 X z / 2 18 1.65 (16.76, 19.24). n 44 So we are 90% confident that the mean number of miles travelled per day lies in the interval (16.76, 19.24). Important: It is the interval which varies from sample to sample, not . So for example, if we generated a 95% confidence interval for each of 100 different samples, we would expect 95 of them to contain 6.3 Confidence Interval for (Unknown Population Variance) In most situations the population variance is not known. In such situations, some amendment is needed to the formulae presented in Section 6.2. We deal first with the case where the sample size, n, is small. 6.3.1 Confidence intervals when is unknown and n is small The formulae presented in the previous section for a confidence interval for are written in terms of the population variance, 2 . When this is unknown, the formulae cannot be applied. A confidence interval is instead derived from the following result: Important result: Suppose X 1 ,..., X n is a random sample from a N[ , 2 ] distribution, where both parameters are unknown. If S 2 denotes the sample variance, then X ~ t n 1 S n 57 Proof: (see also Example Sheet 4) We know that Y X ~ N[0, 1] and Z (n 1) S 2 n 2 ~ n21 . Therefore, by definition, Y Z n 1 But, Y Z n 1 X n S2 2 X n S ~ t n 1 . X . S n Derivation of confidence interval: From the above result, we know that X P t n 1, / 2 t n 1, / 2 1 S n We now need to rearrange the inequality so that is in the centre: S S P t n 1, / 2 X t n 1, / 2 1 n n S S P X t n 1, / 2 X t n 1, / 2 1 n n S S P X t n 1, / 2 X t n 1, / 2 1 n n Thus the upper and lower end-points for the 100(1 - )% confidence interval are given by: S X t n1, / 2 . n Result: Suppose that X 1 ,..., X n ~ N[ , 2 ] where 2 is unknown. Then a 100(1 - )% confidence interval for is given by: S X t n1, / 2 n where S is the sample standard deviation. Comparing this result to that give in the previous section where was assumed known, two changes can be clearly seen: A percentage point from a t distribution is used in place of a normal percentage point; The population standard deviation is replace by the sample standard deviation, S. Example: The number of hours spent by 10 randomly chosen computer science students completing their assessed coursework were as follows: 5.5, 1.5, 3.6, 7.2, 2.4, 3.8, 4.0, 1.9, 5.3, 2.7. Calculate a 99% confidence interval for the mean time spent on the coursework in the population of all students. 58 Here, the population variance is unknown. So we must begin by finding the sample mean and variance: 10 xi 37.9 3.79 hours. 10 10 ( xi ) 2 1 1 37.9 2 2 2 S x i 172.49 3.205 S 1.790 hours. 9 i 10 9 10 The appropriate percentage point here comes from a t distribution with 10 – 1 = 9 degrees of freedom: t 9,0.005 3.250. x i 1 Thus the 99% confidence interval for the population mean is S 1.79 x t 9,0.005 3.79 3.250 (1.95, 5.63). n 10 Note that in producing this confidence interval we need to assume that the data are normally distributed. Exercise: A tennis player wishes to examine his service performance in a particular match. The speeds (in mph) of 8 randomly selected serves were as follows: 98, 92, 101, 80, 94, 99, 88, 96. Calculate a 95% confidence interval for this player’s mean service speed in this match. 6.3.2 Confidence intervals for when is unknown and n is large We noted in Chapter 3 that a t distribution looks very much like the standard normal distribution when the degrees of freedom are large. Therefore, in producing a confidence interval for in situations when i. the population variance is unknown and ii. the sample size, n, is large (e.g. n > 30), we can approximate the percentage point t n1, / 2 that occurs in the formula by z / 2 . Moreover, when the sample size is large, the assumption that the data follow a normal distribution is less critical (because the Central Limit Theorem can then be applied). We therefore have the following result: Given a large (n > 30) sample X 1 ,..., X n , drawn from a distribution with mean and (unknown) variance, a 100(1 )% confidence interval for is (approximately) given by: S X z / 2 . n Example: A nursery is growing a large number of tomato plants. A sample of 45 plants was taken at random and their heights were found. If the sample mean and standard deviation were 5.2 cm and 1.3 cm respectively, calculate a 90% confidence interval for the mean height of the tomato plants in the nursery. Here the sample size is n = 45, so can be considered large. Consequently, we can take our percentage points from the standard normal distribution rather than from t 44 (which incidentally does not appear in tables). The population standard deviation is unknown, so we use the sample standard deviation as an estimate. From statistical tables, the appropriate 5% point is z 0.05 1.645 . Therefore, the 90% confidence interval is: 59 X z 0.05 S 5.2 1.645 n 1.3 (4.88, 5.52). 45 Exercise: 75 randomly selected smokers were asked how many cigarettes they had smoked the previous day. The sample mean and variance were 20 and 196 respectively. Calculate a 95% confidence interval for the population mean. 6.4 Confidence Intervals for the Population Variance Let us assume that we have a random sample, X 1 ,..., X n , drawn from a normal distribution, N[ , 2 ], with both parameters unknown. In the previous section, we learnt how to produce a confidence interval for . We now look at producing a 100(1 )% confidence interval for 2 . Idea: We need to find T1 and T2 such that P(T1 2 T2 ) 1 . These values can be obtained by making use of the sampling distribution of S 2 . Derivation of confidence interval: (n 1) S 2 ~ n21 . Therefore, We know that 2 (n 1) S 2 1 2 1 2 1 . P n21,1 / 2 1 P n 1 , / 2 2 2 2 2 n1,1 / 2 n1, / 2 (n 1) S Multiplying throughout by (n 1) S 2 gives: (n 1) S 2 (n 1) S 2 P 2 2 2 1 . n1,1 / 2 n1, / 2 Upper % point Lower % point We therefore have the following result: Result: Given a random sample X 1 ,..., X n from N[ , 2 ], a 100(1 )% confidence interval for 2 is given by (n 1) S 2 (n 1) S 2 . , 2 2 n1, / 2 n1,1 / 2 Example: The blood cholesterol levels in a sample of 11 people are as follows: 270, 256, 330, 324, 291, 279, 329, 344, 308, 297, 310. Calculate 95% confidence intervals for the population mean and standard deviation. We first need to calculate the sample mean and variance: 270 ... 310 3338 x 303 .45. 11 11 60 ( xi ) 2 1 1 3338 2 xi2 1020584 765 .273 S 27.66 . 10 i 11 10 11 Confidence intervals for and can only produced if the data are normally distributed. So we need to make this assumption. S2 A 95% confidence interval for is then X t n1, / 2 S 303.45 2.228 n 27.66 (284.87, 322.03). 11 t10,0.025 2.228 2 The upper and lower 2.5% points from a 10 distribution are 20.48 and 3.247 respectively. The 95% confidence interval for 2 is therefore (n 1) S 2 (n 1) S 2 10 765 .273 10 765 .273 , 2 , (373 .67, 2356 .86). 2 20 . 48 3 . 247 n 1,1 / 2 n1, / 2 Taking square roots of the upper and lower end-points results in the following confidence interval for : (19.33, 48.55). Exercises: 1. A machine puts rice into 400g packets and the standard deviation over a long period is 2.5g. A new machine is evaluated by means of a random sample of 21 packets whose sample standard deviation is 3.2g. Find a 90% confidence interval for the standard deviation of the new machine. 2. The speeds in mph of 15 randomly selected cars passing a police speed checkpoint were as follows: 27, 31, 34, 30, 32, 38, 26, 30, 32, 34, 31, 29, 41, 35, 33. Calculate a 99% confidence interval for the population mean and variance. 6.5 Confidence Intervals for a Population Proportion (with large n) We often want to make inferences about a proportion. For example, we might want to estimate the proportion of people who currently support the Conservative Party. Suppose we denote the population proportion by . Then following our previous method, to find a confidence interval for , we need to find T1 and T2 such that P(T1 T2 ) 1 . To do this we make use of the sampling distribution of the sample proportion, p: (1 ) . p ~ N , n Recall that this result was appropriate if the sample size is large ( n 5 and n(1 ) 5 ). Standardising gives the approximate result: p ~ N[0, 1]. (1 ) n Derivation of confidence interval: Using the above result, we can write 61 p (1 ) (1 ) P z / 2 z / 2 1 P z / 2 p z / 2 1 . (1 ) n n n Rearranging so that is alone in the centre of this inequality gives: (1 ) (1 ) P p z / 2 p z / 2 1 n n (1 ) (1 ) P p z / 2 p z / 2 1 n n But the limits of this confidence interval are functions of , which is unknown. So to calculate the limits must p (1 p ) n be estimated. As long as the sample size is large, the value of should be close to (1 ) n and can be used in its place. This result thus follows: Result: A 100(1 )% confidence interval for when n is large ( np 5 and n(1 p) 5 ) is given by p z / 2 p(1 p) . n Example: 120 university students were randomly selected. Of these, 11 had taken one or more years off between leaving school and entering university. Calculate a 95% confidence interval for , the proportion of all students entering university on this basis. From the question, the sample proportion is p 11 0.0917 . As np 11 5 and n(1 p) 109 5 , the 120 confidence interval for can be calculated as: p(1 p) 0.0917 0.9083 p z / 2 0.0917 1.96 0.040, 0.143. . n 120 Exercise: The paper “Worksite smoking cessation programs: a potential for national impact”(Amer. J. of Public Health. 1983, pp 1395-96) investigated the effectiveness of smoking cessation programs at work. The program tested involved group meetings and monetary incentives for attending meetings and for not smoking. Of those who participated in the experiment, 91% successfully stopped smoking and were still abstinent 6 months later. Suppose a representative sample of 70 people were involved in the experiment. Let denote the success rate of the program ( = population proportion of participants who would still be non-smokers 6 months after completing the program). Find a 99% confidence interval for . 6.6 Choosing the Sample Size All the confidence intervals we've looked at depend on the sample size n. For example, the confidence interval for with known is x z / 2 62 n . As n gets larger, the width of the confidence interval decreases, which means that the interval becomes more informative about the unknown parameter. Example: There is interest in learning about the mean I.Q. of students at UKC. If the standard deviation of I.Q.s can be assumed to be 20, find the sample size that will ensure that the width of a 99% confidence interval for is less than 4 units. As the population s.d. is known, the appropriate formula for the confidence interval for is: x z / 2 The width of this is 2 z / 2 n . n . Because the appropriate percentage point is z 0.005 2.5758 , to find n we need to solve 2 2.5758 20 103 .032 4 n 25.758 n 663.5 . n n We would need around 664 students in the sample therefore. 6.7 4 One-sided Confidence Intervals Up until now we have been calculating two-sided confidence intervals for parameters. In other words, we have been setting a lower and upper confidence limit on the parameter in question. For example, consider again the following data relating to the heights (in cm) of primary school children of a certain age: 114, 137, 132, 140, 125, 116, 110, 118, 136, 131, 122, 128. Assuming normality, we can construct a 95% confidence interval for the population mean, . This is shown in the diagram below. (Note that because the population variance is unknown, the confidence interval must be calculated using a percentage point from a t distribution.) Dotplot of Height (cm) (with 95% t-confidence interval for the mean) [ 110 ] _ X 120 130 140 Height (cm) We can therefore express 95% confidence that the mean lies between the indicated lower and upper limits (i.e. there is a 5% probability that the interval will not contain ). However, we might only be interested in finding a lower (or upper) limit for . 63 A dotplot showing children's heights and a one-sided confidence interval Lower limit = 120.65 110 120 130 140 Height (cm) In producing the one-sided confidence interval above, we are putting just a lower limit on . Here, a 95% onesided confidence interval has been calculated as [120.65, ). Note that there is a 5% probability that the lower limit will be greater than . Example: Consider again the blood pressure example that was presented in Section 6.2.1. Here there were 20 volunteers in the sample. The population standard deviation was known to be 8 mmHg and the sample mean was 6 mmHg. Suppose all that matters is the least possible fall in blood pressure. We would then calculate a one-sided confidence interval. For example, for a one-sided 99% confidence interval the lower limit would be: X z 0.01 6 2.3263 8 1.84 mmHg. n 20 The 99% one-sided confidence interval therefore is [1.84, ). Exercise: Consider again the following cholesterol data taken from 11 volunteers: 270, 256, 330, 324, 291, 279, 329, 344, 308, 297, 310. Suppose that we are only interested in an upper limit for the population mean. Calculate this one-sided interval with a 95% confidence coefficient. 64 Chapter 7: Hypothesis testing 7.1 Introduction Introductory scenario Proponents for a particular dieting regime claim that people will, on average, lose 14 pounds if the plan is followed for six weeks. A nutritionist wishes to test this claim- she suspects it to be false. In order to test the plausibility of the claim (or hypothesis) some data are needed. Relevant data here would be the weight losses from a sample of say 50 people who followed the diet over 6 weeks. We then need to assess how consistent the hypothesis is with the observed data. It is important to note that we cannot absolutely prove or disprove a hypothesis, only gather evidence for or against it. Other examples: a manufacturer might claim that the mean lifetime of a brand of battery is 110 hours; a political party might claim that the proportion of voters who will vote for them in the next general election is 45%. In each case, a sample could be taken and the sample values used to determine whether or not the hypothesised population value is reasonable or not. To introduce hypothesis testing we shall use a specific example of testing hypotheses about a mean when the population variance ( 2 ) is known. 7.2 Testing hypotheses for (known population variance) 7.2.1 Terminology In hypothesis testing, we wish to choose between two competing hypotheses. These are called the null hypothesis (denoted H 0 ) and alternative hypothesis (denoted H1 ). Generally, the null hypothesis is the one that we suspect could be false and the alternative hypothesis is the one that we usually hope to be true. We illustrate this terminology through two examples. Example 1: An IQ test is designed so that the average score in the population as a whole is 100 with s.d. 20, and so that the scores follow a normal distribution. A random sample of 25 children at a school under investigation takes the test. The sample mean score is x 108 .3 . Is there any evidence that this school has children with an IQ different from the general population? Let denoted the mean I.Q. for all children at that school. The null and alternative hypotheses would then be as follows: H 0 : 100 (i.e. the school has the same IQ as the whole population); H 1 : 100 (i.e. the school has a different mean IQ from the general population). The null hypothesis here is the cautious hypothesis which we initially assume to be true- i.e. without any sample data, given any group of children we would initially assume that their average IQ is the same as the general population. Note that we here have a two-sided alternative hypothesis as we are testing whether the mean IQ differs from the hypothesised value of 100. If we were looking to test whether the mean IQ was greater (or smaller) than this value, we would need to specify a one-sided alternative hypothesis (see later). 65 Example 2: Researchers have postulated that, due to differences in diet, Japanese children have a different mean blood cholesterol level compared with British children. Suppose that the mean level for British children is known to be 170. Let represent the mean blood cholesterol level for Japanese children. What hypotheses should the researchers test? The null hypothesis represents what we initially assume to be true. So without any sample information about Japanese children we'd initially assume that = 170 and so H 0 : 170 . The alternative hypothesis is that the cholesterol level of Japanese children differs from that of British children and so H1 : 170 . 7.7.2 General formulation (two-sided test) In general suppose that we have the hypotheses: H 0 : 0 versus H1 : 0 . Background: To test H 0 against H1 , we initially assume that H 0 is true. We then see how plausible data at least as extreme as our observed data would be under this assumption. So if the probability of observing our sample result is small under H 0 ’s distribution, then this means that we are unlikely to have observed what we actually have if H 0 were true. In this case it therefore looks like H 0 is not true. On the other hand, if the probability of observing our sample result is large, then we could plausibly have observed the sample we in fact got, and therefore H 0 could be true. The sample mean, X , is a good estimator for , so it makes sense to use our observed sample mean, x , to test hypotheses about . Theory: Consider the case where a sample X 1 ,..., X n is obtained from a N[ , 2 ] distribution (where 2 is assumed to be known). The hypotheses we are interested in testing are H 0 : 0 versus H1 : 0 . We know from Chapter 4 that 2 X ~ N , . n If the null hypothesis is true, then X 0 2 X ~ N 0 , ~ N[0, 1]. Z n n We will use the distribution of Z to decide whether our sample of data (summarised by the sample mean) could plausibly have been obtained from a normal distribution with mean 0 . Z is referred to as the test statistic. The observed value of the test statistic is: z x 0 . n If H 0 is true, a sampled value z of Z will have come from N[0, 1]. In this case we are most likely to observe a value of z which lies in the main body of the distribution (as these values would be the most probable values of Z 66 to observe). Therefore if we observe z in the main body of N[0, 1], this sampled value would support H 0 . We would then have no evidence to reject H 0 , or equivalently we could say that we “accept” H 0 to be true. Note that this is not the same thing as saying that H 0 is true, only that we have no evidence to say that it is false. Suppose now that the observed value z of Z lies in the tails of the standard normal distribution. Such a value would have been unlikely to occur if H 0 were true. So if we observe z outside the main body of N[0, 1], then this sample value would not support H 0 . We would therefore reject H 0 . The range of values of the test statistic that would lead us to reject the null hypothesis is called the critical region. The next problem, then, is to decide how to specify the exact values of our critical region. In carrying out a hypothesis test there are two types of error we can make. Type 1 error. This is when H 0 is rejected when in fact it is true. Type 2 error. This is when you fail to reject H 0 , when in fact it is false. P(type 1 error) is usually denoted and we call it the size of the test. We can use this value to find a suitable critical region for the test. A type 1 error is usually thought to be the more serious and we therefore define our test so that we have a suitably low value of . The values of that are acceptable will vary from situation to situation. The most usual values are 0.1, 0.05, 0.01 or 0.001. Now P(reject H 0 when it is true) P(observe z in the tails of N[0, 1]). So by setting a value for , we can find our critical region. For example, if = 0.05, then we will reject H 0 if we observe z z 0.025 or z z 0.025 , i.e. if z 1.96 or z 1.96 . So our critical region is z : z 1.96 or z 1.96. We then say that we have a test at the 5% significance level. To test between the hypotheses H 0 : 0 versus H1 : 0 when X 1 ,..., X n is a random sample from a normal distribution with known population variance: X 0 use the test statistic Z n reject H 0 at the 100% level if | z | z / 2 . 67 Example: Consider again the IQ example. Here we had a random sample of 25 children from a particular primary school. The mean IQ in the sample was 108.3. The hypotheses of interest were H 0 : 100 versus H1 : 100 . The population standard deviation is known to be 20 and we can assume that IQs follow a normal distribution. The test statistic in this situation is given by: Z X 0 . n The observed value of this test statistic then is: 108 .3 100 8.3 z 2.075 . 20 20 5 25 For a 5% test, the critical values for the test statistic are z 0.025 1.96. As the observed value of the test statistic lies in the critical region, we can reject H 0 at the 5% significance level. For a 1% test, the critical values would be z 0.005 2.5758 . Since z 2.075 2.5758 , we would not be able to reject H 0 at the 1% significance level. We interpret these test results as follows… The data provide some evidence (but not strong evidence) to suggest that the mean IQ of children in this primary school differs from the general population. Important notes: Always state the level of significance you are using when rejecting or accepting H 0 . Rejection of H 0 means there is definite evidence to reject H 0 . “Acceptance” of H 0 means that there is insufficient evidence to reject H 0 - i.e. H 0 may still be untrue, but we do not have enough data to reject it. This is regularly misunderstood. Significance tests are commonly conducted at the following levels: 5%, 1% and 0.1%. These significance levels provide varying degrees of evidence against H 0 : 5% level- some evidence against H 0 ; 1% level- strong evidence against H 0 ; 0.1% level- very strong evidence against H 0 . 68 Exercise: A machine is designed to produce bolts with a mean length of 25mm. The standard deviation of the length of the bolts is known to be 0.23 mm. After a routine service, a random sample of bolts were measured and the lengths (in mm) were found to be: 25.5 25.3 25.1 25.6 24.9 25.0 25.4 25.3 25.0 24.8 25.2 25.4. Test to see whether the servicing of the machine has altered the mean length of the bolts it produces. Assume that the standard deviation is unchanged and that the data can be assumed to follow a normal distribution. Note: If the sample size is large, the assumption that the data are normally distributed is less critical. This is because 2 the central limit theorem ensures that X ~ N , (approximately) whatever the distribution of X 1 ,..., X n n when n is large. Example: The manager of a telesales department claims that the average time that an operator spends talking to a potential client is 70 seconds. The managing director of the company doubts this claim and times a random sample of 40 telephone calls. The sample mean was 62 seconds. If the population standard deviation is known to be 45 seconds, carry out a hypothesis test at the 5% significance level. If denotes the population mean call length, the hypotheses are: H 0 : 70 versus H1 : 70 . The sample size here is large (n > 30) and so we do not need to assume that the call times follow a normal distribution (the Central Limit Theorem ensures that the distribution of X is roughly normal). The test statistic is Z X 0 n giving an observed value 62 70 1.124 . 45 40 For a 5% test, the critical values would be z 0.025 1.96. So no evidence to reject H0 at this level, i.e. it is plausible that the mean call length is 70 seconds. z 7.1.3 Link between hypothesis tests and confidence intervals Consider our usual hypotheses: H 0 : 0 versus H1 : 0 We will be able to reject the null hypothesis at the 5% significance level if a 95% confidence interval for excludes 0 : E.g. 95% C.I. 69 Interpretation: Reject null hypothesis at the 5% level * * * * * ** * * * * * 0 If a 95% confidence interval for includes 0 then we do not have sufficient evidence at the 5% level to reject the null hypothesis: E.g. 95% C.I. * * * * * ** * * * * * 0 Interpretation: “Accept” the null hypothesis at the 5% level ( 0 is a plausible value for the population mean). In general, we can accept H 0 at the 100% level if and only if a 100(1 )% C.I. for excludes 0 . 7.1.4 p-values Specifying the size of test, together with the conclusion about whether the result was statistically significant at that level is one way in which a hypothesis test can be carried out. A more informative way of giving the strength of evidence against a null hypothesis is to calculate a p-value. The p-value gives the exact observed significance of the data, i.e. it specifies the probability of observing a result at least as extreme as our sample result given that H 0 is true. The p-value is often simply denoted by p. Example (IQ example continued): In the IQ example we observed z = 2.075. To calculate the p-value we need to calculate the probability of observing a result which is at least as extreme as this: 70 p P(Z 2.075 or Z 2.075) P(Z 2.075) P(Z 2.075) 2 (1 (2.075)) 2 0.019 0.038. The observed level of significance is therefore 0.038. This value is consistent with our earlier conclusions- we can reject the null hypothesis at the 5% level but not at the 1% level. 7.1.5 One-sided tests Example (continued): Consider again the earlier example concerning whether Japanese children have a different mean blood cholesterol level than British children. Because a Japanese diet has less saturated fat than a British diet, researchers might postulate that the mean cholesterol level for Japanese children is in fact lower than British children (whose mean level is 170). They may then want to test this via a hypothesis test. Once again the null hypothesis represents what we initially assume to be true and so again we'd set H 0 : 170 However, the alternative hypothesis is now that the cholesterol level of Japanese children is less than that of British children and so H1 : 170 . To test these hypotheses, we’ll reject H 0 in favour of H1 only if we observe small values of z. We wouldn't reject H 0 if we observe large values of z this time, because large values of z are now more consistent with H 0 than H1 . We'll therefore reject H 0 only if we observe z in the lower tail of N[0, 1]. So as we're only rejecting H 0 if z falls in one of the tails of the distribution we call this a one-tailed test. Note: A one-tailed test is appropriate only when it is known that deviations from the null hypothesis will be in a particular direction. Example: The average mark in an A-level examination paper has traditionally been 58%. After a change in the syllabus, it is suspected that the A-level paper will now be easier. The marks of 10 randomly chosen candidates sitting the new syllabus are as follows: 64, 67, 35, 46, 78, 59, 53, 84, 60, 56. If the population variance is known to be 225, perform a hypothesis test to see whether marks are now significantly higher than before. The hypotheses we wish to test are as follows: H 0 : 58 versus H1 : 58 . To carry out this (one-sided) test, we need to assume that the data are normally distributed. The sample mean is: n x xi 1 n 602 60.2. 10 71 Therefore, the observed value of the test statistic is x 0 60.2 58 z 0.464 . 225 n 10 For a 5% test, we would reject the null hypothesis if z z 0.05 1.6449 . The conclusion then must be to “accept” the null hypothesis at this level. The data provide no evidence to support the view that the examination marks are on average higher than before the syllabus change. Incidentally, the p-value associated with this test can be found as follows: p P(Z 0.464) 1 (0.464) 1 0.6772 0.3228 (approximately). Null hypothesis is onesided so we find the probability only of larger values than we observed. Exercise: Suppose that the mean systolic blood pressure for white males aged 35-44 is 127.2. A random sample of 13 diabetic males aged 35-44 was taken and their systolic blood pressure was measured. The results are given below. 119.2, 130.2, 134.4, 120.1, 137.6, 128.0, 136.9, 129.1, 130.6, 127.9, 136.8, 135.4, 142.0. Suppose that you are told that the standard deviation of systolic blood pressure for white males aged 35-44 is 6.726 and that it can be assumed that the data roughly follow a normal distribution. Investigate whether there is evidence to suggest that the systolic blood pressure is i. different ii. higher for diabetic 35-44 year old males than for the general population. Calculate the p-value in each case. 7.1.6 Calculating the probability of type 1 and 2 errors Example: A coin is tossed 7 times. Suppose that we want to test the hypotheses: H0: the coin is fair versus H1: the coin is biased in favour of heads. A test is proposed which rejects H0 if 6 or more heads are observed. a) b) What is the probability of a type 1 error? What is the probability of a type 2 error if the the coin is in fact biased so that P(heads)=0.6? Solution: a) P(type 1 error) = P(reject H0 | H0 true) = P(6 or more heads |coin fair) 7 6 1 7 1 1 1 0.0546 0.0078 0.0624 . 2 6 2 2 b) P(type 2 error) = P(accept H0 | P(heads) = 0.6) = 1 P(reject H0 | P(heads) = 0.6) = 1 P(6 or more heads | P(heads) = 0.6) 7 1 0.67 0.6 6 0.4 1 0.131 0.028 0.841 . 6 72 7.1.7 Power function Definition: The power function of a test, which we'll denote ( ) , is defined as ( ) P(reject H 0 | ) . So for each value of , we will have a different value for the power of the test. Example: Consider the one-tailed examination marks example again. For a 5% test we reject H0 when z > 1.6449 or equivalently when x 58 1.6449 . 225 10 So for example, the power for = 62 is calculated as follows. X 58 (62) P( Z 1.64 | 62) P 1.6449 62 P( X 65.80 | 62) 225 10 Now, if = 62, then X 62 225 X ~ N 62, ~ N[0,1]. 10 22.5 Therefore, 65.80 62 P( Z 0.80) 1 0.7881 0.2119 P( X 65.80 | 62) P Z 22.5 using standard normal tables. Note: We will want a test to have large power for values of in H1 and small power for values of in H0 - i.e. we want to maximise the chance of coming to the correct conclusion. Notice that P(type 1 error) = P(Reject H0 | H0 true) = P(Reject H0 | 0 ) = ( 0 ) and when constructing our test we have already set this value to be suitably small. Further if our hypotheses are simply then H 0 : 0 versus H1 : 1 ( 1 ) P(Reject H 0 | 1 ) 1 P(AcceptH 0 | 1 ) 1 P(AcceptH 0 | H 0 false) 1 P(type 2 error) So in this case, if we have a large power, then we have a small probability of making a type 2 error. 7.2 Hypothesis tests for (unknown population variance) Recall that when finding a confidence interval for when 2 is unknown we made use of the result: 73 X ~ t n 1 S n where S is the sample standard deviation. We will use this result again in hypothesis testing. 7.2.1 One sample t-test Consider the situation where we have a random sample of observations, X 1 ,..., X n , drawn from a normal distribution with unknown mean and variance. We wish to use these data to compare the following hypotheses about the population mean : H 0 : 0 versus H1 : 0 . The relevant test statistic would now be X 0 T S n which we know follows a t distribution with n – 1 degrees of freedom if the null hypothesis is true. As in the previous section, if H0 is true then we would expect to observe values of T in the main body of the t n 1 distribution. On the other hand, if H0 is not true, then we might expect to observe t in the tails of this tdistribution. The critical values that we use as cut-off points between accepting and rejecting the null hypothesis are the (/2)% points from the t n 1 distribution. Hence we reject H0 if we observe t t n1, / 2 or t t n1, / 2 . 74 One-sample t-test: To test the hypotheses H 0 : 0 versus H1 : 0 . when X 1 ,..., X n follow a normal distribution with unknown variance: X 0 S n reject H0 at the 100% level if use the test statistic T | t | t n1, / 2 . This test can easily be adjusted if the alternative hypothesis is one-tailed. For example, if H1 took the form: H1 : 0 then we reject the null hypothesis if t t n1, . Note: In performing a one-sample t-test we assume that the data are independently distributed as a normal distribution. This assumption is less critical if the sample size is large (see later). Example: Ten randomly selected ‘pints’ pulled from a campus bar are measured accurately. The amount of beer (fl.oz) in these ‘pints’ was as follows: 19.96, 19.97, 19.94, 20.01, 19.99, 19.97, 19.95, 19.97, 20.00, 19.98. Test between the hypotheses H 0 : 20 versus H1 : 20. Find the associated p-value. Here, we begin by finding the sample mean and variance: 19.96 ... 19.98 199.74 x 19.974; 10 10 ( xi ) 2 1 1 199.74 2 2 S2 x 3989 . 611 0.000471 S 0.0217 . i 9 n 1 i n 10 For testing between the hypotheses: H 0 : 20 versus H1 : 20 , we use the following test statistic: X 0 T . S n This test statistic follows a t 9 distribution if the null hypothesis is true. Here, its observed value is 19.974 20 t 3.79. 0.0217 10 The relevant critical values for different sizes of test are: t 9, 0.025 2.262 5% test: 1% test: t 9, 0.005 3.250 0.1% test: t 9, 0.0005 4.781 . Conclusion: We can reject the null hypothesis at the 1% level. There is strong evidence that the average beer contents are not 20 fl.oz. (i.e. a pint). Note that in performing this test it is necessary to assume that the measurements follow a normal distribution. 75 The p-value associated with this test is: p P(T 3.79 or T 3.79) P(T 3.79) P(T 3.79) . But, P(T 3.79) 1 P(T 3.79) 1 0.9979 0.0021 . So, the p-value is 0.0021 2 = 0.0042 (or 0.42%). Example: The widths (in mm) of a sample of 7 beetles, chosen from a particular island, were measured and found to be: 29, 34, 26, 31, 38, 33, 36. The mean length of the beetles on the island is usually 36 mm, but due to recent adverse weather conditions it is believed that their growth may have been stunted. Perform a hypothesis test to assess whether the data provide any evidence to support this view. We must again assume that the lengths follow a normal distribution. The hypotheses we wish to test are H 0 : 36 versus H1 : 36 . It can be shown that the sample mean and standard deviation are 32.4286 mm and 4.1173 mm respectively. Therefore the observed test statistic is: 32.4286 36 t 2.29. 4.1173 7 Because t 6, 0.025 2.447 , we are unable to reject the null hypothesis at the 5% level. We have no evidence to suggest that the beetles’ average length has decreased. Exercise: The mean weight (in kg) of British children of a certain age is 32 kg. A random sample of American children of this same age gave the following set of weights: 38, 34, 35, 43, 47, 40, 31, 39, 37, 42, 36, 35, 29, 38. Perform a test (stating the necessary distributional assumptions) to assess whether there appears to be a difference in the mean weights of American and British children at this age. 7.2.2 Hypothesis tests for for large samples (unknown population variance) When the sample size is large (say n > 30) the distribution of the sample mean should be approximately normal whatever the distribution of the original data. We therefore do not need to make the assumption of normality in the one-sample t-test for large n. Further, when the sample size is large, the distribution of the test statistic, X 0 T S n will be approximately a standard normal. [Recall: the t-distribution becomes approximately a N[0, 1] as the degrees of freedom increase.] 76 Example: A machine putting cereal into boxes should be set so that the average content of each box weights 510 g. The machine is serviced after which the weight of cereal in a random sample of 38 boxes is checked. The sample mean was 513.4g and sample variance was 67.8 g2. Test to see if there has been a change to the average content of the boxes. The hypotheses here are: H 0 : 510 versus H1 : 510 . The observed value of the test statistic is: x 0 513 .4 510 2.545 . S 67.8 n 38 As the sample size is large, the critical points for this test should be approximately those from a standard normal. Thus, z 0.025 1.96 and z 0.005 2.5758 . We can see that we can reject the null hypothesis at the 5% level, but that there is not quite enough evidence to reject it at the 1% level. 7.3 Hypothesis tests for the population variance We here assume that X 1 ,..., X n , follow a normal distribution, N[ , 2 ] , where is unknown. We now are interested in testing hypotheses about 2 . When finding a confidence interval for 2 when was unknown, we used the fact that (n 1) S 2 2 We will use this fact to define a hypothesis test for 2 . ~ n21 . Suppose that the null hypothesis is H 0 : 2 02 . Our test statistic then is Y= ( n 1) S 2 02 which has a chi-squared distribution with n – 1 degrees of freedom under H0. Then, if H0 is true we would expect to observe values of Y in the main body of a n21 distribution and if H0 is not true, then we might expect to observe Y in the tails of this distribution. So, if we have H1 : 2 02 , then we'll reject H0 if we observe y n21, / 2 or y n21,1 / 2 77 One-sided alternative hypotheses can be tested by using the critical points n21, or n21,1 , as appropriate. Result: To test the hypotheses H 0 : 2 02 versus H1 : 2 02 when is unknown and X 1 ,..., X n follow a normal distribution: (n 1) S 2 use the test statistic Y reject H0 at the 100% level if 02 ; y n21, / 2 or y n21,1 / 2 . Adjust for 1-tailed tests accordingly. Example: Historically it is known that the journey time between 2 points is normally distributed with a standard deviation of 6 minutes. After roadworks a sample of 10 journey times is found to have a sample standard deviation of 5 mins. Is there evidence of a change in the population variance? Want to test H 0 : 2 36 versus H1 : 2 36 We've observed 9 25 6.25. 36 If H0 is true, Y ~ n21 92 . For a 5% test , the appropriate critical points are y 92,0.975 2.70 and 92,0.025 19.02. Our observed test statistic lies between these critical points. Therefore, there is no evidence at the 5% significance level to reject H0. So no evidence for a change in the population variance. 7.4 Hypothesis tests for a proportion (with large n) Let the unknown population proportion of interest be denoted by . Recall from Section 4.3 that if p is our sample proportion, then for large sample size, n, we have the approximate result: p ~ N[0,1]. (1 ) n Now suppose that we have H 0 : 0 . To test this we can use the statistic: p 0 W= . 0 (1 0 ) n If the null hypothesis holds, then W ~ N[0, 1], when the sample size is large. We therefore have the following result: Result: To test the between the hypotheses H 0 : 0 versus H1 : 0 for large sample sizes, p 0 use the test statistic W 0 (1 0 ) n 78 reject H0 at the 100% level if W z / 2 or W z / 2 . We adjust the test accordingly for one-sided alternative hypotheses. For example, if the alternative hypothesis is H1 : 0 , then we would reject the null hypothesis only when W z . Example: In a survey of 588 doctors, 365 believed that it was sometimes right to agree to hasten a patient's death. Based on this information, would you conclude that more than 60% of all doctors feel that it is sometimes appropriate to help a seriously ill person die? Carry out a test at the 5% level. Our hypotheses here would be: H 0 : 0.6 versus H1 : 0.6 We've observed p = 365 588=0.62 and so 0.62 0.6 w 0.989 . 0.6(1 0.6) 588 But z 0.05 1.6449 0.989 and so we cannot reject H0 using a test at the 5% significance level. There is insufficient evidence to suggest that the proportion of doctors who think mercy killings are sometimes appropriate is greater than 0.6. Additional example: A coin is thrown 140 times resulting in 85 heads. Test whether the data suggest that the coin is biased. 79 Chapter 8 Two sample problems 8.1 Introduction There are many situations where we wish to compare the characteristics of two different populations, on the basis of a sample drawn from each. 8.1.1 Introductory example Daily protein intake (in grams) is measured on a sample of individuals living below the poverty level and another sample living above the poverty level with the results: Below poverty level: 51.4, 49.7, 72.0, 76.7, 65.8, 55.0, 73.7, 62.1, 79.7, 66.2, 75.8, 65.4, 65.5, 62.0, 73.3 Above poverty level: 86.0, 69.0, 59.7, 80.2, 68.6, 78.1, 98.6, 69.8, 87.7, 77.2. Given these data we might be interested in seeing whether we conclude that poverty influences diet. The sample mean and standard deviation for the protein intakes for each group are as follows: Sample mean Sample s.d. Sample size Below poverty level x1 66.29 S1 9.17 n1 15 Above poverty level x 2 77.49 S 2 11.34 n 2 10 A box-and-whisker plot showing the protein intakes in the two groups is given below: Boxplots of Below and Above (means are indicated by solid circles) 100 90 80 70 60 50 Below Above This shows that on average individuals above the poverty level seem to have a higher daily protein intake than those living below the poverty level. What we need though is a formal test to see whether there is a significant difference between the two groups. 8.1.2 Notation and preliminary work Suppose that in general we have two populations and we select a random sample from each: 80 Sample from population 1: X 11, X 12 ,..., X 1n1 Sample from population 2: X 21, X 22 ,..., X 2n2 We will consider the case in which the populations are normally distributed. In this case: X 1i ~ N[ 1 , 12 ], i 1,..., n1 X 2 j ~ N[ 2 , 22 ], j 1,..., n2 We aim to use the sample data to make inferences about the difference in population means 1 2 . To answer this question we need to examine the sampling distribution of the estimator of 1 2 . Now, 12 22 X 1 ~ N 1 , and X 2 ~ N 2 , . n1 n2 As the two samples are independent then X 1 and X 2 are also independent, so 12 22 X 1 - X 2 ~ N 1 2 , . n1 n2 So this is our sampling distribution of X 1 X 2 and we can use it to find confidence intervals or carry out hypothesis tests involving 1 2 . Just as before we can distinguish two different cases -- when the population variances are known and when they are not. 8.2 Inferences for 1 2 (variances known) When 12 and 22 are known, then we can use the sampling distribution of X 1 X 2 , namely 12 22 X 1 - X 2 ~ N 1 2 , , n1 n2 to define a hypothesis test or confidence interval relating to 1 2 . 8.2.1 Hypothesis test Suppose that we wish to test the null hypothesis: H 0 : 1 2 k . Usually, we wish to test whether both populations have the same mean value- in this case, k = 0. Using the same reasoning as for one-sample hypothesis testing, we have the following result: 81 Result: To test between the hypotheses: H 0 : 1 2 k versus H1 : 1 2 k when a) 12 and 22 are known b) both samples come from a normal distribution then: use the test statistic Z X1 X 2 k 12 n1 22 n2 reject H0 at the 100% level if z z / 2 or z z / 2 . Note: If the alternative hypothesis is one-sided, we adjust the rejection criteria in the usual way. Example: A consumer magazine is interested in testing the time (in hours) that two types of battery last. The following data was obtained: Type A: 2116, 2347, 2215, 2098, 2156, 2108, 2073, 2205, 2271. Type B: 2067, 2102, 2090, 2017, 1996, 2114, 2088, 2053. It is known that the standard deviation for lives of batteries of type A and type B are 90 hours and 45 hours respectively. Hence test the hypothesis that batteries of type A and B have the same mean life. (You may assume that the observations are normally distributed). Our hypotheses here are: H 0 : 1 2 0 i.e. 1 2 1 2 . H1 : 1 2 0 i.e. The sample means for the two groups of observations can be shown to be: x1 2176 .6 and x 2 2065 .9. Substituting the values of the (known) population variances into the formula for the test statistic gives: x1 x 2 k 2176 .6 2065 .9 0 z 3.26. 12 22 90 2 45 2 9 8 n1 n2 The critical points should be obtained from a standard normal distribution: z 0.025 1.96 5% z 0.005 2.5758 1% z 0.0005 3.2905 0.1% We can therefore reject the null hypothesis at the 1% level (and very nearly at the 0.1% level). There is strong evidence to suggest that the mean lives of the two types of battery are different. 8.2.2 Confidence interval Now, Z X 1 X 2 ( 1 2 ) 12 n1 22 n2 which means that 82 ~ N[0,1] X 1 X 2 ( 1 2 ) P z / 2 z / 2 1 . 12 22 n1 n2 So when the population variances are known, just as before we can rearrange this so that we are left with just 1 2 in the middle of the inequality. The two outer limits will then be our 100(1 - )% confidence interval for 1 2 . Rearranging the above we get: 12 22 12 22 P X 1 X 2 z / 2 1 2 X 1 X 2 z / 2 1 . n1 n2 n1 n2 Result: The 100(1 )% confidence interval for 1 2 when i) ii) is given by: 12 and 22 are known the data follow normal distributions X 1 X 2 z / 2 12 n1 22 n2 . Example (continued): Consider the battery example again. Suppose we want to find a 95% confidence interval for 1 2 . This interval has limits: x1 x 2 z / 2 12 n1 22 n2 2176 .6 2065 .9 1.96 90 2 45 2 44.14, 177.26 . 9 8 Notice that this interval does not contain 0. This is to be expected as we know that the null hypothesis of equal means can be rejected at the 5% significance level. The case in which 12 and 22 are known is unlikely to occur in practice. 8.3 Inferences for 1 - 2 (population variances unknown but sample sizes large) Recall that when we were making inferences in a single sample when the population variance was unknown, we had two different approaches, depending on whether n was large or not. Here we have similar results. If n1 and n2 are large (rule of thumb n1, n2 > 30) then S12 will be a good estimator of 12 and S 22 will be a good estimator of 22 and then Z X 1 X 2 ( 1 2 ) S12 S 22 n1 n2 ~ N[0,1] (approximat ely) We can use this distribution for the statistic Z and directly extend the results for inferences about 1 2 when the variances are known. 83 Result: To test between the hypotheses: H 0 : 1 2 k versus H1 : 1 2 k when a) 12 and 22 are unknown b) both samples come from a normal distribution c) n1 and n2 are both large (i.e. > 30) then: X1 X 2 k use the test statistic Z reject H0 at the 100% level if z z / 2 or z z / 2 . S12 S 22 n1 n 2 Adjust for 1-tailed tests accordingly. Result: The 100(1 )% confidence interval for 1 2 when a) b) c) 12 and 22 are unknown both samples follow normal distributions n1 and n2 are both large (i.e. > 30) is given (approximately) by: X 1 X 2 z / 2 S12 S 22 . n1 n2 Example: A number of studies have focused on the question of whether children born to women smokers differ physiologically from children born to non-smokers. The paper “Placental transfer of lead, mercury, cadmium and carbon monoxide in women” (Environ. Research, 1978, 494 - 503) reported on results from one such investigation. Blood-lead concentration (g/l) was measured in new-born children of 109 smokers and 333 nonsmokers. The results are given below. Sample Mothers who smoke Mothers who don't smoke Sample size 109 333 Sample mean 8.9 8.1 Sample s.d. 3.3 3.5 Is there evidence to suggest that the blood-lead concentrations are different for smokers' babies than for nonsmokers' babies? Let 1 denote the mean for smokers' babies and 2 denote the mean for non-smokers' babies. Want to test: H 0 : 1 2 0 versus H1 : 1 2 0. We have observed z 8.9 8.1 2.16. 3.3 2 3.5 2 109 333 If H0 is true, then Z ~ N[0, 1] approximately. Now, z0.025 = 1.96 < 2.16 so we'll reject H0 at the 5% level and conclude that there is some evidence to suggest that blood-lead concentration is higher for smokers' than nonsmokers' babies. 84 Notice also that p-value = P(Z > 2.16 or Z < -2.16) = 0.015 + 0.015 = 0.03. In addition, the 95% confidence limits for this example are given by: S12 S 22 3.32 3.5 2 8.9 8.1 1.96 (0.075, 1.525) n1 n2 109 333 As this interval is entirely positive, it suggests that the average lead concentration is higher for babies of smoking mothers than for babies of non-smoking mothers. x1 x 2 z / 2 Exercise: An agricultural scientist believes that plants of a particular species tend to be taller if grown in a greenhouse rather than outdoors. To test his theory, he performs an experiment. He grows 45 plants from seed in a greenhouse and 64 plants from seed outside. The heights of these plants were later measured. The results can be summarised as: Greenhouse Outdoors Sample mean 18.6 cm 17.3 cm Sample variance 4.9 cm2 6.2 cm2 Perform a hypothesis test to see whether the data provide evidence to suggest that the plants grown in a greenhouse tend to be taller than those grown outside. 8.4 Two-sample t-test When the sample sizes are small and the population variances are unknown, there is no simple way of estimating 1 2 . There is a solution, however, when it can be assumed that the separate population variances, although unknown, are equal. This assumes that 12 22 2 , say. Note that we don't just casually assume that the variances are equal- we need to check that this is a reasonable assumption. We can assess how reasonable such an assumption is using the F-test (see later). Given equality of variances, 2 2 X ~ N , X 1 ~ N 1 , and 2 2 . n 2 n1 Therefore 2 2 X 1 X 2 ~ N 1 2 , n1 n 2 X 1 X 2 1 2 ~ N0, 1 . 1 1 n1 n2 As in the one-sample case, we now have to replace by a suitable estimator S. 85 8.4.1 Obtaining the pooled sample variance For each sample we have the sample variance S i2 , i = 1, 2, with which we can estimate 2 . However, if we can combine these two estimators in some way, we should be able to get a better estimate of 2 than if we just used 1 one of the single sample variances. An intuitive estimate to use would be S12 S 22 . However, if n1 is larger 2 than n2, then we'd expect S1 to be a better estimator of than S2. Therefore, instead of taking a straight average of S12 and S 22 , we'll take a weighted average (taking account of the relative magnitudes of the two sample sizes). The pooled estimate S2 of 2 is therefore defined by: S2 (n1 1) S12 (n 2 1) S 22 n1 n 2 2 So for example, if we have two samples of the same size, then S 2 S 22 S2 1 2 which is the straight average of the two. On the other hand, if for example n1 >n2, then we'll give more weight to S12 . Note: (n1 1) S12 (n 2 1) S 22 n S 2 n 2 S 22 results in an unbiased estimate of 2 , whereas 1 1 is a n1 n 2 2 n1 n 2 biased estimator for the population variance. The formula S 2 8.4.2 Sampling distributions Recall that when we have a single sample, X 1 ,..., X n , drawn from a normal distribution with unknown mean and variance, then X ~ t n 1 . S n Here we have a similar set up. This time we have 2 2 X 1 X 2 ~ N 1 2 , n1 n 2 where is unknown. An intuitive statistic to base our hypothesis tests and confidence interval on will be X 1 X 2 1 2 T 1 1 S n1 n2 We need to find the distribution of T. (n 1) S 2 (n 1) S 2 We know that 1 2 1 ~ n21 1 and 2 2 2 ~ n22 1 and therefore, (n1 1) S12 (n 2 1) S 22 ~ n21 n2 2 2 2 (sum of 2 chi-squared random variables). But the pooled sample variance S 2 is: (n 1) S12 (n 2 1) S 22 S2 1 n1 n 2 2 86 and so (n1 1) S12 2 We also know that, (n2 1) S 22 2 (n1 n2 2)S 2 2 ~ n21 n2 2 X 1 X 2 1 2 ~ N0, 1 . 1 1 n1 n2 Recall that if Y ~ N[0, 1] and Z ~ m2 then: X Y Z ~ tm . m Therefore, with Z (n1 n2 2)S 2 and Y 2 X 1 X 2 1 2 1 1 n1 n2 we get: Y X 1 X 2 (1 2 ) . Z n1 n2 2 S 1 1 n1 n2 Therefore, we have T X 1 X 2 (1 2 ) ~ t 1 1 S n1 n2 n1 n2 2 . We will use T and its distribution to define hypothesis tests and confidence intervals for 1 2 when we have small normal samples with unknown variances. 8.4.3 Hypothesis test: 2-sample t-test Suppose we want to test the hypotheses: H 0 : 1 2 k versus H1 : 1 2 k Then if H0 is true, 1 2 k so T X 1 X 2 k ~ t 1 1 S n1 n 2 n1 n2 2 . Therefore (using exactly the same reasoning as before) we'll reject H0 if we observe T in the tail regions of the t n1 n2 2 distribution 87 Two-sample t-test: To test the hypotheses when a) b) 12 H 0 : 1 2 k versus H1 : 1 2 k 22 and are unknown but assumed equal both samples come from normal distributions, then use the test statistic T X 1 X 2 k S 1 1 n1 n 2 reject H0 at the 100% level if | t | t n1 n2 2, / 2 Adjust for 1-tailed tests appropriately. Example: Recall the example that we used to introduce this chapter. Here we had daily protein intake measurements recorded on two sets of individuals: Below poverty level: 51.4, 49.7, 72.0, 76.7, 65.8, 55.0, 73.7, 62.1, 79.7, 66.2, 75.8, 65.4, 65.5, 62.0, 73.3 Above poverty level: 86.0, 69.0, 59.7, 80.2, 68.6, 78.1, 98.6, 69.8, 87.7, 77.2. Suppose we wish to see whether these two groups differ in their mean protein intake. Our hypotheses would then be H 0 : 1 2 0 versus H1 : 1 2 0 where 1 and 2 represent the mean protein intake of those below and above the poverty level. We first must find the sample mean and s.d. for each sample. These are: Sample mean Sample s.d. Sample size Below poverty level x1 66.29 S1 9.17 n1 15 Above poverty level x 2 77.49 S 2 11.34 n 2 10 The pooled sample variance is therefore given by: (n 1) S12 (n 2 1) S 22 14 9.17 2 9 11.34 2 S2 1 101 .50 S 10.07. n1 n 2 2 15 10 2 The observed value of the test statistic therefore is: x1 x 2 0 66.29 77.49 t 2.72. 1 1 1 1 S 10.07 n1 n 2 15 10 The appropriate critical points are found from a t distribution with 15 + 10 – 2 = 23 degrees of freedom: 5% test: 2.069 1% test: 2.807 0.1% test: 3.768. We can see that we can reject the null hypothesis at the 5% level. There is some evidence to suggest that the mean protein intakes in the two groups differ. Note that the p-value for this test is 88 p P(T 2.72 or T 2.72) 2 P(T 2.72) 2 (1 0.994) 0.012. Note: In performing the analysis two assumptions have been made- equal population variances and normality. We will consider techniques that can be used to assess how reasonable these assumptions are in later sections. 8.4.4 Confidence intervals: Now, X 1 X 2 (1 2 ) ~ t 1 1 S n1 n2 n1 n2 2 . and so X 1 X 2 ( 1 2 ) P t n1 n2 2, / 2 t n1 n 2, / 2 1 2 1 1 S n n 1 2 Rearranging this so that we only have 1 2 in the middle of the inequality we get 1 1 1 1 P X 1 X 2 t n1 n2 2, / 2 S 1 2 X 1 X 2 t n1 n2 2, / 2 S 1 . n1 n 2 n1 n 2 Result: A 100(1 )% confidence interval for 1 2 when 12 and 22 are unknown but assumed equal both samples come from normal distributions, a) b) is given by X 1 X 2 t n n 2, / 2 S 1 2 1 1 n1 n2 Protein intake example (continued): Suppose that we also require a 95% confidence interval for the difference in mean daily protein intakes between those below and above the poverty level. This would be given by: 1 1 1 1 X 1 X 2 t n1 n2 2, / 2 S (66.29 77.49) 2.069 10.07 . n1 n2 15 10 Hence, the 95% confidence interval is: (-19.7, -2.7). 89 Exercise: A car hire firm is trying to decide which kind of tyre to use. It has narrowed the choice down to two types, A and B. Randomly selected samples of tyres of each type were tested to destruction on a machine. The number of hours to failure are: Tyre A: Tyre B: 3.82, 3.11, 4.21, 2.64, 4.16, 3.91, 2.44, 4.52. 4.16, 3.02, 3.94, 4.22, 4.15, 4.92, 4.11, 5.45, 3.65. Test to see whether there appears to be a significant difference between the mean time to failure for tyres of type A and B. Find also a 99% confidence interval for the difference in population means. 8.5 Inferences about the ratio of two variances Recall that before we can carry out the 2-sample t-test, we need to make the assumption that 12 22 . In this section we'll look at inferences about 12 22 . In particular we'll be interested in whether 12 22 1 , i.e. whether 12 22 . Once again we will assume that both samples come from normal distributions. 8.5.1 Testing for equality of variances Suppose we want to test the hypotheses: H 0 : 12 22 versus H1 : 12 22 . Recall from Chapter 3 that if Y and Z are independent random variables with k1Y ~ k21 and k 2 Z ~ k22 then Y ~ Fk1 , k2 . Z We know that (n1 1) S12 12 (n 2 1) S 22 ~ n21 1 and 22 ~ n22 1 . So if we let Y S12 and Z 12 S 22 22 then Y S12 22 ~ Fn1 1, n2 1 . Z 12 S 22 So if H0 is true, then 12 22 2 say, and so S12 12 We will therefore reject H0 if we observe S12 S 22 S12 S 22 22 S 22 S12 S 22 ~ Fn1 1, n2 1 . in the tail ends of the Fn1 1, n2 1 distribution- i.e. if Fn1 1, n2 1,1 / 2 or if S12 S 22 Fn1 1, n2 1, / 2 Recall from Chapter 3 that we cannot find lower percentage points for the F-distribution directly from tables and 90 Fn1 1, n2 1,1 / 2 1 Fn2 1, n1 1, / 2 . F-test: To test the hypotheses H 0 : 12 22 versus H1 : 12 22 . given two normally distributed samples: S2 i) use the test statistic 12 S2 ii) reject H0 at the 100% level if S12 S12 1 F or if . n1 1, n2 1, / 2 2 2 S2 S 2 Fn2 1, n1 1, / 2 Adjust for 1-tailed tests appropriately. Example: Suppose we have two samples (assumed normally distributed), the first with 13 observations with S12 16.37 and the second with 11 observations with S 22 12.98 . We want to test H 0 : 12 22 versus H 1 : 12 22 . The test statistic is: S12 16.37 1.26. 12.98 Suppose that we wish to perform a test at the 10% level of significance. We then would want to compare the test statistic with the 5% upper and lower percentage points from the F12,10 distribution. S 22 Looking up in F12,10 tables we find F12,10, 0.05 = 2.913. To find the lower percentage point: F12,10, 0.95 1 F10,12, 0.05 1 0.363 . 2.753 Our observed test statistic is 1.26 which is neither smaller than 0.363 nor larger than 2.913. Therefore we do not reject H0 at the 10% level and so we find no evidence to suggest that the population variances differ. Exercise: Look back to the example at the end of Section 8.4.4 relating to the two types of car tyre. Test to see whether there appears to be any evidence to suggest that the population variances are different. (Use a 5% level of significance). 91 8.5.2 Confidence interval We know that S12 12 22 S 22 ~ Fn1 1, n2 1 and therefore S12 22 1 P 2 2 Fn1 1, n2 1, / 2 1 . Fn 1, n 1, / 2 1 S 2 2 1 2 We want to find a confidence interval for 1 12 22 22 and so we will rearrange this equation until we just have in the middle. We get: S2 2 S2 1 P 22 22 22 Fn1 1, n2 1, / 2 1 . S1 Fn 1, n 1, / 2 1 S1 2 1 S2 12 S12 1 1 P 2 2 2 Fn2 1, n1 1, / 2 1 . S 2 Fn 1, n 1, / 2 2 S 2 1 2 2 Result: A 100(1 )% confidence interval for 1 22 when both samples come from normal distributions is given by: S2 S2 1 12 , Fn2 1, n1 1, / 2 12 Fn 1, n 1, / 2 S 2 S2 1 2 Example: Returning again to the earlier example in which we had two samples with 13 and 11 observations respectively and S12 16.37 , S 22 12.98 . Then the 90% confidence interval is: 1 16.37 16.37 1 , F10,12, 0.05 1.26, 2.753 1.26 12.98 2.913 F12,10, 0.05 12.98 so that the 90% confidence interval is given by [0.43, 3.47]. 8.6 Assessing Normality The statistical tests that we have developed in this and the previous chapters have often relied upon the assumption that the data follow a normal distribution. In this section we look at some techniques which can be used to assess whether such an assumption appears reasonable. 8.6.1 Graphical methods Graphical techniques provide a very simple way of gauging whether a set of data look roughly normally distributed. For example, a histogram of normally distributed data should be roughly symmetrical and unimodal. Further it should also show most of the observations near the mean and steadily fewer as we go further away from the mean. 92 A probability plot is a slightly more sophisticated plot that is used for assessing normality. Probability plots are also now widely available on statistical software packages (such as Minitab). To produce a probability plot for a set of data x1 ,..., x n (ordered so that x1 x 2 ... x n ), we plot y i against xi (for i 1,..., n) , where y i 1 z i and is the normal cumulative distribution function, and z i (i 0.5) n . The points should roughly lie on a straight line if the assumption of normality is appropriate. Note: Minitab produces its probability plots by plotting z i against xi and using a special scale on the vertical axis. This has the same effect as applying the inverse normal cdf. Note 2: Other formulas for calculating z i exist. Example: Consider the data introduced at the start of this chapter relating to the daily protein intakes of two groups of people. We focus here just on those that are below the poverty level: Below poverty level: 51.4, 49.7, 72.0, 76.7, 65.8, 55.0, 73.7, 62.1, 79.7, 66.2, 75.8, 65.4, 65.5, 62.0, 73.3 The probability plot produced by Minitab for these data is: A probability plot showing protein intake 99 ML Estimates 95 Mean: 66.2867 StDev: 8.85934 90 Percent 80 70 60 50 40 30 20 10 5 1 36 46 56 66 76 86 96 Data This probability plot also contains a 95% confidence interval (as shown by the broken line)- we would expect about 95% of the points to fall within these limits if the assumption of normality is valid. For these data, all the points are contained within the confidence band. The probability plot therefore does not cast any doubt about the appropriateness of a normal assumption. 8.6.2 Formal methods for assessing normality A variety of more formal techniques can be used to assess how well a normal distribution fits a set of data: Shapiro-Wilk test; Kolmogorov-Smirnov test; 93 Anderson-Darling etc. These tests can be performed in Minitab. 8.7 Inferences for the difference between two proportions Suppose that we have two populations for which the proportion of “successes” for population 1 is 1 and the proportion of “successes” for population 2 is 2 . Suppose that we observe a sample from each population: Population 1 2 Sample size n1 n2 Sample proportion p1 p2 Based upon these samples, we might be interested in drawing inferences about 1 2 (for example, to test whether 1 2 ). To do this, we need to know the sampling distribution of p1 p 2 . Now, we know that for large sample size ni (ni i 5, ni (1 i ) 5), i 1,2, (1 i ) pi ~ N i , i ni (approximately). So, for large sample sizes: (1 1 ) 2 (1 2 ) p1 p 2 ~ N 1 2 , 1 n1 n2 W p1 p 2 ( 1 2 ) 1 (1 1 ) n1 8.7.1 2 (1 2 ) ~ N0, 1 n2 Hypothesis tests We shall consider here only the simplest hypothesis test concerning 1 2 , namely where we wish to test the following null hypothesis: H0 : 1 2 0 . Now if H0 is true, then 1 2 , say, and W p1 p 2 (1 ) (1 ) . n1 n2 We can't, however, use this as our test statistic because we cannot compute it- H0 says that we have a common value of , but it doesn't specify an actual value. So to find a test statistic we first estimate from the sample data and then use this estimate in W. When 1 2 we get our best estimator for by making use of both sample proportions, p1 , p 2 , and pooling these suitably. Following on from the method we used to find the pooled sample variance, we'll use a weighted average. The combined estimate of the population proportion is therefore n p n2 p 2 p 1 1 . n1 n2 Using this pooled estimate in the test statistic gives: 94 W p1 p 2 p(1 p) p(1 p) n1 n2 ~ N[0, 1]. We therefore get the following test: Result: To test the hypotheses H 0 : 1 2 0 versus H1 : 1 2 0 when ni is large i.e. ni i 5, ni (1 i ) 5, i 1,2, i) use the test statistic p1 p 2 W p(1 p) p(1 p) n1 n2 ii) reject H0 at the 100% level if w z / 2 or if w z / 2 . Adjust for 1-tailed tests accordingly. Example: Two drugs are used to treat patients with a certain type of cancer. In order to compare their effectiveness, a clinical trial was planned. 75 patients were given drug A whilst 60 patients were assigned to drug B. The number of patients who survived for one year beyond diagnosis in each group was as follows: Drug A: 49 Drug B: 34. Test whether both drugs appear to be equally effective. Let 1 denote the proportion of people with this type of cancer who would survive for one year beyond diagnosis if treated with drug A. Similarly for 1 . We then wish to test: H 0 : 1 2 0 versus H1 : 1 2 0 . We have observed: 49 34 0.6533 p2 0.5667 . 75 60 If the null hypothesis is true, then the pooled estimator of , the common population proportion, is: n p n2 p 2 49 34 p 1 1 0.6148 . n1 n2 75 60 p1 The test statistic is: w p1 p 2 0.6533 0.5667 1.027 . p(1 p) p(1 p) 0.6148 0.3852 0.6148 0.3852 75 60 n1 n2 For a 5% test, we reject H0 when w < -1.96 or w > 1.96. Our test statistic does not lie in the rejection region. We therefore have no evidence to reject the null hypothesis at this level. It is plausible that both drugs are equally effective at treating this form of cancer. Exercise: “Predictors of driving while intoxicated among teenagers” (J. of Drug Issues, 1988, 367 - 84) investigated how common it is for teenagers to drive while intoxicated. The following results were obtained: Number surveyed Number driven while intoxicated 95 Boys Girls 100 100 28 17 Use a p-value to decide whether there is sufficient evidence to suggest that the number of girls who have driven while intoxicated is smaller than the number of boys. 8.7.2 Confidence interval Now, W p1 p 2 ( 1 2 ) 1 (1 1 ) n1 2 (1 2 ) ~ N[0, 1] n2 and so p1 p 2 ( 1 2 ) P z / 2 z / 2 1 1 (1 1 ) 2 (1 2 ) n n 1 2 We'll find our confidence interval by rearranging this so that we have 1 2 in the middle: 1 (1 1 ) 2 (1 2 ) 1 (1 1 ) 2 (1 2 ) P z / 2 p1 p 2 ( 1 2 ) z / 2 1 n n n n 1 2 1 2 1 (1 1 ) 2 (1 2 ) 1 (1 1 ) 2 (1 2 ) P p1 p 2 z / 2 1 2 p1 p 2 z / 2 n1 n2 n1 n2 1 But we do not know the values of 1 and 2 which are in our limits. However, pi is a good estimator of i , i =1, 2 and so we can find a confidence interval by substituting in pi for i in the limits. Result: A 100(1 )% confidence interval for 1 2 when ni i 5, ni (1 i ) 5, i 1,2, is given approximately by: p1 (1 p1 ) p 2 (1 p 2 ) ( p1 p 2 ) z / 2 . n1 n2 96 Example (continued): Returning to the cancer drug example. A 90% confidence interval for 1 2 is given by: ( p1 p 2 ) z / 2 p1 (1 p1 ) p 2 (1 p 2 ) n1 n2 (0.6533 0.5667 ) 1.6449 0.6533 0.3467 0.5667 0.4333 75 60 (as z0.05 = 1.6449). The interval is therefore (-0.052, 0.225). 8.8 Paired data So far, we have considered the case of two independent samples. Sometimes we have two sets of observations that are made on the same group of individuals. For example, we could have blood pressure measurements that are recorded on one group of women before and after the birth of their child. Such data are called paired. Matched pairs are often used in experiments as the resulting data can yield more accurate inferences (by reducing variability). Suppose that we have the following paired data: Sample 1: Sample 2: X 11 X 21 … … X 1i X 2i … … X 1n X 2n A pair of observations. To compare the two means of the populations we look at the differences: Di X 1i X 2i for i 1,..., n . Then D1 ,..., Dn are a random sample from N[ 1 2 , d2 ], where d2 is some variance which is generally unknown. Our problem has therefore reduced to a one-sample problem and so, by denoting 1 2 as d say, we can use a t-test to test the hypothesis H0 : d k . Paired t-test: To test the hypotheses H 0 : d k versus H 1 : d k when we have a sample of matched pairs we use a one-sample t test applied to the differences. Adjust for 1-tailed tests accordingly. Example: Ten athletes ran a 400 m race at sea level and at a later meeting ran another 400 m race at high altitude. Their times in seconds were as follows: Athlete Sea level High altitude 1 48.3 48.7 2 47.9 49.2 3 50.2 50.1 4 51.7 51.9 5 46.5 48.2 6 44.9 45.8 7 45.2 48.0 8 47.7 47.3 9 48.4 50.2 10 49.1 51.5 Test whether the athletes are performing equally well at sea level and at high altitude. The data here are clearly paired (two measurements are recorded on each athlete). We let d denote the mean difference in times ( 1 2 ). The hypotheses we wish to test are as follows: H 0 : d 0 versus H1 : d 0 . 97 The difference in times for each athlete are: -0.4, -1.3, 0.1, -0.2, -1.7, -0.9, -2.8, 0.4, -1.8, -2.4. The sample mean and variance for these differences then are given by: 1 x d (0.4) (1.3) 0.1 ... (2.4) 1.1 10 (11) 2 1 S d2 22.6 1.1667 9 10 This gives the following value for the test statistic: xd 0 1.1 t 3.22. Sd 1.1667 10 n We compare this with critical points from a t distribution with 9 degrees of freedom. As t 9, 0.005 3.25 we just fail to reject the null hypothesis at the 1% level. We have some evidence to suggest that athletes performance differs at the different altitudes. [Note we need to assume here that the differences follow a normal distribution]. To find a confidence interval for the differences, simply use the corresponding one-sample results. Exercise: “Effects of alcohol on hypoxia” (J of Amer. Med. Assoc., 1965, 135) examined the relationship between alcohol intake and the time of useful consciousness during high-altitude flight. Ten men were taken to a simulated altitude of 25,000 ft and given several tasks to perform. The time (in seconds) at which useful consciousness was lost, due to lack of oxygen, was recorded. The experiment was repeated 3 days later after the subjects had .5cc of 100-proof whiskey per pound of body weight. The time of useful consciousness was again recorded. Does the alcohol intake reduce the average time of useful consciousness? Subject 1 2 3 4 5 6 7 8 9 10 Time of useful consciousness No alcohol Alcohol Difference 261 185 76 565 375 190 900 310 590 630 240 390 280 215 65 365 420 -55 400 405 -5 735 205 530 430 255 175 900 900 0 98 Chapter 9: Introduction to Non-Parametric Tests We use sample data to make inferences about the population from which it was drawn. In the one-sample and two-sample problems covered in the previous chapters, we assumed that the samples come from normal distributions (or at least that the sample size is so large that the central limit theorem applies). We then make inferences about the parameters of the normal distribution- i.e. the means and variances. If it turns out that the distributions are not normal, then our inferences may not be valid. If the assumption of normality is not a reasonable assumption, we may decide to use tests that do not assume a specific form for the population distribution. These are known as nonparametric (or distribution-free) tests. 9.1 The sign test This may be used for testing hypotheses about the median of a distribution- i.e. the centre of the distribution. In particular, with matched pairs, we may test that the median of the distribution of differences is zero. In this context, the sign test is a nonparametric equivalent of the paired t-test. Procedure Calculate differences Di X i Y i . Discard all zero differences. Count the number of positive differences. If the median is zero, then we would expect half of our values to be >0 and half of them to be <0. So test the null hypothesis that p, the population proportion of positive differences, is 0.5. The test is based on the fact that the number of positive observations in a sample of non-zero differences of size n, S say, has a binomial distribution B[n, p]. So if H0 is true, then S ~ B[n, 0.5]. We can then use this binomial distribution to find critical values for the test or p-values. Note that in this case you probably won't be able to find critical values for the test which give a significance level of exactly , because S is discrete. It is therefore usually simpler to just find the p-value here. Alternatively, if the sample size is large, we may use a normal approximation so that the number of positive observations, S, is then distributed: n n S ~ N , . 2 4 Our test statistic is then s 12 n 2s n z ~ N[0, 1] n n 4 approximately if H0 is true. If the alternative hypothesis is that the median difference is different to 0, then our decision rule is to reject H0 if we observe: z z / 2 or z z / 2 Note: If the alternative hypothesis is one-sided, then we would adjust this critical region as appropriate. 99 Example: To determine whether two tests are equally effective in evaluating job applicants for a certain position, the test questions are randomly intermixed and a combined test is given to each of 14 applicants. The answers to the two sets of test questions are then separated and the scores below were obtained. Using the sign test, test the hypothesis that the two tests produce the same score distributions. Test 1 78 84 65 98 56 28 70 66 55 87 90 61 70 83 Test 2 74 81 73 98 60 13 58 74 59 88 93 66 88 90 We wish to test: H0: median for the 2 tests is the same vs H1: medians are different (i.e. 2-tailed test). We first need to count how many of the non-zero differences (Test 1 – Test 2) are positive and negative. Test 1 Test 2 (Test1 – Test 2) 78 74 + 84 81 + 65 73 - 98 98 56 60 - 28 13 + 70 58 + 66 74 - 55 59 - 87 88 - 90 93 - 61 66 - 70 88 - 83 90 - Of the 13 non-zero differences 4 are positive and 9 negative, so s = 4. If H0 is true, S ~ B[13, 0.5]. The p-value when s = 4 for a test against a two-sided alternative can be found by calculating 2 P(S 4) assuming the null hypothesis to be true (we need to multiply by 2 because we have a two-tailed test). So the p-value is: 4 13 p 2 0.5 i 0.513i 0.2668 . i i 0 (Note that this can be easily found from binomial distribution tables in Lindley and Scott). As the p-value is >0.1, we would not reject H0 even at the 10% significance level. Alternatively, we can use a normal approximation and use the corresponding test statistic or a p-value. The test statistic is 8 13 z 1.387. 13 For a 5% test we reject H0 if z 1.96 or z 1.96. We therefore do not reject H0 at the 5% level and conclude that the two tests produce the same score distribution. To calculate the p-value, a normal approximation gives 13 13 S ~ N , = N[6.5, 3.25]. 2 4 The p-value is then: p 2 P( S 4.5) (making use of a continuity correction) 4.5 6.5 2 P( Z 1.11) 0.267. p 2 P Z 3.25 Notice that, as expected, the approximate p-value is almost exactly the same as the exact one. 100 6.2 Mann-Witney (or Wilcoxon Rank Sum) Test This is the nonparametric equivalent of the 2-sample t-test. It is used to compare two samples of data and doesn’t make the assumption of either normally distributed observations or equal population variances. We explain the procedure for performing this test in relation to the following example: Example: There is interest in finding out whether stroke patients make a more successful recovery if they receive treatment within 24 hours of the stroke occurring. The data below are the results of a mobility test and are scores on a 0100 scale. Patients with low scores are unable to do a lot of things for themselves. The test was performed one week after the stroke occurred. Treated within 24 hours: Treated after 24 hours 63, 39, 77, 80, 59, 41, 55, 71, 84, 75. 44, 31, 58, 60, 47, 51, 68, 52, 34, 49, 26, 50. We are interested in the hypotheses: H0 No difference in scores between the two groups H1 The scores from the two groups differ. Step 1: Combine the two samples of data and rank the observations: E.g. Observation Rank Group 26 1 2 31 2 2 34 3 2 39 4 1 41 5 1 44 6 2 47 7 2 49 8 2 50 9 2 51 10 2 52 11 2 Observation Rank Group 55 12 1 58 13 2 59 14 1 60 15 2 63 16 1 68 17 2 71 18 1 75 19 1 77 20 1 80 21 1 84 22 1 Here, Group 1 represents those that received prompt treatment. Step 2: Calculate the sum of ranks for each group. E.g. Group 1: Group 2: T1 = sum of ranks = 4 + 5 + 12 + … + 21 + 22 = 151; T2 = sum of ranks = 1 + 2 + 3 + 6 + …+ 15 + 17 = 102 Note that T1 + T2 = 253 = 0.5 22 23 (i.e. the sum of numbers 1, 2, 3, …, 22). Step 3: Calculate the Mann-Witney U statistic in the following way: U min( U 1 , U 2 ) where U 1 T1 0.5n1 (n1 1) U 2 T2 0.5n 2 (n 2 1) and n1 and n2 are the number of observations in Group 1 and Group 2 respectively. E.g. 101 U 1 151 0.5 10 11 96 U 2 102 0.5 12 13 24. and So U = 24. Step 4: Compare the value of U with statistical tables and draw conclusions. E.g. Here we have n1 = 10 and n2 = 12 and our test is two-sided. From tables, we can find the critical values for various test sizes: Size of test 5% 1% Critical value 29 21 We reject the null hypothesis if our value of U is smaller than any of these critical values. We can see that we can reject the null hypothesis at the 5% level (but not at the 1% level). Notes: 1) The critical values can be found from statistical tables if the two sample sizes are fairly small. If the two sample sizes are large (rule of thumb: both greater than 10) then the distributions of T1 and T2 can be taken as normal with 1 1 E[T1 ] n1 (n 1 n2 1); E[T2 ] n2 (n 1 n2 1); 2 2 1 Var[T1 ] Var[T2 ] n1n2 (n 1 n2 1). 12 2) When ties are involved, it is usual to replace the ties with the average rank of all observations involved in the tie. For example, if the observations are 12, 14, 14, 17, 19 then the corresponding ranks would be 1, 2.5, 2.5, 4, 5. 3) Note that U1 can be defined as the total number of times each observation from sample 1 comes before each observation from sample 2. 9.3 Goodness-of-fit tests In this section we'll look at how we can check (or test) whether a given distribution is plausible. This is called goodness of fit testing. Example: The number of thunderstorms reported in one summer month by 100 meteorological stations were given as: Number Frequency 0 22 1 37 2 20 3 13 4 6 5 2 If thunderstorms occur at random (in a Poisson process), we would expect the number observed in a month to have a Poisson distribution. Therefore the question of interest which we'd like to test would be: “Does the Poisson distribution fit the data?” and so we'd want to test: H0: data follow a Poisson distribution versus H1: data do not follow a Poisson distribution. One of the problems here is that we are trying to test whether data follow any Poisson distribution, as opposed to testing whether the data follow a specific Poisson distribution, for example one with mean 22, say. Before we look at how we might test the hypotheses above, we'll first look at how we might test a distribution which is fully specified. 102 9.3.1 Goodness of fit tests for fully specified null hypotheses Example: Two dice are thrown 180 times and the number of sixes, X, which occur are counted. These are displayed in the table below. X Frequency 0 105 1 70 2 5 total 180 Given these data, is there evidence to suggest that the dice are loaded? We want to test: versus H0: H1: dice not loaded dice loaded If they are not loaded, then the probability of throwing a six with each die will be 1/6. On the other hand, if the dice are loaded, then the probability of obtaining a six will not be 1/6. We therefore want to test: 1 H0: X ~ B 2, 6 1 versus H1: X is not B 2, . 6 To do this, we'll calculate how many times we would expect to observe X = 0, 1, 2 and compare these expected values with the number of times we actually did observe these values. If the expected frequencies are close to the observed frequencies, then this would suggest that H0 might be true and so we'll accept H0. On the other hand, if the observed frequencies are very different to those which would be expected if H0 were true, then this would cast doubt as to whether H0 were true and so we'd reject H0. Our first task then is to calculate the expected frequencies. We know that under the null hypothesis: 2 25 5 P ( X 0) 36 6 5 1 10 P( X 1) 2 6 6 36 2 1 1 P ( X 2) . 36 6 Since the dice are thrown 180 times, our expected frequencies are as follows: x 0 1 2 Expected frequency 25 180 125 36 10 180 50 36 1 180 5 36 Observed Frequency 105 70 5 Does this provide evidence on which we should reject H0? Some general theory Let us first consider a general problem in which we have several categories for which we have observed frequencies. Suppose that we have calculated expected frequencies for these categories from our distribution under H0 and we have compiled everything into a table: Category Observed 1 O1 2 O2 … … i Oi 103 … … Expected Difference E1 O1 E1 E2 O2 E2 … … Ei Oi Ei … … The smaller the differences are, the more plausible H0 is. We use a test statistic which measures how large these discrepancies are. Define (Oi Ei ) 2 . Ei As long as H0 is true, then, regardless of the distribution being fitted, C ~ 2 The following general rule can be used to find the appropriate number of degrees of freedom for this chi-squared distribution: C Degrees of freedom = number of categories number of restrictions What are the restrictions? Well we always ensure that Oi Ei - this is one restriction that always applies. Note: We shall see later that when a parameter is unknown, we match the distribution to the data by estimating parameters. We then get one restriction per parameter. So we will reject H0 if our observed value of the statistic C lies in the upper % point of this chi-squared distribution- i.e. this would indicate that the discrepancies between the Oi and Ei were larger than expected if H0 were true. Note that we always use a one-tailed test here. Further, if C = 0, the observed and expected frequencies are identical, and so we then have a perfect fit. Example (continued): Returning to the dice example we have x O E O-E 0 105 125 -20 1 70 50 20 2 5 5 0 So (105 125) 2 (70 50) 2 (5 5) 2 11.2. 125 50 5 If H0 is true, then this test statistic should have a chi-squared distribution with 3 – 1 = 2 degrees of freedom. We C can therefore reject H0 at the 1% level if we observe C 22,0.01 9.21. We therefore reject H0 at the 1% level and conclude that there is strong evidence to suggest that the dice are not fair. 9.3.2 Goodness of fit tests for more general H0 In the dice example, the distribution B(2, 1/6) was specified precisely. In many cases, we just want to test the hypothesis that the data come from a general distribution. Example: Consider the earlier storms example. Here we wanted to test the hypotheses: H0: data follow a Poisson distribution versus H1: data do not follow a Poisson distribution. 104 So we're interested in testing whether the probability function P( X x ) e fits the data, for some value of . x x! We don't have a specific distribution under H0 this time with which to calculate our expected frequencies. We therefore need to identify the Poisson distribution that is likely to fit best and then we'll use this to find our expected frequencies. Now, the Poisson distribution which is likely to fit best will be when the distribution mean is the sample mean x . So we will calculate the frequencies which we'd expect to observe if the data followed a Poisson distribution with mean x and then we will see how well our expected frequencies match our observed frequencies. For the storm data: 150 1.5 100 and we will use this value to calculate our expected frequencies: 1.5 0 0.2231 . P(X = 0) = e 1.5 0! Therefore, out of a sample of 100 observations we would expect to observe X = 0 on 100 0.2231 = 22.31 occasions. The other values are found similarly. We therefore end up with the following table: x x 0 1 2 3 4 5 or more O 22 37 20 13 6 2 P(X = i) 0.2231 0.3347 0.2510 0.1255 0.0471 0.0186 E 22.31 33.47 25.10 12.55 4.71 1.86 O-E -0.31 3.53 -5.10 0.45 1.29 0.14 Then (O E ) 2 (0.31) 2 3.53 2 0.14 2 ... 1.793. E 22.31 33.47 1.86 If the null hypothesis is true, we would expect C to have a chi-squared distribution. As before, the degrees of freedom are given by: C Degrees of freedom = number of categories – number of restrictions. Here, we have two restrictions (the totals of the observed and expected frequencies must agree and we are estimating the mean from the data). The general rule is: Degrees of freedom = number of categories 1 number of parameters estimated In our example, there are 6 – 1 – 1 = 4 degrees of freedom. We would therefore be able to reject H0 at the 10% significance level if we observe C 42,0.1 7.779 . Our observation is nowhere near the critical region and so we’ll conclude that the poisson distribution appears to fit the data. 9.3.3 General format of test 105 A goodness of fit test considers the hypotheses: H0: data follow some distribution H1: data do not follow that distribution. Step 1: Specify a specific distribution for H0. Substitute in estimated parameters if they aren't already specified. Step 2: Calculate the expected frequencies for each category assuming the distribution under H0 is true. (Oi Ei ) 2 Step 3: Use the test statistic C . Ei Step 4: Reject H0 if C > 2, where the degrees of freedom, , are found according to the above rule. 9.3.4 Combining categories The result that (Oi Ei ) 2 ~ 2 Ei is an asymptotic result. Mathematically it depends on the expected frequencies E being large. C General rules exist for combining categories: Old rule of thumb: Ensure all E's are > 5. Modern rule of thumb: Ensure all E's are > 1 and almost all are > 5 If rule is contravened: Combine adjacent categories until all E's are acceptable. 9.3.5 Fitting a geometric distribution Reminder: A geometric distribution can be used to model situations where a count is made of the number of trials performed until a success occurs. The conditions that give rise to the geometric distribution are: There is a sequence of (Bernoulli) trials; Only two outcomes, success and failure, are possible at each trial; The trials are independent; There is a constant probability p of success at each trial; The variable is the number of trials taken for the first success to appear. If X has a geometric distribution, then P( X x) p (1 p) x 1 for x = 1, 2, 3, … Note: The expected value of X is E[ X ] 1 . p Example: An infertility clinic records the number of treatment sessions (x) required by 100 patients until pregnancy results: x 1 2 3 4 Observed frequency 57 24 10 9 106 a) b) Test whether a geometric distribution with p = 0.4 provides an adequate fit to these data. Test whether the data can be modelled well by any geometric distribution. a) The hypotheses to be tested are: Null: the data follow a Ge(0.4) distribution; Alternative: the data are not Ge(0.4). With p = 0.4, the probabilities are: P(X = 1) = 0.4 P(X = 2) = 0.4 0.6 = 0.24 P(X = 3) = 0.4 0.62 = 0.144 P(X 4) = 1 – 0.4 – 0.24 – 0.144 = 0.216. Expected frequencies are found by multiplying by the total frequency (i.e. 100): x Observed frequency Expected frequency 1 57 40 2 24 24 3 10 14.4 4 9 21.6 The test statistic is: (Oi Ei ) 2 (57 40) 2 (24 24) 2 (10 14.4) 2 (9 21.6) 2 C 15.92. Ei 40 24 14.4 21.6 i From tables, the 1% point from 32 is 11.34 and the 0.1% point is 16.27. So we can reject the null hypothesis at the 1% level (but not at the 0.1% level). Consequently, there is strong evidence that the data is not Ge(0.4). b) The hypotheses to be tested now are: Null: the data follow a geometric distribution; Alternative: the data are not geometric. The mean of the data is (1 57) (2 24) (3 10) (4 9) 1.71. 100 1 1 So a good estimate of p would be 0.585 . 1 . 71 x The expected probabilities under the null hypothesis then are: P(X = 1) = 0.585 P(X = 2) = 0.585 0.415 = 0.243 P(X = 3) = 0.585 0.4152 = 0.101 P(X 4) = 0.071 (by subtraction). The table of observed and expected frequencies is: X Observed frequency Expected frequency 1 57 58.5 2 24 24.3 3 10 10.1 4 9 7.1 Therefore, the test statistic is: C i (Oi Ei ) 2 (57 58.5) 2 (24 24.3) 2 (10 10.1) 2 (9 7.1) 2 0.55. Ei 58.5 24.3 10.1 7.1 107 If H0 is true, C should be from a chi-squared distribution with 4 – 1 – 1 = 2 degrees as freedom. As the 5% point for this distribution is 5.991, we are unable to reject the null hypothesis at the 5% level. There is therefore no evidence to suggest that a geometric model is unsuitable. 9.4 An additional example The marital-status distribution of the US adult population is given by: Marital status Percentage Single 21.5 Married 63.9 Widowed 7.7 Divorced 6.9 A random sample of 750 US 25-29 year old males, yielded the following frequencies: Marital status Frequency Single 289 Married 408 Widowed 0 Divorced 53 Does it appear that the marital-status distribution of all 25-29 year old US males is different from that of the US adult population as a whole? Solution: Firstly we need to identify our hypothesised distribution. We want to test whether the marital-status distribution of all 25-29 year old US males is different from that of the US adult population as a whole- i.e. we want to test the null hypothesis that the distribution for 25-29 year olds is: Marital status Probability Single 0.215 Married 0.639 Widowed 0.077 Divorced 0.069 against the alternative that the distribution for 25-29 year olds is different to this. The next stage is to calculate the expected frequencies assuming that H0 is true. Of the 750 males sampled, we'd expect to observe 750 0.215 = 161.25 of them to be single. We calculate the other expected frequencies similarly to get: Marital status Observed Expected Single 289 161.25 Married 408 479.25 Widowed 0 57.75 Divorced 53 51.75 We can now calculate the test statistic: (289 161.25) 2 (408 479.25) 2 (0 57.75) 2 (53 51.75) 2 161.25 479.25 57.5 51.75 101.2 10.59 57.75 0.03 169.57 C We will reject H0 at the 0.1% level if we observe 32,0.001 16.27 . It is clear that we should reject H0 at this level and conclude that the marital-status distribution for US 25-29 year old males is different from the US adult population as a whole. 108 Chapter 10: Association Between Variables Consider two random variables. These may be related to each other- for example, heights and weights of people are related. This section will look at ways of measuring the strength of relationship between two random variables (i.e. the strength of association or correlation). 10.1 Product-moment Correlation Coefficient When considering the nature of the relationship between two variables we might be interested in the folowing questions: Is there a negative or positive relationship (or some other form of relationship)? Is the relationship linear? Example: Consider the following scatterplots showing (hypothetical) data from 20 school children Diagram (b) 120 120 110 110 Mark in maths exam Mark in maths exam Diagram (a) 100 90 80 70 60 50 40 100 90 80 70 60 50 40 30 40 50 60 70 80 90 100 20 30 Mark in mock maths paper 40 50 60 70 80 90 100 110 120 Mark in English exam Diagram (c) 120 Mark in maths exam 110 100 90 80 70 60 50 40 150 155 160 165 170 175 180 185 Height (in cm) We can see that: In Diagram (a): the points are not scattered far from a straight line- there is a strong positive relationship between the mark in the maths exam and the mark in the mock paper; In Diagram (c): the points are very scattered- there appears to be no relationship between height and maths mark; In Diagram (b): the relationship comes somewhere between a) and c) i.e. there is a weak positive relationship between English and maths marks. The product-moment correlation coefficient, r, (also known as Pearson's correlation coefficient) gives a summary measure of the strength of (linear) association between two random variables. r can take values in the range [-1, 1]. If r is positive, this indicates a positive relationship between the variables. If r is negative, it indicates a negative relationship. The further r is from 0, the stronger the association between the two random variables. 109 r = +1 r = 1 exact straight line relationship with positive slope. exact straight line relationship with negative slope. Note: The value of r does not imply anything about the slope of the straight line fit, it just says something about the quality of the fit. Definition The formula for calculating the product-moment correlation coefficient r from bivariate data ( x1 , y1 ), …, ( x n , y n ) is S xy r SxS y where, S x is the sample standard deviation of x1 , x 2 ,..., x n ; S y is the sample standard deviation of y1 , y 2 ,..., y n ; S xy is the sample covariance between the two variables calculated using S xy 1 n 1 n ( x i x )( y i y ) xi y i n 1 i 1 n 1 i 1 xi yi . n Example: Blood pressure was measured (in mm Hg) for 15 patients who had moderately raised blood pressure. Patient number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Systolic blood pressure 210 169 187 160 167 176 185 206 173 146 174 201 198 148 154 Diastolic blood pressure 130 122 124 104 112 101 121 124 115 102 98 119 106 107 100 Let X denote the systolic and Y the diastolic blood pressure. Then, we have: n 15, xi 2654, yi 1685, xi2 475502 , yi2 190817 . i Also, i xi yi (210 130) ... (154 100) 300137 . i i i Therefore: S x2 1 2654 2 475502 422 .92 14 15 S xy S y2 1 2654 1685 300137 143 .17 14 15 110 1 1685 2 190817 109 .67 14 15 So, the correlation coefficient is: r S xy SxS y 143 .17 422 .92 109 .67 0.665 . As r is positive, the two measures of blood pressure are positively correlated. The value of r is not particularly close to either 0 or 1, implying that the strength of association between the two variables is moderate. A plot of the two variables is given below. Diastolic BP (mm Hg) 130 120 110 100 140 150 160 170 180 190 200 210 Systolic BP (mm Hg) Note: It is important to remember that the product-moment correlation coefficient is a measure of linear association and can give misleading results for data which don't display a linear relationship. Example: Suppose we have two variables with scatter plot: 15 y 10 5 0 0 10 20 x This has r = 0.81 indicating quite a strong relationship. But if you take away the outlying data point, then there is in fact no linear relationship. So when interpreting r, it's always a good idea to have a look at a plot of the data. The product-moment correlation coefficient is most useful when a plot of the data has an oval pattern: 111 20 y 15 10 5 0 5 10 15 x It is less appropriate when the data are curvilinear: 1.00 y 0.75 0.50 0 5 10 x or some data points lie away from most of the data: y 15 10 5 0 10 20 x 112 10.2 The Spearman rank correlation coefficient The Spearman rank correlation coefficient, rS , is more general than the product-moment correlation coefficient as it measures the strength of the monotonic (i.e. always moving in a consistent direction) association. For example, suppose that we have: 80 6 70 5 60 y y 7 50 4 40 3 30 4 9 14 0 x 1 2 3 4 5 6 7 8 9 10 x (a) (b) The product-moment correlation would have problems with (b), but Spearman's correlation can handle both. To find rS : 1. Find the ranks of the X's and Y's separately. If two values are tied, then give both values the same average rank. 2. Calculate the product-moment correaltion coefficient using the ranks. rS = +1 perfect monotonic increasing relationship: 10 15 9 8 10 y y 7 6 5 5 4 3 0 2 0 5 10 0 15 5 10 15 x x rS 1, r 1 rS 1, r 1 113 rS = 1 perfect monotonic decreasing relationship: 5 15 4 y y 10 3 5 2 0 0 5 10 15 0 x 1 2 3 4 5 6 7 8 x rS 1, r 1 rS 1, r 1 Example: A study used a new method of measuring body composition and the age and body fat percentage of 14 women were obtained. Age (years) 23 39 41 49 50 53 53 54 56 57 58 58 60 61 Body fat (%) 27.9 31.4 25.9 25.2 31.1 34.7 42.0 29.1 32.5 30.3 33.0 33.8 41.1 34.5 A scatter plot of the data is given below: Body fat (%) 40 35 30 25 20 30 40 50 60 Age (years) To calculate Spearman's rank coefficient, we first need to rank the data for each of the variables. 114 Age (years) 23 39 41 49 50 53 53 54 56 57 58 58 60 61 Rank 1 2 3 4 5 6.5 6.5 8 9 10 11.5 11.5 13 14 Body fat (%) 27.9 31.4 25.9 25.2 31.1 34.7 42.0 29.1 32.5 30.3 33.0 33.8 41.1 34.5 Rank 3 7 2 1 6 12 14 4 8 5 9 10 13 11 Now we need to find the product-moment correlation using the ranks. n 14, xi 105, yi 105, xi2 1014, yi2 1015, xi yi 921.5 Then, S x2 1 105 2 1014 17.423 13 14 S xy S y2 1 105 2 1015 17.5 13 14 1 105 105 921 .5 10.308 . 13 14 So, rS 10.308 0.590. 17.423 17.5 So the two variables are moderately positively correlated. Note that in this example using the product-moment correlation would not have been a problem and we could have simply used that. 10.3 Testing correlations The product-moment and Spearman correlation coefficients measure the correlation between two variables for our samples i.e. they are sample statistics. However, we are often interested in making inferences about the correlations in the population. In particular, we are often interested in testing whether there is really no association between the variables in the population. Consider testing the hypotheses: H0: No association between the variables versus H1: Association between the variables. It can be shown that when there is no association i.e. when H0 is true the distribution of the product-moment correlation R is such that: n2 R ~ t n2 . 1 R2 So we will use the test statistic: R n2 1 R2 and reject H0 at the 100% level if we observe: 115 r n2 1 r 2 t n 2, / 2 or r n2 1 r2 t n 2, / 2 . Example (continued): For the blood pressure example, we have the test statistic: n2 15 2 r 0.665 3.21. 2 1 r 1 0.665 2 We want to compare this with t13, 0.025 2.16 3.21 and so we reject H0 at the 5% level and conclude that there is some evidence of a positive association between the variables. Exercise: The data below refer to a sample of 12 children suffering from cystic fibrosis. The two variables are a measure to the resistance of breathing, x, and height, y (cm). x y 13.8 89 8.2 93 9.0 92 12.5 101 21.1 95 6.8 89 17.0 97 11.0 97 8.2 111 12.7 102 8.5 103 10.0 108 Calculate the (product-moment) correlation coefficient connecting the two variables and test to see whether it is significantly different from 0. When testing the above hypotheses using Spearman's rank correlations, we use exactly the same idea and use the test statistic: n2 RS . 1 R S2 However, the distribution of the test statistic when H0 is true is very complicated and it is best to carry out the test on the computer. 2.4 Contingency Tables Suppose that we have a random sample and we categorise the sample according to which of two characteristics each sample member has. We will then use these data to investigate whether the two characteristics are associated. Example: A national survey was conducted in the USA to obtain information about alcohol consumption and marital status. 1772 US adults were selected randomly and the results are displayed in the table below: Maritial Drinks per month Total Abstain 1-60 Over 60 status Single 67 213 74 354 Married 411 633 129 1173 Widowed 85 51 7 143 Divorced 27 60 15 102 Total 590 957 225 1772 This is called a two-way contingency table (“two-way” because we are categorising in terms of two variables). The question of interest here is: “Is there an association between the amount a person drinks and their marital status?” i.e. are the variables marital status and alcohol consumption statistically independent? We will define a test to answer this question. 116 10.4.1 Chi-squared test of independence Consider a two-way contingency table. The chi-squared test of independence tests whether there is any association between the two variables in the table. To introduce the test, we will consider the specific example above. Example: Consider again the contingency table of marital status versus alcohol consumption. We want to investigate whether the two variables are associated and so we want to test: H0: Marital status and alcohol consumption are statistically independent versus H1: Marital status and alcohol consumption are statistically dependent Now if H0 is true, then the two variables are independent. This would mean that we would expect to observe the same proportion in each of the alcohol categories across the marital status categories. For example, we'd then expect to observe the same proportion of single people who abstain as married who abstain, etc. The total number who abstain is 590 out of a total of 1772 people sampled. So if H0 were true, we'd expect the proportion of people of each marital status who abstained to be 590/1772 = 0.333. Now, a total of 354 of the 590 sample were single, and so we would expect to observe 1772 354 117.9 of the people sampled to fall in the single/abstain category, if H0 is true. Similarly, a total of 1173 people in the sample were married, and so we 590 would expect to observe 1772 1173 390.6 of the people sampled to fall in the married/abstain category under the null hypothesis. By using these arguments we can build up a table of frequencies which we would expect to observe in the table H0 were true. Table of Expected Frequencies Marital status Single Married Widowed Divorced Total Abstain 117.9 390.6 47.6 34.0 590 Drinks per month Over 60 160 191.2 44.9 633.5 148.9 77.2 18.2 55.1 13.0 957 225 Total 354 1173 143 102 1772 We can now compare these expected frequencies with what we did observe. If what we observed is close to what we'd expect, then this would give us no reason to reject H0 (as the data are consistent with H0). On the other hand, if what we observed is very different from what we expected, then this would cast doubt on H0 and we'd reject it. The test statistic we'll use is exactly the same as we used for the goodness of fit test, namely: C (O E ) 2 E (adding over all cells in the table). Then if H0 is true, C ~ 2 (just as before). We now need to define our degrees of freedom: Degrees of freedom = number of expected values in the table that can be chosen freely. In this example we have 4 marital status categories and 3 for alcohol consumption and so we have 12 categories altogether. However, when calculating the expected frequencies we kept the totals fixed for each category. That leaves us with (4 1) (3 1) 6 values to find freely and so the degrees of freedom is 6 (see below): Expected frequencies that can be chosen freely 117 Marital status Single Married Widowed Divorced Total Drinks per month Abstain Over 60 160 117.9 191.2 390.6 633.5 47.6 77.2 590 957 Total 354 1173 143 102 1772 225 Shaded cells can be deduced as we know row/column totals We can now carry out the test formally. We have observed (O E ) 2 (67 117 .9) 2 (15 13) 2 C ... E 117 .9 13 21.952 2.489 18.776 1.07 0 2.67 29.358 8.908 6.856 1.427 0.438 0.324 94.269 . We will reject H0 at the 1% level if C 62,0.01 16.81 . We will therefore reject H0 at the 1% level and conclude that marital status and alcohol consumption are associated. Test details in general: Suppose now that we have a general contingency table with r rows and c columns. Denote the frequency in row i and column j by n ij . To test whether variable 1 and 2 are independent, we first need to calculate the marginal totals. Let Ri and C j denote the row total for row i and the column total for column j respectively and let the total of all the observations be n. We then have: Variable 1 Variable 2 2 … 1 c 1 2 … r n11 n 21 … n r1 n12 n 22 … nr 2 … … … n1c n 2c … n rc Total C1 C2 … Cc Total R1 R2 Rr n Then the expected frequency in row i and column j is: Eij Cj n Ri or Eij Ri C j n We now have the expected frequencies and the observed frequencies so we can calculate the test statistic (Oij E ij ) 2 C E ij i, j where Oij is the observed frequency in the (i, j)th cell and where we sum over all r c cells. 118 The degrees of freedom is the number of expected values in the table which can be chosen freely. Again we have fixed the totals for each marginal total and so we can choose (r 1) (c 1) values freely and so this is our degrees of freedom: Variable 1 Variable 2 … c-1 … n1,c 1 1 1 n11 2 n12 c 2 n 21 n 22 … n2,c 1 R2 … r-1 … n r 1,1 … nr 1, 2 … … nr 1,c1 Rr 1 C c 1 Rr n r Total C1 … C2 Cc Total R1 Summary: A chi-squared test of independence uses an r c contingency table to test the hypotheses: H0: the 2 variables in the table are independent H1: the 2 variables are not independent. Step 1: Find marginal row and column totals. Step 2: Calculate the expected frequencies for each category using Ri C j E ij . n Step 3: Use the test statistic (Oij E ij ) 2 C E ij i, j summed over the r c cells. Step 4: Reject H0 if C exceeds the (2r 1)(c 1), upper percentage point. Note: Just as for the chi-squared goodness of fit tests, we must have reasonably large expected frequencies before we can use the chi-squared distribution. We use the same rule of thumb as for the goodness of fit test. Example: A case-control study was carried out among swimmers to investigate the possible association between exposure to chlorinated swimming pool water and erosion of dental enamel. Among 49 swimmers with enamel corrosion (the cases) 32 reported swimming 6 or more hours per week, compared with 118 out of 245 swimmers without enamel corrosion. Observed frequencies: Amount of swimming per week 6 hours < 6 hours Total Erosion of enamel (cases) 32 17 49 No erosion of enamel (controls) 118 127 245 Total 150 144 294 Hypotheses: H0: Amount of swimming per week and the occurrence of dental enamel erosion are independent. H1: The two variables are associated. 119 Expected frequencies: Amount of swimming per week 6 hours < 6 hours Total Erosion of enamel (cases) 25 24 49 C Eij Total 150 144 294 49 144 294 So the value of the test statistic is: (Oij Eij ) 2 No erosion of enamel (controls) 125 120 245 (32 25) 2 (118 125) 2 (17 24) 2 (127 120 ) 2 4.802 . 25 125 24 120 We have to compare this test statistic with percentage points from a chi-squared distribution with 1 degree of freedom. As 12,0.05 3.841 and 12,0.01 6.635 , we can reject the null hypothesis at the 5% level (but not at the 1% level). So there is some evidence of an association between the amount of swimming and erosion of enamel. Note 1: We have not demonstrated a causal relationship (i.e. that by swimming a lot in chlorinated swimming pools increases your chance of eroding tooth enamel). It may be that people who swim more are those who take more care of their body and perhaps spend more time brushing their teeth (perhaps brushing the tooth enamel away). In other words, tooth enamel and swimming time may be associated because they are both related to a 3rd variable (e.g. degree of health consciousness and personal hygiene). Note 2: It is important that expected frequencies are not too small. To improve the test statistic’s approximation to a chisquared distribution, a continuity correction is sometimes used (due to Yates). This is done by reducing each difference (observed minus expected) by ½ in absolute value before squaring. The test statistic therefore becomes: C (| Oij Eij | 12 ) 2 Eij . Note 3: This question could be solved by examining the difference in the two proportions: Amongst the cases, the proportion who swim 6 or more hours per week is p1 controls, this proportion is p 2 32 0.653 . Amongst the 49 118 0.482. 245 Hypotheses: H 0 : 1 2 0 (i.e. 1 2 ) H1 : 1 2 0 The pooled estimate of is p n1 n2 49 32 245 118 p1 p2 . . 0.510. n1 n2 n1 n2 294 49 294 294 Therefore, the test statistic is 120 w 0.653 0.482 2.19. 0.51 0.49 0.51 0.49 49 245 Comparing this with percentage points of N[0, 1], we can again reject the null hypothesis at the 5% level. Exercise: A random sample of accident reports was taken in a large city. Safety officials know that males are expected to have more accidents than females and they were interested to know whether the types of accidents differ between the sexes. The data obtained are displayed in the following table. Accident Circumstance While at work Home Motor vehicle Other Sex Male 18 26 4 36 Female 4 28 6 24 Do the data provide sufficient evidence to conclude that in this city, accident circumstance and sex are statistically dependent? 121