Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mathematics 243: Statistics M. Stob May 4, 2008 Preface This is the textbook for the course Mathematics 243 taught at Calvin College. This edition of the book is for the Spring, 2008 version of the course. Not using a “standard textbook” requires an explanation. This book differs from other available books in at least three ways. First, this book is a “modern” treatment of statistics that reflects the most recent wisdom about what belongs, and does not belong, in a first course on statistics. Most existing textbooks must give at least some attention to traditional or old-fashioned approaches since traditional and oldfashioned courses are often taught. Second, this course relies on a particular statistical software package, R. The use of R is expected of students throughout the course. Most traditional textbooks are published so as to be usable with any software package (or with no software package at all). The use of R is part of what makes this text modern. Third, this textbook is written for Mathematics 243 and so includes all and only what it covered in the course. Most traditional textbooks are rather encyclopedic. While this textbook includes all the topics that are covered in the course, it is not meant to be self-contained. In particular, the textbook is for a class that meets 52 times throughout the semester and what goes on in those sessions is important. Also, the textbook contains numerous problems and the problems must be done so that the concepts are understood in full detail. The sections of the textbook are intended to be covered in the order that they appear in the text. An exception concerns the appendix, Using R. The R language will be introduced throughout the text by means of examples that solve the problems at hand. The appendix gives fuller explanation of language features that are important for developing the proficiency with R needed to proceed. The text will often refer forward to the appropriate section of the appendix for more details. The text is not a complete introduction to R however. R has a built-in help facility and there are also several introductions to the R language that are available on the web. A particularly good one is http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf by John Verzani. This text will change over the course of the semester. The current version of the text will always be available on the course website http://www.calvin.edu/~stob/ courses/m243/S08. The pdf version is designed to be useful for on-screen reading. The references in the text to other parts of the text and to the web are hyperlinked. This is the first edition of this text. Thus errors, typographical and otherwise, abound. I encourage readers to communicate them to me at [email protected]. This text is a part of a larger effort to improve the teaching of statistics at Calvin College. Earlier versions of some of this material were used for the course Mathematics 232. Some of the material in this book was developed by Randy Pruim and appears in the text for Mathematics 343–344. The assistance of Pruim and Tom Scofield in the development of these courses is gratefully acknowledged. Contents Introduction 1 1. Data 101 1.1. Basic Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 1.2. A Single Variable - Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 1.3. Measures of the Center of a Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 1.4. Measures of Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 1.5. The Relationship Between Two Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 1.6. Two Quantitative Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 1.7. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 2. Data from Random Samples 201 2.1. Populations and Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 2.2. Simple Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 2.3. Other Sampling Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 2.4. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 3. Probability 301 3.1. Random Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 3.2. Assigning Probabilities I – Equally Likely Outcomes . . . . . . . . . . . . . . . . . . 306 3.3. Probability Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 3.4. Empirical Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 3.5. Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 3.6. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 4. Random Variables 401 4.1. Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401 4.2. Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402 4.2.1. The Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 4.2.2. The Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407 4.3. An Introduction to Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 4.4. Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 4.4.1. pdfs and cdfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 4.4.2. Uniform Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418 4.4.3. Exponential Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 v Contents 4.4.4. Weibull Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 4.5. The Mean of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 4.5.1. The Mean of a Discrete Random Variable . . . . . . . . . . . . . . . . . . . . . 423 4.5.2. The Mean of a Continuous Random Variable . . . . . . . . . . . . . . . . . . 425 4.6. Functions of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426 4.6.1. The Variance of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . 428 4.7. The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 4.8. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 5. Inference - One Variable 501 5.1. Statistics and Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501 5.1.1. Samples as random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501 5.1.2. Big Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502 5.1.3. The Standard Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504 5.2. The Sampling Distribution of the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506 5.3. Estimating Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508 5.3.1. Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509 5.3.2. Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510 5.3.3. Mean Squared Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511 5.4. Confidence Interval for Sample Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512 5.4.1. Confidence Intervals for Normal Populations . . . . . . . . . . . . . . . . . . . 512 5.4.2. The t Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513 5.4.3. Interpreting Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515 5.4.4. Variants on Confidence Intervals and Using R . . . . . . . . . . . . . . . . . . 516 5.5. Non-Normal Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517 5.5.1. t Confidence Intervals are Robust . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517 5.5.2. Why are t Confidence Intervals Robust? . . . . . . . . . . . . . . . . . . . . . . 519 5.6. Confidence Interval for Proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520 5.7. The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524 5.8. Testing Hypotheses About the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527 5.9. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531 6. Producing Data – Experiments 601 6.1. Observational Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601 6.2. Randomized Comparative Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602 6.3. Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607 6.4. Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608 7. Inference – Two Variables 701 7.1. Two Categorical Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701 7.1.1. The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701 7.1.2. I independent populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702 7.1.3. One population, two factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706 vi Contents 7.1.4. I experimental treatments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707 7.2. Difference of Two Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 709 7.3. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714 8. Regression 801 8.1. The Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801 8.2. Inferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804 8.3. More Inferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808 8.4. Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 811 8.4.1. The residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 812 8.4.2. Influential Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815 8.5. Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816 8.6. Evaluating Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 821 8.7. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826 A. Appendix: Using R 1001 A.1. Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1001 A.2. Vectors and Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1001 A.3. Data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1002 A.4. Getting Data In and Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004 A.5. Functions in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006 A.6. Samples and Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1007 A.7. Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1009 A.8. Lattice Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1009 A.9. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1010 19:08 -- May 4, 2008 vii Introduction Kellogg’s makes Raisin Bran and packages it in boxes that are labeled “Net Weight: 20 ounces”. How might we test this claim? It seems obvious that we need to actually weigh some boxes. However we certainly cannot require that every box that we weigh contains exactly 20 ounces. Surely some variation in weight from box to box is to be expected and should be allowed. So we are faced with several questions: How many boxes should we weigh? How should we choose these boxes? How much deviation in weight from the 20 ounces should we allow? These are the kind of questions that the discipline of statistics is designed to answer. Definition 0.0.1 (Statistics). Statistics is the scientific discipline concerned with collecting, analyzing and making inferences from data. While we cannot tell the whole Raisin Bran story here, the answers to our questions as prescribed by NIST (National Institute of Standards and Technology) and developed from statistical theory are something like this. Suppose that we are at a Meijer’s warehouse that has just received a shipment of 250 boxes of Raisin Bran. We first select twelve boxes out of the whole shipment at random. By at random we mean that no box should be any more likely to occur in the group of twelve than any other. In other words, we shouldn’t simply take the first twelve boxes that we find. Next we weigh the contents of the twelve boxes. If any of the boxes are “too” underweight, we reject the whole shipment - that is we disbelieve the claim of Kellogg’s (and they are in trouble). If that is not the case, then we compute the average weight of the twelve boxes. If that average is not “too” far below 20 ounces, we do not disbelieve the claim. Of course there are some details in the above paragraph. We’ll address the issue of how to choose the boxes more carefully in Chapter 2. We’ll address the issue of summarizing the data (in this case, using the average weight) in Chapter 1. The question of how much below 20 ounces the average of our sample should be allowed to be will be dealt with in Chapter 5. Underlying our statistical techniques is the theory of probability which we take up in Chapter 3. The theory of probability is meant to supply a mathematical model for situations in which there is uncertainty. In the context of Raisin Bran, we will use probability to give a model for the variation that exists from box to box. We will also use probability to give a model of the uncertainty introduced because we are only weighing a sample of boxes. If the whole course was only about Raisin Bran it wouldn’t be worth it (except perhaps to Kellogg’s). But you are probably sophisticated enough to be able to generalize 1 Introduction this example. Indeed, the above story can be told in every branch of science (biological, physical, and social). Each time we have a hypothesis about a real-world phenomenon that is measurable but variable, we need to test that hypothesis by collecting data. We need to know how to collect that data, how to analyze it, and how to make inferences from it. So without further ado, let’s talk about data. 2 r 1. Data Statistics is the science of data. In this chapter, we talk about the kinds of data that we study and how to effectively summarize such data. 1.1. Basic Notions For our purposes, the sort of data that we will use comes to us in collections or datasets. A dataset consists of a set of objects, variously called individuals, cases, items, instances, units, or subjects, together with a record of the value of a certain variable or variables defined on the objects. Definition 1.1.1 (variable). A variable is a function defined on the set of objects. Ideally, each individual has a value for each variable. These values are usually numbers but need not be. Sometimes there are missing values—defidxmissing values. Example 1.1.2. Your college maintains a dataset of all currently active students. The individuals in this dataset are the students. Many different variables are defined and recorded in this dataset. For example, every student has a GPA, a GENDER, a CLASS, etc. Not every student has an ACT score — there are missing values for this variable. In the preceding example, some of the variables are obviously quantitative (e.g., GPA) and others are categorical (e.g., GENDER). A categorical variable is often called a factor and the possible values of the categorical variables are called its levels. Sometimes the levels of a categorical variable are represented by numbers. For example, we might code gender using 1 for female and 0 for male. It will be quite important to us not to treat the categorical variable as quantitative just because numbers are used in this way. (Is the average gender 1/2?) It is useful to think of the values of the variable as forming a list. In R, the values of a particular quantitative variable defined on a collection of individuals is usually stored in a vector. A categorical variable is stored in an R object called a indexfactor—defidxfactor (which behaves much like a vector). You can read more about vectors and factors in Section A.2 of the Appendix. We will normally think of a dataset as presented in a two-dimensional table. The rows of the table correspond to the individuals. (Thus the individuals need to be 101 1. Data ordered in some way.) The columns of the table correspond to the variables. Each of the rows and the columns normally has a name. In R, the canonical way to store such data is in an object called a data.frame. More details on how to operate on data.frames is in Appendix A.3. In the remainder of this section, we give a few examples of datasets that can be accessed in R and look at some of their basic properties. These datasets will be used several times in this book. Example 1.1.3. The iris dataset is a famous set of measurements taken by Edgar Anderson on 150 iris plants of the Gaspe Peninsula which is located on the eastern tip of the province of Quebec. The dataset is included in the basic installation of R. The variable iris is a predefined data.frame. There are many such datasets built into R. > data(iris) > dim(iris) [1] 150 5 > iris[1:5,] Sepal.Length 1 5.1 2 4.9 3 4.7 4 4.6 5 5.0 > # the dataset called iris is loaded into a data.frame called iris # list dimensions of iris data # print first 5 rows (individuals), all columns Sepal.Width Petal.Length Petal.Width Species 3.5 1.4 0.2 setosa 3.0 1.4 0.2 setosa 3.2 1.3 0.2 setosa 3.1 1.5 0.2 setosa 3.6 1.4 0.2 setosa Notice that the data.frame has rows and columns. The individuals (rows) are, by default, numbered (they can also be named) and the variables (columns) are named. The numbers and names are not part of the dataset. Each column of a data.frame is a vector or a factor. In the iris dataset, there are 150 individuals (plants) and five variables. Notice that four of the variables (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) are quantitative variables. The fifth variable is categorical. In this example the variable Species is categorical variable (factor) with three levels. The following example shows how to look at pieces of the dataset. > iris$Species [1] setosa [7] setosa [13] setosa [19] setosa [25] setosa [31] setosa [37] setosa [43] setosa [49] setosa 102 # a boring vector setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa versicolor setosa setosa setosa setosa setosa setosa setosa setosa versicolor setosa setosa setosa setosa setosa setosa setosa setosa versicolor setosa setosa setosa setosa setosa setosa setosa setosa versicolor 1.1. Basic Notions [55] versicolor versicolor versicolor versicolor [61] versicolor versicolor versicolor versicolor [67] versicolor versicolor versicolor versicolor [73] versicolor versicolor versicolor versicolor [79] versicolor versicolor versicolor versicolor [85] versicolor versicolor versicolor versicolor [91] versicolor versicolor versicolor versicolor [97] versicolor versicolor versicolor versicolor [103] virginica virginica virginica virginica [109] virginica virginica virginica virginica [115] virginica virginica virginica virginica [121] virginica virginica virginica virginica [127] virginica virginica virginica virginica [133] virginica virginica virginica virginica [139] virginica virginica virginica virginica [145] virginica virginica virginica virginica Levels: setosa versicolor virginica > iris$Petal.Width[c(1:5,146:150)] # selecting [1] 0.2 0.2 0.2 0.2 0.2 2.3 1.9 2.0 2.3 1.8 versicolor versicolor versicolor versicolor versicolor versicolor versicolor virginica virginica virginica virginica virginica virginica virginica virginica virginica versicolor versicolor versicolor versicolor versicolor versicolor versicolor virginica virginica virginica virginica virginica virginica virginica virginica virginica some individuals Example 1.1.4. There are 3,077 counties in the United States (including D.C.). The U.S. Census Bureau lists 3,141 units that are counties or county-equivalents. (Some people don’t live in a county. For example, most of the land in Alaska is not in any borough, which is what Alaska calls county level divisions. The Census Bureau has defined county equivalents so that all land and every person is in some county or other.) Data from the 2000 census about each county is available in a dataset maintained at the website for this course. These data are available from http://www.census.gov. The short R session below shows how to read the file and computes a few interesting numbers. > counties=read.csv(’http://www.calvin.edu/~stob/data/uscounties.csv’) > dim(counties) [1] 3141 9 > names(counties) [1] "County" "State" "Population" "HousingUnits" [5] "TotalArea" "WaterArea" "LandArea" "DensityPop" [9] "DensityHousing" > sum(counties$Population) [1] 281421906 > sum(counties$LandArea) [1] 3537438 The population of the 50 states and D.C. was 281,421,906 at the time of the 2000 U.S. Census. There were over 3.5 million square miles of land area. Notice that the variable State is a categorical variable and that County is really just a variable to 19:08 -- May 4, 2008 103 1. Data hold the name of each individual. Example 1.1.5. R comes with many user-created “packages”, many of which contain additional datasets. The faraway package comes with a broccoli dataset. In this dataset, a number of growers supply broccoli to a food processing plant. They are supposed to pack the broccoli in boxes with 18 clusters to a box and with each cluster weighing between 1.3 and 1.5 pounds. Four boxes from each of three growers were selected and three clusters from each box were weighed. Notice that it appears that numerical variables were used for the cluster, box, and grower but that these variables are correctly stored in factors and not vectors. > library(faraway) > dim(broccoli) [1] 36 4 > broccoli[1:5,] wt grower box cluster 1 352 1 1 1 2 369 1 1 2 3 383 1 1 3 4 339 2 1 1 5 367 2 1 2 > broccoli$grower [1] 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 Levels: 1 2 3 1.2. A Single Variable - Distributions Now that we can get our hands on some data, we would like to develop some tools to help us understand the distribution of a variable in a data set. By distribution we mean two things: what values does the variable take on, and with what frequency. Simply listing all the values of a variable is not an effective way to describe a distribution unless the data set is quite small. For larger data sets, we require some better methods of summarizing a distribution. In this section, we will look particularly at graphical summaries of a single variable. The type of summary that we generate will vary depending on the type of data that we are summarizing. A table is useful for summarizing a categorical variable. The following table is a useful description of the distribution of species of iris flowers in the iris dataset. > table(iris$Species) 104 1.2. A Single Variable - Distributions setosa versicolor 50 50 virginica 50 A more interesting table gives the number of counties per state. Note that it isn’t always the largest states that have the most counties. > table(counties$State) Alabama 67 Arkansas 75 Connecticut 8 Florida 67 Idaho 44 Iowa 99 Louisiana 64 Massachusetts 14 Mississippi 82 Nebraska 93 New Jersey 21 North Carolina 100 Oklahoma 77 Rhode Island 5 Tennessee 95 Vermont 14 West Virginia 55 Alaska Arizona 27 15 California Colorado 58 63 Deleware District of Columbia 3 1 Georgia Hawaii 159 5 Illinois Indiana 102 92 Kansas Kentucky 105 120 Maine Maryland 16 24 Michigan Minnesota 83 87 Missouri Montana 115 56 Nevada New Hampshire 17 10 New Mexico New York 33 62 North Dakota Ohio 53 88 Oregon Pennsylvania 36 67 South Carolina South Dakota 46 66 Texas Utah 254 29 Virginia Washington 135 39 Wisconsin Wyoming 72 23 Tables can be generated for quantitative variables as well. > table(iris$Sepal.Length) 4.3 4.4 4.5 4.6 4.7 4.8 4.9 1 3 1 4 2 5 6 19:08 -- May 4, 2008 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 10 9 4 1 6 7 6 8 7 3 6 6.1 6 6 105 1. Data Percent of Total 30 0 2 4 6 8 Frequency 12 Histogram of bball$HR 100 120 140 160 180 200 220 240 20 10 0 100 bball$HR 150 200 bball$HR Figure 1.1.: Homeruns in major leagues: hist() and histogram() 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 4 9 7 5 2 8 3 4 7 7.1 7.2 7.3 7.4 7.6 7.7 7.9 1 1 3 1 1 1 4 1 The table function is more useful in conjunction with the cut() function. The second argument to cut() gives a vector of endpoints of half-open intervals. Note that the default behavior is to use intervals that are open to the left and closed to the right. > table(cut(iris$Sepal.Length,c(4,5,6,7,8))) (4,5] (5,6] (6,7] (7,8] 32 57 49 12 The kind of summary in the above table is graphically presented by means of a histogram. There are two R commands that can be used to build a histogram: hist() and histogram(). hist() is part of the standard distribution of R. histogram() can only be used after first loading the lattice graphics package, which now comes standard with all distributions of R. The R functions are used as in the following excerpt which generates the two histograms in Figure 1.1. Notice that two forms of the histogram() function are given. The second form (the “formula” form) will be discussed in more detail in Section 1.5. The histograms are of the number of homeruns per team during the 2007 Major League Baseball season. > > > > > library(lattice) bball=read.csv(’http://www.calvin.edu/~stob/data/baseball2007.csv’) hist(bball$HR) histogram(bball$HR) # lattice histogram of a vector histogram(~HR,data=bball) # formula form of histogram Notice that the histograms produced differ in several ways. Besides aesthetic differences, the two histogram algorithms typically choose different break points. Also, the vertical scale of histogram() is in percentages of total while the vertical scale of hist() contains actual counts. As one might imagine, there are optional arguments to each of these functions that can be used to change such decisions. 106 1.2. A Single Variable - Distributions 0 neg. skewed 5 10 15 pos. skewed symmetric Percent of Total 20 15 10 5 0 0 5 10 15 0 5 10 15 x Figure 1.2.: Skewed and symmetric distributions. In these notes, we will usually use histogram() and indeed we will assume that the lattice package has been loaded. Graphics functions in the lattice package often have several useful features. We will see some of these in later Sections. A histogram gives a shape to a distribution and distributions are often described in terms of these shapes. The exact shape depicted by a histogram will depend not only on the data but on various other choices, such as how many bins are used, whether the bins are equally spaced across the range of the variable, and just where the divisions between bins are located. But reasonable choices of these arguments will usually lead to histograms of similar shape, and we use these shapes to describe the underlying distribution as well as the histogram that represents it. Some distributions are approximately symmetric with the distribution of the larger values looking like a mirror image of the distribution of the lower values. We will call a distribution positively skewed if the portion of the distribution with larger values (the right of the histogram) is more spread out than the other side. Similarly, a distribution is negatively skewed if the distribution deviates from symmetry in the opposite manner. Later we will learn a way to measure the degree and direction of skewness with a number; for now it is sufficient to describe distributions qualitatively as symmetric or skewed. See Figure 1.2 for some examples of symmetric and skewed distributions. The county population data gives a natural example of a positively skewed distribution. Indeed, it is so skewed that the histogram of populations by county is almost worthless. The histogram is on the left in Figure 1.3. In the case of positively skewed data where the data includes observations of several orders of magnitude, it is sometimes useful to transform the data. In the case of county populations, a histogram of the natural log of population gives a nice symmetric distribution. The histogram is on the right in Figure 1.3. > logPopulation=log(counties$Population) > histogram(logPopulation) Notice that each of these distributions is clustered around a center where most of the values are located. We say that such distributions are unimodal. Shortly we 19:08 -- May 4, 2008 107 1. Data 30 Percent of Total Percent of Total 80 60 40 20 10 20 0 0 0 2000000 4000000 6000000 8000000 10000000 5 10 counties$Population 15 logPopulation Figure 1.3.: County populations and natural log of county populations. Percent of Total 12 10 8 6 4 2 0 2 3 4 5 eruptions Figure 1.4.: Old Faithful eruption times (based on the faithful data set). will discuss ways to summarize the location of the “center” of unimodal distributions numerically. But first we point out that some distributions have other shapes that are not characterized by a strong central tendency. One famous example is eruption times of the Old Faithful geyser in Yellowstone National park. The command > data(faithful); > histogram(faithful$eruptions,n=20); produces the histogram in Figure 1.4 which shows a good example of a bimodal distribution. There appear to be two groups or kinds of eruptions, some lasting about 2 minutes and others lasting between 4 and 5 minutes. While the default histogram has the vertical axis read percent of total, another scale will be useful to us. In Figure 1.5, generated by histogram(faithful$eruptions,type="density") we have a density histogram. The vertical axis gives density per unit of the horizontal axis. With this as a density, the bars of the histogram have total mass of 1. The 108 1.2. A Single Variable - Distributions 0.5 Density 0.4 0.3 0.2 0.1 0.0 2 3 4 5 faithful$eruptions Figure 1.5.: Density histogram of Old Faithful eruption times. histogram is read as follows. The bar that extends from 4 to 4.4 on the horizontal axis as width 0.4 and density approximately 0.6. THis means that about 24% of the data is represented by this bar. One disadvantage of a histogram is that the actual data values are lost. For a large data set, this is probably unavoidable. But for more modestly sized data sets, a stem plot can reveal the shape of a distribution without losing the actual data values. A stem plot divides each value into a stem and a leaf at some place value. The leaf is rounded so that it requires only a single digit. > stem(faithful$eruptions) The decimal point is 1 digit(s) to the left of the | 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 | | | | | | | | | | | | | | | | | | 070355555588 000022233333335577777777888822335777888 00002223378800035778 0002335578023578 00228 23 080 7 2337 250077 0000823577 2333335582225577 0000003357788888002233555577778 03335555778800233333555577778 02222335557780000000023333357778888 0000233357700000023578 00000022335800333 0370 From this output we can readily see that the shortest recorded eruption time was 19:08 -- May 4, 2008 109 1. Data 1.60 minutes. The second 0 in the first row represents 1.70 minutes. Note that the output of stem() can be ambiguous when there are not enough data values in a row. 1.3. Measures of the Center of a Distribution Qualitative descriptions of the shape of a distribution are important and useful. But we will often desire the precision of numerical summaries as well. Two aspects of unimodal distributions that we will often want to measure are central tendency (what is a typical value? where do the values cluster?), and the amount of variation (are the data tightly clustered around a central value, or more spread out?) Two widely used measures of center are the mean and the median. You are probably already familiar with both. The mean is calculated by adding all the values of a variable and dividing by the number of values. Our usual notation will be to denote the n values as x1 , x2 , . . . xn , and the mean of these values as x. Then the formula for the mean becomes Pn xi x = i=1 . n The median is a value that splits the data in half – half of the values are smaller than the median and half are larger. By this definition, there could be more than one median (when there are an even number of values). This ambiguity is removed by taking the mean of the “two middle numbers” (after sorting the data). Whereas x denotes the mean of the n numbers x1 , . . . , xn , we use x̃ to denote the median of these numbers. The mean and median are easily computed in R. For example, > mean(iris$Sepal.Length); median(iris$Sepal.Length); [1] 5.843333 [1] 5.8 We can also compute the mean and median of the Old Faithful eruption times. > mean(faithful$eruptions); median(faithful$eruptions); [1] 3.487783 [1] 4 Notice, however, that in the Old Faithful eruption times histogram (Figure 1.4) there are very few eruptions that last between 3.5 and 4 minutes. So although these numbers are the mean and median, neither is a very good description of the typical eruption time(s) of Old Faithful. It will often be the case that the mean and median are not very good descriptions of a data set that is not unimodal. In the case of our Old Faithful data, there seem to be two predominant peaks, but unlike in the case of the iris data, we do not have another variable in our data that lets us partition the eruptions times into two corresponding groups. This observation could, however, lead to some hypotheses about Old Faithful eruption times. Perhaps eruption times are different at night than during the day. Perhaps there are other differences in the eruptions. Subsequent data collection (and statistical analysis of the resulting data) might help us determine whether our hypotheses appear correct. 110 1.3. Measures of the Center of a Distribution Comparing mean and median Why bother with two different measures of central tendency? The short answer is that they measure different things, and sometimes one measure is better than the other. If a distribution is (approximately) symmetric, the mean and median will be (approximately) the same. (See Exercise 1.2.) If the distribution is not symmetric, however, the mean and median may be very different. For example, if we begin with a symmetric distribution and add in one additional value that is very much larger than the other values (an outlier), then the median will not change very much (if at all), but the mean will increase substantially. We say that the median is resistant to outliers while the mean is not. A similar thing happens with a skewed, unimodal distribution. If a distribution is positively skewed, the large values in the tail of the distribution increase the mean (as compared to a symmetric distribution) but not the median, so the mean will be larger than the median. Similarly, the mean of a negatively skewed distribution will be smaller than the median. Consider the data on the populations of the 3,141 county equivalents in the United States. From R we see the great difference in the mean county population and the median county population. Note that the largest county, Los Angeles County with over 9 million people, alone contributes over 3,000 people to the mean. > mean(counties$Population); median(counties$Population) [1] 89596.28 [1] 24595 Over 80% of the counties in the United States are less populus than the “average” county. > sum(counties$Population<mean(counties$Population)) [1] 2565 Whether a resistant measure is desirable or not depends on context. If we are looking at the income of employees of a local business, the median may give us a much better indication of what a typical worker earns, since there may be a few large salaries (the business owner’s, for example) that inflate the mean. This is also why the government reports median household income and median housing costs. The median county population perhaps tells us more about what a “typical” county looks like than does the mean. On the other hand, if we compare the median and mean of the value of raffle prizes, the mean is probably more interesting. The median is probably 0, since typically the majority of raffle tickets do not win anything. This is independent of the values of any of the prizes. The mean will tell us something about the overall value of the prizes involved. In particular, we might want to compare the mean prize value with the cost of the raffle ticket when we decide whether or not to purchase one. From the mean population of counties, we can compute the total population of the United States. We might underestimate that number if we are only told the median county size. 19:08 -- May 4, 2008 111 1. Data The trimmed mean compromise There is another measure of central tendency that is less well known and represents a kind of compromise between the mean and the median. In particular, it is more sensitive to the the extreme values of a distribution than the median is, but less sensitive than the mean. The idea of a trimmed mean is very simple. Before calculating the mean, we remove the largest and smallest values from the data. The percentage of the data removed from each end is called the trimming percentage. A 0% trimmed mean is just the mean; a 50% trimmed mean is the median; a 10% trimmed mean is the mean of the middle 80% of the data (after removing the largest and smallest 10%). A trimmed mean is calculated in R by setting the trim argument of mean(), e.g. mean(x,trim=.10). Although a trimmed mean in some sense combines the advantages of both the mean and median, it is less common than either the mean or the median. This is partly due the mathematical theory that has been developed for working with the median and especially the mean of sample data. The 10% trimmed mean of county populations is 38,234 which is much closer in size to the median than to the mean. > mean(counties$Population,trim=.1) [1] 38234.59 In some sports, the trimmed mean is used to compute a competitors score based on the scores given by individual judges. Both diving and international figure skating work this way. 1.4. Measures of Dispersion It is often useful to characterize a distribution in terms of its center, but that is not the whole story. Consider the distributions depicted in the histograms below. −10 A 0 10 20 30 B 0.20 Density 0.15 0.10 0.05 0.00 −10 0 10 20 30 In each case the mean and median are approximately 10, but the distributions clearly have very different shapes. The difference is that distribution B is much more “spread 112 1.4. Measures of Dispersion out”. “Almost all” of the data in distribution A are quite close to 10; a much larger proportion of distribution B is “far away” from 10. The intuitive (and not very precise) statement in the preceding sentence can be quantified by means of quantiles. The idea of quantiles is probably familiar to you since percentiles are a special case of quantiles. Definition 1.4.1 (Quantile). Let p ∈ [0, 1]. A p-quantile of a quantitative distribution is a number q such that the (approximate) proportion of the distribution that is less than q is p. So for example, the .2-quantile divides a distribution into 20% below and 80% above. This is the same as the 20th percentile. The median is the .5-quantile (and the 50th percentile). The idea of a quantile is quite straightforward. In practice there are a few wrinkles to be ironed out. Suppose your data set has 15 values. What is the .30-quantile? 30% of the data would be (.30)(15) = 4.5 values. Of course, there is no number that has 4.5 values below it and 11.5 values above it. This is the reason for the parenthetical word approximate in Definition 1.4.1. Different schemes have been proposed for giving quantiles a precise value, and R implements several such methods. They are similar in many ways to the decision we had to make when computing the median of a variable with an even number of values. Two important methods can be described by imagining that the sorted data have been placed along a ruler, one value at every unit mark and also at each end. To find the p-quantile, we simply snap the ruler so that proportion p is to the left and 1 − p to the right. If the break point happens to fall precisely where a data value is located (i.e., at one of the unit marks of our ruler), that value is the p-quantile. If the break point is between two data values, then the p-quantile is a weighted mean of those two values. For example, suppose we have 10 data values: 1, 4, 9, 16, 25, 36, 49, 64, 81, 100. The 0-quantile is 1, the 1-quantile is 100, the .5-quantile (median) is midway between 25 and 36, that is 30.5. Since our ruler is 9 units long, the .25-quantile is located 9/4 = 2.25 units from the left edge. That would be one quarter of the way from 9 to 16, which is 9 + .25(16 − 9) = 9 + 1.75 = 10.75. (See Figure 1.6.) Other quantiles are found similarly. This is precisely the default method used by quantile(). > quantile((1:10)^2) 0% 25% 50% 1.00 10.75 30.50 75% 100% 60.25 100.00 A second scheme is just like this one except that the data values are placed midway between the unit marks. In particular, this means that the 0-quantile is not the smallest value. This could be useful, for example, if we imagined we were trying to estimate the lowest value in a population from which we only had a sample. Probably the lowest value overall is less than the lowest value in our particular sample. Other methods try 19:08 -- May 4, 2008 113 1. Data 1 4 9 16 6 25 36 49 64 81 100 6 Figure 1.6.: An illustration of a method for determining quantiles from data. Arrows indicate the locations of the .25-quantile and the .5-quantile. to refine this idea, usually based on some assumptions about what the population of interest is like. Fortunately, for large data sets, the differences between the different quantile methods are usually unimportant, so we will just let R compute quantiles for us using the quantile() function. For example, here are the deciles and quartiles of the Old Faithful eruption times. > quantile(faithful$eruptions,(0:10)/10); 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1.6000 1.8517 2.0034 2.3051 3.6000 4.0000 4.1670 4.3667 4.5330 4.7000 5.1000 > quantile(faithful$eruptions,(0:4)/4); 0% 25% 50% 75% 100% 1.60000 2.16275 4.00000 4.45425 5.10000 The latter of these provides what is commonly called the five number summary. The 0-quantile and 1-quantile (at least in the default scheme) are the minimum and maximum of the data set. The .5-quantile gives the median, and the .25- and .75quantiles (also called the first and third quartiles) isolate the middle 50% of the data. When the quartiles are close together, then most (well, half, to be more precise) of the values are near the median. If those numbers are farther apart, then much (again, half) of the data is far from the center. The difference between the first and third quartiles is called the inter-quartile range and abbreviated IQR. This is our first numerical measure of dispersion. The five number summary is also computed by the R function fivenum(). However fivenum uses yet another method of computing the quartiles. The .25- and .75-quartiles computed this way are called the lower hinge and upper hinge. The computation of the lower hinge depends on whether there are an even or odd number of data points. If there are an even number of points, the lower hinge is simply the median of the lower half of the data. If there are an odd number of points, the lower hinge is simply the median of the lower half of the data with the middle data point included in that lower half. The upper hinge is computed in exactly the same way with the middle point again being considered as part of the upper half of the data if there are an odd number of data points. The five-number summary is often presented by means of a boxplot. The standard R function is boxplot and the lattice function is bwplot() A boxplot of the Sepal.Width of the iris data is in Figure 1.7 and was generated by > bwplot(iris$Sepal.Width) The sides of the box are drawn at the hinges. The median is represented by a dot in the box. In some boxplots, the whiskers extend out to the maximum and 114 1.4. Measures of Dispersion ● ● 2.0 2.5 3.0 ● 3.5 ● ● 4.0 4.5 iris$Sepal.Width Figure 1.7.: Boxplot of Sepal.Width of iris data. minimum values. However the boxplot that we are using here attempts to identify outliers. Outliers are values that are unusually large or small and are indicated by a special symbol beyond the whiskers. The whiskers are then drawn from the box to the largest and smallest non-outliers. One common rule for automating outlier detection for boxplots is the 1.5 IQR rule. This is the default rule in both boxplot functions in R. Under this rule, any value that is more than 1.5 IQR away from the box is marked as an outlier. Indicating outliers in this way is useful since it allows us to see if the whisker is long only because of one extreme value. Variance and Standard Deviation Another important way to measure the dispersion of a distribution is by comparing each value with the center of the distribution. If the distribution is spread out, these differences will tend to be large, otherwise these differences will be small. To get a single number, we could simply add up all of the deviation from the mean: total deviation from the mean = n X (xi − x) . i=1 The trouble with this is that the total deviation from the mean is always 0 (see Exercise 1.5). The problem is that the negative deviations and the positive deviations always exactly cancel out. To fix this problem we might consider taking the absolute value of the deviations from the mean: total absolute deviation from the mean = n X |xi − x| . i=1 This number will only be 0 if all of the data values are equal to the mean. Even better would be to divide by the number of data values. Otherwise large data sets will have 19:08 -- May 4, 2008 115 1. Data large sums even if the values are all close to the mean. n mean absolute deviation = 1X |xi − x| . n i=1 This is a reasonable measure of the dispersion in a distribution, but we will not use it very often. There is another measure that is much more common, namely the variance, which is defined by n variance = Var(x) = 1 X (xi − x)2 . n−1 i=1 You will notice two differences from the mean absolute deviation. First, instead of using an absolute value to make things positive, we square the deviations from the mean. The chief advantage of squaring over the absolute value is that it is much easier to do calculus with a polynomial than with functions involving absolute values. The second difference is that we divide by n − 1 instead of by n. There is a good reason for this, even though dividing by n seems more natural. We will get to that reason in Chapterr̃efchapter-onevariableinference. For now, we’ll use this heuristic for remembering the n−1. If you know the mean and all but one of the values of a variable, then you can determine the remaining value, since the sum of all the values must be the product of the number of values and the mean. So once the mean is known, there are only n − 1 independent pieces of information remaining. Because the squaring changes the units of this measure, the square root of the variance, called the standard deviation, is commonly used in place of the variance. standard deviation = SD(x) = p Var(x) . We will sometimes use the notation sx and s2x for the standard deviation and variance respectively. All of these quantities are easy to compute in R. > x=c(1,3,5,5,6,8,9,14,14,20); > > mean(x); [1] 8.5 > x - mean(x); [1] -7.5 -5.5 -3.5 -3.5 -2.5 -0.5 0.5 > sum(x - mean(x)); [1] 0 > abs(x - mean(x)); [1] 7.5 5.5 3.5 3.5 2.5 0.5 0.5 > sum(abs(x - mean(x))); [1] 46 > (x - mean(x))^2; [1] 56.25 30.25 12.25 12.25 6.25 116 5.5 5.5 11.5 5.5 5.5 11.5 0.25 0.25 30.25 30.25 132.25 1.5. The Relationship Between Two Variables Sepal.Length 8 7 ● 6 5 ● ● setosa ● versicolor virginica Figure 1.8.: Box plot for iris sepal length as a function of Species. > sum((x - mean(x))^2); [1] 310.5 > n= length(x); > 1/(n-1) * sum((x - mean(x))^2); [1] 34.5 > var(x); [1] 34.5 > sd(x); [1] 5.87367 > sd(x)^2; [1] 34.5 1.5. The Relationship Between Two Variables Many scientific problems are about describing and explaining the relationship between two or more variables. In the next two sections, we begin to look at graphical and numerical ways to summarize such relationships. In this section, we consider the case where one or both the variables are categorical. We first consider the case when one of the variables is categorical and the other is quantitative. This is the situation with the iris data if we are interested in the question of how, say, Sepal.Length varies by Species. A very common way of beginning to answer this question is to construct side-by-side boxplots. > bwplot(Sepal.Length~Species,data=iris) We see from these boxplots (Figure 1.8)that the virginica variety of iris tends to have the longest sepal length though the sepal lengths of this variety also have the greatest variation. 19:08 -- May 4, 2008 117 1. Data 4 setosa 5 6 7 8 versicolor virginica Percent of Total 50 40 30 20 10 0 4 5 6 7 8 4 5 6 7 8 Sepal.Length Figure 1.9.: Sepal lengths of three species of irises The notation used in the first argument of bwplot() is called formula notation and is extremely important when considering the relationship between two variables. This formula notation is used throughout lattice graphics and in other R functions as well. The simplest form of a formula is y ~ x We will often read this formula as “y modelled by x”. In general, the variable y is the dependent variable and x the independent variable. In this example, it is more natural to think of Species as the independent variable. There is nothing logically incorrect however with thinking of sepal length as the independent variable. Usually, for plotting functions, y will be the variable presented on the vertical axis, and x the variable to be plotted along the horizontal axis. In this case, we are modeling (or describing) sepal length by species. The formula notation can also be used with lattice function histogram(). For example, histogram(~Sepal.Length,data=iris) will produce a histogram of the variable Sepal.Length. In this case, the independent variable in the formula is omitted since the independent variable, the frequency of the class, is computed by histogram(). Side-by-side histograms can be generated by with a more general form of the formula syntax The same information in the boxplots above is contained in the side-by-side histograms of Figure 1.9. > histogram(~Sepal.Length | Species,data=iris,layout=c(3,1)) In this form of the formula y~x | z 118 1.5. The Relationship Between Two Variables the variable z is a conditioning variable. The condition z is a variable that is used to break the data into different groups. In the case of histogram(), the different groups are plotted in separate panels. When z is categorical there is one panel for each level of z. When z is quantitative, the data is divided into a number of sections based on the values of z. The formula notation is used for more than just graphics. In the above example, we would also like to compute summary statistics (such as the mean) for each of the species separately. There are two ways to do this in R. The first uses the aggregate() function. A much easier way uses the summary() function from the Hmisc package. The summary() function allows us to apply virtually any function that has vector input to each level of a categorical variable separately. > library(Hmisc) # load Hmisc package Loading required package: Hmisc ............................... > summary(Sepal.Length~Species,data=iris,fun=mean); Sepal.Length N=150 +-------+----------+---+------------+ | | |N |Sepal.Length| +-------+----------+---+------------+ |Species|setosa | 50|5.006000 | | |versicolor| 50|5.936000 | | |virginica | 50|6.588000 | +-------+----------+---+------------+ |Overall| |150|5.843333 | +-------+----------+---+------------+ > summary(Sepal.Length~Species,data=iris,fun=median); Sepal.Length N=150 +-------+----------+---+------------+ | | |N |Sepal.Length| +-------+----------+---+------------+ |Species|setosa | 50|5.0 | | |versicolor| 50|5.9 | | |virginica | 50|6.5 | +-------+----------+---+------------+ |Overall| |150|5.8 | +-------+----------+---+------------+ > summary(Sepal.Length~Species,iris); Sepal.Length N=150 +-------+----------+---+------------+ | | |N |Sepal.Length| +-------+----------+---+------------+ |Species|setosa | 50|5.006000 | | |versicolor| 50|5.936000 | 19:08 -- May 4, 2008 119 1. Data | |virginica | 50|6.588000 | +-------+----------+---+------------+ |Overall| |150|5.843333 | +-------+----------+---+------------+ Notice that the default function used in summary() computes the mean. From now on we will assume that the lattice and Hmisc packages have been loaded and will not show the loading of these packages in our examples. If you try an example in this book and R reports that it cannot find a function, it is likely that you have failed to load one of these packages. You can set up R to automatically load these two packages every time you launch R if you like. Of course none of these summaries – boxplots, histograms, or numerical summaries – can tell us whether the differences in sepal lengths among species is accidental to these 150 flowers or whether these differences are significant properties of the species. We next turn to the case where both variables are categorical. Example 1.5.1. In 2004, over 400 incoming first-year students at Calvin College took a survey concerning, among other things, their beliefs and values. In 2007, 221 of these students were asked these same questions again. Their responses to three of the questions are included in the file http://www.calvin.edu/~stob/data/ CSBVpolitical.csv. The variable SEX uses codes of 1 for male and 2 for female. The other two variables, POLIVW04 and POLIVW07, refer to the question “How would you characterize your political views?” as answered in 2004 and 2007. The coded responses are Far right 1 Conservative 2 Middle-of-the-road 3 Liberal 4 Far left 5 Each of these questions results in a categorical variable. We might be interested in whether there is a difference between self-characterization of male students and female students. We might also be interested in the relationship of the views of a student in 2004 and 2007. The first few entries of this dataset are given in the following output. > csbv=read.csv(’http://www.calvin.edu/~stob/data/CSBVpolitical.csv’) > csbv[1:5,] SEX POLIVW04 POLIVW07 1 1 2 2 2 1 3 3 3 2 2 2 120 1.5. The Relationship Between Two Variables 4 5 1 2 2 2 2 2 The most useful form of summary of data that arises from two or more categorical variables is a cross tabulation. We first use a cross-tabulation to determine the relationship of the gender of a student to his or her political views as entering first-year students. > xtabs(~SEX+POLIVW04,csbv) POLIVW04 SEX 1 2 3 4 1 7 47 28 6 2 0 67 48 14 While the command syntax is a bit inscrutable, it should be clear how to read the table. Note that no entering students characterized their views as “Far left” and no female characterized her views as “Far right.” Also notice that it appears that males tended to be more conservative than females. The xtabs() function uses the formula syntax. As in histogram, there is now no independent variable in the formula as the frequencies are computed from the data. Also, the formula has form x~y1+y2 where the plus sign indicates that there are two independent variables. Another example of xtabs() with just one independent variable is > xtabs(~SEX ,csbv) SEX 1 2 88 133 which counts the number of males and females in our dataset. In this first example of xtabs our dataset contained a record for each observation. It is quite often the case that we are only given summary data. Example 1.5.2. Data on graduate school admissions to six different departments of the University of California of California, Berkeley, in 1973 are summarized in the dataset http://www.calvin.edu/~stob/data/Berkeley.csv. > Admissions=read.csv(’http://www.calvin.edu/~stob/data/Berkeley.csv’) > Admissions[c(1,10,19),] Admit Gender Dept Freq 1 Admitted Male A 512 10 Rejected Male C 205 19 Admitted Female E 94 We see that 512 Males were admitted to Department A while 10 Males were rejected by Department C. We now use the xtabs function with a dependent variable: 19:08 -- May 4, 2008 121 1. Data > xtabs(Freq~Gender+Admit,Admissions) Admit Gender Admitted Rejected Female 557 1278 Male 1198 1493 There seems to be relationship between the two variables in this cross-tabulation. Females were rejected at a greater rate than Males. While this might be evidence of gender bias at Berkeley, further analysis tells a more complicated story. > xtabs(Freq~Gender+Admit+Dept,Admissions) , , Dept = A Admit Gender Admitted Rejected Female 89 19 Male 512 313 , , Dept = B Admit Gender Admitted Rejected Female 17 8 Male 353 207 , , Dept = C Admit Gender Admitted Rejected Female 202 391 Male 120 205 , , Dept = D Admit Gender Admitted Rejected Female 131 244 Male 138 279 , , Dept = E Admit Gender Admitted Rejected Female 94 299 Male 53 138 , , Dept = F 122 1.5. The Relationship Between Two Variables Admit Gender Admitted Rejected Female 24 317 Male 22 351 In all but two departments, females are admitted at a greater rate than males while in those two departments the admission rate is quite similar. The next example again illustrates the difficulty in trying to explain the relationship between two categorical variables, in this case race and the death penalty. Example 1.5.3. A 1981 paper investigating racial biases in the application of the death penalty reported on 326 cases in which the defendant was convicted of murder. For each case they noted the race of the defendant and whether or not the death penalty was imposed. > deathpenalty=read.table(’http://www.calvin.edu/~stob/data/deathPenalty.txt’,header=T) > deathpenalty[1:5,] Penalty Victim Defendant 1 Not White White 2 Not Black Black 3 Not White White 4 Not Black Black 5 Death White Black > xtabs(~Penalty+Defendant,data=deathpenalty) Defendant Penalty Black White Death 17 19 Not 149 141 > (We have used read.table which is suitable to read files that are not CSV but rather in which the data is separated by spaces. However read.table() does not assume a header with variable names.) From the output, it does not look like there is much of a difference in the rates at which black and white defendants receive the death penalty although a white defendant is slightly more likely to receive the death penalty. However a different picture emerges if we take into account the race of the victim. > xtabs(~Penalty+Defendant+Victim,data=deathpenalty) , , Victim = Black Defendant Penalty Black White Death 6 0 Not 97 9 19:08 -- May 4, 2008 123 1. Data , , Victim = White Defendant Penalty Black White Death 11 19 Not 52 132 It appears that black defendants are more likely to receive the death penalty when the victim is black and also when the victim is white. In the last example, we met something called Simpson’s Paradox. Specifically, we found that a relationship between two categorical variables (white defendants receive the death penalty more frequently) is reversed if we divide the analysis by a third categorical variable (black defendants receive the death penalty more often if the victim is white and if the victim is black). A cross-table is usually the most useful way to present data on the relationship between two categorical variables. A graphical representation that is sometimes used however is called a mosaic plot. We illustrate the relationship between gender and political views in 2007 of the Calvin sample of 221 students. The function is > mosaicplot(~SEX+POLIVW07,csbv) and it generates the picture in Figure 1.10. Here area is proportional to frequency. It is easy to see here (if we recall the codes) that the female student population is somewhat less conservative in political orientation than the male population. csbv 2 5 4 3 POLIVW07 2 1 1 SEX Figure 1.10.: A mosaic plot of the relationship between political views and gender. 124 1.6. Two Quantitative Variables ● ● 120 loss 130 ● ● ● ● 110 ● 100 ● 90 ● ● 0.5 ● ● 1.0 ● ● 110 ● 100 ● ● 0.0 ● ● 120 loss 130 1.5 2.0 ● ● ● 90 ● ● 0.0 Fe 0.5 1.0 1.5 2.0 Fe Figure 1.11.: The corrosion data with a “good” line added on the right. 1.6. Two Quantitative Variables A very common problem in science is to describe and explain the relationship between two quantitative variables. Often our scientific theory (or at least our intuition) suggests that two variables have a relatively simple functional relationship, at least approximately. We look at three typical examples. Example 1.6.1. Thirteen bars of 90-10 Cu/Ni alloys were submerged for sixty days in sea water. The bars varied in iron content. The weight loss due to corrosion for each bar was recorded. The R dataset below gives the percentage content of iron (Fe) and the weight loss in mg per square decimeter (loss). > library(faraway) > data(corrosion) > corrosion[c(1:3,12:13),] Fe loss 1 0.01 127.6 2 0.48 124.0 3 0.71 110.8 12 1.44 91.4 13 1.96 86.2 > xyplot(loss~Fe, data=corrosion) > xyplot(loss~Fe,data=corrosion,type=c("p","r")) # plot has points, regression line It is evident from the plot (Figure 1.11) that the greater the percentage of iron, the less corrosion. The plot suggests that the relationship might be linear. In the second plot, a line is superimposed on the data. The line is meant to summarize approximately the linear relationship between iron content and corrosion. (We will explain how to choose the line soon.) Note that to plot the relationship between two quantitative variables, we may use either plot from the base R package or xyplot from lattice. The function xyplot() used the same formula notation as histogram(). 19:08 -- May 4, 2008 125 1. Data Distance 100 200 400 800 1000 1500 Mile 2000 3000 5000 10,000 Time 9.77 19.32 43.18 1:41.11 2:11.96 3:26.00 3:43.13 4:44.79 7:20.67 12:37.35 26:17.53 Record Holder Asafa Powell (Jamaica) Michael Johnson (US) Michael Johnson (US) Wilson Kipketer (Denmark) Noah Ngeny (Kenya) Hicham El Guerrouj (Morocco) Hicham El Guerrouj (Morocco) Hicham El Guerrouj (Morocco) Daniel Komen (Kenya) Kenenisa Bekele (Ethiopia) Kenenisa Bekele (Ethiopia) Table 1.1.: Men’s World Records in Track (IAAF) What is the role of the line that we superimposed on the plot of the data in this example? Obviously, we do not mean to claim that the relationship between iron content and corrosion loss is completely captured by the line. But as a “model” of the relationship between these variables, the line has at least three possible important uses. First, it provides a succinct description of the relationship that is difficult to see in the unsummarized data. The line plotted has equation loss = 129.79 − 24.02Fe. Both the intercept and slope of this line have simple interpretations. For example, the slope suggests that every increase of 1% of iron content means a decrease in loss of content of 24.02 mg per square decimeter. Second, the model might be used for prediction in a situation where we have a yet untested object. We can easily use this line to make a prediction for the material loss in an alloy of 2% iron content. Finally, it might figure in a scientific explanation of the phenomenon of corrosion. Example 1.6.2. The current world records for men’s track appear in Table 1.1. These data may be found at http://www.calvin.edu/~stob/data/mentrack. csv. The plot of record distances (in meters) and times (in seconds) looks roughly linear. We know of course (for physical reasons) that this relationship cannot be a linear one. Nevertheless, it appears that a smooth curve might approximate the data very well and that this curve might have a relatively simple formula. Such a formula might help us predict what the world record time in a 4,000 meter race might be (if ever such a race would be run by world-class runners). 126 1.6. Two Quantitative Variables ● Seconds 1500 1000 ● 500 0 ● ●● ● ●● ●● 0 ● 2000 4000 6000 8000 10000 Meters Example 1.6.3. The R dataset trees contains the measurements of the volume (in cu ft), girth (diameter of tree in inches measured at 4 ft 6 in above the ground), and height (in ft) of 31 black cherry trees in a certain forest. Since girth is easily measured, we might want to use girth to predict volume of the tree. A plot shows the relationship. > data(trees) > trees[c(1:2,30:31),] Girth Height Volume 1 8.3 70 10.3 2 8.6 65 10.3 30 18.0 80 51.0 31 20.6 87 77.0 > xyplot(Volume~Girth,data=trees) Volume 80 ● 60 ●● 40 ● ●● ● ●●●● ● ● ● ● 20 ● ● ● ● ● ● ● ● ●● ● ●●● 10 15 20 Girth In this example, we probably wouldn’t expect that a linear relationship is the best way to describe the data. Furthermore, the data indicates that any simple function is not going to describe completely the variation in volume as a function of girth. This makes sense because we know that trees of the same girth can have different volumes. These three examples share the following features. In each, we are given n observations (x1 , y1 ), . . . , (xn , yn ) of quantitative variables x and y. In each we would like to 19:08 -- May 4, 2008 127 1. Data express the relationship between x and y, at least approximately, using a simple functional form. In each case we would like to find a “model” that explains y in terms of x. Specifically, we would like to find a simple functional relationship y = f (x) between these variables. Summarizing, our goal is the following Goal: Given (x1 , y1 ), . . . , (xn , yn ), find a “simple” function f such that yi is approximately equal to f (xi ) for every i. The goal is vague. We need to make precise the notion of “simple” and also the measure of fit we will use in evaluating whether yi is close to f (xi ). In the rest of this section, we make these two notions precise. The simplest functions we study are linear functions such as the function that we used in Example 1.6.1. In other words, in this case our goal is to find b0 and b1 so that yi ≈ b0 + b1 xi for all i. (Statisticians use b0 , b1 or a, b for the slope and intercept rather than the b, m that are typical in mathematics texts. We will use b0 , b1 .) Of course, in only one of our motivating examples does it seem sensible to use a line to approximate the data. So two important questions that we will need to address are: How do we tell if a line is an appropriate description of the relationship? and What do we do if a linear function is not the right relationship? We will address both questions later. How shall we measure the goodness of fit of a proposed function f to the data? For each xi the function f predicts a certain value ŷi = f (xi ) for yi . Then ri = yi − ŷi is the “mistake” that f makes in the prediction of yi . Obviously we want to choose f so that the values ri are small in absolute value. Introducing some terminology, we will call ŷi the fitted or predicted value of the model and ri the residual. The following is a succinct statement of the relationship observation = fitted + residual. It will be impossible to choose a line so that all the values of ri are simultaneously small (unless the data points are collinear). Various values of b0 , b1 might make some values of ri small while making others large. So we need some measure that aggregates all the residuals. Many choices are possible and R provides software to find the resulting lines for many of these but the canonical choice and the one we investigate here is the sum of squares of the residuals. Namely, our goal is now refined to the following Goal: Given (x1 , y1 ), . . . , (xn , yn ), find b0 and b1 such that if f (x) = b0 + b1 x n X and ri = yi − f (xi ) then ri2 is minimized. i=1 128 1.6. Two Quantitative Variables We call n X ri2 the sum of squares of residuals and denote it by SSResid or SSE (for i=1 sum squares error). The choice of the squaring function here is quite analogous to the choice of squaring in the definition of variance for measuring variation. Just as in that problem, different ways of combining the ri are possible. Before we discuss the solution of this problem, we show how to solve it in R using the data of Example 1.6.1. The R function lm finds the coefficients of the line that minimizes the sums of squares of the residuals. Note that it uses the same syntax for expressing the relationship between variables as does xyplot. > lm(loss~Fe,data=corrosion) Call: lm(formula = loss ~ Fe, data = corrosion) Coefficients: (Intercept) Fe 129.79 -24.02 While we will always use R to solve our minimization problem, it is worthwhile to explicitly solve for b0 and b1 so that we see how these coefficients are related to the values of the data. Finding b0 and b1 is a minimization problem of the sort addressed in calculus classes. In particular we want to find b0 and b1 to minimize SSResid = n X (yi − (b0 + b1 xi ))2 . i=1 It is important to note that SSResid is a function of b0 and b1 thought of as variables (the xi and yi that appear in this function are not variables but rather have numerical values) and so the task of finding b0 and b1 is that of minimizing a function of two variables. Since the function is nicely differentiable (one consequence of using squares rather than absolute values), calculus tells us to find the points where the partial derivatives of SSResid with respect to each of b0 and b1 are 0. (Of course then we have to check that we have found a minimum rather than a maximum or a saddlepoint.) After much algebra, we find that P (xi − x)(yi − y) P b1 = (xi − x)2 b0 = y − b1 x. Therefore the equation of our “least-squares” line is P (xi − x)(yi − y) P y=y+ (x − x) (xi − x)2 The quantities in these expressions are tedious to write, so we introduce some useful abbreviations. 19:08 -- May 4, 2008 129 1. Data Sxx n X = (xi − x̄)2 s2x = Sxx /(n − 1) i=1 SST = Syy = n X (yi − ȳ)2 s2y = Syy /(n − 1) i=1 Sxy = n X (xi − x̄)(yi − ȳ) i=1 We can now rewrite the expression for b1 as b1 = Sxy Sxx and the equation for the line as y−y = Sxy (x − x) . Sxx An important fact that we note immediately from the above equation for the line is that it passes through the point (x, y). This says that, whatever else, we should predict that the value of y is “average” if the value of x is “average”. This seems like a plausible thing to do. The slope b1 of the regression line tells us something about the nature of the linear relationship between x and y. A positive slope suggests a positive relationship between the two quantities, for example. However the slope has units — we would like a dimensionless measure of the linear relationship. The key to finding such is to reexpress the variables x and y as unit-free quantities. The key is to “standardize” x and y. In problem 1.13 we introduced the notion of standardization of a variable. If x is a variable, the new variable x0 = x−x sx changes the data to have mean 0 and standard deviation 1. This new variable is unit-less. It can be shown that the regression equation can be written as y−y x−x =r sy sx where r is the correlation coefficient between x and y given by r=√ Sxy p . Sxx Syy It can be shown that −1 ≤ r ≤ 1. For the corrosion dataset we find that the correlation coefficient between iron content (Fe) and material loss due to corrosion (loss) is −.98. > cor(corrosion$loss,corrosion$Fe) [1] -0.9847435 130 1.7. Exercises This number can be easily interpreted using a sentence such as “loss decreases approximately .98 standard deviations for each increase of 1 standard deviation of iron content in this dataset.” In R, the object defined by the lm() function is actually a list that contains more than just the fitted line. There are several functions to access the information contained in that object. In particular, the residuals() and fitted() return vectors of the same length of the data containing the residuals and fitted values corresponding to each data point. > l=lm(loss~Fe,corrosion) > fitted(l) 1 2 3 4 5 6 7 8 129.54640 118.25705 112.73247 106.96770 101.20293 129.54640 118.25705 95.19795 9 10 11 12 13 112.73247 82.70761 129.54640 95.19795 82.70761 > residuals(l) 1 2 3 4 5 6 7 -1.9464003 5.7429496 -1.9324749 -3.0677005 0.2970739 0.5535997 3.7429496 8 9 10 11 12 13 -2.8979527 0.3675251 0.9923919 -1.5464003 -3.7979527 3.4923919 From this output, we can see that the largest residual corresponds to the second data point. For that point, (0.48,124), the predicted value is 118.26 and the residual is 5.74. Note that a positive residual means that the prediction underestimates the actual. A plot of residuals is often useful in determining whether a linear relationship is an appropriate description of the relationship between the two variables. We know that the track record data of Example 1.6.2 is not best summarized by a linear relationship. When we try to do that, we have the residual plot of Figure 1.12. > track=read.csv(’http://www.calvin.edu/~stob/data/mentrack.csv’) > l=lm(Seconds~Meters,data=track) > xyplot(residuals(l)~Meters,data=track) The residual plot certainly suggests that there is a structure in the data that is other than linear. The fitted model consistently underpredicts at short and long distances while overpredicts at intermediate distances. 1.7. Exercises 1.1 Load the built-in R dataset chickwts. (Use data(chickwts).) a) How many individuals are in this dataset? b) How many variables are in this dataset? 19:08 -- May 4, 2008 131 1. Data 20 ● ● residuals(l) ● 10 ● 0 ● ● ●● ● −10 ● 0 2000 ● 4000 6000 8000 10000 Meters Figure 1.12.: A residual plot for the male world records in track data. c) Classify the variables as quantitative or categorical. 1.2 The distribution of a quantitative variables is symmetric about m if whenever there are k data values m + d there are also k values of m − d. a) Show that if a distribution is symmetric about m then m is the median. (You may need to handle separately the cases where the number of values is odd and even.) b) Show that if a distribution is symmetric about m then m is the mean. c) Create a small distribution that is not symmetric about m, but the mean and median are both equal to m. 1.3 Describe some situations where the mean or median is clearly a better measure of central tendency than the other. 1.4 A bowler normally bowls a series of three games. When the author was first learning long division, he learned to compute a bowling average. However he did not completely understand the concept since to find the average of three games, he took the average of the first two games and then averaged that with the the third game. (That is, if x2 denotes the mean of the first two games and x3 denotes the mean of the three games, the author thought that x3 = (x2 + x3 )/2.) a) Give a counterexample to the author’s method of computing the average of three games. b) Given x2 and x3 , how should x3 be computed? 132 1.7. Exercises c) Generalizing, given the mean xn of n observations and an additional observation xn+1 , how should the mean xn+1 of the n + 1 observations be computed? 1.5 Show that the total deviation from the mean, defined by total deviation from the mean = n X (xi − x) , i=1 is 0 for any distribution. 1.6 Find a distribution with 10 values between 0 and 10 that has as large a variance as possible. 1.7 Find a distribution with 10 values between 0 and 10 that has as small a variance as possible. 1.8 We could compute the mean absolute deviation from the median instead of from the mean. Show that the mean absolute deviation from the median is always less than or equal to the mean absolute deviation from the mean. P 1.9 Let SS(c) = (xi − c)2 . (SS stands for sum of squares.) Show that the smallest value of SS(c) occurs when c = x. This shows that the mean is a minimizer of SS. (Hint: use calculus.) 1.10 Sketch a boxplot of a distribution that is positively skewed. 1.11 Suppose that x1 , . . . , xn are the values of some variable and a new variable y is defined by adding a constant c to each xi . In other words, yi = xi + c for all i. a) How does y compare to x? b) How does Var(y) compare to Var(x)? 1.12 Repeat Problem 1.11 but with yi defined by multiplying xi by c. In other words, yi = cxi for all i. 1.13 Suppose that x1 , . . . , xn are given and we define a new variable z by zi = xi − x . sx What is the mean and the standard deviation of the variable z? This transformed variable is called the standardization of x. In R, the expression z=scale(x) produces the standardization. The standard value zi of xi is also sometimes called the z-score of xi . 1.14 The dataset singer comes with the lattice package. Make sure that you have loaded the lattice package and then load that dataset. The dataset contains the heights of 235 singers in the New York Choral Society. 19:08 -- May 4, 2008 133 1. Data a) Using a histogram of the heights of the singers, describe the distribution of heights. b) Using side-by-side boxplots, describe how the heights of singers vary according to the part that they sing. 1.15 The R dataset barley has the yield in bushels/acre of barley for various varieties of barley planted in 1931 and 1932. There are three categorical variables in play: the variety of barley planted, the year of the experiment, and the site at which the experiment was done (the site Grand Rapids is Minnesota, not Michigan). By examining each of these variables one at a time, make some qualitative statements about the way each variable affected yield. (e.g., did the year in which the experiment was done affect yield?) 1.16 A dataset from the Data and Story Library on the result of three different methods of teaching reading can be found at http://www.calvin.edu/~stob/data/reading. csv. The data includes the results of various pre- and post-tests given to each student. There were 22 students taught by each method. Using the results of POST3, what can you say about the differences in reading ability of the three groups at the end of the course? Would you say that one of the methods is better than the other two? Why or why not? 1.17 The death penalty data illustrated Simpson’s paradox. Construct your own illustration to conform to the following story: Two surgeons each perform the same kind of heart surgery. The result of the surgery could be classified as “successful” or “unsuccessful.” They have each done exactly 200 surgeries. Surgeon A has a greater rate of success than Surgeon B. Now the surgical patient’s case can be classified as either “severe” or “moderate.” It turns out that when operating on severe cases, Surgeon B has a greater rate of success than Surgeon A. And when operating on moderate cases, Surgeon B also has a greater rate of success than Surgeon A. By the way, who would you want to be your surgeon? 1.18 Data on the 2003 American League Baseball season is in the file http://www. calvin.edu/~stob/data/al2003.csv’. a) Suppose that we wish to predict the number of runs (R) a team will score on the year given the number of homeruns (HR) the team will hit. Write a linear relationship between these two variables. b) Use this linear relationship to predict the number of runs a team will score given it hits 200 homeruns on the year. 134 1.7. Exercises c) Are there any teams for which the linear relationship does a poor job in predicting runs from homeruns? 1.19 Continuing to use data from the AL 2003 baseball season, suppose that we wish to predict the number of games a team will win (W) from the number of runs the team scores (R). a) Write a linear relationship for W in terms of R. b) How many runs must a team score to win 81 games according to this relationship? 1.20 Suppose that we wish to fit a linear model without a constant: P i.e., y = bx. Find the value of b that minimizes the sums of squares of residuals, ni=1 (yi − bxi )2 in this case. (Hint: there is only one variable here, b, so this is a straightforward Mathematics 161 max-min problem.) 1.21 In R, if we wish to fit a line y = bx without the constant term, we use lm(y~x-1). (The -1 in the formula notation in this context tells R to omit the constant term.) Using the same data as Problem 1.19, define new variables for W − L and R − OR. (For example, define wl=s$W-s$L where s is the data frame containing your data.) a) Write W − L as a linear function of R − OR without a constant term. b) Why do you think it makes sense (given the nature of the variables) to omit a constant term in this model? 1.22 The R dataset women gives the average weight of American women by height. Do you think that a linear relationship is the best way to describe the relationship between average weight and height? 19:08 -- May 4, 2008 135 2. Data from Random Samples If we are to make decisions based on data, we need to be careful in their collection. In this chapter we consider one common way of generating data, that of sampling from a population. 2.1. Populations and Samples To determine whether Kellogg’s is telling the truth about the net weight of its boxes of Raisin Bran, it is simply not feasible to weigh every box of cereal in the warehouse. Instead, the procedure recommended by NIST (National Institute of Standards and Technology) tells us to select a sample consisting of a relatively small number of boxes and weigh those. For example, in a shipment of 250 boxes, NIST tells us to weigh just 12. The hope is that this smaller sample is representative of the larger collection, the population of all cereal boxes. We might hope, for example, that the average weight of boxes in the sample is close to the average weight of the boxes in the population. Definition 2.1.1 (population). A population is a well-defined collection of individuals. As with any mathematical set, sometimes we define a population by a census or enumeration of the elements of the population. The registrar can easily produce an enumeration of the population of all currently registered Calvin students. Other times, we define a population by properties that determine membership in the population. (In mathematics, we define sets like this all the time since many sets in mathematics are infinite and so do not admit enumeration.) For example, the set of all persons who voted in the last Presidential election is a well-defined population but it doesn’t admit an easy enumeration. Definition 2.1.2 (sample). A subset S of population P is called a sample from P . Quite typically, we are studying a population P but have only a sample S and have the values of one or several variables for each element of S. The canonical goal of (inferential) statistics is: Goal: Given a sample S from population P and values of a variable X on elements of S, make inferences about the values of X on the elements of P. 201 2. Data from Random Samples Most commonly, we will be making inferences about parameters of the population. Definition 2.1.3 (parameter). A parameter is a numerical characteristic of the population. For example, we might want to know the mean value of a certain variable defined on the population. One strategy for estimating the mean of such a variable is to take a random sample and compute the mean of the sample elements. Such an estimate is called a statistic. Definition 2.1.4 (statistic). A statistic is a numerical characteristic of a sample. Example 2.1.5. The Current Population Survey (CPS) is a survey sponsored jointly by the Census Bureau and the Bureau of Labor Statistics. Each month 60,000 households are surveyed. The intent is to make inferences about the whole population of the United States. For example, one population parameter is the unemployment rate – the ratio of the number of those unemployed to the size of the total labor force. The sample produces a statistic that is an estimate of the unemployment rate of the whole population. Obviously, our success in using a sample to make inferences about a population will depend to a large extent on how representative S is of the whole population P with respect to the properties measured by X. As one might imagine, if the 60,000 households in the Current Population Survey are to give dependable information about the whole population, they must be chosen very carefully. Example 2.1.6. The Literary Digest began forecasting elections in 1912. While it forecasted the results of the election accurately until 1932, in 1936 the poll predicted that Alf Landon would receive 55% of the popular vote. Of course Roosevelt went on to win the election in a landslide with 61% of the popular vote. What went wrong with the poll? There were are at least two problems with the survey. First, the Literary Digest sampled from telephone directories and automobile registration lists. Voters with telephones and automobiles in 1936 tended to be more affluent and so were somewhat more likely to favor Landon than the typical voter. Second, although the digest sent out more than 10 million questionaires, only 2.3 million of these were returned. So it probably is the case that voters favorable to Landon were more likely to return their questionaires than those favorable to Roosevelt. The representativeness of the sample will depend how the sample is chosen. A convenience sample is a sample chosen simply by locating units that conveniently present themselves. A convenience sample of students at Calvin could be produced by 202 2.2. Simple Random Samples grabbing the first 100 students that come through the doors of Johnny’s. It’s pretty obvious that in this case, and for convenience samples in general, there is no guarantee that the sample is likely to be representative of the whole population. In fact we can predict some ways in which a “Johnny’s sample” would not be representative of the whole student population. One might suppose that we could construct a representative sample by carefully choosing the sample according to the important characteristics of the units. For example, to choose a sample of 100 Calvin students, we might ensure that the sample contains 54 females and 46 males. Continuing, we would then ensure a representative proportion of first-year students, dorm-livers, etc. There are several problems with this strategy. There are usually so many characteristics that we might consider that we would have to take too large a sample so as to get enough subjects to represent all the possible combinations of characteristics in the proportions that we desire. It might be expensive to find the individuals with the desired characteristics. We have no assurance that the subjects we choose with the desired combination of characteristics are representative of the group of all the individuals with those characteristics. Finally, even if we list many characteristics, it might be the case that the sample will be unrepresentative according to some other characteristic that we didn’t think of and that characteristic might turn out to be important for the problem at hand. Instead of trying to construct a representative sample, most survey samples are chosen at “random.” We investigate the simplest sort of random sample in the next section. 2.2. Simple Random Samples Definition 2.2.1 (simple random sample). A simple random sample (SRS) of size k from a population is a sample that results from a procedure for which every subset of size k has the same chance to be the sample chosen. For example, to pick a random sample of size 100 of Calvin students, we might write the names of all Calvin students on index cards and choose 100 of these cards from a well-mixed bag of all the cards. In practice, random samples are often picked by computers that produce “random numbers.” (A computer can’t really produce random numbers since a computer can only execute a deterministic algorithm. However computers can produce numbers that behave as if they are random. We’ll talk about what that might mean later.) In this case, we would number all students from 1 to 4,224 and then choose 100 numbers from 1 to 4224 in such a way that any set of 100 numbers has the same chance of occurring. The R command sample(1:4224,100,replace=F) will choose such a set of 100 numbers. It is certainly possible that a random sample is unrepresentative in some significant way. Since all possible samples are equally likely to be chosen, by definition it is possible that we choose a bad sample. For example, a random sample of Calvin students might 19:08 -- May 4, 2008 203 2. Data from Random Samples fail to have any seniors in it. However the fact that a sample is chosen by simple random sampling enables us to make quantitative statements about the likelihood of certain kinds of nonrepresentativeness. This in turn will enable us to make inferences about the population and to make statements about how likely it is that our inferences are accurate. In Chapter 5 we will see how to place some bounds on the error that using a random sample might produce. Definition 2.2.2 (sampling error). The sampling error of an estimate of a population parameter is the error that results from using a sample rather than the whole population to estimate the parameter. Of course we cannot know the sampling error exactly (this is equivalent to knowing the population parameter). But we will be able to place some bounds on it. High quality public opinion polls are usually published with some information about the sampling error. For example, typical political polls are expressed this way: Mitt Romney is favored by 43% of the Iowa voters (with a margin of error of ±3%). While we will learn to carefully interpret this statement in Section 4.3, it means roughly that we can be reasonably sure that 40%–46% of the population of Iowa voters favors Romney if the only errors made in this process are those introduced by using a sample rather than the whole population. (Though this survey was reported the day before the Iowa caucuses, Romney actually only received 25.2% of the votes in those caucuses.) To see how sampling error might work we return to the data on US counties. Example 2.2.3. Recall that the dataset http://www.calvin.edu/~stob/data/ uscounties.csv contains data on the 3,141 county equivalents in the United States. Suppose that we take a random sample of size 10 counties from this population. How representative is it? For example, can we make inferences about the mean population per county from a sample of size 10? (Of course in this instance, we know the actual mean population per county – 89,526 – so we do not need a sample to estimate it!) There are too many possible samples of size 10 to investigate them all, but we can get an idea of what might happen by taking many different samples. In the following example, we collect 10,000 different random samples of size 10. Notice that one of these samples had a mean population of as small as 8,392 and another larger than 1.1 million. Half of the samples had means between 38,219 and 107,426. It looks like using a sample of size 10 would more often than not produce a sample with mean considerably less than the population mean. This is to be expected since the distribution of populations by county is highly skewed. Notice also from the example that samples of size 30 produce a narrower range of estimates than samples of size 10. That’s of course not surprising. The distribution of all of the 10,000 samples of size 10 and of size 30 are in the histograms of Figure 2.1. 204 2.2. Simple Random Samples > mean(counties$Population) [1] 89596.28 > fivenum(counties$Population) [1] 67 11206 24595 61758 9519338 > samples = replicate(10000, mean( sample(counties$Population,10,replace=F))) > fivenum(samples) [1] 8391.70 38219.15 62015.35 107425.60 1122651.50 > samples = replicate(10000, mean( sample(counties$Population,30,replace=F))) > fivenum(samples30) [1] 18066.50 56462.10 78047.07 107471.27 592331.20 50 40 Percent of Total Percent of Total 40 30 20 10 30 20 10 0 0 0 200000 400000 600000 800000 1000000 1200000 0e+00 2e+05 samples 4e+05 6e+05 samples30 Figure 2.1.: Sample means of 10,000 samples of size 10 (left) and 30 (right) of U.S. Counties Of course the description of simple random sampling above is an idealized picture of what happens in the real world. We are assuming that we can produce a dependable list of the entire population, that we can have access to any subset of a particular size from that population, and that we get perfect information about the sample that we choose. The Current Population Survey Technical Manual spends considerable effort identifying and attempting to measure non-sampling error. It lists several basic kinds of such errors. 1. Inability to obtain information about all sample cases (unit non-reponse). 2. Definitional difficulties. 3. Differences in the interpretation of questions. 4. Respondent inability or unwillingness to provide correct information. 5. Respondent inability to recall information. 6. Errors made in data collection, such as recording and coding data. 7. Errors made in processing the data. 8. Errors made in estimating values for missing data. 19:08 -- May 4, 2008 205 2. Data from Random Samples 9. Failure to represent all units with the sample (i.e., under-coverage). [Bur06] Most surveys of real populations (of people) fall prey to some or all of these problems. Example 2.2.4. The US National Immunization Survey attempts to determine how many young children receive the common vaccines against childhood illnesses. For example, in 2006, this survey estimates that 92.9% of ages 19–35 months at the time of the survey had received at least three doses of one of the polio vaccines. The sampling error reported for this estimate is 0.6%. The survey itself is a telephone survey of households that contain at least 30,000 children. One issue with a telephone survey is that not all children of the appropriate age live in a household with a telephone. Also, it is extremely difficult to choose telephone numbers at random. Though we would like a list of the entire population from which to choose our sample, as in the previous example we often must choose our sample from another list that does not “cover” the population. The sampling frame is the list of individuals from which we actually choose our sample. The quality of the sampling frame is one of the most important features in ensuring a representative sample. Political pollsters, for example, would like a list of all and only those persons who will actually vote in the election. Usual sampling frames will omit some of these voters but will also include many persons who will not vote. Example 2.2.5. In 2004 during Quest, all incoming Calvin students were given a survey, the CIRP Freshmen Survey. In other words, the “sample” was actually the whole first year class. However only 43% of the first-year students actually filled out the survey and returned it. Of those students who returned it, in the Spring of 2007 (when they were Juniors), their GPA was substantially higher on average than those students who had not returned the survey. So the sample of students studied in this survey was not representative of the first-year students of 2004 in at least one important way. The response rate in the National Immunization Survey is about 75%. Considerable effort is expended in determining in what ways non-responders might differ from responders. 2.3. Other Sampling Plans The concept of random sampling can be extended to produce samples other than simple random samples. There are a number of reasons that we might want to choose a sample that is not a simple random sample. One important reason is to reduce sampling error. 206 2.3. Other Sampling Plans Class Level First-year Sophomore Junior Senior Other Total Population 1,129 1,008 897 1,041 149 4,224 Sample 27 24 21 24 4 100 Table 2.1.: Population of Calvin Students and Proportionate Sample Sizes Consider the situation in which the population in question has several subpopulations that differ substantially on the variables in question. For example, suppose that we wish to survey Calvin College students to determine whether they favor abolishing the Interim. It seems likely that the seniors (who have take three or four interims) might have in general a higher opinion of the interim than first year students who have only taken DCM. Then a simple random sample in which first-year students happen to be overrepresented is likely to underestimate the percentage of students favoring the interim. A sample in which the classes are represented proportionally is an obvious strategy for overcoming this bias. Example 2.3.1. Suppose that we wish to have a sample of Calvin students of size 100 in which the classes are represented proportionally. We should then choose a sample according to the breakdowns in Table 2.1. Once we have defined the sizes of our subsamples, it seems wise to proceed to choose simple random samples from each subpopulation. Definition 2.3.2 (stratified random sample). A stratified random sample of size k from a population is a sample that results from a procedure that chooses simple random samples from each of a finite number of groups (strata) that partition the population. In the example of sampling from the Calvin student body, we chose the random sample so that the number of individuals in the sample from each strata were proportional to the size of the strata. While this procedure has much to recommend it, it is not necessary and sometimes not even desirable. For example, only 4 “other” students appear in our sample of size 100 from the whole population. This is fine if we are only interested in making inferences about the whole population, but often we would like to say something about the subgroups as well. For example, we might want to know how much Calvin students work in off-campus jobs but we might expect and would like to discover differences among the class levels in this variable. For this purpose, we might 19:08 -- May 4, 2008 207 2. Data from Random Samples choose a sample of 20 students from each of the five strata. (Of course we would have to be careful about how to combine our numbers when making inferences about the whole population.) We would say about this sample that we have “oversampled” one of the groups. In public opinion polls, it is often the case that small minority groups are oversampled. The sample that results will still be called a random sample. Definition 2.3.3 (random sample). A random sample of size k from a population is a sample chosen by a procedure such that each element of the population has a fixed probability of being chosen as part of the sample. While we need to give a definition of probability in order to make this definition precise, it is clear from the above examples what we mean. This definition differs from that of a simple random sample in two ways. First, it does not requires that each object has the same likelihood of being the sample chosen. Second, it does not require that equal likelihood extends to groups. It is obvious that stratified random sampling is a form of random sampling according to this definition. Other forms of sampling meet the above definition of random sampling without being simple random sampling. A sampling method that we might employ given a list of Calvin students is to choose one of the first 422 students in the list and then choose every 422nd student thereafter. Obviously some subsets can never occur as the sample since two students whose names are next to each other in the list can never be in the same sample. Such a sample might indeed be representative however. It is very important to note that we cannot guarantee by using random sampling of whatever form that our sample is representative of the population along the dimension we are studying. In fact with random sampling, it is guaranteed that it is possible that we could select a really bad (unrepresentative) sample. What we hope to be able to do (and we will later see how to do it) is to be able to quantify our uncertainty about the representativeness of the sample. Example 2.3.4. Another kind of modification to random sampling is used in the Current Population Survey. This survey of 60,000 households in the United States is conducted by individuals who live and work near enough to the sample subjects so that they can conduct the survey in person. It is easy to imagine that 60,000 households chosen totally at random might be inconveniently distributed geographically. The CPS works as follows. First, the country is divided in about 800 primary sampling units, PSUs, which must be, geographically, not too large. For example, large cities (actually, Metropolitan Statistical Areas) are each PSUs. Other PSUs are whole counties or pairs of contiguous counties. The PSUs are grouped into strata, and then one PSU per strata is chosen at random (with a probability proportional to its population). The next stage of the sampling procedure is to choose at random certain housing clusters. A housing cluster is a group of four housing units in a PSU. The idea behind sampling housing clusters rather than individual 208 2.4. Exercises houses is to cut down on interviewer travel time. A larger sample is generated for the same cost. Of course the penalty for using clusters is that clusters tend to have less variability than the whole PSU in which the cluster lies and so the group of individuals in the cluster will probably not be as representative of the PSU as a sample of similar size chosen from the PSU at random. The PSU illustrates two enhancements to simple random samples: it is multistage (but with random sampling at each stage) and it produces a cluster sample, a sample in which the ultimate sampling units are not the individuals desired but clusters of individuals. We will not undertake a formal study of all the variants of sampling methods and their resultant sampling errors, but it is good to keep in mind that most large scale surveys are not simple random samples but some modification thereof. Nevertheless, they all rely on the basic principle that randomness is our best hope for producing representative samples. 2.4. Exercises 2.1 In the parts below, we list some convenience samples of Calvin students. For each of these methods for sampling Calvin students, indicate in what ways that the sample is likely not to be representative of the population of all Calvin students. a) The students in Mathematics 243A. b) The students in Nursing 329. c) The first 30 students who walk into the FAC west door after 12:30 PM today. d) The first 30 students you meet on the sidewalk outside Hiemenga after 12:30 PM today. e) The first 30 students named in the “Names and Faces” picture directory. f ) The men’s basketball team. 2.2 Suppose that we were attempting to estimate the average height of a Calvin student. For this purpose, which of the convenience samples in the previous problem would you suppose to be most representative of the Calvin population? Which would you suppose to be least representative? 2.3 Consider the set of natural numbers P = {1, 2, . . . , 30} to be a population. a) How many prime numbers are there in the population? b) If a sample of size 10 is representative of the population, how many prime numbers would we expect to be in the sample? How many even numbers would we expect to be in the sample? 19:08 -- May 4, 2008 209 2. Data from Random Samples c) Using R choose 5 different samples of size 10 from the population P . Record how many prime numbers and how many even numbers are in each sample. Make any comments about the results that strike you as relevant. 2.4 Before easy access to computers, random samples were often chosen by using tables of random digits. The tables looked something like this table which was constructed in R. [1] [11] [21] [31] [41] [51] 40139 82348 09366 00976 83264 54703 61007 44867 05554 43068 34861 28006 60277 12854 86209 88362 60488 03477 41219 03179 36252 42080 52180 66384 45533 21145 33740 54161 03796 55787 68878 91154 92037 34593 17289 42212 48506 84831 21446 18209 39816 55253 11950 78503 63192 04344 19080 82256 07747 00159 87206 52566 64575 61471 69280 97920 58877 86976 55492 73665 Each digit in this table is supposed to occur with equal likelihood as are all pairs, triples, etc. Suppose that a population has 280 individuals numbered 1–280. Explain whether each of the following methods of using the random number table is an appropriate method of producing a simple random sample of size 5. a) Divide the table into three digit groups. (i.e., 401, 396, 100, etc.). Choose the first five numbers between 1–280 and choose the corresponding individuals. If a number is repeated, do not use it again. (So in this case, the first individual in the sample is the individual numbered 100.) b) Proceed as in (a) but instead of throwing out the whole three digit number if the first digit is 3 or larger, throw out only the first digit and use the next three. (On this method, the first element of the sample is the one numbered 013.) c) As in (a), use a 3 digit group. However divide the three digits by 3 and throw away the remainder. If the result is 1–280, use that individual as the next individual in the sample. (On this method, since 401/3=133.7, the first element of the sample is 133.) 2.5 In a very small class, the final exam scores of the six students were 139, 145, 152, 169, 171, and 189. a) How many different simple random samples of size 3 of students in this class are there? b) What is the “population” mean of exam scores? c) Suppose that we use the mean of exam scores of a SRS of size 3 to estimate the population mean. What is the greatest possible error that we could make? 210 2.4. Exercises 2.6 Donald Knuth, the famous computer scientist wrote a book entitled “3:16”. This book was a Bible study book that studied the 16th verse of the 3rd chapter of each book of the Bible (that had a 3:16). Knuth’s thesis was that a Bible study of random verses of the Bible might be edifying. The sample was of course not a random sample of Bible verses and Knuth had ulterior motives in choosing 3:16. Describe a method for choosing a random sample of 60 verses from the Bible. Construct a method that is more complicated than simple random sampling that seeks to get a sample representative of all parts of the Bible. 2.7 Suppose that we wish to survey the Calvin student body to see whether the student body favors abolishing the Interim (we could only hope!). Suppose that instead of a simple random sample, we select a random sample of size 20 from each of the five groups of Table 2.1. Suppose that of 20 students in each group, 9 of the first-year students, 10 of the sophomores, 13 of the juniors, 19 of the seniors and all 20 of the other students favor abolishing the interim. Produce an estimate of the proportion of the whole student body by using these sample results. Be sure to describe and justify your computation that uses these results. 2.8 There are 3,141 county equivalents in the county dataset (http://www.calvin. edu/~stob/data/uscounties.csv). Suppose that we wish to take a random sample of 60 counties. What are two different variables that might be useful to create strata for a stratified random sample? 2.9 Describe a method for choosing a random sample of 200 Calvin students using the “Names and Faces” directory. 2.10 You would like to estimate the percentage of books in the library that have red covers. Describe a method of choosing a random sample of books to help estimate this parameter. Discuss any problems that you see with constructing such a sample. 19:08 -- May 4, 2008 211 3. Probability 3.1. Random Processes Probability theory is the mathematical discipline concerned with modeling situations in which the outcome is uncertain. For example in choosing a simple random sample we do not know which sample of individuals from the population that we might actually get in our sample. The basic notion is that of a probability. Definition 3.1.1 (A probability). A probability is a number meant to measure the likelihood of the occurrence of some uncertain event (in the future). Definition 3.1.2 (probability). Probability (or the theory of probability) is the mathematical discipline that 1. constructs mathematical models for “real-world” situations that enable the computation of probabilities (“applied” probability) 2. develops the theoretical structure that undergirds these models (“theoretical” or “pure” probability). The setting in which we make probability computations is that of a random process. (What we call a random process is usually called a random experiment in the literature but we use process here so as not to get the concept confused with that of randomized experiment, a concept that we introduce later.) Characteristics of a Random Process: 1. A random process is something that is to happen in the future (not in the past). We can only make probability statements about things that have not yet happened. 2. The outcome of the process could be any one of a number of outcomes and which outcome will obtain is uncertain. 3. The process could be repeated indefinitely (under essentially the same circumstances), at least in theory. 301 3. Probability Historically, some of the basic random processes that were used to develop the theory of probability were those originating in games of chance. Tossing a coin or dealing a poker hand from a well-shuffled deck are examples of such processes. One of the most important random processes that we study is that of choosing a random sample from a population. It is clear that this process has all three characteristics of a random process. The first step in understanding a random process is to identify what might happen. Definition 3.1.3 (sample space, event). Given a random process, the sample space is the set (collection) of all possible outcomes of the process. An event of the random process is any subset of the sample space. The next example lists several random processes, their sample spaces, and a typical event for each. Example 3.1.4. 1. A fair die is tossed. The sample space can be described as the set S = {1, 2, 3, 4, 5, 6}. A typical event might be E = {2, 4, 6}; i.e., the event that an even number is rolled. 2. A card is chosen from a well-shuffled standard deck of playing cards. There are 52 outcomes in the sample space. A typical event might be “A heart is chosen” which is a subset consisting of 13 of the possible outcomes. 3. Twenty-nine students are in a certain statistics class. It is decided to choose a simple random sample of 5 of the students. There are a boatload of possible outcomes. (It can be shown that there are 118,755 different samples of 5 students out of 29.) One event of interest is the collection of all outcomes in which all 5 of the students are male. Suppose that 25 of the students in the class are male. Then it can be shown that 53,130 of the outcomes comprise this event. We often have some choice as to what we call outcomes of a random process. For example, in Example 3.1.4(3), we might consider two samples different outcomes if the students in the sample are chosen in a different order, even if the same five students appear in the samples. Or we might call such samples the same outcome. To some extent, what we call an outcome depends on the way in which we are going to use the results of the random process. Given a random process, our goal is to assign to each event E a number P(E) (called the probability of E) such that P(E) measures in some way the likelihood of E. In order to assign such numbers however, we need to understand what they are intended to measure. Interpreting probability computations is fraught with all sorts of 302 3.1. Random Processes philosophical issues but it is not too great a simplification at this stage to distinguish between two different interpretations of probability statements. The frequentist interpretation. The probability of an event E, P(E), is the limit of the relative frequency that E occurs in repeated trials of the process as the number of trials approaches infinity. In other words, if the event E occurs en many times in the first n trials, then on the frequentist interpretation, P(E) = lim en /n. n→∞ The subjectivist interpretation. The probability of an event E, P(E), is an expression of how confident the assignor is that the event will happen in the next trial of the process. The word “subjective” is usually used in science in a pejorative sense but that is not the sense of the word here. Subjective here simply means that the assignor needs to make a judgment and that this judgment may differ from assignor to assignor. Nevertheless, this judgment might be based on considerable evidence and experience. That is, it might be expert judgment. Mathematics cannot tell us which of these two interpretations is “true” or even which is “better.” In some sense this is a discussion about how mathematics can be applied to the real world and is a philosophical not a mathematical discussion. In this book (as is customary for introductory texts) we will explain our probability statements using frequentist language. Notice that the frequentist approach makes an important assumption about a random process. Namely, it assumes that, given an event E, there will be a limiting relative frequency of occurrence of E in repeated trials of the random process and that this limiting relative frequency is always the same given any such infinite sequence of repeated trials. This is not something that can be proved. Consider the simplest kind of random process, one with two outcomes. The paradigmatic example of such a process is coin tossing. The frequentist approach would say that in repeated tossing of a coin, the fraction of tosses that have produced a head approaches some limit. The next example simulates this situation. 19:08 -- May 4, 2008 303 3. Probability Example 3.1.5. Suppose that we toss a coin “fairly.” That is we toss a coin so that we expect that heads and tails are equally likely. Let E be the event that the coin turns up heads. It is reasonable to think that in large numbers of tosses, the fraction of heads approaches 1/2 so that P(E) = 1/2. (Indeed, there have been many famous coin-tossers throughout the years that have tried this experiment.) Rather than toss physical coins, we illustrate what happens when a coin is tossed 1,000 times using R. In the R code below, we toss the coin 1,000 times and find that after 1,000 tosses, the relative frequency of heads is 0.499. Notice however that in the first 100 tosses or so that approximately 60% of the tosses were heads. > coins = sample (c(’H’,’T’), 1000, replace=T) > noheads = cumsum(coins==’H’) > cumfrequency=noheads/(1:1000) > xyplot(cumfrequency~(1:1000),type="l") > cumfrequency[1000] [1] 0.499 cumfrequency 0.6 0.4 0.2 0.0 0 200 400 600 800 1000 (1:1000) Though in the above example, the simulated frequency of heads did indeed approach 1/2, there does not seem to be any reason why it wouldn’t be possible to toss 1,000 consecutive heads or, alternatively, to have the relative frequency of heads oscillate wildly from very close to 0 to very close to 1. We will return to this issue when we discuss the Law of Large Numbers in Section 5.2. We should note here that one fact that is clear from the frequentist interpretation of probability is the following. For every event E, 0 ≤ P(E) ≤ 1. 304 3.1. Random Processes We have already said that the sample space is a set and an event is a subset of the sample space. We will use the language of set theory extensively to talk about events. Definition 3.1.6 (union, intersection, complement). Suppose that E and F are events in some sample space S. 1. The union of events E and F , denoted E ∪ F , is the set of outcomes that are in either E or F . 2. The intersection of events E and F , denoted E ∩ F , is the set of outcomes that are in both E and F . 3. The complement of an event E, denoted E 0 , is the set of outcomes that are in S but not in E. Example 3.1.7. Suppose that a random sample of 5 individuals is chosen from a statistics class of 20 students. Let E be the event that there are at least 3 males in the sample and let F be the event that all five individuals are sophomores. Then we have Event Description E ∪ F either at least three males or all sophomores (or both) E ∩ F all sophomores and at least three of them male at most two males E0 So far we have considered random processes that have only finitely many different possible outcomes. Some random processes have infinitely many different outcomes however. Here are two typical examples. Example 3.1.8. A six-sided die is tossed until all six different faces have appeared on top at least once. The possible outcomes form an infinite collection since we could toss arbitrarily many times before seeing the number 1. Example 3.1.9. Kellogg’s packages Raisin Bran in 11 ounce boxes. We might view the weight of any particular box as the result of a random process. It is difficult to describe exactly what outcomes are possible (is a 22 ounce box of Raisin Bran possible?), but it certainly seems like at least all real numbers between 10.9 and 11.1 ounces are possible. This is already an infinite set of outcomes. An important event is that the weight of the box is at least 11 ounces. 19:08 -- May 4, 2008 305 3. Probability 3.2. Assigning Probabilities I – Equally Likely Outcomes How shall we assign a probability P(E) to an event E? On the frequentist interpretation, we need to examine what happens if we repeat the experiment indefinitely. This of course is not usually feasible. In fact, we often want to make probability statements about a process that we will perform only once. For example, we would like to make probability statements about what might happen in the Current Population Survey but only one random sample is chosen. So what we need to do is make some sort of model of the process and argue that the model allows us to draw conclusions about what might happen if we repeat the experiment many times. For many random processes, we can make a plausible argument that the possible outcomes of the process are equally likely. That is, we can argue that each of the outcomes will occur about as often as any other outcome in a long series of trials. For example, when we toss a fair coin, we usually assume that in a large number of trials we will have as many heads as tails. That is, we assume that heads and tails are equally likely. That’s why coin tossing is often used as a means of choosing between two alternatives. Similarly, given the symmetry of a six-sided die, the sides of a die should be equally likely to occur when the die is rolled vigorously. In a more important example, a procedure for random sampling is designed to ensure that all samples are equally likely to occur. In this situation, it is straightforward to assign probabilities to each event. Definition 3.2.1 (probability in the equally likely case). Suppose that a sample space S has n outcomes that are equally likely. Then the probability of each outcome is 1/n. Also, the probability of an event E, P(E) is k/n where k is the number of outcomes in E. The following examples illustrate this definition. In each example, the key is to list the outcomes of the process in such a way that it is apparent that they are equally likely. Example 3.2.2. A six-sided die is rolled. Then one of six possible outcomes occurs. From the symmetry of the die it is reasonable to assume that the six outcomes are equally likely. Therefore, the probability of each outcome is 1/6. If E is the event that is described by “the die comes up 1 or 2” then P(E) = 2/6 = 1/3 since the event E contains two of the outcomes. This probability assignment means that in a large number of tosses of the die, approximately one-third of them will be 1s or 2s. Example 3.2.3. Suppose that four coins are tossed. What is the probability that exactly three heads occur? It is tempting to list the outcomes in this particular 306 3.2. Assigning Probabilities I – Equally Likely Outcomes experiment as the set S = {0, 1, 2, 3, 4} since all that we are interested in is the number of heads that occurs. However, it would be difficult to make an argument that these outcomes are equally likely. The key is to note that there are really sixteen possible outcomes if we distinguish the four coins carefully. To see this, label the four coins (say, penny, nickel, dime, and quarter) and list the possible outcomes as a four-tuple in that order (PNDQ): HHHH HHHT HHTH HTHH THHH HHTT HTHT THHT HTTH THTH TTHH HTTT THTT TTHT TTTH TTTT Exactly 4 of these outcomes have three heads so that P(three heads) = 4/16 = 1/4. In fact, the following table gives the complete probability distribution of the number of heads: no. of heads probability 0 1 16 1 4 16 2 6 16 3 4 16 4 1 16 Example 3.2.4. In many games (e.g., Monopoly) two dice are thrown and the sum of the two numbers that occur are used to initiate some action. Rather than use the 11 possible sums as outcomes, it is easy to see that there are 36 equally likely outcomes (list the pairs (i, j) of numbers where i is the number on the first die, j is the number on the second die and i and j range from 1 to 6). One event related to this process is the event E that the throw results in a sum of 7 on the two dice. It is easy to see that there are 6 outcomes in E so that P(E) = 6/36 = 1/6. For simple random processes with a small number of equally likely outcomes, it is easy to compute probabilities using Definition 3.2.1. But when the number of outcomes is so large that it is impractical to list them all, it becomes more difficult. In such a case, we need to be able to count the number of outcomes without listing them. For example, in choosing a random sample of 10 students from a large class, the number of different possible samples is very large and would be impractical to enumerate. The mathematical discipline of counting is known as combinatorics. In this text, we will not spend a great deal of time counting outcomes in complicated cases but rather leave such computations to R. However a few of the more important principles of counting will be quite useful to us. The Multiplication Principle It is no accident that in rolling 2 dice that there are 62 possible outcomes and that in flipping 4 coins that there are 24 possible outcomes. These are special cases of what we will call the multiplication principle. 19:08 -- May 4, 2008 307 3. Probability Definition 3.2.5 (cartesian product). If A and B are sets then the Cartesian product of A and B, A × B, is the set of ordered pairs of elements of A and B. That is A × B = {(a, b) | a ∈ A and b ∈ B} . The Multiplication Principle is then given by the following lemma. Lemma 3.2.6. If A has n elements and B has m elements then A×B has mn elements. It is easy to prove this lemma (and to remember the multiplication principle) by a diagram. Let a1 , . . . an be the elements of A and b1 , . . . bm be the elements of B. Then the elements of A × B are listed in the following two dimensional array that has n rows and m columns or nm entries. (a1 , b1 ) (a2 , b1 ) (an , b1 ) (a1 , b2 ) . . . (a2 , b2 ) . . . ... (an , b2 ) . . . (a1 , bm ) (a2 , bm ) (an , bm ) It is easy to see that counting the outcomes in the experiment of tossing two dice is equivalent to counting D × D where D = {1, 2, 3, 4, 5, 6}. The two sets A and B do not have to be the same however. Example 3.2.7. A class has 20 students, 12 male and 8 female. A male and a female are chosen at random from the class. How many possible outcomes of this process are there? It is easy to see that we are simply counting A × B where A, the set of males, has 12 elements and B, the set of females, has 8 elements. Therefore there are 12 · 8 = 96 outcomes. The multiplication principle can be profitably generalized in two ways. First, we can extend the principle to the case of more than two sets. It is easy to see that if sets A, B, and C have n, m, p elements respectively, there are nmp triples of elements, one from each of A, B, and C. This is because the set A × B × C can be thought of as (A × B) × C. So for example, there are 63 = 216 different outcomes of the process of tossing three fair dice. A second way to generalize this principle is illustrated in the following example. Example 3.2.8. In a certain card game, a player is dealt two cards. What is the probability that the player is dealt a pair? (A pair is two cards of the same rank. A deck of playing cards has 4 cards of each of thirteen ranks.) We first need to identify the equally likely outcomes and count them. Consider the cards being dealt in succession. There are 52 choices for the first card that player receives. For each 308 3.2. Assigning Probabilities I – Equally Likely Outcomes of these 52 cards there are 51 possible choices for the second card that the player receives. Thus there are (52)(51) = 2, 652 possible equally likely outcomes. To see that this is really an application of the multiplication principle above, we could view it as counting a set that has the same size as A×B where A = {1, . . . , 52} and B = {1, . . . , 51} or we could directly list the possible outcomes in a table as we did in the proof of the multiplication principle. To compute the probability that the a pair is dealt, we need to also count the number of outcomes that are a pair. This is (52)(3) = 156 since the first card can be any card but the second card needs to be one of the three cards remaining that has the same rank as the first card. Thus the probability in question is 156/2652 = .059. Notice that in this example, we have treated the two cards of a given hand as being ordered by taking into account the order in which they are dealt. Of course it does not usually matter in a card game the order in which the cards of a given hand are dealt. We will later show how to compute the number of different unordered hands. Generalizing this example, we have the following principle. If two choices must be made and there are n possibilities for the first choice and, for any first choice there are m possibilities for the second choice then there are nm many ways to make the two choices in succession. Counting Subsets Many of our counting problems can be reduced to counting the number of subsets of a set that are of a given size. Example 3.2.9. Suppose that a set A has 10 elements. How many different three element subsets of A are there? To answer this question, we first count the number of ordered three element subsets of A using the multiplication principle. It is easy to see that there are 10 · 9 · 8 = 720 of these. However, since this counts the number of ordered subsets it counts each different (unordered) subset several times. In fact each three element subset is counted 3 × 2 × 1 = 6 times using the same multiplication principle. (There are 3 choices for the first element, 2 for the second, and 1 for the third.) Thus there must be 720/6 = 120 different three-element subsets of A. Generalizing the example, we have Theorem 3.2.10. Suppose that A has n elements. There are of size k where n n! = . k k!(n − k)! 19:08 -- May 4, 2008 n k many subsets of A 309 3. Probability Proof. We first count the number of k-element ordered subsets of A. By the multiplication principle this is n(n − 1)(n − 2) · · · (n − k + 1) = n! (n − k)! This follows from the multiplication principle since there are n choices for the first element of the subset, n−1 choices for the second element, and so forth down to n−k+1 choices for the k th element. Now for any subset of size k, there are k(k − 1) · · · 1 = k! many different orderings of the elements of that subset. Thus each subset is counted k! many times in our count of the ordered subsets. So there are actually only n! n /k! = (n − k)! k many subsets of size k of A. one and it can be computed using R. The The number nk is obviously an important n R function choose(n,k) computes k . Example 3.2.11. A random sample of 5 students is chosen from a class of 20 students, 12 of whom are female. What is the probability that the sample consists of 5 females? We first need to count the number of equally likely outcomes. Since there are 20 students and an outcome is a subset of size 5 of those 20, the number of different random samples that we could have chosen is 20 5 = 15, 504. Since the event that we are interested in is the collection of samples that have five females, we need to count how many of these 15,504 outcomes contain five females. But that is simply 12 5 = 792 since each sample of five females is a subset of the 12 females in the class. So the probability in question is 792/15504 = 0.051. > choose(20,5) [1] 15504 > choose(12,5) [1] 792 > 792/15504 [1] 0.05108359 3.3. Probability Axioms In the last section, we considered one way of assigning probabilities to events. But we can’t always identify equally likely outcomes. 310 3.3. Probability Axioms Example 3.3.1. A basketball player is going to shoot two free throws. What is the probability that she makes both of them? It is easy to write the possible outcomes. Using X for a made free throw and O for a miss, the four outcomes are XX, XO, OX, and OO. In this respect, the process looks just like that of tossing a coin twice in succession. But we have no reason to think that these four outcomes are equally likely. In fact, it is almost always that case that shooter is more likely to make a free throw than miss it so that it is probably the case that XX is more likely to occur than OO. As we have said before, mathematics cannot tell us how to assign probabilities in situations such as Example 3.3.1. However not just any assignment of probabilities makes sense. For example, we cannot assign a probability of 1/2 to each of the four outcomes. It is not reasonable to think that the limiting relative frequency of all four outcomes will be 1/2 if the experiment is repeated many times. In fact it seems clear that we should be looking for four numbers that sum to 1. In 1933, Andrei Kolmogorov published the first rigorous treatment of probability in which he gave axioms for a probability assignment in the same way that Euclid gave axioms for geometry. Axiom 1. For all events E, P(E) ≥ 0. Axiom 2. P(S) = 1. Axiom 3. If E and F are disjoint events (i.e., have no outcomes in common) then P(E ∪ F ) = P(E) + P(F ) More generally, if E1 , E2 , . . . is a sequence of pairwise disjoint events, then P(E1 ∪ E2 ∪ · · · ) = P(E1 ) + P(E2 ) + · · · . Axioms in mathematics are supposed to be propositions that are “intuitively obvious” and that we agree to accept as true without proof. Each of the three Kolmogorov axioms can easily be interpreted as a statement about limiting relative frequency that is obviously true. For example, the second axiom is obviously true because by our definition of a random process, one of the outcomes in the sample space must occur. Notice that the method of equally likely outcomes can be seen to rely heavily on Axiom 2 and Axiom 3. While the axioms do not directly help us assign probabilities in a case like Example 3.3.1, they do constrain our assignments. Also, they are useful in helping to compute some probabilities in terms of others. Namely, we can prove some theorems using these axioms. 19:08 -- May 4, 2008 311 3. Probability Proposition 3.3.2. For every event E, P(E 0 ) = 1 − P(E). Proof. The events E and E 0 are disjoint and E ∪ E 0 = S. Thus P(E) + P(E 0 ) = P(E ∪ E 0 ) = P(S) = 1 . The first equality is Axiom 3 and the last is Axiom 2. The proposition follows immediately. A curious event is ∅. Since we assume that something happens each time the random process is performed, it should be the case that P(∅) = 0. It is easy to see that this follows from the proposition and Axiom 2 since S = ∅0 . Proposition 3.3.3. For any events E and F , P(E ∪ F ) = P(E) + P(F ) − P(E ∩ F ). Proof. We first use Axiom 3 and find that P(E) = P(E ∩ F 0 ) + P(E ∩ F ) and P(F ) = P(F ∩ E) + P(F ∩ E 0 ) Next we use Axiom 3 again to see that P(E ∪ F ) = P(E ∩ F 0 ) + P(E ∩ F ) + P(E 0 ∩ F ) Combining, we have that P(E ∪ F ) = (P(E) − P(E ∩ F )) + P(E ∩ F ) + (P(F ) − P(E ∩ F )) which after simplifying gives the desired result. The propositions above help us simplify probability computations, even in the case of equally likely outcomes. Example 3.3.4. From experience, an insurance company estimates that a customer that has both a homeowner’s policy and an auto policy has a probability of .83 of having no claim on either policy in a given year. These policy holders also have a probability of .15 of having an automobile claim and .05 of having a homeowner’s claim. What is the probability that such a policy holder has both a homeowner and automobile claim? If E is the event of a homeowner’s claim and F the event of an auto claim, then we have P(E ∪F ) = 1−.83 = .17. Also P(E) = .05 and P(F ) = .15. Thus the event that we are looking for, E ∩ F , has probability P(E) + P(F ) − P(E ∪ F ) = .03. 312 3.4. Empirical Probabilities 3.4. Empirical Probabilities In Section 3.2, we saw how to assign probabilities consistent with the Kolmogorov Axioms in the case that we could identify a priori equally likely outcomes. However in many applications, the outcomes are not equally likely and there is usually no similar theoretical principle that enables us to assign probabilities with confidence. In such cases, we need some data from the real world to assign probabilities. While much of Chapter 4 will be devoted to this problem, in this section we look at a very simple method of assigning probabilities based on data. Since the probability of an event E is supposed to be the limiting relative frequency of the occurrence of E as the number of trials increases indefinitely, a very simple estimate of the probability of E is the relative frequency with which it has occurred in the past. Example 3.4.1. What is the probability that number 20 of the Calvin Knights will make a free-throw when he has to shoot one in a game? As of the writing of this example, number 20 had attempted 25 free-throws and had made 22 of them. Thus the relative frequency of a made free-throw is 88%. Thus we say that number 20 has a 88% probability of making a free-throw. There are all sorts of objections that might be raised to the computation in Example 3.4.1. The first that comes to mind is that 25 is a relatively small number of trials on which to base the argument. Another serious objection might be to the whole idea that there is a fixed probability that number 20 makes a free-throw. Nevertheless, as a model of what number 22 might do on his next and subsequent free-throws, this number might have some value and allow us to make some useful predictions. We have seen of course that this method of assigning probabilities can lead us to incorrect (and sometimes really bad) probability values. Even in 100 tosses of a coin, it is quite possible that we would find 60 heads and so think that the probability of a head was 0.6 rather than 1/2. (In Section 4.3 we will actually examine closely the question of just how close to the “true” value we are likely to be given a certain number n of coin tosses.) But in situations where we have a lot of past data and very little of a theoretical model to help us compute otherwise, this might be a reasonable strategy. This way of assigning probabilities is an important tool in the insurance industry. Example 3.4.2. Suppose that an insurance company wants to sell a 5-year term life insurance policy in the amount of $100,000 to a 55-year old male. Such a policy pays $100,000 to the beneficiary of the policy holder only if he dies within five years. Obviously, the insurance company would like to know that the probability that the insured dies within five years. The key tool in computing such a probability is a mortality table such as Figure 3.4.2. (The full table is available at http://www. cdc.gov/nchs/data/nvsr/nvsr54/nvsr54_14.pdf.) Using data from a variety 19:08 -- May 4, 2008 313 3. Probability of sources (including the US Census Bureau and the Center for Medicare and Medicaid), the Division of Vital Services makes a very accurate count of the number of people that die in the United States each year. For our problem, we note that the table indicates that of every 88,846 men alive at the age of 55, only 84,725 of them are alive at the age of 60. This means that our insurance company has a probability of (88846 − 84725)/88846 = 0.046 of paying out on this policy. If the company writes many such policies, it appears that it would average about $4,600 per policy in payouts. This is the most important number in trying to decide how much the company should charge for such a policy. Figure 3.1.: Portion of life table prepared by Division of Vital Services of U.S. Department of Health and Human Services For the purpose of investigating how random processes work, it is very useful to use R to perform simulations. We have already seen how to simulate a random process in which the outcomes are equally likely. The next example simulates a process in which the probabilities are determined empirically. Example 3.4.3. In the 2007 baseball season, Manny Ramirez came to the plate 569 times. Of those 569 times, he had 89 singles, 33 doubles, 1 triple, 20 homeruns, 78 walks (and hit by pitch), and 348 outs. We can use the frequency of these events to estimate the probabilities of each sort of event that might happen when 314 3.5. Independence Ramirez comes to the plate. For example, we might estimate that the probability Ramirez will hit a homerun in his next plate appearance to be 20/569 = .035. In the following R session we simulate one, and then five, of Manny Ramirez’s plate appearances. > outcomes=c(’Out’,’Single’,’Double’,’Triple’,’Homerun’,’Walk’) > ramirez=c(348,89,33,1,20,78)/569 > sum(ramirez) [1] 1 > ramirez [1] 0.611599297 0.156414763 0.057996485 0.001757469 0.035149385 0.137082601 > sample(outcomes,1,prob=ramirez) [1] "Double" > sample(outcomes,5,prob=ramirez,replace=T) [1] "Out" "Double" "Out" "Out" "Walk" 3.5. Independence It is often the case that two events associated with a random process are related in some way so that if we knew that one of them was going to happen we would change our estimate of the likelihood that the other would happen. The following example illustrates this. Example 3.5.1. At the end of each semester, students in many college courses are given the opportunity to rate the course. At Calvin, the first two questions that students are asked are: The course as a whole was: (Excellent, Very Good, Good, Fair, Poor, Very Poor) The course content was: (Excellent, Very Good, Good, Fair, Poor, Very Poor) Empirical evidence suggests that the probability that a student answers excellent on the first question is 0.24 and the probability that a student answers Excellent to the second question is 0.22. (What is the random process here? Am I suggesting that students answer these questions at random?) Suppose that we happen to see that a student has answered Excellent to the first question. We would certainly not continue to suppose that the probability that this student has answered Excellent to the second question is just 0.22. We would guess that the students answers are not independent one of another. In fact, 75% of the students who answer Excellent to question 1 also answer Excellent to question 2. 19:08 -- May 4, 2008 315 3. Probability Definition 3.5.2 (conditional probability). Given two events E and F such that P(F ) 6= 0, the conditional probability of E given F , written P(E | F ) is given by P(E | F ) = P(E ∩ F ) P(F ) It is easiest to interpret the formula for P(E | F ) using the relative frequency interpretation. The denominator in the fraction in the definition is the proportion of times that the event F happens in a large number of trials of the random process. The numerator, P(E ∩ F ) is the proportion of times that both events happen. So the fraction is the proportion of times E happens among those times that F happens which is precisely what we want conditional probability to measure. In the definition of conditional probability, it is best to think of F as being a fixed event and that E is allowed to be any event in the sample space. Thus P(E | F ) is a function of E. As a function of E, we can see that the new probabilities satisfy the axioms of probability theory. Proposition 3.5.3. Suppose that F is a fixed event of some process with sample space S and such that P(F ) > 0. Then 1. for every E, P(E | F ) ≥ 0, 2. P(S | F ) = 1, 3. For disjoint events E1 and E2 , P(E1 ∪ E2 | F ) = P(E1 | F ) + P(E2 | F ). In applications, it is often the case that we know P(F ) and P(E | F ). Using the definition of conditional probability, we can then compute P(E ∩ F ) using Multiplication Law of Probability If E and F are events with P(F ) 6= 0 then P(E ∩ F ) = P(F ) P(E|F ) . Example 3.5.4. Suppose that we choose two students from a class of 20 without replacement. If there are 12 female students in the class, the probability of the first chosen student being female is 12/20 = .6. Having chosen a female, the probability that the second chosen student is also female is 11/19 since there are 11 remaining females of the 19 remaining students. So the probability of choosing two females in succession is (12/20)(11/19) = .347. 316 3.5. Independence We can extend the analysis in Example 3.5.4 to compute the probabilities of all possible combinations of E and F occurring or not. It is useful to view this situation as a tree. P(E) P(F | E) F E∩F P(F 0 | E) F0 E ∩ F0 P(F | E 0 ) F E0 ∩ F P(F 0 | E 0 ) F0 E0 ∩ F 0 E P(E 0 ) E0 It is clear when one thinks about it that, in general, P(E | F ) 6= P(F | E). Indeed, simply knowing P(E | F ) does not necessarily given us any information about P(F | E). As a simple example, note that the probability that a primary voter votes for Hilary Clinton given that she votes in the Democratic primary is certainly not equal to the probability that she votes in the Democratic primary given that she votes for Hilary Clinton (the latter probability is 1!). In the next example, we look at an important situation in which we desire to know P(F | E) but we only know conditional probabilities of form P(E | F ). Example 3.5.5. Most laboratory tests for diseases aren’t infallible. The important question from the point of view of the patient is what inference to make about the disease status given the outcome of the test. Namely, if the test is positive, how likely is it that the patient has the disease? The sensitivity of a test is the probability that it will give a positive result given that the patient has the disease. The specificity of a test is the probability that it will give a negative result given that the patient does not have the disease. A widely used rapid test for the HIV virus has sensitivity 99.9% and specificity 99.8%. Since the test appears to be very accurate and it is now quite inexpensive, one might suppose that doctors should give this test as a routine matter to allow for early detection of the virus. In this situation, we are interested in four possible events: D+ the patient has the disease D− the patient does not have the disease T + the test is positive T − the test is negative The sensitivity and specificity then give P(T + | D+ ) = .999 and P(T − | D− ) = .998. (Note that this means that P(T + | D− ) = 0.002 and P(T − | D+ ) = 0.001. 19:08 -- May 4, 2008 317 3. Probability Suppose now that a patient tests positive. What is the probability that this patient has the disease? It is clear that this is the question of computing P(D+ | T + ). We have P(D+ ∩ T + ) P(D+ | T + ) = . P(T + ) Using the Multiplication Law, we have P(D+ ∩ T + ) = P(T + | D+ ) P(D+ ) and also we have P(T + ) = P(T + ∩ D+ ) + P(T + ∩ D− ) . One more piece of information is needed to compute P(D+ | T + ) and that is P(D+ ), the prevalence of the disease in the tested population. Of course this depends on the population that is tested. It is estimated that about 0.01% of all persons in the U.S. have the disease. So if we adopt a policy of testing everyone without regard to other factors, we might estimate P(D+ ) = 0.0001. We can now compute P(D+ | T + ). The probability tree is as follows. 0.999 T + (0.0001)(0.999) = (9.99)10−5 0.0001 0.001 T− (0.0001)((0.001) = 10−7 0.9999 0.002 T+ (0.9999)(0.002) = 0.0020 0.998 T− (0.9999)((0.998) = 0.9979 D+ D− Using the probabilities computed from the tree, we have P(D+ ∩ T + ) P(T + ) P(T + | D+ ) P(D+ ) = P(T + ∩ D+ ) + P(T + ∩ D− ) (9.99)10−5 = = 0.047 . (9.99)10−5 + 0.0020 P(D+ | T + ) = Thus, even though the test is very accurate, 95% of the time the positive result will be for someone who does not have the disease! This is one reason that universal testing for rare diseases often does not make economic sense. This method of “reversing” the conditional probabilities is so important that it has a name: Bayes’ Theorem. 318 3.6. Exercises Independence If P(E | F ) = P(E), knowing that the event F occurs does not give us any more information as to whether F will occur. Such events E and F are called independent. The multiplication law simplifies in this case and leads to the following definition. Definition 3.5.6 (independent). Events E and F are independent if P(E ∩ F ) = P(E) P(F ) . Notice that we do not assume that P(F ) 6= 0 in this definition. It is easy to see that if P(F ) = 0, the equality in the definition is always true so that we would consider E and F to be independent in this special case. Example 3.5.7. Suppose that a free-throw shooter makes 70% of her free-throws. What is the probability that she makes both of her free-throws when she is fouled in the act of shooting? It might be reasonable to suppose that the results of the two free-throws are independent of each other. Then the probability of making two successive free-throws is (0.7)(0.7) = 0.49. Similarly, the probability that she misses both free throws is only 9%. 3.6. Exercises 3.1 For each of the following random processes, write a complete list of all outcomes in the sample space. a) A nickel and a dime are tossed and the resulting faces observed. b) Two different cards are drawn from a hat containing five cards numbered 1–5 are put in a hat. (For some reason, lots of probability problems are about cards in hats.) c) A voter in the Michigan 2008 Primary elections is chosen at random and asked for whom she voted. (See problem A.2.) 3.2 Two six-sided dice are tossed. a) List all the outcomes in the sample space (you should find 36) using some appropriate notation. b) Let F be the event that the sum of the dice is 7. List the elements of F . c) Let E be the event that the sum of the dice is odd. List the elements of the event E. 19:08 -- May 4, 2008 319 3. Probability 3.3 If a Calvin College student is chosen at random and his/her height is recorded, what is a reasonable listing of the possible outcomes? Explain the choices that you have to make in determining what the outcomes are. 3.4 Weatherman in Grand Rapids are fond of saying things like “The probability of snow tomorrow is 70%.” What do you think this statement really means? Can you give a frequentist interpretation of this statement? A subjectivist interpretation? 3.5 In Example 3.2.3 we considered the random experiment of tossing four coins. In this problem, we consider the problem of tossing five coins. a) How many equally likely outcomes are there? b) For each x = 0, 1, 2, 3, 4, 5, compute the probability that exactly x many heads occurs in the toss of five coins. 3.6 A 20-sided die (with sides numbered 1–20) is used in some games. Obviously the die is constructed in such a way that the sides are intended to be equally likely to occur when the die is rolled. (The die is in fact an icosohedron.) Using R, simulate 1,000 rolls of such a die. How many of each number did you expect to see? Include a table of the actual number of times each of the 20 numbers occurred. Is there anything that surprises you in the result? 3.7 A poker hand consists of 5 cards. What is the probability of getting dealt a poker hand of 5 hearts? (Remember that there are 13 hearts in the deck of 52 cards.) 3.8 In Example 3.2.11 we considered choosing a random sample of 5 students from a class of 20 students of whom 12 were female. a) What is the probability that such a random sample will contain 5 males? b) What is the probability that such a random sample will contain 3 females and 2 males? 3.9 Many games use spinners rather than dice to initiate action. A classic board game published by Cadaco-Ellis is “All-American Baseball.” The game contains discs for each of several baseball players. The disk for Nellie Fox (the great Chicago White Sox second baseman) is pictured below. 320 3.6. Exercises The disc is placed over a peg with a spinner mounted in the center of the circle. The spinner is spun and comes to rest pointing to the one of the numbered areas. Each number corresponds to the possible result of Nellie Fox batting. (For example, 1 is a homerun and 14 is a flyout.) a) Why is it unreasonable to believe that all the numbered outcomes are equally likely? b) Explain how one could use the idea of equal likelihood to predict the probability that the spinner will land on the sector numbered 14 and then make an estimate of this probability. (Spinners with regions of unequal size are used heavily in the K–8 textbook series Everyday Mathematics to introduce probability to younger children.) 3.10 The traditional darboard is pictured below. 19:08 -- May 4, 2008 321 3. Probability A dart that sticks in the board is scored as follows. There are 20 numbered sectors each of which has a small outer ring, a small inner ring, and two larger areas. A dart landing in the larger areas scores the number of the sector, in the outer ring scores double the number of the sector, and in the inner ring scores triple the number of a sector. The two circles near the center score 25 points (the outer one) and 50 points (the inner one). Unlike the last problem, it does not seem that an equal likelihood model could be used to compute the probability of a “triple 20.” Explain why not. 3.11 Suppose that E and F are events and that P(E), P(F ), and P(E ∩ F ) are given. Find formulas (in terms of these known probabilities) for the probabilities of the following events: a) exactly one of E or F happens, b) neither E nor F happens, c) at least one of E or F happens, d) E happens but F does not. 3.12 Suppose that E, F , and G are events. Show that P(E ∪ F ∪ G) = P(E) + P(F ) + P(G) − P(E ∩ F ) − P(E ∩ G) − P(F ∩ G) + P(E ∩ F ∩ G) . 3.13 Use the axioms to prove that for all events E and F , if E ⊆ F then P(E) ≤ P(F ). 3.14 Show that for all events E and F that P(E ∩ F ) ≤ min{P(E), P(F )}. 3.15 In 2006, there were 42,642 deaths in vehicular accidents in the United States. 17,602 of the victims had a positive blood alcohol content (BAC). In 15,121 of these, the BAC of the victim was greater than 0.08 (which is the legal limit for DUI in many states). What is a good estimate for the probability that a victim of a vehicular accident had BAC exceeding 0.08? The statistics in this problem can be found at the Fatality Analysis Reporting System, http://www-fars.nhtsa.dot.gov/Main/index. aspx. (A probability that we would really like to know is the probability that a driver with a BAC of greater than 0.08 becomes a fatality in an accident. Unfortunately, that’s a much harder number to obtain.) 3.16 We have used tossing coins as our favorite example of a process with two equally likely outcomes. Consider instead the process where the coin is stood on end on a hard surface and spun. a) If a dime is used, do you think a head and a tail are equally likely to occur? b) Do the experiment 10 times and record the results. 322 3.6. Exercises c) On the basis of your data, is it possible that heads and tails are equally likely? d) Using the data alone, estimate the probability that a spun dime comes up heads. 3.17 In Example 3.4.2, we determined that the probability that a 55 year old male dies before his 60th birthday is 0.046. a) If the company sells this 5-year, $100,000 policy to 100 different men, how many of these policies would you expect that they would have to pay the death benefit for? b) Simulate this situation. Namely use this empirical probability to simulate the 100 policies. How many policies did the company have to pay off on in your simulation? Are you surprised by this result? 3.18 Show that if E ⊆ F then P(F | E) = 1. 3.19 Construct an example to show that it is not necessarily true that P(E | F ) = 1 − P(E | F 0 ). 3.20 Show that if E and F are independent, then so are E 0 and F 0 . 3.21 Suppose that two different bags of blue and red marbles are presented to you and you are told that one bag (bag A) has 75% blue marbles and the other bag (bag B) has 90% red marbles. Suppose that you choose a bag at random. Now suppose that you choose a single marble from the bag at random and it is red. What is the probability that you have in fact chosen bag A? 3.22 Over the course of a season, a certain basketball player shot two free-throws on 36 occasions. On 18 of those occasions, she made both of the free-throws and on 9 of the occasions she missed both (and so on 9 occasions she made one and missed one). Does this data appear to be consistent with the hypothesis that she has a constant probability of making a free-throw and that the result of the second throw of a pair is independent of the first? 3.23 In Example 3.5.5 we studied the effectiveness of universal HIV testing and determined that 95% of the time positive tests results are wrong even though the test itself has a very high sensitivity. Now suppose that HIV testing is restricted to a high risk population - one in which the prevalence of the disease is 25%. What is the probability that a positive test result is wrong in this case? 19:08 -- May 4, 2008 323 4. Random Variables 4.1. Basic Concepts If the outcomes of a random process are numbers, we will call the random process a random variable. Since non-numerical outcomes can always be coded with numbers, restricting our attention to random variables results in no loss of generality. We will use upper-case letters to name random variables (X, Y , etc.) and the corresponding lower-case letters (x, y, etc.) to denote the possible values of the random variable. Then we can describe events by equalities and inequalities so that we can write such things as P (X = 3), P (Y = y) and P (Z ≤ z). Some examples of random variables include 1. Choose a random sample of size 12 from 250 boxes of Raisin Bran. Let X be the random variable that counts the number of underweight boxes and let Y be the random variable that is the average weight of the 12 boxes. 2. Choose a Calvin senior at random. Let Z be the GPA of that student and let U be the composite ACT score of that student. 3. Assign 12 chicks at random to two groups of six and feed each group a different feed. Let D be the difference in average weight between the two groups. 4. Throw a fair die until all six numbers have appeared. Let T be the number of throws necessary. We will consider two types of random variables, discrete and continuous. Definition 4.1.1 (discrete random variable). A random variable X is discrete if its possible values can be listed x1 , x2 , x3 , . . . . In the example above, the random variables X, U , and T are discrete random variables. Note that the possible values for X are 0, 1, . . . , 12 but that T has infinitely many possible values 1, 2, 3, . . . . The random variables Y , Z, and D above are not discrete. The random variable Z (GPA) for example can take on all values between 0.00 and 4.00. (We should make the following caveat here however. All variables are discrete in the sense that there are only finitely many different measurements possible to us. Each measurement device that we use has divisions only down to a certain tolerance. 401 4. Random Variables Nevertheless it is usually more helpful to view these measurements as on a continuous scale rather than a discrete one. We learned that in calculus.) The following definition is not quite right — it omits some technicalities. But it is close enough for our purposes. Definition 4.1.2 (continuous random variable). A random variable X is continuous if its possible values are all x in some interval of real numbers. We will turn our attention first to discrete random variables. 4.2. Discrete Random Variables If X is a discrete random variable, we will be able to compute the probability of any event defined in terms of X if we know all the possible values of X and the probability P (X = x) for each such value x. Definition 4.2.1 (probability mass function). The probability mass function (pmf) of a random variable X is the function f such that for all x, f (x) = P (X = x). We will sometimes write fX to denote the probability mass function of X when we want to make it clear which random variable is in question. The word mass is not arbitrary. It is convenient to think of probability as a unit mass that is divided into point masses at each possible outcome. The mass of each point is its probability. Note that mass obeys the Kolmogorov axioms. Example 4.2.2. Two dice are thrown and the sum X of the numbers appearing on their faces is recorded. X is a random variable with possible values 2, 3, . . . , 12. By using the method of equally likely outcomes, we can see that the pmf f of X is given by the following table: x f (x) 2 1/36 3 2/36 4 3/36 5 4/36 6 5/36 7 6/36 8 5/36 9 4/36 10 3/36 11 2/36 12 1/36 We can now compute such probabilities as P (X ≤ 5) = 5/18 by adding the appropriate values of f . Example 4.2.3. We can think of a categorical variable as a discrete random variable by coding. Suppose that a student is chosen at random from the Calvin student body. We will code the class of the student by 1, 2, 3, 4 for the four standard classes and 5 for other. The coded class is a random variable. Referring to Table 2.1, we see that the probability mass function of X is given by f (1) = 0.27, f (2) = 0.24, f (3) = 0.21, f (4) = 0.25, f (5) = 0.03, and f (x) = 0 otherwise. 402 4.2. Discrete Random Variables 25 Percent of Total 20 15 10 5 0 1 2 3 4 5 r Figure 4.1.: The probability histogram for the Calvin class random variable. One useful way of picturing a probability mass function is by a probability histogram. For the mass function in Example 4.2.3, we have the corresponding histogram in Figure 4.2. On the frequentist interpretation of probability, if we repeat the random process many times, the histogram of the results of those trials should approximate the probability histogram. The probability histogram is not a histogram of data from many trials however. It is a representation of what might happen in the next trial. We will often use this idea to work in reverse. In other words, given a histogram of data that obtained from successive trials of a random process, we will choose the pmf to fit the data. Of course we might not ask for a perfect fit but instead we will choose the pmf f to fit the data approximately but so that f has some simple form. Several families of discrete random variables are particularly important to us and provide models for many real-world situations. We examine two such families here. Each arises from a common kind of random process that will be important for statistical inference. The second of these arises from the very important case of simple random sampling from a population. We will first study a somewhat different case (which, among other uses, can be used to study sampling with replacement). 4.2.1. The Binomial Distribution A binomial process is a random process characterized by the following conditions: 1. The process consists of a sequence of finitely many (n) trials of some simpler random process. 2. Each trial results in one of two possible outcomes, usually called success (S) and failure (F ). 19:08 -- May 4, 2008 403 4. Random Variables 3. The probability of success on each trial is a constant denoted by π. 4. The trials are independent one from another. Thus a binomial process is characterized by two parameters, n and π. Given a binomial process, the natural random variable to observe is the number of successes. Definition 4.2.4 (binomial random variable). Given a binomial process, the binomial random variable X associated with this process is defined by X is the number of successes in the n trials of the process. If X is a binomial random variable with parameters n and π, we write X ∼ Binom(n, π). The symbol ∼ can be read as “has the distribution” or something to that effect. The use of the word distribution here is not inconsistent with our earlier use. Here to specify a distribution is to specify the possible values of the random variable and the probability that the random variable attains any particular value. Example 4.2.5. The following are all natural examples of binomial random variables. 1. A fair coin is tossed n = 10 times with the probability of a HEAD (success) being π = .5. X is the number of heads. 2. A basketball player shoots n = 25 freethrows with the probability of making each freethrow being π = .70. Y is the number of made freethrows. 3. A quality control inspector tests the next n = 12 widgets off the assembly line each of which has a probability of 0.10 of being defective. Z is the number of defective widgets. 4. Ten Calvin students are randomly sampled with replacement. W is the number of males in the sample. The fact that the trials are independent of one another makes it possible to easily compute the pmf of any binomial random variable using the multiplication principle. We first give a simple example. Example 4.2.6. An unaccountably popular dice game is known as Bunko. Three dice are rolled and the number of sixes rolled is the important value. Let X be the random variable that counts the number of sixes in three dice. Then X ∼ Binom(3, 1/6). We can now compute the probability mass function (which can take on values 0, 1, 2, 3). We simply need to keep track of all possible sequences of three successes and failures and find the probability of each such sequence. 404 4.2. Discrete Random Variables f (3) = P(X = 3) = (1/6)(1/6)(1/6) = 1/216 f (2) = P(X = 2) = (1/6)(1/6)(5/6) + (1/6)(5/6)(1/6) + (5/6)(1/6)(1/6) = 15/216 f (1) = P(X = 1) = (1/6)(5/6)(5/6) + (5/6)(1/6)(5/6) + (5/6)(5/6)(1/6) = 75/216 f (0 = P(X = 0) = (5/6)(5/6)(5/6) = 125/216 The computation for f (2) for example has three terms, one for each of SSF, SFS, FSS. The important probability fact for Bunko players is P(X ≥ 1) = 91/216. We can easily generalize the previous example to any n and π to get the following theorem. Theorem 4.2.7 (The Binomial Distribution). Suppose that X is a binomial random variable with parameters n and π. The pmf of X is given by n x n! fX (x; n, π) = π (1 − π)n−x = π x (1 − π)n−x x x!(n − x)! x = 0, 1, 2, . . . , n . Proof. Suppose that n and π are given and that 0 ≤ x ≤ n. Consider all sequences of n trials that have exactly x successes and n − x failures. There are nx of these since all we have to decide is how to “choose” the x places in the sequence for the successes. Now consider any one such sequence, say the sequence S. . . SF. . . F, the sequence in which x successes are followed by n − x failures. The probability on this sequence (and any sequence with x successes is px (1 − p)n−x by the multiplication principle, relying on the independence of the trials. The result follows. Note the use of the semicolon in the definition of fX in the theorem. We will use a semicolon to separate the possible values of the random variable (x) from the parameters (n, π). For any particular binomial experiment, n and π are fixed. If n and π are understood, we might write fX (x) for fX (x; n, π). For all but very small n, computing f by hand is tedious. We will use R to do this. Besides computing the mass function, R can be used to compute the cumulative distribution function FX which is the useful function defined in the next definition. Definition 4.2.8 (cumulative distribution function). If X is any random variable, the cumulative distribution function of X (cdf) is the function FX given by FX (x) = P (X ≤ x) = X fX (y) y≤x 19:08 -- May 4, 2008 405 4. Random Variables We will usually use the convention that the pmf of X is named by a lower-case letter (usually fX ) and the cdf by the corresponding upper-case letter (usually FX ). The R functions to compute the cdf, pdf, and also to simulate binomial processes are as follows if X ∼ Binom(n, π). function (& parameters) explanation rbinom(n,size,prob) makes n random draws of the random variable X and returns them in a vector. dbinom(x,size,prob) returns P(X = x) (the pmf). pbinom(q,size,prob) returns P(X ≤ q) (the cdf). Example 4.2.9. Suppose that a manufacturing process produces defective parts with probability π = .1. If we take a random sample of size 10 and count the number of defectives X, we might assume that X ∼ Binom(10, 0.1). Some examples of R related to this situation are as follows. > defectives=rbinom(n=30, size=10,prob=0.1) > defectives [1] 2 0 2 0 0 0 0 2 0 1 1 1 0 0 2 2 3 1 1 2 1 1 0 2 0 1 1 0 1 1 > table(defectives) defectives 0 1 2 3 11 11 7 1 > dbinom(c(0:4),size=10,prob=0.1) [1] 0.34867844 0.38742049 0.19371024 0.05739563 0.01116026 > dbinom(c(0:4),size=10,prob=0.1)*30 # pretty close to table [1] 10.4603532 11.6226147 5.8113073 1.7218688 0.3348078 > pbinom(c(0:5),size=10,prob=0.1) # same as cumsum(dbinom(...)) [1] 0.3486784 0.7360989 0.9298092 0.9872048 0.9983651 0.9998531 > It is important to note that • R uses size for the number of trials (what we have called n) and n for the number of random draws. • pbinom() gives the cdf not the pmf. Reasons for this naming convention will become clearer later. • There are similar functions in R for many of the distributions we will encounter, and they all follow a similar naming scheme. We simply replace binom with the R-name for a different distribution. 406 4.2. Discrete Random Variables 4.2.2. The Hypergeometric Distribution The hypergeometric distribution arises from considering the situation of random sampling from a population in which there are just two types of individuals. (That is there is a categorical variable defined on the population with just two levels.) It is traditional to describe the distribution in terms of the urn model. Suppose that we have an urn with two different colors of balls. There are m white balls and n black balls. Suppose we choose k balls from the urn in such a way that every set of k balls is equally likely to be chosen (i.e., a random sample of balls) and count the number X of white balls. We say that X has the hypergeometric distribution with parameters m, n, and k and write X ∼ Hyper(m, n, k). A simple example shows how we can compute probabilities in this case. Example 4.2.10. Suppose the urn has 2 white and 3 black balls and that we choose 2 balls at random without replacement. If X is the number of white balls, we have X ∼ Hyper(2, 3, 2). Notice that in this case there are 10 different possible choices of two balls. If we label the balls W1, W2, B1, B2, B3, we have the following: 2 whites (W1,W2) 1 white (W1,B1),(W1,B2), (W1,B3), (W2,B1), (W2,B2), (W3,B3) 0 whites (B1,B2), (B1,B3), (B2,B3) Since the 10 different pairs are equally likely, we have P (X = 0) = 3/10, P (X = 1) = 6/10, and P (X = 2) = 1/10. The systematic counting of the example can easily be extended to compute the pmf of any hypergeometric random variable. Theorem 4.2.11. Suppose that X ∼ Hyper(m, n, k). Then the pmf f of X is given by n m f (x; m, n, k) = x k−x m+n k x ≤ min(k, m) . Proof. The denominator counts the number of samples of size k from m+n many balls. The two terms in the numerator count the number of ways of choosing x white balls from m and k − x black balls from n. Multiplying the two terms together counts the number of ways of choosing x white balls and k − x black balls. R knows the hypergeometric distribution and the syntax is exactly the same as for the binomial distribution (except that the names of the parameters have changed). 19:08 -- May 4, 2008 407 4. Random Variables function (& parameters) explanation rhyper(nn,m,n,k) makes nn random draws of the random variable X and returns them in a vector. dhyper(x,m,n,k) returns P(X = x) (the pmf). phyper(q,m,n,k) returns P(X ≤ q) (the cdf). Example 4.2.12. Suppose that a statistics class has 29 students, 25 of whom are male. Let’s call the females the white balls and the males the black balls. suppose that we choose 5 of these students at random and without replacement, i.e., a random sample of size 5. Let X be the number of females in our sample. Then X ∼ Hyper(4, 25, 5). Some interesting questions related to this random variable are answered by the R output below. > dhyper(x=c(0:5),m=4,n=25,k=5) [1] 0.4473916888 0.4260873226 0.1162056334 0.0101048377 0.0002105175 [6] 0.0000000000 > dhyper(x=c(0:5),k=5,m=4,n=25) # order of named arguments does not matter [1] 0.4473916888 0.4260873226 0.1162056334 0.0101048377 0.0002105175 [6] 0.0000000000 > phyper(q=c(0:5),m=4,n=25,k=5) [1] 0.4473917 0.8734790 0.9896846 0.9997895 1.0000000 1.0000000 > rhyper(nn=30,m=4,n=25,k=5) # note nn for number of random outcomes [1] 2 1 1 1 1 2 2 2 1 1 1 0 1 0 0 0 1 1 0 0 1 1 0 1 1 1 2 0 0 0 > dhyper(0:5,4,25,5) # default order of unnamed arguments [1] 0.4473916888 0.4260873226 0.1162056334 0.0101048377 0.0002105175 [6] 0.0000000000 > 4.3. An Introduction to Inference There are many situations in which the binomial distribution seems to be the right model for a process but for which π is unknown. The next example gives several quite natural cases of this. Example 4.3.1. 1. Microprocessor chips are being produced by an assembly line. There is a possibility that any particular chip produced is defective. It might be reasonable under some circumstances to assume that the probability that any particular chip is defective is a constant π. Then in a sample of 10 chips, it might 408 4.3. An Introduction to Inference Figure 4.2.: Zener cards for ESP testing. be plausible to assume that the number of defective chips X behaves like a binomial random variable with n = 10 and π fixed but unknown. 2. Perhaps it is reasonable to assume that a free-throw shooter in a basketball game has a constant probability π of making a free-throw and that successive attempts are independent one from another. Then in a series of n free-throws, the number of successful free-throws might behave as a binomial random variable with n known and π unknown. 3. In a standard test for ESP, a card with one of five printed symbols is selected without the person claiming to have ESP being able to see it. As the experimenter “concentrates” on the symbol printed on the card, the subject is supposed to announce which symbol is on the card. (These cards are called Zener cards and are pictured in Figure 4.2.) While we think that the probability that a subject can identify any card is 1/5, the person with ESP might claim that the probability is higher. If we allow n trials of this experiment, it is plausible to assume that the number of successful trials X is a binomial random variable with π unknown. In situations like those in the example, we often want to test a hypothesis about π. For example, in the case of the person supposed to have ESP, we would like to test our hypothesis that π = .2. Let us look more closely at the ESP situation. What would it take for us to believe that the subject in fact has a probability greater than 0.2 of correctly identifying the hidden card? Clearly, we would want to have several trials and a rate of success that we would think would not be likely by luck (or “chance”) alone. A standard test is to use 25 trials. (In a standard deck, there are 25 cards with five each of the first five symbols. Rather than going through the deck once however, we will think of the experiment as shuffling the deck after each trial. Then it is clear that each of the five types of cards is 19:08 -- May 4, 2008 409 4. Random Variables equally likely to occur as the top card.) The following R output is relevant to our test. > x=c(5:15) > pbinom(x,25,.2) [1] 0.6166894 0.7800353 0.8908772 0.9532258 0.9826681 0.9944451 0.9984599 [8] 0.9996310 0.9999237 0.9999864 0.9999979 Even if our subject is just guessing, he will get more than five cards right about 40% of the time. Clearly it is certainly possible that he is just guessing if he gets just 6 out of 25 correct. On the other hand, it is virtually certain that he will not get more than 12 right if he is just guessing. While we would not have to believe the ESP explaination for 13 out of 25 successes, it would be difficult to continue asserting that π = 0.2 in this case. Of course there is a grey area. Suppose that our subject gets 10 cards right. The probability that our subject will get at least 10 cards correct by guessing alone is less than 2%. Is this sufficiently surprising to rule out guessing as an explanation? We might not rule out guessing but we would very likely test this subject further. The procedure described above for testing the ESP hypothesis is a special case of a general (class of) procedures known as hypothesis tests. Any hypothesis test follows the same outline. Step 1: Identify the hypotheses A statistical hypothesis test starts, oddly enough, with a hypothesis. A hypothesis is a statement proposing a possible state of affairs with respect to a probability distribution governing an experiment that we are about to perform. There are a variety of kinds of hypotheses that we might want to test. 1. A hypothesis stating a fixed value of a parameter: π = .5. 2. A hypothesis stating a range of values of a parameter: π ≤ .3. 3. A hypothesis about the nature of the distribution itself: X has a binomial distribution. In the ESP example, the hypothesis that we wished to test was π = .2. Notice that we did not propose to test the hypothesis that a binomial distribution was the correct explanation of the data. We assumed that the binomial distribution is a plausible model of our data collection procedure. It will often be the case that we make distributional hypotheses without thinking about testing them. (Sometimes that will be a big mistake.) In the standard way of describing hypothesis tests, there are actually two hypotheses that we view as being pitted against each other. For example, the two hypotheses in the the ESP case were π = 0.2 (the subject does not have ESP) and π > 0.2 (the subject does have ESP or some other mechanism of doing better than guessing). The two hypotheses have standard names. 410 4.3. An Introduction to Inference 1. Null Hypothesis. The null hypothesis, usually denoted H0 , is generally a hypothesis that the data analysis is intended to investigate. It is usually thought of as the “default” or “status quo” hypothesis that we will accept unless the data gives us substantial evidence against it. The null hypothesis is often a hypothesis that we want to “prove” false. 2. Alternate Hypothesis. The alternate hypothesis, usually denoted H1 or Ha , is the hypothesis that we are wanting to put forward as true if we have sufficient evidence against the null hypothesis. Thus we present our hypotheses in the ESP experiment as H0 : Ha : π = 0.2 π > 0.2 . For the ESP experiment, π = 0.2 is the null hypothesis since it is clearly our starting point and it is the hypothesis that we wish to retain unless we have convincing evidence otherwise. Step 2: Collect data and compute a test statistic Earlier, we defined a statistic as a number computed from the data. In the ESP example, our statistic is simply the result of the binomial random variable. Since we are using the statistic to test a hypothesis, we often call it a test statistic. In our previous definition of a statistic, a statistic is a number, i.e., it is an actual value computed from the data. In fact, we are now going to introduce some ambiguity and refer also to the random variable as a statistic. So in this case, the random variable X that is the result of counting the number of correct cards in 25 trials is our test statistic but we will also refer to the value of that random variable x as a test statistic. Of course the difference is whether we are referring to the experiment before or after we collect the data. Note that we can only make probability statements about X. The value of the statistic x is just a number which we have computed from the result of the random process. Amidst this confusion, the central point is that if we think of the test statistic as a random variable it has a distribution. This distribution is unknown (since we do not know π). Step 3: Compute the p-value. Next we need to evaluate the evidence that our test statistic provides. To do this requires that we think about our statistic as a random variable. In the ESP testing example, X ∼ Binom(25, π). The distribution of the test statistic is called its sampling distribution since we think of it as arising from producing a sample of the process. Since our test statistic is a random variable, we can ask probability questions about our test statistic. The key question that we want to ask is this: 19:08 -- May 4, 2008 411 4. Random Variables How unusual would the value of the test statistic that I obtained be if the null hypothesis were true? We show how to answer the question if the result of the ESP experiment were 9 out of 25 correct cards (36%). Notice that if the null hypothesis is true, P(X ≥ 9) = 1 − pbinom(8,25,0.2) = 0.047. Therefore, if the null hypothesis is true, the probability that we would see a result at least as extreme (in the direction of the alternate hypothesis) as 9 is 0.047. This probability is called the p-value of the test statistic. Definition 4.3.2 (p-value). The p-value of a test statistic t is the probability that a result at least as extreme as t (in the direction of the alternate hypothesis) would occur if the null hypothesis is true. Notice that the p-value is a number that is associated with a particular outcome of the process. The p-value of 9 successes in our example is 0.047. Since the p-value is computed after the random process is performed, it is not a probability associated with this particular outcome of the random process. Rather it is a probability that describes what might happen if the experiment is repeated indefinitely. Namely, if the null hypothesis is true, then about 5% of the time the subject would get 9 or more successes in his 25 trials. Countless journal articles in the social, biological, and medical sciences report on the results of hypothesis tests. While there are many kinds of hypothesis tests and it might not always be clear what kind of test the article is reporting on, it is almost universally the case that the result of such a hypothesis test is reported using a p-value. It is quite common to see statements such as p < 0.001. This obviously means that either the null hypothesis being tested is false or something exceedingly surprising happened. Step 4: Draw a conclusion Drawing a conclusion from a p-value is a judgment call and it is a scientific rather than mathematical decision. Our p-value of a test statistic of 9 in the ESP experiment is is 0.047. This means that if we would test many people for ESP and they were all just guessing, about 5% of them would have a result at least as extreme as this. A test statistic of 9 provides some evidence that our subject is more successful than we would expect by chance alone but certainly not definitive such evidence. If we were really interested in this question, we would probably subject the subject to more tests. Sometimes the results of hypothesis tests are expressed in terms of decisions rather than p-values. This is often the case when we must take some action based on our data. We illustrate with a common example. 412 4.3. An Introduction to Inference Example 4.3.3. Suppose that a company claims that the defective rate of their manufacturing process is 1%. A customer tests 100 parts in a large shipment and finds 4 of these parts defective. Is the customer justified in rejecting the shipment? It is easy to think of this situation as a hypothesis test. The test statistic 4 is the result of a random variable X ∼ Binom(100, π). The null and alternate hypotheses are given by H0 : π = 0.01 Ha : π > 0.01 . The p value of this value of the test statistic is 1 − pbinom(3,100,.01) = 0.018. Therefore if the manufacturers claim is correct, we should only see 4 or more defectives 1.8% of the time when we test 100 parts. The customer might be justified in rejecting the shipment and therefore rejecting the null hypothesis. We will describe our possible decisions as to reject the null hypothesis or not reject the null hypothesis. There are of course two different kinds of errors that we could make. Definition 4.3.4 (Type I and Type II errors). A Type I error is the error of rejecting H0 even though it is true. A Type II error is the error of not rejecting H0 even though it is false. Of course, if we reject the null hypothesis, we cannot know whether we have made a Type I error. Similarly, if we do not reject the null hypothesis, we cannot know whether we have made a Type II error. Whether we have committed such an error depends on the true value of π which we cannot ever know simply from data. To determine the likelihood of making such errors, we need to specify our decision rule. For example in Example 4.3.3 we might decide that our rule is to reject the null hypothesis if we see 4 or more defective parts in our shipment of 100. Then we know that if the null hypothesis is true, the probability that we will make a type I error is 0.018. Of course we cannot know the probability of a type II error without knowing the true value of π. In the next example, we further consider the probabilities of the two kinds of errors in particular situation. This example also illustrates another variation on the hypothesis test as it considers a two-sided alternate hypothesis. Example 4.3.5. A coin-toss is going to be used to make a very important decision. Since the coin is a commemorative coin of non-standard design (like those used in the Super Bowl), it is very important to know whether it is fair. We decide to do a test by tossing the coin 100 times and observing the number of heads. It is obvious that our test statistic x is the result of a random variable X ∼ Binom(100, π) and that our competing hypotheses should be 19:08 -- May 4, 2008 413 4. Random Variables H0 : π = 0.5 Ha : π 6= 0.5 . Notice the two-sided alternate hypothesis suggests that we want to reject the fairness hypothesis in the case that the coin favors heads as well as the case that the coin favors tails. It is reasonable to reject the null hypothesis whenever the number of heads is too far from 50 in either direction. Let’s suppose that our decision rule is to reject the null hypothesis if the number of heads is at most 40 or at least 60 (i.e., X ≤ 40 or X ≥ 60). Then the probability of a Type I error is the probability of getting 40 or fewer or 60 or more heads when the true probability of heads is 0.5. This is given by > pbinom(40,100,.5)+(1-pbinom(59,100,0.5)) [1] 0.05688793 The probability of a Type I error is 0.057. The probability of a Type II error for this decision rule can only be computed if we know the true value of π. Suppose that π = 0.48. Then the probability of not rejecting the null hypothesis is given by > pbinom(59,100,.48)-pbinom(39,100,0.48) [1] 0.9454557 In other words, we will almost always make a type II error with this decision rule if π = 0.48. On the other hand, if π = 0.4, then the probability of a Type II error is given by 0.54 as given by > pbinom(59,100,.4)-pbinom(39,100,0.4) [1] 0.5378822 This above computation illustrates the central dilemma of hypothesis testing. If we want to make it very unlikely that we commit a Type I error as we did with our decision rule here, it will be very difficult to detect that the null hypothesis is false. For this decision rule, even 100 tosses of the coin do not allow us much success in discovering that the coin has a 10% bias! There is a trade-off between type I and type II errors. If we choose a decision rule that is less likely to make a Type I error if the null hypothesis is true, then it is more likely to make a Type II error if the null hypothesis is false. What one should also notice in our treatment of hypothesis testing is the asymmetry between the two hypotheses. We are generally not willing to tolerate a large probability of a Type I error. However this seems to lead to a rather large probability of a Type II error in the case that the null hypothesis is false. This asymmetry is intentional however as the null hypothesis usually has a preferred status as the “innocent until proven guilty” hypothesis. 4.4. Continuous Random Variables Recall that a continuous random variable X is one that can take on all values in an interval of real numbers. For example, the height of a randomly chosen Calvin 414 4.4. Continuous Random Variables 1.0 0.8 0.6 0.8 0.6 0.4 0.6 0.4 0.4 0.2 0.2 0.0 0.2 0.0 0 2 4 6 8 Time 0.0 0 2 4 6 8 Time 0 2 4 6 8 Time Figure 4.3.: Discretized pmf for T . student in inches could be any real number between, say, 36 and 80. Of course all continuous random variables are idealizations. If we measure heights to the nearest quarter inch, there are only finitely many possibilities for this random variable and we could, in principle, treat it as discrete. We know from calculus however that treating measurements as continuous valued functions often simplifies rather than complicates our techniques. In order to understand what kinds of probability statements that we would like to make about continuous random variables, it is helpful to keep in mind this idea of the finite precision of our measurements however. For example, a statement that a randomly chosen individual is 72 inches tall is not a claim that the individual is exactly 72 inches tall but rather a claim that the height of the individual is in some small interval (maybe 71 34 to 72 41 if we are measuring to the nearest half inch). So probabilities of the form P (X = x) are not especially meaningful. Rather the appropriate probability statements will be of the form P (a ≤ X ≤ b). 4.4.1. pdfs and cdfs Recall the analogy of probability and mass. In the case of discrete random variables, we represented the probability P(X = x) by a point of mass P(X = x) at the point x and had total mass 1. In this case mass is continuous and the appropriate weighting of mass is a density function. In the following example, we can see how this works. Example 4.4.1. A Geiger counter emits a beep when a radioactive particle is detected. The rate of beeping determines how radioactive the source is. Suppose that we record the time T to the next beep. It turns out that T behaves like a random variable. Suppose that we measured T with increasing precision. We might get histograms that look like those in Figure 4.3 for the pmf of T . It’s pretty obvious that we want to replace these histograms by a smooth curve. In fact the pictures should remind us of the pictures drawn for the Riemann sums that define the integral. The analogue to a probability mass function for a continuous variable is a probability density function. 19:08 -- May 4, 2008 415 4. Random Variables Definition 4.4.2 (probability density function, continuous random variable). A probability density function (pdf) is a function f such that • f (x) ≥ 0 for all real numbers x, and R∞ • −∞ f (x) dx = 1. The continuous random variable X defined by the pdf f satisfies Z b P(a ≤ X ≤ b) = f (x) dx a for any real numbers a ≤ b. The following simple lemma demonstrates one way in which continuous random variables are very different from discrete random variables. Lemma 4.4.3. Let X be a continuous random variable with pdf f . Then for any a ∈ R, 1. P(X = a) = 0, 2. P(X < a) = P(X ≤ a), and 3. P(X > a) = P(X ≥ a). Z Proof. a f (x) dx = 0 . And P(X ≤ a) = P(X < a) + P(X = a) = P(X < a). a Example 4.4.4. ( 3x2 Q. Consider the function f (x) = 0 calculate P(X ≤ 1/2). x ∈ [0, 1] Show that f is a pdf and otherwise. 0.0 1.0 f (x) 2.0 3.0 A. Let’s begin be looking at a plot of the pdf. 0.0 0.2 0.4 0.6 x 416 0.8 1.0 4.4. Continuous Random Variables The rectangular region of the plot has an area of 3, so it is plausible that the area under the graph of the pdf is 1. We can verify this by integration. Z 1 Z ∞ 1 3x2 dx = x3 0 = 1 , f (x) dx = −∞ 0 so f is a pdf and P(X ≤ 1/2) = R 1/2 0 1/2 3x2 dx = x3 0 = 1/8. The cdf of a continuous random variable is defined the same way as it was for a discrete random variable, but we use an integral rather than a sum to get the cdf from the pdf in this case. Definition 4.4.5 (cumulative distribution function). Let X be a continuous random variable with pdf f , then the cumulative distribution function (cdf) for X is Z x F (x) = P(X ≤ x) = f (t) dt . −∞ Example 4.4.6. Q. Determine the cdf of the random variable from Example 4.4.4. A. For any x ∈ [0, 1], FX (x) = P(X ≤ x) = Z 0 So 0 FX (x) = x3 1 x x 3t2 dt = t3 0 = x3 . x ∈ [−∞, 0) x ∈ [0, 1] x ∈ (1, ∞) . Notice that the cdf FX is an antiderivative of the pdf fX . This follows immediately from the Fundamental Theorem of Calculus. Notice also that P(a ≤ X ≤ b) = F (b) − F (a). Lemma 4.4.7. Let FX be the cdf of a continuous random variable X. Then the pdf fX satisfies fX (x) = d FX (x) . dx Just as the binomial and hypergeometric distributions were important families of discrete random variables, there are several important families of continuous random variables that are often used as models of real-world situations. We investigate a few of these in the next three subsections. 19:08 -- May 4, 2008 417 4. Random Variables 4.4.2. Uniform Distributions The continuous uniform distribution has a pdf that is constant on some interval. Definition 4.4.8 (uniform random variable). A continuous uniform random variable on the interval [a, b] is the random variable with pdf given by ( f (x; a, b) = x ∈ [a, b] otherwise. 1 b−a 0 It is easy to confirm that this function is indeed a pdf. We could integrate, or we could simply use geometry. The region under the graph of the uniform pdf is a rectangle 1 with width b − a and height b−a , so the area is 1. Example 4.4.9. Q. Let X be uniform on [0, 10]. What is P(X > 7)? What is P(3 ≤ X < 7)? A. Again we argue geometrically. P(X > 7) is represented by a rectangle with base from 7 to 10 along the x-axis and a height of .1, so P(X > 7) = 3 · 0.1 = 0.3. Similarly P(3 ≤ X < 7) = 0.4. In fact, for any interval of width w contained in [0, 10], the probability that X falls in that particular interval is w/10. We could also compute these results by integrating, but this would be silly. Example 4.4.10. Q. Let X be uniform on the interval [0, 1] (which we denote X ∼ Unif(0, 1)) what is the cdf for X? Rx A. For x ∈ [0, 1], FX (x) = 0 1 dx = x, so 0 FX (x) = x 1 x ∈ (∞, 0) x ∈ [0, 1] x ∈ (1, ∞) . 0.0 0.0 0.5 1.0 x 418 0.4 F (x) 0.4 0.0 f (x) 0.8 cdf for Unif(0,1) 0.8 pdf for Unif(0,1) 1.5 2.0 0.0 0.5 1.0 x 1.5 2.0 4.4. Continuous Random Variables Although it has a very simple pdf and cdf, this random variable actually has several important uses. One such use is related to random number generation. Computers are not able to generate truly random numbers. Algorithms that attempt to simulate randomness are called pseudo-random number generators. X ∼ Unif(0, 1) is a model for an idealized random number generator. Computer scientists compare the behavior of a pseudo-random number generator with the behavior that would be expected for X to test the quality of the pseudo-random number generator. There are R functions for computing the pdf and cdf of a uniform random variable as well as a function to return random numbers. An additional function computes the quantiles of the uniform distribution. If X ∼ Unif(min, max) the following functions can be used. function (& parameters) explanation runif(n,min,max) makes n random draws of the random variable X and returns them in a vector. dunif(x,min,max returns fX (x), (the pdf). punif(q,min,max) returns P(X ≤ q) (the cdf). qunif(p,min,max) returns x such that P(X ≤ x) = p. Here are examples of computations for X ∼ Unif(0, 10). > runif(6,0,10) # 6 random values on [0,10] [1] 5.449745 4.124461 3.029500 5.384229 7.771744 8.571396 > dunif(5,0,10) # pdf is 1/10 [1] 0.1 > punif(5,0,10) # half the distribution is below 5 [1] 0.5 > qunif(.25,0,10) # 1/4 of the distribution is below 2.5 [1] 2.5 4.4.3. Exponential Distributions In Example 4.4.1 we considered a “waiting time” random variable, namely the waiting time until the next radioactive event. Waiting times are important random variables in reliability studies. For example, a common characteristic of a manufactured object is MTF or mean time to failure. The model often used for the Geiger counter random variable is the exponential distribution. Note that a waiting time can be any x in the range 0 ≤ x < ∞. 19:08 -- May 4, 2008 419 4. Random Variables Definition 4.4.11 (The exponential distribution). The random variable X has the exponential distribution with parameter λ > 0 (X ∼ Exp(λ)) if X has the pdf ( λe−λx x ≥ 0 fX (x) = 0 x<0. It is easy to see that the function fX of the previous definition is a pdf for any positive value of λ. R refers to the value of λ as the rate so the appropriate functions in R are rexp(n,rate), dexp(x,rate), pexp(x,rate), and qexp(p,rate). We will see later that rate is an apt name for λ as λ will be the rate per unit time if X is a waiting time random variable. Example 4.4.12. Suppose that a random variable T measures the time until the next radioactive event is recorded at a Geiger counter (time measured since the last event). For a particular radioactive material, a plausible models for T is T ∼ Exp(0.1) where time is measured in seconds. Then the following R session computes some important values related to T . > pexp(q=0.1,rate=.1) # probability waiting time less than .1 [1] 0.009950166 > pexp(q=1,rate=.1) # probability waiting time less than 1 [1] 0.09516258 > pexp(q=10,rate=.1) [1] 0.6321206 > pexp(q=20,rate=.1) [1] 0.8646647 > pexp(100,rate=.1) [1] 0.9999546 > pexp(30,rate=.1)-pexp(5,rate=.1) # probability waiting time between 5 and 30 [1] 0.5567436 > qexp(p=.5,rate=.1) # probability is .5 that T is less than 6.93 [1] 6.931472 The graphs in Figure 4.4 are graphs of the pdf and cdf of this random variable. All exponential distributions look the same except for the scale. The rate of 0.1 here means that we can expect that in the long run this process will average 0.1 counts per second. Notice that when given a random variable such as the waiting time to a geiger counter event, we are not handed its pdf as well. The pdf is a model of the situation. In the case of an example such as this, we really are faced with two decisions. 1. Which family (e.g., uniform, exponential, etc.) of distributions best models the situation? 420 4.4. Continuous Random Variables 0.08 0.8 0.06 0.6 y 1.0 y 0.10 0.04 0.4 0.02 0.2 0.0 0.00 0 10 20 30 40 x 50 0 10 20 30 40 50 x Figure 4.4.: The pdf and cdf of the random variable T ∼ Exp(0.1). 2. What particular values of the parameters should we use for the pdf? Sometimes we can begin to answer question 1 even before we collect data. Each of the distributions that we have met has certain properties which we check against our process. For example, it is often apparent whether the properties of a binomial process should apply to a certain process we are examining. Of course it is always useful to check our answer to question 1 by collecting data and verifying that the shape of the distribution of the data collected is consistent with the distribution we are using. The only reasonable way to answer the second question however is to collect data. In the last example, for instance, we saw that if X ∼ Exp(0.1) that P (X ≤ 6.93) = .5. Therefore if about half of our data are less than 6.93, we would say that the data are consistent with the hypothesis that X ∼ Exp(0.1) but if almost all the data are less than 5, we would probably doubt that X has this distribution. The problems of choosing the appropriate distribution and the appropriate values of the parameters is an important one that we will address in various ways in Chapter 5. 4.4.4. Weibull Distributions A very important generalization of the exponential distribution is the Weibull distribution. A Weibull distribution is often used by engineers to model phenomena such as failure, manufacturing or delivery times. They have also been used for applications as diverse as fading in wireless communications channels and wind velocity. The Weibull is a two-parameter family of distributions. The two parameters are a shape parameter α and a scale parameter λ. Definition 4.4.13 (The Weibull distributions). The random variable X has a Weibull distribution with shape parameter α > 0 and scale parameter β > 0 (X ∼ Weibull(α, β)) if the pdf of X is ( α α−1 e−(x/β)α x ≥ 0 βα x fX (x; α, β) = 0 x<0 19:08 -- May 4, 2008 421 y21 0.4 0.2 0.10 0.0 0.00 0.05 y35 0.15 0.6 0.20 0.8 4. Random Variables 0 2 4 6 8 10 0 2 4 x 6 8 10 x Figure 4.5.: Left: fixed β. Right: fixed α. Notice that if X ∼ Weibull(1, λ) then X ∼ Exp(1/λ). Varying α in the Weibull distribution changes the shape of the distribution while changing β changes the scale. The effect of fixing β (β = 5) and changing α (α = 1, 2, 3) is illustrated by the first graph in Figure 4.5 while the second graph shows the effect of changing β (β = 1, 3, 5) with α fixed at α = 2. The appropriate R functions to compute with the Weibull distribution are dweibull(x,shape,scale), pweibull(q,shape,scale), etc. Example 4.4.14. The Weibull distribution is sometimes used to model the maximum wind velocity measured during a 24 hour period at a specific location. The dataset http://www.calvin.edu/~stob/data/wind.csv gives the maximum wind velocity at the San Diego airport on each of 6,209 consecutive days. It is claimed that the maximum wind velocity measured on a day behaves like a random variable W that has a Weibull distribution with α = 3.46 and β = 16.90. The R code below investigates that model using this past data. (In fact, this model is not a very good one although the output below suggests that it might be plausible.) > w$Wind [1] 14 11 10 13 11 11 26 21 14 13 10 10 13 10 13 13 12 12 13 17 11 11 13 25 15 [26] 18 13 17 12 14 15 10 16 17 17 13 18 14 12 20 11 14 20 16 12 14 18 17 13 16 [51] 13 16 11 13 11 15 13 15 16 18 14 15 15 14 14 16 15 18 14 16 14 10 17 14 12 ............. > cutpts=c(0,5,10,15,20,25,30) > table(cut(w$Wind,cutpts)) (0,5] (5,10] (10,15] (15,20] (20,25] (25,30] 2 434 3303 1910 409 95 > length(w$Wind[w$Wind<12.5])/6209 [1] 0.2728298 # 27.3% days with max windspeed less than 12.5 > pweibull(12.5,3.46,16.9) [1] 0.2968784 # 29.7% predicted by Weibull model > length(w$Wind[w$Wind<22.5])/6209 422 4.5. The Mean of a Random Variable [1] 0.951361 > pweibull(22.5,3.46,16.9) [1] 0.9322498 > simulation=rweibull(100000,3.46,16.9) > mean(simulation) [1] 15.18883 > mean(w$Wind) [1] 15.32405 > sd(simulation) [1] 4.85144 > sd(w$Wind) [1] 4.239603 > # 100,000 simulated days # simulated days have mean about the same as actual # simulated days have greater variation 4.5. The Mean of a Random Variable Just as numerical summaries of a data set can help us understand our data, numerical summaries of the distribution of a random variable can help us understand the behavior of that random variable. In this section we introduce the notion of a mean of a random variable. The name of this summary, mean, is no accident. The mean of a random variable is supposed to measure the “center” of a distribution in the same way that the mean of data measures the center of that data. We will use our experience with data to help us develop a definition. 4.5.1. The Mean of a Discrete Random Variable Example 4.5.1. Q. Let’s begin with a motivating example. Suppose a student has taken 10 courses and received 5 A’s, 4 B’s and 1 C. Using the traditional numerical scale where an A is worth 4, a B is worth 3 and a C is worth 2, what is this student’s GPA (grade point average)? A. The first thing to notice is that 4+3+2 = 3 is not correct. We cannot simply 3 add up the values and divide by the number of values. Clearly this student should have GPA that is higher than 3.0, since there were more A’s than C’s. Consider now a correct way to do this calculation and some algebraic reformu- 19:08 -- May 4, 2008 423 4. Random Variables lations of it. GPA = 4+4+4+4+4+3+3+3+3+2 5·4+4·3+1·2 = 10 10 5 4 1 = ·4+ ·3+ ·2 10 10 10 4 1 5 +3· +2· =4· 10 10 10 = 3.4 Our definition of the mean of a random variable follows the example above. Notice that we can think of the GPA as a sum of terms of the form (grade)(proportion of students getting that grade) . Since the limiting proportion of outcomes that have a particular value is the probability of that value, we are led to the following definition. Definition 4.5.2 (mean). Let X be a discrete random variable with pmf f . The mean (also called expected value) of X is denoted as µX or E(X) and defined by X µX = E(X) = x · f (x) . x The sum is taken over all possible values of X. Example 4.5.3. Q. If we flip four fair coins and let X count the number of heads, what is E(X)? A. If we flip four fair coins and let X count the number of heads, then the distribution of X is described by the following table. (Note that X ∼ Binom(4, .5).) value of X probability 0 1 16 1 4 16 2 6 16 3 4 16 4 1 16 So the expected value is 0· 1 4 6 4 1 +1· +2· +3· +4· =2 16 16 16 16 16 On average we get 2 heads in 4 tosses. This is certainly in keeping with our informal understanding of the word average. 424 4.5. The Mean of a Random Variable More generally, the mean of a binomial random variable is found by the following Theorem. Theorem 4.5.4. Let X ∼ Binom(n, π). Then E(X) = nπ. Similarly, the mean of a hypergeometric random variable is just what we think it should be. Theorem 4.5.5. Let X ∼ Hyper(m, n, k). Then E(X) = km/(m + n). The following example illustrates the computation of the mean for a hypergeometric random variable. > x=c(0:5) > p=dhyper(x,m=4,n=25,k=5) > sum(x*p) [1] 0.6896552 > 4/29 * 5 [1] 0.6896552 4.5.2. The Mean of a Continuous Random Variable If we think of probability as mass, then the expected value for a discrete random variable X is the center of mass of a system of point masses where a mass fX (x) is placed at each possible value of X. The expected value of a continuous random variable should also be the center of mass where the pdf is now interpreted as density. Definition 4.5.6 (mean). Let X be a continuous random variable with pdf f . The mean of X is defined by Z ∞ µX = E(X) = xf (x) dx . −∞ ( 3x2 Example 4.5.7. Recall the pdf in Example 4.4.4: f (x) = 0 E(X) = Z 1 x ∈ [0, 1] . Then otherwise. x · 3x2 dx = 3/4 . 0 The value 3/4 seems plausible from the graph of f . We compute the mean of two of our favorite continuous random variables in the next Theorem. 19:08 -- May 4, 2008 425 4. Random Variables Theorem 4.5.8. 1. If X ∼ Unif(a, b) then E(X) = (a + b)/2. 2. If X ∼ Exp(λ) then E(X) = 1/λ. Proof. The proof of each of these is a simple integral. These are left to the reader. Our intuition tells us that in a large sequence of trials of the random process described by X, the sample mean of the observations should be usually be close the mean of X. This is in fact true and is known as the Law of Large Numbers. We will not state that law precisely here but we will illustrate it using several simulations in R. > r=rexp(100000,rate=1) > mean(r) [1] 0.9959467 > r=runif(100000,min=0,max=10) > mean(r) [1] 5.003549 > r=rbinom(100000,size=100,p=.1) > mean(r) [1] 9.99755 > r=rhyper(100000,m=10,n=20,k=6) > mean(r) [1] 1.99868 # should be 1 # should be 5 # should be 10 # should be 2 4.6. Functions of a Random Variable After collecting data, we often transform it. That is we apply some function to all the data. For example, we saw the value of using a logarithmic transformation (on the U.S. Counties data) to make a distribution more symmetric. Now consider the notion of transforming a random variable. Definition 4.6.1 (transformation). Suppose that t is a function defined on all the possible values of the random variable X. Then the random variable t(X) is the random variable that has outcome t(x) whenever x is the outcome of X. If the random variable Y is defined by Y = t(X), then Y itself has an expected value. To find the expected value of Y , we would need to find the pmf or pdf of Y , fY (y), and then use the definition of E(Y ) to compute E(Y ). Occasionally, this is easy to do, particularly in the case of a discrete random variable X. 426 4.6. Functions of a Random Variable Example 4.6.2. Suppose that X is the random variable that results when a single die is rolled and the number on its face recorded. The pdf of X is f (x) = 1/6, x = 1, 2, 3, 4, 5, 6, and E(X) = 3.5. Now suppose that for a certain game, the value Y = X 2 is interesting. Then the pdf of Y is easily seen to be f (y) = 1/6, y = 1, 4, 9, 16, 25, 36, and E(Y ) = 15.2. Note that to find E(Y ) we first found the pdf of Y and then found E(Y ) using the usual method. Note that E(Y ) 6= [E(X)]2 ! It turns out that there is a way to compute E(t(X)) that does not require us to first find fY . This is especially useful in the case that X is continuous. Lemma 4.6.3. If X is a random variable (discrete or continuous) and t a function defined on the values of X, then if Y = t(X) and X has pdf (pmf) fX (P t(x)fX (x) if X is discrete E(Y ) = R ∞x −∞ t(x)f (x) dx if X is continuous . We will not give the proof but it is easy to see that this lemma should be so (at least for the discrete case) by looking at an example. Example 4.6.4. Let X be the result of tossing a fair die. X has possible outcomes 1, 2, 3, 4, 5, 6. Let Y be the random variable |X − 2|. Then the lemma gives E(Y ) = 6 X |x − 2| · x=1 1 1 1 1 1 1 11 1 =1· +0· +1· +2· +3· +4· = . 6 6 6 6 6 6 6 6 But we can also compute E(Y ) directly from the definition. Noting that the possible values of Y are 0, 1, 2, 3, 4, we have E(Y ) = 4 X y=0 yfY (y) = 0 · 2 1 1 1 11 1 +1· +2· +3· +4· = . 6 6 6 6 6 6 The sum that computes E(Y ) is clearly the same sum as E(X) but in a “different order” and with some terms combined since there are more than one x that produce a given value of Y . Example 4.6.5. Suppose that X ∼ Unif(0, 1) and that Y = X 2 . Then E(Y ) = Z 1 x2 · 1 dx = 1/3 . 0 This is consistent with the following simulation. 19:08 -- May 4, 2008 427 4. Random Variables > x=runif(1000,0,1) > y=x^2 > mean(y) [1] 0.326449 While it is not necessarily the case that E(t(X)) = t(E(X)) (see problem 4.21), the next proposition shows that the expectation function is a “linear operator.” Lemma 4.6.6. If a and b are real numbers, then E(aX + b) = a E(X) + b. 4.6.1. The Variance of a Random Variable We are now in a position to define the variance of a random variable. Recall that the variance of a set of n data points x1 , . . . , xn is almost the average of the squareddeviation from the sample mean. X Var(x) = (xi − x)2 /(n − 1) The natural analogue for random variables is the following. Definition 4.6.7 (variance, standard deviation of a random variable). Let X be a random variable. The variance of X is defined by 2 σX = Var(X) = E((X − µX )2 ) . The standard deviation is the square root of the variance and is denoted σX . It is obvious from the definition that σX ≥ 0 and that σX > 0 unless X = µX with probability 1. Example 4.6.8. Suppose that X is a uniform random variable, X ∼ Unif(0, 1). Then E(X) = 1/2. To compute the variance of X we need to compute Z 1 (x − 1/2)2 dx 0 It is easy to see that the value of this integral is 1/12. The following lemma records the variance of several of our favorite random variables. Lemma 4.6.9. 428 4.7. The Normal Distribution It is instructive to compare the variances of the binomial and the hypergeometric distribution. We do that in the next example. Example 4.6.10. Suppose that a population has 10,000 voters and that 4,000 of them plan to vote for a certain candidate. We select 100 voters at random and ask them if they favor this candidate. Obviously, the number of voters X that favor this candidate has the distribution Hyper(4000, 6000, 100). This distribution has mean 40 and variance 100(.4)(.6)(.99). On the other hand, were we to treat this situation as sampling with replacement so that X ∼ Binom(100, .4), X would have mean 40 and variance 100(.4)(.6). The only difference in the two expressions for the variance is the term m+n−k m+n−1 which is sometimes called the finite population correction factor. It should really be called the sampling without replacement correction factor. The following lemma sometimes helps us to compute the variance of X. It also is useful in understanding the properties of the variance. Lemma 4.6.11. Suppose that the random variable X is either discrete or continuous with mean µX . Then 2 σX = E(X 2 ) − µ2X . Proof. We have 2 = E((X − µX )2 ) = E(X 2 − 2µX X + µ2X ) = E(X 2 ) − 2µX E(X) + µ2X = E(X 2 ) − µ2X . σX Note that we have used the linearity of E and also that E(c) = c if c is a constant. 4.7. The Normal Distribution The most important distribution in statistics is called the normal distribution. Definition 4.7.1 (normal distribution). A random variable X has the normal distribution with parameters µ and σ if X has pdf f (x; µ, σ) = √ 1 2 2 e−(x−µ) /2σ 2πσ −∞<x<∞. We write X ∼ Norm(µ, σ) in this case. The mean and variance of a normal distribution are µ and σ 2 so that the parameters are aptly, rather than confusingly, named. R functions dnorm(x,mean,sd), pnorm(q,mean,sd), rnorm(n,mean,sd), and qnorm(p,mean,sd) compute the relevant values. 19:08 -- May 4, 2008 429 4. Random Variables 0.4 f(x) 0.3 0.2 0.1 0.0 −3 −2 −1 0 1 2 3 x Figure 4.6.: The pdf of a standard normal random variable. If µ = 0 and σ = 1 we say that X has a standard normal distribution. Figure 4.6 provides a graph of the density of the standard normal distribution. Notice the following important characteristics of this distribution: it is unimodal, symmetric, and can take on all possible real values both positive and negative. The curve in Figure 4.6 suffices to understand all of the normal distributions due to the following lemma. Lemma 4.7.2. If X ∼ Norm(µ, σ) then the random variable Z = (X − µ)/σ has the standard normal distribution. Proof. To see this, we show that P(a ≤ Z ≤ b) is computed by the integral of the standard normal density function. X −µ ≤ b) = P (µ + aσ ≤ X ≤ µ + bσ) = P(a ≤ Z ≤ b) = P(a ≤ σ Z µ+bσ µ+aσ √ 1 2 2 e−(x−µ) /2σ dx . 2πσ Now in the integral, make the substitution u = (x − µ)/σ. We have then that Z µ+bσ µ+aσ 1 2 2 √ e−(x−µ) /2σ dx = 2πσ Z a b 1 2 √ e−u /2 du . 2π But the latter integral is precisely the integral that computes P(a ≤ U ≤ b) if U is a standard normal random variable. The normal distribution is used so often that it is helpful to commit to memory certain important probability benchmarks associated with it. 430 4.8. Exercises The 68–95–99.7 Rule If Z has a standard normal distribution, then 1. P(−1 ≤ Z ≤ 1) ≈ 68% 2. P(−2 ≤ Z ≤ 2) ≈ 95% 3. P(−3 ≤ Z ≤ 3) ≈ 99.7%. If the distribution of X is normal (but not necessarily standard normal), then these approximations have natural interpretations using Lemma 4.7.2. For example, we can say that the probability that X is within one standard deviation of the mean is about 68%. Example 4.7.3. In 2000, the average height of a 19-year old United States male was 69.6 inches. The standard deviation of the population of males was 5.8 inches. The distribution of heights of this population is well-modeled by a normal distribution. Then the percentage of males within 5.8 inches of 69.6 inches was approximately 68%. In R, > pnorm(69.6+5.8,69.6,5.8)-pnorm(69.6-5.8,69.6,5.8) [1] 0.6826895 It turns out that the normal distribution is a good model for many variables. Whenever a variable has a unimodal, symmetric distribution in some population, we tend to think of the normal distribution as a possible model for that variable. For example, suppose that we take repeated measures of a difficult to measure quantity such as the charge of an electron. It might be reasonable to assume that our measurements center on the true value of the quantity but have some spread around that true value. And it might also be reasonable to assume that the spread is symmetric around the true value with measurements closer to the true value being more likely to occur than measurements that are further away from the true value. Then a normal random variable is a candidate (and often used) model for this situation. 4.8. Exercises 4.1 Suppose that you roll 5 standard dice. Determine the probability that all the dice are the same. (Hint: first compute the probability that all five dice are sixes.) 4.2 Suppose that you deal 5 cards from a standard deck of cards. Determine the probability that all the cards are of the same color. (A standard deck of cards has 52 19:08 -- May 4, 2008 431 4. Random Variables cards in two colors. There are 26 red and 26 black cards. You should be able to do this computation using R and the appropriate discrete distribution.) 4.3 Acceptance sampling is a procedure that tests some of the items in a lot and decides to accept or reject the entire lot based on the results of testing the sample. Suppose that the test determines whether an item is “acceptable” or “defective”. Suppose that in a lot of 100 items, 4 are tested and that the lot is rejected if one or more of those four are found to be defective. a) If 10% of the lot of 100 are defective, what is the probability that the purchaser will reject the shipment? b) If 20% of the lot of 100 are defective, what is the probability that the purchaser will reject the shipment? 4.4 Suppose that there are 10,000 voters in a certain community. A random sample of 100 of the voters is chosen and are asked whether they are for or against a new bond proposal. Suppose that in fact only 4,500 of the voters are in favor of the bond proposal. a) What is the probability that fewer than half of the sampled voters (i.e., 49 or fewer) are in favor of the bond proposal? b) Suppose instead that the sample consists of 2,000 voters. Answer the same question as in the previous part. 4.5 If the population is very large relative to the size of the sample, sampling with replacement should yield very similar results to that of sampling without replacement. Suppose that an urn contains 10,000 balls, 3,000 of which are white. a) If 100 of these balls are chosen at random with replacement, what is the probability that at most 25 of these are white? b) If 100 of these balls are chosen at random without replacement, what is the probability that at most 25 of these are white? 4.6 In the days before calculators, it was customary for textbooks to include tables of the cdf of the binomial distribution for small values of n. Of course not all values of π could be included — often only the values π = .1, , 2, . . . , .8, .9 were included. Let’s supposes that one of these tables includes the value of the cdf of the binomial distribution for all n ≤ 25, all x ≤ n and all these values of π. a) To save space, the values of π = .6, .7, .8, .9 could be omitted. Give a clear reason why F (x; n, π) could be computed for these values of π from the other values in the table. 432 4.8. Exercises b) On the other hand, we could instead omit the values of x ≥ n/2. Show how the value of F (x; n, π) could be computed from the other values in the table for such omitted values of x. (Hint: one person’s success is another person’s failure.) 4.7 The number of trials in the ESP experiment, 25, was arbitrary and perhaps too small. Suppose that instead we use 100 trials. a) Suppose that the subject gets 30 right. What is the p-value of this test statistic? b) Suppose that the subject actually has a probability of .30 of guessing the card correctly. What is the probability that the subject will get at least 30 correct? 4.8 A basketball player claims to be a 90% free-throw shooter. Namely, she claims to be able to make 90% of her free-throws. Should we doubt her claim if she makes 14 out of 20 in a session at practice? Set this problem up as a hypothesis testing problem and answer the following questions. a) What are the null and alternate hypotheses? b) What is the p-value of the result 14? c) If the decision rule is to reject her claim if she makes 15 or fewer free-throws, what is the probability of a Type I error? 4.9 Nationally, 79% of students report that they have cheated on an exam at some point in their college career. You can’t believe that the number is this high at your own institution. Suppose that you take a random sample of size 50 from your student body. Since 50 is so small compared to the size of the student body, you can treat this sampling situation as sampling with replacement for the purposes of doing a statistical analysis. a) Write an appropriate set of hypotheses to test the claim that 79% of students cheat. b) Construct a decision rule so that the probability of a Type I error is less than 5%. 4.10 A random variable X has the triangular distribution if it has pdf ( 2x x ∈ [0, 1] fX (x) = 0 otherwise. a) Show that fX is indeed a pdf. 19:08 -- May 4, 2008 433 4. Random Variables b) Compute P(0 ≤ X ≤ 1/2). c) Find the number m such that P(0 ≤ X ≤ m) = 1/2. (If is natural to call m the median of the distribution.) ( k(x − 2)(x + 2) 4.11 Let f (x) = 0 −2 ≤ x ≤ 2 otherwise. a) Determine the value of k that makes f a pdf. Let X be the corresponding random variable. b) Calculate P(X ≥ 0). c) Calculate P(X ≥ 1). d) Calculate P(−1 ≤ X ≤ 1). 4.12 Describe a random variable that is neither continuous nor discrete. Does your random variable have a pmf? a pdf? a cdf? 4.13 Show that if f and g are pdfs and α ∈ [0, 1], then αf + (1 − α)g is also a pdf. 4.14 Suppose that a number of measurements that are made to 3 decimal digits accuracy are each rounded to the nearest whole number. A good model for the “rounding error” introduced by this process is that X ∼ Unif(−.5, .5) where X is the difference between the true value of the measurement and the rounded value. a) Explain why this uniform distribution might be a good model for X. b) What is the probability that the rounding error has absolute value smaller than .1? 4.15 If X ∼ Exp(λ), find the median of X. That is find the number m (in terms of λ) such that P(X ≤ m) = 1/2. 4.16 A part in the shuttle has a lifetime that can be modeled by the exponential distribution with parameter λ = 0.0001, where the units are hours. The shuttle mission is scheduled for 200 hours. a) What is the probability that the part fails on the mission? b) The event that is described in part (a) is BAD. So the shuttle actually runs three of these systems in parallel. What is the probability that the mission ends without all three failing if they are functioning independently? c) Is the assumption of independence in the previous part a realistic one? 434 4.8. Exercises 4.17 The lifetime of a certain brand of water heaters in years can be modeled by a Weibull distribution with α = 2 and β = 25. a) What is the probability that the water heater fails within its warranty period of 10 years? b) What is it probability that the water heater lasts longer than 30 years? c) Using a simulation, estimate the average life of one of these water heaters. 4.18 Prove Theorem 4.5.8. 4.19 Suppose that you have an urn containing 100 balls, some unknown number of which are red and the rest are black. You choose 10 balls without replacement and find that 4 of them are red. a) How many red balls do you think are in the urn? Give an argument using the idea of expected value. b) Suppose that there were only 20 red balls in the urn. How likely is it that a sample of 10 balls would have at least 4 red balls? 4.20 The file http://www.calvin.edu/~stob/data/scores.csv contains a dataset that records the time in seconds between scores in a basketball game played between Kalamazoo College and Calvin College on February 7, 2003. a) This waiting time data might be modeled by an exponential distribution. Make some sort of graphical representation of the data and use it to explain why the exponential distribution might be a good candidate for this data. b) If we use the exponential distribution to model this data, which λ should we use? (A good choice would be to make the sample mean equal to the expected value of the random variable.) c) Your model of part (b) makes a prediction about the proportion of times that the next score will be within 10, 20, 30 and 40 seconds of the previous score. Test that prediction against what actually happened in this game. 4.21 Show that it is not necessarily the case that E(t(X)) = t(E(X)). 4.22 Prove Lemma 4.6.6 in the case that X is continuous. 4.23 Let X be the random variable that results form tossing a fair six-sided die and reading the result (1–6). Since E(X) = 3.5, the following game seems fair. I will pay you 3.52 and then we will roll the die and you will pay me the square of the result. Is the game fair? Why or why not? 19:08 -- May 4, 2008 435 4. Random Variables 4.24 Not every distribution has a mean! Define f (x) = 1 1 π 1 + x2 −∞<x<∞. a) Show that f is a density function. (The resulting distribution is called the Cauchy distribution.) b) Show that this distribution does not have a mean. (You will need to recall the notion of an improper integral.) 4.25 In this problem we compare sampling with replacement to sampling without replacement. You will recall that the former is modeled by the binomial distribution and the latter by the hypergeometric distribution. Consider the following setting. There are 4,224 students at Calvin and we would like to know what they think about abolishing the interim. We take a random sample of size 100 and ask the 100 students whether or not they favor abolishing the interim. Suppose that 1,000 students favor abolishing the interim and the other 3,224 misguidedly want to keep it. a) Suppose that we sample these 100 students with replacement. What is the mean and the variance of the random variable that counts the number of students in the sample that favor abolishing the interim? b) Now suppose that we sample these 100 students without replacement. What is the mean and the variance of the random variable that counts the number of students in the sample that favor abolishing the interim? c) Comment on the similarities and differences between the two. Give an intuitive reason for any difference. 4.26 Scores on IQ tests are scaled so that they have a normal distribution with mean 100 and standard deviation 15 (at least on the Stanford-Binet IQ Test). a) MENSA, a society supposedly for persons of high intellect, requires a score of 130 on the Stanford-Binet IQ test for membership. What percentage of the population qualifies for MENSA? b) One psychology text labels those with IQs of between 80 and 115 as having “normal intelligence.” What percentage of the population does this range contain? c) The top 25% of scores on an IQ test are in what range? 436 5. Inference - One Variable In Chapter 2 we introduced random sampling as a way of making inferences about populations. Recall the framework. We first identified a population and some parameters of that population about which we wanted to make inferences. We then chose a sample, most often by simple random sampling, and computed statistics from that sample to allow us to make statements about the parameters. Alas, these statements were subject to sampling error. Armed now with the technology of the last two chapters, we develop this framework further with a particular emphasis on understanding sampling error. We will focus especially on the problem of making inferences about the mean of a population from that of a sample. 5.1. Statistics and Sampling Distributions 5.1.1. Samples as random variables Suppose that we have a large population and a variable x defined on that population, and we would like to estimate the mean of x on that population. We choose a simple random sample x1 , . . . , xn and compute x. How is this sample mean related to the population mean? In other words, what is likely to be the sampling error? Consider the first value of the sample, x1 . This value is the result of a random variable, namely the random variable that results from choosing an individual from the population at random and measuring or recording the value of the variable x. We call that random variable X1 . Similarly, X2 is the process of choosing the second element of the sample. And so forth. The result is a sequence of random variables X1 , . . . , Xn . Since we are now thinking of the data x1 , . . . , xn are the result of the random variables X1 , . . . , Xn , the sample mean x is the result of a random variable as well, namely X= X1 + · · · + Xn . n Then X is a random variable and so it also has a distribution. We’ll call the distribution of X the sampling distribution of the mean since it is a distribution that results from sampling. The same kind of analysis can be done for any statistic. For example, 2 for the random variable that is the result of computing the sample we will write SX variance that results from X1 , . . . , Xn . This is indeed a random variable — different 2 . As another example, the sample possible samples may have different values of SX median X̃ is a statistic and so it has a distribution as well. 501 5. Inference - One Variable 2 (as well We would like to know the distribution of the random variables X and SX as the distribution of any other statistics that we might want to compute). Obviously, these distributions depend on the distributions of X1 , . . . , Xn , which in turn depend on the underlying population. Before investigating this problem analytically, let’s investigate it via simulation. 5.1.2. Big Example Percent of Total In general, we do not know the distribution of the variable in the population. In order to illustrate what can happen in simple random sampling, we will do some simulation in a situation where we actually have the entire population. The dataset we will use is a dataset that contains information on every baseball game played in Major League Baseball during the 2003 season. This population consists of 2430 games. The dataset is available at http://www.calvin.edu/~stob/data/baseballgames-2003.csv. For our variable of interest, we will consider the number of runs scored by the visitors in each game. In the population, the distribution of this variable is unimodal and positively skewed as illustrated in Figure 5.1. Some numerical characteristics of this 10 5 0 0 5 10 15 20 Visitor Score Figure 5.1.: Runs scored by visitors in 2003 baseball games. population are as follows. > games=read.csv(’http://www.calvin.edu/~stob/data/baseballgames-2003.csv’) > vs=games$visscore > summary(vs) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 2.000 4.000 4.656 7.000 19.000 Suppose that we take samples of size 2 from this population. It is in fact possible to generate all possible samples of size 2 and compute the mean of each such sample using the function combn(). > vs2mean=combn(vs,2,mean) > summary(vs2mean) 502 # applies mean to all combinations of 2 elements of vs 5.1. Statistics and Sampling Distributions Min. 1st Qu. 0.000 3.000 Median 4.500 Mean 3rd Qu. 4.656 6.000 Max. 18.500 Note that the mean of the distribution of sample means of size 2 is the same as the mean of the population. This should be expected. The histogram of the samples means is in Figure 5.2. Percent of Total 15 10 5 0 0 5 10 15 20 Means, samples of size 2 Figure 5.2.: All means of samples of size 2. We note the following two features of the distribution of sample means of samples of size 2: its spread is less than the spread of the population variable and its shape, while still positively skewed, is less so. It is not realistic to generate the actual sampling distribution of the X for samples larger than size 2. For example, there are 1014 samples of size 5. However, simulation allows us to get a fairly good idea of what the distribution of X looks like for larger sample sizes. Consider first samples of size 5. > vs5mean=replicate(10000, mean(sample(vs,5,replace=F))) > summary(vs5mean) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.600 3.600 4.600 4.669 5.600 10.800 Comparing Figure 5.3 to Figure 5.2, we see that the distribution of the sample mean in samples of size 5 appears to have less spread and to be more symmetric than that of the distribution of sample means in samples of size 2. Now let’s consider samples of size 30. Again, simulating this situation by choosing 10,000 such samples, we have the following results. > vs30mean=replicate(10000,mean(sample(vs,30,replace=F))) > summary(vs30mean) Min. 1st Qu. Median Mean 3rd Qu. Max. 2.433 4.267 4.633 4.658 5.033 7.300 With samples of size 30, we note that the distribution is now dramatically decreased in spread. For example, the IQR is 0.76 (as compared to 2.0 for samples of size 5). This 19:08 -- May 4, 2008 503 5. Inference - One Variable Percent of Total 20 15 10 5 0 0 2 4 6 8 10 Means, samples of size 5 Figure 5.3.: Means of 10,000 samples of size 5. Percent of Total 25 20 15 10 5 0 2 3 4 5 6 7 Means, samples of size 30 Figure 5.4.: Means of 10,000 samples of size 30. says that if we use the sample mean of a sample of size 30 to estimate the population mean (of 4.656), over 50% of the time we will be within 0.4 of the true value. Notice too from Figure 5.4 that the distribution of X 30 appears to be unimodal and quite symmetric. 5.1.3. The Standard Framework We are now conceiving of the simple random sample x1 , . . . , xn from a population as the result of n random variables X1 , . . . , Xn . What can we say about the distributions of these random variables? The first property is the Identically Distributed Property. Identically Distributed Property In simple random sampling, the random variables X1 , . . . , Xn all have the same distribution. In fact, the distribution of Xi is the same as the distribution of the variable x in the population. 504 5.1. Statistics and Sampling Distributions It is easy to see that this property is true in the case of simple random sampling. Each xi is equally likely to be any on the individuals in the population. Therefore the distribution of possible values of Xi is exactly the same as the distribution of actual values of x in the population. For example, if the values of x are normally distributed in the population, then Xi will have that same normal distribution. One important fact to note however is that the random variables Xi are not independent one from another. In simple random sampling (which among other properties is sampling without replacement) the outcome of X2 is dependent on that of X1 . This will usually be an annoyance to us in trying to analyze the distribution of of certain statistics — independent random variables are easier to deal with. Therefore we will simplify and often assume that the Xi are independent. In fact, if we sample with replacement, this will be exactly true. And if the population is large, this will be “almost” true — sampling without replacement behaves almost like sampling with replacement. One general rule of thumb is that if the sample is of size less than 10% of the population, then it does not do much harm to treat sampling without replacement in the same way as sampling with replacement. Therefore we will usually assume that our sample random variables are independent. The i.i.d. assumption. Random variables X1 , . . . , Xn are called i.i.d. if they are independent and identically distributed. We will usually assume that the random variables X1 , . . . , Xn that arise from a simple random sample are i.i.d. (For this reason, we will call i.i.d. random variables X1 , . . . , Xn a random sample from X.) Given i.i.d. random variables X1 , . . . , Xn , we will refer to their (common) distribution as the population distribution. With all this background, we expand the meaning of our four important concepts. Population Parameter Sample Statistic any random variable X a numerical property of X, (e.g., µX ) i.i.d. random variables X1 , . . . , Xn with the same distribution as X any function T = f (X1 , . . . , Xn ) of the sample While we have motivated this terminology by the very important problem of sampling from a finite population, it is also useful for describing other situations. Suppose that we have a random variable X which (since it is a random process) is repeatable under essentially identical conditions. Suppose that the process is repeated n times. Then the results of those n trials X1 , . . . , Xn are i.i.d. random variables and so fit the framework above. 19:08 -- May 4, 2008 505 5. Inference - One Variable 5.2. The Sampling Distribution of the Mean In this section we consider the problem of determining the sampling distribution of the mean. Namely we assume that X1 , . . . , Xn are i.i.d. random variables with population random variable X and we want to explore the relationship between the distribution of X and that of X. The fundamental tool in studying this problem is the following theorem. Theorem 5.2.1. Suppose that Y and Z are random variables. Then 1. If c is a constant E(cY ) = c E(Y ) and Var(cY ) = c2 Var(Y ), 2. E(Y + Z) = E(Y ) + E(Z), and 3. if Y and Z are independent, then Var(Y + Z) = Var(Y ) + Var(Z). We will not prove this theorem. Part (1) is easy to prove (it’s a simple fact about integrals or sums). Part (2) certainly fits our intuition. Part (3) is not obvious. While there certainly should be some relationship between the variance of Y + Z and those of Y and Z, the fact that variances are additive seems almost accidental. Notice that this rule looks like a “Pythagorean Theorem” as it involves squares on both sides. From this Theorem, we now have one of the most important tools of inferential statistics. Theorem 5.2.2 (The distribution of the sample mean). Suppose that X1 , . . . , Xn are i.i.d. random variables with population random variable X. Then 1. E(X) = E(X) , and 2. Var(X) = Var(X)/n. Proof. By Theorem 5.2.1, we have that E(X1 + · · · + Xn ) = X E(Xi ) = n E(X). Then E(X) = E X1 + · · · + Xn n = 1 1 E(X1 + · · · + Xn ) = n E(X) = E(X) . n n Similarly Var(X) = Var 506 X1 + · · · + Xn n = 1 1 1 Var(X1 +· · ·+Xn ) = 2 (n Var X) = Var(X) . n2 n n 5.2. The Sampling Distribution of the Mean Example 5.2.3. We know that a random variable X such that X ∼ Unif(0, 1) has mean 1/2 and variance 1/12. Suppose that we have a random sample X1 , . . . , X10 with population random variable X. Then X 10 has mean 1/2 and variance 1/120. This is not inconsistent with the simulation below. > means=replicate(10000,mean(runif(10,0,1))) > mean(means) [1] 0.4991267 > var(means) [1] 0.008315763 > 1/120 [1] 0.008333333 Theorem 5.2.2 gives us two crucial pieces of information concerning the distribution of X. However, it does not tell us the shape of the distribution. In the example of Section 5.1.2, we noted that as the size of the sample increased, the empirical distribution of X approached a more symmetrical distribution. This was not a property peculiar to that example. The next theorem is so important, we might call it the Fundamental Theorem of Statistics. Theorem 5.2.4 (The Central Limit Theorem). Suppose that X is a random variable with mean µ and variance σ 2 . For every n, let X n denote the sample mean of i.i.d. random variables X1 , . . . , Xn which have the same distribution as X. Then as n gets large, the shape of the distribution of X n approaches that of a normal distribution. In particular for every a, b, Xn − µ √ ≤ b = P(a ≤ Z ≤ b) lim P a ≤ n→∞ σ/ n where Z is a standard normal random variable. The Central Limit Theorem (CLT) is a limit theorem. As such, it only provides an approximation. In using it, we will always be faced with the question of large n needs to be so that the approximation is “close enough” for our purposes. Nevertheless, it will be a crucial tool in making inferences about µ. Example 5.2.5. Continuing Example 5.2.3, suppose again that X1 , . . . , X10 is a random sample from a population X ∼ Unif(0, 1). By the Central Limit Theorem, we have that X is approximately normal with mean 1/2 and variance 1/120. Therefore we have the approximate probability statement ! r r 1 1 1 1 P − ≤X≤ + = .68 . 2 120 2 120 19:08 -- May 4, 2008 507 5. Inference - One Variable Again, we can compare this we the results of a simulation. > means=replicate(10000,mean(runif(10,0,1))) > sum( (1/2-sqrt(1/120))<means & means<(1/2+sqrt(1/120)) ) [1] 6783 > pnorm(1)-pnorm(-1) [1] 0.6826895 We know even more in the special case that the population random variable X is normally distributed. Theorem 5.2.6. Suppose that X is normally distributed with mean µ and variance σ 2 . Let X1 , . . . , Xn be i.i.d. random variables with population random variable X. Then X n has a normal distribution with mean µ and variance σ 2 /n. Example 5.2.7. The distribution of heights of 20 year old females in the United States in 2005 was very close to being normal with mean 163.3 cm and standard deviation 6.5 cm. If a random sample of 20 such females had been chosen, what is the probability that the mean of the sample was greater than 165 cm? Since the distribution of the √ sample mean of a sample of size 20 has mean 163.3 and standard deviation 6.5/ 20 = 1.45, a sample mean of 165 has a z-score of (165 − 163.3)/1.45 = 1.17. Since 1-pnorm(1.17)=.12, this probability is 12%. 5.3. Estimating Parameters The results of the last section taken together tell us that x provides a good estimate of µX . In this section, we look at the problem of parameter estimation in general and identify properties to look for in good estimators. Suppose that X is a random variable and that θ is a parameter associated with 2 . Let X , . . . , X be a random X. Examples of such parameters include µX and σX 1 n sample with population random variable X. With that setting, we have the following definition. Definition 5.3.1 (estimator, estimate). An estimator of the parameter θ is any statistic θ̂ = f (X1 , . . . , Xn ) used to estimate θ. The value of θ̂ for a particular outcome of X1 , . . . , Xn is called the estimate of θ. 2 is σ̂ 2 . Using the notation of the definition, X should be written µ̂ and SX X 508 5.3. Estimating Parameters 5.3.1. Bias Consider the following simple situation. We have one observation x from a random variable X ∼ Binom(n, π) and we wish to estimate π. An absolutely natural choice is to use x/n. In other words, π̂ = X/n. One way of justifying this choice is that E(X/n) = π so “on average” this estimator gets it right. Consider another estimator, proposed by Laplace. He suggested using π̂L = X+1 n+2 . Notice that if π > .5, this estimator tends to underestimate π a bit by on average shading its estimate towards 0.5. Likewise, if π < 0.5, the estimate tends to be a little larger than π. In other words, Laplace’s estimate has a bias. Definition 5.3.2 (unbiased, bias). An estimator θ̂ of θ is unbiased if E(θ̂) = θ. The bias of an estimator θ̂ is E(θ̂) − θ. It is important to note that θ is unknown and E(θ̂) depends on θ so that in general we do not know the bias of an estimator. In the first example below, we look at examples where we can determine that an estimator is unbiased. In the second example, we look more carefully at the bias of Laplace’s estimator. In the third example, we look at another biased estimator via a simulation. Example 5.3.3. 1. Since E(X n ) = µX for all random variables X no matter what the sample size n, we have that X n is an unbiased estimator of µ. 2. It can be shown that E(S 2 ) = σ 2 . Thus S 2 is an unbiased estimator of σ 2 . This is the real reason for using n − 1 in the definition of S 2 rather than n. (It is important to note that it does not follow that S is an unbiased estimator of σ. Indeed, this is not true.) 3. X/n is an unbiased estimator of π if X ∼ Binom(n, π). Example 5.3.4. Consider Laplace’s estimator π̂L = X+1 n+2 . We have 1 n 1 X +1 = E (X + 1) = π+ . E(π̂L ) = E n+2 n+2 n+2 n+2 Thus the bias of π̂L is n 1 1 2 1 − 2π E(π̂l ) − π = π+ −π = − π= . n+2 n+2 n+2 n+2 n+2 If π = .5 then this estimator is unbiased but the bias is negative if π > 0.5 and positive if π < 0.5. 19:08 -- May 4, 2008 509 5. Inference - One Variable Example 5.3.5. Suppose that we have a random sample from a population X ∼ Exp(λ). Since µX = 1/λ, we have that E(X) = 1/λ. Therefore a reasonable choice for an estimator of λ is λ̂ = 1/X. Notice that this estimator is not necessarily unbiased. We investigate with a simulation. We first consider random samples of size 5 and then random samples of size 20. We use λ = 10 in our simulation. > hatlambda5 = replicate(10000,1/mean(rexp(5,10))) > mean(hatlambda5) [1] 12.47850 > hatlambda20 = replicate(10000,1/mean(rexp(20,10))) > mean(hatlambda20) [1] 10.51414 Note that in both cases, our estimator appears to be biased and produces an overestimate on average. The last example illustrates an important point. Even if θ̂ is an unbiased estimator of θ, this does not mean that f (θ̂) is an unbiased estimator of f (θ). 5.3.2. Variance An estimator is a random variable. In considering its bias, we are considering its mean. But its variance is also important — an estimator with large variance is not likely to produce an estimate close to the parameter it is trying to estimate. Definition 5.3.6 (standard error). If θ̂ is an estimator for θ, the standard error of θ̂ is q σθ̂ = Var(θ̂) . If we can estimate σθ̂ , we write sθ̂ for the estimate of σθ̂ . Example 5.3.7. Regardless of the population random variable X, we know that 2 /n. Thus σ = σ /√n. To estimate this, it is natural to use Var(X) = σX X X √ sX = sX / n . Example 5.3.8. If X ∼ Binom(n, π), we have that π̂ = X/n has variance Var(π̂) = π(1 − π)/n. Thus r π(1 − π) σπ̂ = . n 510 5.3. Estimating Parameters A good estimator for σπ̂ can be found by using π̂ to estimate π. Thus r π̂(1 − π̂) sπ̂ = . n An unbiased estimator with small variance is obviously the kind of estimator that we seek. We note that the sample mean is always an unbiased estimator of the population mean and the variance of the sample mean goes to 0 as the sample size gets large. 5.3.3. Mean Squared Error Bias is bad and so is high variance. We put these two measures together into one in this section. Definition 5.3.9 (mean squared error). The mean squared error of an estimator θ̂ is MSE(θ̂) = E[(θ̂ − θ)2 ] . The mean squared error measures how far away θ̂ is from θ on average where the measure of distance is our now familiar one of squaring. Proposition 5.3.10. For any estimator θ̂ of θ MSE(θ̂) = Var(θ̂) + Bias(θ̂)2 . The proof of Proposition 5.3.10 is a messy computation and we will omit it. We illustrate the use of the MSE to compare the two estimators we have for the parameter π of the binomial distribution. Again, π̂ denotes the usual unbiased estimator and π̂L = X+1 n+2 denotes the Laplace estimator. We have Estimator Bias Variance π 0 π(1 − π) n πL 1 − 2π n+2 π(1 − π) n + 4 + n4 It is obvious that πL has a smaller variance that π (and it is clear why this should be so). It is not immediately obvious from the expressions above which has the smaller MSE. In fact, this depends on both π and n. In the Figure 5.5, we plot the MSE of both estimators for samples of size 10 and size 30 respectively. Note that the Laplace estimator has smaller MSE for intermediate values of π while the unbiased estimator has smaller MSE for extreme values of π. Ase we might expect, there is a greater difference in the two estimators for smaller samples than for large samples. 19:08 -- May 4, 2008 511 5. Inference - One Variable 0.025 0.008 0.020 MSE MSE 0.006 0.015 0.010 0.004 0.002 0.005 0.000 0.000 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 pi 0.6 0.8 1.0 pi Figure 5.5.: MSE of two estimators of π, sample sizes n = 10 and n = 30. 5.4. Confidence Interval for Sample Mean In this section, we introduce an important method for quantifying sampling error, the confidence interval. First, we’ll look at a very special but important case. 5.4.1. Confidence Intervals for Normal Populations Suppose that X1 , . . . , Xn is a random sample with population random variable X with unknown mean µ and variance σ 2 . Suppose too that the population random variable X has a normal distribution. Using Theorem 5.2.6 and one of our favorite facts about the standard normal distribution, we have X −µ √ < 1.96 = .95 . P −1.96 < σ/ n We now do some algebra to get σ σ P X − 1.96 √ < µ < X + 1.96 √ = .95 n n The interval σ σ X − 1.96 √ , X + 1.96 √ n n is a random interval. Now suppose that we know σ (an unlikely happenstance, we admit). For any particular set of data x1 , . . . , xn the interval is simply a numerical interval. The key fact is that we are fairly confident that this interval contains µ. Definition 5.4.1 (confidence interval). Suppose that X1 , . . . , Xn is a random sample from a normal distribution with known variance σ 2 . Suppose that x1 , . . . , xn is the observed sample. The interval σ σ x − 1.96 √ , x + 1.96 √ n n 512 5.4. Confidence Interval for Sample Mean is called a 95% confidence interval for µ. Example 5.4.2. A machine creates rods that are to have a diameter of 23 millimeters. It is known that the distribution of the diameters of the parts is normal and that the standard deviation of the actual diameters of parts created over time is 0.1 mm. A random sample of 40 parts are measured precisely to determine if the machine is still producing rods of diameter 23 mm. The data and 95% confidence interval are given by > x [1] 22.958 23.179 23.049 22.863 23.098 23.011 22.958 23.186 [11] 23.166 22.883 22.926 23.051 23.146 23.080 22.957 23.054 [21] 23.040 23.057 22.985 22.827 23.172 23.039 23.029 22.889 [31] 22.837 23.045 22.957 23.212 23.092 22.886 23.018 23.031 > mean(x) [1] 23.024 > c(mean(x)-(1.96)*.1/sqrt(40),mean(x)+(1.96)*.1/sqrt(40)) [1] 22.993 23.055 23.015 22.995 23.019 23.059 23.089 22.894 23.073 23.117 It appears that the process could still be producing rods of diameter 23 mm. Of course the example illustrates a problem with using this notion of confidence interval, namely we need to know the standard deviation of the population. It is unlikely that we would be in a situation where the mean of the population is unknown but the standard deviation is known. One approach to solving this problem is to use an estimate for σ, namely sX , the sample standard deviation. If the sample size is quite large, we hope that sX is close to σ so that our confidence interval statement is approximately correct. In the case of a normal population random variable X however, we know more. 5.4.2. The t Distribution Definition 5.4.3 (t distribution). A random variable T has a t distribution (with parameter ν ≥ 1, called the degrees of freedom of the distribution) if it has pdf 1 Γ((ν + 1)/2) 1 f (x) = √ 2 Γ(ν/2) (1 + x /ν)(ν+1)/2 πν −∞<x<∞ Here Γ is the gamma function from mathematics but all we need to know about the constant out front is that it exists to make the integral of the density function equal to 1. Some properties of the t distribution include 1. f is symmetric about x = 0 and unimodal. In fact f looks bell-shaped. 2. If ν > 1 then the mean of T is 0. 19:08 -- May 4, 2008 513 5. Inference - One Variable 3. If ν > 2 then the variance of T is ν/(ν − 2). 4. For large ν, T is approximately standard normal. 0.3 0.2 density 0.1 x=seq(-3,3,.01) y=dt(x,3) z=dt(x,10) w=dnorm(x,0,1) plot(w~x,type="l",ylab="density") lines(y~x) lines(z~x) 0.0 > > > > > > > 0.4 In summary, the t distributions look very similar to the normal distribution except that they have slightly more spread, especially for small values of ν. R knows the tdistribution of course and the appropriate functions are dt(x,df), pt(), qt(), and rt(). The graphs of the normal distribution and two t-distributions are shown below. −3 −2 −1 0 1 2 3 x The important fact that relates the t distribution to the normal distribution is the following theorem which is one of the most heavily used in statistics. Theorem 5.4.4. If X1 , . . . , Xn is a random sample from a normal distribution with mean µ and variance σ 2 , then the random variable X −µ √ S/ n has a t distribution with n − 1 degrees of freedom. To generate confidence intervals using this theorem,, first define tβ,ν to be the unique number such that P (T > tβ,ν ) = β where T is random variable that has a t distribution with ν degrees of freedom. We have the following: Confidence Interval for µ If x1 , . . . , xn are the observed values of a random sample from a normal distribution with unknown mean µ and t∗ = tα/2,n−1 , the interval ∗ s ∗ s x̄ − t √ , x̄ + t √ n n is an 100(1 − α)% confidence interval for µ. 514 5.4. Confidence Interval for Sample Mean Example 5.4.5. It is plausible to think that the logs of populations of U.S. counties have a normal distribution. (We’ll talk about how to test that claim a later point.) In the following example, we look at a sample of 10 such counties and produce a 95% confidence interval for the mean of the log-population. To produce a 95% confidence interval, we need t.025 which is the 97.5% quantile of the t distribution. Notice that the true mean of our population random variable is 10.22 so in this case the confidence interval does capture the mean. > counties=read.csv(’http://www.calvin.edu/~stob/data/uscounties.csv’) > logpop=log(counties$Population) > smallsample=sample(logpop,10,replace=F) # our sample of size 10 > tstar = qt(.975,9) # 9 degrees of freedom > xbar= mean(smallsample) > s= sd(smallsample) > c( xbar-tstar* s/sqrt(10), xbar+tstar * s/sqrt(10)) [1] 10.14891 12.01605 5.4.3. Interpreting Confidence Intervals It is important to be very careful in making statements about what a confidence interval means. In Example 5.4.5, we can say something like “we are 95% confident that the true mean of the logs of population is in the interval (10.15, 12.02).” (This, at least, is what many AP Statistics students are taught to say.) But beware: This is not a probability statement! That is, we do not say that the probability that the true mean is in the interval (10.15, 12.02) is 95%. There is no probability after the experiment is done, only before. The correct probability statement is one that we make before the experiment. If we are to generate a 95% confidence interval for the mean of the population from a sample of size 10 from this population, then the probability is 95% that the resulting confidence interval will contain the mean. Another way of saying this using the relative frequency interpretation of probability is If we generate many 95% confidence intervals by this procedure, approximately 95% of them will contain the mean of the population. After the experiment, a good way of saying what confidence means is this Either the population mean is in (10.15, 12.02) or something very surprising happened. 19:08 -- May 4, 2008 515 5. Inference - One Variable 5.4.4. Variants on Confidence Intervals and Using R Nothing is sacred about 95%. We could generate 90% confidence intervals or confidence intervals of any other level. There might also be a reason for generating one-sided confidence intervals which could be done by using eliminating one of the two tails of the t-distribution in our computation. R will actually do all the computations for us. We illustrate. Example 5.4.6. The file http://www.calvin.edu/~stob/data/March9bball. csv contains the results of all basketball games played in NCAA Division I on March 9, 2008. It might be a reasonable assumption that the visitor’s scores in Division I games have a normal distribution and that the games of March 9 approximate a random sample. Proceeding on that assumption, we write a variety of different confidence intervals. Notice that the output of t.test() gives a variety of information beyond simply the confidence interval. > games=read.csv(’http://www.calvin.edu/~stob/data/March9bball.csv’) > names(games) [1] "Visitor" "Vscore" "Home" "Hscore" > t.test(games$Vscore) One Sample t-test data: games$Vscore t = 35.7926, df = 38, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 59.38840 66.50903 sample estimates: mean of x 62.94872 > t.test(games$Vscore,conf.level=.9) # 90% confidence interval One Sample t-test data: games$Vscore t = 35.7926, df = 38, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 90 percent confidence interval: 59.98362 65.91382 sample estimates: mean of x 62.94872 > t.test(games$Vscore,conf.level=.9,alternative=’greater’) One Sample t-test 516 # 90% one-sided interval 5.5. Non-Normal Populations data: games$Vscore t = 35.7926, df = 38, p-value < 2.2e-16 alternative hypothesis: true mean is greater than 0 90 percent confidence interval: 60.65496 Inf sample estimates: mean of x 62.94872 5.5. Non-Normal Populations In this section we consider the problem of generating confidence intervals for the mean in the case that our population random variable does not have a normal distribution. Of course it is not hard to find examples where this would be useful. Indeed, it is really not often that we know our population is normal. Our advice in this section amounts to the following: we can often use the same confidence intervals that we used when the population is normal. 5.5.1. t Confidence Intervals are Robust A statistical procedure is robust if it performs as advertised (at least approximately) even if the underlying distributional assumptions are not satisfied. The important fact about confidence intervals generated by the method of the last section is that they are robust against violations of the normality assumption if the sample size is not small and if the data does not have extreme outliers. To measure whether the t procedure works, we have the following definition. Definition 5.5.1. Suppose that I is a random interval used as a confidence interval for θ. The coverage probability of I is P(θ ∈ I). (In other words, the coverage probability is the true confidence level of the confidence intervals produced by I.) We would like that 95% confidence intervals generated from the t distribution have a 95% coverage probability even in the case that the normality assumption is not satisfied. We first look at some examples. Example 5.5.2. We will use as our population the maximum wind velocity at the San Diego airport on 6,209 consecutive days. The true mean of this population is 15.32. We generate 10,000 samples of each of size 10, 30 and 50. 19:08 -- May 4, 2008 517 5. Inference - One Variable > w=read.csv(’http://www.calvin.edu/~stob/data/wind.csv’) > m=mean(w$Wind) # samples of size 10 > intervals= replicate(10000,t.test(sample(w$Wind,10,replace=F))$conf.int) > sum(intervals[1,]<m & intervals[2,]>m) [1] 9346 # samples of size 30 > intervals= replicate(10000, t.test(sample(w$Wind,30,replace=F))$conf.int) > sum(intervals[1,]<m & intervals[2,]>m) [1] 9427 # samples of size 50 > intervals= replicate(10000, t.test(sample(w$Wind,50,replace=F))$conf.int) > sum(intervals[1,]<m & intervals[2,]>m) [1] 9441 We find that we do not quite achieve our desired goal of 95% confidence intervals though it appears for samples of size 50 we have approximately 94.4% confidence intervals. Example 5.5.3. Suppose that X ∼ Exp(0.2) so that µX = 5. We generate 10,000 different random samples of size 10 for this distribution and compute the 95% confidence interval given by the t-distribution in each case. We note that we do not have exceptional success - only 89.1% of the 95% confidence intervals contain the mean. > # samples of size 10 from an exponential distribution with mean 5 > # t.test()$conf.int recovers just the confidence interval > > intervals = replicate(10000, t.test(rexp(10,.2))$conf.int) > > # now count the intervals that capture the mean > > sum (intervals[1,]<5 & intervals[2,]>5) [1] 8918 With random samples of size 30, we do better and with samples of size 50 better yet. However in no case do we achieve the 95% coverage probability that we desire. The exponential distribution is quite asymmetric. # samples of size 30 > intervals = replicate(10000, t.test(rexp(30,.2))$conf.int) > sum (intervals[1,]<5 & intervals[2,]>5) [1] 9297 518 5.5. Non-Normal Populations # samples of size 50 > intervals = replicate(10000, t.test(rexp(50,.2))$conf.int) > sum (intervals[1,]<5 & intervals[2,]>5) [1] 9348 In neither of the last two examples did we achieve our objective of 95% confidence intervals containing the mean 95% of the time. The next example uses the Weibull distribution with parameters that make it fairly symmetric. Example 5.5.4. The Weibull distribution with parameters α = 5 and β = 10 has mean 9.181687. We generate samples of size 10, 30 and 50. Note that we have achieved almost exactly 95% confidence intervals regardless of the sample size. > m=9.181687 # mean of Weibull distribution with parameters 5, 10 > intervals = replicate(10000, t.test(rweibull(10,5,10))$conf.int) > sum (intervals[1,]<m & intervals[2,]>m) [1] 9502 > intervals = replicate(10000, t.test(rweibull(30,5,10))$conf.int) > sum (intervals[1,]<m & intervals[2,]>m) [1] 9499 > intervals = replicate(10000, t.test(rweibull(50,5,10))$conf.int) > sum (intervals[1,]<m & intervals[2,]>m) [1] 9496 5.5.2. Why are t Confidence Intervals Robust? Let’s consider generating a 95% confidence interval from 30 data points x1 , . . . , x30 . The t-confidence interval in this case is s s x − 2.05 √ , x − 2.05 √ . (5.1) 30 30 The magic number 2.05 of course is just t.025,29 . Let’s approach the problem of generating a confidence interval from a different direction. Namely let’s use the Central Limit Theorem. The CLT says that the random variable X −µ √ σ/ n 19:08 -- May 4, 2008 519 5. Inference - One Variable has a distribution that is approximately standard normal (if we believe that n = 30 is large). We therefore have the following approximate probability statment: X −µ √ < 1.96 ≈ .95 . P −1.96 < σ/ n This leads to the approximate 95% confidence interval σ σ x − 1.96 √ , x − 1.96 √ . 30 30 (5.2) The problem with this interval (besides the fact that it is only approximate) is that σ is not known. Now for a reasonably large sample size, we might expect that the value s of the sample standard deviation is close to σ. If we replace σ in 5.2 by s, we have the interval s s x − 1.96 √ , x − 1.96 √ . 30 30 Now we see that the only difference between this interval (which involves two approximations) and the interval of Equation 5.1 that results from the t-distribution is the difference between the numbers 1.96 and 2.04. It is easy to give an argument for using a larger number than 1.96 — using 2.04 helps compensate for the fact that we are making several approximations in constructing the interval by expanding the width of the interval slightly. Of course we should note that the t intervals do not perform equally well regardless of the population. The performance of this method depends on the shape of the distribution (symmetric, unimodal is best) and the sample size (the larger the better). 5.6. Confidence Interval for Proportion To estimate the proportion of individuals in a population with a certain property, we often choose a random sample and use as an estimate the proportion of individuals in the sample with that property. This is the methodology of political polls, for example. While this random process is best modeled by the hypergeometric distribution, we normally use the binomial distribution instead if the size of the population is large relative to the size of the sample. So then, assume that we have a binomial random variable X ∼ Binom(n, π) where as usual n is known but π is not. Then of course the obvious estimator for π is π̂ = X n and it is an unbiased estimator of π. Of course we would also like to write a confidence interval for π so that we know the precision of our estimate. Because X is discrete, there is no good way to write exact confidence intervals for π, but the Central Limit Theorem allows us to write an approximate confidence interval that is really quite good. The key is to understanding the relationship between the binomial distribution and the Central Limit Theorem. 520 5.6. Confidence Interval for Proportion Theorem 5.6.1. Suppose that X ∼ Binom(n, π). Then if n is large, the random variable X −π qn π(1−π) n has a distribution that is approximately standard normal. Proof. Let the individual trials of the random process X be denoted X1 , . . . , Xn . This 2 = π(1 − π) for sequence is i.i.d. P In fact Xi ∼ Binom(1, π). Obviously µXi = π and σX i each i and X = Xi . We apply the CLT to the sequence X1 , . . . , Xn . The random variable X n is the sample mean for this i.i.d. sequence and so has mean π and variance π(1−π) . The result follows. n The Theorem suggests how to find an (approximate) confidence interval. For a fixed β, let zβ be the number such that P(Z > zβ ) = 1 − β where Z is the standard normal random variable. Then we have the following approximate equality from the CLT. ! π̂ − π P −zα/2 < p < zα/2 ≈ 1 − α (5.3) π(1 − π)/n Equation 5.3 is the starting point for several different approximate confidence intervals. As we did for confidence intervals for µ, we should attempt to use Equation 5.3 to isolate π in the “middle” of the inequalities. The first two steps are ! r r π(1 − π) π(1 − π) P −zα/2 < π̂ − π < zα/2 ≈ 1 − α, n n and thus r P π̂ − zα/2 π(1 − π) < π < π̂ + zα/2 n r π(1 − π) n ! ≈1−α. (5.4) The problem with 5.4 is that the unknown π appears not only in the middle of the inequalities but also in the bounds. Thus we do not yet have a true confidence interval since the endpoints are not statistics that we can compute from the data. The Wald interval. The Wald interval results from replacing π by π̂ in the endpoints of the interval of 5.4. p p π̂ − zα/2 π̂(1 − π̂)/n, π̂ + zα/2 π̂(1 − π̂)/n 19:08 -- May 4, 2008 521 5. Inference - One Variable Until recently, this was the standard confidence interval suggested in most elementary statistics textbooks if the sample size is large enough. (In fact this interval still receives credit on the AP Statistics Test.) Books varied as to what large enough meant. A typical piece of advice is to only use this interval if nπ̂(1 − π̂) ≥ 10. However, you should never use this interval. The coverage probability of the (approximately) 95% Wald confidence intervals is almost always less than 95% and could be quite a bit less depending on π and the sample size. For example, if π = .2, it takes a sample size of 118 to guarantee that the coverage probability of the Wald confidence interval is at least 93%. For very small probabilities, it takes thousands of observations to ensure that the coverage probability of the Wald interval approaches 95%. The Wilson Interval. At least since 1927, a much better interval than the Wald interval has been known although it wasn’t always appreciated how much better the Wilson interval is. The Wilson interval is derived by solving the inequality in 5.3 so that π is isolated in the middle. After some algebra and the quadratic formula, we get the following (impressive looking) approximate confidence interval statement: π̂ + P 2 zα/2 2n r − zα/2 π̂(1−π̂) n 2 )/n 1 + (zα/2 + 2 zα/2 4n2 π̂ + <π< 2 zα/2 2n r + zα/2 π̂(1−π̂) n + 2 zα/2 4n2 2 )/n 1 + (zα/2 ≈1−α The Wilson interval performs much better than the Wald interval. If nπ̂(1 − π̂) ≥ 10, you can be reasonably certain that the coverage probability of the 95% Wilson interval is at least 93%. The Wilson interval is computed by R in the function prop.test(). The option correct=F needs to be used however. (The option correct=T makes a “continuity” correction that comes from the fact that binomial data is discrete. It is not recommended to be used in the Wilson interval however.) Example 5.6.2. In a poll taken in Mississippi on March 7, 2008, of 354 voters who were decided between Obama and Clinton, 190 said that they would vote for Obama in the Mississippi primary. We can estimate the proportion of voters in the population that will vote for Obama (of those who were decided on one of these two candidates) using the Wilson method. > prop.test(190,354,correct=F) 1-sample proportions test without continuity correction data: 522 190 out of 354, null probability 0.5 5.6. Confidence Interval for Proportion X-squared = 1.9096, df = 1, p-value = 0.167 alternative hypothesis: true p is not equal to 0.5 95 percent confidence interval: 0.4846622 0.5879957 sample estimates: p 0.5367232 We see that π̂ = .537 and that a 95% confidence interval for π is (.485, .588). This is often reported by the media as 53.7% ± 5.1% with no mention of the fact that a 95% confidence interval is being used. (Note that the center of the interval is not π̂ but in this case does agree with π̂ to three decimal digits.) Notice that the center of the Wilson interval is not π̂. It is z2 2 /2 α/2 x + zα/2 π̂ + 2n . 2 )/n = n + z 2 1 + (zα/2 α/2 2 A way to think about this is that the center of the interval comes from adding zα/2 2 /2 successes to the observed data. (For a 95% confidence interval, this trials and zα/2 is very close to adding two successes and four trials.) This is the basis for the next interval. The Agresti-Coull Interval. Agresti and Coull (1998) suggest combining the biased estimator of π̂ that is used in the Wilson interval together with the simpler estimate for the standard error that comes from the Wald interval. In particular, if we are looking for a 100(1 − α)% confidence interval and x is the number of successes observed in n trials, define 2 x̃ = x + zα/2 /2 2 ñ = n + zα/2 π̃ = x̃ ñ Then the Agresti-Coull interval is ! r r π̃(1 − π̃) π̃(1 − π̃) π̃ − zα/2 , π̃ + zα/2 ñ ñ In practice, this estimator is even better than the Wilson estimator and is now widely recommended, even in basic statistics textbooks. For the particular example of x = 7 and n = 10, the Wilson and Agresti-Coull intervals are compared below. Note that the Agresti-Coull interval is somewhat wider than the Wilson interval. Of course wider intervals are more likely to capture the mean. # The Wilson interval > prop.test(7,10,correct=F) 19:08 -- May 4, 2008 523 5. Inference - One Variable 1-sample proportions test without continuity correction data: 7 out of 10, null probability 0.5 X-squared = 1.6, df = 1, p-value = 0.2059 alternative hypothesis: true p is not equal to 0.5 95 percent confidence interval: 0.3967781 0.8922087 sample estimates: p 0.7 # The Agresti-Coull Interval > xtilde=9 > ntilde=14 > z=qnorm(.975) > pitilde=xtilde/ntilde > se= sqrt ( pitilde * (1-pitilde)/ntilde ) > c( pitilde - z* se, pitilde + z * se) [1] 0.3918637 0.8938505 In elementary statistics books, the Agrest-Coull interval is often presented at the “Plus 4” interval and the instructions for computing it are simply to add four trials and two successes and then to compute the Wald interval. 5.7. The Bootstrap Throughout this chapter we have been developing methods for making inferences about the unknown value of the parameter θ associated with a “population” random variable. In general, to estimate θ we need good answers to two questions 1. What estimator θ̂ of θ should we use? 2. How accurate is the estimator θ̂? For the case that θ is the population mean, we have a rich theory that answers these questions. We answered the questions by knowing two things: 1. the distribution of the population random variable (e.g., normal, binomial), and 2. how the sampling distribution of the estimator depends on the the distribution of the population. If we know the distribution of the population random variable but not how the sampling distribution of the estimator depends on it, we can often do simulation to get 524 5.7. The Bootstrap an idea of the sampling distribution of the estimator. Indeed, that is what we did in Section 5.1. In this section we look at the bootstrap, a “computer-intensive” technique for addressing these questions if we know neither of these two facts. We will illustrate the bootstrap with the following example dataset. (This dataset is found in the package boot which you would probably need to load from the internet.) The data are the times to failure for the air-conditioning unit of a certain Boeing 720 aircraft. > aircondit$hours [1] 3 5 7 18 43 > mean(aircondit$hours) [1] 108.0833 85 91 98 100 130 230 487 Suppose that we want to estimate the MTF — mean time to failure of such airconditioning units. Our estimate is 108 hours, but we would like an estimate of the precision of this estimate, e.g., a confidence interval. While the simple advice of Section 5.5 is to use the t-distribution, this is not really a good strategy as the dataset is quite small and the distribution of the data is quite skewed. Furthermore, the small size of the dataset does not suggest to us a particular distribution for the population (although engineers might naturally turn to some Weibull distribution). The idea of the bootstrap is to generate lots of different samples from the population as we did in Section 5.1). However, without any assumptions about the shape of the distribution of the population, the bootstrap uses the data itself to approximate that shape. In this case, we have that 1/12 of our sample has the value 3, 1/12 of the sample has the value 5, etc. Therefore, we will model the population by assuming that 1/12 of the population has the value 3, 1/12 of the population has the value 5, etc! Now to take a random sample of size 12 from such a population, we need only take a sample of size 12 from our original data with replacement. The idea of the bootstrap is to take many such samples and compute the value of the estimator for each sample and thereby get an approximation to the sampling distribution of the estimator. Here are the steps to computing a bootstrap confidence interval for the mean of our air-conditioning failure time population. The following R function chooses 1,000 different random samples of size 12 from our original random sample, with replacement and computes the mean of each sample. > means = replicate (1000, mean(sample(aircondit$hours,12,replace=T))) These 1,000 means are our approximation of what would happen if we took 1,000 samples from the population of air-cinditioning failure times. A histogram of these 1,000 means is in Figure 5.6. We now convert these 1,000 means to a confidence interval by using the quantile() function. > quantile(means,c(0.025,0.975)) 2.5% 97.5% 45.16042 190.33750 19:08 -- May 4, 2008 525 5. Inference - One Variable 30 Percent of Total 25 20 15 10 5 0 0 50 100 150 200 250 300 means Figure 5.6.: 1,000 sample means of bootstrapped samples of air-conditioning failure times. It is reasonable to announce that the 95% confidence interval for µ is (45.16, 190.34) hours. The bootstrap method that we illustrated above (called the bootstrap percentile confidence interval), is quite general. There was nothing special about the fact that we were constructing a confidence interval for the mean. Indeed, we could use the very same method to construct a confidence interval for any parameter, as long as we have a reasonable estimator for the parameter. (For parameters other than the mean, there are more sophisticated bootstrap methods that account for the fact that many estimators are biased.) We illustrate with one more example. Example 5.7.1. The dataset city in the boot package consists of a random sample of 10 of the 196 largest cities of 1930. The variables are u which is the population (in 1,000s) in 1920 and x which is the population in 1930.PThe P population is the 196 cities and we would like to know the value of θ = x/ u, the ratio of increase P ofPpopulation in these cities from 1920 to 1930. The obvious estimator is θ̂ = x/ u for the sample. We construct our bootstrap confidence interval for θ. > library(boot) > city u x 1 138 143 2 93 104 3 61 69 4 179 260 5 48 75 6 37 63 7 29 50 8 23 48 9 30 111 526 5.8. Testing Hypotheses About the Mean 10 2 50 > thetahat=sum(city$x)/sum(city$u) > thetahat [1] 1.520312 # estimate from sample > thetahats = replicate ( 1000, { i=sample((1:10),10,replace=T) ; + us=city[i,]$u ; xs=city[i,]$x ; + sum(xs)/sum(us) } ) > quantile(thetahats, c(0.025,0.975)) 2.5% 97.5% 1.250343 2.127813 # bootstrap confidence interval Notice that th confidence interval is very wide. This is only to be expected from such a small sample. 5.8. Testing Hypotheses About the Mean In this section, we review the logic of hypothesis testing in the context of testing hypotheses about the mean. While the language of hypothesis testing is still quite common in the literature, it is fair to say that confidence intervals are a superior way to quantify inferences about the mean. The language of hypothesis testing is perhaps most useful when one needs to make a decision about the parameter in question. We first look at an example of a situation in which a decision rule is necessary. Example 5.8.1. Kellogg’s makes Raisin Bran and fills boxes that are labelled 11 oz. NIST mandates testing protocols to ensure that this claim is accurate. Suppose that a shipment of 250 boxes, called the inspection lot, is to be tested. The mandated procedure is to take a random sample of 12 boxes from this shipment. If any box is more than 1/2 ounce underweight, then the lot is declared defective. Else, the sample mean x and the sample standard deviation s are computed. The shipment is rejected if (x − 11)/s ≤ −0.635. We can view Example 5.8.1 as implementing a hypothesis test. Recall the technology. There are four steps as described in Section 4.3. 1. Identify the hypotheses. 2. Collect data and compute a test statistic. 3. Compute a p-value. 4. Draw a conclusion. 19:08 -- May 4, 2008 527 5. Inference - One Variable We go through these four steps in the case that our hypotheses are about the population mean µ, using the Kellogg’s example as an illustration. We will suppose that X1 , . . . , Xn is a random sample from a normal distribution with unknown mean µ and that we wish to make inferences about µ. Identify the Hypotheses We start with a null hypothesis, H0 , the default or “status quo” hypothesis. We want to use the data to determine whether there is substantial evidence against it. The alternate hypothesis, Ha , is the hypothesis that we are wanting to put forward as true if we have sufficient evidence in its favor. So in the Raisin Bran example, our pair of hypotheses are H0 : Ha : µ = 11 µ < 11 . In general, our hypotheses for a test of means is one of the following three pairs: H0 Ha µ = µ0 µ < µ0 H0 Ha µ = µ0 µ > µ0 H0 Ha µ = µ0 µ 6= µ0 where µ0 is some fixed number. Collect data and compute a test statistic We will use the following test statistic: T = X − µ0 √ . S/ n The important fact about this statistic is that if H0 is true then the distribution of T is known. (It is a t distribution with n − 1 degrees of freedom.) This is the key property that we need whenever we do a hypothesis test: we must have a test statistic whose distribution we know if H0 is true. Compute a p-value Recall that the p-value of the test statistic t is the probability that we would see a value at least as extreme as t (in the direction of the alternate hypothesis) if the null hypothesis were true. The R function t.test() computes the p-value if the argument alternative is appropriately set. Let’s look at some possible Raisin Bran data. > raisinbran [1] 11.01 10.91 10.94 11.01 10.97 11.01 10.95 10.93 10.92 10.83 11.02 10.84 > t.test(raisinbran,alternative="less",mu=11) One Sample t-test 528 5.8. Testing Hypotheses About the Mean data: raisinbran t = -2.9689, df = 11, p-value = 0.006385 alternative hypothesis: true mean is less than 11 95 percent confidence interval: -Inf 10.97827 sample estimates: mean of x 10.945 In this example, the p-value is 0.006. This means that if the null hypothesis (of µ = 11) were true, we would expect to get a value of the test statistic at least as extreme as the value we computed from the data (-2.9689) 0.6% of the time. This would be an extremely rare occurence so this is strong evidence against the null hypothesis. Draw a conclusion It is often enough to present the result of a hypothesis test by stating the p-value. What to do with that evidence is not really a statistical problem. It is sometimes necessary to go further however and to announce a decision. That is the case in the Raisin Bran example where it is necessary to decide whether to reject the shipment as being underweight. In this case, we set up the hypothesis test in terms of a decision rule. The possible decisions are either to reject the null hypothesis (and accept the alternate hypothesis) or not to reject the null hypothesis. The decision rule is expressed in terms of the test statistic. In order to determine what the decision rule should be, we need to examine the errors in making an incorrect decision. Recall the kinds of errors that we might make: 1. A Type I error is the error of rejecting H0 even though it is true. The probability of a type I error is denoted by α. 2. A Type II error is the error of not rejecting H0 even though it is false. The probability of a Type II error is denoted by β. To construct a decision rule, we choose α, the probability of a Type I error. This number α is often called the significance level of the test. In this case, testing H0 : µ = µ0 versus Ha : µ < µ0 , our decision rule should be: Reject H0 if and only if t < −tα,n−1 . It is obvious that this decision rule will reject H0 if it is true with probability α. While the R example above does not explicitly make a decision, the p-value of the test statistic gives us enough information to determine what the decision should be. Namely 19:08 -- May 4, 2008 529 5. Inference - One Variable if the p-value is less than α, we must reject the null hypothesis. Otherwise we do not. In the Kellogg’s example above, we obviously reject the null hypothesis. We can now understand the test that NIST prescribes in Example 5.8.1. The NIST manual says that “this method gives acceptable lots a 97.5% chance of passing.” In other words, NIST is prescribing that α = 0.025. For such an α, our test should be to reject H0 if x − 11 √ < −t.025,11 s/ 12 or if t0.25,11 x − 11 = 0.635 <− √ s 12 which is exactly the requirement of the NIST test. Of course this NIST method implicitly is relying on the assumption that the distribution of the lot is normal. We really should be cautious about using the t-distribution for a non-normal population with a sample size of 12 although the t-test is robust. Type II Errors The four step procedure above focuses on α, the probability of a Type I error. Usually, the consequences of a Type I error are much more severe than those of making a Type II error and it is for this reason that we set α to be a small number. But if our procedures were only about minimizing Type I errors, we would never reject H0 since this would make the probability of a Type I error 0! Of course the probability of a Type II error depends on the distribution of T = X̄ − 11 √ S/ 12 if µ 6= 11. This distribution depends on the true mean µ, the standard deviation σ (neither of which we know), and the sample size. R will compute this probability for us if we specify these values. The probability of a type II error is denoted by β and the number 1 − β is called the power of the hypothesis test. (Higher powers are better.) The R function power.t.test computes the power given the following arguments: delta sd n sig.level type alternative the deviation of the true mean from the null hypothesis mean the true standard deviation the sample size α this t-test is called a one.sample test we tested a one.sided alternative In the Raisin Bran example, if the true value of the mean is 10.9 and the standard deviation is 0.1, then the power of the test is 88.3%. In other words, we will reject a shipment that on average is one standard deviation underweight 88.3% of the time using this test. 530 5.9. Exercises > power.t.test(delta=.1,sd=.1,n=12,sig.level=.025,type=’one.sample’, + alternative=’one.sided’) One-sample t test power calculation n delta sd sig.level power alternative = = = = = = 12 0.1 0.1 0.025 0.8828915 one.sided Obviously, the test that we use should have greater power if the true mean is further from 11. > diff=seq(0,.1,.01) > power.t.test(delta=diff,sd=.1,n=12,sig.level=.025,type=’one.sample’, + alternative=’one.sided’) One-sample t test power calculation n delta sd sig.level power = = = = = 12 0.00, 0.01, 0.1 0.025 0.02500000, 0.24401839, 0.71365697, alternative = one.sided 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10 0.05024502, 0.09249152, 0.15643493, 0.35263574, 0.47466264, 0.59891866, 0.80978484, 0.88289152 Many users of hypothesis testing technology do not think very carefully about type II errors before doing the experiment and so often construct tests that are not very powerful. For example, if we think that it is important to reject shipments with that average more than a half a standard deviation underweight, we find that the sample size of 12 given above has power only 35%. We really should increase the sample size in this case (and we know this even before we collect data). 5.9. Exercises 5.1 In this problem and the next, we investigate the use of the sample mean to estimate the mean population of a U.S. county. We use the dataset at http://www.calvin. edu/~stob/data/uscounties.csv. a) What is the average population of a U.S. county? (Answer: 89596.) b) Generate 10,000 samples of size 5 and compute the mean of the population of each sample. In how many of these 10,000 samples was the sample mean greater than the population mean? Why so many? 19:08 -- May 4, 2008 531 5. Inference - One Variable c) Repeat part (b) but this time use samples of size 30. Compare the result to that of part (b). d) For the 10,000 samples of size 30 in part (c), what is the IQR of the sample means? e) Explain why using the sample mean for a sample of size 30 is likely to give a fairly poor estimate of the average population of a county. 5.2 Recall from Chapter 1 that reexpressing the population of counties by taking logarithms produced a symmetric unimodal distribution. (See Figure 1.3.) Let’s now repeat the work of the last problem using this transformed data. a) What is the mean of the log of population for all counties? (Answer: 10.22) b) Generate 10,000 samples of size 5 and compute the mean of the log-population for each of the samples. In how many of these samples was the sample mean greater than the population mean? c) Repeat part (b) but this time use samples of size 30. d) For the 10,000 samples in part (c), what is the IQR of the sample means? e) How useful is a sample of size 30 for estimating the mean log-population? 5.3 Suppose that X ∼ Binom(n, π) and that Y ∼ Binom(m, π). Also suppose that X and Y are independent. a) Give a convincing reason why Z = X + Y should have a binomial distribution (with parameters n + m and π). b) Show that the mean and variance of Z as computed by Theorem 5.2.1 from that of X and Y is the same as computed directly from the fact that Z is binomial with parameters n + m and π. 5.4 In this problem, you are to investigate the accuracy of the approximation of the Central Limit Theorem for the exponential distribution. Suppose that X ∼ Exp(0.1) and that a random sample of size 20 is chosen from this population. a) What is the mean and variance of X? b) What is the mean and variance of X? c) Using the Central Limit Theorem approximation, compute the probability that X is within 1, 2, 3, 4, and 5 of µX . 532 5.9. Exercises d) Now choose 1,000 random samples of size 20 from this distribution. Count the number of samples in which x is with 1, 2, 3, 4, and 5 of µX and compare to part (c). Comment. 5.5 Scores on the SAT test were redefined (recentered) in 1990 and were set to have mean of 500 and standard deviation of 110 on each of the Mathematics and Verbal Tests. The scores were constructed so that the population had a normal distribution (or at least very close to normal). In random sample from this population of size 100, a) What is the probability that the sample mean will be between 490 and 510? b) What is the probability that the sample mean will exceed 500? 510? 520? 5.6 Continuing Problem 5.5, the total sat score for each student is formed by adding their verbal score V and their math score M . a) If the two scores for an individual are independent of each other, what is the mean and standard deviation of V + M ? b) It is not likely that the verbal and mathematics scores of individuals in the population behave like independent random variables. Do you expect that the standard deviation of V + M is more or less than you computed in part (a)? Why? 5.7 Which is wider, a 90% confidence interval or a 95% confidence interval generated from the same random sample from a normal population? 5.8 Suppose that the standard deviation σ of a normal population is known. How large a random sample must be chosen so that a 95% confidence interval will be of form x ± .1σ? 5.9 The dataset found at http://www.calvin.edu/~stob/data/normaltemp.csv contains the body temperature and heart rate of 130 adults. (“What’s Normal? – Temperature, Gender, and Heart Rate” in the Journal of Statistics Education (Shoemaker 1996). ) a) Assuming that the body temperatures of adults in the population is approximately normal and that the 130 adults sampled behave like a simple random sample, write a 95% confidence interval for the mean body temperature of an adult. b) Comment on the result in (a). c) Is there anything in the data that would lead you to believe that the normality assumption is incorrect? 19:08 -- May 4, 2008 533 5. Inference - One Variable 5.10 The R dataset morley contains the speed of light measurements for 100 different experimental runs. The vector Speed contains the measurements (in some obscure units). a) If we think of these 100 measurements as repeated independent trials of a random variable X, what is a good description of the population of which these measurements are a sample? b) Write a 95% confidence interval for the mean of this population. c) What is the value tβ,n−1 for the confidence interval generated in the previous part? d) Is there anything in the histogram of the data values that suggests that the t procedure might not be a good one for generating a confidence interval in this case? 5.11 Write 95% confidence intervals for the mean of the sepal length of each of the three species of irises in the R dataset iris. Would you say that these confidence intervals give strong evidence that the means of the sepal lengths of these species are different? 5.12 The dataset http://www.calvin.edu/~stob/data/uselessdata.csv contains data collected about each student in our class. Our class is not a random sample of Calvin students but suppose that we consider it so. a) Write a 90% confidence interval for mean number of hours of sleep that a Calvin student got the night before the first day of classes. (The variable named Sleep records that for the sample.) b) From the data, is there anything about the data on hours slept that concerns you in using the t-distribution to generate the confidence interval in (a)? c) Write a 90% confidence interval for the average amount of cash that students carried on that first day of class. d) Is there anything in the data that concerns you about using the t-distribution to generate the interval in (c)? 5.13 Suppose that 4 circuit boards out of 100 tested are defective. Generate 95% confidence intervals for the proportion of the population of boards that is defective. Give each of the Wald, Wilson and Agresti-Coull intervals. 5.14 The Chicago Cubs (a major league baseball team) won 11 games and lost 5 games in their season series against the St. Louis Cardinals last year. Write a 90% confidence interval for the proportion of the games that the Cubs would win if they played many 534 5.9. Exercises games against the Cardinals. Comment on the assumptions you are making about the process of playing baseball games. 5.15 In a taste test, 30 Calvin students prefer Andrea’s Pizza and 19 prefer Papa John’s. If the sample of students could reasonably be considered a random sample of Calvin students write a 95% confidence interval for the proportion of students who prefer Andrea’s Pizza. 5.16 It is common to use a sample size of 1,000 when doing a political poll. It is also common to use the Wald interval to report the results of such polls. What is the widest that a 95% confidence interval for a proportion could be with this sample size? 19:08 -- May 4, 2008 535 6. Producing Data – Experiments In many datasets we have more than one variable and we wish to describe and explain the relationships between them. Often, we would like to establish a cause-and-effect relationship. 6.1. Observational Studies The American Music Conference is an organization that promotes music education at all levels. On their website http://www.amc-music.com/research_briefs.htm they promote music education as having all sorts of benefits. For example, they quote a study performed at the University of Sarasota in which “middle school and high school students who participated in instrumental music scored significantly higher than their non-band peers in standardized tests”. Does this mean that if the availability of and participation in instrumental programs in a school is increased, standardized test scores would generally increase? The American Music Conference is at least suggesting that this is true. They are attempting to “explain” the variation in test scores by the variation in music participation. The problem with that conclusion is that there might be other factors that cause the higher test scores of the band students. For example, students who play in bands are more likely to come from schools with more financial resources. They are also more likely to be in families that are actively involved in their education. It might be that music participation and higher test scores are a result of these variables. Such variables are often called lurking variables. A lurking variable is any variable that is not measured or accounted for but that has a significant effect on the relationship of the variables in the study. The Sarasota study described above is an observational study. In such a study, the researcher simply observes the values of the relevant variables on the individuals studied. But as we saw above, an observational study can never definitively establish a causal relationship between two variables. This problem typically bedevils the analysis of data concerning health and medical treatment. The long process of establishing the relationship between smoking and lung cancer is a classic example. In 1957, the Joint Report of the Study Group on Smoking and Health concluded (in Science, vol. 125, pages 1129–1133) that smoking is an important health hazard because it causes an increased risk for lung cancer. However for many years after that the tobacco industry denied this claim. One of their principal arguments is that the data indicating this relationship came from observational studies. (Indeed, the data in the Joint Report came from 16 independent observational studies.) For example, the report documented 601 6. Producing Data – Experiments that one out of every ten males who smoked at least two packs a day died of lung cancer. but only one out of every 275 males who did not smoke died of lung cancer. Data such as this falls short of establishing a cause-and-effect relationship however as there might be other variables that increase both one’s disposition to smoke and susceptibility to lung cancer. Observational studies are useful for identifying possible relationships and also simply for describing relationships that exist. But they can never establish that there is a causal relationship between variables. Using observational studies in this way is analogous to using convenience samples to make inferences about a population. There are some observational studies that are better than others however. The music study described above is a retrospective study. That is the researchers identified the subjects and then recorded information about past music behavior and grades. A prospective study is one in which the researcher identifies the subjects and then records variables over a period of time. A prospective study usually has a greater chance of identifying relevant possible “lurking” variables so as to rule them out as explanations for a possible relationship. One of the most ambitious and scientifically important prospective observational studies has been the Framingham Heart Study. In 1948, researchers identified a sample of 5,209 adults in the town of Framingham, Massachusetts (a town about 25 miles west of Boston). The researchers tracked the lifestyle choices and medical records of these individuals for the rest of their lives. In fact the study continues to this day with the 1,110 individuals who are still living. The researchers have also added to the study 5,100 children of original study participants. There is no question that the Framingham Heart Study has led to a much greater understanding of what causes heart disease although it is “only” an observational study. For example, it is this study that gave researchers the first convincing data that smoking can cause high blood pressure. The website of the study http://www.nhlbi.nih.gov/about/framingham/ gives a wealth of information about the study and about cardiovascular health. 6.2. Randomized Comparative Experiments If an observational study falls short of establishing a causal relationship and even an expensive well-designed prospective observational study cannot identify all possible lurking variables, can we ever prove such a relationship? The “gold standard” for establishing a cause and effect relationship between two variables is the randomized comparative experiment. In an experiment, we want to study the relationship between two or more variables. At least one variable is an explanatory variable and the value of the variable can be controlled or manipulated. At least one variable is a response variable. The experimenter has access to a certain set of experimental units (subjects, individuals, cases), sets various values of the explanatory variables to create a treatment, and records the values of the response variables. 602 6.2. Randomized Comparative Experiments It is important first of all that an experiment be comparative. If we are attempting to establish that music participation increases grades, we cannot simply look at participators. We need to compare the achievement level of participators to those who do not participate. Many educational studies fall short of this standard. A school might introduce a new curriculum in mathematics and measure the test scores of the students at the end of the year. However the school cannot make the case that the test scores are a result of the new curriculum — the students might have achieved the same level with any curriculum. In a randomized experiment we assign the individuals to the various treatments at random. For example, if we took 100 fifth graders and randomly chose 50 of them to be in the band and 50 of them not to receive any music instruction, we could begin to believe that differences in their test scores could be explained by the different treatments. Example 6.2.1. Patients undergoing certain kinds of eye surgery are likely to experience serious post-operative pain. Researchers were interested in the question of whether giving acetaminophin to the patients before they experienced any pain would substantially reduce the subsequent pain and the further need for analgesics. One group received acetaminophin before the surgery but no pain medicine after the surgery. A second group received no pain medicine before the surgery and acetaminophin after the surgery. And the third group received no acetaminophin either before or after the surgery. Sixty subjects were used and 20 subjects were assigned at random to each group. (Soltani, Hashemi, and Babaei, Journal of Research in Medical Sciences, March and April 2007; vol. 12, No 2.) In Example 6.2.1, the goal of random assignment is to construct groups that are likely to be representative of the whole pool of subjects. If the assignment were left to the surgeons, for example, it might be the case that surgeons would give more pain medication to certain types of patients and therefore we wouldn’t be able to attribute the different results to the different treatments. Example 6.2.2. The R dataset chickwts gives the weights of chicks who were fed six different diets over a period of time. The experimenter was attempting to determine which chicken feed caused the greatest weight gain. Feed is the explanatory variable and there were six treatments (six different feeds). Weight is the response variable. The first step in designing such an experiment is to assign baby chicks at random to the six different feed groups. If we allow the experimenter to choose which chicks receive which feed, she might unconsciously (or consciously) construct treatment groups that are unequal to start. Student (W.S. Gosset) was one of the researchers in the early part of the twentieth century who realized the importance of randomization. One of his influential papers 19:08 -- May 4, 2008 603 6. Producing Data – Experiments analyzed a large scale study that was to compare the nutritional effects of pasteurized and unpasteurized milk. In the Spring of 1930, 20,000 school children participated in the study. Of these, 5,000 received pasteurized milk each day, 5,000 received unpasteurized milk, and 10,000 did not receive milk at all. The weight and height of each student was recorded both before and after the trial. Student analyzed the way in which students were assigned to the three experimental treatments. There were 67 schools involved and in each school about half the students were in the control group and half received milk. However each school received only one kind of milk, pasteurized or unpasteurized. This was the first sort of bias that Student found — he was not convinced that the schools that received pasteurized milk were comparable to those that received unpasteurized milk. A more important difficulty was the way in which students were assigned either to the control or milk group within a school. The students were assigned at random initially, but teachers were given freedom to adjust the assignments if it seemed to them that the two groups were not comparable to each other in weight and height. In fact Student showed that this freedom on the part of teachers to assign subjects to groups resulted in a systematic difference between the groups in initial weight and height. The control groups were taller and heavier on average than those in the milk groups. Student conjectured that teachers unconsciously favored giving milk to the more undernourished students. Of course assigning subjects to treatments at random does not ensure that the experimental groups are alike in all relevant ways. Just as we were subjected to sampling error when choosing a random sample from a population, we can have variation in the groups due to the chance mechanism alone. But assigning subjects at random will allow us to make probabilistic statements about the likelihood of such error just as we were able to make confidence intervals for parameters based on our analysis of sampling error that might arise in random sampling. Randomized assignment and random samples We assign subjects to treatments at random so that the various treatment groups will be similar with respect to the variables that we do not control. That is, we would like the experimental groups to be representative of the whole group of subjects. In surveys (Chapter 2), we choose a random sample from a population for a similar reason. We hope that the random sample is representative of a larger population. Ideally, we would like both kinds of randomness in our experiments. Not only do we ensure that the subjects are assigned at random to treatments, but we would like the subjects to be chosen at random from a larger population. If this is true, we could more easily justify generalizing our experimental results to a larger population than the immediate subject pool. However that is almost never the case. In the pain study of Example 6.2.1, the subjects were simply all those persons who were operated on at a given clinic in a given period of time. This issue is particularly important if we try to generalize the conclusions of a an experiment to a larger population. 604 6.2. Randomized Comparative Experiments Example 6.2.3. The author of this text participated in a study to investigate how people make probabilistic judgments in situations for which they do not have much data. (Default Probabilities, Osherson, Smith, Stob, and Wilkie, Cognitive Science, (15), 1991, 251–270.) Subjects were placed in various experimental groups at random. However the subjects were not chosen at random from any particular population. Indeed every subject was an undergraduate in an introductory psychology course at the University of Michigan or Massachusetts Institute of Technology. It is difficult to make an argument that the results of the paper would generalize to the population of all undergraduates in the United States let alone to the population of all adults. The MIT students in particular seemed to have a different set of strategies for dealing with probabilistic arguments. Other features of a good experiment In our analysis of simple random sampling from a population, we saw again and again the importance of large samples in getting precise estimates of our parameters. Analogously, if we are to measure precisely the effect of a treatment, we would like many individuals in each treatment group. This principle is known as replication. With a small number of individuals, it might be difficult to determine whether the differences in response are due only to the treatments or whether they reflect the natural variation in individuals. The chickwts data illustrate the issue. Figure 6.1 plots the weights of the six different treatment groups of chicks. While there is definitely some variation ● 400 ● weight ● ● 300 ● ● ● ● 200 ● 100 casein horsebean linseed meatmeal soybean sunflower Figure 6.1.: Weights of six different treatment groups of a total of 71 chicks. between the groups, there is also considerable variation within each group. Chicks fed meatmeal, for example, have weights spanning most of the range of the the entire experimental group. It is probably the case that the small difference between the linseed and soybean groups is due to the particular chicks in the groups rather than due to the feed. More chickens in each group would help us resolve this issue however. 19:08 -- May 4, 2008 605 6. Producing Data – Experiments In most good experiments one of the treatments is a control. A control generally means a treatment that is a baseline or status quo treatment. In an educational experiment, the control group might receive the standard curriculum while another group is receiving the supposed improved curriculum. In a medical experiment, the control group might receive the generally accepted treatment (or no treatment at all if ethical) while another group receives a new drug. In Example 6.2.1, the group that received no pre-pain medication is referred to as the control group. The goal of a control group is to establish a baseline to which to compare the new or changed treatment. Often the control is a placebo. A placebo is a “treatment” that is really no treatment at all but looks like a treatment from the point of view of the subject. In Example 6.2.1, all subjects received pills both before and after surgery. But some of these pills contained no acetaminophin and were inert. Placebos are given to ensure that the placebo effect is measurable. The placebo effect is the tendency for experimental subjects to be affected by the treatment even if it has no content. The need for control groups and placebos is highlighted by the next famous example. Example 6.2.4. During the period 1927-1932, researchers conducted a large-scale study of industrial efficiency at the Hawthorne Plant of the Western Electric Company in Cicero, IL. The researchers were interested in how physical and environmental features (e.g., lighting) affected worker productivity and satisfaction. Researchers found that no matter what the experimental conditions were, productivity tended to improve. Workers participating in the experiment tended to work harder and better to satisfy those persons who were experimenting on them. This feature of human experimentation — that the experimentation itself changes behavior whatever the treatment — is now called the Hawthorne Effect. (It is now generally accepted that the extent of the Hawthorne Effect in the original experiments have been significantly overstated by the gazillions of undergraduate psychology textbooks that refer to it. But the name remains and it makes a nice story as well as a plausible cautionary tale!) Another feature which helps to ensure that the differences in treatments are due to the treatments themselves is blinding. An experiment is blind if the subjects do not know which treatment group that they are in. In Example 6.2.1, no subject knew whether they were receiving acetaminophin or a placebo. It is plausible that a subject knowing they receive a placebo would have a different (subjective) estimate of pain than one who thought that they might be receiving acetaminophin. An experiment is double-blind if the person administering the treatment also does not know which treatment is being administered. This prevents the researcher from treating the groups differently. It is not always possible or ethical to make an experiment blind or double-blind. But when possible, blinding helps to ensure that the differences between treatments are due to the treatments which is always the goal in experimentation. 606 6.3. Blocking A C A B B B C A C A B C Figure 6.2.: Two experimental designs for three fertilizers. 6.3. Blocking If the experimental subjects are identical, it does not matter which is assigned to which treatment. The differences in the response variable are likely to be the result of the differences in treatment. The subjects are not usually identical however or at least cannot be treated identically. So we would like to know that the differences in the response variable are due to the differences in the explanatory variable and not any systematic differences in subjects. Randomization is one tool that we use to distribute such differences equally across the treatments. In some cases however, our experimental units are not identical or our experiment itself introduces a systematic difference in the units that is due to something other than the treatment variable. This leads to the notion of blocking which we illustrate with a classic example. R.A. Fisher was one of the key early figures in developing the principles of good experimental design. He did much of this while working at Rothamsted Experimental Station on agricultural experiments. He studied closely data from experiments that were attempting to establish such things as the effects of fertilizer on yield. Suppose that we have three unimaginatively named fertilizers A, B, C. We could divide the plot of land that we are using as in the first diagram of Figure 6.2. But it might be the case that the further north in the plot, the better the soil conditions. In that case, the variation in yield might be better explained (or at least partially explained) by the location of the plot rather than by fertilizer. In this example, we would say that the effects of northernness and fertilizer are confounded, meaning simply that we cannot separate them given the data of the experiment at hand. To separate out the effect of northernness from that of fertilizer, we could instead divide the patch using the second diagram in figure 6.2. Of course there still might be variations in the soil conditions across the three fertilizers. But we would at least be able to measure the effect of northernness separately from that of fertilizer. In this example, “northernness” is a blocking variable and our goal is to isolate the variability attributable to northernness so that we can see the differences between the fertilizers more clearly. In a medical experiment it is often the case that gender or age are used as blocking 19:08 -- May 4, 2008 607 6. Producing Data – Experiments variables. Obviously, we cannot assign individuals to the various levels of these variables at random but it is plausible that in certain circumstances gender or age can have a significant effect on the response. If so, it would be useful to design an experiment that allows us to separate out the effects of, say, gender and the treatment. When using a blocking variable, it is important to continue to honor the principle of randomization. Suppose for example that we use gender as a blocking variable in a medical experiment comparing two treatments. The ideal experimental design would be to take a group of females and assign them at random to the two treatments and similarly for the group of males. That is, we should randomize the treatments within the blocks. The resulting experiment is usually called a randomized block design. It is not completely randomized because subjects in one block cannot be assigned to another but within a block it is randomized. It is instructive to compare the randomized block design to stratified random sampling. In each case, we divide subjects into groups and randomize within these groups. The goal is to isolate and measure the variability that is due to the groups so that we can measure the variability that remains. A special case of blocking is known as a matched pair design. In such an experiment, there are just two observations in each block (one for each of two treatments). In his 1908 paper, Student analyzed earlier published data from such an experiment. That data is in the R dataframe sleep. The two different treatments were two different soporifics (sleeping drugs). There was no control treatment. The response variable was the number of extra hours of sleep gained by the subject over his “normal” sleep. There were just 10 subjects and each subject took both drugs (on different nights). Thus each subject was a block and there was one observation on each treatment in each block. Student then compared the difference in the two drugs on each patient. Using the individuals as blocks served to help Student to decide what part of the variation in the response could be explained by the normal variation between individuals and what could be attributed to the drugs themselves. In educational experiments, matched pairs are often constructed by finding two students who are very similar in baseline academic performance. Then it is hoped that the differences between these students at the end of the experiment are the result of the different treatments. It is important to remember that block designs are not an alternative to randomization. Indeed, it is very important that we randomize the assignment to treatments within every block for the same reasons that randomization is important when we have no blocking variable. Identifying blocking variables is simply acknowledging that there are variables on which the treatments may systematically differ. 6.4. Experimental Design In the above sections, we have introduced the three key features of a good experimental design — randomization, replication, blocking. We’ve illustrated these principles in the 608 6.4. Experimental Design case that we have just one explanatory variable with just a few levels. These principles can be extended to situations with more than one explanatory variable however. In this book, we will not investigate the problem of inference for such situations or discuss in detail the issues of experimental design in these cases. In this section, we look at one example of extending these principles to experiments involving more than one explanatory variable. Example 6.4.1. The R dataframe ToothGrowth contains the results of an experiment performed on Guinea Pigs to determine the effect of Vitamin C on tooth growth. There were two treatment variables, the dose of Vitamin C, and the delivery method of the Vitamin C. The dose variable had three levels (.5, 1, and 2 mg) and the delivery method was by either orange juice or ascorbic acid. There were 10 guinea pigs given each of the six treatments. The plot below (using coplot()) shows the differences between the two delivery methods and the various does levels. Given : supp VC OJ 1.0 1.5 2.0 35 0.5 ● ● 30 ● ● 20 5 10 15 len 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 1.0 1.5 2.0 ToothGrowth data: length vs dose, given type of supplement It appears that both the delivery method and the dose have some effect on tooth growth. Both the principles of randomization and replication extend to experiments with more than one explanatory variable. In Example 6.4.1 for example, it is apparent that the 60 guinea pigs should have been assigned at random to the six different treatments. And it also is clear that there should have been enough guinea pigs in each treatment so that the natural variation from pig to pig can be accounted for. No blocking variables are described in the tooth growth study but it is often the case that natural blocking variables can be identified. For example, in the tooth growth study, it might not have been possible for the same technician to have recorded all the measurements. In that case, it would not be a good idea for one technician to make 19:08 -- May 4, 2008 609 6. Producing Data – Experiments all the measurements for the orange juice treatment while another technician makes all the measurements for the ascorbic acid treatment. The blocking variable would be the technician and we would attempt to randomize assignment withing treatment. Since there were 10 guinea pigs in each of the 6 treatments, two technicians could each measure 5 guinea pigs in each treatment. 610 7. Inference – Two Variables Is there a relationship between two variables? If there is a relationship, is it causal? 7.1. Two Categorical Variables 7.1.1. The Data Suppose that the data consist of a number of observations on which we observe two categorical variables. We normally present such data in a (two-way) contingency table. Example 7.1.1. In 1973, the rate of acceptance to graduate school at the University of California at Berkeley was lower for females than males. (See the R dataset UCBAdmissions.) Here 4,526 individuals are classified according to these two variables in the following contingency table. > xtabs(Freq~Gender+Admit,data=UCBAdmissions) Admit Gender Admitted Rejected Male 1198 1493 Female 557 1278 We introduce some notation to aid in our discussion and analysis of such situations. I J nij ni. n.j n = n.. the the the the the the number of rows number of columns integer entry in the ith row and j th column sum of the entries in the ith row sum of the entries in the j th column total of all entries in the table We’ll also usually call the row variable R and the column variable C. Dots in subscripts are often used in statistics to denote the operation of summing over the possible values of that subscript. Hence ni. sums over the possible values of the second subscript. This notation can be extended to more dimensions with k, l etc. denoting the generic subscripts in the next places. 701 7. Inference – Two Variables Our research question and the nature of the two categorical variables determine how we collect and analyze the data. There are three different data collection schemes that we distinguish among. 1. I independent populations. On this model, R, the categorical variable that determines the rows, defines I many populations. The data are collected by choosing a simple random sample of each population and categorizing each population according the column categorical variable. An example of such a data collection exercise might be to choose a random sample of students of each class level and ask each subject a YES-NO question. On this model of sampling, we need to be able to identify each of the populations in advance. 2. One population, two factors. On this model, we choose n individuals at random from one population and classify the individuals according to the two different categorical variables. 3. I experimental treatments. On this model, the I rows are the I different treatments to which we might assign a number of individuals. We assign ni. individuals to each treatment (we hope by randomization) and then observe the value of the column categorical variable in each individual. Sometimes it is difficult to see immediately which of the three data collection schemes is the best description of our data and sometimes it is clear that the data did not arise in any one of these ways. For example, in most observational studies, randomness does not play a role. It is often the case that such studies correspond to the description of the second data collection scheme above but without the random sampling. How we make inferences from such data from observational studies (and whether we can make any inferences at all) is usually a difficult question. Of course the data collection scheme should match the research question and we would like to phrase our research questions as questions about parameters. 7.1.2. I independent populations Suppose that random samples are chosen from each of the I independent populations determined by the rows. This situation is really that of stratified random sampling with the rows determining the strata. In this case, the variable C divides each population into J many groups. A natural question to ask is whether the proportion of individuals in a particular group is the same across populations. Example 7.1.2. In [AM], Chase and Dummer report on a survey of 478 children in Ingham and Clinton Counties in Michigan. (The data are available at the Data and Story Library and at http://www.calvin.edu/~stob/data/popularkids.csv.) The children were chosen from grades 4, 5, and 6. Among the questions asked was 702 7.1. Two Categorical Variables which goal was most important to them: making good grades, being popular, or being good in sports. The results are > pk=read.csv(’http://www.calvin.edu/~stob/data/popularkids.csv’) > names(pk) [1] "Gender" "Grade" "Age" "Race" "Urban.Rural" [6] "School" "Goals" "Grades" "Sports" "Looks" [11] "Money" > xtabs(~Grade+Goals,data=pk) Goals Grade Grades Popular Sports 4 63 31 25 5 88 55 33 6 96 55 32 Here the three populations are students in the three grades and the research question is whether students at the three grade levels are the same in their choice of their most important goal. We define parameters as follows: πi,j = proportion of population i at level j of the second variable. Note that with πij defined in this way, J X πij = 1 for every i. A natural first j=1 hypothesis to test is H0 : for every j, π1,j = π2,j = · · · = πI,j . If H0 is true, we say that the populations are homogeneous (with respect to variable C). In order to test this hypothesis, it is necessary to construct a test statistic T such that two things are true: 1. We know the distribution of T when H0 is true, and 2. The values of T tend to be small if H0 is true and large if H0 is false (or the other way around). It is easy to construct test statistics that have the second of these two properties. However, since the distribution of such a statistic is discrete, it is usually computationally impossible to determine the distribution of the statistic we construct even under the assumption that the null hypothesis is true. The classical test in this situation is to use a test statistic for which we have a good approximation to its distribution. The statistic is called the chi-square (χ2 ) statistic and its lineage is really the same as that of the normal approximation to the binomial distribution. To form the chi-square statistic, we investigate what we expect would happen if the null hypothesis were true. In this case, for every j, we have π1,j = π2,j = · · · = πI,j . 19:08 -- May 4, 2008 703 7. Inference – Two Variables We let π.j denote the common value. (Here we use the dot in a slightly different but analogous manner.) How would we estimate π.j , the probability of an individual falling in the j th column? Since there are n.j individuals in this column, a natural estimate would be π̂.j = n.j /n. With this estimate of π.j , we can estimate the number of individuals that should fall in each cell. Since there are ni. individuals in row i, we should estimate that there are ni. π̂.j = ni. n.j /n individuals in the i, j th cell. This quantity is important: we give it a name and notation. Definition 7.1.3 (Expected Count). Under the null hypothesis H0 , the expected count in cell i, j is ni. n.j n̂i,j = . n We now introduce the statistic that we use to test this hypothesis. (We use X 2 rather than χ2 so that the statistic is an upper-case Roman letter!) X2 = X (observed − expected)2 expected = X X (nij − n̂ij )2 . n̂ij i j It is not hard to see that this statistic is always nonnegative and tends to be larger if the null hypothesis is false and smaller if it is true. However the distribution of this statistic cannot be computed exactly for all but the smallest n. We digress and introduce a new and important distribution. Definition 7.1.4 (chi-square distribution). The chi-square distribution is a one-parameter family of distributions with parameter a natural number ν and pdf f (x; ν) = 1 2ν/2 Γ(ν/2) xv/2−1 e−x/2 x≥0. The chi-square distribution has mean ν and variance 2ν. The parameter ν is called the degrees of freedom. The plot of the density function for the chi-square distribution with ν = 4 is in Figure 7.1. The importance of the chi-square distribution stems from the following fact. Proposition 7.1.5. Suppose that X1 , . . . , Xν are independent random variables each of which has a standard normal distribution. Then X12 + · · · + Xν2 has a chi-square distribution with ν degrees of freedom. For our purposes, we have the following fact. Proposition 7.1.6. If the null hypothesis H0 is true, then the statistic X 2 has a distribution that is approximately chi-square with (I − 1)(J − 1) degrees of freedom. 704 7.1. Two Categorical Variables y 0.15 0.10 0.05 0.00 0 2 4 6 8 10 12 x Figure 7.1.: The density of the chi-square distribution with ν = 4. We now use the proposition to make a hypothesis test. chi-square test of homogeneity of populations. Suppose that the value of X 2 is c. The p-value of the hypothesis test of H0 is p = P(X 2 ≥ c) where we assume that X 2 has a chi-square distribution with ν = (I − 1)(J − 1) degrees of freedom. Example 7.1.7. Continuing the popular kids example, Example 7.1.2, we compute the chi-square value using R. While R does the computations, we illustrate the computation by considering the first cell. There are 478 subjects total (n.. = 478) of which 119 are in grade 4 (n1. = 119). Of the 478 subjects, 247 have getting good grades as their most important goal. Thus 247/478 = 51.7% of the sampled children have this as their goal. The expected count in the first cell is therefore n̂1,1 = (247/478)119 = 61.49. Since the actual count is 63, this contributes (61.49 − 63)2 /61.49 = .037 to the chi-squared value. Continuing over the six cells, we have a chi-square value of 1.3121 according to R. > popkidstable=xtabs(~Grade+Goals,data=pk) > chisq.test(popkidstable) Pearson’s chi-square test data: popkidstable X-squared = 1.3121, df = 4, p-value = 0.8593 The value of X 2 is 1.31. The p-value indicates that if H0 is true, we would expect to see a value of X 2 at least as large as 1.31 over 85% of the time. So if H0 is true, 19:08 -- May 4, 2008 705 7. Inference – Two Variables this value of the chi-square statistic is not at all surprising. We have no reason to doubt the null hypothesis that students of these three grades do not differ in their most important goals. The use of the chi-square distribution is only an approximation. The approximation is better if the populations are large and the individual cell sizes are not too small. The conventional wisdom is to not use this test if any cell has a count of 0 or more than 20% of the cells have expected count less than 5. R will give a warning message if any cell has expected count less than 5. 7.1.3. One population, two factors We now look at the case in which the contingency table results from sampling from a single population and classifying the sampled elements according to two different categorical variables. The natural research question is whether the two variables are “independent” of each other. We start with an example. Example 7.1.8. During the Spring semester of 2007, 280 statistics students were given a survey. Among other things, they were asked their gender and whether they were smokers. The results are tabulated below. (Note that the file was created using a blank field to denote a missing value. An argument to read.csv() addresses that.) > survey=read.csv(’http://www.calvin.edu/~stob/data/survey.csv’,na.strings=c(’NA’,’’)) > t=xtabs(~gender+smoker,data=survey) smoker gender Non Smoke F 133 5 M 125 13 Now these 280 students were not a random sample of 280 students from any particular population. However we might think that this group could be representative of the population of all students with respect to the relationship of smoking to gender. We note that in this (convenience) sample, a male is more likely to smoke than a female. Does this difference indicate a true difference between the genders or is this simply a result of sampling variability? To formulate the research question as a question about parameters, we define πi,j as the proportion of the population that has the value i for variable R and j for variable S. We also define πi. and π.j to denote the proportion of the population with the relevant value of each individual categorical variable. Then the hypothesis of independence that we wish to test is H0 : 706 for every i, j: πi,j = πi. π.j . 7.1. Two Categorical Variables This hypothesis is an independence hypothesis as it states that the events of an object being classified as i on variable R and j on variable C are independent. Just as in the case of independent populations, it is plausible to estimate π.j by n.j /n. It is also reasonable to estimate πi. by ni. /n. Then, if the null hypothesis is true we should use π̂i,j = ni. n.j /n2 as our estimate of πi,j . Notice that with this estimate of πi,j , we expect that we would have nπ̂i,j = ni. n.j /n individuals in cell i, j. This is exactly the same expected cell value as in the case of the test for homogeneity. This suggests that exactly the same statistic, X 2 , should be used to test H0 . Indeed, we have Proposition 7.1.9. If H0 is true, then the statistic X2 = X (observed − expected)2 expected = X X (nij − n̂ij )2 n̂ij i j has a distribution that is approximately χ-squared with (I − 1)(J − 1) degrees of freedom. The proposition means that we can use exactly the same R test in this case. It also means that in cases where it is not so clear whether we are testing for homogeneity or independence, it doesn’t really matter! In the smoking and gender example, Example 7.1.8, we have > chisq.test(t) Pearson’s Chi-squared test with Yates’ continuity correction data: t X-squared = 2.9121, df = 1, p-value = 0.08791 A p-value of 0.088 suggests that there is not sufficient evidence to claim that smoking and gender are not independent. 7.1.4. I experimental treatments The third way that a two-way contingency table might arise is in the case that the rows correspond to the the different treatments in an experiment. Here we are thinking that the n individuals are assigned at random to the I treatments with ni. individuals assigned to treatment i. (We hope as well that the n individuals are a random sample from some larger population to which we want to generalize the results of the experiment. This hope will hardly ever be realized.) We want to know whether the experimental treatments have an effect on the column variable C. 19:08 -- May 4, 2008 707 7. Inference – Two Variables Example 7.1.10. In [LP01], a study was done to see if delayed prescribing of antibiotics was as effective as immediate prescribing of antibiotics for treatment of ear infections. 164 children were assigned to the treatment group that received a prescription for antibiotics but which was instructed not to take the antibiotics for three days (the “delay” group). 151 children received a prescription for antibiotics to be taken immediately (the “immediate” group). The assignment was by randomization. One of the side effects of antibiotics in children is diarrhea. Of the delay group, 15 children had diarrhea and of the immediate group, 29 had diarrhea. The question is whether the rate of diarrhea differs for those receiving antibiotics immediately as opposed to those who waited. We do not have the raw data so we construct the table ourselves using the summary data above. > m=matrix(c(15,149,29,122),nrow=2,ncol=2,byrow=T) > m [,1] [,2] [1,] 15 149 [2,] 29 122 > colnames(m)=c(’Diarrhea’,’None’) > rownames(m)=c(’Delay’,’Immediate’) > m Diarrhea None Delay 15 149 Immediate 29 122 Obviously, the rate of diarrhea in the immediate group is bigger but we would like to know if this difference could be attributable to chance. The null hypothesis in this case is that there is no difference between the treatments (e.g., the rows) as far as the column variable C is concerned. This is essentially a homogeneity hypothesis and we will analyze the data in precisely the same manner as the case of I independent populations. In this case, we could think of the treatment levels as defining theoretical populations, namely the population of individuals that might have received each treatment. The “random sample” from the ith population is then the collection of subjects randomly assigned to treatment i. We write the null hypothesis in terms of parameters πij just as in the null hypothesis for homogeneity. In this case πij denotes the probability that a subject assigned to treatment i will have the value j on the the categorical variable. C. The null hypothesis is H0 : for every j, π1,j = π2,j = · · · = πI,j . and we test this null hypothesis exactly the same way as in the case of homogeneity. Example 7.1.11. Continuing Example 7.1.10, we have the following test of the hypothesis that there is no difference in the rates of diarrhea for the two treatment conditions. 708 7.2. Difference of Two Means > chisq.test(m,correct=F) Pearson’s Chi-squared test data: m X-squared = 6.6193, df = 1, p-value = 0.01009 With a p-value of .01 it appears that the difference in the rate of diarrhea in the two groups is greater than we would expect to see if the null hypothesis were true. We would reject the null hypothesis at the significance level α = 0.05, for example. In the above test, we have chosen not to use something called the “continuity correction” correct=F. If we use the correction, we find > chisq.test(m) Pearson’s Chi-squared test with Yates’ continuity correction data: m X-squared = 5.8088, df = 1, p-value = 0.01595 In the correction, which is used only for the two-by-two case, the value 0.5 is subtracted from all of the terms Observed − Expected. It turns out that this makes the chi-square approximation somewhat closer. 7.2. Difference of Two Means This section addresses the problem of determining the relationship between a categorical variable (with two levels) and a quantitative variable. Just as in the case of two categorical variables, data like this can arise from independent samples from two different populations, from a randomized comparative experiment with two treatment groups, or from cross-classifying a random sample from a single population according to the two variables. We look at the two population case here (and suggest that the two treatment group case should be analyzed the same way as in Section 7.1). Assumptions for two independent samples: 1. X1 , . . . , Xm is a random sample from a population with mean µX and vari2 . ance σX 2. Y1 , . . . , Yn is a random sample from a population with mean µY and variance σY2 . 3. The two samples are independent one from another. 4. The samples come from normal distributions. 19:08 -- May 4, 2008 709 7. Inference – Two Variables Of course the fourth assumption above is an assumption of convenience to make the mathematics work out. In most cases, our populations are not known or known not to be normal and we hope that the inference procedures we develop below are reasonably robust. We first write a confidence interval for the difference in the two means µX − µY . Just as did our confidence intervals for one mean µ, our confidence interval will have the form (estimate) ± (critical value) · (estimate of standard error) . The natural choice for an estimator of µX − µY is X − Y . To write the other two pieces of the confidence interval, we need to know the distribution of X − Y . The necessary fact is this: X − Y − (µX − µY ) q ∼ Norm(0, 1) . 2 2 σY σX + m n Analogously to confidence intervals for a single mean, it seems like the right way to proceed is to estimate σX by sX , σY by sY and to investigate the random variable X − Y − (µX − µY ) q . 2 SX SY2 m + n (7.1) The problem with this approach is that the distribution of this quantity is not known even if we assume that the populations are normal (unlike the case of the single mean where the analogous quantity has a t-distribution). We need to be content with an approximation. Lemma 7.2.1. (Welch) The quantity in Equation 7.1 has a distribution that is approximately a t-distribution with degrees of freedom ν where ν is given by ν= 2 SX m 2 /m)2 (SX m−1 + + SY2 n 2 (SY2 /n)2 n−1 (7.2) (It isn’t at all obvious from the formula but it is good to know that min(m − 1, n − 1) ≤ ν ≤ n + m − 2.) We are now in a position to write a confidence interval for µX − µY . 710 7.2. Difference of Two Means An approximate 100(1 − α)% confidence interval for µX − µY is ! r s2X s2Y ∗ x−y±t + m n (7.3) where t∗ is the appropriate critical value tα/2,ν from the t-distribution with ν degrees of freedom given by (7.2). We note that ν is not necessarily an integer and we leave it R to compute both the value of ν and the critical value t∗ . Example 7.2.2. The barley dataset of the lattice package has the yield in bushels per acre of various experiments done in Minnesota in 1931 and 1932. If we think of the experiments done in 1931 and in 1932 as samples from two populations, we have > t.test(yield~year,barley) Welch Two Sample t-test data: yield by year t = -2.9031, df = 116.214, p-value = 0.004422 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -8.940071 -1.688820 sample estimates: mean in group 1932 mean in group 1931 31.76333 37.07778 There is a significant difference in the mean yield for the two years. We should remark at this point that older books (and even some newer books which don’t reflect current practice) suggest an alternate approach to the problem of writing confidence intervals for µX − µY . These books suggest that we assume that the two standard deviations σX and σY are equal. In this case the exact distribution of our quantity is known (it is t with n + m − 2 degrees of freedom). The difficulty with this approach is that there is usually no reason to suppose that σX and σY are equal and if they are not equal the proposed confidence interval procedure is not as robust as the one we are using. Current best practice is to always prefer the Welch procedure to that of assuming that the two standard deviations are equal. 19:08 -- May 4, 2008 711 7. Inference – Two Variables Robustness Confidence intervals generated by Equation 7.3 are probably the most common confidence intervals in the statistical literature. But those who generate such intervals are not always sensitive to the hypotheses that are necessary to be confident about the confidence intervals generated. It should first be noted that the confidence intervals constructed are based on the hypothesis that the two populations are normally distributed. It is often apparent from even a cursory examination of the data that this hypothesis is unlikely to be true. However, if the sample sizes are large enough, the intervals generated are fairly robust. (This is related to the Central Limit Theorem and the fact that we are making inferences about means.) There are a number of different rules of thumb as to what large enough means, but n, m > 15 for distributions that are relatively symmetric and n, m > 40 for most distributions are common rules of thumb. A second principle is that we are surer of confidence intervals where the quotients s2X /m and s2Y /n are not too different in size than those in which they are quite different. Turning Confidence Intervals into Hypothesis Tests It is often the case that researchers content themselves with testing hypotheses about µX −µY rather than computing a confidence interval for that quantity. For example, the null hypothesis µX − µY = 0 in the context of an experiment is a claim that there is no difference in the two treatments represented by X and Y . This would be the typical null hypothesis in comparing a medical treatment to a control or a placebo. Hypothesis testing of this sort has fallen into disfavor in many circles since the knowledge that µX − µY 6= 0 is of rather limited interest unless the size of this quantity is known. (After all, nobody should really believe that two populations would have exactly the same mean on any variable.) A confidence interval gives information about the size of the difference. Nevertheless, since the literature is still littered with such hypothesis tests, we give an example here. Example 7.2.3. Returning to our favorite chicks, we might want to know if we should believe that the effect of a diet of horsebean seed is really different that a diet of linseed. Suppose that x1 , . . . , xm are the weights of the m chickens fed horsebean seed and y1 , . . . , yn are the weights of the n chickens fed linseed. The hypothesis that we really want to test is H0 q : µX − µY = 0. We note that if the null hypothesis is true, then T = (X − Y )/ Sx2 /m + Sy2 /m has a distribution that is approximately a t-distribution with the Welch formula giving the degrees of freedom. Thus the obvious strategy is to reject the null hypothesis if the value of T is too large in absolute value. Fortunately, R does all the appropriate computations. Notice that the mean weight of the two groups of chickens differs by 58.5 but that a 95% confidence interval for the true difference in means is (−99.1, −18.0). On this basis we expect to conclude that the linseed diet is superior, i.e., that there 712 7.2. Difference of Two Means is a difference in the mean weights of the two populations. This is verified by the hypothesis test of H0 : µX − µY = 0 which results in a p-value of 0.007. That is, this great a difference in mean weight would have been quite unlikely to occur if there was no real difference in the mean weights of the populations. > hb=chickwts$weight[chickwts$feed=="horsebean"] > ls=chickwts$weight[chickwts$feed=="linseed"] > t.test(hb,ls) Welch Two Sample t-test data: hb and ls t = -3.0172, df = 19.769, p-value = 0.006869 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -99.05970 -18.04030 sample estimates: mean of x mean of y 160.20 218.75 Variations One-sided confidence intervals and one-sided tests are possible as are intervals of different confidence levels. All that is needed is an adjustment of the critical numbers (for confidence intervals) or p-values for tests. Example 7.2.4. A random dot stereogram is shown to two groups of subjects and the time it takes for the subject to see the image is recorded. Subjects in one group (VV) are told what they are looking for but subjects in the other group (NV) are not. The quantity of interest is the difference in average times. If µX is the theoretical average of the population of the NV group and µY is the average of the VV group, then we might want to test the hypothesis H0 : µX − µY = 0 Ha : µX > µY > rds=read.csv(’http://www.calvin.edu/~stob/data/randomdot.csv’) > rds Time Treatment 1 47.20001 NV 2 21.99998 NV 3 20.39999 NV ...................... 77 1.10000 VV 78 1.00000 VV 19:08 -- May 4, 2008 713 7. Inference – Two Variables > t.test(Time~Treatment,data=rds,conf.level=.9,alternative="greater") Welch Two Sample t-test data: Time by Treatment t = 2.0384, df = 70.039, p-value = 0.02264 alternative hypothesis: true difference in means is greater than 0 90 percent confidence interval: 1.099229 Inf sample estimates: mean in group NV mean in group VV 8.560465 5.551429 > From this we see that a lower bound on the difference µX − µY is 1.10 at the 90% level of confidence. And we see that the p-value for the result of this hypothesis test is 0.023. We would probably conclude that those getting no information take longer than those who do on average. Just as in the case of the t-test for one mean, it is important to consider the power of the two-sample t-test before conducting an experiment. The R function power.t.test() with argument type=’two.sample’ does the appropriate computations. 7.3. Exercises 7.1 In Berkson, JASA, 33, pp. 526-536, there is data on the result of an experiment evaluating a treatment designed to prevent the common cold. There were 300 subjects and 143 received the treatment and 157 the placebo. Of the treatment group, 121 eventually got a cold and of the placebo group, 145 got a cold. Was the treatment effective? Write a contingency table and formulate this problem as a chi-square hypothesis test (as indeed Berkson did). 7.2 The DAAG package has a dataset rareplants classifying various plant species in South Australia and Tasmania. Each species was classified according to whether it was rare or common in each of those two locations (giving the possibilities CC, CR, RC, RR) and whether its habitat was wet, dry, or both (W, D, WD). The dataset contains the summary table which is also reproduced here. > rareplants D W WD CC 37 190 94 CR 23 59 23 RC 10 141 28 RR 15 58 16 714 7.3. Exercises a) what hypothesis exactly is begging to be tested with the aid of this contingency table? (e.g., homogeneity or independence?) b) Test this hypothesis. 7.3 21 rubber bands were divided into two groups. One group was placed in hot water for 4 minutes while the other was left at room temperature. They were each then stretched by a 1.35 km weight and the amount of stretch in mm was recorded. (The dataset comes from the DAAG library where it is called two65). You can get the dataset in a dataframe format from http://www.calvin.edu/~stob/data/rubberband.csv. Write a 95% confidence interval for the difference in average stretch for this kind of rubberband for the two conditions. 7.4 The dataset http://www.calvin.edu/~stob/data/reading.csv contains the re- sults of an experiment done to test the effectiveness of three different methods of reading instruction. We are interested here in comparing the two methods DRTA and Strat. Let’s suppose, for the moment, that students were assigned randomly to these two different treatments. a) Use the scores on the third posttest (POST3) to investigate the difference between these two teaching methods by constructing a 95% confidence interval for the difference in the means of posttest scores. b) Your confidence interval in part (a) relies on certain assumptions. Do you have any concerns about these assumptions being satisfied in this case? c) Using your result in (a), can you make a conclusion about which method of reading instruction is better? 7.5 Surveying a choir, you might expect that there would not be a significant height difference between sopranos and altos but that there would be between sopranos and basses. The dataset singer from the lattice package contains the heights of the members of the New York Choral Society together with their singing parts. a) Decide whether these differences do or do not exist by computing relevant confidence intervals. b) These singers aren’t random samples from any particular population. Explain what your conclusion in (a) might be about. 7.6 The package alr3 has a dataframe ais containing various statistics on 202 elite Australian athletes. (The package must be loaded and then the dataset must be loaded as well using data(ais).) 19:08 -- May 4, 2008 715 7. Inference – Two Variables a) Is there a difference between the hemoglobin levels of males and females? (Well, of course there is a difference. But is it statistically significant.) b) What assumptions are you making about the data in (a) to make it a problem in statistical inference? c) To what populations do you think you could generalize the result of (a)? 716 8. Regression In Section 1.6 we introduced the least-squares method for finding a linear function that best describes the relationship between a pair of quantitative variables. In this chapter we enhance that method by grounding it in a statistical model. 8.1. The Linear Model Suppose that we have n individuals on which we measure two variables. The data then consist of n pairs (x1 , y1 ), . . . , (xn , yn ). We will develop a model for the situation in which we consider the variable x as an explanatory variable and y as the response variable. Our model will assume that for each fixed data value x, the corresponding value y is the result of a random variable, Y . The linearity of the model comes from the fact that we will assume that the expected value of Y is a linear function of x. The model is given by The standard linear model The standard linear model is given by the equation Y = β0 + β1 x + (8.1) where 1. is a random variable with mean 0 and variance σ 2 , 2. β0 , β1 , σ 2 are (unknown) parameters, 3. and has a normal distribution. We will assume that the data (x1 , y1 ), . . . , (xn , yn ), result from n independent trials governed by the process above. That is, we assume that 1 , . . . , n is an iid sequence of random variables with mean 0 and variance σ 2 . Then each yi is the result of a random variable Yi given by Yi = β0 + β1 xi + i . 801 8. Regression Notice that in our description of the data collection process, yi is treated as the result of a random variable but xi is not. Also note that for any fixed i, the mean of Yi is β0 + β1 xi and the variance of Yi is σ 2 . There are three unknown parameters (β0 , β1 , and σ 2 ) in the linear model. Usually, the most interesting of these from a scientific point of view is β1 since it is an expression of the way of in which the response variable Y depends on the value of the explanatory variable x. We would like to estimate these parameters and make inferences about them. It turns out that we have already done much of the work in Section 1.6. The least-squares line is the “right” line to use in estimating β0 and β1 . We review the construction of that line. Let β̂0 and β̂1 denote the estimators of β0 and β1 respectively. A note on notation. It would be nice to use uppercase to denote the estimator and lowercase to denote the estimate. That would mean that we should use b1 to denote the estimate of β1 and B1 to denote the estimator of β1 . However this is typically not done and instead β̂1 is used for both. So β̂1 might be a number (an estimate) or a random variable (the corresponding estimator) depending on the context. Be careful! Now define ŷi = β̂0 + β̂1 xi . Of course ŷi is not defined until we specify how to choose β̂0 and β̂1 . Given β̂0 and β̂1 , we define 2 X X SSResid = (yi − ŷi )2 = yi − (β̂0 + β̂1 xi ) . We proceed exactly as in Section 1.6. Namely we choose β̂0 and β̂1 to minimize SSResid. (In fact in that section we called these two numbers b0 and b1 .) We have the following expressions for β̂0 and β̂1 . Pn (xi − x)yi β̂1 = Pi=1 β̂0 = y − β̂1 x . n 2 i=1 (xi − x) The corresponding estimators result from these expressions by replacing yi by Yi . The desirable properties of these estimators (besides minimizing SSResid) are summarized in the next three results. Proposition 8.1.1. Assume only that E(i ) = 0 for all i in the model given by (8.1). Then β̂0 and β̂1 are unbiased estimates of β0 and β1 respectively. Therefore, ŷi = β̂0 + β̂1 xi is an unbiased estimate of β0 + β1 xi (which is the expected value of Yi for the value x = xi ). Notice that in Proposition 8.1.1 we do not need to assume that that the errors have constant variance or that they are independent! This proposition therefore gives us a very good reason to use the least-squares slope and intercept for our estimates. 802 8.1. The Linear Model 130 ● ● ● ● ● loss 120 ● ● 110 ● ● 100 ● ● 90 ● ● 0.0 0.5 1.0 1.5 2.0 Fe Figure 8.1.: The corrosion data with the least-squares line added. Example 8.1.2. In Example 1.6.1 we looked at the loss due to corrosion of 13 Cu/Ni alloy bars submerged in the ocean for sixty days. Here the iron content Fe is the explanatory variable and it is reasonable to treat that as controlled and known by the experimenter (rather than as a random variable). The data plot suggests that the linear model might be a reasonable approximation to the true relationship between iron content and material loss. We reproduce the analysis here. Using R we find that β̂0 = 129.79 and β̂1 = −24.02. > library(faraway) > data(corrosion) > corrosion[c(1:3,12:13),] Fe loss 1 0.01 127.6 2 0.48 124.0 3 0.71 110.8 12 1.44 91.4 13 1.96 86.2 > lm(loss~Fe,data=corrosion) Call: lm(formula = loss ~ Fe, data = corrosion) Coefficients: (Intercept) 129.79 Fe -24.02 If we add the assumption of independence of the i and also the assumption of constant variance, we know considerably more P about our estimates as evidenced by the next two propositions. (Recall that Sxx = (xi − x)2 .) 19:08 -- May 4, 2008 803 8. Regression Proposition 8.1.3. Suppose that Yi = β0 + β1 xi + i where the random variables i are independent and satisfy E(i ) = 0 and Var(i ) = σ 2 . Then σ2 , Sxx P σ 2 x2i x2 1 2 + 2. Var(β̂0 ) = =σ . n Sxx n Sxx 1. Var(β̂1 ) = It is not important to remember the formulas of this proposition. But they are worth examining for what they say about the variance of our estimators. We can decrease the variance of the estimator of slope, for example, by collecting a large amount of data with x values that are widely spread. This seems intuitively correct. In general we like unbiased estimators with small variance. The next theorem assures us that the least squares estimators are good estimators in this respect. Theorem 8.1.4 (Gauss-Markov Theorem). Assume that E(i ) = 0, Var(i ) = σ 2 , and the random variables i are independent. Then the estimators β̂0 and β̂1 are the unbiased estimators of minimum variance among all unbiased estimators that are linear in the random variables Yi . (We say that these estimators are best linear unbiased estimators, BLUE.) While there might be non-linear estimators that improve on β̂0 and β̂1 , the GaussMarkov Theorem gives us a powerful reason for using these estimators. Notice however that the Theorem has hypotheses. Both the homeoscedasticity (equal variance) and independence hypotheses are important. Our final proposition of the section gives us additional information if we add the normality assumption. Theorem 8.1.5. Assume that E(i ) = 0, Var(i ) = σ 2 , and the random variables i are independent and normally distributed. Then β̂0 and β̂1 are normally distributed. We exploit this theorem in the next section to make inferences about the parameters β0 , β1 . 8.2. Inferences We first consider the problem of making inferences about β1 . In particular, we would like to construct confidence intervals for β1 with the aid of our estimate β̂1 . In order to do this, we must clearly make some distributional assumptions about the Yi . So for this entire section, we will assume all the hypotheses of the standard linear model, namely that E(i ) = 0, Var(i ) = σ 2 , the random variables i are normally distributed and independent of one another. From the results of the last section, we then have 804 8.2. Inferences β̂1 ∼ N β1 , σ 2 /Sxx . We’ve been in this situation before. Namely, we have an estimator that has a normal distribution centered at the true value of the parameter but with a standard deviation that depends on an unknown parameter σ. Clearly the way to proceed is to estimate the unknown standard deviation. To do this, we need to estimate σ. Proposition 8.2.1. Under the assumptions of the linear model, MSResid = SSResid n−2 is an unbiased estimate of σ 2 . While we will not prove the proposition, let’s see that it is plausible. The numerator in this computation is a sum of terms of the form (yi − ŷi )2 . Since ŷi is the best estimate of E(Yi ) that we have, yi − ŷi is a measure of the deviation of yi from its mean. Thus (yi − ŷi )2 functions exactly the same way that (xi − x)2 functions the computation of the sample variance. However in this case we have a denominator of n − 2 rather than n − 1. This accounts for the fact that we are minimizing SSResid by choosing two parameters. The n − 2 is the key to making this estimator unbiased — a more straightforward choice would have been to use √ n in the denominator. Since MSResid 2 is an estimate for σ , we will √ use s to denote MSResid. With the estimate s = MSResid for σ in hand, √ we can estimate the standard 2 deviation of β̂1 . Since Var(β̂1 ) = σ /Sxx we will use s/ Sxx to estimate the standard deviation of β̂1 . We can similarly estimate Var(β̂0 ). We record these estimates in a definition. Definition 8.2.2 (standard errors of β̂0 and β̂1 ). The estimates of the standard deviation of the estimators β̂0 and β̂1 , called the standard errors of the estimates, are given by s 1. sβ̂1 = √ and Sxx s 1 (xi − x)2 2. sβ̂0 = s + . n Sxx We illustrate all the estimates computed so far with another example. Example 8.2.3. A number of paper helicopters were dropped from a balcony and the time in air was recorded by two different timers. Various dimensions of each helicopter were measured including L, the “wing” length. A plot shows that there 19:08 -- May 4, 2008 805 8. Regression is a positive relationship betweeen L and the time of the second timer (Time.2). To describe the relationship, we suppose that a linear model might be a good description. A plot of the data with a regression line added is in Figure 8.2 ● 7 ● ● ● ● ● ● Time.2 6 ● ● ● 5 ● ● ● ● ● ● 4 ● 2 3 4 5 6 L Figure 8.2.: Flight time for helicopters with various wing lengths. > h=read.csv(’http://www.calvin.edu/~stob/data/helicopter.csv’) > h[1,] Number W L H B Time.1 Time.2 1 1 3 6 2 1.5 6.89 6.82 > l=lm(Time.2~L,data=h) > summary(l) Call: lm(formula = Time.2 ~ L, data = h) Residuals: Min 1Q -1.42875 -0.53381 Median 0.04489 3Q 0.49348 Max 1.59941 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.9816 0.7987 3.733 0.00200 ** L 0.5773 0.1753 3.293 0.00493 ** --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 0.8872 on 15 degrees of freedom (3 observations deleted due to missingness) Multiple R-Squared: 0.4196, Adjusted R-squared: 0.381 F-statistic: 10.85 on 1 and 15 DF, p-value: 0.004925 We have the following estimates: s = 0.8872 (labeled residual standard error in the output of lm), β̂1 = 0.5773, sβ̂1 = 0.1753, β0 = 2.9816, and sβ̂0 = 0.7897. 806 8.2. Inferences To construct confidence intervals for β1 , we need one more piece of information. The following result should not seem surprising given our work on the t-distribution. Proposition 8.2.4. With all the assumptions of the linear model, the random variable T = β̂1 − β1 β̂1 − β1 √ = Sβ̂1 S/ Sxx has a t-distribution with n − 2 degrees of freedom. The proposition is another example for us of the use of the t distribution to generate a confidence interval in the presence of a normality assumption. We generalize this into a principle (which is too imprecise to call a theorem or to prove). Suppose that θ̂ is an unbiased estimator of a parameter θ and sθ̂ is the standard error of θ̂ (that is an estimate of the standard deviation of θ̂). Suppose also that sθ̂ has ν degrees of freedom. Then, in the presence of sufficient normality assumptions, θ̂ − θ the random variable T = has a t distribution with ν degrees of freedom. sθ̂ We now use Proposition 8.2.4 to write confidence intervals for β1 . Confidence Intervals for β1 A 100(1 − α)% confidence interval for β1 is given by β̂1 ± tα/2,n−2 · sβ̂1 . We don’t even have to use qt() or do the multiplication since R will compute the confidence intervals for us. Both 95% and 90% confidence intervals for the slope and the intercept of the regression line in Example 8.2.3 are given by > confint(l) 2.5 % (Intercept) 1.2791991 L 0.2036587 > confint(l,level=.9) 5 % (Intercept) 1.5814235 L 0.2699840 19:08 -- May 4, 2008 97.5 % 4.6839421 0.9508533 95 % 4.3817177 0.8845281 807 8. Regression The 95% confidence interval for β1 of (0.204, 0.951) gives us a very good idea of the large uncertainty in the estimate of linear relationship between L and flight time. Nevertheless, it does tell us that L does have some use in predicting flight time. 8.3. More Inferences We usually want to use the results of a regression to make inferences about the possible values of y for given values of x. In this section, we look at two different kinds of inferences of this sort. We begin with an example. Example 8.3.1. In the R library DAAG is a dataset ironslag that has observations of measurements of the iron content of 53 samples of slag by two different methods. One method, the chemical method, is more time-consuming and expensive than the other, the magnetic method, but presumably more accurate. ● 30 ● ● ● ● ● ● ● ● ● chemical 25 ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 15 ● ● ● 10 ● 10 15 20 25 30 35 40 magnetic Figure 8.3.: Iron content measured by two different methods. > library(DAAG) > l=lm(chemical~magnetic,data=ironslag) > summary(l) Call: lm(formula = chemical ~ magnetic, data = ironslag) Residuals: Min 1Q Median -6.5828 -2.6893 -0.3825 3Q 2.7240 Max 6.6572 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.95650 1.65235 5.420 1.63e-06 *** magnetic 0.58664 0.07624 7.695 4.38e-10 *** --- 808 8.3. More Inferences Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 3.464 on 51 degrees of freedom Multiple R-Squared: 0.5372, Adjusted R-squared: 0.5282 F-statistic: 59.21 on 1 and 51 DF, p-value: 4.375e-10 Given a particular value x = x∗ (here x∗ might be one of the values xi or some other possible value of x), define Y = β0 + β1 x∗ and Ŷ = β̂0 + β̂1 x∗ . Since β̂0 and β̂1 are unbiased estimators of β0 and β1 , we have that E(Ŷ ) = β0 + β1 x∗ = E(Y ). It can also be shown that 1 (x∗ − x̄)2 2 Var(Ŷ ) = σ . + n Sxx If we make the normality assumptions of the standard linear model, Ŷ is also normally distributed and we have the following confidence interval. Confidence intervals for β0 + β1 x∗ A 100(1 − α)% confidence interval for β0 + β1 x∗ is given by s 1 (x∗ − x)2 β̂0 + β̂1 x∗ ± tα/2,n−2 · s + n Sxx Notice that the confidence interval is smallest when x∗ = x and at that point the √ standard error is simply s/ n. This error should remind us of the standard error in the construction of simple confidence intervals for the mean of a normal population. The confidence interval is wider the greater the distance of x∗ from x. This is not surprising as small errors in the position of a line magnify the errors at its extremes. Of course the computations of these intervals are to be left to R. We illustrate with the ironslag data. Example 8.3.2. (continuing Example 8.3.1) In the ironslag data, the values of the explanatory variable magnetic range from 10 to 40. We use R to write confidence intervals for β0 + β1 x∗ for four different values of x∗ in this range. > x=data.frame(magnetic=c(10,20,30,40)) > predict(l,x,interval=’confidence’) fit lwr upr 1 14.82291 12.91976 16.72607 2 20.68933 19.72724 21.65142 3 26.55574 24.84847 28.26301 4 32.42215 29.32547 35.51884 19:08 -- May 4, 2008 809 8. Regression Notice that for a value of x∗ = 20, the confidence interval for the mean of Y is (19.7, 21, 7) which is considerably narrower than the confidence interval at the extremes of the data. As is usual, R defaults to a 95% confidence interval. It is important to realize that the confidence intervals produced by this method are confidence intervals for the mean of Y . The confidence interval of (19.7, 21.7) for x∗ = 20 means that we are confident that the true line has the value somewhere in this interval at x = 20. Obviously, we often want to use the regression line to make predictions about future observations of Y . Suppose for example in Example 8.3.1 that we produce another sample with a measurement of 30 on the variable magnetic. The fitted line predicts a measurement of 26.56 on the variable chemical. We also have a confidence interval of (24.85, 28.26) for the mean of the possible observations at x = 30. But what we would like to do is have an estimate of how close our measured value is likely to be to our predicted value of 26.56. We take up this question next. Given a value of x = x∗ , we define Y = β0 + β1 x∗ and Ŷ = β̂0 + β̂1 x∗ as before. Since Y is going to be based on a future observation of y, we know that that the random variable Y is independent of the the random variable Ŷ (which is based on the sample observations). Consider the random variable Y − Ŷ . (This is simply the error made in using Ŷ to predict Y .) This random variable has mean 0 and variance given by Var(Y − Ŷ ) = Var(Y ) + Var(Ŷ ) = σ 2 + σ 2 1 (x∗ − x)2 + n Sxx . This leads to the following prediction interval for Y . Prediction intervals for a new Y given x = x∗ . A 100(1 − α)% prediction interval for a future value of Y given x = x∗ is s 1 (x∗ − x)2 . β̂0 + β̂1 x∗ ± tα/2,n−2 · s 1 + + n Sxx For the ironslag data, with x = x∗ we have > predict(l,data.frame(magnetic=30),interval="predict") fit lwr upr [1,] 26.55574 19.39577 33.71571 Obviously, this is a very wide interval compared to the confidence intervals we generated for the mean. This is because we are asking that the interval capture 95% of the values of future measurements rather than just the true mean of such measurements. 810 8.4. Diagnostics The problem of multiple confidence intervals When constructing many confidence intervals, we need to be careful in how we phrase our conclusions. Consider the problem of constructing 95% confidence intervals for the two parameters β0 and β1 . By the definition of confidence intervals, there is a 95% probability that the confidence interval that we will construct for β0 will in fact contain β0 and similarly for β1 . But what is the probability that both confidence intervals will be correct? Formally, let Iβ0 denote the (random) interval for β0 and Iβ1 denote the interval for β1 . We have P(β0 ∈ Iβ0 ) = .95 and P(β1 ∈ Iβ1 ) = .95. Then we know that .90 ≤ P (β0 ∈ Iβ0 and β1 ∈ Iβ1 ) ≤ .95 (8.2) but we cannot say more than this unless we know the joint distribution of β̂0 and β̂1 . In fact, given the full assumptions of the normality model, we can find a joint confidence region in the plane for the pair (β0 , β1 ). We need the ellipse package of R. > e=ellipse(l) > e[1:5,] (Intercept) magnetic [1,] 9.562752 0.6146146 [2,] 9.300103 0.6266209 [3,] 9.036070 0.6384663 [4,] 8.771717 0.6501030 [5,] 8.508108 0.6614841 > plot(e,type=’l’) We note that an ellipse is simply a set of points (by default 100 points are used) and we can plot the points. (It is easier to use standard graphics to do this.) The resulting ellipse is in Figure 8.4. The ellipse is chosen to have minimum area (just as our confidence intervals are chosen to have minimum length. Thus more values of the slope and intercept are allowed but the ellipse itself is small in area (compared to the rectangle that is implied by using both individual confidence intervals). The problem of multiple confidence intervals arises in several other places. For example, if we generate many 95% confidence intervals for the mean of Y given x from the same data, we are not 95% confident in the enutre collection. 8.4. Diagnostics We can construct the regression line and compute confidence and prediction intervals for any set of pairs (x1 , y1 ), . . . , (xn , yn ). But unless the hypotheses of the linear model are satisfied and the data are “clean,” we will be producing mostly nonsense. Anscombe constructed the examples of Figure 8.5 to illustrate this fact in a dramatic way. Each of the datasets has the same regression line: y = 3 + .5x. Indeed, the means and standard deviations of all the x’s are exactly the same in each case, and similarly for 19:08 -- May 4, 2008 811 0.6 0.4 0.5 magnetic 0.7 8. Regression 6 8 10 12 (Intercept) Figure 8.4.: The 95% confidence ellipse for the parameters in the ironslag example. the y’s. These data are available in the dataset anscombe. The first example looks like a textbook example for the application of regression. The relationship in the second example is clearly non-linear. In the third example, one point is disguising what seems to be the “real” relationship between x and y. And in the fourth example, it is clear that some other method of analysis is more appropriate (is the outlier good data or not?). In each of these four examples, a simple plot of the data suffices to convince us not to use linear regression (at least with the data as given). But departures from the assumptions that are more subtle are not always easily detectable by a plot. (That will be true particularly in the case of several predictors which we take up in the next section.) In this section we look at some of the things that can be done to determine if the linear model is the appropriate one. 8.4.1. The residuals A careful look at the residuals often gives useful information about the appropriateness of the linear model. We will use ei for the ith residual rather than ri to emphasize that the residual is an estimate of i , the error random variable of the model. Thus ei = yi − ŷi . If the linear model is true, the random variables i are a random sample from a population that has mean 0, variance σ 2 , and, in the case of the normality assumption, are normally distributed. The residuals are estimates of the i in this random sample so it behooves us to take a closer look at the distribution of the residuals. The first step in an analysis of the model using residuals is to construct a residual plot. While we could plot the residuals ei against either of xi or yi , the plot that is usually constructed is that of the residuals against the fitted values ŷi . In other words, we plot the n points (ŷ1 , e1 ), . . . , (ŷn , en ). In this plot we are looking for violations of the linearity assumption, heteroscedasticity (unequal variances), and perhaps non- 812 8.4. Diagnostics 12 12 Anscombe's 4 Regression data sets ● ● ● ● ● ● ● ● 6 ● ● ● 4 ● ●●● ● 8 ●● y2 8 y1 6 ● 4 10 10 ● ● ● 5 10 15 5 10 x1 15 x2 12 ● 10 10 12 ● ● ● ● ● ● ● ● ● ● ● ● 8 ● ● 6 ● 4 ● ● 4 ● ● y4 8 6 y3 ● ● 5 10 15 5 x3 10 15 x4 Figure 8.5.: Four datasets with regression line y = 3 + .5x. normality. Example 8.4.1. A famous dataset on cats used in a certain experiment has measurements of the body weight (in kg) and brain weight (in g) of 144 cats of both sexes. A linear regression suggests a strong relationship. > > > > > library(MASS) cats.m=subset(cats,Sex==’M’) l.cats.m=lm(Hwt~Bwt,data=cats.m) xyplot(residuals(l.cats.m)~fitted(l.cats.m)) summary(l.cats.m) Call: lm(formula = Hwt ~ Bwt, data = cats.m) Residuals: Min 1Q Median -3.7728 -1.0478 -0.2976 3Q 0.9835 Max 4.8646 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.1841 0.9983 -1.186 0.239 Bwt 4.3127 0.3399 12.688 <2e-16 *** --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 1.557 on 95 degrees of freedom Multiple R-Squared: 0.6289, Adjusted R-squared: 0.625 19:08 -- May 4, 2008 813 8. Regression F-statistic: 161 on 1 and 95 DF, p-value: < 2.2e-16 Note that R has functions to return both a vector of fitted values and a vector of the residual values of the fit. This makes it easy to construct the residual plot. ● ● residuals(l) 2 ● ● ● ● ● ● ● ● ● −2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 8 10 12 14 fitted(l) This plot gives no obvious evidence of the failure of any of our assumptions. The residuals do look as if they are random noise. We next take a more careful look at the size of the residuals. The residual ei is the result of a random variable, Ei where Ei = Yi − Ŷi . (It is useful at this point to stop and think about what a complicated random variable Ei is. We’ve come a long way from tossing coins.) The important facts about the distribution of the residual random variable Ei are E(Ei ) = 0 1 (xi − x)2 Var(Ei ) = σ 1 − − . n Sxx 2 The first equation here is easy to prove and expected. It follows from the fact that β̂0 and β̂1 are unbiased estimators of β0 and β1 . Since Yi = β0 +β1 xi +i and Ŷi = β̂0 + β̂1 xi , we have that Ei = i + (β0 − β̂0 ) + (βi − β̂1 )xi . The variance computation above is a bit surprising at first glance. For ease of 1 (xi − x)2 notation, define hi = + . Then the second equality above says that n Sxx Var(Ei ) = σ 2 (1 − hi ). It can be shown that 1/n ≤ hi for every i. Therefore we 2 have that Var(Ei ) ≤ n−1 n σ . This means that the variance of our estimates of i are smaller than the variances of the i by a factor that depends only the x values. Notice that if hi is large, the variance of Ei is small. This means that for such a point, the line is forced to be close to the point. Since hi is large when xi is far from x, this means that points with extreme values of x pull the line close to them. The number hi is appropriately called the leverage of the point (xi , yi ). This suggests that we should pay careful attention to points of high leverage. With the variance of the residual in hand, we can normalize ei by dividing by the estimate of its standard deviation. We are not surprised when the resulting random variable has a t distribution. The resulting proposition should have a familiar look. 814 8.4. Diagnostics Proposition 8.4.2. With the normality assumption, Ei∗ = with n − 2 degrees of freedom. √Ei s 1−hi has a t distribution The proposition implies that if we the normality assumption is true we should not expect to see many standardized residuals outside of the range −2 ≤ e∗i ≤ 2. It is useful to plot the standardized residuals against the fitted values. In the cats example, the plot of the standardized residuals is produced by > xyplot(rstandard(l.cats.m)~fitted(l.cats.m)) From this plot we see that there are one or two large residuals, both for relatively large fitted values (corresponding to large cats). ● 3 ● ● 2 rstandard(l.cats.m) ● ● ● ● ● ● ● 1 ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 ● ● ● ● ● ● ● ● ● ● ● −2 ● 8 10 12 14 16 fitted(l.cats.m) From this plot we see that there are one or two large residuals, both for relatively large fitted values (corresponding to large cats). 8.4.2. Influential Observations An influential observation is one that has a large effect on the fit. We have already seen that a point with large leverage has the potential to have a large effect on the fit as the fitted line tends to be closer to such a point than other points. However that point might still have a relatively small effect on the regression as it might be entirely consistent with the rest of the data. To measure the influence of a particular observation on the fit, we consider what would change if we left that point out of the fit. Let a subscript of (i) to any computed value denote the value we get from a fit that omits the point (xi , yi ). Thus β̂0(i) denotes the value of β̂0 when the point (xi , yi ) is removed. Also ŷj(i) denotes the predicted yj when the point (xi , yi ) is removed. We might measure the influence of a point on the regression by measuring 19:08 -- May 4, 2008 815 8. Regression 1. changes in the coefficients β̂ − β̂(i) and 2. changes in the fit ŷj(i) − ŷj . The R function dfbeta() computes the changes in the coefficients. In the case of the cats data, we have > dfbeta(l.cats.m) (Intercept) Bwt 48 -0.1333235404 4.245539e-02 49 -0.1333235404 4.245539e-02 50 0.2807378812 -8.855077e-02 ............................... 143 0.1677492306 -6.250644e-02 144 -0.6605688365 2.461400e-01 Note that the last observation has a considerably greater influence on the regression than the four other points listed (and indeed it is the point of greatest influence in this sense). In particular, its inclusion changes the intercept by 0.25 (from 4.06 to 4.31). Changes in the fit depend on the scale of the observations, so it is customary to normalize by a measure of scale. One such popular measure is known as Cook’s distance. The Cook’s distance Di of a point (xi , yi ) P is a measure of how this point affects the other fitted values and is defined by Di = (ŷj − ŷj(i) )2 /(2s2 ). It can be shown that e∗2 hi i Di = 2 (1 − hi ) Thus the point (xi , yi ) has a large influence on the regression if it has a large residual and or a large leverage and especially if it has both. A general rule of thumb is that a point with Cook’s distance greater than 0.7 is considered influential. In the cats data the last point is by far the most influential but is not considered overly influential by this criterion. This point corresponds to the biggest male cat. > cd=cooks.distance(l.cats.m) > summary(cd) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.117e-06 1.563e-03 4.482e-03 1.331e-02 1.155e-02 3.189e-01 > cd[cd>0.1] 140 144 0.1302626 0.3189215 8.5. Multiple Regression In this section, we extend the linear model to the case of several quantitative explanatory variables. There are many issues involved in this problem and this section serves only as an introduction. We start with an example. 816 8.5. Multiple Regression Example 8.5.1. The dataset fat in the faraway package contains several body measurements of 252 adult males. Included in this dataset are two measures of the percentage of body fat, the Brozek and Siri indices. Each of these indices computes the percentage of body fat from the density (in gm/cm3 ) which in turn is approximated by an underwater weighing technique. This is a time-consuming procedure and it might be useful to be able to estimate the percentage of body fat from easily obtainable measurements. For example, it might be nice to have a relationship of the following form: density = f (x1 , . . . , xk ) for k easily measured variables x1 , . . . , xk . We will first investigate the problem of approximating body fat by a function of only weight and abdomen circumference. The data on the first two individuals is given for illustration. > fat[1:2,] brozek siri density age 1 12.6 12.3 1.0708 23 2 6.9 6.1 1.0853 22 thigh knee ankle biceps 1 59.0 37.3 21.9 32.0 2 58.7 37.3 23.4 30.5 weight height adipos free neck chest abdom hip 154.25 67.75 23.7 134.9 36.2 93.1 85.2 94.5 173.25 72.25 23.4 161.3 38.5 93.6 83.0 98.7 forearm wrist 27.4 17.1 28.9 18.2 The notation gets a bit messy. We will continue to use y for the response variable and we will use x1 , . . . , xk for the k explanatory variables. We will again assume that there are n individuals and use the subscript i to range over individuals. Therefore, the ith data point is (xi1 , xi2 , . . . , xik , yi ). The standard linear model now becomes the following. The standard linear model The standard linear model is given by the equation Y = β0 + β1 x1 + · · · + βk xk + (8.3) where 1. is a random variable with mean 0 and variance σ 2 , 2. β0 , β1 , . . . , βk , σ 2 are (unknown) parameters, 3. and has a normal distribution. We again assume that the n data points are the result of independent 1 , . . . , n . To find good estimates of β0 , . . . , βk we proceed exactly as in the case of one predictor and 19:08 -- May 4, 2008 817 8. Regression find the least squares estimates. Specifically, let β̂i be an estimate of βi and define ŷi = β̂0 + β̂1 xi1 + β̂2 xi2 + · · · + β̂k xik . We choose these estimates so that we minimize SSResid where SSResid = n X (yi − ŷi )2 . i=1 It is routine to find the values of the β̂’s that minimize SSResid. R computes them with dispatch. Suppose that we use weight and abdomen circumference to try to predict the Brozek measure of body fat. > l=lm(brozek~weight+abdom,data=fat) > l Call: lm(formula = brozek ~ weight + abdom, data = fat) Coefficients: (Intercept) -41.3481 weight -0.1365 abdom 0.9151 In the case of multiple predictors, we need to be very careful in how we interpret the various coefficients of the model. For example β̂1 = −0.14 in this model seems to indicate that body fat is decreasing as a function of weight. This is counter to our intuition and our experience which says that the heaviest men tend to have more body fat than average. On the other hand, the coefficient β̂2 = 0.9151 seems to be consistent with the relationship between stomach girth and body fat that we know. The key here is that the coefficient β̂1 measures the effect of weight on body fat for a fixed abdomen circumference. This makes more sense. Among individuals with a fixed abdomen circumference, the heavier individuals tend to be taller and so have perhaps less body fat. Even this interpretation needs to be expressed carefully however. It is misleading to say that “body fat decreases as weight increases with abdomen circumference held fixed” since increasing weight tends to increase abdomen circumference. We will come back to this relationship in a moment but first we investigate the problem of inference in this linear model. The short story of inference is that all of the results for the one predictor case have the obvious extensions to more than one variable. For example, we have Theorem 8.5.2 (Gauss-Markov Theorem). The least squares estimator β̂j of βj is the minimum variance unbiased estimator of βj among all linear estimators of βj . To estimate σ 2 , we again use MSResid except that we define MSResid by MSResid = 818 SSResid . n − (k + 1) 8.5. Multiple Regression The denominator in MSResid is simply n − p where p is the number of estimated parameters in SSResid. Using the estimate MSResid of σ 2 , we can again produce an estimate sβ̂j of the standard deviation of β̂j and produce confidence intervals for β̂j . For the body fat data we have > summary(l) Call: lm(formula = brozek ~ weight + abdom, data = fat) Residuals: Min 1Q -10.83074 -2.97730 Median 0.02372 3Q 2.93970 Max 9.76794 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -41.34812 2.41299 -17.136 < 2e-16 *** weight -0.13645 0.01928 -7.079 1.47e-11 *** abdom 0.91514 0.05254 17.419 < 2e-16 *** --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 4.127 on 249 degrees of freedom Multiple R-Squared: 0.7187, Adjusted R-squared: 0.7165 F-statistic: 318.1 on 2 and 249 DF, p-value: < 2.2e-16 > confint(l) 2.5 % 97.5 % (Intercept) -46.1005887 -36.59566057 weight -0.1744175 -0.09848946 abdom 0.8116675 1.01860856 From the output we observe the following. Our estimate for σ is the residual standard √ error, 4.127, which is MSResid. We note that 249 degrees of freedom are used which is 252 − 3 since there are three parameters. We can compute the confidence interval for β̂1 from the summary table (β̂1 = −0.14 and sβ̂1 = 0.019) using the t distribution with 249 degrees of freedom or from the R function confint. We can compute confidence intervals for the expected value of body fat and prediction intervals for an individual observation as well. Investigating what happens for a male weighing 180 pounds with an abdomen measure of 82 cm gives the following prediction and confidence intervals: > d=data.frame(weight=180, abdom=82) > predict(l,d,interval=’confidence’) fit lwr upr [1,] 9.13157 7.892198 10.37094 > predict(l,d,interval=’prediction’) 19:08 -- May 4, 2008 819 8. Regression fit lwr upr [1,] 9.13157 0.9090354 17.35410 The average body fat of such individuals is likely to be between 7.9% and 10.4%. An individual male not part of the dataset is likely to have body fat between 0.91% and 17.4%. We now return to the issue of interpreting the coefficients in the linear model. In the case of the body fat example, let’s fit a model with weight as the only predictor. > lm(brozek~weight,data=fat) Call: lm(formula = brozek ~ weight, data = fat) Coefficients: (Intercept) -9.9952 weight 0.1617 Notice that the sign of the relationship between weight and body fat has changed! Using weight alone, we predict an increase of 0.16 in percentage of body fat for each pound increase in weight. What has happened? Let’s first restate the two fitted linear relationships: brozek = −41.3 − 0.14 weight + 0.92 abdom (8.4) brozek = −10.0 + 0.16 weight (8.5) In order to understand the relationships above, it is important to understand that there is a linear relationship between weight and the abdomen measurement. One more regression is useful. > lm(abdom~weight,data=fat) Call: lm(formula = abdom ~ weight, data = fat) Coefficients: (Intercept) 34.2604 weight 0.3258 Now suppose that we change weight by 10 pounds. The last analysis says that we would predict that the abdomen measure increases by 3.3 cm. Using (8.4) we see that a increase in 10 pounds of weight and an increase of 3.3 cm in abdomen circumference causes an increase of 10 ∗ (−0.14) + 0.92 ∗ (3.3) = 1.6% in Brozek index. But this is precisely what an increase in 10 pounds of weight should produce according (8.5). The fact that our predictors are linearly related in the set of data (and so presumably in the population that we are modeling) is known as multicollinearity. The presence of multicollinearity makes it difficult to give simple interpretations of the coefficients in a multiple regression. 820 8.6. Evaluating Models Interaction terms Consider our linear relationship, brozek = −41.3−0.14 weight+0.92 abdom. This model implies that for any fixed value of abdom, the slope of the line relating brozek to weight is always −0.14. An alternative (and more complicated) model would be that the slope of this line also changes as the value of abdom changes. One strategy for incorporating such behavior into our model is to add an additional term, an interaction term. The equation for the linear model with an interaction term in the case that there are only two predictor variables is Y = β0 + β1 x1 + β2 x2 + β1,2 x1 x2 + . While this is not the only way that two variables could interact, it seems to be the simplest possible way. R allows us to add an interaction term using a colon. > lm(brozek~weight+abdom+weight:abdom,data=fat) Call: lm(formula = brozek ~ weight + abdom + weight:abdom, data = fat) Coefficients: (Intercept) -65.866013 weight 0.003406 abdom 1.155338 weight:abdom -0.001350 While the coefficient for the interaction term (−0.0014) seems small, one should realize that the values of the product of these two variables are large so that this term contributes significantly to the sum. On the other hand, in the presence of this interaction term, the contribution of the term for weight is now very small. With all the possible variables that we might include in our model and with all the possible interaction terms, it is important to have some tools for evaluating different choices. We take up this issue in the next section. 8.6. Evaluating Models In the previous section, we considered several different linear models for predicting the Brozek body fat index from easily determined physical measurements. Other models could be considered by using other physical measurements that were available in the dataset. How should we evaluate one of these models and how should we choose among them? One of the principle tools used to evaluate such models is known as the analysis of variance. Given a linear model (any model, really), we choose the parameters to minimize SSResid. Recall n X SSResid = (yi − ŷi )2 . i=1 19:08 -- May 4, 2008 821 8. Regression Therefore it seems reasonable to suppose that a model with smaller SSResid is better than one with large SSResid. Such a model seems to “explain” or account for more of the variation in the yi . Consider the two models for body fat, one using only abdomen circumference and the other only weight. > la=lm(brozek~abdom,data=fat) > anova(la) Analysis of Variance Table Response: brozek Df Sum Sq Mean Sq F value Pr(>F) abdom 1 9984.1 9984.1 489.9 < 2.2e-16 *** Residuals 250 5094.9 20.4 --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 > lw=lm(brozek~weight,data=fat) > anova(lw) Analysis of Variance Table Response: brozek Df Sum Sq Mean Sq F value Pr(>F) weight 1 5669.1 5669.1 150.62 < 2.2e-16 *** Residuals 250 9409.9 37.6 --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Among other things, the function anova() tells us that SSResid = 5, 095 for the linear model using abdomen circumference and SSResid = 9, 410 for the model using only weight. While this comparison seems clearly to indicate that abdomen circumference predicts Brozek index better on average than does weight, using SSResid as an absolute measure of goodness of fit is has two shortcomings. First, the units of SSResid are in terms of the squares of y units which means that SSResid will tend to be large or small according as the observations are large or small. Second, we will obviously reduce SSResid by including more variables in the model so that comparing SSResid does not give us a good way of comparing, say, the model with abdomen circumference and weight to the model with abdomen circumference alone. We address the first issue first. We would like to transform SSResid into a dimension free measurement. The key to doing this is to compare SSResid to the maximum possible SSResid. To do this, define SSTotal = n X (yi − y)2 , i=1 The quantity SSTotal could be viewed as SSResid for the model with only a constant term. We P have already seen (Problem 1.9) that y is the unique constant c that minimizes (yi − c)2 . The quantity SSTotal can be computed from the output of the 822 8.6. Evaluating Models function anova() by summing the column labeled Sum Sq. For the body fat data, that number is SSTotal = 1, 579.0. We first note that 0 ≤ SSResid ≤ SSTotal. This is because choosing β0 = y and β1 = 0 would already achieve SSResid = SSTotal but SSResid is the minimum among all choices of β0 , β1 . Using this fact, we have a first measure of the fit of a linear model. Define SSResid R2 = 1 − . SSTotal We have that 0 ≤ R2 ≤ 1 and R2 is close to 1 if linear part of the model fits the data well. The number R2 is sometimes called the coefficient of determination of the model and is often read as a percentage. In the model for Brozek index which uses only abdomen circumference, we can compute R2 from the statistics in the analysis of variance table or else we can read it from the summary of the regression where it is labeled Multiple R-Squared. We read the result below as “abdomen circumference explains 66.2% of the variation in Brozek index.” > summary(la) Call: lm(formula = brozek ~ abdom, data = fat) Residuals: Min 1Q -17.62568 -3.46724 Median 0.01113 3Q 3.14145 Max 11.97539 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -35.19661 2.46229 -14.29 <2e-16 *** abdom 0.58489 0.02643 22.13 <2e-16 *** --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 4.514 on 250 degrees of freedom Multiple R-Squared: 0.6621, Adjusted R-squared: 0.6608 F-statistic: 489.9 on 1 and 250 DF, p-value: < 2.2e-16 The value of R2 for the model using only weight is 37.6%. The number R2 values for two different models with the same number of parameters gives us a reasonable way to compare their usefulness. However R2 is a misleading tool for comparing models with differing numbers of parameters. After all, if we allow ourselves n different parameters (i.e., we have n different explanatory variables), we will be able to fit the data exactly and so achieve R2 = 100%. We consider just one way of comparing two models with a different number of parameters. Given a model with parameters β0 , . . . , βk , we define a quantity AIC, called the Akaike Information Criterion by SSResid + 2(k + 1) AIC = n ln n 19:08 -- May 4, 2008 823 8. Regression While we cannot give the theoretical basis for choosing this measure, we can notice the following two properties: 1. AIC is larger if SSResid is larger, and 2. AIC is larger if the number of parameters is larger. These two properties should lead us to choose models with small AIC. Indeed, AIC captures one good way of measuring the trade-off in reducing SSResid (good) by increasing the number of terms in the model (bad). We can compute AIC for a given model by extractAIC in R. > law=lm(brozek ~ abdom + weight,data=fat) > extractAIC(law) [1] 3.0000 717.4471 The 3 parameter model with linear terms for abdomen circumference and weight has AIC = 717.4. This value of AIC does not mean much alone but it is used for comparing models with differing numbers of parameters. (We should remark here that there are different definitions of AIC that vary in the choice of some constants in the formula. The R function AIC() computes one other version of AIC. It does not usually matter which AIC one uses to compare two models.) We illustrate the use of AIC in developing a model by applying it the the Brozek data. We first consider a model that contains all 12 easily measured explanatory variables in the dataset fat. > lbig=lm(brozek ~ weight + height + neck + chest + + abdom + hip + thigh + knee + ankle + biceps + forearm + wrist,data=fat) > extractAIC(lbig) [1] 13.0000 712.5451 At least by the AIC criterion, the 13 parameter model is better (by a small margin) than the 3 parameter model that we first considered. We really do not want the 13 parameter model above, however. First, it is too complicated to suit the purpose of easily approximating body fat from body measurements. Second, we really cannot believe that all these explanatory variables are necessary. In order to decide which model to use, we might simply evaluate AIC for all possible subsets of the 12 explanatory variables in the big model. While R packages exist that do this, we use an alternate approach where we consider one variable at a time. The R function that does this is step(). At each stage, step() performs a regression for each variable, determining how AIC would change if that variable were left-out (or included) in the model. The output is lengthy, the piece below illustrates the first step: > step(lbig,direction=’both’) Start: AIC=712.55 brozek ~ weight + height + neck + chest + abdom + hip + thigh + 824 8.6. Evaluating Models knee + ankle + biceps + forearm + wrist - chest - knee - ankle - height - biceps - thigh <none> - hip - neck - forearm - weight - wrist - abdom Df Sum of Sq 1 0.6 1 3.0 1 5.8 1 15.5 1 19.4 1 19.9 1 1 1 1 1 1 42.0 55.4 67.0 67.3 98.1 2831.4 RSS 3842.8 3845.3 3848.0 3857.8 3861.6 3862.1 3842.3 3884.2 3897.7 3909.2 3909.6 3940.4 6673.6 AIC 710.6 710.7 710.9 711.6 711.8 711.8 712.5 713.3 714.2 714.9 714.9 716.9 849.7 Step: AIC=710.58 brozek ~ weight + height + neck + abdom + hip + thigh + knee + ankle + biceps + forearm + wrist For each possible variable that is in the big model, AIC is computed for a regression leaving that variable out. For example, leaving out the variable chest reduces AIC to 710.6, an improvement from the value 712.5 of the full model. Removing chest gives the most reduction of AIC. The second step starts with this model and determines that it is useful to remove the knee measurement from the model. brozek ~ weight + height + neck + abdom + hip + thigh + knee + ankle + biceps + forearm + wrist - knee - ankle - height - biceps - thigh <none> - hip - neck + chest - forearm - weight - wrist - abdom Step: Df Sum of Sq 1 3.3 1 5.9 1 14.9 1 19.0 1 21.9 1 1 1 1 1 1 1 41.6 55.9 0.6 66.4 87.3 98.0 3953.3 RSS 3846.1 3848.7 3857.8 3861.8 3864.7 3842.8 3884.4 3898.7 3842.3 3909.2 3930.1 3940.9 7796.1 AIC 708.8 709.0 709.6 709.8 710.0 710.6 711.3 712.2 712.5 712.9 714.2 714.9 886.9 AIC=708.8 19:08 -- May 4, 2008 825 8. Regression brozek ~ weight + height + neck + abdom + hip + thigh + ankle + biceps + forearm + wrist Notice that at this second step, all variables in the model were considered for exclusion and all variables currently not in the model (chest) were considered for inclusion. After several more steps, the final step determines that no single variable should be included or excluded: Df Sum of Sq <none> - hip + biceps + height - thigh + ankle - neck + knee + chest - wrist - forearm - weight - abdom 1 1 1 1 1 1 1 1 1 1 1 1 38.8 19.6 17.5 53.3 6.8 57.7 2.4 0.1 89.8 102.6 134.1 4965.6 RSS 3887.9 3926.6 3868.2 3870.3 3941.1 3881.1 3945.5 3885.4 3887.7 3977.7 3990.5 4021.9 8853.4 AIC 705.5 706.0 706.2 706.4 706.9 707.1 707.2 707.4 707.5 709.3 710.1 712.1 910.9 Call: lm(formula = brozek ~ weight + neck + abdom + hip + thigh + forearm + Coefficients: (Intercept) -21.7410 forearm 0.4372 weight -0.1042 wrist -1.0514 neck -0.3971 abdom 0.9584 hip -0.2010 wrist, data = fat) thigh 0.2090 The final model has AIC = 705.5 and appears to be the best model, at least by the AIC criterion. 8.7. Exercises 8.1 Sometimes the experimenter has control over the choice of the points x1 , . . . , xn in an experiment. Consider the following two sets of choices: Set A: x1 = 1, x2 = 2, x3 = 3, x4 = 4, x5 = 5, x6 = 6, x7 = 7, x8 = 8, x9 = 9, x10 = 10 Set B: x1 = 1, x2 = 1, x3 = 1, x4 = 1, x5 = 1, x6 = 10, x7 = 10, x8 = 10, x9 = 10, x10 = 10 a) Explain how Proposition 8.1.3 can be used to argue for Set B. b) Despite the argument in part (a), why might Set A be a better choice? 826 8.7. Exercises 8.2 A simple random sample was chosen the population of all the students with senior status as of February, 2003, who had taken the ACT test. The ACT score and GPA of each student is in the file http://www.calvin.edu/~stob/data/sr80.csv’. a) Write the equation of the regression line that could be used to predict the GPA of a student from their ACT. b) Write a 95% confidence interval for the slope of the line. c) For each of the ACT scores 20, 25, 30, use the line to predict the GPA of a student with that score. d) Write 95% confidence intervals for the mean GPA of all students with ACT scores 20, 25, and 30. e) Write a 95% prediction interval for the GPA of another student with ACT score 20. f ) Plot the residuals from this regression and say whether the residuals indicate any concerns about whether the assumptions of the standard linear model are met. 8.3 A famous dataset (Pierce, 1948) contains data on the relationship between cricket chirps and temperature. The dataset is reproduced at http://www.calvin.edu/~stob/ data/crickets.csv’. Here the variables are Temperature in degrees Fahrenheit and Chirps giving the number of chirps per second of crickets at that temperature. a) Write the equation of the regression line that could be used to predict the temperature from the number of cricket chirps per second. b) Write a 95% confidence interval for the slope of the line. c) Write a 95% confidence interval for the mean temperature for each of the values 12, 14, 16, and 18 of cricket chirps per second. d) You hear a cricket chirping 15 times per second. What is an interval that is likely to capture the value of the temperature? Explain what likely means here. e) Plot the residuals from this regression and say whether the residuals indicate any concerns about whether the assumptions of the standard linear model are met. 8.4 Prove Equation 8.2. 8.5 The faraway package contains a dataset cpd which has the projected and actual sales of 20 different products of a company. (The data were actually transformed to disguise the company.) 19:08 -- May 4, 2008 827 8. Regression a) Write a regression line that describes a linear relationship between projected and actual sales. b) Identify one data point that has particularly large influence on the regression. Give a couple of quantitative measures that summarize its influence. c) Refit the regression line after removing the data point that you identified in part (b). How does the equation of the line change? 828 A. Appendix: Using R A.1. Getting Started Download R from the R project website http://www.r-project.org/ which requires a few clicks or directly from http://cran.stat.ucla.edu/. There are Windows, Mac, and Unix versions. These notes are for the Windows version. There will be minor differences for the other versions. A.2. Vectors and Factors A vector has a length (a non-negative integer) and a mode (numeric, character, complex, or logical). All elements of the vector must be of the same mode. Typically, we use a vector to store the values of a quantitative variable. Usually vectors will be constructed by reading data from an R dataset or a file. But short vectors can be constructed by entering the elements directly. > x=c(1,3,5,7,9) > x [1] 1 3 5 7 9 Note that the [1] that precedes the elements of the vectors is not one of the elements but rather an indication that the first element of the vector follows. There are a couple of shortcuts that help construct vectors that are regular. > y=1:5 > z=seq(0,10,.5) > y;z [1] 1 2 3 4 5 [1] 0.0 0.5 1.0 [16] 7.5 8.0 8.5 1.5 9.0 2.0 2.5 9.5 10.0 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 To refer to individual elements of a vector we use square brackets. Note that a variety of expressions, including other vectors, can go within the brackets. > x[3] [1] 5 > x[c(1,3,5)] [1] 1 5 9 > x[-4] [1] 1 3 5 9 # 3rd element of x # 1st, 3rd, 5th elements of x # all but 4th element of x 1001 A. Appendix: Using R > x[-c(2,3)] [1] 1 7 9 # all but 2nd and 3rd elements of x If a vector t is a logical vector of the same length as x, then x[t] selects only those elements of x for which t is true. Such logical vectors t are often constructed from logical operations on x itself. > x>5 [1] FALSE FALSE FALSE > x[x>5] [1] 7 9 > x[x==1|x>5] [1] 1 7 9 # compares x elementwise to 5 TRUE TRUE # those elements of x where condition is true # == for equality and | for logical or Arithmetic on vectors works element by element as do many functions. > x [1] 1 3 5 7 9 > y [1] 1 2 3 4 5 > x*y # componentwise multiplication [1] 1 6 15 28 45 > x^2 # exponentiation of each element by a constant [1] 1 9 25 49 81 > c(1,2,3,4)*c(2,4) # if the vectors are not of the same length, the shorter is [1] 2 8 6 16 # recycled if the lengths are compatible > log(x) # the log function operates componentwise [1] 0.000000 1.098612 1.609438 1.945910 2.197225 A.3. Data frames Datasets are typically stored in data frames. A data frame in R is a data structure that can be considered a two-dimensional array with rows and columns. Each column is a vector or a factor. The rows usually correspond to the individuals of our dataset. Usually data frames are constructed by reading data from a file or loading a built-in R dataset (see the next section). A data frame can also be constructed from individual vectors and factors. The following R session uses the built-in iris dataset to illustrate some of the basic operations on data frames. > dim(iris) # 150 rows or observations, 5 columns or variables [1] 150 5 > iris[1,] # the first observation (row) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa > iris[,1] # the first column (variable), output is a vector [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 1002 A.3. Data frames [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5 [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1 [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5 [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 [109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2 [127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 [145] 6.7 6.7 6.3 6.5 6.2 5.9 > iris[1] # alternative means of referring to first column, output is a data frame Sepal.Length 1 5.1 2 4.9 3 4.7 4 4.6 5 5.0 ................ # many observations omitted 145 6.7 146 6.7 147 6.3 148 6.5 149 6.2 150 5.9 > iris[1:5,3] # the first five observations, the third variable [1] 1.4 1.4 1.3 1.5 1.4 > iris$Sepal.Length # the vector in the data frame named Sepal.Length [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5 [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1 [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5 [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 [109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2 [127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 [145] 6.7 6.7 6.3 6.5 6.2 5.9 > iris$Sepal.Length[10] # iris$Sepal.Length is a vector and can be used as such [1] 4.9 We next demonstrate how to construct a data frame from vectors and factors. > > > > 1 2 3 > x=1:3 y=factor(c("a","b","c")) # makes a factor of the character vector d=data.frame(numbers=x, letters=y) d numbers letters 1 a 2 b 3 c d[,2] 19:08 -- May 4, 2008 1003 A. Appendix: Using R [1] a b c Levels: a b c > d$numbers [1] 1 2 3 A.4. Getting Data In and Out Accessing datasets in R There are a large number of datasets that are included with the standard distribution of R. Many of these are historically important datasets or datasets that are often used in statistics courses. A complete list of such datasets is available by data(). A built-in dataset named junk usually contains a data.frame named junk and the command data(junk) defines that data.frame. In fact, many datasets are preloaded. For example, the iris dataset is available to you without using data(iris). For the built-in dataset junk, ?junk usually gives a description of the dataset. Many users of R have made other datasets available by creating a package. A package is a collection of R datasets and/or functions that a user can load. Some of these packages come with the standard distribution of R. Others are available from CRAN. To load a package, use library(package.name) or require(package.name). For example, the faraway package contains several datasets. One such dataset records various health statistics on 768 adult pima indians for a medical study of diabetes. > library(faraway) > data(pima) > dim(pima) [1] 768 9 > pima[1:5,] pregnant glucose diastolic triceps insulin bmi diabetes age test 1 6 148 72 35 0 33.6 0.627 50 1 2 1 85 66 29 0 26.6 0.351 31 0 3 8 183 64 0 0 23.3 0.672 32 1 4 1 89 66 23 94 28.1 0.167 21 0 5 0 137 40 35 168 43.1 2.288 33 1 > If the package is not included in the distribution of R installed on your machine, the package can be installed from a remote site. This can be done easily in both Windows and Mac implementations of R using menus. Finally, datasets can be loaded from a file that is located on one’s local computer or on the internet. Two things need to be known: the format of the data file and the location of the data file. The most common format of a datafile is CSV (comma separated values). In this format, each individual is a line in the file and the values of the variables are separated by commas. The first line of such a file contains the variable names. There are no individual names. The R function read.csv reads such 1004 A.4. Getting Data In and Out a file. Other formats are possible and the function read.table can be used with various options to read these. The following example shows how a file is read from the internet. The file contains the offensive statistics of all major league baseball teams for the complete 2007 season. > bball=read.csv(’http://www.calvin.edu/~stob/data/baseball2007.csv’) > bball[1:4,] CLUB LEAGUE BA SLG OBP G AB R H TB X2B X3B 1 New York A 0.290 0.463 0.366 162 5717 968 1656 2649 326 32 2 Detroit A 0.287 0.458 0.345 162 5757 887 1652 2635 352 50 3 Seattle A 0.287 0.425 0.337 162 5684 794 1629 2416 284 22 4 Los Angeles A 0.284 0.417 0.345 162 5554 822 1578 2317 324 23 SH SF HBP BB IBB SO SB CS GDP LOB SHO E DP TP 1 41 54 78 637 32 991 123 40 138 1249 8 88 174 0 2 31 45 56 474 45 1054 103 30 128 1148 3 99 148 0 3 33 40 62 389 32 861 81 30 154 1128 7 90 167 0 4 32 65 40 507 55 883 139 55 146 1100 8 101 154 0 > HR 201 177 153 123 RBI 929 857 754 776 Creating datasets in R Probably the best way to create a new dataset for use in R is to use an external program to create it. Excel, for example, can save a spreadsheet in CSV format. The editing features of Excel make it very easy to create such a dataset. Small datasets can be entered into R by hand. Usually this is done by creating the vectors of the data.frame individually. Vectors can be created using the c() or scan() functions. > x=c(1,2,3,4,5:10) > x [1] 1 2 3 4 5 6 7 > y=c(’a’, ’b’,’c’) > y [1] "a" "b" "c" > z=scan() 1: 2 3 4 4: 11 12 19 7: 4 8: Read 7 items > z [1] 2 3 4 11 12 19 4 > 8 9 10 The scan() function prompts the user with the number of the next item to enter. Items are entered delimited by spaces or commas. We can use as many lines as we like and the input is terminated by a blank line. There is also a data editor available in the graphical user interfaces but it is quite primitive. 19:08 -- May 4, 2008 1005 A. Appendix: Using R A.5. Functions in R Almost all the capabilities of R are implemented as functions. A function in R is much like a mathematical function. Namely, a function has inputs and outputs. In mathematics, f (x, y) is functional notation. The name of the function is f and there are two inputs x, and y. The expression f (x, y) is the name of the output of the function. The notation in R is quite similar. For example, mean(x) denotes the result of applying the function mean to the input x. There are some important differences in the conventions that we typically use in mathematics and that are used in R. A first difference is that functions in R often have optional arguments. For example, in using the function to compute the mean, there is an optional argument that allows us to compute the trimmed mean. Thus mean(x,trim=.1) computes a 10%-trimmed mean of x. A second difference is that in R inputs have names. In mathematics, we rely only on position to identify which input is which in functions that have several inputs. Because we have optional arguments in R, we need some way to indicate which arguments we are including. Hence, in the example of the mean function above, the argument trim is named. If we use a function in R, without naming arguments, then R assumes that the arguments are included in a certain order (that can be determined from the documentation). For example, the mean function has specification mean(x, trim = 0, na.rm = FALSE, ...) This means that the first three arguments are called x, trim, and na.rm. The latter two of these arguments have default values if they are missing. If unnamed, the arguments must appear in this order. If named they can appear in any order. The following short session of R shows the variety of possibilities. Just remember that R first matches up the named arguments. Then R uses the unnamed arguments to match the other arguments it accepts in the order that it expects them. Notice that mean allows other arguments ... that it does not use. > mean(y) [1] 5.5 > y=1:10 > mean(x=y) > mean(y,trim=.1) > mean(trim=.1,y) > mean(trim=.1,x=y) > mean(y,.1,na.rm=F) > mean(y,na.rm=F,.1) > mean(y,na.rm=F,trim=.1) > mean(y,.1,F) > mean(y,trim=.1,F) > mean(y,F,trim=.1) > mean(y,F,.1) 1006 # all these are legal # these are not legal A.6. Samples and Simulation > mean(.1,y) > mean(z=y,.1) A third difference between R and our usual mathematical conventions is that many functions are “vectorized.” For example, the natural log function operates on vectors one component at a time: > x=c(1:10) > log(x) [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101 [8] 2.0794415 2.1972246 2.3025851 A.6. Samples and Simulation The sample() function allows us to choose probability samples of any size from a fixed population. The syntax is sample(x,size,replace=F,prob=NULL) where x size replace prob a vector representing the population the size of the sample is true or false according to whether the sampling is with replacement or not if present, a vector of same length of x of probabilities of choosing the corresponding individual The following R session gives examples of some typical uses of the sample command. > x=1:6 > sample(x) # a random permutation [1] 5 6 2 1 4 3 > sample(x,size=10,replace=T) # throwing 10 dice [1] 5 3 2 5 5 5 5 3 1 2 > sample(x,size=10,replace=T,prob=c(1/2,1/10,1/10,1/10,1/10,1/10)) # weighted dice [1] 1 1 3 5 1 6 1 5 1 1 > sample(x,size=10,replace=T,prob=c(5,3,2,1,1,1)) # weights need not sum to 1 (used proportionally) [1] 3 1 1 1 2 1 4 2 1 5 > sample(x,size=4,replace=F) # sampling without replacement [1] 2 3 1 6 Simulation is an important tool for understanding what might happen in a random sampling situation. Many simulations can be performed using the replicate function. The simplest form of the replicate function is replicate(n,expr) where expr is an R expression that has a value (e.g., a function) and n is the number of times that we wish to replicate expr. The result of replicate is a list but if all replications of expr have scalar values of the same mode, the result is a vector. Continuing the dice-tossing motif, the following R session gives the result of computing the mean of 10 dice rolls for 20 different trials. 19:08 -- May 4, 2008 1007 A. Appendix: Using R > replicate(20, mean(sample(1:6,10,replace=T))) [1] 3.5 4.3 3.5 3.7 2.1 3.5 3.6 3.4 3.9 3.3 2.6 3.1 3.8 3.2 2.9 3.1 3.1 3.6 4.0 [20] 2.7 If expr returns something other than a scalar, then the object created by replicate might be a list or a matrix. For example, we generate 10 different permutations of the numbers from 1 to 5. > r=replicate(10,sample(c(1:5))) > r [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 3 3 5 4 2 2 4 1 3 1 [2,] 5 1 4 1 4 3 1 2 1 3 [3,] 2 5 1 5 1 5 3 5 5 2 [4,] 4 2 3 3 3 1 5 3 4 4 [5,] 1 4 2 2 5 4 2 4 2 5 > r[,1] [1] 3 5 2 4 1 Notice that the results of replicate are placed in the columns of the returned object. In fact the result of replicate can have quite a complicated structure. In the following code, we simulate 1,000 different tosses of 1,000 dice and for each of the trials we construct a histogram. Note that the internal structure of a histogram is a list with various components. > h=replicate(1000,hist(sample(1:6,1000,replace=T))) > h[,1] $breaks [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 $counts [1] 169 176 0 177 0 149 0 170 0 159 $intensities [1] 0.3379999 0.3520000 0.0000000 0.3540000 0.0000000 0.2980000 0.0000000 [8] 0.3400000 0.0000000 0.3180000 $density [1] 0.3379999 0.3520000 0.0000000 0.3540000 0.0000000 0.2980000 0.0000000 [8] 0.3400000 0.0000000 0.3180000 $mids [1] 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25 5.75 $xname [1] "sample(1:6, 1000, replace = T)" 1008 A.7. Formulas $equidist [1] TRUE A.7. Formulas Formulas are used extensively in R when analyzing multivariate data. Formulas can take many forms and their meaning varies by R context but in general they are used to describe models in which we have a dependent or response variable that depends on some independent or predictor variables. There may also be conditioning variables that limit the scope of the model. Suppose that x, y, z, w are variables (which are usually vectors or factors). Then the following are legal formulas, together with a way to read them. x~y x~y|z x~y+w x~y*w x~y+I(y^2) x x x x x modeled modeled modeled modeled modeled by by by by by y y conditioned on z y and w y, w and y*w y and y2 Notice in the last example that we are essentially defining a new variable, y 2 as one of the predictor variables. In this case we need I to indicate that this is the interpretation. Most arithmetic expressions can occur within the scope of I. For example, > histogram(~I(x^2+x)) produces a histogram of the transformed variable x2 + x. (Leaving out the I in this case gives a completely different result.) Most graphics commands that use formulas will use the vertical axis for the response variable, the horizontal axis for the predictor variable, and will draw a separate plot for each value of the conditioning variable (which is usually a categorical variable). A.8. Lattice Graphics The lattice graphics package (accessed by library(lattice)) is the R implementation of Trellis graphics, a graphics system developed at Bell Laboratories. The lattice graphics package is completely self-contained and unrelated to the base graphics package of R. Lattice graphics functions in general produce objects that are of class “trellis.” These objects can be manipulated and printed. Printing a lattice object is generally what makes a graph appear in its own window on the display. The standard high-level graphics functions automatically print the object they create. The most important lattice graphic functions are as follows. 19:08 -- May 4, 2008 1009 A. Appendix: Using R xyplot() bwplot() histogram() dotplot() densityplot() qq() qqmath() stripplot() contourplot() levelplot() splom() rfs() scatter plot box and whiskers plot histograms dot plots kernel density plots quantile-quantile plot for comparing two distributions quantile plots against certain mathematical distributions one-dimensional scatter plots contour plot of trivariate data level plot of trivariate data scatter plot matrix of several variables residuals and fitted values plot The syntax of these plotting commands differs according to the nature of the plot and the data and most of these high-level plotting commands allow various options. A typical syntax is found in xyplot() which we illustrate here using the iris data. > xyplot(Sepal.Length~Sepal.Width | Species, data=iris, subset=c(1:149), + type=c("p","r"),layout=c(3,1)) Here we are using the data frame iris, and we are using only the first 149 observations of this data frame. We are making three x-y plots, one for each Species (the conditioning variable in the formula). The plots have Sepal.Width on the horizontal axis and Sepal.Length on the vertical axis. The plots contain points and also a fitted regression line. There three plots are displayed in a 3 columns by 1 row display. All kinds of options besides type and layout are available to control the size, shape, labeling, colors, etc. of the plot. A.9. Exercises A.1 Choose 4 integers in the range 1–10 and 4 in the range 11–20. Enter these 8 integers in non-decreasing order into a vector x. For each of the following R commands, write down a guess as to what the output of R would be and then write down (using R of course), what the output actually is. a) x b) x+1 c) sum(x) d) x>10 e) x[x>10] 1010 A.9. Exercises f ) sum(x>10) Explain what R is computing here. g) sum(x[x>10]) Explain what R is computing here. h) x[-(1:4)] i) x^2 A.2 The following table gives the total of votes cast for each of the candidates in the 2008 Presidential Primaries in the State of Michigan. Democratic Clinton 328,151 Uncommitted 236,723 Kucinich 21,708 Dodd 3,853 Gravel 2,363 Republican Romney 337,847 McCain 257,521 Huckabee 139,699 Paul 54,434 Thompson 32,135 Giuliani 24,706 Uncommitted 17,971 Hunter 2,823 a) Create a data frame in R, named Michigan, that has three variables: candidate, party, votes. Be careful to make variables factors or vectors as appropriate. b) Write an R expression to list all the candidates. c) Write an R expression to list all the Democratic candidates. d) Write an R expression that computes the total number of votes case in the Democratic primary. A.3 The function mad computes the median absolute deviation of the absolute deviations from the median of a vector of numbers. That is, if m is the median of x1 , . . . , xn , then the median absolute deviation from the median is median{|x1 − m|, . . . |xn − m|} . Actually, the function in R is considerably more versatile. For example, instead of m, the function allows as an option that the mean x be used instead. Also there are several choices for which median is computed of the set of numbers. Finally, the R function multiplies the result by a constant (the default is 1.4826 for technical reasons). Using ?mad, we find that the usage for the function is mad(x, center = median(x), constant = 1.4826, na.rm = FALSE, low = FALSE, high = FALSE) 19:08 -- May 4, 2008 1011 A. Appendix: Using R Enter the vector x=c(1,2,4,6,8,10). a) R computes mad(x) to be 4.4478. (Try it!) Using the help document and the default values of the function, explain how the number 4.4478 is computed. b) Compute mad(x,mean(x),constant=1,FALSE,TRUE,FALSE). Explain the result. c) The three logical values in the expression in part (b) might be mysterious to a reader. Write an R function that is somewhat more self-explanatory. A.4 In R, define a vector x with 100 values of your own choosing. Compare the behavior of > histogram(~x^2+x) > histogram(~I(x^2+x)) and state precisely what each of the two expressions does with the data x. 1012 Bibliography [AM] Chase M. A. and Dummer G. M. The role of sports as a social determinant for children. Research Quarterly for Exercise and Sport, 63:18–424. [Bur06] U.S. Census Bureau. Current population survey, design and methodology. (Technical Paper 66):175, October 2006. [LP01] Williamson I. Little P., Gould C. Delayed presciribing of antibiotics increased duration of acute otitis media but reduced diarrhoea. Evidence-Based Nursing, 4(4):107, October 2001. 1013 Index 3:16, 211 68–95–99.7 Rule, 430 alternate hypothesis, 411 Bayes’ Theorem, 318 bimodal, 108 bin, 107 binomial distribution, 405 binomial process, 403 binomial random variable, 404 boxplot, 114 broccoli, 104 Bunko, 404 Cartesian product of sets, 307 categorical variable, 101 Cauchy distribution, 436 CIRP Survey (Quest), 206 cluster sampling, 209 coin toss, 424 complement of a set, 305 conditional probability, 315 continuous random variable, see random variable, continuous, 414 convenience sample, 202 corrosion, 125 Counties, 103 cross tabulation, 121 cumulative distribution function continuous random variable, 417 cumulative distribution function (cdf), 405 Current Population Survey, 202 dataset, 101 decile, 114 discrete random variable, see random variable, discrete distribution, 104 equally likely outcomes, 306 event, 302 expected value, 424 exponential distribution, 419 five number summary, 114 frequentist interpretation, 303 hat, 128 hinge, 114 histogram, 106 hypergeometric distribution, 407 hypothesis, 410 hypothesis test, 410 independent events, 319 inter-quartile range, 114 Interim, abolish, 211 intersection of sets, 305 IQR, see inter-quartile range Kolmogorov, 311 Literary Digest, 202 Manny Ramirez, 314 mean, 110, 133 of a continuous random variable, 425 of a random variable, see expected value mean absolute deviation, 116, 133 median, 110, 133 1015 Index missing values, 101 mortality table, 314 mosaic plot, 124 Multiplication Law of Probability, 316 Multiplicaton Principle, 308 multistage sampling, 209 National Immunization Survey, 206 Nellie Fox, 320 normal distribution, 429 null hypothesis, 411 outlier, 111, 115 parameter, 202 percentile, 113 population, 201 probability, 301 probability density function (pdf), 415 probability mass function (pmf), 402 pseudo-random number, see random number generation quantile, 113 quartile, 114 random number generation, 419 random number table, 210 random process, 301 random sample, 208 random variable, 401 continuous, 402 discrete, 401 residual, 128 residual sum of squares (SSResid), 129 resistant, 111 sample, 201 sample space, 302 sampling distribution, 411 sampling error, 204 sampling frame, 206 scatterplot, 125 simple random sample, 203 1016 skew, 107 SRS, see simple random sample standard deviation, 116 of a random variable, 428 standardized variable, 130 statistic, 202 stem and leaf plot, 109 stratified random sample, 207 subjectivist interpretation, 303 sum of squares, 133 symmetric, 107 total deviation from the mean, 133 track records, 126 transform (reexpress), 107 transformation of a random variable, 426 trees, 127 trimmed mean, 112 uniform distribution, 418 unimodal, 107 union of sets, 305 variable, 101 variance, 116 of a random variable, 428 vector, 101 Weibull distribution, 421 wind speed in San Diego, 422