Download 1 Overview of Statistics/Data Classification

1 Overview of Statistics/Data Classification Statistics: the science of collecting, organizing, analyzing, and interpreting data in order to make decisions Two branches of statistics Descriptive statistics: organizing, summarizing, and displaying data Inferential statistics: use of a sample to draw conclusions about a population To make an inference means to draw a general conclusion about a population from a sample. Probability is importantly involved in inferential statistics and not at all involved in descriptive statistics. Population: the entire set of individuals of interest The population is determined by a problem. Variable: a characteristic of an individual, to be measured or observed E.g., if the individuals are people, variables might be height, eye color, ... Sample: a subset of a population i.e., members of the population which you actually know something about E.g. What’s the population and what’s the sample? (a) A 2010 survey of 8000 U.S. adults showed that 42% of respondents considered themselves conservative (Gallup). (b) A 2010 poll showed that 43% of Texans believed that the country was on the right track. Can’t generalize beyond TX: number was 31% for the country as a whole (Texas Tribune) Data: information from counting, measuring, or observing i.e., what you write down about members of your sample or population Parameter: a value computed from population data Statistic: a value computed from sample data A statistic is an estimate of a parameter. In real life, you hardly ever have a parameter. But we’re interested in parameters (on average, how much does a baby weigh at birth?), and we use statistics to estimate them, so we need words to distinguish them. E.g. Jake measured the diameters of 100 ball bearings chosen from a shipment of 10,000. The average diameter was 1.1mm. Population: Sample: Variable: Data: Statistic: A statistic changes when the sample changes. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 1 E.g. Parameter or statistic? A 2009 survey of 218 law firms with at least 50 lawyers found that 69% of firms had cut personnel in the previous year. (Altman-Weil) Data Classification Just words to describe types of data so we can talk about it sensibly. Two basic types of data: quantitative (or numerical): numbers that are the results of measuring or counting It makes sense to do arithmetic on quantitative data. It usually makes sense to average it. qualitative (or categorical): everything else It does not make sense to do arithmetic on qualitative data. E.g. Qual or quant? (a) diameters of Eastern White Pines (b) eye color (c) numbers on jerseys of starting team Data is univariate if is made up of a list of individual values. →All the data mentioned above is univariate. Paired or bivariate data is made up of ordered pairs of numbers (often presented in a table). E.g. John measured the boiling point of water at various altitudes. Altitude (ft) 0 1000 2000 5280 Corwin Boiling point (◦ F) 212 210.2 208.4 202.45 S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 2 2 Experiments and Sampling An experiment is any activity with measurable outcomes, e.g., rolling a die or drawing a card. Sampling census: use the entire population (rarely possible) We want to be able to use information about a sample to infer something about a population characteristic, so we want a sample that is representative of the population. A sampling method is biased if it tends to produce samples that are not representative of the population. Sometimes we refer to such samples as “biased samples.” What does it mean for a sample to be “not representative”? It means that if you compute statistics based on many samples chosen by the method, then on average they won’t correctly estimate the parameters they’re supposed to estimate. sampling error: difference in a calculation made from population data and one made from sample data A simple random sample is one in which every possible sample of the same size has the same chance of being selected. This is not quite the same thing as saying that every individual has the same chance of being selected. E.g., a coin is flipped. If it comes up heads, Alice and Bob are both chosen; if it comes up tails, Carol is chosen. Then each individual has a 50/50 chance of being chosen, but Bob and Carol cannot both be chosen. You get a simple random sample by assigning a number to every member of the population and then using a random number generator to choose. A simple random sample is almost always best, but sometimes you cannot afford a large enough one. We will assume that all samples are simple random samples unless told otherwise. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 3 3 Frequency Distributions and Histograms Frequency distribution tables Divide a set of data into classes or intervals. A frequency distribution is a table that shows the number of data points in each class or interval. →for univariate data →although these are very useful for qualitative data, we only make them for quantitative data E.g. Attendance in an Intro Stat section on each day of one semester: 45, 47, 43, 40, 38, 36, 23, 35, 44, 26, 32, 35, 40, 38, 38, 39, 36, 37, 45, 35, 36, 37, 38, 36, 33, 35, 36, 40, 45 Make a frequency distribution using the classes 20–30, 30–40, 40–50. Class # days 20–30 −→ 30–40 40–50 Class # days 20–30 2 30–40 18 40–50 9 The frequency of a class is the number of data values in it. Notation: f. For the class a–b, a is the lower class limit and b is the upper class limit. b − a is the class width. E.g. For the class 30–40 in the example above, the lower class limit is 30, the upper class limit is 40, and the class width is 40 − 30 = 10. If a data value falls on the boundary between two classes, it is always put in the higher class. How to construct a frequency distribution table 1. Find the range of the data, i.e., (max value) − (min value). 2. The number of classes to use should be approximately the square root of the number of data points —round up to the nearest integer —but if this number is less than 5, use 5; if it’s greater than 20, use 20. 3. Approximate class width = approx #range classes to use —round up to the number of places the data has 4. Choose a starting value equal to or a little less than the smallest data value. 5. Count the number of data values in each class and make the table. E.g. Construct a frequency distribution table for the attendance data. (1) √ range = 24 (2) 29 ≈ 5.4; round up to 6 (3) approx. class width = 24 6 = 4; no need to round (4) starting value = 21 Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 4 Class 21–25 25–29 29–33 33–37 37–41 41–45 45–49 Recall that the midpoint of the interval [a, b] is # days 1 1 1 10 10 2 4 Σ f = 29 a+b 2 . Frequency Histograms A frequency histogram is a bar graph of a frequency distribution table. • horizontal axis: classes class boundaries must coincide: if, in the table, classes are separated by an amount x, extend all boundaries by 2x (this means that bars must touch) • label with midpoints • • vertical axis: frequencies E.g. Construct a frequency histogram for the table above. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 5 Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 6 4 Relative Frequency; Cumulative Frequency; Distribution Shapes Relative frequency distributions and histograms The relative frequency of a class is # data points in the class # data points in the whole set A relative frequency distribution table shows percentages of the whole for each class instead of the number in each class. A relative frequency histogram is a histogram for a relative frequency distribution. Note that the percentages must add up to 100. E.g. Compute the relative frequencies and make a relative frequency histogram for the following data: Class Frequency Relative Frequency (%) 21–25 1 3.4 25–29 1 3.4 29-33 2 6.9 33-37 9 31.0 37–41 10 34.5 41–45 2 6.9 4 14.0 45–49 Σ f = 29 Σ f means “sum of all frequencies” Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 7 Cumulative frequency The cumulative frequency at any row of a frequency table is the sum of all frequencies up to and including that row. E.g. Add a column for cumulative frequency to the table just constructed. Class Frequency 21–25 1 25–29 1 29-33 2 Cumulative Frequency 33-37 9 37–41 10 41–45 2 45–49 4 1 2 4 13 23 25 29 29 A cumulative frequency graph is a line graph joining the points (upper class boundary, cumulative frequency of class) E.g. Make a cumulative frequency graph for the table just constructed. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 8 Common distribution shapes Note that (i) distributions don’t have to be exact and (ii) lots of distributions are symmetrical, but only the one that is symmetrical and has a single central hump is called symmetric in statistics. Memorize: Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 9 5 Graphs and Displays These not only present data to others, but help the statistician to understand the data so that he can choose the best ways to analyze it. Dot plots →For univariate, numerical data Use one dot for each data value. Make your dots all the same size. E.g. Make a dot plot for the data set shown. 1 7 10 1 2 3 1 7 10 1 9 13 1 9 13 1 10 13 4 10 11 4 10 11 5 6 7 8 9 10 4 7 10 15 11 7 10 12 13 14 15 Bar charts Pretty simple. The bars should be separated. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 10 Pie charts Pie charts are widely used for categorical data. →For univariate data Pie charts are best for showing the relationship of the size of each category to the whole. Time series charts A line graph of a quantity or quantities at regularly spaced times. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 11 Scatter plots →For bivariate, numerical data Just graph the ordered pairs Very useful for seeing whether data seem to be correlated Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 12 E.g. Using the pie chart below, answer the following questions: • Which major has the most students? • Which major has the fewest students? • Which major represents more than one third of all majors studied? E.g. The pie charts below give some information about U.S. government finances in 1999. Use them to answer the following questions: (a) About how many dollars were paid in income tax for each dollar paid in corporate tax? (b) Can you determine whether the U.S. government was taking in enough revenue from Social Security payments to cover its outlays for that program? If so, what is the answer? (c) If the government’s outlay was exactly equal to its income, did corporate and excise taxes together generate enough income to cover the cost of defense? Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 13 6 Mean, Median, Mode Suppose you were forced to describe a quantitative data set using just one number. Which number would you use? Measures of central tendency A measure of central tendency is an average: a single value intended to be typical of the data set. →These are for univariate data only Notation: Number of data points in a population: N Number of data points in a sample: n Data: the number of points scored in RU Women’s Basketball games in the 2007–8 season: 52, 52, 54, 57, 58, 63, 63, 65, 67, 67, 70, 71, 72, 73, 75, 76, 77, 93 Note that this is population data. Mean →For quantitative data only If the data are x1 , x2 , . . . , xk , then the mean is x1 +···+xk k (for population or sample). Notation: population mean: µ sample mean: x̄ The mean has the same units as the data. E.g. Find the mean of the basketball data. Think of x̄ as an estimate of µ. For any one sample, it might happen that x̄ is far from µ, but it can be proved that if many samples are taken from the same population, then we can expect the average value of x̄ to be very close to µ. This means that x̄ is an unbiased estimator for µ. Median →for quantitative data only The median Q2 is the number halfway up a sorted list of data. To find the median: 1. Sort the data in ascending order. 2. For k data points, the middle position in the list is the k+1 2 position. 3. If k is odd, then the median is the middle data value. If k is even, then the median is the mean of the two middle values. E.g. Find the median of the dataset 2, 7, 4, 3, 8. E.g. Find the median of the basketball data. Either the mean or the median may be called an average. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 14 Mode Sometimes you must give a typical value for a qualitative data set. E.g. A car dealership sold 60 cars in the past week of which 42 were red, 12 were green, and 8 were blue. If forced to describe the “average” color of car sold using one or two values or the phrase “no typical value,” what response would you give? What if 25 were red, 25 were green, and 10 were blue? What if 20 were red, 20 were green, and 20 were blue? If there is a data value that occurs most frequently, it is called the mode of the data. If there are two values that occur most frequently, the data is bimodal and we report both values. If there are more than two values that occur most frequently, we report no mode. Note that the mode can be used for qualitative data. E.g. Find the mode(s) of each dataset. (a) a, a, b, c, c, c, d, d, e, f, g (b) a, b, c, d, d, e, e, f, g (c) a, b, c, c, d, d, e, f, f, g, g, h (d) a, b, c, d, e Which measure of central tendency? E.g. Annual compensation for ten RU employees with faculty rank (in $): 47,561, 49,687, 52,375, 53,626, 60,573, 63,716, 73,832, 96,666, 105,719, 508,299. Mean = $111,205.40, median = $62,144.50. Which average is most representative of the center of this data? (If you were recruiting a prospective faculty member, which would you feel most honest reporting?) A data value is an outlier if it is extremely high or low compared to the rest of the data. We’ll get a proper definition later. Which measure of central tendency should you use? • If the data set contains qualitative data, use the mode. • If there is an outlier (or two) in a set of data, use the median. • Use the mean in all other situations. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 15 Sigma notation Σxi means “sum all the numbers xi ” 3 E.g. ∑ xi means x1 + x2 + x3 . i=1 3 E.g. If x1 = 4, x2 = 3, and x3 = 1, then ∑ xi = i=1 Go to the course website and follow the “Stats on the TI84+” link. Print the whole document and keep it with your notes. (The PDF is best for printing.) Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 16 7 Range, Variance, Standard Deviation dispersion = spread = variability = how spread out the data is →for univariate, quantitative data only E.g. The example datasets below all have the same mean and median, namely, 7. It’s the “spread” or “variability” that’s different–the range of the data points and how far they tend to be from their center. Data Set A 3, 5, 5, 7, 7, 7, 9, 9, 11 Data Set B 1, 3, 5, 7, 7, 9, 11, 13 Data Set C −17, −17, −17, −10, −10, 5.0, 5.3, 5.6, 5.7, 5.8, 5.9, 6.1, 6.3, 6.4, 6.5, 6.7, 6.8, 6.9, 7.0, 7.0, 7.1, 7.2, 7.3, 7.5, 7.6, 7.7, 7.9, 8.1, 8.2, 8.3, 8.4, 8.7, 9.0, 24, 24, 31, 31, 31 Range range = (max data value) – (min data value) computed the same way for population & sample data E.g. range of A : 11 − 3 = 8 range of B : 13 − 1 = 12 range of C : 31 − (−17) = 48 Variance The variance will measure how far the data tends to be from its mean. We need to do a little work before we can define it. We will define the sample variance s2 and then the population variance σ2 . If x is a data point in a sample, then the deviation of x (from the mean) is x − x̄. E.g. For Dataset A, x̄ = 7. Thus the deviation of the data point 3 from the mean is 3 − 7 = −4. The deviation measures how far x is from the mean. You might think that we could use the average deviation to measure how far the data tends to be from its mean, but there’s a problem. Problem: the sum of all the deviations is always 0. Solution: use the squares of the deviations to measure how far data points are from x̄. The sample variance is s2 = sum of (deviations squared) number of data points−1 = ∑(x−x̄)2 n−1 The denominator is n − 1 because it can be shown that if n is used, then for samples that are small compared to the size of the population, the result is a biased estimator for σ2 , while using n − 1 makes it unbiased. Again, “unbiased” means that if many samples are taken and s2 is computed for each one, then we can expect the average value of s2 to be very close to the population variance. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 17 E.g. Find the variance of the sample 1.5, 1.7, 1.9, 2.1, 2.3. x x − x̄ (x − x̄)2 1.5 1.7 1.9 2.1 2.3 −0.4 −0.2 0 0.2 0.4 0.16 0.04 0 0.04 0.16 x̄ = 9.5 5 s2 = = 1.9 0.4 5−1 = 0.1 ∑(x − x̄)2 = 0.4 The variances of Datasets A, B, and C are 6, 16, and 125.44, respectively. The population variance is σ2 = Σ(x−µ)2 . N The units of the variance are the units of the data squared. Larger variance corresponds beautifully to greater variability, but the units are wrong. E.g., if the data points represent the number of shoes in a man’s closet, then the units of the variance are “shoes squared.” We can fix this. Standard deviation The standard deviation is the positive square root of the variance: √ For population data: σ = σ2 √ For sample data: s = s2 E.g. The standard deviations of Datasets A, B, and C are (approximately) 2.45, 4, and 11.2, respectively. The units of the std dev are the units of the data. Interpret std dev as giving the average distance of a data point from the mean. E.g. The table below shows the numbers of pairs of shoes in four men’s closets. Find the mean, median, range, and standard deviation. Interpret the standard deviation. Person A B C D Pairs of shoes 12 4 8 7 x = # pairs of shoes mean: median: range: Corwin x 12 4 8 7 x − x̄ (x − x̄)2 ∑(x − x̄)2 = S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 18 How to find the mean, median, range, pop. and sample std dev on the TI: all online. Chebyshev’s Theorem The standard deviation is computed from the data. If the data is bunched together, the standard deviation will be small; if the data is spread out, the standard deviation will be large. But even though the standard deviation changes with the data, because of the way in which it is computed, a certain fraction of the data must be within two standard deviations of the mean, a different fraction must be within three standard deviations, etc. Chebyshev’s Theorem: No matter how the data are distributed, the portion of the data lying within k standard deviations of the mean (k > 1) is at least 1 − k12 . k 1 − k12 2 1 − 212 = 3 1 − 312 ≈ 88.9% 88.9% lies within three std devs of the mean 4 1 − 412 = 93.75% .. . 5 1 − 512 = 96% 96% 3 4 = 75% so at least 75% of the data lies within two std devs of the mean →C’s Thm tells us how likely it is to find a data point far from the mean in terms of the standard deviation Illustrations of Chebyshev’s Theorem for a few data sets Chebyshev’s Theorem is actually very conservative. When we know more about a distribution, we can usually get much better estimates. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 19 z -score The z-score of a data point x measures how far x is from the mean in units of the standard deviation. z= x−µ σ To understand this, note that the signed distance from x to µ on the number line is x − µ and that the number of lengths σ in a length x − µ is (x − µ)/σ. E.g. A data set has mean 17.5 and standard deviation 6. What is the z-score of the data value 21.2? z= x − µ 21.2 − 17.5 = ≈ 0.6167 σ 6 z-scores are useful in comparing populations which have similar probability distributions but different means and standard deviations. E.g. For a long time, the mean SATV score was 500 with a standard deviation of 100, and the mean score on the verbal portion of the ACT was 18 with a standard deviation of 6. The distributions are the same for the two tests. Which is better, a 630 on the SAT or a 25 on the ACT? z-score of 630 = 630−500 100 ≈ 1.3 z-score of 25 = 25−18 6 ≈ 1.17 The 630 is better because it is farther above average. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 20 8 Quantiles, IQR, Five-number Summary, Box Plots Quantiles are numbers that split an ordered list of numbers into parts each with approximately the same number of data points. The simplest is the median, which splits an ordered list into two parts. Quartiles About 1/4 1/2 3/4 of the data points are less than Q1 Q2 = median Q3 To find these: • List the data in ascending order • Find Q 2 • Q is the median of the lower half of the data set 1 • Q is the median of the upper half of the data set 3 If n is odd, don’t include the middle data point in either sublist. E.g. Find Q1 , Q2 , and Q3 for the data set 2 3 3 4 7 8 9 9 11 11 11 11 11 12 13 14 E.g. If a data point x is chosen at random from any data set, then Probability(x < Q1 ) = Probability(Q1 < x < Q3 ) = Five-number summary of a data set (min, Q1 , Q2 , Q3 , max) E.g. Construct the 5-number summary for the data set 2 3 3 4 7 8 9 9 11 11 11 11 11 12 13 14 Box plots (boxplots, box-and-whisker plots) Just a graph of a 5-number summary: min Corwin Q1 Q2 Q3 S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute max 21 E.g. Construct the box plot for the data in the previous example. Interquartile range IQR = Q3 − Q1 The values Q1 − 1.5(IQR) and Q3 + 1.5(IQR) are called inner fences. A data value x is a suspected outlier if it is outside the inner fences, i.e., if it is either less than Q1 − 1.5(IQR) or greater than Q3 + 1.5(IQR). Percentiles We won’t compute these—it’s rather complicated—but we’ll learn to interpret them. Percentiles apply to a list of numerical data arranged in ascending order. They split the list into 100 approximately equal parts. Notation: P1 , P2 , . . . , P99 About n percent of the data is to the left of Pn . Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 22 E.g. Consider the cumulative frequency graph of SAT scores at a particular school. (a) What test score corresponds to the 70th percentile? (b) About what percentage of all test-takers got a score higher than 1200? (c) If a test-taker is chosen at random, what’s the probability that his score is less than 1200? For each total SAT score x, the corresponding y is the percent of students receiving that score or less. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 23 9 Probability When an experiment is performed, an outcome is observed. The sample space for the experiment is the set of all possible outcomes. A set of outcomes is an event . An experiment must have an outcome. Nevertheless we regard the empty set as an event—the impossible event. E.g. Roll a die once, observe the number of pips on the top face. Possible outcomes: {1, 2, 3, 4, 5, 6} Some possible events: {}, {1}, {5}, {2, 4, 6}, {1, 2, 3, 4, 5, 6} An event occurs if, when the experiment is performed, the outcome is in that event. E.g. Roll one die. Let E = {get an even number}, F = {get a number less than 4}. • • • • If you get a 1, then only F has occurred. If you get a 4, then only E has occurred. If you get a 2, both E and F have occurred. If you get a 5, neither E nor F has occurred. In order to model what happens in the physical world, we accept the following rules about probability: • • • P({}) = 0 P(sample space) = 1 If x and y are different outcomes, then P(x and y) = P(x) + P(y) These rules imply that if x1 , x2 , . . . , xn are all the possible outcomes, then P(x1 ) + P(x2 ) + · · · P(xn ) = 1. Now roll one fair die. There are six outcomes; the s.s. = {1, 2, 3, 4, 5, 6}, and we must have P(1) + P(2) + · · · + P(6) = 1. All the outcomes are equally likely, so if P(1) = p, then P(2) = p, etc. Thus we have 6p = 1, or p = 16 . That is, each of the outcomes has probability 16 . In general, if a sample space consists of n equally likely outcomes, then the probability of any one outcome is 1n . E.g. Roll two dice, one red and one green. Then the sample space consists of 36 equally likely outcomes: (1,1) (2,1) (3,1) (4,1) (5,1) (6,1) Thus each outcome has probability (1,2) (2,2) (3,2) (4,2) (5,2) (6,2) (1,3) (2,3) (3,3) (4,3) (5,3) (6,3) (1,4) (2,4) (3,4) (4,4) (5,4) (6,4) (1,5) (2,5) (3,5) (4,5) (5,5) (6,5) (1,6) (2,6) (3,6) (4,6) (5,6) (6,6) 1 36 . Theoretical probability When there are only finitely many outcomes and all are equally likely, we define the theoretical probability of an event E by: P(E) = # outcomes in the event total # possible outcomes — Can only be used in ideal/mathematical situations E.g. A bag holds ten identical marbles numbered 1, 2, 3, . . . , 10. One is drawn out. What’s P(get marble #5)? • all outcomes equally likely • one outcome in event: “get #5” = {5} • ten outcomes total • P(get #5) = 1⁄10 Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 24 E.g. A fair die is rolled once. What’s P(get a 5)? • all outcomes equally likely • one outcome in event “get a 5” • six outcomes total • P(get a 5) = 1⁄6 E.g. A fair die is rolled once. What’s P(get an odd number)? • all outcomes equally likely • three outcomes in event: {1, 3, 5} • six outcomes total • P(get an odd number) = 3⁄6 = 0.5 E.g. The probability of drawing an ace from a standard deck is 4⁄52. In the real world, “the probability of rolling a 1 with a fair die is 1⁄6” does not mean that one in every six rolls will be a 1. It means that if we roll enough times, almost exactly 1⁄6 of the rolls will be 1s. Empirical probability Suppose that an experiment is performed repeatedly under very similar conditions. Then we define the empirical probability of an event E by: P(E) = number of times E occurs total # of observations I.e., an empirical probability is a relative frequency. E.g. The table below shows the times at which Jill has gotten up on ten different school mornings. What’s the probability that she’ll get up before 7:05 on any given school morning? Day 1 2 3 4 5 6 7 8 9 10 Time 7:01 7:03 7:01 7:00 7:10 7:07 7:03 7:01 7:15 7:06 event: Jill gets up before 7:05 # outcomes in which event occurred: 6 total # observations = 10 prob = 6⁄10 = 0.6 The Law of Large Numbers says that as an experiment is repeated more and more times, the empirical probability of events approaches their theoretical probability. Subjective probability Subjective probability is a probability judgment. E.g. The doctor says there’s a 60% chance John will survive the surgery. Subjective probability is used when an experiment cannot be repeated or in a situation too complicated to be directly compared to other cases. Properties of probability All probabilities are between 0 and 1. E impossible: P(E) = 0 E just as likely to happen as not: P(E) = E certain: P(E) = 1 (or 100%) 1 2 If the probability that an event E will occur is p, then the probability that E will not occur is 1 − p. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 25 Mutually exclusive events Events E and F are mutually exclusive if they cannot both happen at the same time. E.g. Roll a single die. Let E = “get an even number,” F = “get an odd number.” Then E and F are mutually exclusive. E.g. Roll one red die and one green die. Let E = “get an even number on the red,” F = “sum of dice is an odd number.” Then E and F are not mutually exclusive, because they can both happen at the same time (e.g., if you rolled a 2 on the red and a 1 on the green). If E and F are mutually exclusive events, then P(E occurs or F occurs) = P(E) + P(F) “Or” here is the “inclusive or”: it means “one or the other or both.” Note that mutual exclusivity is not defined in terms of probability. It is a set property: E and F are mutually exclusive if there is no outcome that is in both. Independent events Two events are statistically independent (or just independent) if the occurrence or non-occurrence of either of them does not affect the probability that the other will occur. E.g. Dependent or independent? Roll one red die, one green. (a) E = get a 1 on red, F = get a 6 on green. (b) E = get a total of 12, F = get a 6 on green. True fact: events E, F are independent if and only if P(both E and F occur) = P(E)P(F) E.g. Roll one die and flip one fair coin. What’s the probability of getting heads and a 4? Solution 1. All possible outcomes: {(1, H), (2, H), (3, H), (4, H), (5, H), (6, H), (1, T ), (2, T ), (3, T ), (4, T ), (5, T ), (6, T )} There are 12 possible outcomes, all equally likely. The event “get heads and a 4” contains the single 1 outcome (4, H), so its probability is 12 . Solution 2. P(H) = 12 , P(4) = 16 , and the two events are independent, so P(E and F) = 12 · 16 = 1 12 . In general, probabilities should be rounded to five decimal places. (We’ll often break this rule.) Note that independence is defined in terms of probability. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 26 10 Random Variables; Discrete Probability Distributions A random variable is a variable that takes on numerical values associated with the outcome of an experiment. E.g. Choose an apple at random and let x = weight of the apple. Choose a box of blueberries at the supermarket, let x = the number of (whole) blueberries in the box. Choose a sample of ten apples at random, and let x = the mean weight for that sample. Continuous vs. Discrete A r.v. is discrete if all of its (possibly infinitely many) possible values can be listed, at least in principle. A r.v. is continuous if it can take on any value in some interval of real numbers. Recall: an interval is the set of all real numbers between two real numbers. To figure out whether a r.v. is discrete or continuous, think of plotting all its possible values on a number line. If there would be gaps between values, it’s discrete; otherwise, it’s continuous. E.g. Discrete or continuous? What are the possible values? (a) x = number of (whole, unbroken) eggs in one carton (b) y = weight of one carton of eggs, in ounces (c) time since the last customer arrived (at some counter), in seconds (d) number of stocks in the DJIA the share prices of which closed higher than they opened yesterday Independent random variables If x is a discrete r.v., its probability distribution is a table or function that gives, for each possible value of x, the probability that x takes on that value. A discrete probability distribution is usually a table of relative frequencies. If x is a continuous r.v., its probability distribution is a function from which it is possible to compute the probability that the value of x is in any given interval. (We’ll come back to these.) E.g. A psychological test was administered to 150 people. Possible final scores were 1, 2, 3, 4, and 5. Results: x (score) Frequency 1 24 2 33 3 42 4 30 5 21 →Note that the r.v. x takes on a numerical value for each person who took the test. (a) Construct a discrete probability distribution and relative frequency histogram. (b) What is the probability that a randomly chosen participant’s score was (i) 3? (ii) 4 or 5? Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 27 x (score) P(x) 24 150 1 = 0.16 2 0.22 3 0.28 4 0.20 5 0.14 Mean of a discrete r.v. or probability distribution “The mean of a discrete probability distribution” and “the mean of a r.v. with that distribution” are the same thing. If the possible values of a r.v. x are x1 , x2 , . . . , and if the probabilities of these are p1 , p2 , . . . respectively, then the mean of x is E[x] = p1 x1 + p2 x2 + · · · E[x] is often called the expected value of x. E.g. Find the mean of the following frequency distribution. Value 1 2 3 4 Freq 12 10 8 6 Notice that Σ f = 36. Next, get the relative frequencies = the probabilities: Value 1 2 3 4 Rel freq = Probability 12 36 10 36 8 36 6 36 Extend the table with a column for (Value × Prob) and sum these: Value 1 2 3 4 Prob Value × Prob 12 36 10 36 8 36 6 36 12 36 20 36 24 36 24 36 Σ(Value × Prob) = 80 36 The mean of the probability distribution is Corwin 80 36 . S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 28 Independent random variables Two r.v.s are independent if neither variable’s probability distribution depends on the value of the other. The r.v. equivalent of “E occurs” is “x takes on a value”. Our method of determining independence is once again “educated guess.” E.g. Choose a person at random. x = number of atoms in the person’s left hand y = the person’s SAT score E.g. Choose a day from the year 2014. x = temperature on that day y = snowfall on that day Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 29 11 The Binomial Distribution A binomial experiment is made up of smaller experiments called trials. An experiment is binomial if • • • It consists of a fixed number of independent trials (called Bernoulli trials). The only possible outcomes for each trial are success (S) and failure (F). The probability p of success is the same in each trial. When working with binomial experiments, we let n be the number of trials, set q = 1 − p = P(F), and let x = number of successes. E.g. An experiment consists of flipping one fair coin once. Success is defined as getting heads. What are all the possible outcomes? What are n, p, and q? There’s only one trial, so n = 1, and the possible outcomes of the experiment are just H and T . The coin is fair, so p = 12 , whence q = 1 − p = 12 . E.g. An experiment consists of flipping one fair coin twice. Success is defined as getting heads. What are all the possible outcomes? What are n, p, and q? The possible outcomes on any one trial are just H and T , but this is not what is meant. The possible outcomes of the whole experiment are {HH, HT, T H, T T }. There are two trials, so n = 2. The coin is fair, so p = 12 , whence q = 1 − p = 12 . (Remember that success is defined per trial, not for the whole experiment, so only the outcomes of one trial are important when computing p.) Finally, the possible values of x are 0, 1, and 2. E.g. Each trial of an experiment consists of flipping a coin and throwing a die. The experiment itself consists of two trials. List all possible outcomes of one trial, and all possible outcomes of the whole experiment. Possible outcomes of one trial: H1, H2, H3, H4, H5, H6, T 1, T 2, T 3, T 4, T 5, T 6 Possible outcomes of the experiment: (H1, H1) (H1, T 1) (H2, H1) (H1, H2) (H1, T 2) (H2, H2) (H1, H3) (H1, T 3) (H2, H3) (H1, H4) (H1, T 4) (H2, H4) .. . (H1, H5) (H1, T 5) (H2, H5) (H1, H6) (H1, T 6) (H2, H6) (T 6, T 1) (T 6, T 2) (T 6, T 3) (T 6, T 4) (T 6, T 5) (T 6, T 6) Examples of binomial distributions Don’t worry about where the probabilities in these examples come from just yet. E.g. A fair coin is flipped five times. S = get heads. Here n = 5, p = 12 , q = 12 . The possible values of x are 0,1,2,3,4, and 5. Probability distribution: x P(x) Corwin 0 0.03125 1 0.15625 2 0.3125 3 0.3125 4 0.15625 S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 5 0.03125 1 Experiment: flip coin 5 times Trial: one flip p= q= 1 2 1 2 n=5 x is a random variable Value of x = number of successes E.g. Same for 10, 20, 50 flips: Ain’t that purty. E.g. A fair, six-sided die is thrown ten times. Success = get a 1 or 3 Here n = 10, p = 26 , q = 4 6 Distribution (to four decimal places) and histogram: x 0 1 2 3 4 5 6 7 8 9 10 P(x) 0.0173 0.0867 0.1951 0.2601 0.2276 0.1366 0.0569 0.0163 0.003 0.0003 0 Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 2 Same, for 50 throws: Now let’s see where those probabilities came from. E.g. A binomial experiment with n = 3 trials is performed. Possible outcomes: FFF, FFS, FSF, SFF, FSS, SFS, SSF, SSS Because the trials are independent, we have, for example, P(FFS) = P(F)P(F)P(S) = (1 − p)(1 − p)p = (1 − p)2 p In fact: Outcome FFF FFS FSF SFF FSS SFS SSF SSS Probability (1 − p)3 p(1 − p)2 p(1 − p)2 p(1 − p)2 p2 (1 − p) p2 (1 − p) p2 (1 − p) p3 Remember that x = the number of successes in this experiment. The probability distribution of x is: x Prob 0 (1 − p)3 1 3p(1 − p)2 2 3p2 (1 − p) 3 p3 x is said to be binomially distributed, sometimes written x ∼ B(n, p). What’s E[x]? It’s E[x] = 0 · (1 − p)3 + 1 · 3p(1 − p)2 + 2 · 3p2 (1 − p) + 3 · p3 = 3p It’s possible to show that for a binomial experiment with n trials, E[x] = np. It’s also possible to show that the standard √ deviation is σ = npq. This explains the locations of the peaks in the histograms. E.g. A certain surgery is successful in 80% of cases. The surgery is performed four times. What’s the probability of exactly two successes? Remember that x = 2 when exactly two successes occur. This can happen in any of the following ways: SSFF, FSSF, FFSS, SFSF, SFFS, FSFS Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 3 Each of these outcomes has probability p2 q2 = (0.8)2 (0.2)2 = 0.0256. They’re mutually exclusive, so the probability that one of them occurs is 0.0256 + 0.0256 + · · · + 0.0256 = 6 · 0.0256 = 0.1536 Now, what’s the probability of at least two successes? Using the same technique, we can compute that the probability of exactly three successes is 0.4096 and the probability of exactly four successes is 0.4096, so the probability of at least two successes is 0.1536 + 0.4096 + 0.4096 = 0.9728. But what if 40 surgeries were performed and we wanted the probability of at least 35 successes? Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 4 Using the TI, we find that the probability of at least 35 successes out of 40 surgeries is about 0.2858. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 5 12 The Normal Distribution, Part I The image below shows a relative frequency histogram for 2781 numbers associated with the outcome of some experiment. The mean is 1600. The height of a bar is the bars is A. h 2781 when there are exactly h numbers in the interval spanned by the bar. Suppose the total area of Divide the bars up into equal-sized boxes, one for each number: Notice that the area of each box must be A 2781 . Now choose a number x at random. If 101 boxes are to the left of 1300, what’s P(x ≤ 1300)? P(x ≤ 1300) = Corwin number of numbers no greater than 1300 101 = total number of numbers 2781 S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 6 Because each little box has the same area, we could also find P(x ≤ 1300) by considering areas: P(x ≤ 1300) = area of boxes at or to the left of 1300 total area of boxes A 101 · 2781 101 = A 2781 Really, we don’t need the boxes or the rectangles, because the probability that a randomly chosen value of x is less than 1300 is just the fraction of the total area which is to the left of 1300 under the curve made up of the tops of the rectangles: = The curve along the tops of the rectangles serves as a probability distribution function (pdf): it lets us compute the probability that a randomly chosen value of x is in any given interval. We’ll insist, though, that things have been fixed up so that the total area is 1, so that we don’t have to multiply and divide by it. This allows us to answer questions like, “Where are there lots of values and where are there not very many?” and “If I choose a value at random, what’s the probability that it’s between 13 and 14?” If, instead of the tops of bars, a pdf is a smooth curve, we treat it in the same way: the total area under the curve is 1, and when we want a probability, we compute an area. Actually finding areas in the continuous case is technically complicated, but we won’t learn that part; we’ll use the calculator. And we’ll work only with the two most important families of pdfs. Don’t forget: Probabilities are areas Areas are probabilities Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 7 We’ll be most interested in analyzing data that is heaped up around its mean—data that is approximately “normally distributed.” The normal distribution: 2 1 − (x−µ) y = √ e 2σ2 σ 2π Properties of the normal distribution mean = median = x-coordinate of highest point curve changes shape at inflection points—i.e., above [µ − σ, µ + σ] • total area under curve is 1 • data values/values of random variable on x-axis • height above any particular point unimportant • curve is always above x-axis but gets closer and closer as x → ±∞ • almost all the area is above [µ − 3.5σ, µ + 3.5σ] • std dev controls width/height of central hump • • We’ll write x ∼ N(µ, σ) to mean that the r.v. x is normally distributed with mean µ and std dev σ. (Not everyone uses this in the same way.) Even though there are infinitely many different normal curves, one for each possible µ and σ, we just talk about “the normal curve” because they’re all the same in the ways that interest us. Similarly, people say “the normal distribution.” Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 8 Influence of std dev on shape of normal curve: True fact: If a r.v. x is normally distributed, then P(x < c) = the area under the normal curve to the left of c Area = DISTR normalcdf(−10 000, c, µ, σ) It’s really the area from −∞ to c, but calculators can’t manage −∞. We’ll use −10 000 for −∞ and 10 000 for +∞. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 9 Also, P(c < x < d) = area under curve between x = c and x = d Area = DISTR normalcdf(c, d, µ, σ) You could get P(c < x < d) by subtracting areas: P(c < x < d) = = = Corwin area under curve between x = c and x = d (area left of d) − (area left of c) P(x < d) − P(x < c) S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 10 E.g. Suppose that x is normally distributed with mean 12 and standard deviation 0.3, and that an x-value is chosen at random. Then (a) P(x ≤ 12) = (b) P(x < 12.2) = (c) P(11.9 < x < 12.1) = E.g. John has measured the actual amount of soda in many 12-oz bottles and found that it is normally distributed with mean µ = 12 oz and std dev σ = 0.3 oz. If a bottle is chosen at random, then (a) P(it contains no more than 12 oz) = (b) P(it contains less than 12.2 oz)= (c) P(it contains between 11.9 and 12.1 oz)= What is the 75th percentile? Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 11 Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 12 Finally, P75 is the point such that, if a bottle is chosen at random, the probability that the bottle contains less than P75 ounces of soda is 0.75. When you know an area or probability and need x, use the invNorm function: P75 ≈ DISTR invNorm(0.75, 12, 0.3) ≈ 12.2 normalcdf and invNorm When you know µ and σ and you need a probability: P(a < x < b) = normalcdf(a, b, µ, σ) When you know a probability (or area) p and you need an x-value c such that P(x < c) = p: c = invNorm(p, µ, σ) The normal can approximate the binomial When a healthy adult is given the cholera vaccine, the probability that he will contract cholera if exposed is known to be 0.15. Five hundred vaccinated tourists, all healthy adults, were exposed while on a cruise. What is the probability that more than 25 will contract the disease? This is clearly a binomial experiment with p = 0.15, q = 0.85, and n = 500. We certainly don’t want to computed p(x > 25) directly. Fortunately, any time both np and nq are greater than 5, the binomial distribution is very close to a √ normal distribution with µ = np and σ = npq. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 13 13 The Normal Distribution, Part II True fact: the area under any normal curve between µ and µ + σ is the same as under any other. normalcdf(0, 1, 0, 1) ≈ 0.3413 normalcdf(5, 5.3, 5, 0.3) ≈ 0.3413 This is a mathematical result the proof of which is beyond us. This is true for any number of σs. E.g., the area between µ and µ − 1.25σ under one normal curve is the same as under another: normalcdf(µ1 − 1.25σ1 , µ1 , µ1 , σ1 ) ≈ 0.3944 normalcdf(µ2 − 1.25σ2 , µ2 , µ2 , σ2 ) ≈ 0.3944 This means that if we measure the distance of a point from the mean in units of the standard deviation, we can use any normal curve to find probabilities. How do we find the distance from µ to c in units of σ? It’s the z-score of c: z= Corwin c−µ σ S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 14 Since we can use any normal curve to find the probability that a point x is in an interval under any other normal curve, why not use the one with µ = 0 and σ = 1? Standard Normal Distribution N(0, 1) Horizontal axis is usually labeled z Sometimes called the z-distribution Because σ = 1, the distance from µ = 0 to any point z in std devs is z Suppose c is any number on the x-axis under the normal curve with mean µ and standard deviation σ. Then there’s a point w on the z-axis such that P(x < c) = P(z < w). The point w is w = Corwin c−µ σ , the z-score of c. S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 15 In fact, w is the z-score of c if and only if P(x < c) = P(z < w): Another way to say this: if x ∼ N(µ, σ), then the variable Corwin x−µ σ is N(0, 1). S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 16 Suppose x ∼ N(100, 16) and we want to know P(x < 124). One way to proceed: 1. Measure how many std devs 124 is from the mean: 124−100 16 = 1.5 2. “Transform to z”: on the z-axis, the point that is 1.5 standard deviations from the mean is just 1.5 : 3. Then the area under N(100, 16) between −∞ and 124 is equal to the area under N(0, 1) between −∞ and 1.5 : P(x < 124) = P(z < 1.5) = normalcdf(−10 000, 1.5, 0, 1) ≈ 0.933 Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 17 E.g. Suppose that x ∼ N(10, σ) and that the z-score of the x-value 8.7 is −0.95. What is P(x > 8.7)? The Empirical Rule For normally distributed data, About 68% About 95% About 99.7% Corwin of the data will lie within 1 standard deviation of the mean 2 3 S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 18 E.g. In a recent study, the heights of American women 20–29 years old were found to be normally distributed with mean x̄ = 64 inches and std dev s = 2.71 inches. The Empirical Rule says that About 68% About 95% About 99.7% of the women will be between 61.27 and 66.71 inches tall 58.58 69.42 55.87 72.13 A rule of thumb An interval centered at the mean and four standard deviations wide will, in practice, contain nearly all the data (if it’s approximately normally distributed). Because of this, a common quick calculation takes the range to be four standard range deviations, which says that the standard deviation of a data set will be approximately . 4 Using the Empirical Rule to test whether data is normally distributed Compute x̄ and s for your data See whether all three of the following are true: approximately 68% falls in x̄ ± s approximately 95% falls in x̄ ± 2s approximately 99% falls in x̄ ± 3s If not, it’s probably not from a normally distributed population. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 19 15 The Normal Distribution, Part III Quantiles for continuous distributions For a discrete r.v. x, the quartile Q1 was defined as a number such that about 25% of the values of x were less than Q1 . For a continuous r.v. x, the quartile Q1 is defined to be a number such that P(x < Q1 ) = 0.25. Similarly for Q2 , Q3 , and percentiles. E.g. A r.v. x is normally distributed with mean 38 and std dev 10. What is the third quartile Q3 ? Q3 = invNorm(0.75, 38, 10) ≈ 44.74 E.g. The number of beet seeds in a one-ounce box is normally distributed with mean 1600 and std dev 114. What is the 33rd percentile P33 ? P33 = invNorm(0.33, 1600, 114) ≈ 1549.85 E.g. Scores on a civil service exam are normally distributed with mean µ = 75 and std dev σ = 6.5. What is the lowest score you can earn and still be in the top 5%? If we let the r.v. x represent scores, we are being told that x ∼ N(75, 6.5). We need the score P95 : P95 = invNorm(0.95, 75, 6.5) ≈ 85.69 Cumulative Probability Distribution We think of the cumulative distribution function in terms of its graph. For a numerical data set, we can graph the cumulative relative frequency, i.e., what fraction of the data is to the left of each x: Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 20 In general, suppose u is a r.v. For each point x on the x-axis, the y-value on the graph of the cumulative distribution function of u is P(u < x). That is, for a given x, the y-coordinate on the graph of the cdf is the fraction of the total area which is to the left of x under the pdf for u. The graph of a cdf is called an ogive. E.g. Let z ∼ N(0, 1). Tabulate a few values of the cdf of z. Sketch its graph. x = point on z-axis −3.49 −2.0 −1.0 0.0 1.0 2.0 3.49 Corwin Fraction of area to the left of x P(z < −3.49) ≈ 0.00024 P(z < −2.0) ≈ 0.023 P(z < −1.0) ≈ 0.159 P(z < 0.0) ≈ 0.5 P(z < 1.0) ≈ 0.84 P(z < 2.0) ≈ 0.977 P(z < 3.49) ≈ 0.99976 S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 21 Ogive and Quantiles E.g. Given the ogive below: (a) Estimate Q2 . (b) Estimate P33 . Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 22 The Normal Probability Plot We often need a way to tell whether data is approximately normally distributed. A reasonable way to tell is the normal probability plot. After following this procedure, if the data seems not to lie approximately on a straight line, then it is probably not normal. E.g. Some data sampled from a normal distribution and the TI’s normal probability plot for it: 90.4 98.5 100.9 78.5 55.7 Corwin 108.3 80.2 78.0 81.6 90.8 76.8 92.4 72.2 75.8 108.6 91.6 96.9 96.7 112.5 109.0 94.2 122.2 81.5 113.1 100.7 92.7 105.1 85.8 94.1 88.4 S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 23 16 Sampling Distributions and the Central Limit Theorem We’re going to be very concerned with using the mean of a sample to estimate the mean of a population. In particular, we want some way of estimating how far a sample mean is likely to be from the population mean. Suppose we want to estimate the average height µ of a twenty-year-old American. The natural thing to do is to take a reasonably large random sample, compute its mean x̄, and use x̄ as our estimate of µ. But if we took a different sample, we’d get a different x̄. Which one should we use? Or should we use neither, but take the average of the two instead? We’re in the position of the man with two watches who doesn’t know what time it is. What we need to know is what happens if we take a bunch of different samples and compute their means. Do the sample means mostly fall near the population mean µ, or are they spread out? Is their distribution predictable? Here’s how we get at this. First, think of taking many samples, all of the same size. Next, instead of thinking of each sample having its own x̄, think of defining a r.v. named x̄. Each time a new sample is chosen, the value of the r.v. x̄ becomes the mean of that sample. If we call our samples Sample 1, Sample 2, and so on, then we can label the corresponding values of x̄ as x̄1 , x̄2 , etc. If we plotted these on an x̄ axis, we’d expect them to cluster around µ: In fact, not only do they cluster around µ, they’re normally distributed with mean µ, and we can give the standard deviation exactly. The theorem that does it is called the Central Limit Theorem. We want a little vocabulary before stating the theorem. • • • The mean of the sample means is denoted µx̄ The std dev of the sample means is denoted σx̄ σ is called the standard error of the mean x̄ The distribution of sample means is called the sampling distribution of the sample mean. C ENTRAL L IMIT T HEOREM. Suppose that samples of size n ≥ 30 are repeatedly drawn from a population with mean µ and standard deviation σ. Then 1. 2. 3. The sampling distribution of the sample mean is approximately normal, and the approximation gets better as the sample size increases. µx̄ = µ and σx̄ = √σn If the population is itself normally distributed, then (1) and (2) are true even if n < 30. You’re going to have to answer this question quite a lot: Question: When can you apply the CLT? Answer: When n ≥ 30 OR the population is approximately normal (or both). We’ll call a sample large if n ≥ 30, small if n < 30. The CLT rephrased: when n ≥ 30 or the population is approximately normal, then µx̄ = µ and σx̄ = Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute √σ . n 24 E.g. For residents of Snowburg, the average phone bill is $64. Bill amounts are normally distributed, and the std dev is $9. (a) What is the probability that a randomly chosen bill is less than $58? If x represents the phone bill amount, then x ∼ N(64, 9), so the probability that a randomly chosen bill is less than $58 is P(x < 58) = normalcdf(−10 000, 58, 64, 9) ≈ 0.2525 (b) What is the probability that the mean of a randomly chosen sample of 36 phone bills is less than $58? If x̄ represents the mean of a randomly chosen sample of size 36, then x̄ ∼ N(64, √936 ) = N(64, 1.5), so the probability that the mean of a randomly chosen sample is less than $58 is P(x̄ < 58) = normalcdf(−10 000, 58, 64, 1.5) ≈ 0.00003 E.g. A certain quantity is normally distributed with mean 42 and std dev 12. You take a sample of size n = 9. What is the probability that the mean of your sample is between 38 and 46? The sample is small, but because the population is known to be normally distributed, the sample means are normally distributed with mean 42 and std dev √129 = 4. So the probability is normalcdf(38, 46, 42, 4) ≈ 0.68 E.g. To meet a customer’s requirements, the diameters of the ball bearings made by a certain machine must be normally distributed with an average of 2 mm and a standard deviation of no more than 0.1 mm. An inspector measures 35 bearings and finds a mean diameter of 1.95 mm. Should he consider the machine to be operating within specifications? If the machine is operating as it ought, the sample means of samples of size 35 should be normally distributed around 2 mm with a standard error of √0.1 mm. Thus if the machine is working correctly, the probability of 35 observing a sample mean of 1.95 mm or less is normalcdf −10 000, 1.95, 2, √0.1 ≈ 0.0015 35 Thus, it is very unlikely that the machine is within spec. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 25 Study the last example carefully. It is a standard way of approaching such problems: assume that the distribution is what it’s claimed to be and compute the probability of seeing a sample mean at least as extreme as the one actually observed. We’ll need this next time: By the CLT, x̄ is normally distributed with mean µ and std dev √σn , so x̄ − µ the variable σ √ is normally distributed with mean 0 and std dev 1. ( / n) Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 26 17 Confidence Intervals, Part I Interval notation Recall: Any of the finite ones of these may be written 2 ± 1. Introduction to confidence intervals Our estimate for µ is always x̄, the mean of some sample, but we’d like to be able to add something like, “We’re 90% confident that µ is in the interval (x̄ − E, x̄ + E).” A point estimate of a population parameter is a one-number estimate (like x̄ for µ). An interval estimate of a population parameter is an interval supposed to contain the parameter with a given probability called the level of confidence. We’ll be concerned primarily with interval estimates for the population mean. You won’t be tested on the following explanation. Think of choosing samples of size 50 from a much larger population with standard deviation σ = 7.3. Each sample will have its own sample mean x̄. Suppose we want a number E such that, for 90% of the samples, the interval (x̄ − E, x̄ + E) contains the population mean µ. Clearly E will depend on the spread of the population, which is to say, on σ. If σ is large, E will have to be large; if σ is small, E can be small. Suppose we knew E. As the diagrams above show, µ will be in the interval (x̄ − E, x̄ + E) if and only if x̄ is in the interval (µ − E, µ + E), so if µ is in (x̄ − E, x̄ + E) 90% of the time, then x̄ must be in (µ − E, µ + E) 90% of the time. Thus, E is the number such that P(µ − E < x̄ < µ + E) = 0.9. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 27 We can find E by transforming to the standard normal: the z-score of µ + E must be the number z0.9 such that 90% of the area under the standard normal curve is between −z0.9 and z0.9 : If 90% of the area is between −z0.9 and z0.9 then 95% of the area is to the left of z0.9 , so z0.9 = invNorm(0.95, 0, 1) ≈ 1.6449. So the z-score of µ + E is z0.9 ≈ 1.6449. We have z0.9 = ⇒ (µ+E)−µ σx̄ = E = z0.9 ≈ 1.6449 E √ (σ/ n) √σ n 7.3 √ 50 ≈ 1.6982 So, for 90% of samples of size 50, µ will lie in the interval (x̄ − 1.6982, x̄ + 1.6982). Note that E = 1.6449 confidence interval. Corwin 7.3 √ 50 says that we need about 1.6449 × the standard error of the mean on either side of x̄ to get a 90% S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 28 Generalizing, if we want confidence level c, our interval will be x̄ − zc √σn , x̄ + zc √σn We call zc a critical value and E = zc √σn a margin of error. Critical values The critical values for a level of confidence c are the points −zc and zc such that c is the area under the standard normal curve between −zc and zc . From the picture, the total area to the left of zc is 1 − 1−c 2 = E.g. z0.9 = invNorm 1+0.9 2 1+c 2 , so zc = invNorm 1+c 2 . ≈ 1.6449 zc is the number of standard errors needed for confidence level c. Margin of error Margin of error E = zc √σn E is the accuracy needed on the x-axis for confidence level c. To find a z -confidence interval for the mean with confidence level c σ must be known You must be able to apply the CLT 1. 2. 3. 4. Find σ, n, x̄, and c (you may need to compute x̄) Compute the critical value: zc = invNorm 1+c 2 Using the critical value, compute the margin of error: E = zc √σn The interval is x̄ ± E Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 29 E.g. A sample of size n = 36 from a population with σ = 3.1 has x̄ = 18. Find a 95% confidence interval for the population mean. 1. From the problem, σ = 3.1, n = 36, x̄ = 18, and c = 0.95 2. zc = z0.95 = invNorm 1+0.95 = invNorm 1.95 ≈ 1.96 2 2 3. E = 1.96 √3.1 ≈ 1.01 36 4. The interval is 18 ± 1.01, or (16.99, 19.01) E.g. Fifty measurements of visibility through the water at one particular location averaged 25 feet. Find a 90% confidence interval for the mean visibility, assuming that the population standard deviation is 5 ft. 1. From the problem, σ = 5, n = 50, x̄ = 25, and c = 0.9 2. zc = z0.9 = invNorm 1+0.90 ≈ 1.64 2 3. E = 1.64 √550 ≈ 1.16 4. The interval is 25 ± 1.16, or (23.84, 26.16) Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 30 18 Confidence Intervals, Part II z confidence intervals on the TI (ZInterval) The TI will let you compute a z-interval directly from sample data. Don’t do it. Only use ZInterval if • • you know the population standard deviation σ AND you can apply the CLT Sample size Suppose x is approximately normally distributed with std dev σ, and we must estimate µ to within a margin of error E with confidence level c. How large a sample must we use? Recall: margin of error E = zc √σn √ ⇒ n = zEc σ 2 ⇒ n = zEc σ ←− formula for sample size Always round the result UP to the nearest integer. N.B.: In order to compute n using the formula, you must be able to read c, σ, and E out of the problem. E.g. The lengths of movies in Jesse’s collection have standard deviation σ = 14.7 minutes. How large a sample is needed to be 95% confident that the population mean will be within five minutes of the sample mean x̄? We need to be 95% confident that µ is within 5 of x̄. I.e., E = 5. 2 z0.95 ≈ 1.96 ⇒ n ≈ 1.96(14.7) ≈ 33.2, so the smallest sample has n = 34. 5 The difficulty in sample size problems is identifying E. Remember that E is how close you have to come to the population mean. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 31 Confidence intervals: normal population, σ unknown (TInterval) Suppose that either we have n ≥ 30 or a reason to think that a population is approximately normally distributed (because similar populations are), but we don’t know σ. What do we do? Approximate σ by s, of course; but this changes things a bit. Recall that what makes the z-confidence interval work is that x̄ is approximately normally distributed with mean µx̄ = µ and standard deviation σx̄ ≈ √σn . When this is true, the variable z = x̄−µ σ/√n has the standard normal distribution. When you must approximate σ by s, you have to make an adjustment to this. What you get instead is: If n ≥ 30 or the population is approximately normally distributed, then the statistic t= x̄ − µ s/√n has what is called a t-distribution. The t-distribution is a lot like the standard normal distribution and can be used in much the same way. Properties of the Student t -distribution bell-shaped, like the normal curve total area under curve is 1 • exact curve depends on the number of degrees of freedom (d.f.) • for us, d.f. is always n − 1 (where n is the sample size) • central hump is lower than the normal curve, tails are thicker • when n ≥ 30, the t and the standard normal are very close • • You won’t be tested on properties of the t-distribution. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 32 Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 33 When to use z and when to use t If you can use the CLT, then • • if the std dev is from the population, use z; if the std dev is from the sample, use t. If you can’t use the CLT, don’t construct a confidence interval. E.g. In a random sample of fifteen CD players brought in for repair, the average repair cost was $80 and the standard deviation was $14. Assuming that repair costs are approximately normally distributed, use your calculator to construct a 90% confidence interval for µ. You don’t know σ, so you should use a t interval. The interval is (73.633, 86.367). E.g. Construct 90% and 95% confidence intervals for the mean of the population from which the following sample data was taken. 90.4 98.5 100.9 108.3 80.2 78.0 76.8 92.4 72.2 91.6 96.9 96.7 94.2 122.2 81.5 92.7 105.1 85.8 78.5 113.1 108.6 81.6 94.1 109.0 75.8 55.7 100.7 112.5 90.8 88.4 Now use STAT, TESTS, 8:TInterval (with Data, not Stats): For the 95% confidence interval The 90% interval: (87.97, 96.91) The 95% interval: (87.059, 97.821) Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 34 19 Hypothesis Testing, Part I Suppose that someone claims that the average hourly rate charged by lawyers in your area is $200. To test this claim, you survey 30 firms. If you find an average of $205 per hour, would you say that you had enough evidence to reject the claim? Probably not. But what if you found an average of $300 per hour for your sample? Then, you would almost certainly conclude that the claim of $200 per hour was wrong. To understand the idea of hypothesis testing, you have to realize why you would be so sure that the claim was wrong: it’s because your intuition tells you that if the $200 per hour claim were right, then it would be extremely unlikely ever to find a largish sample with a mean as high as $300. To make this precise, suppose that hourly rates really have mean $200 and standard deviation $50. Then sample means for samples of size 30 will be normally distributed with mean $200 and standard deviation √5030 ≈ 9.13. That makes a sample mean of $300 more than ten standard deviations away from we we expect to find it! If we go ahead and calculate the probability of finding a sample with a mean of $300 or more, we find that it is normalcdf 300, 10 000, 200, √5030 ≈ 3.4 × 10−28 —a probability so small that we normally just think of it as zero. We just did a hypothesis test of a claim about a population mean µ. Specifically, we 1. 2. 3. temporarily accepted the claim about µ and σ (i.e., as a hypothesis); computed the mean of a specific sample from the population; and used the claim to compute the probability of observing a sample mean at least as extreme as the one observed. The probability of observing a sample mean at least as extreme as the one observed was very small, so we rejected the claim. We’ll refine and formalize this process, but we need some other stuff first. Statistical hypotheses In statistics, a null hypothesis H0 is a statement about the equality or inequality of two quantities. A null hypothesis always contains one of the symbols ≤, =, ≥. E.g. H0 : µ ≤ 1, H0 : µ = 50, H0 : σ2 ≥ 42 We’ll only be concerned with hypotheses about a single mean. An alternative hypothesis H1 is the negation of a null hypothesis. These are the only types of null and alternative hypotheses we will see: H0 is if and only if H1 is µ ≤ µ0 µ > µ0 µ = µ0 µ 6= µ0 µ ≥ µ0 µ < µ0 Choosing hypotheses A professional statistician would choose hypotheses based on what types of errors could occur and the seriousness of each type. We’ll just use a few simple rules. 1. A claim is being made about the relation of the population mean to some number µ0 . Find this claim and use it to identify µ0 . 2. The alternative hypothesis must be one of µ < µ0 , µ 6= µ0 , or µ > µ0 . Choose the one that you want to support (or that the person in the problem wants to support). 3. The null hypothesis is the negation of the alternative hypothesis. Note that the word “claim” in a problem sometimes corresponds to H0 and sometimes to H1 . Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 35 E.g. A certain chain of movie theaters claims that the average price for a movie ticket in its theaters is no more than $7.25. By querying people on the internet, John has put together a sample of thirty ticket prices with a mean of $7.50. (a) What is the population here? (b) What claim is being made about the population mean µ? (c) What is µ0 ? (d) What are the possibilities for the alternative hypothesis? (e) What is John’s alternative hypothesis? (f) What is John’s null hypothesis? Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 36 E.g. A rendering company claims that on average, each ounce of shmoo oil contains no more than one gram of saturated fat. Jane does not believe this claim, and wishes to do an experiment to prove her point. (a) What is the population here? (b) What claim is being made about the population mean µ? (c) What is µ0 ? (d) What are the possibilities for the alternative hypothesis? (e) What is Jane’s alternative hypothesis? (f) What is Jane’s null hypothesis? Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 37 More specific idea of the tests Assume that σ is known, that we can apply the CLT, that H0 : µ ≤ 3, and that we have x̄ for a specific sample. • • • If H0 is true, then µ is less than or equal to 3. x̄ should be near µ, so x̄ should be less than or equal to 3, or at least not much greater than 3 We’ll reject H0 if x̄ is too much greater than 3 i.e., if x̄ − 3 is too much greater than 0 i.e., if zx̄ = x̄−3 σ/√n is too much greater than 0 Because zx̄ has the standard normal distribution, we can compute the probability of finding a value of zx̄ that is larger than some cutoff value. In practice, we won’t ever find the cutoff value explicitly. Instead, we’ll find the area to the right of our actual zx̄ , and if that is too small, we’ll reject H0 . Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 38 For the null hypothesis H0 : µ ≥ 3, we want to reject H0 if zx̄ is too much less than 0: For H0 : µ = 3, we want to reject H0 if zx̄ is too far from 0 in either direction: Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 39 20 Hypothesis Testing, Part II Procedure for testing H0 : µ ≤ µ0 when σ is known To use this, you must know σ and be able to apply the CLT. 1. Decide on a level of significance α (in problems, this is given) • typical values are 0.1, 0.05, 0.01 2. Get x̄ for your sample • x̄ is called the test statistic 3. Determine the z-score of this x̄ • • Formula: zx̄ = x̄−µ0 σ/√n zx̄ is the standardized test statistic 4. Determine the p-value for this x̄ • • Formula: p = normalcdf(zx̄ , 10 000) p is the probability of observing a sample mean at least as extreme as this x̄ if the population mean and std dev are really µ0 and σ 5. If p ≤ α, reject H0 . Otherwise, don’t. E.g. A certain chain of movie theaters claims that the average price for a movie ticket in its theaters is no more than $7.25. By querying people on the internet, John has put together a sample of thirty tickets with a mean price of $7.50. Does he have enough evidence to reject the chain’s claim at the 0.10 significance level? Assume that the population is normally distributed with standard deviation $1.30. Clearly, John wants to show that µ > 7.25, so H1 : µ > 7.25, whence H0 : µ ≤ 7.25. Also, n = 30 and σ = 1.30. Steps: 1. 2. 3. 4. 5. α = 0.10 x̄ = 7.50 √ zx̄ = 7.50−7.25 (1.30/ 30) ≈ 1.05 p = normalcdf(1.05, 10 000) ≈ 0.147 It’s not true that p ≤ α, so don’t reject H0 Note that when we reject H0 , we don’t say that we have proved H1 . Rather, we have found H0 unlikely, and accept H1 as more likely. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 40 Procedure for testing H0 : µ ≥ µ0 when σ is known This is the same as the previous case except for: 4. Determine the p-value for this x̄ p = normalcdf(−10 000, zx̄ ) • Procedure for testing H0 : µ = µ0 when σ is known This is the same as the previous case except for: 4. Determine the p-value for this x̄ • • If x̄ < µ0 , p = 2 · normalcdf(−10 000, zx̄ ) If x̄ > µ0 , p = 2 · normalcdf(zx̄ , 10 000) p is always computed in such a way that we can reject H0 if p ≤ α. When p is low, reject the H0 . E.g. A company claims that the amount of cereal in its 24-ounce boxes is normally distributed with mean 24 oz and standard deviation 1 oz. You have checked 30 boxes and found a mean of 24.5 oz. Do you have enough evidence to reject the company’s claim at the 0.01 significance level? You want to reject the claim that µ = 24, so your alternative hypothesis should be µ 6= 24. Thus, H0 : µ = 24. 1. 2. 3. 4. 5. Corwin α = 0.01 is given x̄ = 24.5 √ zx̄ = 24.5−24 ≈ 2.7386 1/ 30 p = 2 · normalcdf(2.7386, 10 000) ≈ 0.00617 p ≤ α, so reject H0 —i.e., reject the company’s claim S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 41 21 Hypothesis Testing, Part III Type I and Type II errors In the American justice system, a defendant is innocent until proven guilty. Possible circumstances and outcomes of a trial: Man is really innocent Man is really guilty Jury finds man guilty Error Justice Jury finds man innocent Justice Error Note the two very different kinds of possible error. Note also that either error might be made by a completely honest and scrupulous jury: the conclusion they reach depends on the evidence they are given, so if the evidence does not correctly represent the situation, they will be led into error even with perfect attention and judgement. If we let H0 be “The man is innocent” and H1 be “The man is guilty,” then we can translate the table to: H0 is actually true H0 is actually false We reject H0 Type I Error Correct We accept H0 Correct Type II Error For a trial, we regard a Type I error as the more serious type, so we require a number of people to agree that guilt is proved “beyond a reasonable doubt”. Usually, the harder we try to prevent a Type I error, the more likely we are to make a Type II error. The names we used for the two different kinds of errors are standard. Roughly: Type I error: accept H1 even though it is false (false positive) Type II error: accept H0 even though it is false (false negative) The trial situation is perfectly general. Even with the best techniques, an unusual sample can be chosen, yielding a sample mean that is not near the population mean (the equivalent of bad evidence). Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 42 To figure out what a Type I error would mean in the context of a particular problem, ask yourself: (i) (ii) What would it mean to accept H1 ? What would it mean for H1 to be false? Use similar questions for Type II errors. Answers to “Describe the practical consequences of making a Type X error in the context of the problem” questions should have the form Type I : Accept when in fact Type II: Accept . H1 is false H1 when in fact H0 is false H0 E.g. You are in charge of drug testing for your company. Write (in words) an H0 and an H1 suitable for showing the presence of a drug, and describe the practical consequences of making Type I and Type II errors. H0 : no drug present H1 : drug present Type I : Accept that a drug is present when in fact there is no drug Type II: Accept that no drug is present when in fact a drug is present Note that “false positive” and “false negative” are perfect descriptions in this case. E.g. A regulation requires an average bacteria count µ of 70 as the maximum acceptable for fishing waters. If the average is above 70, the site is considered unsafe and is closed. (a) Suppose that you work for the agency which monitors these waters, and you are interested in showing that they are unsafe. Write appropriate null and alternative hypotheses for testing the waters. H0 : ( ) H1 : ( ) (b) Describe the practical consequences of making Type I and Type II errors in this situation. Type I: Type II: It is better and more satisfactory to acquit a thousand guilty persons than to put a single innocent man to death once in a way. —Maimonides, The Commandments (Negative Commandment 290) Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 43 22 Hypothesis Testing, Part IV Z-Test on the TI Only use Z-Test if • • you know the population standard deviation σ AND you can apply the CLT E.g. A company claims that the amount of cereal in its 24-ounce boxes is normally distributed with mean 24 oz and standard deviation 1 oz. You have checked 30 boxes and found a mean of 24.5 oz. Use Z-Test on the TI calculator to test the company’s claim at the 0.05 significance level. Note. We found earlier that µ0 = 24, σ = 1, x̄ = 24.5, n = 30, H0 : µ = 24, and H1 : µ 6= 24. Note that it is the alternative hypothesis H1 that you indicate on the penultimate line. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 44 When σ is unknown If you are able to apply the CLT, use T-Test on the TI (see below for instructions). E.g. The XYZ Corporation has told its staff that the average salary of a secretary is normally distributed with a mean of $22,000. Janet believes that the average is really less than $22,000 and has found that the ten secretaries in her area have the salaries shown below. Write her null and alternative hypotheses and test her claim at the 0.05 significance level. 21178 21569 22424 21806 22814 20834 20509 22259 20727 21555 We don’t know σ, so we can’t use a z-test, but because we expect the salaries to be normally distributed, we can use a t-test. We have H0 : µ ≥ 22 000, H1 : µ < 22 000. → The p-value for your x̄ can be read from the T-Test results. Which test? Just like confidence intervals: • • • If σ is known and you can apply the CLT, use a z-test. If σ is unknown and you can apply the CLT, use a t-test. Otherwise, don’t use either—call a professional. In problems, ask yourself: where does the standard deviation come from—from the sample or from the population? Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 45 Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 46 23 Linear Regression Linear models Hooke’s Law tells us that if we stretch a spring a distance x, then the force with which the spring pulls against the stretch is kx for some number k (which depends on the spring, of course). But suppose we take a real spring, stretch it various distances x, and measure the force y. Will we see an exact relationship y = kx? Of course not; there are bound to be errors. Instead, we would see something like this: Our measurements should have given us points (xi , yi ) where each yi = kxi , but they didn’t quite do so. Instead, we have yi = kxi + εi , where εi is a random error. A relationship like this is a type of linear (statistical) model. Notice that if we subtract the linear part (the kxi ) from the data, we are left with just the error. In many practical cases, this error is approximately normally distributed with mean 0. The correlation and regression techniques we will study depend on this, so we will assume it. When the data points don’t lie very close to a line, it can still make sense to use a linear model, as the following example shows. Instead of thinking of two variables x and y as simply being related, we often think of using a model to explain y by x or to predict y from x. For example, if we wanted to explain patients’ responses to medication by the dosages they take, we would let x be the dosage and y be the response. Then, if responses really do depend only on dosages, and if the dependence is linear, the data points (xi , yi ) from a study should lie very close to a straight line. In many situations, however, more than one variable must be used to explain what is observed. If we give patients various doses of a drug, for example, their responses may well depend on both the dosage and the patients’ ages, so we might have yi = k1 xi + k2 zi + εi where xi is the dosage amount and zi is the patient’s age. Imagine that you have been given just the dosages and responses from such an experiment, as in the following picture. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 47 If you try to explain the response y using only the dosages, you will find that your formula yi = k1 xi isn’t very accurate. As the scatter plot shows, there’s a linear trend, but with quite a lot of variation. If you add the line y = k1 x to the plot, it will track the trend, but many data points will be fairly far from it. This makes sense, because the dosage only partially explains the response. It also makes sense to try to model the response using a linear function of only the dosage, because the linear dependence is really there. It’s just not the only dependence. Statisticians have developed methods for developing multi-variable models, but we won’t study them in this course; we will be content with finding one line. Ultimately, we will need to answer three questions: • • • Supposing that a linear model can be used, which line is best? When is it reasonable to use a linear model? How good is the model? The answer to the first question is the regression line. The regression line If there is a linear relationship between x and y, the line that best fits the (x, y) points is called the regression line or trend line. How can we find this line? If, truly, yi = axi +b+εi , then for each xi , the difference between each yi and axi + b is just a random error εi . (See the picture). The total error is the sum ∑ |yi − (axi + b)|, and the best line should be the one that minimizes the total error. Graphically, this means picking the line for which the sum of the vertical distances from each data point to the line is as small as possible. In practice, we minimize ∑[yi − (axi + b)]2 , so the method is called least squares. This seems reasonable, as x2 is always positive and varies like |x| (i.e., is small when |x| is small and large when |x| is large). In fact, it can be proved that the method of least squares gives the best linear fit most practical situations. Specifically: if, for each x, the y-values are approximately normally distributed and, for each y, the x-values are approximately normally distributed, then least squares will give the best fit. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 48 E.g. The table shows the ages and salaries of ten secretaries at the XYZ corporation. Give the equation of the trend line. Age Salary 20 21178 21 21569 26 22424 32 21806 25 22814 21 20834 21 20509 22 22259 20 20727 24 21555 The calculator gives a ≈ 109.68 and b ≈ 19 022.89 (see the LinReg output below), so the equation of the line is y = 109.68x + 19 022.89. E.g. Make a scatter plot of the data in the previous example. → Corwin → S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 49 Using Linear Regression We use the regression line to predict y-values: simply plug a number in for x and compute ŷ = ax + b. (The notation ŷ is traditional for predicted values of y.) Strictly speaking, we should use the regression line only when the correlation between an independent variable x and a dependent variable y is significant (see the section on correlation). We will simply assume that this is true unless told otherwise. (If the correlation is insignificant, then the best linear predictor for y is simply y = ȳ.) Only use the regression equation with values of x that are within, or very near, the range of x values used to compute the trend line coefficients. Finding ŷ for an x within the range of x-values used to compute the regression line coefficients is called interpolation; finding ŷ for an x outside that range is called extrapolation. Interpolation Example The table below shows the density of pure liquid water at various temperatures. Temp (◦ C) Density (g/cm3 ) 0 0.9999 4 1.000 20 0.9982 40 0.9922 60 0.9832 80 0.9718 Use linear interpolation to estimate the density at 30◦ C. To do this problem, we use the TI, putting the temperature values in the L1 list and the corresponding density values in the L2 list. We get a ≈ −0.000345, b ≈ 1.00261, so the regression line is D = −0.000345T + 1.00261, where T is the temperature and D is the density. Thus the density at T = 30◦ C is estimated to be −0.000345(30) + 1.00261 = 0.99226. The TI’s scatter plot and regression line with the interpolated value added. Extrapolation Example The table below shows the global temperature anomaly (difference between annual average global temperature and the mean of that quantity for the period 1901–2000) for the years 1950–1969. Year Difference 1950 −0.207 1951 −0.196 1952 −0.186 1953 −0.184 1954 −0.186 1955 −0.186 1956 −0.176 Year Difference 1957 −0.155 1958 −0.131 1959 −0.111 1960 −0.100 1961 −0.101 1962 −0.113 1963 −0.131 Year Difference 1964 −0.148 1965 −0.157 1966 −0.153 1967 −0.138 1968 −0.120 1969 −0.103 Find the coefficients for the trend line. Use the trend line to predict the anomaly for 2009. Let x = 0 correspond to 1950, so that List L1 in the TI contains the numbers 0, 1, 2, . . . 19. With this we find that a ≈ 0.00427 and b ≈ −0.1897, so the regression equation is 0.00427x − 0.1897. Plugging in 59 for x yields ŷ2009 = 0.06223. However, the real y2009 was 0.415 — a huge difference. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 50 Moral: Beware of extrapolation! Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 51 24 Correlation Remember the questions from the previous section: • • • If a linear model can be used, which line is best? When is it reasonable to use a linear model? How good is the model? We want to answer the second and third of these. Given some (x, y) data, we would like to try to find out whether there is a linear model that fits the data reasonably well. That is, we’d like to detect when there are numbers a and b such that, to a large extent, y = ax + b. Now, we can draw a line through any bunch of points, but that doesn’t make the line a reasonable model. When is there a reasonable linear model? The answer is given by defining a quantity called correlation which measures how close the (x, y) points are to being on a straight line. The closer they are, the more reasonable it is to assume that a linear model can be used. For a population, it’s possible to define a quantity called the Pearson correlation coefficient ρ which varies between −1 and 1 and to prove that an exact linear relationship y = ax + b exists if and only if ρ is −1 or 1. This is very important theoretically, but in practice we almost never have a population, only samples. For sample data, we compute r, an estimate for ρ. Properties of r • r is always between −1 and 1. • When r is near −1, there is a strong negative linear correlation between x and y. That is, the (x, y) points lie near a straight line with negative slope. When r is near zero but not equal to zero, we say that the correlation is weak. When r is near 0, there is no linear correlation between x and y. When r is near 1, there is a strong positive linear correlation between x and y. That is, the (x, y) points lie near a straight line with positive slope. • • Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 52 r is the answer to the second question. We’ll compute it with the calculator: Note that r near zero does not mean that x and y are not correlated; it means that there is no linear relationship. In the social sciences, small, medium, and large ‘effect sizes’ are about 0.1, 0.3, 0.5. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 53 r2 Think of the regression line as predicting the value of y, given the value of x. What we’d like to know is: for a given case, how good is this prediction? If the data is very nearly on a line, then the prediction should be very good: the predicted values of y shouldn’t be too different from the actual values. If the data doesn’t lie very near a line, then the prediction won’t be very good. We know that the correlation coefficient r measures how close the data comes to being on a line, so if r is near ±1, the prediction will be good, and if r is near zero, it won’t be. We don’t use r directly for this, however. To understand the language that is used for this, consider the set of (x, y) data pictured below, in which there’s an obvious positive linear correlation. If we think of moving from left to right along the x-axis and watching what happens to the y values, we see that the changes in y are really made up of two parts: an overall linear trend and something else that makes the y value jump up and down from the trend line. If we subtract out the linear part, all that is left is the “something else”—i.e., the rest of the variation in y: If this remaining variation is small, then the trend line explains most of the variation in y, and so is a good fit. If it is large, then there’s a large amount of variation in y which is not explained by the trend line, i.e., the line is not a good fit. It turns out that the value of r2 is exactly the fraction of the variation in y which is explained by the trend line. r2 , called the coefficient of determination, is the answer to our third question. When r2 is small (near 0), the linear model does not explain much of the variation, and so is not very good; when it is large (near 1), the model explains most of the variation and so is a good model. Note that r2 is always between 0 and 1. E.g. A study of the correlation between high school GPA and first-year college GPA for 63,482 students found that r = 0.39 at the 0.05 significance level. How much of the variation in first-year college GPA is explained by high school GPA? We have r = 0.39, so r2 ≈ 0.15. We say that about 15% of the variation in first-year college GPA is explained by high school GPA. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 54 E.g. The table shows the ages and salaries of ten secretaries at the XYZ corporation. Determine the correlation coefficient. About what percentage of the variation in salary is explained by age? Age Salary 20 21178 21 21569 26 22424 32 21806 25 22814 21 20834 21 20509 22 22259 20 20727 24 21555 As the TI screen shows, r ≈ 0.53 and r2 ≈ 0.28. We say that about 28% of the variation in salary is explained by age. A bad mistake Correlation does not imply causation! For example, the number of metropolitan readers of the weekday New York Times between 1993 and 2007 and the percentage of American adults who smoked cigarettes regularly in those years are very highly correlated (r = 0.96, my calculation). Did either of these cause the other? This stuff is used everywhere... Figure 6: a) Scatter graph of E2F4 and RBL2 expression levels. The linear correlation coefficient is −0.36. Clearly, there is little relationship between the two sets of expression data. b) Scatter graph of the predicted E2F-4 and p130 TF activities. The linear correlation coefficient is found to be −0.80. The training sets of E2F-4 and p130 included 12 and 43 interactions, respectively. Only three of the genes were coregulated by both TFs. Taken from Transcriptional regulatory networks via gene ontology and expression data, In Silico Biology 7 (2006) Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 55 Significance of a correlation If a correlation coefficient r is determined from a small sample, then even if |r| is large, the correlation detected may not be significant. (If n = 2, for example, then r = ±1 no matter what.) If r is determined from a large sample, the correlation may be significant even if it is fairly weak. Significance is determined by a hypothesis test. We won’t do the test, just use the results. Let n be the number of ordered pairs used to determine r. To determine significance, we use a table of what are called the critical values for r. Note that the critical value depends on both n and r. α = 0.05 α = 0.01 n=4 0.950 0.990 5 0.878 0.959 6 0.811 0.917 7 0.754 0.875 8 0.707 0.834 9 0.666 0.798 10 0.632 0.765 11 0.602 0.735 12 0.576 0.708 α = 0.05 α = 0.01 13 0.553 0.684 14 0.532 0.661 15 0.514 0.641 16 0.497 0.623 17 0.482 0.606 18 0.468 0.590 19 0.456 0.575 20 0.444 0.561 21 0.433 0.549 Testing for significance 1) Using the table, determine the critical value c for your α and n. 2) If |r| > c, the correlation is significant; otherwise, it isn’t. On a test, you will have to be given a critical value or a table from which to determine it. E.g. A correlation coefficient of r = −0.5 was computed using a sample of size n = 20. Test the significance of the correlation at the α = 0.05 level. 1) From the table, the critical value for n = 20, α = 0.05 is 0.444. 2) | − 0.5| = 0.5 and 0.5 > 0.444, so the correlation is significant. Corwin S TAT 200 ©2011-2014 Stephen Corwin — Do not distribute 56

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 1 Overview of Statistics/Data Classification