Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Transcript

Chapter 2 ~ Descriptive Analysis & Presentation of Single-Variable Data Black Bears Mean: 60.07 inches Median: 62.50 inches Range: 42 inches 20 Variance: 117.681 Standard deviation: 10.85 inches Minimum: 36 inches Frequency Maximum: 78 inches 10 First quartile: 51.63 inches Third quartile: 67.38 inches Count: 58 bears Sum: 3438.1 inches 0 30 40 50 60 70 80 Length in Inches 1 Chapter Goals • Learn how to present and describe sets of data • Learn measures of central tendency, measures of dispersion (spread), measures of position, and types of distributions • Learn how to interpret findings so that we know what the data is telling us about the sampled population 2 Measures of Central Tendency • Numerical values used to locate the middle of a set of data, or where the data is clustered • The term average is often associated with all measures of central tendency 3 Mean Mean: The type of average with which you are probably most familiar. The mean is the sum of all the values divided by the total number of values, n: 1 1 x = xi = ( x1 + x2 + . . . + xn ) n n Notes: The population mean, , (lowercase mu, Greek alphabet), is the mean of all x values for the entire population We usually cannot measure but would like to estimate its value A physical representation: the mean is the value that balances the weights on the number line 4 Example Example: The following data represents the number of accidents in each of the last 6 years at a dangerous intersection. Find the mean number of accidents: 8, 9, 3, 5, 2, 6, 4, 5: Solution: 1 x = (8 + 9 + 3 + 5 + 2 + 6 + 4 + 5) = 5.25 8 In the data above, change 6 to 26: Solution: 1 x = (8 + 9 + 3 + 5 + 2 + 26 + 4 + 5) = 7.75 8 Note: The mean can be greatly influenced by outliers 5 Median Median: The value of the data that occupies the middle position when the data are ranked in order according to size Notes: ~ Denoted by “x tilde”: x The population median, (uppercase mu, Greek alphabet), is the data value in the middle position of the entire population To find the median: 1. Rank the data x ) = n +1 2. Determine the depth of the median: d ( ~ 2 3. Determine the value of the median 6 Example Example: Find the median for the set of data: {4, 8, 3, 8, 2, 9, 2, 11, 3} Solution: 1. Rank the data: 2, 2, 3, 3, 4, 8, 8, 9, 11 x ) = (9 +1)/ 2 = 5 2. Find the depth: d (~ 3. The median is the fifth number from either end in the ranked x =4 data: ~ Suppose the data set is {4, 8, 3, 8, 2, 9, 2, 11, 3, 15}: 1. Rank the data: 2, 2, 3, 3, 4, 8, 8, 9, 11, 15 2. Find the depth: d ( ~x ) = (10 + 1) / 2 = 5.5 3. The median is halfway between the fifth and sixth observations: ~ x = (4 +8)/ 2 = 6 7 Mode & Midrange Mode: The mode is the value of x that occurs most frequently Note: If two or more values in a sample are tied for the highest frequency (number of occurrences), there is no mode Midrange: The number exactly midway between a lowest value data L and a highest value data H. It is found by averaging the low and the high values: midrange= L+ H 2 8 Example Example: Consider the data set {12.7, 27.1, 35.6, 44.2, 18.0} + H 12.7 + 44.2 L = = 2845 Midrange = . 2 2 Notes: When rounding off an answer, a common rule-of-thumb is to keep one more decimal place in the answer than was present in the original data To avoid round-off buildup, round off only the final answer, not intermediate steps 9 Measures of Dispersion • Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency. • Measures of dispersion are used to describe the spread, or variability, of a distribution • Common measures of dispersion: range, variance, and standard deviation 10 Range Range: The difference in value between the highest-valued (H) and the lowest-valued (L) pieces of data: range= H L • Other measures of dispersion are based on the following quantity Deviation from the Mean: A deviation from the mean, x x , is the difference between the value of x and the mean x 11 Example Example: Consider the sample {12, 23, 17, 15, 18}. Find 1) the range and 2) each deviation from the mean. Solutions: 1) range= H L = 2312 =11 2) x = 1(12 + 23+17 +15+18) =17 5 Data Deviation from Mean x x x _________________________ 12 -5 23 6 17 0 15 -2 18 1 12 Sample Variance & Standard Deviation Sample Variance: The sample variance, s2, is the mean of the squared deviations, calculated using n 1 as the divisor: s2 = 1 ( x x) 2 n 1 where n is the sample size Note: The numerator for the sample variance is called the sum of squares for x, denoted SS(x): s2 = SS( x) n 1 where SS( x ) = ( x x ) 2 = x 2 1 n ( x ) 2 Standard Deviation: The standard deviation of a sample, s, is the positive square root of the variance: s = s2 13 Example Example: Find the 1) variance and 2) standard deviation for the data {5, 7, 1, 3, 8}: Solutions: First: x = 1(5+ 7 +1+ 3+ 8) = 48 . 5 Sum: 1) s 2 = x x x ( x x)2 5 7 1 3 8 24 0.2 2.2 -3.8 -1.8 3.2 0 0.04 4.84 14.44 3.24 10.24 32.08 1 ( 32 . 8 ) = 8 . 2 4 2) s = 8 . 2 = 2 . 86 14 Notes The shortcut formula for the sample variance: ( x ) x n 2 2 s2 = n 1 The unit of measure for the standard deviation is the same as the unit of measure for the data 15 Measures of Position • Measures of position are used to describe the relative location of an observation • Quartiles and percentiles are two of the most popular measures of position • An additional measure of central tendency, the midquartile, is defined using quartiles • Quartiles are part of the 5-number summary 16 Quartiles Quartiles: Values of the variable that divide the ranked data into quarters; each set of data has three quartiles 1. The first quartile, Q1, is a number such that at most 25% of the data are smaller in value than Q1 and at most 75% are larger 2. The second quartile, Q2, is the median 3. The third quartile, Q3, is a number such that at most 75% of the data are smaller in value than Q3 and at most 25% are larger Ranked data, increasing order 25% L 25% Q1 25% Q2 25% Q3 H 17 Percentiles Percentiles: Values of the variable that divide a set of ranked data into 100 equal subsets; each set of data has 99 percentiles. The kth percentile, Pk, is a value such that at most k% of the data is smaller in value than Pk and at most (100 k)% of the data is larger. at most k % L at most (100 - k )% Pk H Notes: The 1st quartile and the 25th percentile are the same: Q1 = P25 The median, the 2nd quartile, and the 50th percentile are x = Q2 = P50 all the same: ~ 18 Finding Pk (and Quartiles) • Procedure for finding Pk (and quartiles): 1. Rank the n observations, lowest to highest 2. Compute A = (nk)/100 3. If A is an integer: – d(Pk) = A.5 (depth) – Pk is halfway between the value of the data in the Ath position and the value of the next data If A is a fraction: – d(Pk) = B, the next larger integer – Pk is the value of the data in the Bth position 19 Example Example: The following data represents the pH levels of a random sample of swimming pools in a California town. Find: 1) the first quartile, 2) the third quartile, and 3) the 37th percentile: 5.6 6.0 6.7 7.0 5.6 6.1 6.8 7.3 5.8 6.2 6.8 7.4 5.9 6.3 6.8 7.4 6.0 6.4 6.9 7.5 Solutions: 1) k = 25: (20) (25) / 100 = 5, depth = 5.5, Q1 = 6 2) k = 75: (20) (75) / 100 = 15, depth = 15.5, Q3 = 6.95 3) k = 37: (20) (37) / 100 = 7.4, depth = 8, P37 = 6.2 20 5-Number Summary 5-Number Summary: The 5-number summary is composed of: 1. 2. 3. 4. 5. L, the smallest value in the data set Q1, the first quartile (also P25) ~ x , the median (also P50 and 2nd quartile) Q3, the third quartile (also P75) H, the largest value in the data set Notes: The 5-number summary indicates how much the data is spread out in each quarter The interquartile range is the difference between the first and third quartiles. It is the range of the middle 50% of the data 21 Box-and-Whisker Display Box-and-Whisker Display: A graphic representation of the 5-number summary: • The five numerical values (smallest, first quartile, median, third quartile, and largest) are located on a scale, either vertical or horizontal • The box is used to depict the middle half of the data that lies between the two quartiles • The whiskers are line segments used to depict the other half of the data • One line segment represents the quarter of the data that is smaller in value than the first quartile • The second line segment represents the quarter of the data that is larger in value that the third quartile 22 Example Example: A random sample of students in a sixth grade class was selected. Their weights are given in the table below. Find the 5-number summary for this data and construct a boxplot: Solution: 63 85 92 99 112 63 L 64 86 93 99 85 Q1 76 88 93 99 92 ~ x 76 89 93 101 99 Q3 81 90 94 108 83 91 97 109 112 H 23 Boxplot for Weight Data Weights from Sixth Grade Class 60 70 80 90 100 110 Weight L Q1 ~ x Q3 H 24 z-Score z-Score: The position a particular value of x has relative to the mean, measured in standard deviations. The z-score is found by the formula: value mean x x z= = st.dev. s Notes: Typically, the calculated value of z is rounded to the nearest hundredth The z-score measures the number of standard deviations above/below, or away from, the mean z-scores typically range from -3.00 to +3.00 z-scores may be used to make comparisons of raw scores 25 Example Example: A certain data set has mean 35.6 and standard deviation 7.1. Find the z-scores for 46 and 33: Solutions: x x 46 35.6 z= = = 176 . s 7.1 46 is 1.46 standard deviations above the mean x x = 33 35.6 = = z 0.37 s 7.1 33 is -0.37 below standard deviations below the mean. 26 Interpreting & Understanding Standard Deviation • Standard deviation is a measure of variability, or spread • Two rules for describing data rely on the standard deviation: – Empirical rule: applies to a variable that is normally distributed – Chebyshev’s theorem: applies to any distribution 27 Empirical Rule Empirical Rule: If a variable is normally distributed, then: 1. Approximately 68% of the observations lie within 1 standard deviation of the mean 2. Approximately 95% of the observations lie within 2 standard deviations of the mean 3. Approximately 99.7% of the observations lie within 3 standard deviations of the mean Notes: The empirical rule is more informative than Chebyshev’s theorem since we know more about the distribution (normally distributed) Also applies to populations Can be used to determine if a distribution is normally distributed 28 Illustration of the Empirical Rule 99.7% 95% 68% x 3s x 2s xs x x+s x +2s x +3s 29 Example Example: A random sample of plum tomatoes was selected from a local grocery store and their weights recorded. The mean weight was 6.5 ounces with a standard deviation of 0.4 ounces. If the weights are normally distributed: 1) What percentage of weights fall between 5.7 and 7.3? 2) What percentage of weights fall above 7.7? Solutions: 1) ( x 2s, x + 2s) = (65 . 2(0.4), 65 . + 2(0.4)) = (57 . , 7.3) Approximately 95% of the weights fall between 5.7 and 7.3 2) ( x 3s, x + 3s) = (65 . 3(0.4), 65 . + 3(0.4)) = (53 . , 7.7) Approximately 99.7% of the weights fall between 5.3 and 7.7 Approximately 0.3% of the weights fall outside (5.3, 7.7) Approximately (0.3/2)=0.15% of the weights fall above 7.7 30 A Note about the Empirical Rule Note: The empirical rule may be used to determine whether or not a set of data is approximately normally distributed 1. Find the mean and standard deviation for the data 2. Compute the actual proportion of data within 1, 2, and 3 standard deviations from the mean 3. Compare these actual proportions with those given by the empirical rule 4. If the proportions found are reasonably close to those of the empirical rule, then the data is approximately normally distributed 31 Chebyshev’s Theorem Chebyshev’s Theorem: The proportion of any distribution that lies within k standard deviations of the mean is at least 1 (1/k2), where k is any positive number larger than 1. This theorem applies to all distributions of data. Illustration: at least 1 12 k x ks x x + ks 32 Important Reminders! Chebyshev’s theorem is very conservative and holds for any distribution of data Chebyshev’s theorem also applies to any population The two most common values used to describe a distribution of data are k = 2, 3 The table below lists some values for k and 1 - (1/k2): k 1(1/ k 2) 1.7 0.65 2 0.75 2.5 0.84 3 0.89 33 Example Example: At the close of trading, a random sample of 35 technology stocks was selected. The mean selling price was 67.75 and the standard deviation was 12.3. Use Chebyshev’s theorem (with k = 2, 3) to describe the distribution. Solutions: Using k=2: At least 75% of the observations lie within 2 standard deviations of the mean: ( x 2s, x + 2s ) = (67.75 2(12.3), 67.75 + 2(12.3) = (43.15, 92.35) Using k=3: At least 89% of the observations lie within 3 standard deviations of the mean: ( x 3s, x + 3s ) = (67.75 3(12.3), 67.75 + 3(12.3) = (30.85, 104.65) 34