Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 3: Numerically Summarizing Data Section 3.1: Measure of Central Tendency Objectives: Students will be able to: Determine the arithmetic mean of a variable from raw data Determine the median of a variable from raw data Determine the mode of a variable from raw data Use the mean and the median to help identify the shape of a distribution Vocabulary: Parameter – a descriptive measure of a population Statistic – a descriptive measure of a sample Arithmetic Mean – sum of all values of a variable in a data set divided by the number of observations Population Arithmetic Mean – (μ) summation ( ∑ ) of all values of a variable from the population divided by the total number in the population (N) Sample Arithmetic Mean – (x‾) summation of all values of a variable from a sample divided by the total number of observation from the sample (n) Median – (M) the value of the variable that lies in the middle of the data when arranged in ascending order (if there is a even number of observations, then the median is the average of observations either side of the middle (½) value Mode – the most frequently observed value of the variable Resistant – extreme values do not affect the statistic Key Concepts: Three characteristics used to describe distributions (from histograms or similar charts) 1. Shape 2. Center 3. Spread Distributions Parameters Skewed Left: Mean substantially smaller than median Mean < Median < Mode Symmetric: Mean roughly equal to median Mean ≈ Median ≈ Mode Skewed Right: Mean substantially greater than median Mean > Median > Mode Measure of Central Tendency Computation Interpretation μ = (∑xi ) / N Mean Median Mode Center of gravity x‾ = (∑xi) / n Arrange data in ascending order and divide the data set into half Tally data to determine most frequent observation Divides into bottom 50% and top 50% Most frequent observation When to use Data are quantitative and frequency distribution is roughly symmetric Data are quantitative and frequency distribution is skewed Data are qualitative or the most frequent observation is the desired measure of central tendency Chapter 3: Numerically Summarizing Data Example 1: Which of the following are resistant measures of central tendency: Mean, Median or Mode? Example 2: Given the following set of data: 70, 56, 48, 48, 53, 52, 66, 48, 36, 49, 28, 35, 58, 62, 45, 60, 38, 73, 45, 51, 56, 51, 46, 39, 56, 32, 44, 60, 51, 44, 63, 50, 46, 69, 53, 70, 33, 54, 55, 52 What is the mean? What is the median? What is the mode? What is the shape of the distribution? Example 3: Given the following types of data and sample sizes, list the measure of central tendency you would use and explain why? Sample of 50 Sample of 200 Hair color Height Weight Parent’s Income Number of Siblings Age Does sample size affect your decision? Homework: pg : 130-7; 9, 21, 23, 27, 33, 34, 44 Chapter 3: Numerically Summarizing Data Section 3.2: Measures of Dispersion (Spread) Objectives: Students will be able to: Compute the range of a variable from raw data Compute the variance of a variable from raw data Computer the standard deviation of a variable from raw data Use the Empirical Rule to describe data that are bell shaped Use Chebyshev’s inequality to describe any set of data Vocabulary: Range – difference between the smallest and largest data values Variance – based on the deviation about the mean (how spread out the data is) Population Variance – ( σ2 ) computed using (∑(xi – μ)2)/N Sample Variance – ( s2 ) computed using (∑(xi – x‾)2)/((n – 1) Biased – a statistic that consistently under-estimates or over-estimates a population parameter Degrees of Freedom – number of observations minus the number of parameters estimated in the computation Population Standard Deviation – square root of the population variance Sample Standard Deviation – square root of the sample variance Key Concepts: Sample variance is found by dividing by (n – 1) to keep it an unbiased (since we estimate the population mean, μ, by using the sample mean, x‾) estimator of population variance The larger the standard deviation, the more dispersion the distribution has Empirical Rule and Chebyshev’s Inequality Empirical Rule Chebyshev’s Inequality μ ± 3σ at least 88.9% μ ± 2σ at least 75% μ±σ Nothing 99.7% At least (1 – 1/k2)*100%, k>1 within k standard deviations of the mean 95% 68% 34% 0.15% 34% 13.5% 13.5% 2.35% μ - 3σ μ - 2σ 2.35% μ-σ μ μ+σ μ + 2σ μ + 3σ 0.15% μ - 3σ μ - 2σ μ-σ μ μ+σ μ + 2σ μ + 3σ Chapter 3: Numerically Summarizing Data Example 1: Which of the following measures of spread are resistant? 1. Range 2. Variance 3. Standard Deviation Example 2: Given the following set of data: 70, 56, 48, 48, 53, 52, 66, 48, 36, 49, 28, 35, 58, 62, 45, 60, 38, 73, 45, 51, 56, 51, 46, 39, 56, 32, 44, 60, 51, 44, 63, 50, 46, 69, 53, 70, 33, 54, 55, 52 1. What is the range? 2. What is the variance? 3. What is the standard deviation? Example 3: Compare the Empirical Rule and Chebyshev’s Inequality Empirical Rule Chebyshev μ±σ μ ± 2σ μ ± 3σ Homework: pg 148-155: 11, 14, 22, 23, 35, 39, 40, 43, 45, 51 Chapter 3: Numerically Summarizing Data Section 3.3: Measures of Central Tendency and Dispersion from Grouped Data Objectives: Students will be able to: Approximate the mean of a variable from grouped data Compute the weighted mean Approximate the variance and standard deviation of a variable from grouped data Vocabulary: Weighted Mean – mean of a variable value times its weighted value Key Concepts: Use raw data whenever possible If grouped (summarized data) is the only data available, estimates for mean and standard deviation can still be obtained Descriptive Stats from Summary Stats Class Midpt xi Freq fi xi fi xi - x (xi – x)²fi 0 – 1.99 1 2 2 -5.1 52.02 2 – 3.99 3 5 15 -3.1 48.05 4 – 5.99 5 6 30 -1.1 7.26 6 – 7.99 7 8 56 0.9 6.48 8 – 9.99 9 9 81 2.9 75.69 Total Σfi = 30 Σxifi = 184 Σxifi = 184 x = --------------- = 6.1 Σfi = 30 Σ(xi – x)²fi = 189.5 Σ(xi – x)² fi = 189.5 s² = -------------------------- = 6.53 Σfi – 1 = 29 30/2 - 13 n/2 – CF Median = M = L + ------------ (ci ) = 6 + ------------- (2) = 6 + (2/8) 2 = 6.25 8 f L is the lower class limit of the class containing the median n is the total number of data values in the frequency distribution CF is the cum freq of the class preceding the median class f is the frequency of the median class ci is the class width of the median class Homework: pg 161 - 165: 3, 4, 5, 21, 25 Chapter 3: Numerically Summarizing Data Section 3.4: Measures of Position Objectives: Students will be able to: Determine and interpret z-scores Determine and interpret percentiles Determine and interpret quartiles Check a set of data for outliers Vocabulary: Z-Score – the distance that a data value is from the mean in terms of the number of standard deviations K Percentile – (Pk) divides the lower kth percentile of a set of data from the rest Quartiles – (Qi) divides the whole data into four (25%) sets of data Outliers – extreme observations IQR (Interquartile range) – difference between third and first quartiles (IQR = Q3 – Q1) Lower fence – Q1 – 1.5(IQR) Upper fence – Q3 – 1.5(IQR) Key Concepts: Data sets should be checked for outliers as the mean and standard deviation are not resistant statistics and any conclusions drawn from a set of data that contains outliers can be flawed Fences serve as cutoff points for determining outliers (data values less than lower or greater than upper fence are considered outliers) Z-Scores Population z-Score Sample z-Score x–μ z = -----------σ z= x–x -----------s Allows comparisons of different distributions. Mean of z is 0 and the standard deviation of z is 1. Quartiles Smallest Data Value 25% of the data Median Q1 Q2 25% of the data Example 1: Which player had a better year in 1967? Carl Yastrzemski AL Batting Champ 0.326 Roberto Clemente NL Batting Champ 0.357 AL average 0.236 AL stdev 0.01072 NL average 0.249 NL stdev 0.01257 Q3 25% of the data Largest Data Value 25% of the data Chapter 3: Numerically Summarizing Data Example 2: Given the following set of data: 70, 56, 48, 48, 53, 52, 66, 48, 36, 49, 28, 35, 58, 62, 45, 60, 38, 73, 45, 51, 56, 51, 46, 39, 56, 32, 44, 60, 51, 44, 63, 50, 46, 69, 53, 70, 33, 54, 55, 52 What is the median? What is the Q1? What is the Q3? What is the IQR? What is the upper fence? What is the lower fence? Are there any outliers? Homework: pg 172 - 174: 9-12, 14, 19 Chapter 3: Numerically Summarizing Data Section 3.5: The Five-Number Summary and Boxplots Objectives: Students will be able to: Compute the five-number summary Draw and interpret boxplots Vocabulary: Five-number Summary – the minimum data value, Q1, median, Q3 and the maximum data value Key Concepts: Distribution Shape Based on Boxplots: a. If the median is near the center of the box and each horizontal line is of approximately equal length, then the distribution is roughly symmetric b. If the median is to the left of the center of the box or the right line is substantially longer than the left line, then the distribution is skewed right c. If the median is to the right of the center of the box or the left line is substantially longer than the right line, then the distribution is skewed left Remember identifying a distribution from boxplots or histograms is subjective! Why Use a Boxplot? A boxplot provides an alternative to a histogram, a dotplot, and a stem-and-leaf plot. Among the advantages of a boxplot over a histogram are ease of construction and convenient handling of outliers. In addition, the construction of a boxplot does not involve subjective judgements, as does a histogram. That is, two individuals will construct the same boxplot for a given set of data - which is not necessarily true of a histogram, because the number of classes and the class endpoints must be chosen. On the other hand, the boxplot lacks the details the histogram provides. Dotplots and stemplots retain the identity of the individual observations; a boxplot does not. Many sets of data are more suitable for display as boxplots than as a stemplot. A boxplot as well as a stemplot are useful for making side-by-side comparisons. Five-number summary Min Q1 smallest value Boxplot M Q3 Max First, Second and Third Quartiles (Second Quartile is the Median, M) largest value Lower Fence Upper Fence [ ] Smallest Data Value > Lower Fence (Min unless min is an outlier) * Largest Data Value < Upper Fence (Max unless max is an outlier) Outlier Chapter 3: Numerically Summarizing Data Ex. #1 Consumer Reports did a study of ice cream bars (sigh, only vanilla flavored) in their August 1989 issue. Twentyseven bars having a taste-test rating of at least “fair” were listed, and calories per bar was included. Calories vary quite a bit partly because bars are not of uniform size. Just how many calories should an ice cream bar contain? 342 377 319 353 295 234 294 286 377 182 310 439 111 201 182 197 209 147 190 151 131 151 Construct a boxplot for the data above. Ex. #2 The weights of 20 randomly selected juniors at MSHS are recorded below: 121 126 130 132 134 137 141 144 125 128 131 133 135 139 141 147 148 153 205 213 a) Construct a boxplot of the data b) Determine if there are any mild or extreme outliers. Ex. #3 The following are the scores of 12 members of a woman’s golf team in tournament play: 89 90 87 95 86 81 111 108 83 88 91 a) 79 Construct a boxplot of the data. b) Are there any mild or extreme outliers? c) Find the mean and standard deviation. d) Based on the mean and median describe the distribution? Ex. #4 Comparative Boxplots: The scores of 18 first year college women on the Survey of Study Habits and Attitudes (this psychological test measures motivation, study habits and attitudes toward school) are given below: 154 109 137 115 152 140 154 178 101 103 126 126 137 165 165 129 200 148 The college also administered the test to 20 first-year college men. There scores are also given: 108 140 114 91 180 115 126 92 169 146 109 132 75 88 113 151 70 115 187 104 Compare the two distributions by constructing boxplots. Are there any outliers in either group? Are there any noticeable differences or similarities between the two groups? Homework: pg 181-183: 5-7, 15 Chapter 3: Numerically Summarizing Data Chapter 3: Review Objectives: Students will be able to: Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review exercises Use the technology to display graphs and plots of data Vocabulary: None new Distribution Spread Distributions Parameters Standard Deviation and Variance High Lots of Data on extremes Data very spread out Skewed Left: Mean substantially smaller than median Mean < Median < Mode Symmetric: Mean roughly equal to median Standard Deviation and Variance Lower Some data on extremes Data more clustered than first, but still not tightly packed Mean ≈ Median ≈ Mode Skewed Right: Mean substantially greater than median Standard Deviation and Variance Lowest Mean > Median > Mode Fewest data on extremes Data clustered more near then mode (tightly packed) Homework: pg 186 - 191: 7, 9, 11, 12, 13, 19