* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Chapter Three Numerically Summarizing Data
Survey
Document related concepts
Transcript
Numerically Summarizing Data Learning Objectives 1. Understand the difference between a parameter and a statistic 2. Describe and compute measures of central tendency 3. Describe and compute measures of dispersion 4. Compute measures of location 5. Learn to read box plots and check for outliers 1 Measures of Central Tendency (Mean, Median and Mode) A parameter is a descriptive measure of a population. In most real world cases, the population parameter is not known. For example, the average gas price in the whole nation. A statistic is a descriptive measure of a sample. We use statistic to estimate the corresponding parameter. For example, Average gas price of the nation is not known. However, we can take a random sample of 100 stations and compute the sample average gas price, then use the sample average to estimate the unknown population average. 2 The population mean, is computed using all the individuals in a population, the total # of all individuals is N. The population mean is a parameter. The sample mean, is computed using sample data. The sample mean is a statistic that is an unbiased estimator of the population mean. NOTE: In real world applications, population mean m is usually not known, and is estimated by using sample mean x 3 Median The median of a variable is the value that lies in the middle of the data when arranged in ascending order. That is, half the data is below the median and half the data is above the median. We use m to represent the median. 4 Steps in Computing the Median of a Data Set 1. Arrange the data in ascending order. 2. Determine the number of observations (n). 3. Determine the observation in the middle of the data set. The position is (n+1)/2 (a) If (n+1)/2 is an integer, locate the data value at the (n+1)/2 position. This is the median (NOTE: for this situation, # of data values, n is an odd number.) (b) If (n+1)/2 is NOT an integer, the median is the average of the two data values on either side of the observations that lies in the (n+1)/2 position. [ NOTE: for this situation, n is even]. 5 EXAMPLE Computing the Median of Data Find the mean and median of the following pulse rates from a sample of 8 individuals {NOTE: n = 8 in this case} 80, 76, 65, 68, 72, 73, 65, 80 Arrange them in ascending order: 65, 65, 68, 72, 73, 76, 80, 80 Find the position: (n+1)/2 = (8+1)/2=4.5 Position is not an integer: Median = (72+73)/2 = 72.5 Adding one additional pulse rate of 100, now find the median of the data {NOTE n = 9 in this case}: 80, 76, 65, 68, 72, 73, 65, 80,100 Ascending order: 65,65,68,72,73,76,80,80,100 Position: (9+1)/2 = 5: Median is 73 (on the 5th position) 6 The mode of a variable is the most frequent observation of the variable that occurs in the data set. If there are two values that occur with the most frequency, we say the data has is bimodal. Exercise: Find the mode of the following pulse rate data 80, 76, 65, 68, 72, 73, 65, 80,100, 80, 74, 65, 66, 70, 74, 65, 80,98 Modes are: 65 and 80 7 Comparing Mean and Median: How does the extreme observation affect the mean and median? [similar exam questions] Example: The following is the quiz scores of 10 students in class A: 5,5,5,5,5,7,7,7,7,7 Find mean = ______, find median: ________ The following is the quiz score of 10 students in class B: 5,5,5,5,5,7,7,7,7,30 Find mean = ________, find median =_________ Fact: The mean is sensitive to extreme data values. Median is robust to extreme data values. 8 How does the unusual cases affect the average, median and the shape of the histogram? Compare Histograms with/without the ‘outlier’ case, 5000 miles Histogram of Miles With the case of 5000 miles 90 Histogram of Miles Without the case of 5000 miles 87 52 50 80 42 70 Shape is _________ 56 50 Frequency Frequency 60 40 40 30 Shape is 30 21 20 14 20 12 10 10 0 _________ 4 0 0 750 0 0 0 1500 0 0 0 0 0 2250 Miles 0 3000 0 0 0 3750 0 0 0 0 1 1 0 4500 0 100 1 0 2 200 300 400 Distance from Home 0 1 0 500 1 600 Descriptive Statistics: Miles for 148 cases (with the case of 5000 miles) Variable N Miles 148 Mean 151.5 SE Mean TrMean 33.6 111.8 StDev 409.4 Min 1.0 Q1 75 Median 120 Q3 150 Max 5000 Descriptive Statistics: Miles for 147 cases (without the case of 5000 miles) Variable N Miles 147 Mean SE Mean TrMean ______ 6.71 111.02 StDev 81.37 Min 1.0 Q1 75 Median ______ Q3 150 Maximum 600 9 Descriptive Statistics: Miles for 147 cases (without the case of 5000 miles) Variable N Miles 147 Mean 118.52 Min 1 Median 120 Maximum 600 NOTE: Median remains unchange. Why? Since it only uses the middle one (or two data points) to find median. But, it uses everyone data to find average. So, a very large unusual data will make average larger. But, not median. When data sets have unusually large or small values relative to the entire set of data or when the distribution of the data is skewed, the median is the preferred measure of central tendency over the arithmetic mean because it is more representative of the typical observation. 10 Comparison of Mean, Median, and Mode for different shapes of distributions [Similar exam question] Mean<Medain Mean~Median Mean>Median Left-Skewed Symmetric Right-Skewed Mean Median Mode Mean= Median = Mode Mode Median Mean 11 Exercise NOTE: In real world applications, distribution of a sample data can never be perfectly symmetric. The shape can only be approximately symmetric. IF MEAN IS CLOSE TO MEDIAN (NOT NECESSARY EXACTLY MEAN = MEDIAN), WE WOULD SAY THE DISTRIBUTION IS APPROXIMATELY SYMMETRIC. Exercise: A sample of 50 gas prices are recorded and summarized. The average price is $3.15, median price is $3.13. Is the shape of the price distribution more likely to be skew-to-left, approximately symmetric, skewed-to-right? ANS: Measures of Dispersion Four different measures of dispersion: Range, Variance, Standard Deviation, Interquartile Range (IQR) Measures of dispersion measure the degree that the data values spread. The larger the data values spread, the larger the variation of the data values. Example: Scores of 5 students in class A: 60,60,70,80,80 Scores of 5 students in class B: 40,60,70,80,100 Scores of 5 students in class C: 70,70,70,70,70 Q: Scores in Class ____ have largest variation. Scores in Class _____ has zero variation. 13 Visualizing Variability using Histogram A C B Which one shows the largest variation: Which one shows the smallest variation: 14 How to measure the variation? • Range = R = Largest Data Value – Smallest Data Value • The sample variance is : • The sample standard deviation is: s = s2 NOTE: the divider: (n-1) is called the Degrees of Freedom. The population variance is symbolically represented by lower case Greek sigma squared. The population standard deviation is: 15 NOTE: As mentioned before, for real world problems, population mean, population variance and population standard deviation are NOT KNOWN. Similar to Sample Mean, sample variance and sample standard deviation are obtained from sample data. They are used to estimate the unknown population variance and population standard deviation. This is the major part of the inferential statistics, which will be dealt with later. In this Chapter, we are learning how to compute and interpret these sample descriptive summaries to understand the sample data. 16 Notation: s 2: sample variance s s2 : sample standard deviation NOTE: If the original measurement unit is (ft), the variance s2 has measurement unit (ft)2, since If x has unit (ft), then, (x- x )2 has the unit (ft)(ft) , which is (ft)2 The measurement unit of s2 is (ft)2 . The measurement unit of s is (ft). s 2: population variance. s 2 : population standard deviation 17 Some important Tips NOTE: Sample statistics: such as sample mean , sample median, s, s2 will be different for different samples. Population parameters: such as population mean, m, population variance, s2, population s.d., s are fixed constant for a given population. They do not change for different samples. Exercise Comparing Variation: Quiz Scores of 40 students [similar exam questions] 20 20 20 20 1 3 2 4 9 10 5 3 1 2 0 5 10 0 Class A 4 5 6 Class B 10 0 5 10 Class C Variation: Which one has smallest s.d.? Which has largest s.d.? 19 Answer Class B has smallest standard deviation Class A has largest standard deviation Points to remember about variance and standard deviation and the relationship with histogram: - The value of s and s2 is always greater than or equal to zero. - The larger the value of s 2 or s, the greater the variability of the data set. - If s 2 or s is equal to zero, all measurements must have the same. - The standard deviation s is computed in order to have a measure of variability having the same unit as the observations. - The larger the s.d., the more spread the data, the flatter the histogram. - The smaller the s.d., the more clustered the data around the mean, the taller the peak of the histogram. 21 Exercise (Similar Exam questions) 1. The gas price is a concern for people. A random sample of 40 stations gives the following data summary: Sample mean = $2.15 Median = $2.12 S = $.15 Q: Is the distribution of the gas prices more likely to be (a) Symmetric (b) skewed-to-right (c) Skewed-to-left And WHY? 2. The following two data are prices of milk from 6 stores, one was from January, and one year after. Store: A B C D E F Price in January 2004 1.85 1.95 1.85 2.00 1.78 1.97 Price in January 2005 2.05 2.15 2.05 2.20 1.98 2.17 True or False for each of the following statements: (a) The average price remains the same between two years. (b) The price range remains the same between two years. (c ) The median remains the same between two years. (d) The standard deviation (s) remains the same between two years. 22 Descriptive Summary for the 56 distances x m s , the sample standard deviation. Mean after excluding the lowest 5% and the highest 5% of the data. Called: Trimmed Mean s2 = (112.2)2 Descriptive Statistics: distance Variable N distance 56 Mean 142.0 Variable Minimum distance 5.0 Smallest Largest Median TrMean StDev 140.0 128.3 112.2 Maximum 800.0 Q1 92.5 25% of the distances are lower than Q1, the first Quartile, or 25th Percentile SE Mean 15.0 Q3 160.0 75% of the distances are lower than Q3, the third Quartile, or 75th Percentile 23 If we add the max, 6000 to the data, so that we 57 cases, what is the effect of 6000 to the following summary statistics: Increase? Decease? The same? (a) the average distance: (b) the median distance: (c) the standard deviation: (d) the range: Answer Adding 6000 miles to the data, then, • Average distance is increased. • Median distance for this example is the same. (in general, will be almost the same) • Standard deviation is increased. • Range is increased. Empirical Rule and Applications What is the meaning of variation and how is it used in solving real world problems? For Symmetric mound-shaped data (Bell-shaped ) Approximately 68% of the data is between ± 1 s 95% of the data is between ± 2 s 100% of the data is between ± 3 s of the mean 26 The important Application of Empirical rule is: It is applied to identify rare (unusual, extreme )observations. If an observation falls outside two s.d. range, it only has 5% of chance to occur. Therefore, it is considered to be a rare (or unusual) case. 34% 34% 13.5% 13.5% 2.5% m-2s 2.5% m-s m m+s m+2s NOTE: If you add the % on each side of the center line m, it adds to 50%. A mounded-shape distribution is symmetric about the mean. 27 Applying Empirical Rule to identify Rare Events A simple and powerful tool for identifying outliers, extremes, or unusual, or rare events. We will use this rule very often through out the entire semester. (Similar questions in the test) Consider the 2010 ACT test, the average was 21 and a standard deviation was 4. The distribution of the ACT scores is mounded-shaped. Q1: A student received a score of 25. Is this an unusually high score? Q2: If CMU will admit students with a minimum ACT to be one standard deviations below the mean, what is the minimum ACT for CMU admission? Q3: A student received an ACT of 30. Is this an unusually high score? ANSWER: Q1: 25 = 21+4 (that is one s.d. above the mean. It is inside two s.d. from the mean. So, it is NOT unusually high score. Q2: The score at one s.d. below the mean = 21 – 4 = 17. Q3: the score 30 > 21 + 2(4) = 29. 30 is outside the two s.d. from mean. There is 28 only 2.5% of scores higher than 29. Hence, 30 is an unusually high score. Exercise: Estimating average, standard deviation and applying Empirical Rule when distribution is mounded-shaped We collect a sample of 40 weekly spending from 40 students. Suppose the spending has a mounded-shape distribution. We only know the min = $20 and max = $80. As you see the weekly spending varies. There is a variation among spending. (a) Give a good estimate of the average spending and standard deviation of the weekly spending based on the 40 students data. (b) Approximately how many % of students would spend $35 or more per week: ANS: Since the distribution is mounded-shaped, we can use (20+80) / 2 = $50 to estimate the average spending. Since this is a sample, so, we use s = range/4 to estimate the s.d., which would be (8020)/4 = $15.0. ANS: We can then use this estimated average spending and s to answer question (b): $35 is about one s.d. below the mean. Hence, the % of spending $35 or more = 34% + 50% = 84%. Approximately 84% of individuals spend $35 or more per week. 29 Five Number Summary; Box plots The Five-Number Summary MINIMUM Q1 Median Q3 MAXIMUM IQR (Inter-quartile Range) = Q3 – Q1 30 Steps for Drawing a Box plot Step 1: Determine the lower and upper fence: Lower fence = Q1 – 1.5(IQR) Upper fence = Q3 + 1.5(IQR) Step 2: Draw vertical lines at Q1, M and Q3. Enclose these vertical lines in a box. Step 3: Label the lower and upper fences. Step 4: Draw a line from Q1 to the smallest data value that is larger than the lower fence. Draw a line from Q3 to the largest data value that is smaller than the upper fence. Step 5: Any data value less than the lower fence or greater than the upper fence are outliers and are marked with an asterisk (*). 31 EXAMPLE Drawing a Boxplot Min Q1 M Q3 28 38 48 56 Max IQR 73 Q3-Q1 =56-38=18 Draw a boxplot for the serum HDL. Boxplot of HDL Median Q3 Q1 30 40 Compute the lower and upper fence and draw a boxplot. Mean 50 HDL 60 70 32 Relationship between Distribution Shape and Boxplot (Similar questions in the test) 1. If the median is near the center of the box and each of the horizontal lines are approximately equal length, then the distribution is roughly symmetric. 2. If the median is left of the center of the box and/or the right line is substantially longer than the left line, the distribution is right skewed. 3. If the median is right of the center of the box and/or the left line is substantially longer than the right line, the distribution is left skewed 33 Symmetric 34 Skewed Right 35 Skewed Left 36 Distance data – 100 distance data Boxplot of Miles Histogram of Miles 50 Frequency 40 30 20 10 0 0 200 400 Miles 600 800 0 1000 100 200 0 35 400 500 Miles 600 700 800 900 Boxplot of Miles Histogram of Miles female 300 200 400 600 0 800 female male 200 400 600 800 male Frequency 30 25 20 15 10 5 0 0 200 400 600 0 800 200 400 Panel variable: Gender 600 800 Miles Miles Panel variable: Gender 37 EXAMPLE Comparing Two Data Sets Using Boxplots The following boxplots represent the birth rate for women 15 - 44 years of age in 1990 and 1997 for each state. What conclusion can you make? 38