Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Measures of Center SUPPOSE THAT AN INSTRUCTOR IS TEACHING TWO SECTIONS OF A COURSE AND THAT SHE CALCULATES THE MEAN EXAM SCORE TO BE 60 FOR SECTION 1 AND 90 FOR SECTION 2 A. Do you have enough information to determine the mean exam score for the two sections combined? Explain B. What can you say with certainty about the value of the overall mean for the two sections combined? C. Without seeing all of the individual students’ exam scores, what information would you need to be able to calculate the overall mean? D. Suppose that section 1 contains 20 students and section 2 contains 30 students. Calculate the overall mean exam score. Is the overall mean closer to 60 or 90? E. Give an example of sample sizes for the two sections for which the overall mean turns out to be less than 65. F. If you do not know the number of students in the sections but do know that there is the same number of students in the 2 sections, can you determine the overall mean? G. Explain how it could happen that a student could transfer from section 1 to section 2 and cause the mean score for each section to decrease. Measures of Central Tendency • Median - the middle of the data; • 50th percentile –Observations must be in numerical order –Find the middle single value if n is odd –Take the average of the middle two values if n is even NOTE: n denotes the sample size Finding the Center: The Median • The median is the value with exactly half the data values below it and half above it. – It is the middle data value (once the data values have been ordered) that divides the histogram into two equal areas. – It has the same units as the data. Slide 5- 4 Measures of Central Tendency parameter • Mean - the arithmetic average –Use m to represent a population statistic mean –Use to x̄ represent a sample mean x x x n Mean • Regardless of the shape of the distribution, the mean is the point at which a histogram of the data would balance; the median is the equal area point. Slide 5- 6 Measures of Central Tendency • Mode – the observation that occurs the most often –Can be more than one mode –If all values occur only once – there is no mode –Not used as often as mean & median Another Measure of Center • As a measure of center, the midrange may also be used (the average of the minimum and maximum values). However it is very sensitive to skewed distributions and outliers. • The median is a more reasonable choice for center than the midrange in skewed distributions. Slide 5- 8 Using the calculator . . . Enter the data in a list Go to LIST Menu Highlight MATH Find your function OR Go to Stat Menu Highlight Calc Run 1-Vars Stats on your list • Measuring Center Example, page 53 10 30 5 25 40 20 10 15 30 20 15 20 85 15 65 15 60 60 40 45 10 30 5 25 ... 40 45 x 31.25 minutes 20 0 1 2 3 4 5 6 7 8 5 005555 0005 Key: 4|5 00 represents a 005 005 5 New York worker who reported a 45minute travel time to work. 20 25 M 22.5 minutes 2 Describing Quantitative Data – Use the data below to calculate the mean and median of the commuting times (in minutes) of 20 randomly selected New York workers. Suppose we are interested in the number of lollipops that are bought at a certain store. A sample of 5 customers buys the following number of lollipops. Find the median. The numbers are in order & n is odd – so find the middle observation. 2 The median is 4 lollipops! 3 4 8 12 What would happen to the median & mean if the 12 lollipops were 20? The median is . . . The mean is . . . 5 7.17 2 3 4 6 8 20 6 What happened? 2 3 4 6 8 20 What would happen to the median & mean if the 20 lollipops were 50? The median is . . . The mean is . . . 5 12.17 2 3 4 6 8 50 6 What happened? 2 3 4 6 8 50 Resistant • Statistics that are not affected by extreme values (outliers) • Is the median resistant? ►Is the mean resistant? YES NO Look at the following data set. Find the mean & median. Mean = 27 Median = 27 21 27 Create a histogram with the data. x-scale of 2) Then Look(use at the placement of find mean median. thethe mean andand median in this symmetrical distribution. 23 23 24 25 25 27 27 28 30 30 26 26 26 27 30 31 32 32 Look at the following data set. Find the mean & median. Mean = 28.176 Median = 25 22 23 Look at the placement of the mean and median in this right skewed 29 distribution. 28 22 24 24 23 26 36 25 28 21 38 62 23 25 Look at the following data set. Find the mean & median. Mean = 54.588 Median = 58 Create a histogram with the data. Then findplacement the meanof and Look at the median. the mean and median in this skewed left distribution. 21 46 54 47 53 60 55 55 56 63 64 58 58 58 58 62 60 Comparing the mean and the median The mean and the median are the same only if the distribution is symmetrical. Even in a skewed distribution, the median remains at the center point, the mean however, is pulled in the direction of the skew. Mean and median for a symmetric distribution Mean Median Left skew Mean Median Mean and median for skewed distributions Mean Median Right skew WHICH MEASURE OF CENTER? ► Given that the MEAN is a NON-RESISTANT measure, it makes sense to use the MEDIAN in a skewed distribution as the “more typical” value ► Ex. Consider the following test scores: ► 96 98 92 90 95 100 91 55 ► Find the mean & the median ► Which one is more “typical”? Trimmed mean: To calculate a trimmed mean: • Multiply the % to trim by n • Truncate that many observations from BOTH ends of the distribution (when listed in order) • Calculate the mean with the shortened data set First find the mean of the data then find a 10% trimmed mean with the following data. 12 14 19 20 22 24 25 26 26 10%(10) = 1 So remove one observation from each side! 14 19 20 22 24 25 26 26 22 8 55 WEIGHTED MEAN • Midterm --- 92 • Paper ---- 80 • Final --- 88 .25(92) + .25(80) + .5(88) = • Find your semester average if the Midterm is weighed 25%, the paper 25% & the Final 50% WEIGHTED MEAN Weighted Mean is an average computed by giving different weights to some of the individual values. If all the weights are equal, then the weighted mean is the same as the arithmetic mean. x is each data value w is the number of occurrences of x (weight) x̄ is the weighted mean CONSIDER THE FOLLOWING 3 SAMPLE DATA SETS: I 20 40 50 30 60 70 II 47 43 44 46 20 70 III 44 43 40 50 46 47 COMPUTE THE RANGE, MEDIAN & MEAN FOR EACH DATA SET WHAT DO YOU NOTICE??? NOW TAKE A LOOK AT COMPARING THE DOT PLOTS Why is the study of variability important? • Allows us to distinguish between usual & unusual values • In some situations, want more/less variability • When describing data, never rely on center alone • Like Measures of Center, you must choose the most appropriate measure of spread. Measures of Variability • range (max-min) • interquartile range (Q3-Q1) • deviations x x Lower case Greek letter 2 sigma • variance • standard deviation A measure of center alone can be misleading. A useful numerical description of a distribution requires both a measure of center and a measure of spread. How to Calculate the Quartiles and the Interquartile Range To calculate the quartiles: 1)Arrange the observations in increasing order and locate the median M. 2)The first quartile Q1 is the median of the observations located to the left of the median in the ordered list. 3)The third quartile Q3 is the median of the observations located to the right of the median in the ordered list. The interquartile range (IQR) is defined as: IQR = Q3 – Q1 Describing Quantitative Data Spread: The Interquartile Range (IQR) + Measuring and Interpret the IQR + Find Travel times to work for 20 randomly selected New Yorkers 10 30 5 25 40 20 10 15 30 20 15 20 85 15 65 15 60 60 40 45 5 10 10 15 15 15 15 20 20 20 25 30 30 40 40 45 60 60 65 85 Q1 = 15 M = 22.5 Q3= 42.5 IQR = Q3 – Q1 = 42.5 – 15 = 27.5 minutes Interpretation: The range of the middle half of travel times for the New Yorkers in the sample is 27.5 minutes. Describing Quantitative Data Example, page 57 In addition to serving as a measure of spread, the interquartile range (IQR) is used as part of a rule of thumb for identifying outliers. Definition: The 1.5 x IQR Rule for Outliers Call an observation an outlier if it falls more than 1.5 x IQR above the third quartile or below the first quartile. Example, page 57 In the New York travel time data, we found Q1=15 minutes, Q3=42.5 minutes, and IQR=27.5 minutes. For these data, 1.5 x IQR = 1.5(27.5) = 41.25 Q1 - 1.5 x IQR = 15 – 41.25 = -26.25 Q3+ 1.5 x IQR = 42.5 + 41.25 = 83.75 Any travel time shorter than -26.25 minutes or longer than 83.75 minutes is considered an outlier. 0 1 2 3 4 5 6 7 8 5 005555 0005 00 005 005 5 Describing Quantitative Data Outliers + Identifying + Five-Number Summary The minimum and maximum values alone tell us little about the distribution as a whole. Likewise, the median and quartiles tell us little about the tails of a distribution. To get a quick summary of both center and spread, combine all five numbers. Definition: The five-number summary of a distribution consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation, written in order from smallest to largest. Minimum Q1 M Q3 Maximum Describing Quantitative Data The The five-number summary divides the distribution roughly into quarters. This leads to a new way to display quantitative data, the boxplot. How to Make a Boxplot •Draw and label a number line that includes the range of the distribution. •Draw a central box from Q1 to Q3. •Note the median M inside the box. •Extend lines (whiskers) from the box out to the minimum and maximum values that are not outliers. + Boxplots (Box-and-Whisker Plots) Describing Quantitative Data a Boxplot + Construct Consider our NY travel times data. Construct a boxplot. 10 30 5 25 40 20 10 15 30 20 15 20 85 15 65 15 60 60 40 45 5 10 10 15 15 15 15 20 20 20 25 30 30 40 40 45 60 60 65 85 Min=5 Q1 = 15 M = 22.5 Q3= 42.5 Max=85 Recall, this is an outlier by the 1.5 x IQR rule Describing Quantitative Data Example When we use the mean instead of the median as a measure of center, we need another way to measure spread. Suppose that we have these data values: 24 34 26 30 28 21 35 29 37 16 First find the mean: Then find the deviations. x x What is the sum of the deviations from the mean? 24 34 26 30 28 21 35 29 37 16 x x Square the deviations: 2 Find the average of the squared deviations: x x 2 n The average of the deviations squared is called the variance. Population parameter 2 Sample s 2 statistic Calculation of variance of a sample xn x s n 1 2 2 df Degrees of Freedom (df) • n deviations contain (n - 1) independent pieces of information about variability • Measuring Spread: The Standard Deviation Definition: (x1 x ) 2 (x 2 x ) 2 ... (x n x ) 2 1 variance = s (x i x ) 2 n 1 n 1 2 x 1 2 standard deviation = sx (x x ) i n 1 Describing Quantitative Data The standard deviation sx measures the average distance of the observations from their mean. It is calculated by finding an average of the squared distances and then taking the square root. Using a Calculator: • ENTER DATA IN L1 1-Vars Stats on L1 or use List menu option Which measure(s) of variability is/are resistant? • Choosing Measures of Center and Spread – Mean and Standard Deviation – Median and Interquartile Range •The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers. •Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers. •NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA! Describing Quantitative Data • We now have a choice between two descriptions for center and spread COEFFICIENT OF VARIATION: a measurement of the relative variability (or consistency) of data s CV 100 or 100 x m CV is used to compare variability or consistency A sample of newborn infants had a mean weight of 6.2 pounds with a standard deviation of 1 pound. A sample of three-month-old children had a mean weight of 10.5 pounds with a standard deviation of 1.5 pounds. Which (newborns or 3-month-olds) are more variable in weight? To compare variability, compare Coefficient of Variation For newborns: For 3-montholds: CV = 16% Higher CV: more variable CV = 14% Lower CV: more consistent Use Coefficient of Variation To compare two groups of data, to answer: Which is more consistent? Which is more variable? Linear Transformations Variables can be measured in different units (feet vs meters, pounds vs kilograms, etc) When converting units, the measures of center and spread will change. Linear transformation rule • When multiplying or adding a constant to a random variable, the mean changes by both. • When multiplying or adding a constant to a random variable, the standard deviation changes only by multiplication. • Formulas: max b amx b ax b a x An appliance repair shop charges a $30 service call to go to a home for a repair. It also charges $25 per hour for labor. From past history, the average length of repairs is 1 hour 15 minutes (1.25 hours) with standard deviation of 20 minutes (1/3 hour). Including the charge for the service call, what is the mean and standard deviation for the charges for a service call? m 30 25(1.25) $61.25 1 25 $8.33 3 Chapter 1 Summary Data Analysis is the art of describing data in context using graphs and numerical summaries. The purpose is to describe the most important features of a dataset.