Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
What to look for in a distribution 1) Look for the overall pattern and for deviations from the pattern • See if the distribution has a shape we can describe in a few words • Describe the center and spread of the distribution 2) One common deviation from the overall pattern in any graph of data is an outlier, i.e., an observation that falls outside the overall pattern of the graph What to look for in a distribution 3) A distribution is symmetric if the right and left sides of the histogram are approximately mirror images of each other. 4) A distribution is skewed to the right if the right side of the histogram extends much further out than the left side. 5) A distribution is skewed to the left if the left side of the histogram extends much further out than the right side. Describing the Center of a Distribution How to find the mean (average): 1) Add the values together 2) Divide the total by the number of observations • Example: Test Scores : 56, 65, 54, 55, 57, 54, 61, 62, 60, 55, 57, 56, 57, 61, 62, 60, 49, 66, 59, 80 Step 1 : 56 + 65 + 54 + …… + 59 + 80 = 1186 Step 2 : 1186 / 20 = 59.3 Mean Describing the Center of a Distribution Fancy Schmancy Notation : To find the mean x of a set of observations, add their values and divide by the number of observations. If the n observations are x1 , x2 , x3 , ….. , xn , their mean is : x 1 + x 2 + x 3 + ... + x n x = n Or, in more compact notation: x = 1 n x i Describing the Center of a Distribution How to find the median M : 1) Arrange the observations in order from smallest to largest. 2) If the number of observations is odd, then the median is located at the center of the list. So, if there are n observations, then the median is located in spot (n + 1) / 2 3) If the number of observations is even, then the median is the average of the two terms in the middle spots. These are located in spots (n / 2) and (n / 2) + 1 Describing the Center of a Distribution Example of finding a Median : List 1 : 2, 4, 6, 3, 5, 2, 6, 8, 10, 11, 1 Step 1: Order the list : 1, 2, 2, 3, 4, 5, 6, 6, 8, 10, 11 Step 2 : Find the middle term2 : (n+1) / 2 = (11 + 1) / 2 = 6 1, 2, 2, 3, 4, 5, 6, 6, 8, 10, 11 Median Describing the Center of a Distribution Example of finding a Median : List : 2, 4, 6, 3, 5, 2, 6, 8, 10, 11, 1, 12 Step 1: Order the list : 1, 2, 2, 3, 4, 5, 6, 6, 8, 10, 11, 12 Step 2 : Find the two middle terms : (n / 2) + 1 = (12 / 2) + 1 = 7 n / 2 = 12 / 2 = 6 Step 3 : Average the sixth and seventh terms : 1, 2, 2, 3, 4, 5, 6, 6, 8, 10, 11, 12 Median = (5 + 6) /2 = 5.5 In The Presence Of Outliers Q: Do outliers affect the Mean and Median? Consider the list on numbers from 1 through 9 : 1, 2, 3, 4, 5, 6, 7 ,8 ,9 The Mean is : 5 The Median is : 5 What if we put the number 100 at the end of the list : 1, 2, 3, 4, 5, 6, 7 ,8 ,9, 100 The Mean is : 14.5 The Median is : 5.5 A: Outliers affect the Mean much more than the Median ! Describing Spread Consider the following pay distributions: Low Low High High Center Measuring Spread The simplest useful numerical description of a distribution consists of both a measure of center and a measure of spread. We can begin to describe the spread of a distribution by talking about percentiles. Definition: The pth percentile of a distribution is the value such that p percent of the observations fall at or below it. Example: The Median is the 50th percentile. Q: Why isn’t the Mean the 50th percentile? 1, 2, 3, 4, 5, 6, 7 ,8 ,9, 100 The Mean is : 14.5 The Median is : 5.5 Describing Spread The Five Number Summary : 1) The Median 2) First Quartile : 25% of the observations lie below the First Quartile 3) Third Quartile : 75% of the observations lie below the third quartile 4) Lowest Individual Observation (Minimum) 5) Highest Individual Observation (Maximum) Describing Spread Calculating the Quartiles : 1) Arrange the observations in increasing order and locate the Median M in the ordered list o’ observations. 2) The First Quartile Q1 is the median of the observations whose position in the ordered list is to the left of the location of the overall median. 3) The Third Quartile Q3 is the median of the observations whose position in the ordered list is to the right of the location of the overall median. Describing Spread Example of calculating First Quartile : List o’ quiz scores: 10, 8, 9, 4, 6, 6, 8, 9, 2, 7 1) Order the list: 2, 4, 6, 6, 7, 8, 8, 9, 9, 10 Find the median: (7 + 8) / 2 = 7.5 2) Find all the observations whose position in the list is to the left of the median : 2, 4, 6, 6, 7, 8, 8, 9, 9, 10 Find the median of these values : 6 Describing Spread Example of calculating Third Quartile : List o’ quiz scores: 10, 8, 9, 4, 6, 6, 8, 9, 2, 7, 11 1) Order the list: 2, 4, 6, 6, 7, 8, 8, 9, 9, 10, 11 Find the median: 8 2) Find all the observations whose position in the list is to the right of the median : 2, 4, 6, 6, 7, 8, 8, 9, 9, 10, 11 Find the median of these values : 9 Interquartile Range The interquartile range , IQR, is the distance between the first quartile and the third quartile. Determining Outliers Call an observation a suspected outlier if it falls more than 1.5 * IQR above the third quartile or below the first quartile. Example : Imagine we have a bunch of test scores with Q1 = 50 and Q3 = 80. The IQR = 80 - 50 = 30 So, 1.5 * IQR = 1.5 * 30 = 45 This means that if there are any scores above Q3 + 45 = 125 or any scores Q1 - 45 = 5, then these scores are suspected outliers. Boxplot •Example: A BoxplotLow is a =graph of the= five number =summary. central 47, High 98, Median 77, Q1 =A65, Q3 =box 85 spans the quartiles, with a line marking the median. Whiskers extend out from the box to the extremes. Highest Observation (98) 90 Q3 (85) Median (77) 70 Q1 (65) 50 30 10 0 Lowest Observation (47) Describing Spread The Standard Deviation • Variance: The variance of a set of observations is an “average” of the deviations of the observations from the mean. • Note: You divide by (n - 1) instead of n. • Standard Deviation: The SD is the square root of the variance. Describing Spread The Standard Deviation Example : Test Scores : 65, 77, 83, 80, 95 1) Find the average : 80 2) Find the deviations from the mean, and their squares Obs Deviation from Mean 65 77 -15 -3 83 80 95 3 0 15 Deviations Squared 225 9 9 0 225 Describing Spread The Standard Deviation 3) Determine the mean of the squares: (225 + 9 + 9 + 0 + 225) (5 - 1) = 117 Variance 4) Determine the Standard Deviation: 117 = 10.8 More Fancy Schmancy Notation 2 The variance s of a set of observations is the average of the squares of the deviations of the observations from their mean. In symbols, the variance on n observations x 1 , x 2 , ... x n is : 2 s2 = 2 2 + (x 2 - x ) + ... (x 1 - x ) + (x n - x ) n-1 or, in more compact notation : s 2 1 = n-1 2 (x i - x ) The standard deviation s is the square root of the variance s 2 : s= 1 n-1 2 (x i - x ) Another Example of Standard Deviation Consider the following years in our past : 1792, 1666, 1362, 1614, 1460, 1867, 1439 Find the standard deviation of these years. The Mean = 1600 xi 1792 1666 1362 1614 1460 1867 1439 2 ( xi - x ) ( xi - x ) 192 66 -238 14 -140 267 -161 36864 4356 56644 196 19600 71298 25921 s 2 1 = n-1 2 (x i - x ) 1 ( 214879 ) = 6 = 35813.166 s = 189.2 Why Do We Square The Deviations ? 1) The sum of the squared deviations of any set of observations from their mean is the smallest that the sum of squared deviations from any number can possibly be. Why use the Standard Deviation and not the Variance ? 1) The standard deviation is the measure of spread for an important class of symmetric unimodal distributions called the normal distribution. 2) The standard deviation is used by the normal distribution. 3) The variance uses squared deviations, which gives a different unit from the original data. Why use n - 1 ? 1) The sum of the deviations is *always* zero. So, if we know n-1 of the deviations, then the last deviation can be calculated. So, only n-1 of the deviations can vary freely. These are called degrees of freedom. Properties of Standard Deviations 1) The standard deviation measures spread about the mean and should be used only when the mean is chosen as the measure of center. 2) s = 0 only when there is no spread. This happens only when all observations have the same value. Otherwise, s > 0. As the observations get more spread out from the mean, then s gets larger. 3) s, like the mean, is not resistant. A few outliers can make s very large. Which Measure To Use ? Q: When is the mean better than median? When is the five number summary better than the standard deviation? Rules O’ Thumb A1: If outliers appear, or if your distribution is skewed, then the mean could be affected, so use the median and the five number summary. A2: If the distribution is reasonably symmetric and is free of outliers, then the mean and standard deviation should be used. Changing Units Consider the following values : 30, 40, 50, 60, 70 The mean is 50 and the standard deviation is 15.8 What happens to these if we take every score, multiply it by 2 and add 10 We get these values : 70, 90, 110, 130, 150 The mean is 110 and the standard deviation is 31.6 Changing Units Old values : 30, 40, 50, 60, 70 mean = 50 and s = 15.8 What happens to these if we take every score, multiply it by 2 and add 10 New values : 70, 90, 110, 130, 150 mean = 110 and s = 31.6 150 150 150 130 130 130 110 110 110 90 90 90 70 70 70 50 50 50 30 30 30 Linear Transformations A linear transformation changes the original variable x into the new variable x new given an equation of the form : x new = bx + a Note: The constant a shifts all values of x either up or down by the value a. The constant b changes the size of the unit of the distribution. Effects of Linear Transformations 1) To get the new spread, multiply the old spread by b. 2) To get the new mean, multiply the old mean by b and add the constant a. Homework 43, 45, 49, 50, 54, 59, 63, 64, 65, 72, 73, 75