Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 4 - Averages and Standard Deviation PART II : DESCRIPTIVE STATISTICS Dr. Joseph Brennan Math 148, BU Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 1 / 44 Description of Distributions Similar to chapter 3, we will only be handling variables that are quantitative in nature. To describe the distribution of a quantitative variable we should specify : The overall shape of the distribution: Number of modes, Types of Symmetry, Types of Skew. Numerical descriptions of the distribution. These are measures of Center, Spread. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 2 / 44 Measures of Center and Spread We will consider 3 measures of the center of a distribution : the mode, the average (mean), the median. We will also discuss 2 measures of the spread of a distribution : the standard deviation, the interquartile range. Notation Suppose we have a data set which consists of n observations. Denote observations by x1 , x2 , . . . , xn . Consider x1 as the first observation and xn is the last n - th observation. The subscripts on the observations, xi , are just a way of keeping the n observations distinct. They do not necessarily indicate order or any other special facts about the data. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 3 / 44 The Mode The Mode The mode is the number that occurs most frequently in a given data. A mode can be visually determined from a histogram as it will coincide with a peak. There may be several modes! Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 4 / 44 The Mean (Average) The Mean The mean is the numerical center of data. It is the common average by which you have been graded since childhood. Typically, the mean of a set of data will be denoted by x̄. The mean of a population (found through a census) is denoted µ. The mean x̄ for a set of observations is determined by adding all values together and dividing by the number of observations. Typically, the number of observations will be denoted by n. n x̄ = x1 + x2 + . . . + xn 1X = xi . n n i=1 The Σ (capital Greek sigma) in the above formula is short for “add them all up”. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 5 / 44 Example 1 (Student Height) The heights (in inches) of 10 students are given below : 71, 70, 68, 69, 68, 65, 72, 69, 71, 62. What is a mode height? There happen to be three: 68 69 71 What is the mean height? x̄ = 71 + 70 + 68 + 69 + 68 + 65 + 72 + 69 + 71 + 62 = 68.5 (inches) 10 Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 6 / 44 Example 2 (Temperature) A biological experiment takes place in an orchard. The outside temperature (in degrees Fahrenheit) is taken every hour. The first 5 successive measurements, 63, 66, 69, 70, and 75, were taken at 8, 9, 10, 11, and 12 a.m., correspondingly. What is the mode temperature? There isn’t one! What is the mean temperature? x̄ = 63 + 66 + 69 + 70 + 75 = 68.6 (degrees Fahrenheit) 5 Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 7 / 44 Example 2 (Temperature) Suppose that we recorded the last number wrong and accidentally recorded a temperature 105 instead of 75. How will the average change? x̄error = 63 + 66 + 69 + 70 + 105 = 74.6 5 We can track the actual change between the actual mean and the mean found in error: x̄error = 63 + 66 + 69 + 70 + 75 30 + = x̄true + 6 5 5 A single wrong temperature (an outlier!) has shifted the mean temperature up by 6 degrees! Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 8 / 44 Weakness of the Average Example 4 illustrates an important weakness of the average. The average is sensitive to the influence of extreme observations. Extreme observations may be outliers, but a skewed distribution that has no outliers will also shift the mean towards the long tail. This will be discussed in detail later. In statistical language we say that the average is not a robust measure of the center. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 9 / 44 Properties of the Average 1 The average is always between the smallest and the biggest number in the data set. 2 The average is the center of gravity of the histogram. 3 The average is not resistent (robust) to outliers and to a skewness of a distribution. The average shifts towards the long tail of the distribution. 4 The average x̄ estimates the (unknown) population mean µ. 5 The average value x̄ is the best predictor for a future value of a variable. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 10 / 44 The Median The median is the midpoint of a distribution. For a data set the median is the number such that half of observations are smaller and the other half are larger. Typically, the median will be denoted x̃. To find the median of a distribution : 1 Arrange all observations in order of size, from smallest to largest.Be sure to list all observations, even if the same values are repeated several times. 2 If the number of observations n is odd, the median x̃ is the center observation in the ordered list. The location of the median can be found by counting (n + 1)/2 observations up from the bottom of the list. If the number of observations n is even, the median x̃ is any number between two center observations in the ordered list. 3 When n is even, we will usually take the median x̃ to be the average of the two center observations in the ordered list. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 11 / 44 Example 1 (Height) and 2 (Temperature) Revisited Example 1 The sample size for the height of students is n = 10, which is even. The median (69) is the average of the two middle observations in the list: 62 65 68 68 69 69 70 71 71 72 Example 2 The sample size for the temperature is n = 5, which is odd. The median is the 3rd observation in this list: 63 66 69 70 75 When the last temperature was recorded wrong, the median is again 69 : 63 66 69 70 105 As we can see, an outlier does not change the median! The median is resistant (robust) to the influence of extreme observations. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 12 / 44 Properties of the Median 1 The median is always between the smallest and the biggest number in the data set. 2 The median is the value which divides the area of the histogram by half. The area of the histogram to the left of the median is equal to the area of the histogram to the right of the median. 3 The median is resistent (robust) to the influence of extreme observations and to the skewness of a distribution. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 13 / 44 What Measure of Center is Applicable? Different measures of center are appropriate in different situations. Consider several examples: 1 A shoe store is interested in which size of shoes is of greatest demand. This question is about the mode of the distribution of shoe sizes. 2 An economist studying household incomes is interested in the middle income value, the economist wants to de-emphasize the impact of the few very high incomes that are typically present in such a data set. The economist is interested in the median income. 3 An instructor is asked what was the average score on the test. The instructor is interested in the mean grade. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 14 / 44 What Measure of Center is Applicable? It is important that there is not a unique notion of center! A proper measure of center should may be chosen based upon the study question and looking at the available data. Data whose distribution is roughly symmetric has a median, x̃, and mean, x̄, close together. Note that for a perfectly symmetric distribution the mean equals the median. Real data sets, however, never have perfectly symmetric distributions. Data whose distribution is highly skewed (either left or right) has a noticibly seperate median and mean. Note that for a highly skewed distribution the median is a preferred center, as it is robust, unlike the mean, and isn’t affected by outliers. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 15 / 44 Center Measurements on a Histogram Figure : Figure 5. Measures of center. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 16 / 44 Measures of Spread Example 3 Consider two sets of data : Set A: Set B: 48, 49, 51, 53, 45, 47, 55, 50, 51, 51 10, 65, 17, 89, 100, 40, 21, 99, 34, 25 Figure : Figure 6. Histograms for Sets A and B. Both sets have the same mean of 50, but the spread of the distribution in set B is much greater than it is in set A. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 17 / 44 The Sample Standard Deviation The sample standard deviation measures the spread by looking at how far the observations are from their average x̄. Typically, the standard deviation will be denoted by s. The formula for the standard deviation is: v u n u1 X (xi − x̄)2 s=t n i=1 NOTE: There is an alternative way to compute s, which is more efficient in some cases : v u n q u1 X 2 2 t s= xi − x̄ = x 2 − x̄ 2 , n i=1 where x 2 is the average of squared data values. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 18 / 44 Interpreting the Standard Deviation v u n u1 X s=t (xi − x̄)2 n i=1 1 Begin with xi − x̄: First compute the sample mean x̄. Second, for each data point xi , record the difference between xi and x̄. xi − x̄ is a measure of deviation a data point is from the mean. Deviations may be positive or negative. However, the sum of the deviations is zero. 2 Proceed to (xi − x̄)2 : Squaring the deviations makes them positive. 3 Proceed to 1 n Pn i=1 (xi − x̄)2 : We now find the mean of the squared deviations. Do not confuse this mean with the sample mean x̄. 4 Finish by taking a square root, effectively undoing the square. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 19 / 44 Properties of the Sample Standard Deviation 1 s measures spread about the mean. The standard deviation is connected to only the mean among center measures. 2 s = 0 only when there is no spread. This happens only when all the observations are identical. Otherwise s is positive. As observations become more spread out about their mean, s gets larger. 3 s, like the average x̄, is not robust. Distributions with outliers and strongly skewed distributions have large standard deviations. 4 s has the same unit of measurement as the original observations. 5 For bell-shaped distributions, s can be interpreted as a deviation of typical observation from the mean. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 20 / 44 Example 2 (Temperature) Let us compute the standard deviation for the original data set: 63, 66, 69, 70, 75. Step 1. Compute the average x̄: x̄ = 68.6 (completed earlier). Step 2. Compute the deviations: See column 2 below: Observation 63 66 69 70 75 Sum Deviation 63-68.6= -5.6 66-68.6= -2.6 69-68.6= 0.4 70-68.6= 1.4 75-68.6= 6.4 0 Squared deviation 31.36 6.76 0.16 1.96 40.96 81.2 Step 3. Square the Deviations. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 21 / 44 Example 2 (Temperature) Step 4. Average squared deviations: n 1 1X (xi − x̄)2 = (31.36 + 6.76 + 0.16 + 1.96 + 40.96) = 16.24 5 n i=1 Step 5. Take the square root of the averaged squared deviations: v u n √ u1 X (xi − x̄)2 = 16.24 ≈ 4.03 s=t n i=1 We have computed that the mean temperature is 68.6 degrees Fahrenheit, while the standard deviation among the data points is 4.03 degrees Fahrenheit. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 22 / 44 Example 2 (Temperature) Now let us recompute the standard deviation for the case when the last observation was recorded wrong (105 instead of 75). We have already calculated the mean for this case, x̄error = 74.6. Computing: Observation 63 66 69 70 105 Sum Dr. Joseph Brennan (Math 148, BU) Deviation 63-74.6=-11.6 66-74.6=-8.6 69-74.6=-5.6 70-74.6=-4.6 105-74.6=30.4 0 Squared deviation 134.56 73.96 31.36 21.16 924.16 1185.2 Chapter 4 - Averages and Standard Deviation 23 / 44 Examples In this case, when the last observation is recorded wrong, r 1185.2 serror = ≈ 15.40 5 which is almost 4 times greater than the standard deviation for the original data set. Remember, the standard deviation is like the mean, NOT robust. Example 5 What is the standard deviation for the data 5, 5, 5, 5, 5, 5, 5, 5, 5, 5 ? Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 24 / 44 Visual Intuition There are physically intuitive interpretations of mean, median and mode. Consider any histogram sketch associated to a density histogram. The peaks are the modes and are the easiest to spot. The median is the first place where a vertical line splits the area under the density histogram equally. The mean is the centre of gravity of the histogram, thought of as a physical mass. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 25 / 44 Visual Intuition To illustrate the point that the mean plays the role of center of gravity and why it should differ from the median consider two density histogram which have similar shapes but are still different : mean mean Although the median stays the same, the mean moves to the right as the second peak moves to the right! Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 26 / 44 Percentiles The p th percentile of the data is such a value that p percent of the observations fall at or below it. The median is the 50th percentile. The most commonly used percentiles other than the median are the quartiles. The first quartile, Q1 , is the 25th percentile, and the third quartile, Q3 , is the 75th percentile.The second quartile, obviously, is the median itself. NOTE: 50% of the observations are located between the quartiles Q1 and Q3 . Percentiles, and, in particular, quartiles are useful numerical characteristics of the data distribution. The quartiles are used to compute the interquartile range. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 27 / 44 The Interquartile Range The Interquartile Range, IQR is the distance between the first and third quartiles: IQR = Q3 − Q1 . To calculate the quartiles: Arrange the observations in increasing order. If the number of observations n is even, split the ordered data set into 2 parts. Find the first quartile Q1 as the median of the first half of the data set. Similarly, the third quartile Q3 is the median of the second half of the original data set. If the number of observations n is odd, split the data set in two halves by excluding the central value (the median). After that find Q1 and Q3 as the medians of the corresponding halves. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 28 / 44 Example 1 (Heights) The heights (in inches) of 10 students are given below: 71, 70, 68, 69, 68, 65, 72, 69, 71, 62. Notice that we are dealing with an even number of observations: Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 29 / 44 Examples 2 (Temperature) A biological experiment takes place in an orchard. The outside temperature (in degrees Fahrenheit) is taken every hour for 5 hours: 63, 66, 69, 70, and 75, Notice that we are dealing with an odd number of observations: Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 30 / 44 Summary of Center and Spread For a data distribution using a quantitative variable we now have multiple ways to describe the center and spread: Center: Mean, Median, Mode. Spread: Standard Deviation, Interquartile Range. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 31 / 44 Linear Transformations Suppose we have a set of numbers which we want to transform to another set of numbers which will have different units of measurement. We will consider only the linear transformations of variables, which have the following form : xnew = a + bx, (1) where xnew is the variable in new units, x is the old variable, and a and b are numbers. The key word is linear. Linear transformations graph as a straight line with y-intercept a and slope b. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 32 / 44 Examples of Linear Transformations The distance in kilometers is translated into distance in miles using the formula M = 0.62K , (2) where M is the distance in miles, and K is the distance in kilometers. For instance, a 10-kilometer race covers 6.2 miles. The temperature in degrees Fahrenheit is translated into temperature in degrees Celsius as 160 5 5 + F, C = (F − 32) = − 9 9 9 (3) where C is the temperature in degrees Celsius, and F is the temperature in degrees Fahrenheit. For instance, 95◦ F translates into 35◦ C while −40◦ F translates into −40◦ C. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 33 / 44 The Effect of a Linear Transformation. How do the numerical measures of center and spread change after a linear transformation? We will separately consider 2 special cases of linear transformation: Data Shifts: a special case of the transformation (1) when b = 1: xnew = x + a, which corresponds to adding a constant a to every observation. Scale Changes: a special case of the transformation (1) when a = 0, b 6= 0: xnew = bx, which corresponds to multiplying each observation by a constant b (positive or negative). Transformation (2) is an example of the scale transformation. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 34 / 44 Effects of a Shift Transformation If we act upon a data set by a shift transformation: xnew = x + a, the change in spread and center are recorded: Mean 1st Quatrile x̄new = x̄ + a Median Q1, new = Q1 + a 3rd Quartile x̃new = x̃ + a Standard Deviation snew = s Dr. Joseph Brennan (Math 148, BU) Q3, new = Q3 + a Interquartile Range IQRnew = IQR Chapter 4 - Averages and Standard Deviation 35 / 44 Example (Football team) Wrong Scale: Every player of a highschool football team was weighed using the same scale. If it was discovered later that the scale was 10 lb under, so we need to add 10 lb to every weight, what would happen to each of the following measurements? Characteristic Average Median Q1 Q3 Standard Deviation IQR Dr. Joseph Brennan (Math 148, BU) Original 230 lb 240 lb 200 lb 280 lb 50 lb 80 lb After Adjustment 240 lb 240 lb 210 lb 290 lb 50 lb 80 lb Chapter 4 - Averages and Standard Deviation 36 / 44 Effects of a Scale Transformation If we act upon a data set by a scale transformation: xnew = bx, the change in spread and center are recorded: Mean 1st Quatrile x̄new = bx̄ Q1, new = bQ1 Median 3rd Quartile x̃new = bx̃ Q3, new = bQ3 Standard Deviation Interquartile Range snew = |b| · s IQRnew = |b| · IQR where | · | denotes absolute value. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 37 / 44 Example 7 (Football team) Now suppose we found out that we are supposed to report weights and all the summary measures in kilograms, not in pounds! Recall that 1 lb = 0.453 kilograms. What would happen to each of the following measurements? Characteristic Mean Median Q1 Q3 Standard Deviation IQR Dr. Joseph Brennan (Math 148, BU) Original 230 lb 240 lb 200 lb 280 lb 50 lb 80 lb After Adjustment 104 lb 109 lb 91 lb 127 lb 23 lb 36 lb Chapter 4 - Averages and Standard Deviation 38 / 44 Effects of a General Linear Transformations. If we act upon a data set by any linear transformation: xnew = bx + a, the change in spread and center are recorded: Mean 1st Quatrile x̄new = bx̄ + a Q1, new = bQ1 + a Median 3rd Quartile x̃new = bx̃ + a Q3, new = bQ3 + a Standard Deviation Interquartile Range snew = |b| · s IQRnew = |b| · IQR where | · | denotes absolute value. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 39 / 44 The Empirical Rule As a general rule, if the data distribution is unimodal, roughly symmetric, and bell-shaped, then: Approximately 68% of the observations fall within one standard deviation of the average, i.e., approximately 68% of data values are between x̄ − s and x̄ + s. Approximately 95% of the observations fall within 2 standard deviations of the average, i.e., approximately 95% of data values are between x̄ − 2s and x̄ + 2s. Approximately 99.7% of the observations fall within 3 standard deviations of the average, i.e., approximately 99.7% of data values (almost all the observations) are between x̄ − 3s and x̄ + 3s. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 40 / 44 The Empirical Rule Figure : Figure 1. Empirical Rule Illustration. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 41 / 44 WARNING: The Empirical Rule Strikes Back Don’t throw caution (and, perhaps, lightsabers) to the wind while using the empirical rule! Keep in mind: The Empirical Rule does NOT give the exact percentages of observations within one, two, or three standard deviations, just approximate percentages. The Empirical Rule works well just for symmetric bell-shaped histograms. The Empirical Rule will not be too off for lightly skewed distributions, but it will be very wrong for moderately or heavily skewed distributions. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 42 / 44 Example (HANES5 study) HANES5 (The Health and Nutrition Examination Survey) was a study done in 2003-2004 recording the height (in inches) of women (see p. 58). The mean height was x̄ = 63.5 inches and the standard deviation s = 3 inches. The histogram (found on the next slide) is approximately symmetric and bell-shaped, so the Empirical Rule should hold. Dr. Joseph Brennan (Math 148, BU) Chapter 4 - Averages and Standard Deviation 43 / 44 Example (HANES5 study) The shaded region corresponds to the women with height within 1 standard deviation of the average: x̄ ± s = 63.5 ± 3 = (60.5, 66.5) The true area of the shaded region is 72%, which is fairly close to 68%. Dr. Joseph Brennan (Math 148, BU) The shaded region corresponds to the women with height within 2 standard deviation of the average: x̄ ± 2s = 63.5 ± 2 · 3 = (57.5, 69.5) The true area of the shaded region is 97%, which is fairly close to 95%. Chapter 4 - Averages and Standard Deviation 44 / 44