Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STAT 3660 – Introduction to Statistics Chapter 6 – Measures of Central Tendency STAT 3660 – Introduction to Statistics Objectives • By the end of this material, you will be able to: – Explain the purposes of measures of central tendency and interpret the information they provide – Calculate , explain and compare modes, medians and means – Understand other measures of central tendency – Be able to select appropriate measure of central tendency according to the level of measurement and characteristics of the distribution 2 STAT 3660 – Introduction to Statistics Measures of Central Tendency • Measures of Central Tendency – Values can be calculated from the scores of the distribution to be representative of the location or center of the distribution. These are important measures of descriptive statistics • There are three (3) common measures of central tendency – Mean (µ - population, x - sample): is the arithmetic center of a distribution and is the most commonly utilized measure. It reports the average score of a distribution – Median (~ x ): is the center score of the distribution. Half of the values are below the median and half are above the median in the distribution ) – Mode (x ): is the most frequently occurring score of the distribution. There may be more than one mode or none in a distribution 3 STAT 3660 – Introduction to Statistics Measures of Central Tendency • Mean ( x , average) – The mean is calculated by totaling all of the scores or values and dividing by the number of scores or values that were summed x ∑ x= i N where, ∑ x = the sum of all values i N = the number of values in the summations 4 STAT 3660 – Introduction to Statistics Measures of Central Tendency • Three characteristics of the mean 1. 2. 3. The mean balances all of the scores because it acts like a fulcrum. It is the point around which all of the scores cancel out The mean is the point in a distribution around which the variation is minimized (least squares) The mean can be misleading if the distribution is skewed or contains unusual (outlier) values. Distributions with significant skewness will have inflated means. This will be discussed further in the discussion of median 5 STAT 3660 – Introduction to Statistics Measures of Central Tendency • Example – Given the following values: Sample Value 1 2 2 6 3 4 4 4 5 4 Total 20 ∑x i = 2 + 6 + 4 + 4 + 4 = 20 x ∑ x= 20 = =4 N 5 i 6 STAT 3660 – Introduction to Statistics Measures of Central Tendency • The mean balances all of the scores because it acts like a fulcrum. It is the point around which all of the scores cancel out Sample Data Value Distance from Mean 1 3 3 - 4 = -1 2 4 4–4=0 3 7 7–4=3 4 5 5–4=1 5 1 1 – 4 = -3 20 0 Total x ∑ x= 20 = =4 N 5 i 7 STAT 3660 – Introduction to Statistics Measures of Central Tendency • The mean is the point in a distribution around which the variation is minimized (least squares) Sample Data Values x – x-bar (x – x-bar)2 x–3 (x – x-bar)2 x–5 (x – x-bar)2 1 3 3 - 4 = -1 1 3-3=0 0 3 - 5 = -2 4 2 4 4–4=0 0 4–3=1 1 4 – 5 = -1 1 3 7 7–4=3 9 7–3=4 16 7–5=2 4 4 5 5–4=1 1 5–3=2 4 5–5=0 0 5 1 1 – 4 = -3 9 1 – 3 = -2 4 1 – 5 = -4 16 Total 20 0 20 5 25 -5 25 Notice that the sums of squared differences increase as you move away from the mean 8 STAT 3660 – Introduction to Statistics Measures of Central Tendency • To illustrate the mean we use the following dotplot The arithmetic center of the distribution, or balance point is 4 as calculated in the previous slide 9 STAT 3660 – Introduction to Statistics Measures of Central Tendency • Strengths and weaknesses of the mean – Strengths • All data from a variable are used to compute the mean – Weaknesses • Every score affects the mean • A single score (very high or very low) will skew the distribution and thereby giving a misleading interpretation to the mean • L 10 STAT 3660 – Introduction to Statistics Mean of a distribution with outliers Percent of people dying x = 3.4 x = 4.2 Without the outliers With the outliers The mean is pulled to the right a lot by the outliers (from 3.4 to 4.2). 11 STAT 3660 – Introduction to Statistics Measures of Central Tendency • Median (center value) – The median is calculated by locating the center value – If there are an odd number of data values, then to find the center value, simply order the values and remove them from the beginning and end to find the center, e.g.: 2 4 4 4 6 The median is 4 – If there are an even number of data values, then find the center two values and average them, e.g.: 2 4 4 4 5 6 4+4 8 Median = = =2 2 2 12 STAT 3660 – Introduction to Statistics Measures of Central Tendency • Characteristics of the median – The median can be calculated for ordinal results as well as numeric data, but cannot be calculated for nominal data as the data has no order – The median is always the exact center of the distribution – One-half of the values within the distribution are below the median and one-half of the values are above the median – Notice that the mean and median are not always the same value: • In 2004, household income had a mean of $60,528 and median income of $43,384. The mean is almost 40% higher – why do you think this is the case?1 1 From the US Department of Census 13 STAT 3660 – Introduction to Statistics Mean and median of a distribution with outliers Percent of people dying x = 3.4 x = 4.2 Without the outliers With the outliers The mean is pulled to the The median, on the other hand, right a lot by the outliers is only slightly pulled to the right (from 3.4 to 4.2). by the outliers (from 3.4 to 3.6). 14 STAT 3660 – Introduction to Statistics Measures of Central Tendency • Mode (most frequent) – The mode is found by finding the most frequent value (or values) Values Frequency (f) 2 1 4 3 6 1 The most frequent value is 4, since there are 3 of them, so the mode is 4 15 STAT 3660 – Introduction to Statistics Measures of Central Tendency • Characteristics of the mode: – – – – – – the most common score all three levels of measurement can use the mode Commonly used with categorical-nominal variables Some distributions have no mode Some distributions have multiple modes For categorical-ordinal and numerical-continuous, the mode may not be central to the distribution 16 STAT 3660 – Introduction to Statistics Measures of Central Tendency • Notice that in our example the mean, median and mode all have the same value (4), which is not always the case • You can tell a lot about the “shape” of a distribution by comparing these three values Mode Median Mean • In this distribution the mean, median and mode are the same value • In this distribution the mean, median and mode are different values • Notice that the mean will always be pulled in the direction of the “skew” of the distribution 17 STAT 3660 – Introduction to Statistics Distinguishing the Median and Mean of a Density Curve • The median of a distribution is the equal-areas point―the point that divides the area under the curve in half. • The mean is the balance point, at which the curve would balance if made of solid material. 18 STAT 3660 – Introduction to Statistics Measures of Central Tendency • When a dataset has an extreme value, the mean will be pulled in the direction of the extreme scores – For a positive skew (skewed to the right), the mean will be greater than the median – For a negative skew, the mean will be less than the median • When a numerical variable has a pronounced skew, the median may be the more trustworthy measure of central tendency 19 STAT 3660 – Introduction to Statistics Question • Given the following values calculate the following: – Calculate the mean: a) b) c) d) – 3 5 2 6 8 6 Calculate the median: a) b) c) d) – 5.5 5 6 4 5.5 5 6 4 Calculate the mode: a) b) c) d) 5.5 6 5 4 20 STAT 3660 – Introduction to Statistics Question • Given the following values calculate the following: – Calculate the mean: a) b) c) d) – 3 5 2 6 8 6 Calculate the median: a) b) c) d) – 5.5 5 6 4 5.5 5 6 4 Calculate the mode: a) b) c) d) 5.5 6 5 4 21 STAT 3660 – Introduction to Statistics Measures of Central Tendency • The relationship between level of measure and measures of central tendency: Level of Measurement Measure of Central Tendency Nominal Ordinal Interval-Ratio Mode Yes Yes Yes Median No Yes Yes Mean No Yes (?) Yes 22 STAT 3660 – Introduction to Statistics Measures of Central Tendency • Choosing a measure of Central Tendency Use the Mode when 1. The variable is measured at the nominal level 2. You want a quick and easy measure for ordinal and numerical variables 3. You want to report the most frequent score Use the Median when: 1. The variable is measured at the ordinal level 2. A variable measured at the numerical level has a highly skewed distribution (or outliers are present) 3. You want to report the central score. The median always lies at the exact center of a distribution Use the Mean when: 1. The variable is measured at the numerical variable (except when the variable is highly skewed) 2. You want to report the typical score. The mean is “the fulcrum that exactly balances all of the scores 3. You anticipate additional statistical analysis 23 STAT 3660 – Introduction to Statistics Question • A friend of yours used SPSS to report the mean political affiliation where 1 = democrat, 2 = independent, and 3 = republican, and 4 = other. You kindly state that: a) SPSS is not useful for calculating a mode. b) the variable political affiliation is nominal and therefore you should calculate a mode(s) for this variable. c) the variable political affiliation is ordinal and therefore you should calculate a median for this variable. d) to correctly calculate the mean, you needed to omit the “other“ category. 24 STAT 3660 – Introduction to Statistics Question • A friend of yours used SPSS to report the mean political affiliation where 1 = democrat, 2 = independent, and 3 = republican, and 4 = other. You kindly state that: a) SPSS is not useful for calculating a mode. b) the variable political affiliation is nominal and therefore you should calculate a mode(s) for this variable. c) the variable political affiliation is ordinal and therefore you should calculate a median for this variable. d) to correctly calculate the mean, you needed to omit the “other“ category. 25 STAT 3660 – Introduction to Statistics Other Measures of Location • Percentiles: point below which a specific percentage of cases fall • Quartiles: divides distribution into quarters (25, 50, 75) • E.g., the median falls at the 50th percentile (or the 2nd quartile) 26 STAT 3660 – Introduction to Statistics Other Measures of Location • To calculate a percentile: – Sort scores in order from low to high – Multiple the number of cases (N) by the proportional value of the percentile (for example: the 80th percentile is 0.8) – The resultant value indicates the position in the array of cases • Example – In a sample of 70 test grades, we want to find the 3th quartile (or the 75th percentile) – 70 x 0.75 = 52.5, rounding to 53, so the 53rd case is the 75th percentile 27 STAT 3660 – Introduction to Statistics Chapter 6 – Measures of Dispersion STAT 3660 – Introduction to Statistics Objectives • By the end of this material, you will be able to: – Explain the purpose of measures of dispersion and the information they convey – Compute and explain: • The range (R) • The inner quartile range (IQR) [note: he book uses the symbol of Q for this statistics, but the more common symbol is IQR) • The variance (σ2 or s2) • The standard deviation (σ or s) – Select the appropriate measure of dispersion – We will not discuss Average Absolute Deviation (AAD) or Median Absolute Deviation (MAD) 29 STAT 3660 – Introduction to Statistics Measures of Dispersion • Notice that all of the measures of dispersion that we will discuss are for numerical data • The measures of dispersion for categorical data will NOT be discussed in this material 30 STAT 3660 – Introduction to Statistics Measures of Central Tendency • Measures of Dispersion – In the prior material you learned how to describe a variable using graphical techniques and measures of center (or representative values) – In this material we will discuss the spread (or width) of the variable • The measures of dispersion that we will discuss in this materials are: • The range (R) – maximum minus the minimum, which is the total spread of the data • The inner quartile range (IQR) - Q3 (3rd quartile) – Q1 (1st quartile), which is commonly referred to as the inner spread and contains the middle 50% of the data • The variance (σ2 or s2) and the standard deviation (σ or s) will be discussed in detail following • The index of Qualitative Variation (IQV) will NOT be discussed 31 STAT 3660 – Introduction to Statistics Measures of Dispersion • Range (R) – The range is the easiest measure of dispersion to calculate and is simply the total spread of the data Range = R = Highest score (Max) - Lowest score (Min) 32 STAT 3660 – Introduction to Statistics Measures of Dispersion • Characteristics of the Range (R) – The range though simple to calculate and interpret, there are significant limitations to it’s value – It is highly affected by extreme values (outliers) thus is easily exaggerated to indicate more variation (spread) is in the data than is actually there – It only includes information from two of the data values and so has limited power for interpretation and more advanced statistical techniques 33 STAT 3660 – Introduction to Statistics Measures of Dispersion • Example – Given the following values: Sample Value 1 2 2 6 3 4 4 4 5 4 Total The Lowest score – Minimum (Min) = 2 The highest score – Maximum (Max) = 6 20 R = Max – Min = 6 – 2 = 4 34 STAT 3660 – Introduction to Statistics Question • Given the following data what is the range? a) b) c) d) 3 5 4 6 Data 1 2 3 4 5 35 STAT 3660 – Introduction to Statistics Question • Given the following data what is the range? a) b) c) d) 3 5 4 6 Data 1 2 3 4 5 36 STAT 3660 – Introduction to Statistics Measures of Dispersion • The Inner Quartile Range (IQR) avoids some of the problems associated with the range by considering the middle 50% of the distribution Inner Quartile Range = IQR = Q3 – Q1 – To find the inner quartile range arrange the scores from highest to lowest and then divide the distribution into four (4) quarters – Find the values that correspond the score where 25% (first quartile Q1) of the values are below and 75% third quartile Q3) of the values are below – Find the difference between Q3 and Q1 and this is the inner quartile range (IQR) 37 STAT 3660 – Introduction to Statistics Measures of Dispersion • Characteristics of the IQR – The IQR extracts the middle 50% of the data – IQR avoids some of the problems of the range being exaggerated by extreme or unusual values (outliers) – The IQR is also calculated using only two (2) data values and so does not have a high content of information and so has limited further use for statistical techniques but is easily understood and interpreted 38 STAT 3660 – Introduction to Statistics Measures of Dispersion • Example – Given the data from the text in table 4.3, Percent of Population Aged 25 and Older with a College Degree, 2007 [sample of 20 states], on page 94 • Maximum = Max = 34.7 (Connecticut) • Q3 = 27.0 (20*0.75 = 15, Montana) • Q2 (median) = 25.45 • Q1 = 22.1 (20*0.25 = 5, Indiana) • Minimum = Min = 17.3 (West Virginia) Note: The IQR = 27.0 – 22.1 = 4.9 MiniTab boxplot of Data Outliers would be beyond: Lower Bound = Q1 – 1.5*IQR = 22.1 – 1.5*4.9 = 14.75 Upper Bound = Q3 + 1.5*IQR = 27.0 + 1.5*4.9 = 34.35 39 STAT 3660 – Introduction to Statistics Example: Consider our New York travel times data. Construct a boxplot. 10 30 5 25 40 20 10 15 30 20 15 20 85 15 65 15 60 60 40 45 5 10 10 15 15 15 15 20 20 20 25 30 30 40 40 45 60 60 65 85 M = 22.5 Q1 = 15 Min=5 Q3= 42.5 Max=85 This is an outlier by the 1.5 x IQR rule 0 10 20 30 40 50 60 TravelTime 70 80 90 40 STAT 3660 – Introduction to Statistics Measures of Dispersion • The boxplot provides a very useful and informative chart for describing a distribution – The boxplot utilized what is called the 5-Number summary: • • • • • Minimum Q1 (1st quartile) Q2 (2nd quartile or median) Q3 (3rd quartile) Maximum – The boxplot can illustrate the shape of the distribution by the length of the whiskers (longer whiskers on one side indicate a skewed distribution) – Identify outliers by being above Q3 or below Q1 by more than 1.5 * IQR 41 STAT 3660 – Introduction to Statistics Question • Given Q3 = 20 and Q1 = 10, an outlier would be any value that is: a) b) c) d) Below 10 Above 30 Below 5 None of the above 42 STAT 3660 – Introduction to Statistics Question • Given Q3 = 20 and Q1 = 10, an outlier would be any value that is: a) b) c) d) Below 10 Above 30 Below 5 None of the above 43 STAT 3660 – Introduction to Statistics Measures of Dispersion • Variance (σ2 or s2) and Standard Deviation (σ or s) – A good measure of dispersion needs to have the following characteristics: • Uses all the scores in the distribution – the statistic should use all the information available • Describe the “average” or typical deviation of the scores • The statistic should give us an idea about how far scores are from each other or from the center of the distribution • Increase in value as the scores become more diverse and decrease in value as the scores become less diverse, which is to provide for comparison between different distributions 44 STAT 3660 – Introduction to Statistics Measures of Dispersion • Example of Variance (σ2 or s2) – Given the following values: x – x-bar (x – x-bar)2 Sample Value 1 66 66-72.2 = -6.2 (-6.2)2 = 38.44 2 75 75-72.2 = +2.8 (2.8)2 = 7.84 3 69 69-72.2 = -3.2 (-3.2)2 = 10.24 4 72 72-72.2 = -0.2 (-0.2)2 = 0.04 5 84 84-72.2 = +11.8 (11.8)2 6 90 90-72.2 = +17.8 (17.8)2 = 316.84 7 96 96-72.2 = +23.8 (23.8)2 = 566.44 8 70 70-72.2 = -2.2 (-2.2)2 = 4.84 9 55 55-72.2 = -17.2 (-17.2)2 = 295.84 10 45 45-72.2 = -27.2 (-27.2)2 = 739.84 722 0 2,119.60 Totals = 139.24 x ∑ Average = x = 722 = 72.2 N 10 Range = R = Max - Min = 96 - 45 = 51 i = • Recall from discussion of Mean that the mean acts as the balance point of the distribution and so the deviations from the mean will add to zero (0) • To get this to not add to zero, we will remove the signs of the differences by squaring each of the differences 45 STAT 3660 – Introduction to Statistics Measures of Dispersion • The formula for calculating the variance is as follows: Population Variance = σ 2 = 2 Sample Variance = s = 2 ( x − x ) ∑ N 2 ∑ (x − x) N −1 46 STAT 3660 – Introduction to Statistics Measures of Dispersion • Example of Variance (σ2 or s2) (con’t) – From previous slide: 2 Population Variance = σ = 2 Sample Variance = s = 2 ( x − x ) ∑ N 2 ( x − x ) ∑ N −1 2,119.6 = = 211.96 10 2,119.6 2,119.6 = = = 235.51 10 − 1 9 Note: The variance is difficult to interpret in relationship to the data scores, but is a very powerful statistic and has many uses in statistical analyses as will be discussed later 47 STAT 3660 – Introduction to Statistics Measures of Dispersion • Another example of Variance (σ2 or s2) – From example in text, page 99 in comparing ages from two campuses Residential Campus Deviations Squared (x - x-bar)2 Sample Ages Deviations (x - x-bar) 1 18 18 – 19 = -1 (-1)2 = 1 2 19 19 – 19 = 0 (0)2 = 0 3 20 20 – 19 = 1 (1)2 = 1 4 18 18 – 19 = -1 (-1)2 = 1 5 20 20 – 19 = 1 (1)2 = 1 95 0 4 Totals ∑x 95 = 19 N 5 ( x − x )2 4 4 ∑ 2 Sample Variance = s = = = =1 N −1 5 −1 4 x - bar = x = i = 48 STAT 3660 – Introduction to Statistics Measures of Dispersion • Example of Variance (σ2 or s2) – Continuing the example Urban Campus Deviations Squared (x - x-bar)2 Sample Ages Deviations (x - x-bar) 1 20 20 – 23 = -3 (-3)2 = 9 2 22 22 – 23 = -1 (-1)2 = 1 3 18 18 – 23 = -5 (-5)2 = 25 4 25 25 – 23 = +2 (2)2 = 4 5 30 30 – 23 = +7 (7)2 = 49 115 0 88 Totals ∑x 115 = 23 N 5 ( x − x )2 88 88 ∑ 2 Sample Variance = s = = = = 22 N −1 5 −1 4 x - bar = x = i = 49 STAT 3660 – Introduction to Statistics Measures of Dispersion • Example of Variance (σ2 or s2) – Continuing the example The ages of Residential campus students has clearly less variation present than for the urban campus Residential: Variance (residential) = 1 Mean = 19 Urban: Mean = 23 Variance (urban) = 22 50 STAT 3660 – Introduction to Statistics Measures of Dispersion • A good measure of dispersion needs to have the following characteristics: – Uses all the scores in the distribution – Clearly the variance uses all the scores within the distribution – Describe the “average” or typical deviation of the scores – The variance does provide a type of average, although the interpretation of this results is difficult – Increase in value as the scores become more diverse and decrease in value as the scores become less diverse – The diversity of scores for the urban campus is greater than the diversity in the residential campus and the variances increase with the amount of diversity present in the data • Variance is a good measure of dispersion 51 STAT 3660 – Introduction to Statistics Measures of Dispersion • A better measure of dispersion is the standard deviation: Population Standard Deviation = σ = σ 2 = Sample Standard Deviation = s = s 2 = 2 ( x − x ) ∑ N 2 ( x − x ) ∑ N −1 52 STAT 3660 – Introduction to Statistics Measures of Dispersion • Looking back over the previous example Sample Standard Deviation (residential) = s = s 2 = 1 = 1 Sample Standard Deviation (urban) = s = s 2 = 22 = 4.69 – Since the squaring of the deviations from the mean is “undone” by taking the square-root, the standard deviation can be thought of as an “average” deviation from the mean – It can be loosely stated that on average each score deviates from the mean by the standard deviation – This would mean that there is more than 4 times the variation in the urban campus than in the residential campus 53 STAT 3660 – Introduction to Statistics Example of calculating standard deviation Consider the following data on the number of pets owned by a group of nine children. 1. Calculate the mean. 2. Calculate each deviation. deviation = observation – mean deviation: 1 - 5 = -4 deviation: 8 - 5 = 3 0 2 4 6 NumberOfPets Number of Pets 8 x=5 54 STAT 3660 – Introduction to Statistics Example of calculating standard deviation 3. Square each deviation. 4. Find the “average” squared deviation. Calculate the sum of the squared deviations divided by (n – 1)…this is called the variance. 5. Calculate the square root of the variance…this is the standard deviation. (xi-mean)2 xi (xi-mean) 1 1 - 5 = -4 (-4)2 = 16 3 3 - 5 = -2 (-2)2 = 4 4 4 - 5 = -1 (-1)2 = 1 4 4 - 5 = -1 (-1)2 = 1 4 4 - 5 = -1 (-1)2 = 1 5 5-5=0 (0)2 = 0 7 7-5=2 (2)2 = 4 8 8-5=3 (3)2 = 9 9 9-5=4 (4)2 = 16 Sum = ? “Average” squared deviation = 52/(9 – 1) = 6.5 Standard deviation = square root of variance = Sum = ? This is the variance. 6.5 = 2.55 55 STAT 3660 – Introduction to Statistics Question • Given the following data what is the sample standard deviation? Data a) b) c) d) 6 1.581 4 2.500 1 2 3 4 5 56 STAT 3660 – Introduction to Statistics Question • Given the following data what is the sample standard deviation? Data a) b) c) d) 6 1.581 4 2.500 1 2 3 4 5 57 STAT 3660 – Introduction to Statistics Measures of Dispersion • Summary of measures of dispersion Range 1. The range is easy to calculate and interpret 2. Range is significantly affected by unusual or outlying scores 3. Range is not very useful in later statistical analyses because it does not utilize all of the information available in the data (only uses two scores in calculating [max & min]) Inner Quartile Range 1. The inner quartile range is relatively easy to calculate and interpret 2. The inner quartile range is not effected by unusual or outlying scores 3. The inner quartile range is useful in identifying unusual scores (outliers) and utilizes the relative position of all scores Variance or Standard Deviation 1. Both the variance and standard deviation are more calculation intensive 2. There is an effect on both the variance and standard deviation by unusual or outlying scores 3. The variance and standard deviation are very useful in later statistical analyses since they contain the information from all data (scores) 58 STAT 3660 – Introduction to Statistics 59