Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 3 – Descriptive Statistics Numerical Measures 1 Chapter Outline Measures of Central Location Mean Median Mode Percentile (Quartile, Quintile, etc.) Measures of Variability Range Variance (Standard Deviation, Coefficient of Variation) 2 A Recall A sample is a subset of a population. Numerical measures calculated for sample data are called sample statistics. Numerical measures calculated for population data are called population parameters. A sample statistic is referred to as the point estimator of the corresponding population parameter. 3 Mean As a measure of central location, mean is simply the arithmetic average of all the data values. The sample mean x is the point estimator of the population mean . 4 Sample Mean x x x i n The symbol (called sigma) means ‘sum up’. xi is the value of ith observation in the sample. n is the number of observations in the sample. 5 Population Mean x i N The symbol (called sigma) means ‘sum up’. xi is the value of ith observation in the sample. N is the number of observations in the population. is pronounced as ‘miu’. 6 Sample Mean Example: Sales of Starbucks Stores 50 Starbucks stores are randomly chosen in the NYC. The table below shows the sales of those stores in December 2012. 95 108 67 99 93 77 86 93 119 118 77 120 97 103 97 88 89 105 104 105 97 78 97 106 95 100 97 87 93 82 99 79 79 82 61 109 93 82 104 73 89 88 93 93 109 90 88 98 101 101 7 Sample Mean Example: Sales of Starbucks Stores x x n 95 108 67 99 93 77 86 93 119 118 77 120 97 103 97 88 89 105 104 105 i 4,685 93.69 50 97 78 97 106 95 100 97 87 93 82 99 79 79 82 61 109 93 82 104 73 89 88 93 93 109 90 88 98 101 101 8 Median The median of a data set is the value in the middle when the data items are arranged in ascending order. Whenever a data set has extreme values, the median is the preferred measure of central location. The median is the measure of location most often reported for annual income and property value data. A few extremely large incomes or property values can inflate the mean since the calculation of mean uses all the data items. 9 Median For an odd number of observations: 26 18 27 12 14 27 19 12 14 18 19 26 27 27 7 observations in ascending order the median is the middle value. Median = 19 10 Median For an even number of observations: 26 18 27 12 14 27 12 14 18 19 27 27 26 19 30 8 observations 30 in ascending order the median is the average of the middle two values. Median = (19 + 26)/2 = 22.5 11 Mean vs. Median As noted, extremes values can change means remarkably, while medians might not be affected much by extreme values. Therefore, in that regard, median 30 is a better representative of central location. 12 14 18 19 26 27 27 30 280 For the previous example, the median is 22.5 and the mean is 21.6. If we add one large number (280) to the data, the median becomes 26 (the value in the middle). But the mean becomes 50.3. In this case we prefer median to mean as a measure of central location. 12 Mode The mode of a data set is the value that occurs most frequently. The greatest frequency can occur at two or more different values. If the data have exactly two modes, the data are bimodal. If the data have more than two modes, the data are multimodal. Caution: If the data are bimodal or multimodal, Excel’s MODE function will incorrectly identify a single mode. 13 Mode 12 14 18 19 26 27 27 30 For the example above, 27 shows up twice while all the other data values show up once. So, the mode is 27. 14 Percentiles A percentile provides information about how the data are spread over the interval from the smallest value to the largest value. Admission test scores for colleges and universities are frequently reported in terms of percentiles. The pth percentile of a data set is a value such that at least p percent of the items are less than or equal to this value and at least (100 - p) percent of the items are more than or equal to this value. The 50th percentile is simply the median. 15 Percentiles Arrange the data in ascending order. Compute index i, the position of the pth percentile. i = (p/100)n If i is not an integer, round up. The pth percentile is the value in the ith position. If i is an integer, the pth percentile is the average of the values in positions i and i+1. 16 Percentiles Find the 75th percentile of the following data 12 14 18 19 26 27 29 30 Note: The data is already in ascending order. i = (p/100)n = (75/100)8 = 6 So, averaging the 6th and 7th data values: 75th percentile = (27 + 29)/2 = 28 17 Percentiles Find the 20th percentile of the following data 12 14 18 19 26 27 29 30 Note: The data is already in ascending order. i = (p/100)n = (20/100)8 = 1.6, which is rounded up to 2. So, the 20th percentile is simply the 2nd data value, i.e. 14. 18 Quartiles Quartiles are specific percentiles. First Quartile = 25th percentile Second Quartile = 50th percentile = Median Third Quartile = 75th percentile 19 Measures of Variability It is often desirable to consider measures of variability (dispersion), as well as measures of central location. For example, when two stocks provide the same average return of 5% a year, but stock A’s return is very stable – close to 5% and stock B’s return is volatile ( it could be as low as –10%), are you indifferent with regard to which stock to invest in? For another example, in choosing supplier A or supplier B we might consider not only the average delivery time for each, but also the variability in delivery time for each. 20 Measures of Variability Range Interquartile Range Variance/Standard Deviation Coefficient of Variation 21 Range The range of a data set is the difference between the largest and smallest data values. It is the simplest measure of variability. It is very sensitive to the smallest and largest data values. 22 Range Example: 12 14 18 19 26 27 29 30 Range = largest value - smallest value = 30 – 12 =8 23 Interquartile Range The interquartile range of a data set is the difference between the 3rd quartile and the 1st quartile. It is the range of the middle 50% of the data. It overcomes the sensitivity to extreme data values. 24 Interquartile Range Example: 12 14 18 19 26 27 29 30 3rd Quartile (Q3) = 75th percentile = 28 1st Quartile (Q1) = 25th percentile = 16 Interquartile Range = Q3 – Q1 = 28 – 16 = 12 25 Variance The variance is a measure of variability that utilizes all the data. It is based on the difference between the value of each observation (xi) and the mean ( x for a sample, for a population) The variance is useful in comparing the variability of two or more variables. 26 Variance The variance is the average of the squared differences between each data value and the mean. The variance is calculated as follows: s2 2 ( x x ) i n 1 for a sample 2 ( xi ) N 2 for a population 27 Standard Deviation The standard deviation of a data set is the positive square root of the variance. It is measured in the same units as the data, making it more appropriately interpreted than the variance. 28 Standard Deviation The standard deviation is computed as follows: s s2 2 for a sample for a population 29 Variance and Standard Deviation Example 12 14 18 Variance s2 19 26 2 x x i n 1 27 29 30 48.98 Standard Deviation s s 2 48.98 7 30 Coefficient of Variation The coefficient of variation indicates how large the standard deviation is in relation to the mean. In a comparison between two data sets with different units or with the same units but a significant difference in magnitude, coefficient of variation should be used instead of variance. 31 Coefficient of Variation The coefficient of variation is computed as follows: s 100 % x for a sample 100 % for a population 32 Coefficient of Variation Example 12 14 18 19 26 27 29 30 s 7 100 % 32% 100 % x 21.875 33 Coefficient of Variation Example: Height vs. Weight In a class of 30 students, the average height is 5’5’’ with a standard deviation of 3’’ and the average weight is 120 lbs with a standard deviation of 20 lbs. Question, in which measure (height or weight) are students more different? Since height and weight don’t have the same unit, we have to use coefficient of variation to remove the units before comparing the variations in height and weight. As shown below, students’ weight is more variant than their height. s height 3' ' 100 % 100 % 4.6% x 5'5' ' height s weight 20 100 % 100 % 16.7% x 120 weight 34 Measures of Distribution Shape, Relative Location, and Detecting Outliers Distribution Shape z-Scores Chebyshev’s Theorem Empirical Rule Detecting Outliers 35 Distribution Shape: Skewness An important measure of the shape of a distribution is called skewness. The formula for the skewness of sample data is n xi x Skewness (n 1)( n 2) s 3 Skewness can be easily computed using statistical software. 36 Distribution Shape: Skewness Symmetric (not skewed) • Skewness is zero. • Mean and median are equal. Skewness = 0 .35 Relative Frequency .30 .25 .20 .15 .10 .05 0 37 Distribution Shape: Skewness Skewed to the left • Skewness is negative. • Mean is usually less than the median. .35 Relative Frequency Skewness = .33 .30 .25 .20 .15 .10 .05 0 38 Distribution Shape: Skewness Skewed to the right • Skewness is positive. • Mean is usually more than the median. .35 Relative Frequency Skewness = .31 .30 .25 .20 .15 .10 .05 0 39 Z-Scores The z-score is often called the standardized value. It denotes the number of standard deviations a data value xi is from the mean. xi x zi s Excel’s STANDARDIZE function can be used to compute the z-score. 40 Z-Scores An observation’s z-score is a measure of the relative location of the observation in a data set. A data value less than the sample mean has a negative z-score. A data value greater than the sample mean has a positive z-score. A data value equal to the sample mean has a zscore of zero. 41 Z-Scores Example 12 14 18 19 26 27 29 30 x1 x 12 21.875 z1 1.41 s 7 x8 x 30 21.875 z8 1.16 s 7 42 x Chebyshev’s Theorem At least (1 - 1/z2) of the items in any data set will be within z standard deviations of the mean, I.e. between ( x z s ) and (x z s ), where z is any value greater than 1. Chebyshev’s theorem requires z > 1, but z need not be an integer. 43 Chebyshev’s Theorem At least 55.6% of the data values must be within z = 1.5 standard deviations of the mean. At least 89% of the data values must be within z = 3 standard deviations of the mean. At least 94% of the data values must be within z = 4 standard deviations of the mean. 44 Chebyshev’s Theorem Example: Given that x = 10 and s = 2, at least what percentage of all the data values falls into 2 standard deviations of the mean? At least (1-1/22) = 1-1/4 = 75% of all the data values must be between 6 and 14. x z s = 10-2(2) = 6 x z s = 10+2(2) = 14 45 Empirical Rule When the data are believed to approximate a bell-shaped distribution, the empirical rule can be used to determine the percentage of data values that must be within a specified number of standard deviations of the mean. The empirical rule is based on the normal distribution, which is covered in Chapter 6. 46 Empirical Rule For data having a bell-shaped distribution: About 68% of values of a normal random variable are between - and + . Expected number of About 95% of values of a normal random variable correct are between - 2 and + 2. answers About 99% of values of a normal random variable are between - 3 and + 3. 47 Empirical Rule About 99% About 95% About 68% Expected number of correct answers – 3 – 1 – 2 + 3 + 1 + 2 x 48 Detecting Outliers An outlier is an unusually small or unusually large value in a data set. A data value with a z-score less than –3 or greater than +3 might be considered an outlier. It might be: • An incorrectly recorded data value • A data value that was incorrectly included in the data set. • A correctly recorded data value that belongs in the data set. 49 Measures of Association Between Two Variables So far, we have examined numerical methods used to summarize the data for one variable at a time. Often a manager or decision maker is interested in the relationship between two variables. Two numerical measures of the relationship between two variables are covariance and correlation coefficient. 50 Covariance The covariance is a measure of the linear association between two variables. Positive values indicate a positive relationship. Negative values indicate a negative relationship. 51 Covariance The covariance is computed as follows: s xy xy (x i x )( yi y ) n 1 ( xi x )( yi y ) N for samples for populations 52 Correlation Coefficient Correlation is a measure of linear association and not necessarily causation. Just because two variables are highly correlated, it does not mean that one variable is the cause of the other. 53 Correlation Coefficient The correlation coefficient is computed as follows: rxy sxy sx s y for samples xy xy x y for populations 54 Correlation Coefficient The correlation can take on values between –1 and +1. Values near –1 indicate a strong negative linear relationship. Values near +1 indicate a strong positive linear relationship. The closer the correlation is to zero, the weaker the relationship. 55 Covariance and Correlation Coefficient Example: Stock Returns The table below presents the monthly returns (in percentage) of the market index S&P 500 (SPY) and the Apple stock (AAPL) from December 2012 to May 2013. Date Dec-12 Jan-13 Feb-13 Mar-13 Apr-13 May-13 SPY 2.17 0.55 1.62 0.83 1.01 0.24 AAPL -6.76 -1.11 0.12 0.01 0.96 0.10 56 Covariance and Correlation Coefficient Example: Stock Returns Date Dec-12 Jan-13 Feb-13 Mar-13 Apr-13 May-13 Average Std. Dev. x y 2.17 0.55 1.62 0.83 1.01 0.24 1.07 0.71 -6.76 -1.11 0.12 0.01 0.96 0.10 -1.11 2.84 xi x yi y xi x yi y 1.10 -0.52 0.55 -0.24 -0.06 -0.83 -5.65 0.00 1.24 1.12 2.08 1.21 Total: -6.21 0.00 0.68 -0.27 -0.12 -1.00 -6.92 57 Covariance and Correlation Coefficient Example: Stock Returns • Sample Covariance s xy • x i x y i y n 1 6.92 1.38 6 1 Sample Correlation Coefficient rxy s xy sx s y 1.38 0.68 0.712.84 58