Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter III Descriptive Statistics: Numerical Methods 1 Key Learning Objectives and Topics in this Chapter Measures of Location: (Mean, Median, Mode, Percentiles, Quartiles) Measures of Dispersion/Spread/Variability ( Range, Variance, Standard Deviation, Coefficient of Variation) Measures of distribution shape, and association between two variables 2 Important Note In all cases : Know the formulas, learn the computation procedures (i.e., apply the formulas) and understand the meaning (interpretation) of the measures computed. Use Excel; Practice! Practice! and Practice! 3 3.1. Introduction When describing data, usually we focus our attention on two types of measures.. Central location (e.g., average or mean) Variability or Spread (e.g., variance, standard deviation) Both measures could be computed for Population Sample 4 3.2 Measures of Central Location A center is a reference point. Thus a good measure of central location is expected to reflect the locations of all the other actual points in the data. With two data points, How? the central location if the third data should fallpoint in the middle With one data point appearsbetween on the left hand-side them (in order clearly the central of the center, it should “pull” of to reflect the location location is at the point the central to the left. both location of them). itself. 5 Measures of Location Mean Median Mode Percentiles Quartiles If the measures are computed for data from a sample, they are called sample statistics. If the measures are computed for data from a population, they are called population parameters. A sample statistic is referred to as the point estimator of the corresponding population parameter. 6 i) The Arithmetic Mean (µ) Mean is the most popular and useful measure of central location Sum of the observations Mean = Number of observations 7 i) The Arithmetic Mean Sample mean Sum of the values of Observations in the data Population mean n x Xi i 1 n Number of observations In the sample (Sample size) N x i 1 i N Number of Observations In the Population (Population size) 8 i) The Arithmetic Mean • Example 1 Time (hours) spent by 10 students on the Internet are as follows: 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 hours. Based on this data, compute the mean (average) amount of time spent (per day) on the Internet? n x X i 1 n i 0 + 7 + 12 + 5 + 33 + 14 + 8 + 0 + 9 + 22 110 = =11hours 10 10 Based on this data, the average amount of time spent on the internet by a typical student is 11 hours. 9 ii) The Median The Median of a set of observations is the value that falls in the middle of a data that is arranged in certain order (ascending or descending). It is the value that divides the observation into two equal halves 10 ii) The Median To find the median: Put the data in an array (in increasing or decreasing order) and then count the total number of observations in the data. If the total is an ODD number, the median is the middle value. If the total is EVEN number, then the median is the AVERAGE of the middle two values. iii) The Median Example 2a Find the median for the following observations. 0, 7, 12, 5, 14, 8, 0, 9, 22 0, 0, 5, 7, 8 9, 12, 14, 22 Odd Number Observations Median= 8 Step-1: Arrange the data in increasing/ decreasing order, … Step-2: Count the total number of observation in the data (9) … 12 iii) The Median Example 2b Find the median for the following observations. 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 0, 0, 5, 7, 8, 9, 12, 14, 22, 33 Even number Observations Median=(8+9)/2=8.5 Step-1: Arrange the data in increasing/ decreasing order Step-2: Count the total number of observation in the data (10)… 13 ii) The Median Note: The median value (8 in example 2a)of an odd set of data is a member of the data values. However, the median value (8.5 in example 2b) of an even data set is not necessarily a member of the set of values. What is special about median? Unlike the mean, the median value of a data set is not affected by the value that all observations in the data set may assume. III) The Center: Mode Mode is the most frequent value. The Mode is the value that occurs most frequently in the data. It is the value with the highest frequency In any data set there is only one value for the mean or the median. However, a data set may have more than one value for the mode. III) The Center: Mode Histogram of Income distribution One modal class Two modal classes 16 III) The Center: Mode Example 3: What is the mode for the following data? 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 Solution All observation except “0” occur once. There are two “0” values. Thus, the mode is zero. But is this value a good indicator of the central location of this data? The value “0” does not reside at the center of this set (compare with the mean = 11.0 and the median = 8.5). 17 After Comparing Measures of Central Tendency: Mean, Median, Mode: • If mean = median = mode, the shape of the distribution is symmetric. 18 After Comparing Measures of Central Tendency: Mean, Median, Mode: If mode < median < mean, the shape of the distribution trails to the right, is positively skewed. • If mode > median > mean, the shape of the distribution trails to the left, is negatively skewed. A positively skewed distribution (“skewed to the right”) A negatively skewed distribution (“skewed to the left”) Mode Mean Median Mean Mode Median 19 Percentiles A percentile provides information about the relative location and spread of the data between the smallest to the largest value. Percentile tells us the proportion of observations that lie below or above a certain value in the data. Example: Admission test scores for colleges and universities are frequently reported in terms of percentiles. 20 Percentiles The pth percentile of a data set is a value such that at least p percent of the items take on this value or less while (100 - p) percent of the items take on this value or more. 21 Computing Percentiles Arrange the data in ascending order. Compute the ith position of the pth percentile. p i xn 100 If i is not an integer, round up. The p th percentile is the value in the i th position. If i is an integer, the p th percentile is the average of the values in positions i and i +1. 22 Compute the 75th percentile of the following data 425 440 450 465 480 510 575 430 440 450 470 485 515 575 430 440 450 470 490 525 580 435 445 450 472 490 525 590 435 445 450 475 490 525 600 435 445 460 475 500 535 600 435 445 460 475 500 549 600 435 445 460 480 500 550 600 440 450 465 480 500 570 615 440 450 465 480 510 570 615 i = (p/100)n = (75/100)X10 =7.5 Rounding 7.5, we note that the 8th data value is The 75th Percentile = 435 23 Compute the 50th percentile of the following data 425 440 450 465 480 510 575 430 440 450 470 485 515 575 430 440 450 470 490 525 580 435 445 450 472 490 525 590 435 445 450 475 490 525 600 435 445 460 475 500 535 600 435 445 460 475 500 549 600 435 445 460 480 500 550 600 440 450 465 480 500 570 615 440 450 465 480 510 570 615 i = (p/100)n = (50/100)X10 =5 Averaging the 5th and 6th data value, we get 5th Percentile = (435 + 435)/2 = 435 24 Quartiles Quartiles are specific percentiles. First Quartile = 25th Percentile Second Quartile = 50th Percentile = the Median Third Quartile = 75th Percentile 25 Quartiles Divide a data set into four equal parts ( N + 1) Q1 = ; 4 2( N + 1) 3( N + 1) Q2 = ; Q3 = ; 4 4 W hereQi is thelocationof thei th Quartile 26 3.2 Measures of Variability 27 3.2 Measures of Variability Measures of central location fail to tell the whole story about the distribution. A question of interest that remains unanswered even after obtaining measures of central location is how spread out are the observations around the central (say, mean) value? • Variability is Important in business decisions—as it indicates the level of risk. • For example, in choosing between two suppliers A and B, we might consider not only the average delivery time for each, but also the variability in delivery time for each. 28 Measures of Variability Range Inter-Quartile Range Variance Standard Deviation Coefficient of Variation 29 i) The Range The range in a set of observations is the difference between the largest and smallest observations. The range is the distance between the smallest and the largest data value in the set. • Range = largest value – smallest value Its major advantage is the ease with which it can be computed. Its major shortcoming is its failure to provide information on the dispersion of the observations between the two end points. It is also very sensitive to the smallest and largest data values 30 ii) Inter Quartile Range This is a measure of the spread of the middle 50% of the observations Inter quartile range = Q3 – Q1 Large value indicates a large spread of the observations Is not sensitive to extreme data values 31 iii) The Variance Is the average of the squared differences between each data value and the measure of central location (mean) The variance is a measure of variability that utilizes all the data. Is calculated differently when we use population and when we use a sample 32 iv) The Variance N Variance of a Population 2 2 ( x ) i i 1 N n Variance of a sample s 2 (x - x) i 1 2 i n - 1 33 Example- Computing the VarianceBased on a Sample data n Variance of a sample s 2 (x - x) i 1 2 i n - 1 Find the variance of the following sample observations 9 11 8 12 34 Computing Variance of a sample Step-1: Find the mean 9 11 8 12 40 X 10 4 4 Step-2: Compute deviations from the mean Step-3: Square the deviations, add them together, and divide the sum of the squared deviations by n-1 9-10= -1 11-10= +1 8-10= -2 12-10= +2 4 s 2 2 x i i 1 n 1 12 12 (2) 2 22 10 3.33 4 1 3 35 n iii) The Variance s 2 2 ( x x ) i i 1 n - 1 Why square the difference? Sum of deviation from the mean is zero Why divide by n-1 instead of n ? Better approximation of the population variance 36 iv) Standard Deviation The standard deviation of a set of observations is the square root of the variance . Sample standard dev iation: s s 2 Population standard dev iation: 2 37 Why Standard Deviation? The standard deviation Is often reported in the actual unit of measure in which the data is recorded. Thus it can be used to compare the variability of several distributions that are measured in the same units, It can also be used to make a statement about the general shape of a distribution (Kurtosis). 38 Computing the standard deviation Step-1: Find the mean 9-10= -1 11-10= +1 8-10= -2 12-10= +2 Step-2: Compute deviations from the mean Step-3: Square the deviations, add them together, and divide the sum of the squared deviations by n-1 9 11 8 12 40 X 4 4 10 step-4: Take the square root of the variance 4 s 2 x i 1 2 i n 1 12 12 (2) 2 2 2 10 3.33 4 1 3 s s 2 3.33 1.824 39 V) Coefficient of Variation The coefficient of variation is a measure of how large the standard deviation is relative to the mean. The coefficient of variation is computed as follows: CV= s 100 % x for a sample 100 % for a population 40 Why Coefficient of Variation? Example: Is a standard deviation of 10 large? A standard deviation of 10 may be perceived large when the mean value is 100, but it is only moderately large if the mean value is 500 Coefficient of Variation can be used to compare variability in data sets that are measured in different units. 41 Variance, Standard Deviation, and Coefficient of Variation Variance s2 Standard Deviation 2 ( x x ) i n1 2, 996.16 s s 2996.47 54.74 2 the standard deviation is about 11% of the mean s 54.74 100 % 100 Coefficient of % 11.15% x 490.80 Variation 42 Compute the Mean, Median, Mode, Range, Variance, Standard Deviation and Coefficient of Variation for income (in $1000) data from the following cities City Income Akron, OH 74.1 Atlanta, GA 82.4 Birmingham, AL 71.2 Cleveland, OH 62.3 Columbia, SC 79.9 Danbury, CT 66.8 Denver, CO 132.3 Detroit, MI 83.4 Lancaster, PA 100.0 Madison, WI 77.0 Minneapolis, MN 67.8 43 Compute every single measure of central location and Variability you have learned in this chapter for the following sample rent data on 70 efficiency apartments 425 440 450 465 480 510 575 430 440 450 470 485 515 575 430 440 450 470 490 525 580 435 445 450 472 490 525 590 435 445 450 475 490 525 600 435 445 460 475 500 535 600 435 445 460 475 500 549 600 435 445 460 480 500 550 600 440 450 465 480 500 570 615 440 450 465 480 510 570 615 44