Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
NUMERICAL DESCRIPTIVE MEASURES Content • Measures of Central Location – Mean, median, mode • Relative Standing – Percentile, box plots • Measures of Variability – Range, – variance, – standard deviation, • Measures of Association – Covariance, coefficient of correlation 1 MEASURES OF CENTRAL LOCATION MEAN • Mean is defined as follows: Sum of the measurements Mean = Number of measurements • In the following, sample mean and population means are discussed separately. • Note the difference of notation - sample mean is denote by x and the population mean is denoted by . The number of values in a sample is denoted by n and the number of values in the population is denoted by N. 2 MEASURES OF CENTRAL LOCATION MEAN Mean of Data Set Data Set is Sample Data Set is Population Sample Mean Population Mean 3 MEASURES OF CENTRAL LOCATION SAMPLE MEAN • The sample mean is the sum of all the sample values divided by the number of sample values: n x • • • • x i 1 i n where x stands for the sample mean n is the total number of values in the sample xi is the value of the i-th observation. represents a summation 4 MEASURES OF CENTRAL LOCATION SAMPLE MEAN • Statistic: a measurable characteristic of a sample. • A sample of five executives received the following amounts of bonus last year: $14,000, $15,000, $17,000, $16,000, and $15,000. Find the average bonus for these five executives. • Since these values represent a sample size of 5, the sample mean is (14,000 + 15,000 +17,000 + 16,000 +15,000)/5 = $15,400. 5 MEASURES OF CENTRAL LOCATION POPULATION MEAN • The population mean is the sum of all the population values divided by the number of population values: n • • • • x i 1 i N Where stands for the population mean N is the total number of values in the population xi is the value of the i-th observation. represents a summation 6 MEASURES OF CENTRAL LOCATION POPULATION MEAN • Parameter: a measurable characteristic of a population. • The Keller family owns four cars. The following is the mileage attained by each car: 56,000, 23,000, 42,000, and 73,000. Find the average miles covered by each car. • The mean is (56,000 + 23,000 + 42,000 + 73,000)/4 = 48,500 7 MEASURES OF CENTRAL LOCATION PROPERTIES OF MEAN • Data possessing an interval scale or a ratio scale, usually have a mean. • All the values are included in computing the mean. • A set of data has a unique mean. • The mean is affected by unusually large or small data values. • The arithmetic mean is the only measure of central tendency where the sum of the deviations of each value from the mean is zero. 8 MEASURES OF CENTRAL LOCATION PROPERTIES OF MEAN • Consider the set of values: 3, 8, and 4. The mean is 5. Illustrating the fifth property, (3-5) + (8-5) + (4-5) = -2 +3 -1 = 0. In other words, n (x i 1 i x) 0 9 MEASURES OF CENTRAL LOCATION MEDIAN • Median: The midpoint of the values after they have been ordered from the smallest to the largest, or the largest to the smallest. There are as many values above the median as below it in the data array. • For an even set of numbers, the median will be the arithmetic average of the two middle numbers. • The median is the most appropriate measure of central location to use when the data under consideration are ranked data, rather than quantitative data. For example, if 13 universities are ranked according to the reputation, university 7 is the one of median reputation. 10 MEASURES OF CENTRAL LOCATION MEDIAN • Compute the median for the following data. • The age of a sample of five college students is: 21, 25, 19, 20, and 22. • Arranging the data in ascending order gives: 19, 20, 21, 22, 25. Thus the median is 21. • The height of four basketball players, in inches, is 76, 73, 80, and 75. • Arranging the data in ascending order gives: 73, 75, 76, 80. Thus the median is 75.5 11 MEASURES OF CENTRAL LOCATION MODE • The mode is the value of the observation that appears most frequently. • The mode is most useful when an important aspect of describing the data involves determining the number of times each value occurs. If the data are qualitative (e.g., number of graduate in various accounting,finance, etc.) then, mode is useful (e.g., a modal class is accounting). • EXAMPLE 6: The exam scores for ten students are: 81, 93, 84, 75, 68, 87, 81, 75, 81, 87. Since the score of 81 occurs the most, the modal score is 81. 12 MEASURES OF CENTRAL LOCATION MEAN, MEDIAN, MODE • Mean: affected by unusually large/small data, may be used if the data are quantitative (ratio or interval scale). • Median: most appropriate if the data are ranked (ordinal scale) • Mode: most appropriate if the data are qualitative (nominal scale) • Appropriate measures if the data is – quantitative: mean, median, mode – ranked: median, mode – qualitative: mode 13 MEASURES OF CENTRAL LOCATION RELATIVE VALUES OF MEAN, MEDIAN, MODE Mode<Median<Mean Mode=Median=Mean Mean<Median<Mode If distribution is If distribution is if distribution is positively skewed symmetric negatively skewed 14 RELATIVE STANDING PERCENTILES • Percentiles divide the distribution into 100 groups. • The p-th percentile is defined to be that numerical value such that at most p% of the values are smaller than that value and at most (100 – p)% are larger than that value in an ordered data set. • For example, if the 78th percentile of GMAT scores is 600, then at most 78% scores are below 600 and at most 22% scores are above 600 (actually, this is also true that at least 22% are 600 or above). • Two questions: – Find percentile of a given value – Find value of a given percentile 15 RELATIVE STANDING: PERCENTILES FIND PERCENTILE OF A GIVEN VALUE • The percentile corresponding to a given value (X) is computed by using the formula: number of values below X + 0.5 100% Percentile total number of values 16 RELATIVE STANDING: PERCENTILES FIND PERCENTILE OF A GIVEN VALUE • • • • • • A teacher gives a 20-point test to 10 students. Scores are as follows: 18, 15, 12, 6, 8, 2, 3, 5, 20, 10. Find the percentile rank of the score of 12. Ordered set of scores: 2, 3, 5, 6, 8, 10, 12, 15, 18, 20. There are 6 values below 12: 2, 3, 5, 6, 8, 10 Percentile = [(6 + 0.5)/10](100%) = 65th percentile. Student did better than 65% of the class. 17 RELATIVE STANDING: PERCENTILES FIND VALUE OF A GIVEN PERCENTILE • • • • Procedure: Let p be the percentile and n the sample size. Step 1: Arrange the data in the ascending order. Step 2: Compute c = (np)/100. Step 3: If c is not a whole number, round up to the next whole number. If c is a whole number, use the value halfway between c and c+1. • Step 4: The c-th value of the required percentile. 18 RELATIVE STANDING: PERCENTILES FIND VALUE OF A GIVEN PERCENTILE • • • • Example: Consider data set 2, 3, 5, 6, 8, 10, 12, 15, 18, 20. Note: the data set is already ordered. Find the value of the 25th percentile n = 10, p = 25, so c = (1025)/100 = 2.5. Hence round up to c = 3. Thus, the value of the 25th percentile is the 3rd value X = 5. • Find the value of the 80th percentile • n = 10, p = 25, so c = (1080)/100 = 8. Thus the value of the 80th percentile is the average of the 8th and 9th values. Thus, the 80th percentile for the data set is (15 + 18)/2 = 16.5. 19 RELATIVE STANDING: PERCENTILES DECILES AND QUARTILES • Deciles divide the data set into 10 groups. • Deciles are denoted by D1, D2, …, D9 with the corresponding percentiles being P10, P20, …, P90 • Quartiles divide the data set into 4 groups. • Quartiles are denoted by Q1, Q2, and Q3 with the corresponding percentiles being P25, P50, and P75. • The median is the same as P50 or Q2. 20 RELATIVE STANDING: PERCENTILES INTERQUARTILE RANGE AND OUTLIERS • An outlier is an extremely high or an extremely low data value when compared with the rest of the data values. • The Interquartile Range, IQR = Q3 – Q1. • To determine whether a data value can be considered as an outlier: • Step 1: Compute Q1 and Q3. • Step 2: Find the IQR = Q3 – Q1. • Step 3: Compute (1.5)(IQR). • Step 4: Compute Q1 – (1.5)(IQR) and Q3 + (1.5)(IQR). 21 RELATIVE STANDING: PERCENTILES INTERQUARTILE RANGE AND OUTLIERS • To determine whether a data value can be considered as an outlier: • Step 5: Compare the data value (say X) with Q1– (1.5)(IQR) and Q3 + (1.5)(IQR). • If X < Q1 – (1.5)(IQR) or if X > Q3 + (1.5)(IQR), then X is considered an outlier. 22 RELATIVE STANDING: PERCENTILES INTERQUARTILE RANGE AND OUTLIERS • Given the data set 5, 6, 12, 13, 15, 18, 22, 50, can the value of 50 be considered as an outlier? • Q1 = 9, Q3 = 20, IQR = 11. Verify. • (1.5)(IQR) = (1.5)(11) = 16.5. • 9 – 16.5 = – 7.5 and 20 + 16.5 = 36.5. • The value of 50 is outside the range – 7.5 to 36.5, hence 50 is an outlier. 23 RELATIVE STANDING BOX PLOTS • When the data set contains a small number of values, a box plot is used to graphically represent the data set. These plots involve five values: – the minimum value (S) – the lower quartile (Q1) – the median (Q2) – the upper quartile (Q3) – and the maximum value (L) 24 RELATIVE STANDING: BOX PLOTS EXAMPLE • Example: Construct a box plot with the following data which shows the assets of the 15 largest North American banks, rounded off to the nearest hundred million dollars: 111, 135, 217, 108, 51, 98, 65, 85, 75, 75, 93, 64, 57, 56, 98 25 RELATIVE STANDING: BOX PLOTS RANKING AND SUMMARIZING Data Rank Smallest = 51 217 1 Q1 = 64 135 2 Median = 85 111 3 Q3 = 108 108 4 Largest = 217 98 5 IQR = 44 98 6 Outliers = (217, ) 93 7 85 8 75 9 75 10 65 11 64 12 57 13 56 14 51 15 26 Box Plot 0 50 100 150 200 250 Assets (in 100 million dollars) 27 RELATIVE STANDING: BOX PLOTS INTERPRETATION • If the median is near the center of the box, the distribution is approximately symmetric. • If the median falls to the left of the center of the box, the distribution is positively skewed. • If the median falls to the right of the center of the box, the distribution is negatively skewed. • If the lines are about the same length, the distribution is approximately symmetric. • If the line segment to the right of the box is larger than the one to the left, the distribution is positively skewed. • If the line segment to the left of the box is larger than the one to the right, the distribution is positively skewed. 28 SYMMETRIC BOX PLOT 0 50 100 150 200 250 300 Number of units sold 29 POSITIVELY SKEWED BOX PLOT 0 50 100 150 200 250 300 Number of units sold 30 NEGATIVELY SKEWED BOX PLOT 0 50 100 150 200 250 300 Number of units sold 31