Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Measures of Central Tendency and Spread Chapter 1, Section 2 The Motivation • Measure of central tendency are used to describe the typical member of a population. • Depending on the type of data, typical could have a variety of “best” meanings. • We will discuss four of these possible choices. 3 Measures of Central Tendency • Mean – the arithmetic average. This is used for continuous data. • Median – a value that splits the data into two halves, that is, one half of the data is smaller than that number, the other half larger. May be used for continuous or ordinal data. • Mode – this is the category that has the most data. As the description implies it is used for categorical data. Mean • To find the mean, add all of the values, then divide x by the number of values. Population • The lower case, Greek N letter mu is used for x population mean. x Sample n • An “x” with a bar over it, read x-bar, is used for sample mean. Mean Example listing 1 2 3 4 5 6 7 8 9 10 11 12 13 14 n = 15 total X 14 17 31 28 42 43 51 51 66 70 67 70 78 62 47 737 737/15 = x-bar 49.13333 Median • The median is a number chosen so that half of the values in the data set are smaller than that number, and the other half are larger. • To find the median – List the numbers in ascending order – If there is a number in the middle (odd number of values) that is the median – If there is not a middle number (even number of values) take the two in the middle, their average is the median Median Example listing 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 X 14 17 28 31 42 43 47 51 51 62 66 67 70 70 78 listing 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 X 14 17 28 31 42 43 47 51 53 57 62 66 67 70 70 78 51+53 2 = 52 Mode • The mode is simply the category or value which occurs the most in a data set. • If a category has radically more than the others, it is a mode. • Generally speaking we do not consider more than two modes in a data set. • No clear guideline exists for deciding how many more entries a category must have than the others to constitute a mode. Obvious Example 80 70 60 thousands • There is obviously more yellow than red or blue. • Yellow is the mode. • The mode is the class, not the frequency. Beach Ball Production 50 40 30 20 10 0 blue red yellow Bimodal Geometry Scores For TASP 120 100 80 60 40 20 0 very bad bad neutral good very good No Mode Category Frequency 1 51 70 2 51 60 3 66 50 4 62 40 30 5 65 20 6 57 10 7 47 0 8 43 1 9 64 • Although the third category is the largest, it is not sufficiently different to be called the mode. 2 3 4 5 6 7 8 9 Quartiles, Percentiles and Other Fractiles • We will only consider the quartile, but the same concept is often extended to percentages or other fractions. • The median is a good starting point for finding the quartiles. • Recall that to find the median, we wanted to locate a point so that half of the data was smaller, and the other half larger than that point. Quartile • For quartiles, we want to divide our data into 4 equal pieces. Suppose we had the following data set (already in order) 2 3 7 8 8 8 9 13 17 20 21 21 Choosing the numbers 7.5, 8.5, and 18.5 as markers would Divide the data into 4 groups, each with three elements. These numbers would be the three quartiles for this data set. Quartiles Continued • Conceptually, this is easy, simply find the median, then treat the left hand side as if it were a data set, and find its median; then do the same to the right hand side. • This is not always simple. Consider the following data set. • 3333356888889 • The first difficulty is that the data set does not divide nicely. • Using the rules for finding a median, we would get quartiles of 3, 6 and 8. • The second difficulty is how many of the 3’s are in the first quartile, and how many in the second? Quartiles Continued • For this course, let’s pretend that this is not an issue. • I will give you the quartiles. • I will not ask how many are in a quartile. Stem and Leaf Plot • Take a data set: • Put data in order from least to greatest • Make a list of the stems, inclusive of all from least to greatest • Fill in the leaves • Make a key Stem and Leaf • Data: 11, 13, 16, 16, 16, 19, 20, 41, 43 • Stems: 1, 2, 3, and 4, even though no data in the 30s. 53 1 136669 • Compare data: 7432 2 0 • 13, 15, 22, 23, 25, 4 3 1 3 27, 31, 33, 34, 46 6 4 13 Key: 1/3 = 13 You try: • • • • • Data set 1: 3, 5, 18, 19, 19, 20, 30, 56, 57, 58, 58 Data set 2: 17, 17, 18, 18, 18, 19, 29, 29, 29, 59 Write two things you notice from this stem and leaf plot. Answer 0 35 988877 999 1 899 2 0 3 0 4 9 5 6788 Key: 0/3 = 3 Note about stems and leaves • Given data: 465, 466 470, 489, … Key: 46/5 = 465 Given data: 0.95, 0.99, 0.89, 1.03, 1.09 Key: 10/3 = 1.03 46 56 47 0 48 9 08 9 09 59 10 39 Box and Whisker Plot • Put data in order. • Make a number line containing minimum and maximum points • Mark min and max with dots. • Mark the median with a line. • Mark the quartiles with a line. • Make a box, and connect the whiskers, Box and Whisker Plot 3, 5, 18, 19, 19, 20, 30, 56, 57, 58, 58 17, 17, 18, 18, 18, 19, 29, 29, 29, 59 0 5 10 20 30 40 50 60 70 Different Distributions • Consider the range of the data (the minimum point to the maximum point). • If there is no mode, then the distribution is relatively uniform. • If the mean, median, and mode are about equal, then the distribution is roughly “normal”. • If the mean and median are not roughly equal, then the distribution is “skewed”. Standard Deviation The Standard Deviation is a number that measures how far away each number in a set of data is from their mean. If the Standard Deviation is large, large, it means the numbers are spread out from their mean. If the Standard Deviation is small, it means the numbers are small, close to their mean. Two classes took a recent quiz. There were 10 students in each class, and each class had an average score of 81.5 Since the averages are the same, can we assume that the students in both classes all did pretty much the same on the exam? The answer is… No. The average (mean) does not tell us anything about the distribution or variation in the grades. Here are Dot-Plots of the grades in each class: Mean So, we need to come up with some way of measuring not just the average, but also the spread of the distribution of our data. Why not just give an average and the range of data (the highest and lowest values) to describe the distribution of the data? Well, for example, lets say from a set of data, the average is 17.95 and the range is 23. But what if the data looked like this: Here is the average And here is the range But really, most of the numbers are in this area, and are not evenly distributed throughout the range. Here are the scores on the math quiz for Team A: 72 76 80 80 81 83 84 85 85 89 Average: 81.5 The Standard Deviation measures how far away each number in a set of data is from their mean. For example, start with the lowest score, 72. How far away is 72 from the mean of 81.5? 72 - 81.5 = - 9.5 - 9.5 Or, start with the lowest score, 89. How far away is 89 from the mean of 81.5? 89 - 81.5 = 7.5 - 9.5 7.5 So, the first step to finding the Standard Deviation is to find all the distances from the mean. Distance from Mean 72 76 80 80 81 83 84 85 85 89 -9.5 7.5 So, the first step to finding the Standard Deviation is to find all the distances from the mean. Distance from Mean 72 76 80 80 81 83 84 85 85 89 - 9.5 - 5.5 - 1.5 - 1.5 - 0.5 1.5 2.5 3.5 3.5 7.5 Distance from Mean Next, you need to square each of the distances to turn them all into positive numbers 72 76 80 80 81 83 84 85 85 89 - 9.5 - 5.5 - 1.5 - 1.5 - 0.5 1.5 2.5 3.5 3.5 7.5 Distances Squared 90.25 30.25 Distance from Mean Next, you need to square each of the distances to turn them all into positive numbers 72 76 80 80 81 83 84 85 85 89 - 9.5 - 5.5 - 1.5 - 1.5 - 0.5 1.5 2.5 3.5 3.5 7.5 Distances Squared 90.25 30.25 2.25 2.25 0.25 2.25 6.25 12.25 12.25 56.25 Distance from Mean Add up all of the distances 72 76 80 80 81 83 84 85 85 89 - 9.5 - 5.5 - 1.5 - 1.5 - 0.5 1.5 2.5 3.5 3.5 7.5 Distances Squared 90.25 30.25 2.25 2.25 0.25 2.25 6.25 12.25 12.25 56.25 Sum: 214.5 Distance from Mean Divide by (n - 1) where n represents the amount of numbers you have. 72 76 80 80 81 83 84 85 85 89 - 9.5 - 5.5 - 1.5 - 1.5 - 0.5 1.5 2.5 3.5 3.5 7.5 Distances Squared 90.25 30.25 2.25 2.25 0.25 2.25 6.25 12.25 12.25 56.25 Sum: 214.5 (10 - 1) = 23.8 Distance from Mean Finally, take the Square Root of the average distance 72 76 80 80 81 83 84 85 85 89 - 9.5 - 5.5 - 1.5 - 1.5 - 0.5 1.5 2.5 3.5 3.5 7.5 Distances Squared 90.25 30.25 2.25 2.25 0.25 2.25 6.25 12.25 12.25 56.25 Sum: 214.5 (10 - 1) = 23.8 = 4.88 Distance from Mean This is the Standard Deviation 72 76 80 80 81 83 84 85 85 89 - 9.5 - 5.5 - 1.5 - 1.5 - 0.5 1.5 2.5 3.5 3.5 7.5 Distances Squared 90.25 30.25 2.25 2.25 0.25 2.25 6.25 12.25 12.25 56.25 Sum: 214.5 (10 - 1) = 23.8 = 4.88 Distance from Mean Now find the Standard Deviation for the other class grades 57 65 83 94 95 96 98 93 71 63 - 24.5 - 16.5 1.5 12.5 13.5 14.5 16.5 11.5 - 10.5 -18.5 Distances Squared 600.25 272.25 2.25 156.25 182.25 210.25 272.25 132.25 110.25 342.25 Sum: 2280.5 (10 - 1) = 253.4 = 15.91 Now, lets compare the two classes again Team A Average on the Quiz Standard Deviation Team B 81.5 81.5 4.88 15.91 5 Number Summary • The five number summary is the minimum value, the three quartiles and the maximum value. • This may be represented graphically with a box and whisker plot. Outliers • Outliers are values in the data set which are either suspiciously large or small. • Such values may be the result of an error, the researcher measures incorrectly or maybe the results are typed incorrectly. • Outliers may be good data. There is always the chance that you have one basketball player in a set of ordinary people. • The seven foot height is not an error, but it is still unusually large. Interquartile Range • One method for identifying these outliers, involves the use of quartiles. • The interquartile range (IQR) is Q3 – Q1. • All numbers less than Q1 – 1.5(IQR) are probably too small. • All numbers greater than Q3 + 1.5(IQR) are probably too large. Using IQR to Find Outliers The red lines are 1.5 times the IQR. Starting from Q1 going left, and starting from Q3 going right 1.5(IQR) we establish limits. All numbers smaller on the left, and larger on the right are outliers. Example Linear Transformations • When changing units, e.g., feet to meters, degrees F to degrees C, we employ a linear transformation. – New = a + b Old • Measures of both center and spread will be multiplied by “b”. • Only measures of location are affected by “a”.