Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Central Tendency and Dispersion Univariate Data: Data involving one variable Mean Where x is the mean, Σx is the sum of all the scores and n is the number of scores. x x= n The mean gives the average or ‘typical score’ for a set of values. Add up all the values (Σx) and divide by the number of values that you have (n). If the values are written in a frequency table: E.g. Value Frequency 5 2 6 4 7 3 8 1 then this means the same as 5, 5, 6, 6, 6, 6, 7, 7, 7, 8 (i.e. 10 numbers) x = (2×5) + (4×6) + (3×7) + (1×8) = 6.3 10 An OUTLIER can greatly affect your results. E.g. If you include a score of 20 to the above example then the new mean becomes (2×5) + (4×6) + (3×7) + (1×8) + (1×20) = 7.5 (1 d.p.) 11 This is a very different answer and is very misleading. Outliers are often, therefore, identified and removed first. Median First place your scores from smallest to biggest. The median is the middle score. If two middle scores exist (i.e. when n is an even number) then find the average of these two. If n is a large number, the easiest way to find the median is to add 1 to the number of scores (n) and then divide by 2. If you get a whole number then that is the score you use as the median. If you get a decimal, e.g. 14.5, then the median is the average of the 14th and 15th score. Mode The most frequent value. In the example above the mode is 6. It is possible to have a bimodal situation (i.e. two values both occurring the most often) but if there are more than two we usually say that there is no distinct mode. Measures of Central Tendency The mean, median and mode are called measures of central tendency. They describe the average, central or typical members of a group. Stem and Leaf Diagrams 139, 168, 142, 133, 152, 155, 131, 140, 163, 142 can be rewritten as: 13 14 15 16 Stem 9 2 2 8 3 1 0 2 5 3 Leaves You can rewrite them in order if you want to – it makes it easier to see the median: 13 1 3 9 14 0 2 2 15 2 5 16 3 8 Confirm these results: Mean = 146.5 Mode = 142 Median = 142 Dot Frequency Graphs Mean = (1×1)+(2×2)+(1×3)+(3×4)+(2×5)+(1×6) 10 = 3.6 y 3 2 Median = 4 1 Mode = 4 1 2 3 4 5 6 x Also see Sadler 3A page 127 and Exercise 6A Check that you know how to use your Classpad calculators here Range = Largest score - Smallest score In the Dot Frequency graph above the range is 6 – 1 = 5 Quartiles First put the scores into order, identifying the median score. Then, taking the score to the lower side of the median, split them in half by finding their median. Do the same for the scores above the original median. This will result in the scores being split into quarters. E.g. 4 6 6 8 9 10 12 12 14 16 19 Median of bottom half of scores Median Lower quartile Q1 Interquartile Range = Median of upper half of scores Upper quartile Q2 Upper Quartile - Lower Quartile In the example above IQR = 14 – 6 = 8 Q3 = Q3 - Q1 Box and Whisker Plot These show the quartiles and range clearly. A rectangular box with end points denoting the upper and lower quartiles has a line through it at the median. Tails on either side show the range. For the example in the quartiles section above, the box and whisker plot looks like: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Also see Sadler 3A Exercise 6E Question 5 Mean Deviation Mean brings an average to mind. Deviation refers to a difference between two values, more specifically how far away from a set point the score is. Here, a mean deviation refers to how concentrated around a mean some scores are. E.g. 25 21 24 22 25 29 29 25 Mean = 25 Score 25 21 24 22 25 29 29 25 Deviation from the mean 25 – 25 = 0 25 – 21 = 4 25 – 24 = 1 25 – 22 = 3 25 – 25 = 0 29 – 25 = 4 29 – 25 = 4 25 – 25 = 0 Notice that it doesn’t matter which side of the mean the score is. We just look at how far away it is from the mean (i.e. no negative values). Mean deviation = 0 + 4 + 1 + 3 + 0 + 4 + 4 + 0 8 = 2 Variance Square the deviations first and then find the average: Variance = σ2 = 02 + 42 + 12 + 32 + 02 + 42 + 42 + 02 8 = 7.25 Note: Σ is capital sigma σ is lower case sigma Standard Deviation Take the square root of the variance. SD = σ = √7.25 = 2.7 Measures of Dispersion The range, interquartile range and the standard deviation are all known as measures of dispersion. The standard deviation is the more useful measure of dispersion in data analysis. This measure of spread is significant because most (probably all) of the scores would lie within three standard deviations either side of the mean. i.e. (x - 3σ) ≤ all scores ≤ (x + 3σ) There are rules that you can follow if scores have been increased or decreased or multiplied by a constant amount: Original mean 5 g 10 g 15 g 12 g Original SD Alteration to scores New mean New SD 1.5 h 2 h 3 h 2.5 h Multiply all scores by 3 Multiply all scores by a Add 8 to each score Add b to each score Double each and add 1 to each score Multiply each score by a and then add b Add 2 to each and then multiply each by 3 Add b to each and then multiply each by a 15 Ag 18 g+b 31 ag + b 14×3=42 a(g + b) 4.5 ǀaǀh 2 h 6 ǀaǀh 2.5×3=7.5 ǀaǀh Also see Sadler 3A pages 131 – 138 Exercise 6B and “Statistics Worksheet” Grouped Data Sometimes data is grouped into sections, ‘class intervals’ or ‘bins’. E.g. Score 20-29 30-39 40-49 50-59 60-69 70-79 80-89 Frequency 3 11 18 28 24 10 6 This means, for example, that there are 10 scores between 70 and 79 but the problem is that we don’t know the exact value of each of the ten scores. The advantage is that the overall distribution is clear. If asked to find the mean or standard deviation of grouped data we need to assume all of the scores in an interval are all at the midpoint or the interval. E.g. The interval 70-79 has a midpoint of 74.5 Hence the mean of the above is 24.5×3 + 34.5×11 + 44.5×18 + 54.5×28 + 64.5×24 + 74.5×10 + 84.5×6 3 + 11 + 18 + 28 + 24 + 10 + 6 Mean = 55.8 Also, the SD = 14.3 (1 d.p.) The mode for this type of example is referred to as the modal class. i.e. the class interval with the most scores = 50-59. The range can only be approximated. 89 – 20 = 69. We could argue that it should be 89.5 – 19.5 = 70 but, as it is only an estimation then 69 or 70 would be reasonable. See also Sadler 3A page 139 Weighted Means Identify of each of the data points contributing equally to the final average some data points contribute more than others. E.g. Two school classes, one with 20 students and one with 30 students, both take a test. Morning Class: 62, 67, 71, 74, 76, 77, 78, 79, 79, 80, 80, 81, 81, 82, 83, 84, 86, 89, 93, 98 Mean = 80 Afternoon Class: 81, 82, 83, 84, 85, 86, 87, 87, 88, 88, 89, 89, 89, 90, 90, 90, 90, 91, 91, 91 92, 92, 93, 93, 94, 95, 96, 97, 98, 99 Mean = 90 The straight average of 80 and 90 is (80 + 90)/2 = 85. However, this does not accurately account for the difference in numbers of students in each class, and the value of 85 does not reflect the average student grade (independent of class). Weighted mean of the class is 20×80 + 30×90 = 86 20 + 30 Notice that the grouped mean of 86 is closer to the afternoon class mean of 90 than the morning class mean of 80. The fact that there are more students in the afternoon class “weights” the group mean towards the afternoon class more than the morning class. See also Sadler 3A page 140 Cropped Data Sometimes outliers affect the mean and make it look unrealistic so the data can be “cropped” to eliminate them. E.g. In some sporting events the highest and lowest scores are ignored and the rest averaged. See also Sadler 3A page 141 Standard Scores Standardised Score = Raw Score - Mean Standard Deviation Standardisation is a method to convert various types or units or measurements into a common scale in order to make comparisons. See also Sadler 3A page 142 and Exercise 6D