Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Advanced Math 45 Statistics #2: Ungrouped Data 3. Calculations Using Ungrouped or Raw Data 1. 2. 3. 4. 5. 6. Measures of Central Tendency Measures of Central Tendency from a Frequency Table Which is the best measure of Central Tendency? Measures of Spread Measures of Spread from a Frequency Table Percentiles 1. Measures of Central Tendency Measures of central tendency tell us about the “centre” of a set of data. The three that are used most frequently are the mean, median and mode. The Mode – this is the piece of data that occurs most frequently. Data: 3, 2, 6, 3, 7, 12, 3, 14, 2 Mode: 3 Some sets of data can have more than one mode or no mode at all. Data: 2, 3, 2, 3, 5, 7, 2, 5, 3, Data: 5, 5, 5, 3, 3, 3, 7, 7, 7 Modes: 2 and 3 No mode. The Median – the middle piece of the ordered data. If there is an odd number of pieces of data then the middle piece of data is the (n+1)/2th piece of data where n is the number of pieces of data Data: Ordered Data: 3, 2, 2, 6, 3, 7, 12, 3, 14 2, 2, 3, 3, 3, 6, 7, 12, 14 In this case we have 9 pieces of data so (9+1)/2 means the 5th piece of data is the middle piece of data. So Median: 3 If there is an even number of pieces of data then you take the average of the middle two pieces of data, i.e. add the n/2th and n/2+1th pieces of data and divide by 2. Data: Data: Ordered Data : 3, 2, 6, 3, 7, 12, 3, 14 2, 3, 3, 3, 6, 7, 12, 14 In this case we have 8 pieces of data so take the 4th (8/2) and 5th (8/2+1) pieces of data and divide their sum by 2. Page 1 of 12 Advanced Math 45 Median = (3+6) 2 Statistics #2: Ungrouped Data = 4.5 ̅ ) – divide total of data by the number of pieces of data The Mean ( 𝒙 Data: 3, 2, 6, 3, 7, 12, 3, 14 Mean 𝑥̅ = 3 + 2 + 6 + 3 + 7 + 12 + 3 + 14 50 = = 6.25 8 8 Summation (sigma) notation ∑ 𝑥 means the sum of all the x’s so in our example above ∑ 𝑥 = 50 and 𝑥̅ = ∑ 𝑥 50 = = 6.25 𝑛 8 Note that different symbols are used for the mean depending on if the mean is of a sample or the population. Mean of a sample: 𝑥̅ Mean of the population : μ Assignment #2 Qu 1-3 2. Measures of Central Tendency From a Frequency Table With larger quantities of data it is often useful to put data into a frequency table. Example: Find the mean median and mode from the following frequency table. x 0 1 2 3 4 5 Mode: f 5 9 4 2 3 1 mode = 1 (x value with the highest frequency) Median: Page 2 of 12 Advanced Math 45 Statistics #2: Ungrouped Data x 0 1 2 3 4 5 f 5 9 4 2 3 1 Cumulative Freq 5 5+9=14 14+4=18 18+2=20 20+3=23 23+1=24 ∑ 𝑓 = 24 By totalling the frequency column we can see that there are 24 pieces of data so we need to find the 12th and 13th pieces of data both of which have the value 1 in this case. 1+1 Median = 2 = 1 Mean: The easiest way to find the total of ALL the data is to add a frequency × data value column (fx) x 0 1 2 3 4 5 f 5 9 4 2 3 1 fx 0x5=0 1x9=9 2x4=8 3x2=6 4x3=12 5x1=5 ∑ 𝑓 = 24 ∑ 𝑓𝑥 = 40 Mean: 𝑥̅ = Assignment #2 Qu 4 Page 3 of 12 ∑ 𝑓𝑥 40 = = 1.67 ∑𝑓 24 Advanced Math 45 Statistics #2: Ungrouped Data 3. Which is the best measure of central tendency? Or, put another way, With three averages to choose from mean, median and mode – which should we use? The answer is – it depends on the situation! Each kind of measure is appropriate for certain types of data. The following table shows the advantages and disadvantages of these different averages. Average Advantages Disadvantages Mean All the data is used to find the answer Very large or very small numbers can distort the answer Median Very big and very small values don't affect it Takes a long time to calculate for a very large set of data Mode or modal class The only average we can use when the data is not numerical 1. 2. 3. There may be more than one mode There may be no mode at all if none of the data is the same It may not accurately represent the data Example This table shows the annual salary of people who work at a garden centre. Annual salary (£) Number of people 0 - 9,999 10 10,000 - 19,999 9 20,000 - 29,999 9 30,000 - 39,999 1 40,000 - 49,999 1 The modal class is £0 - £9,999. Page 4 of 12 Advanced Math 45 Statistics #2: Ungrouped Data Question What is the disadvantage of using the modal class? Answer Even though the range £0 - £9,999 contains the most number of people, the next two ranges have comparable numbers and so is not representative of the data. Question What is the disadvantage of using the mean? Almost everyone earns under £30,000. The mean would be distorted by the fact that two people earn much more than this. Assignment #2 Qu 5 & 6 4. Measure of Spread Range: The range of a set of data is the difference between the largest value and the smallest value. Given the data set: 12, 5, 17, 9, 16, 3 Range = 17 – 3 = 14. Because the range if a set of data is the difference between the extreme values of a set of data it can be unduly influenced by outliers. A better way of looking at the spread of data is to use the standard deviation. Standard Deviation The Standard Deviation is a measure of how spread out numbers are. Its symbol is σ (the greek letter sigma) The formula is easy: it is the square root of the Variance. So now you ask, "What is the Variance?" Variance The Variance is defined as: Page 5 of 12 Advanced Math 45 Statistics #2: Ungrouped Data The average of the squared differences from the Mean. To calculate the variance follow these steps: Work out the mean (the simple average of the numbers) Then for each number: subtract the Mean and square the result (the squared difference). Then work out the average of those squared differences. (Why Square?) Example You and your friends have just measured the heights of your dogs (in millimeters): The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm. Find out the Mean, the Variance, and the Standard Deviation. Your first step is to find the Mean: Answer: Mean = 600 + 470 + 170 + 430 + 300 1970 = = 394 5 5 so the mean (average) height is 394 mm. Let's plot this on the chart: Page 6 of 12 Advanced Math 45 Statistics #2: Ungrouped Data Now, we calculate each dogs difference from the Mean: To calculate the Variance, take each difference, square it, and then average the result: So, the Variance is 21,704. And the Standard Deviation is just the square root of Variance, so: Standard Deviation: σ = √21,704 = 147.32... = 147 (to the nearest mm) And the good thing about the Standard Deviation is that it is useful. Now we can show which heights are within one Standard Deviation (147mm) of the Mean: So, using the Standard Deviation we have a "standard" way of knowing what is normal, and what is extra large or extra small. Rottweilers are tall dogs. And Dachshunds are a bit short ... but don't tell them! But ... there is a small change with Sample Data Our example was for a Population (the 5 dogs were the only dogs we were interested in). Page 7 of 12 Advanced Math 45 Statistics #2: Ungrouped Data But if the data is a Sample (a selection taken from a bigger Population), then the calculation changes! When you have "n" data values that are: The Population: divide by n when calculating Variance (like we did) A Sample: divide by n -1 when calculating Variance All other calculations stay the same, including how we calculated the mean. Example: if our 5 dogs were just a sample of a bigger population of dogs, we would divide by 4 instead of 5 like this: Sample Variance = 108,520 / 4 = 27,130 Sample Standard Deviation = √27,130 = 164 (to the nearest mm) Think of it as a "correction" when your data is only a sample. Formulas: The "Population Standard Deviation": The "Sample Standard Deviation": Looks complicated, but the important change is to divide by N-1 (instead of N) when calculating a Sample Variance. A way to make calculating the standard deviation easier is to create a table. x 600 470 170 430 300 (𝑥 − 𝑥̅ ) 600-394=206 470-394=76 170-394=-224 430-394=36 300-394=-94 ∑(𝑥 − 𝑥̅ )2 = 108520 ∑ 𝑥 = 1970 Mean: 𝑥̅ = ∑𝑥 𝑛 = 1970 5 = 394 ∑(𝑥−𝑥̅ )2 Standard Deviation: 𝜎 = √ Page 8 of 12 (𝑥 − 𝑥̅ )2 2062=42436 762=5776 (-224)2=50176 362=1296 (-94)2=8836 𝑛 108520 =√ 5 = 147.32 Advanced Math 45 Statistics #2: Ungrouped Data Why do we square the differences? Check out what happens if you add up the (𝑥 − 𝑥̅ ) column. Not very useful! There is a second version of the variance or square root formula. ∑ 𝑥2 i.e. 𝜎 = √ 𝑛 − 𝑥̅ 2 Lets see how we get this: ∑(𝑥 − 𝑥̅ )2 ∑(𝑥 2 − 2𝑥̅ 𝑥 + 𝑥̅ 2 ) 𝜎=√ =√ 𝑛 𝑛 ∑ 𝑥 2 − ∑ 2𝑥̅ 𝑥 + ∑ 𝑥̅ 2 =√ 𝑛 =√ ∑ 𝑥 2 − 2𝑥̅ ∑ 𝑥 + 𝑥̅ 2 ∑ 1 𝑛 ∑ 𝑥2 𝑛𝑥̅ 2 =√ − 2𝑥̅ 𝑥̅ + 𝑛 𝑛 ∑ 𝑥2 √ = − 2𝑥̅ 2 + 𝑥̅ 2 𝑛 ∑ 𝑥2 √ = − 𝑥̅ 2 𝑛 This is often the easier formula to work with. x 600 470 170 430 300 𝑥2 360000 220900 28900 184900 90000 ∑ 𝑥 = 1970 ∑ 𝑥 2 = 884700 ∑ 𝑥2 𝜎=√ − 𝑥̅ 2 𝑛 884700 =√ − 3942 5 = 147.322… as before. Page 9 of 12 Advanced Math 45 Statistics #2: Ungrouped Data For a sample use n-1 instead of n. Many calculators will work out the values of the summations, given the input data. Assignment #3 Qu 1-6 5. Measures of Spread from a Frequency Table. Finding the range from a frequency table doesn’t differ from before. i.e. Range = largest value – smallest value x 0 1 2 3 4 5 f 5 9 4 2 3 1 From Table we can see that the range = 5 – 0 = 5 To calculate the standard deviation we have to modify the formula slightly. ∑ 𝑓(𝑥−𝑥̅ )2 𝜎=√ 𝑛 ∑ 𝑓𝑥2 or 𝜎 = √ 𝑛 − 𝑥̅ 2 Let’s see how this works, using the example from above, mean = 12/3 x 0 1 2 3 4 5 f 5 9 4 2 3 1 (𝒙 − 𝒙 ̅) 2 -1 /3 -2/3 1/3 11/3 21/3 31/3 (𝒙 − 𝒙 ̅)𝟐 7 2 /9 4/9 1/9 17/9 54/9 111/9 ̅)𝟐 𝒇(𝒙 − 𝒙 8 13 /9 4 4/9 35/9 161/3 111/9 ∑ 𝑓(𝑥 − 𝑥̅ )2 = 49 5⁄9 𝜎=√ Page 10 of 12 ∑ 𝑓(𝑥 − 𝑥̅ )2 √49 5⁄9 = = 1.43 … 𝑛 24 Advanced Math 45 Statistics #2: Ungrouped Data or, using the alternate formula x 0 1 2 3 4 5 f 5 9 4 2 3 1 fx 0x5=0 1x9=9 2x4=8 3x2=6 4x3=12 5x1=5 x2 0 1 4 9 16 25 ∑ 𝑓 = 24 ∑ 𝑓𝑥 = 40 fx2 0 9 16 18 48 25 ∑ 𝑓𝑥 2 = 116 ∑ 𝑓𝑥2 𝜎=√ − 𝑥̅ 2 𝑛 116 2 2 =√ − (1 ) 24 3 = 1.43… Assignment #3 Qu 7 6. Percentiles A Percentile a percentile is 1/100 of a given set of elements arranged in order of magnitude, hence the median is the 50th percentile. They are used in health care to compare individual scores with what is considered to be “normal” eg weight of babies, in education to determine grades, the top 5% get an A, the next 10% a B etc. or in industry to find the best candidates for a job. They can be calculated using the formula: 𝑃𝑟 = 1 (𝐵 + 2 𝐸) 𝑛 × 100% where B = number below score E = Number with the same score n = total number of scores Example: Students is a test scored as follows. 54 76 Page 11 of 12 65 76 71 77 71 80 71 80 71 80 74 92 75 98 Advanced Math 45 Statistics #2: Ungrouped Data Jane scored 71. What was her percentile rank? 𝑃𝑟 = 1 (2 + 2 (4)) 16 × 100% = 25% i.e. 25% score less or the same as Jane. Janes friend John scored 77 on this test. He will get an interview for a job if he is the top 30% . Will he get an interview? 𝑃𝑟 = 1 (10 + 2 (1)) 16 × 100% = 65.625% Percentile are always rounded up to the next whole number so Johns percentile rank would be 66. i.e. 66% of the class scored less than or equal than him, so 34% were better than him, so no he won’t get the interview! Assignment #4 Page 12 of 12