Download Describing Data - VCC Library

HOSP 1207 (Business Stats) Learning Centre Describing Data This worksheet focuses on describing data through measuring its central tendency and variability. These measurements will give us an idea of what our data set looks like. CENTRAL TENDENCY There are three measurements of central tendency: mean, median, and mode. Mean: Mean is another word for average. The mean represents the average or typical value of a data set. To find the mean, take the sum of all numbers in the data set and divide by how many data points there are. The symbol for taking the sum of a set of numbers is the capital Greek letter sigma, Σ, so “Σx” tells you to “take the sum of all values of x”. n is the number of observations in the data set. You will see the notation for mean represented two ways: x (pronounced “x bar”) is used for the mean of a sample of data, and μ (pronounced “mew”, the Greek letter mu) represents the mean of a population of data. The population is all the members of the group of interest (e.g. all ducks) while a sample is a smaller group, or subset, of the population (e.g. 250 ducks at Trout Lake). Example 1: Find the mean of the following sample: {3, 5, 4, 9, 8, 5, 7, 8, 9, 12} Solution: We first take the sum of all numbers in the sample: Σ xi = 3 + 5 + 4 + 9 + 8 + 5 + 7 + 8 + 9 + 12 = 70 and then divide by the number of values in the data set, n, which equals 10: x Σx 70 10 7 The mean of our data set is 7. Median: The median is the middle value of an ordered data set. This is a more useful measure of central tendency if the data is significantly skewed. “Skewness” means the data favours high numbers over low numbers, or vice versa. In graph form, a skewed curve appears asymmetrical, rather than as a symmetrical bell shape, with a longer tail leading off to one side. We find the position of the central observation using the formula: position number = Example 2: Find the median of the data set in Example 1. Solution: The first step is to put the data set in order from smallest to largest: {3, 4, 5, 5, 7, 8, 8, 9, 9, 12} © 2013 Vancouver Community College Learning Centre. Student review only. May not be reproduced for classes. Authored by by Emily EmilySimpson Simpson Since we have an even number of observations (n = 10), the position of the median is going to be average of two values. We use the formula for central tendency, = #5.5, th th which means the median is halfway between the 5 and 6 values of the ordered set. We take the average of 7 and 8 and get 7.5, so the median of our data set is 7.5. Mode: The mode is the data value that occurs most frequently. It is possible to have more than one mode in a data set. In the data set {3, 4, 5, 5, 7}, the number 5 occurs twice so it is the mode. In the data set {2, 4, 2, 6, 7, 7, 7, 8, 2}, both the numbers 2 and 7 occur three times each. This would be a bimodal data set. Example 3: Identify the mode(s) in the data set from Exercise 1 if any exist. Solution: There are three modes in this data set: 5, 8, and 9 (each value occurs twice). This is called a multimodal data set. VARIABILITY Range: The range is the difference between the highest and lowest value in the data set. It is not the most useful measure of variability of a data set. Standard deviation: This is the most commonly used measure of variability. It reflects the deviations (or differences) of all values in the data set from the mean. A larger standard deviation indicates greater variability for a data set. If you calculated the mean mark on a class midterm to be 65, that only tells you the average mark. Did the marks in the class look like {66, 64, 67, 66, 62, 70…} or like {48, 97, 83, 57, 62, 81, …}? The first set of marks has low standard deviation - most of the marks are quite close to the mean. The second set has a higher standard deviation as there is a greater spread of values from the mean. The notation for standard deviation of a population is σ (“sigma” - lower case Greek letter). The notation for standard deviation of a sample is s. To calculate standard deviation, use the following formulas: Population Standard Deviation Sample standard deviation Σ Σ 1 Σ 1 The rightmost formula for sample standard deviation is the easiest one to use for calculating s by hand. Variance is another related measure of variability that is simply the square of the standard deviation (σ2 or s2). If the variance is calculated first (or given), take the square root of the variance to get the standard deviation. © 2013 Vancouver Community College Learning Centre. Student review only. May not be reproduced for classes. 2 Example 4: Calculate the standard deviation of the data set from Example 1. Solution: We know n = 10 from Example 1. We also know Σ = 70 from Example 1. The only term left to figure out is Σx2. Σx2 is the sum of the square of all data values: Σx2= 32 + 52 + 42 + 92 + 82 + 52 + 72 + 82 + 92 + 122 = 558 Now we plug into the formula: 558 10 70 10 1 √7.55556 2.749 Quartiles: One other way to measure variability is by using quartiles and the interquartile range. This is a more accurate description of the data than using standard deviation if a data set has strong outliers (values that lie FAR away from the rest of the data) or is strongly skewed. The first quartile (Q1) is the data point that lies above ¼ (25%) of all the points of the data set and the third quartile (Q3) is the point that lies above ¾ (75%) of all the data points. The second quartile lies above ½ (50%) of all data points (it’s the median). The idea of quartiles, which cut a data set into quarters, can be extended to percentiles, which cut a data set into hundredths. The pth percentile of a data set is the data point above p% of all the data points in the set. For example, the 90th percentile is the value above 90% of all the data points. To calculate a percentile (or quartile): (1) Find the position of the percentile. Take the percentile number (e.g. for Q1, 25%) divided by 100 and multiply by the number of observations (n) to get the position in the ordered set. (a) If you get a whole number for the position, add 0.5 (b) If you get a decimal number for the position, round UP to the next whole number (2) Find the data point at that position in the ordered data set. If the position is a whole number, use the value at that position in the data set as the answer. If the position is a decimal value, use the average of the two values spanning that position in the data set. The interquartile range (IQR) is the difference between the 3rd quartile and 1st quartile: Q3 – Q1. This range will include the middle 50% of the values of the data set. © 2013 Vancouver Community College Learning Centre. Student review only. May not be reproduced for classes. 3 Example 5: For the data set in Example 1, determine the 1st and 3rd quartile. Solution: 1st quartile = 25th percentile = 25/100 * 10 = 2.5 (round up) = 3rd position 3rd quartile = 75th percentile = 75/100 * 10 = 7.5 (round up) = 8th position Take the ordered data set and find the values in the 3rd and 8th position. {3, 4, 5, 5, 7, 8, 8, 9, 9, 12} Q1 Q3 Q1 = 5, Q3 = 9. For the case above, IQR = Q3 – Q1 = 4. EXERCISES For the following sets of data, calculate (a) sample mean, (b) median, (c) mode, (d) range, (e) variance, (f) standard deviation, (g) 1st quartile, (h) 3rd quartile, (i) interquartile range, (j) 10th percentile, and (k) 90th percentile. 1. { 8, 24, 9, 6, 10, 18, 7, 14, 16, 21, 13, 24} 2. { 3, 6, 5, 4, 6, 5, 9, 10, 11, 7, 9} 3. { 41, 39, 38, 42, 43, 39, 40, 43, 26, 42, 42, 41, 41, 42, 27, 55, 60} 4. Name the data set (1, 2, or 3 – according to the numbered exercises above) with the greatest variability based on (i) standard deviation (ii) range and (iii) IQR. 5. Explain why the answers to 4(i) and 4(ii) are different from 4(iii). SOLUTIONS 1. (a) 14.17 (b) 13.5 (c) 24 (d) 18 (e) 41.7879 (f) 6.4644 (g) Q1 = 8.5 (position = 3.5, take the average of the 3rd and 4th values in the ordered set) (h) Q3 = 19.5 (position = 9.5, take the average of the 9th and 10th values in the ordered set) (i) IQR = 19.5 – 8.5 = 11 (j) 7 (2nd position) (k) 24 (11th position) © 2013 Vancouver Community College Learning Centre. Student review only. May not be reproduced for classes. 4 2. (a) 6.82 (b) 6 (c) 5, 6, 9 (d) 8 (e) 6.7636 (f) 2.6007 (g) Q1 = 5 (position = 3, take the 3rd value in the ordered set) (h) Q3 = 9 (position = 9, take the 9th value in the ordered set) (i) IQR = 9 – 5 = 4 (j) 4 (2nd position) (k) 10 (10th position) 3. (a) 41.24 (b) 41 (c) 42 (d) 34 (e) 62.9412 (f) 7.9335 (g) 39 (5th position) (h) 42 (13th position) (i) IQR = 42 – 39 = 3 (j) 27 (2nd position) (k) 55 (16th position) 4. (i) Data set 3 has the greatest standard deviation (ii) Data set 3 has the greatest range (iii) Data set 1 has the largest IQR 5. Data set 3 has strong outliers above and below the central data points. Because of this, Data set 3 has a high standard deviation and range. However, the IQR is less sensitive to outliers and should be used as the measure of variability for data sets with high skewness or strong outliers. For this reason, the IQR of Data set 3 is much lower than the IQR of Data set 1. © 2013 Vancouver Community College Learning Centre. Student review only. May not be reproduced for classes. 5

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Describing Data - VCC Library