Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STA 2023 Chapter 2 – Methods for Describing Sets of Data Describing Qualitative Data (2.1) o Class – into which qualitative data can be sorted G, PG, PG-13, R and NC-17 are classes for film ratings o Class Frequency – number of observations in a class o Class Relative Frequency – class frequency divided by the total number of observations Table 2.1 Frequency Relative Frequency G 4 .08 PG 11 .22 PG-13 19 .38 R 15 .30 NC-17 1 .02 TOTAL 50 1.00 2002 Film Ratings NC-17 G 2% 8% R 30% G PG 22% PG PG-13 R NC-17 PG-13 38% Figure 2.1: A pie chart representing the data in Table 2.1. (Excel) 2002 Film Ratings Number of Films 20 18 16 14 12 10 8 6 4 2 0 19 15 11 4 1 G PG PG-13 R NC-17 Ratings Figure 2.2: A frequency histogram representing the data in Table 2.1. (Excel) 1 STA 2023 Chapter 2 – Methods for Describing Sets of Data 2002 Film Ratings 0.38 Proportion of Films 0.4 0.35 0.3 0.3 0.22 0.25 0.2 0.15 0.08 0.1 0.02 0.05 0 G PG PG-13 R NC-17 Ratings Figure 2.3: A relative frequency histogram representing the data in Table 2.3. (Excel) Note that the shape of the frequency histogram and relative frequency histogram are the same. Frequency histograms and relative frequency histograms will always have the same shape for the same set of data. Describing Quantitative Data (2.2) o Graphical Methods Dot Plot Histogram Stem-and-Leaf Display Table 2.2 - Number of kills per match for a volleyball team (sorted) 17 18 19 20 21 21 24 25 25 26 26 27 27 27 28 28 32 32 33 33 33 35 35 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Number of Kills Figure 2.4: A dot plot representing the data in Table 2.2. (Excel) 2 STA 2023 Chapter 2 – Methods for Describing Sets of Data Number of Matches Number of Kills per Match 10 8 6 4 2 0 Less than 21 21-24 25-28 29-32 More than 32 Number of Kills Figure 2.5: A frequency histogram representing the data in Table 2.2. (Excel) Stem = Tens 1 2 3 789 0114556677788 2233355 Figure 2.6: A stem-and-leaf display representing the data in Table 2.2. (Word) Summation Notation (2.3) o Example – Exercise 2.34 using the dataset {3, 8, 4, 5, 3, 4, 6} a. x = 3 + 8 + 4 + 5 + 3 + 4 + 6 = 33 b. x2 = 32 + 82 + 42 + 52 + 32 + 42 + 62 = 9 + 64 + 16 + 25 + 9 + 16 + 36 = 175 c. (x-5)2 = (-2)2 + (3)2 + (-1)2 + (0)2 + (-2)2 + (-1)2 + (1)2 = 4 + 9 + 1 + 0 + 4 + 1 + 1 = 20 d. (x-2)2 = (x2-4x+4) = x2 - 4x + 4 = x2 - 4x + 41 = 175 – 4(33) + 4(7) = 175 – 132 + 28 = 71 e. (x)2 = (33)2 = 1,089 Numerical Measures of Central Tendency (2.4) o Mean – average value of a dataset (also called average or expected value) x Formula to calculate sample mean: x n x (pronounced “x bar”) represents the sample mean (pronounced “myoo” – spelled mu) represents the population mean o Sample Size – number of observations in the sample 3 STA 2023 Chapter 2 – Methods for Describing Sets of Data n represents sample size o Population Size – number of observations in the entire population N represents population size o Median – the value of a dataset which splits the data into halves (middle value) Usually M represents the median of a sample or population To find the median of a dataset, do the following: 1. Order all data in ascending order 2. Calculate the number of observations (assume n) a. If n is odd, then M is the kth value of the dataset where n 1 k 2 b. If n is even, then M is the average of the kth and (k+1)th n values of the dataset where k 2 o Mode – most frequently occurring value in the dataset No symbol usually associated with mode o Modal Class – class with the greatest frequency No symbol usually associated with modal class o Shape When the mean and median of a distribution are equal, the distribution is called symmetric When the mean is less than the median, the distribution is skewed left When the mean is greater than the median, the distribution is skewed right NOTE: You should be comfortable identifying the shape of a distribution by comparing the mean and median and by visual identification Symmetric Skewed Left Skewed Right o Example – Exercise 2.42 calculating the mean, median, and mode a. {7, -2, 3, 3, 0, 4} Order the data: {-2, 0, 3, 3, 4, 7} x = 2 0 3 3 4 7 = 15 = 2.5 Mean: x = 6 6 n 6 Median: n = 6 so n is even, then k = = 3, and k+1 = 4, so we 2 should take the average of the 3rd and 4th values of the dataset 4 STA 2023 Chapter 2 – Methods for Describing Sets of Data which are 3 and 3, respectively, and the average of those numbers is (obviously) 3 Mode: the value in the dataset that occurs most frequently is 3 b. {2, 3, 5, 3, 2, 3, 4, 3, 5, 1, 2, 3, 4} Order the data: {1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 5} x = 1 2 2 2 3 3 3 3 3 4 4 5 5 = Mean: x = 12 n 40 = 3.08 13 14 Median: n = 13 so n is odd, then k = = 7 so the median is the 2 7th value of the dataset which is 3 Mode: the value in the dataset that occurs most frequently is 3 c. {51, 50, 47, 50, 48, 41, 59, 68, 45, 37} Order the data: {37, 41, 45, 47, 48, 50, 50, 51, 59, 68} x = 37 41 45 47 48 50 50 51 59 68 Mean: x = 10 n 496 = = 49.6 10 10 Median: n = 10 so n is even, then k = = 5, and k+1 = 6, so we 2 should take the average of the 5th and 6th values of the dataset which are 48 and 50, respectively, and the average of those numbers is 49 Mode: the value in the dataset that occurs most frequently is 50 Numerical Measures of Variability (2.5) o Range – maximum minus minimum Usually R represents the range in a sample or population o Sample Variance – a common measurement for the variability in a dataset s2 represents the sample variance 2 (sigma-squared) represents the population variance n Formula to calculate sample variance: s 2 ( x x) 2 i i 1 x n 1 2 x 2 n n 1 o Sample Standard Deviation – spread of the data Standard deviation is preferable to variance, since the units of standard deviation are in the original units of variable (variance is not) s represents the sample standard deviation Alternate (easier) formula: s 2 5 STA 2023 Chapter 2 – Methods for Describing Sets of Data (sigma) represents the population standard deviation Formula to calculate sample standard deviation: s s 2 If an estimate for s cannot be obtained, dividing the range by 4 for small samples and by 6 for large samples will provide a conservative estimate o Example – Exercise 2.62 using Sample 1: {10, 0, 1, 9, 10, 0, 8, 1, 1, 9} and Sample 2: {0, 5, 10, 5, 5, 5, 6, 5, 6, 5} a. Examine both samples and identify the one that you believe has the greater variability. Sample 1 appears to have the greater variability. b. Calculate the range for each sample. Does the result agree with your answer to part a.? Explain. Sample 1 Range = Maximum – Minimum = 10 – 0 = 10. Sample 2 Range = Maximum – Minimum = 10 – 0 = 10. These results do not agree with our answer in part a. c. Calculate the standard deviation for each sample. Does the result agree with your answer to part a.? Explain. Sample 1 Variance = x x n 2 2 s2 10 n 1 = 0 8 1 1 9 10 0 1 9 10 10 2 2 0 2 12 9 2 10 2 0 2 8 2 12 12 9 2 9 49 2 429 10 = 20.99, so Sample 1 Standard Deviation = s s 2 = = 9 x 2 2 x n 2 20.99 = 4.58. Sample 2 Variance = s = n 1 0 5 10 5 5 5 6 5 6 52 2 2 2 2 2 2 2 2 2 2 0 5 10 5 5 5 6 5 6 5 10 = 9 322 52 2 10 = 5.73, so Sample 2 Standard Deviation = s s 2 = 5.73 = 9 2.39. These results agree with our answer to part a. d. Which of the two, the range or the standard deviation, provides a better measure of variability? The standard deviation provides a better measure of variability. Interpreting the Standard Deviation (2.6) o Chebyshev’s Rule – gives a bound for the proportion of data that falls within a specified number of standard deviations from the mean and also gives a bound for the proportion of data that falls outside of a specified number of standard deviations from the mean 6 STA 2023 Chapter 2 – Methods for Describing Sets of Data Proportion of data within k standard deviations of the mean: 1 – Proportion of data outside k standard deviations of the mean: < k 1 2 3 1 k2 1 k2 Chebyshev’s Rule applies to ANY distribution, regardless of its shape Table 2.3 – Table of common values for Chebyshev’s Rule Proportion within k Proportion outside k < 1 (useless) 0 (useless) < .25 .75 < .111 .888 o Example – Assume we have a distribution with = 9 and = 2 a. What proportion of the data lies between 5 and 13? Here, k=2, so referring to Table 2.3, we know that at least 75% of the distribution lies between 5 and 13. b. What proportion of the data lies outside of (2, 16)? Here, k=3.5, so we 1 use the formula 2 , with k=3.5, which tells us that less than .0816 or less k than 8.16% of the distribution lies outside of (2, 16)? c. What proportion of the data lies below 4? Here, k=2.5, so we use the 1 formula 2 , with k=2.5, which tells us that less than .16 or less than 16% k of the data lies outside of (4, 14). However, at this point, it would be incorrect to further split the .16 in half, because we are looking at one side of the distribution; we know nothing about the distribution, hence, it would be incorrect to assume that the distribution is symmetric. o Empirical Rule – gives an approximation for the proportion of data that falls within or outside of one, two, or three standard deviations Approximately 68% of the distribution will fall within 1 standard deviation of the mean Approximately 95% of the distribution will fall within 2 standard deviations of the mean Approximately 99.7% of the distribution will fall within 3 standard deviations of the mean Empirical Rule applies only to distributions that are both SYMMETRIC and MOUND-SHAPED 34% 13.5% 2.35% 0.15% 34% 13.5% 2.35% 0.15% 7 STA 2023 Chapter 2 – Methods for Describing Sets of Data o Example – Assume we have a symmetric, mound-shaped distribution with = 9 and = 2 a. What proportion of the data lies within (7, 11)? 68% b. What proportion of the data lies within (5, 15)? 97.35% c. What proportion of the data lies outside of (5, 13)? 5% d. What proportion of the data lies lie above 11? 16% Numerical Measures of Relative Frequency (2.7) o Percentile – an observation that is the pth percentile will have p% of the distribution below it and (100-p)% of the distribution above it o Z-score – measure of relative standing associated with the standard normal distribution (Chapter 5) xx Formula for sample z-score: z s x Formula for population z-score: z o Example – Exercise 2.82, computing z-scores x x 40 30 a. x = 40, s = 5, x = 30 =2 z 5 s x 90 89 z b. x = 90, = 89, = 2 = .5 2 x 50 50 z c. = 50, = 5, x = 50 =0 5 x x 20 30 d. s = 4, x = 20, x = 30 = -2.5 z 4 s e. a. sample b. population c. population d. sample f. a. above the mean by 2 b. above the mean by .5 c. at the mean d. below the mean by 2.5 Methods for Detecting Outliers (2.8) – SKIP Graphing Bivariate Relationships (2.9) – SKIP Distorting the Truth with Descriptive Techniques (2.10) o Statistics can lie 8