Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 2 Descriptive Statistics 2 - 1 Frequency Distributions 2 – 2 Displaying Data 2 – 3 Measures of the Center 2 - 4 Measures of Dispersion 2 – 5 Measures of Position 2 - 6 Bivariate Data 1 2-1 Frequency Distributions Frequency Distribution lists classes (or categories) of values, along with frequencies (or counts) of the number of values that fall into each class 2 Qwerty Keyboard Word Ratings 2 2 5 1 2 6 3 3 4 2 4 0 5 7 7 5 6 6 8 10 7 2 2 10 5 8 2 5 4 2 6 2 6 1 7 2 7 2 3 8 1 5 2 5 2 14 2 2 6 3 1 7 3 Frequency Table of Qwerty Word Ratings Rating Frequency 0-2 20 3-5 14 6-8 15 9 - 11 2 12 - 14 1 4 Lower Class Limits are the smallest numbers that can actually belong to different classes Rating Lower Class Limits Frequency 0-2 20 3-5 14 6-8 15 9 - 11 2 12 - 14 1 5 Upper Class Limits are the largest numbers that can actually belong to different classes Rating Upper Class Limits Frequency 0-2 20 3-5 14 6-8 15 9 - 11 2 12 - 14 1 6 Class Midpoints midpoints of the classes Rating Class Midpoints Frequency 0- 1 2 20 3- 4 5 14 6- 7 8 15 9 - 10 11 2 12 - 13 14 1 7 Class Width is the difference between two consecutive lower class limits or two consecutive class boundaries Rating Class Width Frequency 3 0-2 20 3 3-5 14 3 6-8 15 3 9 - 11 2 3 12 - 14 1 8 Guidelines For Frequency Distributions 1. Be sure that the classes are mutually exclusive. 2. Include all classes, even if the frequency is zero. 3. Try to use the same width for all classes. 4. Select convenient numbers for class limits. 5. Use between 5 and 20 classes. 6. The sum of the class frequencies must equal the number of original data values. 9 Constructing A Frequency Distribution 1. Decide on the number of classes . 2. Determine the class width by dividing the range by the number of classes (range = highest score - lowest score) and round up. class width range round up of number of classes 3. Select for the first lower limit either the lowest score or a convenient value slightly less than the lowest score. 4. Use calculator procedures to construct histogram. 5. List the class limits and frequency. 10 Relative Frequency Distribution relative frequency = class frequency sum of all frequencies 11 Relative Frequency Distribution Rating Frequency Relative Rating Frequency 0-2 20 0-2 38.5% 20/52 = 38.5% 3-5 14 3-5 26.9% 14/52 = 26.9% 6-8 15 6-8 28.8% 9 - 11 2 9 - 11 3.8% 12 - 14 1 12 - 14 1.9% etc. Total frequency = 52 12 Cumulative Frequency Distribution Rating Frequency Rating Cumulative Frequency 0-2 20 Less than 3 20 3-5 14 Less than 6 34 6-8 15 Less than 9 49 9 - 11 2 Less than 12 51 12 - 14 1 Less than 15 52 Cumulative Frequencies 13 Frequency Distributions Rating Frequency Rating Relative Frequency Rating Cumulative Frequency 0-2 20 0-2 38.5% Less than 3 20 3-5 14 3-5 26.9% Less than 6 34 6-8 15 6-8 28.8% Less than 9 49 9 - 11 2 9 - 11 3.8% Less than 12 51 12 - 14 1 12 - 14 1.9% Less than 15 52 14 2-2 Visualizing Data Histogram a bar graph in which the horizontal scale represents classes and the vertical scale represents frequencies 15 Histogram of Qwerty Word Ratings Rating Frequency 0-2 20 3-5 14 6-8 15 9 - 11 2 12 - 14 1 16 TI-83 Calculator Contructing a Frequency Table and Histogram STEP 1 1. Choose statplot (2nd Y=) 2. Press Enter 3. Plot on should be ON 4. Cursor to Type and choose the last plot type in first row 5. Press Enter 17 TI-83 Calculator Contructing a Frequency Table and Histogram STEP 2 – Enter Data 1. Press Stat 2. Press “1” Edit 3. Enter data in L1 18 TI-83 Calculator Contructing a Frequency Table and Histogram STEP 3 – Set window and graph 1. Press Window 2. Set Xmin = 1st lower class limit 3. Set Xscl = class width 4. Set Xmax = largest upper class limit (maybe larger) 5. Set Ymin = negative 5 6. Set Ymax = a little higher than the largest frequency 7. Press Trace 19 Relative Frequency Histogram of Qwerty Word Ratings Relative Rating Frequency 0-2 38.5% 3-5 26.9% 6-8 28.8% 9 - 11 3.8% 12 - 14 1.9% 20 Histogram and Relative Frequency Histogram 21 Frequency Polygon Midpoints (points plotted at middle top of each bar in the histogram) 22 Cummulative Frequency Histogram of Qwerty Word Ratings Rating Frequency 0-2 20 0-5 34 0-8 49 0 - 11 51 0 - 14 52 23 Ogive Upper Class Limit (points are plotted at the upper right corner of each bar in the histogram) 24 Stem-and Leaf Plot Stem Raw Data (Test Grades) 67 72 89 85 88 90 75 89 99 100 6 7 8 9 10 Leaves 7 25 5899 09 0 Used to observe the distribution of data Back to back stem-leaf hw #7 (also a test problem) 25 Dot Plot See HW problem #5b (will be a test problem) 26 Pareto Chart 45,000 40,000 35,000 Frequency 30,000 Accidental Deaths by Type 25,000 20,000 15,000 used for qualitative data 10,000 5,000 Firearms Fire Drowning Poison Falls Ingestion of food or object See HW problem #4a Motor Vehicle 0 27 Pie Chart PIE charts and Pareto charts can illustrate the same data Firearms (1400. 1.9%) Ingestion of food or object (2900. 3.9% Fire (4200. 5.6%) Motor vehicle (43,500. 57.8%) Drowning (4600. 6.1%) Poison (6400. 8.5%) See HW problem #4d Falls (12,200. 16.2%) Accidental Deaths by Type 28 Deaths in British Military Hospitals During the Crimean War other causes preventable diseases wounds 29 Other Graphs Boxplots Pictographs Time-Series Graphs (forecasting) 30 2-3 Measures of Center a value at the center or middle of a data set 31 Definitions Mean (Arithmetic Mean) AVERAGE the number obtained by adding the values and dividing the total by the number of values 32 Notation denotes the addition of a set of values x is the variable usually used to represent the individual data values n represents the number of data values in a sample N represents the number of data values in a population 33 Notation x is pronounced ‘x-bar’ and denotes the mean of a set of sample values x x = n µ is pronounced ‘mu’ and denotes the mean of all values in a population µ = x N Calculators can calculate the mean of data 34 TI-83 Calculator Calculate Mean 1. Press Stat 2. Press “1” Edit 3. Enter Data in L1 4. Press Stat 5. Cursor over to CALC 6. Choose the 1-Var stats option 7. Enter 1-Var stats L1 (Press 2nd then 1) 35 TI-83 Calculator Clearing Data in Column 1. Press Stat 2. Press “4” ClrList 3. Enter ClrList L1,L2,etc 36 Definitions Median the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude not affected by an extreme value 37 6.72 3.46 3.60 6.44 3.46 3.60 6.44 6.72 (sorted) (even number of values) no exact middle -- shared by two numbers 3.60 + 6.44 MEDIAN is 5.02 2 6.72 3.46 3.60 6.44 26.70 3.46 3.60 6.44 6.72 26.70 (sorted) (odd number of values) exact middle MEDIAN is 6.44 38 Definitions Mode the score that occurs most frequently Bimodal Multimodal No Mode 39 Examples a. 5 5 5 3 1 5 1 4 3 5 Mode is 5 b. 1 2 2 2 3 4 5 6 6 6 7 9 Bimodal - c. 1 2 3 6 7 8 9 10 No Mode 2 and 6 40 Definitions Midrange the value midway between the highest and lowest values in the original data set Midrange = highest score + lowest score 2 41 Round-off Rule for Measures of Center Carry one more decimal place than is present in the original set of values Let’s try #2 on HW Go to Excel or Calculator 42 Mean from a Frequency Table use class midpoint of classes for variable x (x • f) x = f x = class midpoint f = frequency Let’s try #7 on HW Go to Excel or Calculator f=n 43 TI-83 Calculator Calculate Mean from a Frequency Distribution 1. Press Stat 2. Press “1” Edit 3. Enter midpoint in L1 4. Enter frequency in L2 5. Press Stat 6. Cursor over to CALC 7. Choose the 1-Var stats option 8. Enter 1-Var stats L1,L2 44 Weighted Mean (w • x) x = w I use this to calculate you grade in the class the weights being 20% for HW, 60% for exams and 20% for the final and x being your total points for each area. 45 Best Measure of Center Advantages - Disadvantage Measure How often used? Takes Every Value into Account? Affected by Extreme Values? Most familiar Yes Yes Median Commonly No No Mode Sometimes No No Rarely No Yes Mean Midrange 46 Definitions Symmetric Data is symmetric if the left half of its histogram is roughly a mirror of its right half. Skewed Data is skewed if it is not symmetric and if it extends more to one side than the other. 47 Skewness Mode = Mean = Median SYMMETRIC Sample data: 2 3 3 4 4 4 5 5 6 Median = 4 Mode = 4 Mean = 4 Frequencies are 1 2 3 2 1 48 Skewness Mode = Mean = Median SYMMETRIC Mean Mode Median SKEWED LEFT (negatively) Mean Mode Median SKEWED RIGHT (positively) 49 Waiting Times of Bank Customers at Different Banks in minutes Green Valley Bank 6.5 6.6 6.7 6.8 7.1 7.3 7.4 7.7 7.7 7.7 Big Spenders Bank 4.2 5.4 5.8 6.2 6.7 7.7 7.7 8.5 9.3 10.0 Green Valley Bank Big Spenders Bank Mean 7.15 7.15 Median 7.20 7.20 Mode 7.7 7.7 Midrange 7.10 7.10 Same data used in section 2.3 & 2.4 #3 50 Dotplots of Waiting Times Green Valley Bank Big Spenders Bank So how do we differentiate this data? We need measures that describes how the data is dispersed. 51 2-4 Measures of Variation Range highest value lowest value 52 Measures of Variation Standard Deviation a measure of variation of the scores about the mean (average deviation from the mean) 53 Sample Standard Deviation Formula S= (x - x) n-1 2 calculators can compute the sample standard deviation of data 54 Population Standard Deviation = (x - µ) 2 N calculators can compute the population standard deviation of data 55 Symbols for Standard Deviation Sample Most textbook Some graphics calculators Some non-graphics calculators s Sx xn-1 Population x x n Articles in professional journals and reports often use SD for standard deviation and VAR for variance. 56 Measures of Variation Variance standard deviation squared } Notation s 2 2 use square key on calculator 57 Variance Which is the parameter and which is the statistic? 2 s = 2 = (x - x ) 2 n-1 (x - µ) N 2 Sample Variance Population Variance 58 TI-83 Calculator Calculate Standard Deviation 1. Same procedure as Mean Calculate Standard Deviation from a frequency distribution 1. Same procedure as Mean from a frequency distribution 59 Round-off Rule for measures of variation Carry one more decimal place than is present in the original set of values. Round only the final answer, never in the middle of a calculation. Now Try Example ; #2 (go to Excel or Calculator) 60 Calculating Standard Deviation from the Mean Example: Let X= 40 & S = 3 1.Calculate 1, 2 and 3 standard deviations from the mean 2.Find the maximum and minimum unusual values (test question) 61 Standard Deviation from a Frequency Distribution n [(f • x 2)] -[(f • x)]2 S= n (n - 1) Use the class midpoints as the x values Calculators can compute the standard deviation for frequency table Let’s try #7 (go to excel or calculator) 62 The Empirical Rule (applies to bell-shaped distributions) 99.7% of data are within 3 standard deviations of the mean 95% within 2 standard deviations 68% within 1 standard deviation 34% 34% 2.4% 2.4% 0.1% 0.1% 13.5% x - 3s x - 2s 13.5% x-s x x+s x + 2s x + 3s 63 Chebyshev’s Theorem applies to distributions of any shape. Therefore it is not as telling (robust) as the empirical rule the proportion (or fraction) of any set of data lying within K standard deviations of the mean is always at least 1 - 1/K2 , where K is any positive number greater than 1. at least 3/4 (75%) of all values lie within 2 standard deviations of the mean. at least 8/9 (89%) of all values lie within 3 standard deviations of the mean. 64 Chebyshev’s Theorem & Empirical Rule Summary Applies To Within 1 SD At least 0% of data Chebyshev’s Theorem Any Distribution Empirical Rule Bell Shaped Approx Distributions 68% of data Within 2 SD’s At Least 75 % of data Within 3 SD’s At least 89% of data Approx 95% of data Approx 99.7% of data Important table for your notes!! 65 Empirical & Chebyshev Examples Will be one of each on the test Example: A batch of bolts has a mean of 4 inches and a standard deviation of .007 inch. What can you conclude about the percentage of bolts between various intervals? 1. Apply Empirical Rule (assume bell shaped data) 2. Apply Chebyshev’s Theorem 66 Unusual Scores For typical data sets, it is unusual for a score to differ from the mean by more than 2 or 3 standard deviations. Note: I ask this on several questions on the test 67 Measures of Variation (dispersion) and Measures of the Center Center Mean Variation / Dispersion Range Mode Standard Deviation MidRange Variance Median 68 2–5 Measures of Relative Standing OR Measures of Position 69 When is the relative position of data important? Example: Student A gets 80 out of 84 on a test. Student B gets 50 out of 62 on another test. Which one did relatively better? What if the average of the test Student A took was 82 and the average of the test Student B took was 30. Now which one did better? 70 Measures of Position z Score (or standard score) the number of standard deviations that a given value x is above or below the mean 71 Measures of Position z score Sample x x z= s Population x µ z= Round to 2 decimal places 72 Mean and Standard Deviation of z score If all data values in a data set have been converted to zscores then Mean 0 Standard Dev 1 Note: This is a test question 73 Interpreting Z Scores Unusual Values -3 Ordinary Values -2 -1 0 Unusual Values 1 2 3 Z 74 Let’s try some examples: See Excel Z – Score Example #2, #4, 9 Note: #9 is very similar to the one on the test 75 Measures of Position Quartiles and Percentiles 76 Quartiles Q1, Q2, Q3 divides ranked scores into four equal parts 25% (minimum) 25% 25% 25% Q1 Q2 Q3 (maximum) (median) 77 Quartiles Q1 = P25 Percentiles 99 percentiles Q2 = P50 Q3 = P75 78 Finding the Percentile of a Given Score Percentile of score x = number of scores less than x • 100 total number of scores Use normal rounding rules Use first value if duplicates Let’s try #10 together – will be one like this on the test Go to Excel for the data 79 Finding the Score Given a Percentile L= k 100 •n n k L Pk total number of values in the data set percentile being used locator that gives the position of a value kth percentile Let’s try a few #14 – 20 even 80 Start Finding the Value of the kth Percentile Sort the data. (Arrange the data in order of lowest to highest.) Compute L= k n 100 ( ) where n = number of values k = percentile in question Is L a whole number ? No Yes The value of the kth percentile is midway between the Lth value and the next value in the sorted set of data. Find Pk by adding the L th value and the next value and dividing the total by 2. Change L by rounding it up to the next larger whole number. The value of Pk is the Lth value, counting from the lowest 81 Interquartile Range (or IQR): Q3 - Q1 Semi-interquartile Range: Q3 - Q1 2 Midquartile: Q1 + Q3 2 10 - 90 Percentile Range: P90 - P10 Won’t ask you anything about these 82 2-6 Bivariate Data Information about two variables (eg. Weight and Height) • Often presented visually as a scatter plot on an xy plane • Weight is the x axis and Height is the y axis 83 Scatter Diagram 20 TAR • 10 • • • • • • • • • • • • • • • • • • • • 0 0.0 0.5 1.0 1.5 NICOTINE Go to Excel Example (Weights of Cars and Gas Mileage) 84 Scatter Diagram of Paired Data 85 Plot a scatter diagram Data from the Garbage Project x Plastic (lb) y Household 0.27 1.41 2 3 2.19 2.83 2.19 1.81 0.85 3.05 3 6 4 2 1 5 86 Plot a scatter diagram Data from the Garbage Project x Plastic (lb) y Household 0.27 1.41 2 3 2.19 2.83 2.19 1.81 0.85 3.05 3 6 4 2 1 5 87 Definition Correlation exists between two variables when one of them is related to the other in some way 88 Positive Linear Correlation y y y (a) Positive x x x (b) Strong positive (c) Perfect positive Scatter Plots 89 Negative Linear Correlation y y y (d) Negative x x x (e) Strong negative (f) Perfect negative Scatter Plots 90 No Linear Correlation y y x (g) No Correlation x (h) Nonlinear Correlation Scatter Plots 91 Definition Linear Correlation Coefficient r measures strength of the linear relationship between paired x and y values in a sample r= nxy - (x)(y) n(x2) - (x)2 n(y2) - (y)2 Rarely used as it’s much easier to use a computer or spreadsheet 92 Notes on correlation r represents linear correlation coefficient for a sample (ro) represents linear correlation coefficient for a population -1 r 1 r measures strength of a linear relationship. r = -1 perfect negative correlation r = 1 perfect positive correlation r = 0 no correlation 93 Plot a Scatter Diagram And Calculate Correlation? Data from the Garbage Project x Plastic (lb) y Household 0.27 1.41 2 3 2.19 2.83 2.19 1.81 0.85 3.05 3 6 4 2 1 5 Go to Excel or Calculator 94 TI-83 Calculator Calculate Correlation 1. Turn on DiagnositcOn Mode 2. Press 2nd then 0 3. Arrow down to DiagnosticOn 4. Press Enter Note: the value for “r” will not appear on the screen if calculator is in DiagnosticOff Mode 95 TI-83 Calculator Calculate Correlation and Slope/Intercept 1. Press Stat 2. Press Edit 3. Enter x values in L1 and y values in L2 4. Press Stat 5. Cursor over to CALC 6. Choose the 4: LinReg(a+bx) option 7. Enter LinReg(a+bx) L1,L2 8. Press Enter 96 Plot a Scatter Diagram Data from the Garbage Project x Plastic (lb) y Household 0.27 1.41 2 3 2.19 2.83 2.19 1.81 0.85 3.05 3 6 4 2 1 5 7 6 y 5 4 3 2 1 0 0 0.5 1 1.5 2 2.5 3 3.5 x 97 Calculate Correlation? Data from the Garbage Project x Plastic (lb) y Household 0.27 1.41 2 3 2.19 2.83 2.19 1.81 0.85 3.05 3 6 4 2 1 5 7 6 y 5 4 3 2 1 0 0 0.5 1 1.5 2 2.5 3 3.5 x r = .84236 98 Definition Line of Best Fit Algebraically describes the relationship between the two variables 99 Line of Best Fit Equation x is the predictor variable ^ y is response variable) y^ = b0 +b1x b0 = y - intercept y = mx +b b1 = slope 100 Line of Best Fit Plotted on Scatter Plot The line of best fit goes through (x, y) 101 Formula for b0 and b1 b0 = b1 = (y) (x2) - (x) (xy) n(x2) - (x)2 n(xy) - (x) (y) n(x2) - (x)2 (y-intercept) (slope) Best to use calculators or computers. 102 Find the Line of Best Fit Data from the Garbage Project x Plastic (lb) y Household 0.27 1.41 2 3 2.19 2.83 2.19 1.81 0.85 3.05 3 6 4 2 1 5 Go to Excel or Calculator for the following example. 103 Find the Line of Best Fit Data from the Garbage Project x Plastic (lb) y Household 0.27 1.41 2 3 2.19 2.83 2.19 1.81 0.85 3.05 3 6 4 2 1 5 Using a calculator: b0 = 0.549 b1= 1.48 y = 0.549 + 1.48x 104 Plot Line of Best Fit Data from the Garbage Project x Plastic (lb) y Household 0.27 1.41 2 3 2.19 2.83 2.19 1.81 0.85 3.05 3 6 4 2 1 5 7 6 y 5 4 3 2 1 0 0 0.5 1 1.5 2 2.5 3 3.5 x 105 What is the best predicted size of a household that discard 0.50 lb of plastic? Data from the Garbage Project x Plastic (lb) y Household 0.27 1.41 2 3 2.19 2.83 2.19 1.81 0.85 3.05 3 6 4 2 1 5 7 6 y 5 4 3 2 1 0 0 0.5 1 1.5 2 2.5 3 3.5 x 106 What is the best predicted size of a household that discard 0.50 lb of plastic? Data from the Garbage Project x Plastic (lb) y Household 0.27 1.41 2 3 2.19 2.83 2.19 1.81 0.85 3.05 3 6 4 2 1 5 Using a calculator: b0 = 0.549 b1= 1.48 y = 0.549 + 1.48 (0.50) y = 1.3 107 What is the best predicted size of a household that discard 0.50 lb of plastic? Data from the Garbage Project x Plastic (lb) y Household 0.27 1.41 2 3 2.19 2.83 2.19 1.81 0.85 3.05 3 6 4 2 1 5 Using a calculator: b0 = 0.549 b1= 1.48 y = 0.549 + 1.48 (0.50) y = 1.3 A household that discards 0.50 lb of plastic has approximately one person. 108