Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CHAPTER 4 NUMERICAL METHODS FOR DESCRIBING DATA What trends can be determined from individual data sets? 4.1 Describing the Center of a Data Set • What is the center of a data set and how can it be found? Center and Spread • Two of the most critical descriptors of a data set • Graphical methods such as those in the last chapter give a general impression of both • Numerical methods give precise value that can be compared in detail The three M’s • Mean • Median • Mode • Also known as the average • Also called the middle • Most Frequent n x x i 1 n Mean i formula for the sample mean • x= each piece of data • n= number of pieces of data in the data set • xi= I indicates the position of the data from within the original data set Always use more accuracy (more decimals) than any one piece of data has. µ is used for the population mean Greek letters are always used for population values Median • The middle value in a list of ordered values – Median has no symbol but is often abbreviated » Med – Middle number if n is odd – Mean of the two middle numbers when n is even Compare and Contrast of the Mean and Median • Median divides the data into two equal parts • 50% of the data is on either side of the median • Mean is where the fulcrum would cause the “data scale” to balance if the values had weight • It is very sensitive to outliers Balancing the “data scale” Normal/Bell curve mean median Skewed Left Skewed Right Dichotomy p̂ pˆ number of successes n Trimmed Mean • Makes the mean less susceptible to outliers • Order the data • Remove the same number of pieces of data from each end • Recalculate the mean % x n = number of pieces to be removed from EACH end A small to moderate trim is 5% to 25% Trimmed Mean • Example: Find the 15% Trimmed mean of: 3, 6, 8, 2, 9, 10, 7, 15, 4, 12, 20, 36, 15, 5, 3, 7, 10, 16, 17, 12 Order the numbers: 2, 3, 3, 4, 5, 6, 7, 7, 8, 9, 10, 10, 12, 12, 15, 15, 16, 17, 20, 36, 20 items • .15 = 3 136 4, 5, 6, 7, 7, 8, 9, 10, 10, 12, 12, 15, 15, 16 = 14 9.71 Weighted Mean similar to an arithmetic mean (the most common type of average), where instead of each of the data points contributing equally to the final average, some data points contribute more than others. Weighted Mean # of students Class average 1st period 20 75 2nd period 35 79 20 75 35 79 weighted ave. 77.54 55 Summarization 1. If Q1 = 20 and Q3 = 30 which of the following must be true? I. The median is 25 II. The mean is between 20 and 30 III. The standard deviation is at most 10 a) b) c) d) e) I only II only III only All are true None are true E, E 2. Suppose the average score on a national test is 500 with a standard deviation of 100. If each score is increased by 25%, what are the new mean and standard deviation? a) b) c) d) e) 500, 100 500, 125 625, 100 625, 105 625, 125 4.1 Homework • Page 160 to 163 2, 5, 9, 12, 13, 14, 15 16 4.2 Describing Variability in a Data Set • What is data variability and how is it used to determine standard deviation? Measures of Variability • Range = high – low • Deviation from the mean= xi – x • if positive then xi is larger than the mean • if negative then xi is smaller than the mean • Sample Variance n s2 ( x x) i 1 i n 1 2 Why divide by n-1 • Since it is a sample (not all the possible data on a subject) and we know that ∑ (xi - x )=0 so if we know all but one xi - x the remaining one can be found, this causes us to divide by n-1 (it also has to do with degrees of freedom concept to be discussed later) • Sample Standard Deviation • “average distance” the items vary from the mean 2 • s s – A small s or s2 indicates low variability – A high s or s2 indicates large variability • Population Variance (knowing all the data) n 2 ( x x) i 1 2 i n • Population Standard Deviation 2 compute to the same accuracy as the population • IQR Interquartile Range IQR = upper quartile (Q3) – lower quartile (Q1) Lower quartile (Q1)—the median of the lower half Upper quartile(Q3)—the median of the upper half IF n is odd, the exact median is excluded from the quartiles Used because it is resistant to outliers There is no special name for the population IQR Uses of the IQR • Standard deviation can be approximated by »SD = IQR/1.35 »If SD > IQR/1.35 it suggests heavier or longer tails than the normal curve Easy method of calculating the SD Sxx means the sum of the deviations and can be found with some manipulation by the following formula. S xx ( x x) x x n 2 2 2 S xx S n 1 2 to avoid round errors use 4 or 5 decimals past the accuracy of the data Example • 20, 15, 12, 18, 17, 15, 17, 16, 18, 25 x 173 =17.3 10 • Reorder 12, 15, 15, 16, 17, 17, 18, 18 20, 25 Q1= 15 range = iqr = Median= 17 Q3= 18 continued • Find the standard deviation By hand by simplified rule i xi 1 12 2 15 3 15 4 16 5 17 6 17 7 18 8 18 9 20 10 25 totals Xi - x (xi- x )2 x2 • By iqr • By calculator 12, 15, 15, 16, 17, 17, 18, 18 20, 25 Q1= 15 Median= 17 Q3= 18 minitab Given: 154, 142, 137, 133, 122, 126, 135, 135, 108, 120, 127, 134, 122 The Minitab output would be: Descriptive Statistics Variable Motion N 13 Mean 130.38 Variable Motion Minimum 108.00 Median 133.00 Maximum 154.00 TrMean 130.27 Q1 122.00 Q3 136.00 StDev 11.47 SE Mean 3.18 Summarization 1. If a set of data has a standard deviation of 0, you can conclude: a) that there is no relationship between the observations 2. During the great depression, the weekly average hours worked in manufacturing jobs by eleven people were 45, 43, 31, 39, 39, 35, 37, 40, 39, 36, 37. What is the variance? b) that the average value is 0 a) 8.1 c) that all observations are the same value d) that a mistake in arithmetic has been made b) 2.99 c) 0 d) There is none e) non of the above e) Not enough information is given C, A 4.2 Homework • Page 169-171 17 (by Hand), 20, 22, 23, 24, 26, 27, 28,31 4.3 Summarizing a Data Set: Boxplots • How can single variable data be summarized in graphical format? Boxplots • Can be used for many types of summarizations 25% 25% 25% 25% • Iqr = Q3 – Q1 • Outlier = data more than 1.5•iqr from the ends of the box • Extreme=data more than 3•iqr from the ends of the box Modified Boxplots • Whiskers go to the last piece of data that is not an outlier Outlier (closed circle) Extreme Outlier (open circle) example • During the great depression, the weekly average hours worked in manufacturing jobs by eleven people were 55, 43, 31, 39, 39, 35, 37, 40, 39, 36, 37. Create a box and whisker plot. • Five number summary is: 31, 36, 39, 40, 55 IQR= 40-36 =4 Outliers are 4*1.5= 6 units from the ends of the box 40 + 6 = 46 and 36-6= 30 Extreme Outliers are 40 + 12= 52 30 35 40 45 50 55 4*3= 12 units from the ends of the box • Given the following data: example 244 191 160 187 180 176 174 205 211 183 211 180 194 200 • Create a modified box and whisker plot. Five number summary is: 160, 180, 189, 205, 244 160 165 170 175 180 185 190 195 200 205 210 215 215 225 230 235 240 245 Boxplot Video #3 18:00 to 20:40 Summarization 1. 2. Suppose the average score on a national test is 500 with a standard deviation of 100. If each score is increased by 25%, what are the new mean and standard deviation? To which of the previous boxplots is the histogram below likely to belong to: 0 a) A 10 b) B 20 30 c) C 40 50 d) D a) b) c) d) e) 500, 100 500, 125 625, 100 625, 105 625, 125 60 e) E C (remember—25% in each part, whiskers will not be that long and part is in the box, E 4.3 Homework • Page 176-178 32, 33, 36, 37 4.4 Interpreting Center and Variablity: Chebyshev’s Rule, Empirical Rule, z-scores • What determinations can be made about the center of the data set? Chebychev’s Rule • One way of determining the percent of data k deviations from the mean (remember that includes above and below the mean) 1 100(1 2 ) k • Use at least terminology • Tends to underestimate the percentage • Applicable to any data set Empirical Rule Uses approximately for its terminology 13.5% -3 -2 -1 13.5% 68% 2.35% mean 1 2.35% 2 3 95% 99.7% Since empirical rule refers to normal data sets, the percentages can be divided in half for parts above or below the mean Z-Scores • Measures the number of standard deviations a particular piece of data is from the mean • Often called the standardization formula xi x z s Compare and Contrast percent vs percentile • Percent x 100 total possible • Percentile • The percent that fall at or below the given value position w /in the ordered data set 100 total pieces of data • Use the position of the value farthest to the left for repeats example • Find the percent and percentile for each – Sue scored 9 out of 10. There were 10 people in the class. Eight people scored 10 , Sue scored 9, and one score 0. • Percent 9/10∙100= 90% • Percentile 2/10∙100= 20th percentile – If the Scores were 0, 5, 7, 7, 8, 8, 8, 9, 9, 10 – And Sue scored a 9 • Percent 9/10∙100=90 % • Percentile 8/10∙100=80th percentile example • Given the following data: • Estimate the median and the 60th percentile Freq 5 to 10 8 10 to 15 5 15 to 20 9 20 to 25 10 Rel freq Cum rel freq example chart • A sample of a certain type of concrete specimens is selected, and the compressive strength of each one is determined. The mean and st. dev are 3000 and 500 respectively. The sample box-plot appears relatively normal. • Approx. what % of the sample falls below 2500? • Approx what percent falls between 2500 and 4300? • Approx what percent falls above 4500? • What compressive strength falls at the 85th percentile Return to problem Chart is on pg 810-811 Summarization 1. The 70 highest dams in the world have an average height of 206 meters with a standard deviation of 35 meters. The Hoover and Grand Coulee dams have heights of 221 and 168 meters, respectively. The Russian dams the Nurek and Charvak have heights with z-scores of 2.69 and -1.13 respectively. List the dams in order of ascending size. a) b) c) d) e) Charvak, Grand Coulee, Hoover, Nurek Charvak, Grand Coulee, Nurek, Hoover Grand Coulee, Charvak, Hoover, Nurek Grand Coulee, Charvak, Nurek, Hoover Grand Coulee, Hoover, Charvak, Nurek A, C 2. Given the following eleven scores on a test worth 45 points were 45, 43, 31, 39, 39, 35, 37, 40, 39, 36, 37. What is the difference in the values of the percentage and percentile for 39. a) b) c) d) e) 86.6 54.5 32.1 6 None of the above 4.4 Homework • Page 184-186 39, 41, 43, 48, 51 1st interval 0≤x<2 4.5 Interpreting the results of Statistical Analysis • Read section 4.5 Pages 135 to 137 See next slide for review Review • Pages 190 to 195 53, 54, 58, 60, 61, 64, 66, 70, 73