Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 4 Displaying and Summarizing Quantitative Data Copyright © 2009 Pearson Education, Inc. NOTE on slides / What we can and cannot do The following notice accompanies these slides, which have been downloaded from the publisher’s Web site: “This work is protected by United States copyright laws and is provided solely for the use of instructors in teaching their courses and assessing student learning. Dissemination or sale of any part of this work (including on the World Wide Web) will destroy the integrity of the work and is not permitted. The work and materials from this site should never be made available to students except by instructors using the accompanying text in their classes. All recipients of this work are expected to abide by these restrictions and to honor the intended pedagogical purposes and the needs of other instructors who rely on these materials.” We can use these slides because we are using the text for this course. Please help us stay legal. Do not distribute these slides any further. The original slides are done in orange / brown and black. My additions are in red and blue. Topics in green are optional. Copyright © 2009 Pearson Education, Inc. Slide 4- 3 Overview – Organization of the chapter Pictorial Display Histogram Stem – and Leaf Plot Dotplot Numerical summary Shape of data Center Spread Copyright © 2009 Pearson Education, Inc. Slide 4- 4 Division of Mathematics, HCC Course Objectives for Chapter 4 After studying this chapter, the student will be able to: 7. Appropriately display quantitative data using a frequency distribution, histogram, relative frequency histogram, stem-and-leaf display, dotplot. 8. Describe the general shape of a distribution in terms of shape, center and spread. 9. Describe any anomalies or extraordinary features revealed by the display of a variable. 10. Compute and apply the concepts of mean and median to a set of data. 11. Compute and apply the concept of the standard deviation and IQR to a set of data. 12. Select a suitable measure of center/spread for a variable based on information about its distribution. 13. Create a five-number summary of a variable. Copyright © 2009 Pearson Education, Inc. Dealing With a Lot of Numbers… Summarizing the data will help us when we look at large sets of quantitative data. Without summaries of the data, it’s hard to grasp what the data tell us. The best thing to do is to make a picture… We can’t use bar charts or pie charts for quantitative data, since those displays are for categorical variables. Copyright © 2009 Pearson Education, Inc. Slide 4- 6 Histograms: Earthquake Magnitudes The chapter example discusses earthquake magnitudes First, slice up the entire span of values covered by the quantitative variable into equal-width piles called bins. The bins will be the horizontal axis of the plot. The counts (i.e. number of data points that go into each bin, or frequency, will be the vertical axis. The bins and the counts in each bin give the distribution of the quantitative variable. Copyright © 2009 Pearson Education, Inc. Slide 4- 7 Histograms: Earthquake Magnitudes (cont.) A histogram plots the bin counts as the heights of bars (like a bar chart). This concept was also invented by William Playfair Here is a histogram of earthquake magnitudes Copyright © 2009 Pearson Education, Inc. Slide 4- 8 Histograms: Earthquake magnitudes (cont.) A relative frequency histogram displays the percentage of cases in each bin instead of the count. In this way, relative frequency histograms are faithful to the area principle. Here is a relative frequency histogram of earthquake magnitudes: Copyright © 2009 Pearson Education, Inc. Slide 4- 9 Histograms Both histograms “look” the same. The only difference is the vertical axis. Did we see this in Chapter 3? Copyright © 2009 Pearson Education, Inc. Slide 4- 10 Histograms They can be displayed horizontally as well as vertically I rotated this one 90 degrees clockwise To publish this, I would put the “% of Earthquakes” axis across the bottom instead of the top. I’d also retype the labels so they can be more easily read! Copyright © 2009 Pearson Education, Inc. Slide 4- 11 Histogram with the TI Example: Data: 62, 63, 65, 66, 68, 70, 71, 73, 75 Use [STAT][EDIT] to put the dataset in L1. The first few data points are shown. NOTE: You will do this a lot in this course! Copyright © 2009 Pearson Education, Inc. Slide 14- 12 Histogram with the TI First, select [Y1] and turn off any functions from Algebra class! Press [2nd][Y1] and go to one of the three plots. Turn it on. Select the histogram. Make sure that L1 (or wherever you put the data) is in Xlist. Make sure the 1 is in Freq Copyright © 2009 Pearson Education, Inc. Slide 4- 13 Histogram with the TI (default) You can get a window default by selecting Zoom and then 9 Below is the window. It shows a bin width of 3.25. It includes all of the values. Because we have integers, I’d rather have 3 as a bin width. Copyright © 2009 Pearson Education, Inc. Slide 4- 14 Histogram with the TI Choose as window X:[60,78];Y[-1,3]. You may have to play with this. For X, I picked a little lower than the min and a little higher than the max. For Y, I picked a little bigger than the largest bin frequency than I expected. Xscl is the length of the bin. In this case, choosing 3 makes cut points at 60, 63, 66, 69. 72. 75, and 78. Copyright © 2009 Pearson Education, Inc. Slide 4- 15 Usefulness of the Trace function Use the horizontal arrows to navigate the bins. Copyright © 2009 Pearson Education, Inc. Slide 4- 16 Histogram with StatCrunch Select Graphics Select Histogram Select the column you want graphed. Select Next. (Do not select “Create Graph” unless you do not want to have control over the bin size. For the same bins as with the TI, “Start Bins” at 60 and set Bin Width equal to 3. Then select “Create Graph”. Copyright © 2009 Pearson Education, Inc. Slide 4- 17 Copyright © 2009 Pearson Education, Inc. Slide 4- 18 Copyright © 2009 Pearson Education, Inc. Slide 4- 19 Copyright © 2009 Pearson Education, Inc. Slide 4- 20 Results With default bin size Copyright © 2009 Pearson Education, Inc. Better size Slide 4- 21 How many bins? No “hard and fast” rule. There is even some disagreement among professionals. Recommendations from sides from two Johns Hopkins graduate Biostatistics Courses. Both depend on the number (n) of data points. Biostatistics 612: √n Biostatistics 651: 2√n I personally would use √n, but would try different numbers to see what looks best. Copyright © 2009 Pearson Education, Inc. Slide 4- 22 Publisher Instructions: Histogram Histogram: Displays the frequency, relative frequency or density for numerical data combined into classes. Select the column(s) to be displayed in the plot(s). A separate plot will be generated for each column selected. Enter an optional Where clause to specify the data rows to be included in the computation. Select an optional Group by column to construct a histogram for each distinct value of this column. Click the Next button to select either the Frequency, Relative Frequency or Density histogram. In addition, optional values for the starting point of the bins and the bin width may be specified. These parameters will apply to all of the histograms to be constructed. Click the Next button again to specify graph layout options. Click the Create Graph! button to create the plot(s). Copyright © 2009 Pearson Education, Inc. Slide 4- 23 Histograms with EXCEL There is a good You-tube tutorial on this – better than anything I can provide. See http://www.youtube.com/watch?v=RyxPp22x9PU Copyright © 2009 Pearson Education, Inc. Slide 4- 24 Thoughts on Histograms Histograms are useful and easy to apply to mostly all types of quantitative data. This is especially true for larger data sets. They can use a lot of ink and space! Color is more useful than black-and-white or grayscale. It can be difficult to display several related datasets at the same time to compare datasets. When you get a default, accept it if you can live with it! If not, at least save (or remember) what you did. Copyright © 2009 Pearson Education, Inc. Slide 4- 25 Stem-and-Leaf Displays Stem-and-leaf displays show the distribution of a quantitative variable, like histograms do, while preserving the individual values. Stem-and-leaf displays contain all the information found in a histogram and, when carefully drawn, satisfy the area principle and show the distribution. Copyright © 2009 Pearson Education, Inc. Slide 4- 26 Stem-and-Leaf Displays They can show a complete dataset in very little space. It is easy to put them back-to-back to compare groups. Invented in 1972 by John Tukey (1915 – 2000) Bell Labs’ NJ “Exploratory Data Analysis”, 1977 Copyright © 2009 Pearson Education, Inc. Slide 4- 27 Stem-and-Leaf Example Compare the histogram and stem-and-leaf display for the pulse rates of 24 women at a health clinic. Which graphical display do you prefer? Copyright © 2009 Pearson Education, Inc. Slide 4- 28 Constructing a Stem-and-Leaf Display First, draw a vertical line. Next, to the left of the line, cut each data value into leading digits (“stems”) and to the right of the line, trailing digits (“leaves”). Use the stems to label the bins. Use only one digit for each leaf—either round or truncate the data values to one decimal place after the stem. Copyright © 2009 Pearson Education, Inc. Slide 4- 29 Stem-and-Leaf Displays Stem-and-Leaf plots give all of the data in pictorial form. Stem-and-Leaf plots are useful for smaller datasets. It is not possible to do a stem-and-leaf plot with the TI. Or EXCEL either. But if the data set is ordered, they are easy to do by hand. Copyright © 2009 Pearson Education, Inc. Slide 4- 30 Stem and Leaf with StatCrunch Select Graphics. Select “Stem and Leaf”. Select the variable you want graphed. All you can do is “Create Graph.” You are not free to select bin sizes. The TI does not do stem and leaf plots. Nor does EXCEL Variable: var1 6 : 23568 7 : 0145 Copyright © 2009 Pearson Education, Inc. Slide 4- 31 Copyright © 2009 Pearson Education, Inc. Slide 4- 32 Copyright © 2009 Pearson Education, Inc. Slide 4- 33 Publisher’s Instructions Stem and Leaf : Displays a character based plot of a column that is similar to a histogram turned on its side. The actual (or approximate) data values are represented in the plot. Select the column(s) to be displayed in the plot(s). A separate plot will be generated for each column selected. Enter an optional Where clause to specify the data rows to be included in the computation. Select an optional Group by column to construct a separate stem and leaf plot for each distinct value of this column. Click the Create Graph! button to create the plot(s). Copyright © 2009 Pearson Education, Inc. Slide 4- 34 Dotplots A dotplot is a simple display. It just places a dot along an axis for each case in the data. The dotplot to the right shows Kentucky Derby winning times, plotting each race as its own dot. You might see a dotplot displayed horizontally (such as this one) or vertically. It looks “sorta” like a histogram. You might see a dotplot displayed horizontally or vertically. Copyright © 2009 Pearson Education, Inc. Slide 4- 35 Dotplots with StatCrunch Can’t do with the TI or EXCEL. With StatCrunch, again select “Graphics”, then “DotPlot” (as with the Histogram and the Stem and Leaf). In the next panel, you can input axis labels and draw grid lines if you wish. In the following one, you can pick a color scheme. But you have no control over the bin size (see next slide for an example of a dotplot that is not very useful). Copyright © 2009 Pearson Education, Inc. Slide 4- 36 Copyright © 2009 Pearson Education, Inc. Slide 4- 37 Publisher’s Instructions Dotplot : Displays a graphical representation of numerical values as points on a number line. Points with the same pixel representation are stacked on top of each other. If the number of points in a stack exceeds the height of the graphic, each point on the plot may represent more than one observation. If this occurs, the number of observations per point will be shown in the title of the graphic. Select the column(s) to be displayed in the plot(s). If multiple columns are selected, the plots will be stacked in the reverse order of selection in the same graphic. Enter an optional Where clause to specify the data rows to be included in the computation. Select an optional Group by column to construct dotplots for each distinct value of this column. If a Group by column is specified, select either to stack the plots of each group for each column or to stack plots of each column for each group. Click the Next button to specify graph layout options. Click the Create Graph! button to create the plot(s). Copyright © 2009 Pearson Education, Inc. Slide 4- 38 Think Before You Draw, Again Remember the “Make a picture” rule? Now that we have options for data displays, you need to Think carefully about which type of display to make. Before making a stem-and-leaf display, a histogram, or a dotplot, check the Quantitative Data Condition: The data are values of a quantitative variable whose units are known. Copyright © 2009 Pearson Education, Inc. Slide 4- 39 Constructing Effective Graphs Source: Agresti & Franklin Label both axes and provide proper headings To better compare relative size, the vertical axis should start at 0. Be cautious in using anything other than bars, lines, or points. Don’t use birds, dollar signs, ships, etc! It can be difficult to portray more than one group on a single graph when the variable values differ greatly 40 Copyright © 2009 Pearson Education, Inc. Now on over to . . . Copyright © 2009 Pearson Education, Inc. Slide 4- 41 Shape, Center, and Spread When describing a distribution, make sure to always tell about three things: shape, center, and spread… Copyright © 2009 Pearson Education, Inc. Slide 4- 42 What is the Shape of the Distribution? 1. Does the histogram have a single, central hump or several separated humps? 2. Is the histogram symmetric? 3. Do any unusual features stick out? Copyright © 2009 Pearson Education, Inc. Slide 4- 43 Humps 1. Does the histogram have a single, central hump or several separated bumps? Humps in a histogram are called modes. A histogram with one main peak is dubbed unimodal; histograms with two peaks are bimodal; histograms with three or more peaks are called multimodal. Copyright © 2009 Pearson Education, Inc. Slide 4- 44 Humps (cont.) A bimodal histogram has two apparent peaks: Diastolic Blood Pressure Copyright © 2009 Pearson Education, Inc. Slide 4- 45 Humps (cont.) A histogram that doesn’t appear to have any mode and in which all the bars are approximately the same height is called uniform: Proportion of Wins Copyright © 2009 Pearson Education, Inc. Slide 4- 46 Symmetry 2. Is the histogram symmetric? If you can fold the histogram along a vertical line through the middle and have the edges match pretty closely, the histogram is symmetric. Copyright © 2009 Pearson Education, Inc. Slide 4- 47 Symmetry (cont.) The (usually) thinner ends of a distribution are called the tails. If one tail stretches out farther than the other, the histogram is said to be skewed to the side of the longer tail. In the figure below, the histogram on the left is said to be skewed left, while the histogram on the right is said to be skewed right. Copyright © 2009 Pearson Education, Inc. Slide 4- 48 Symmetry (cont.) The skewness is in the direction of the tail, not the hump! Think of a playground “sliding board” – when you go down the slide, in which direction are you going? That’s the direction of the skewness. There is a numerical measure of skewness that I will show you later. Copyright © 2009 Pearson Education, Inc. Slide 4- 49 Examples of Skewness Source: Agresti & Franklin, “Statistics: The Art and Science of Learning from Data”; Pearson, 2007 Copyright © 2009 Pearson Education, Inc. Slide 4- 50 Examples of Skewness Source: Agresti & Franklin, “Statistics: The Art and Science of Learning from Data”; Pearson, 2007 Copyright © 2009 Pearson Education, Inc. Slide 4- 51 Anything Unusual? 3. Do any unusual features stick out? Sometimes it’s the unusual features that tell us something interesting or exciting about the data. You should always mention any stragglers, or suspected outliers, that stand off away from the body of the distribution. Are there any gaps in the distribution? If so, we might have data from more than one group. Copyright © 2009 Pearson Education, Inc. Slide 4- 52 Anything Unusual? (cont.) The following histogram has suspected outliers— there are three cities in the leftmost bar: Copyright © 2009 Pearson Education, Inc. Slide 4- 53 Center of a Distribution – Median The median is the value with exactly half the data values below it and half above it. It is the middle data value (once the data values have been ordered) that divides the histogram into two equal areas. It has the same units as the data. Copyright © 2009 Pearson Education, Inc. Slide 4- 54 Finding the median First, make sure that the data are arranged smallest to largest (or largest to smallest). Count the number, say N, of data points If N is odd, take the middle one. For example, if N = 21, the 11th point is the median If N is even, there is no middle “one”! So we average the middle two! Copyright © 2009 Pearson Education, Inc. Slide 54- 55 Examples Data: 62, 63, 65, 66, 68, 70, 71, 73, 75 N = 9; it’s odd th point is the middle one The 5 68 is the median Data: 197, 195,193, 192, 187, 185, 182, 179 N = 8; it’s even Average the two middle points; 192 and 187 The median is 189.5 Copyright © 2009 Pearson Education, Inc. Slide 54- 56 Notice that … If N is even, the median does not have to be one of the data points. The median can be affected by outliers (but maybe not that much) Example: 77, 72, 70, 69, 68, 67. Let’s for now label 77 as an outlier. With the outlier (77), the median is 69.5. Without the 77, we have 72, 70, 69, 68, 67 and the median is 69. Copyright © 2009 Pearson Education, Inc. Slide 54- 57 Comparing the Mean and Median (Source: Agresti & Franklin) In a skewed distribution, the mean is farther out in the long tail than is the median For skewed distributions the median is preferred because it is better representative of a typical observation 58 Copyright © 2009 Pearson Education, Inc. Spread: Home on the Range Always report a measure of spread along with a measure of center when describing a distribution numerically. The range of the data is the difference between the maximum and minimum values: Range = max – min A disadvantage of the range is that a single extreme value can make it very large and, thus, not representative of the data overall. Example: 77, 72, 70, 69, 68, 67. The range is 10. But if we take out the “outlier”, the range drops to 5. CAUTION: In the above example (with the 77), the range is not “67 to 77”; it is 10! Copyright © 2009 Pearson Education, Inc. Slide 4- 59 Spread: The Interquartile Range The interquartile range (IQR) lets us ignore extreme data values and concentrate on the middle of the data. To find the IQR, we first need to know what quartiles are… Copyright © 2009 Pearson Education, Inc. Slide 4- 60 Spread: The Interquartile Range (cont.) Quartiles divide the data into four equal sections. One quarter of the data lies below the lower quartile, Q1 One quarter of the data lies above the upper quartile, Q3. The difference between the quartiles is the interquartile range (IQR), so IQR = upper quartile – lower quartile Copyright © 2009 Pearson Education, Inc. Slide 4- 61 Spread: The Interquartile Range (cont.) The lower and upper quartiles are the 25th and 75th percentiles of the data, so… The IQR contains the middle 50% of the values of the distribution, as shown in figure: Copyright © 2009 Pearson Education, Inc. Slide 4- 62 5-Number Summary The 5-number summary of a distribution reports its median, quartiles, and extremes (maximum and minimum) The 5-number summary for the recent tsunami earthquake Magnitudes looks like this: Copyright © 2009 Pearson Education, Inc. Slide 4- 63 A little clarification! What do we mean by “half of the data below the median” and “half of the data above the median”? Data: 197, 195,193, 192, 187, 185, 182, 179 N = 8; it’s even – the median was 189.5 For the first quartile, we take the median of the last four numbers, i.e. 187, 185, 182, 179. This is 183.5. Similarly, the third quartile is 194, the median of 197, 195, 193, 193 Copyright © 2009 Pearson Education, Inc. Slide 54- 64 Example – Text data, page 58 The ordered values from the first batch: -17.5, 2.8, 3.2, 13.9, 14.1, 25.3, 45.8 Let’s verify the text results with our technology. Odd number of points Min = -17.5, Max = 45.8, Med = 13.9 Copyright © 2009 Pearson Education, Inc. Slide 4- 65 Example – Text data, page 58 How about Q1 and Q3? Book’s method: For Q1, take the median of the first four points (i.e. including the median). That is, take the median of -17.5, 2.8, 3.2, 13.9, which is 3.0. For Q3, take the median of the last four points (i.e. including the median). That is, take the median of 13.9, 14.1, 25.3, 45.8, which is 19.7. Copyright © 2009 Pearson Education, Inc. Slide 4- 66 5 – number summary – TI (except newer 84’s) Select [2nd][STAT] Select [CALC] Select #1, “1-Var Stats” Copyright © 2009 Pearson Education, Inc. Slide 4- 67 5 – number summary – TI (newer 84’s) Copyright © 2009 Pearson Education, Inc. Slide 4- 68 Hmmmmm! For Q1, the text got 3.0 and the TI got 2.8. For Q3, the text got 19.7 and the TI got 25.3. Difference in methodology. The text included the median in the upper-half dataset; the TI did not. Let’s go on to StatCrunch. Copyright © 2009 Pearson Education, Inc. Slide 4- 69 5-Number Summary - StatCrunch Select Stat, then Summary Statistics, then Columns. Then select the column you want summarized. You will see a list of summary statistics. Deselect all except those you want; i.e. Max, Min, Q1, Q3 and Median. Copyright © 2009 Pearson Education, Inc. Slide 4- 70 Copyright © 2009 Pearson Education, Inc. Slide 4- 71 Copyright © 2009 Pearson Education, Inc. Slide 4- 72 Copyright © 2009 Pearson Education, Inc. Slide 4- 73 The Result with StatCrunch Summary statistics: Column Median Min Max Q1 Q3 var1 13.9 -17.5 45.8 2.8 25.3 Copyright © 2009 Pearson Education, Inc. Slide 4- 74 Publisher Instructions for Summary Statistics Columns : Provides the following descriptive statistics in tabular format for the column(s) selected: sample size (n), mean, variance, standard deviation (Std. Dev.), Standard Error (Std. Err.), median, range, minimum, maximum, first quartile (Q1) and third quartile (Q3). Select the columns for which summary statistics will be computed. Enter an optional Where clause to specify the data rows to be included in the computation. Select an optional Group By column to group results. If a Group By column is selected, choose whether to display the output in separate tables for each column selected or in separate tables for each group. Click the Next button to select the summary statistics (by default, all are selected) to be computed. The statistics will be displayed in the order in which they are selected (from right to left). Additional percentiles may also be entered as a space or comma delimited list. Check the Store output in data table option if the output is to be placed in the data table. Click the Calculate button to view the results. Copyright © 2009 Pearson Education, Inc. Slide 4- 75 Other technologies SAS, StatDisk and MINITAB all agree with the TI and StatCrunch. EXCEL: PERCENTILE(Array,.25)=3, PERCENTILE(Array,.75)=19.7! Data Desk, an add-on to EXCEL, gives Q1 = 2.9 and Q3 = 22.5! There are different ways of computing Q1 (same for Q3) Split list into two halves, include median in each (text) Split list into two halves; don’t include median(TI, SC) I think that Data Desk used cut points of 0, (1/6), (2/6),(3/6),(4/6),(5/6) and1, and interpolated. Copyright © 2009 Pearson Education, Inc. Slide 4- 76 Boxes in blue – pp. 59 and 68 of text There are several ways to compute a quartile (we’ve seen 3; the authors have seen 9.) For large datasets, it makes very little difference. For smaller datasets (where it might make a difference), you do as well to just give the whole dataset rather than the summary statistics! You will be using the TI on the assessments. Even StatCrunch and the TI do not agree on some datasets! Therefore, on homework, say which technology you used. The IQR can also be different! Copyright © 2009 Pearson Education, Inc. Slide 4- 77 Summarizing Symmetric Distributions – The Mean When we have symmetric data, there is an alternative other than the median, If we want to calculate a number, we can average the data. We use the Greek letter sigma to mean “sum” and write: Total y y n n The formula says that to find the mean, we add up the numbers and divide by n. Copyright © 2009 Pearson Education, Inc. Slide 4- 78 Summarizing Symmetric Distributions – The Mean (cont) The mean feels like the center because it is the point where the histogram balances: Copyright © 2009 Pearson Education, Inc. Slide 4- 79 Summarizing Symmetric Distributions – The Mean (cont) Because the median considers only the order of values, it is resistant to values that are extraordinarily large or small; it simply notes that they are one of the “big ones” or “small ones” and ignores their distance from center. To choose between the mean and median, start by looking at the data. If the histogram is symmetric and there are no outliers, use the mean. However, if the histogram is skewed or with outliers, you are better off with the median. Copyright © 2009 Pearson Education, Inc. Slide 4- 80 Source: http://www.causeweb.org/resources/fun/ Copyright © 2009 Pearson Education, Inc. Slide 4- 81 What’s wrong with these quotes? “We look forward to the day when everyone will receive more than the average wage.” Australian Minister of Labor, 1973 “Lake Woebegone, Minnesota : Where all the women are strong, all the men are good-looking, and all the children are above average” Garrison Keillor (made in jest on the show “A Prairie Home Companion”) Copyright © 2009 Pearson Education, Inc. Slide 4- 82 *Weighted Arithmetic Mean Weighted Arithmetic Mean is computed by using following formula: Where: Stands for weighted arithmetic mean. x Stands for values of the items and w Stands for weight of the item Source: http://www.emathzone.com/tutorials/basicstatistics/weighted-arithmetic-mean.html Copyright © 2009 Pearson Education, Inc. Slide 4- 83 *Example: Weighted Mean - GPA A freshman receives the following grades Assume 4 points for an A, 3 for a B. What is his grade point average? Course Credits Intro to Literature 3 Russian I 3 Physics I 4 Calculus I 4 Chemistry I 4 Physical Education I 1 Copyright © 2009 Pearson Education, Inc. Grade B A A A B A Points 3 4 4 4 3 4 Slide 4- 84 *Example: Weighted Mean - GPA Use ∑Credits*Points ∑Credits ∑Credits*Points = 69 ∑ Credits =19 69/19 = 3.63. Copyright © 2009 Pearson Education, Inc. Credits Grade Points Credits* Points 3 B 3 9 3 A 4 12 4 A 4 16 4 A 4 16 4 B 3 12 1 A 4 4 Slide 4- 85 *Example: Weighted Mean – Customer Ratings Amazon.com is reviewing the ratings on a line of projects. Customers rate 1 to 5, 1 = Worst, 5 = Best Ratings (and number giving each rating) are on the right What is the average rating for the product. Copyright © 2009 Pearson Education, Inc. Rating 5 Number of customers 57 4 73 3 36 2 7 1 10 Slide 4- 86 *Example: Weighted Mean – Customer Ratings Use ∑Ratings*Customers ∑Customers ∑Rtgs*Cust = 709 ∑ Cust =183 789/183 = 3.874. This is what you will use in Project 1. Copyright © 2009 Pearson Education, Inc. Rating 5 Customers 57 Total 285 4 73 292 3 36 108 2 7 14 1 10 10 Slide 4- 87 With the TI Put the ratings (5 to 1) in L1 and the number in L2. Do 1-varStats L1,L2. L1 comma L2 But there is a setting that you need if you have the new operating system. Copyright © 2009 Pearson Education, Inc. Slide 4- 88 With the TI (adjust the TI with the new operating system) Make sure that StatWizard is off. To do this, [MODE] StatWizard is just above the clock. If it is on, then 1-varStats L1,L2 will not work. Copyright © 2009 Pearson Education, Inc. Slide 4- 89 What About Spread? The Standard Deviation A more powerful measure of spread than the IQR is the standard deviation, which takes into account how far each data value is from the mean. A deviation is the distance that a data value is from the mean. Since adding all deviations together would total zero, we square each deviation and find an average of sorts for the deviations. Copyright © 2009 Pearson Education, Inc. Slide 4- 90 Standard Deviation by hand (Don’t do this yourself!) A student goes shopping for an external hard drive for her computer. She finds the same hard drive in seven places. The prices are $98, $96, $92, $87, $85, $83, $82. The mean is easy to compute (∑x/n) = 623/7 = $89. Let’s do the standard deviation. Copyright © 2009 Pearson Education, Inc. Slide 4- 91 Attempt at a measure of spread (not a very good one!) X (X – Xbar) 98 9 96 7 92 3 87 - 2 85 - 4 83 - 6 82 - 7 • However, all of these deviations add to zero. • This is not a very good measure of spread! Copyright © 2009 Pearson Education, Inc. Slide 4- 92 What About Spread? The Standard Deviation (cont.) The variance, notated by s2, is found by summing the squared deviations and (almost) averaging them: y y 2 s 2 n 1 The variance will play a role later in our study, but it is problematic as a measure of spread—it is measured in squared units! Copyright © 2009 Pearson Education, Inc. Slide 4- 93 Variance by hand (Technology is easier!) • • • • X (X – Xbar) (X –Xbar)2 98 9 81 96 7 49 92 3 9 87 - 2 4 85 - 4 16 83 - 6 36 82 - 7 49 By squaring, we get rid of the negatives. The deviations squared (last col.) add to $244. Then $244/6 = $40.67 square dollars. This is the variance. (What’s a square dollar?) Copyright © 2009 Pearson Education, Inc. Slide 4- 94 What About Spread? The Standard Deviation (cont.) The standard deviation, s, is just the square root of the variance and is measured in the same units as the original data. y y 2 s Copyright © 2009 Pearson Education, Inc. n 1 Slide 4- 95 Standard Deviation by hand (Technology is easier!) • • • • X (X – Xbar) (X –Xbar)2 98 9 81 96 7 49 92 3 9 87 - 2 4 85 - 4 16 83 - 6 36 82 - 7 49 244 / 6 = 40.66667. The square root of 40.66667 is $6.38 We have obtained the standard deviation. Units are the same as in the original data (dollars) Copyright © 2009 Pearson Education, Inc. Slide 4- 96 Questions about Variance s 2 y y 2 n 1 Why n – 1 instead of n? It has to do with a concept called degrees of freedom. We will see this in later chapters (Chapter 23). Essentially, it is the number of entitles that can be freely changed if the sum (or the mean) remains constant. Copyright © 2009 Pearson Education, Inc. Slide 54- 97 Source: http://www.causeweb.org/resources/fun/ Copyright © 2009 Pearson Education, Inc. Slide 4- 98 Mean and Standard Deviation - TI For the numbers 62, 63, 65, 66, 68, 70, 71, 73, 75: Press [STAT], [CALC], 1-Var Stats The mean is 68.1111 The st. dev is 4.4845 Use the sx instead of the σx (will explain later in the course.) Copyright © 2009 Pearson Education, Inc. Slide 4- 99 Mean and Standard Deviation - StatCrunch Numbers in Var1. Select Stat, then Summary Stats, then Column as before. Give Var1 as your input column. Under Statistics, make sure that Mean and Standard Deviation are checked. (You can check others.) Click Create Summary statistics: Column n Mean Std. Dev. var1 9 68.111115 4.4845414 Copyright © 2009 Pearson Education, Inc. Slide 4- 100 **EXCEL summary statistics Summary Statistic EXCEL function Mean =average(a1:a7) 12.514 2.8 Standard Deviation =stdev(a1:a7) 19.824 3.2 Median =median(a1:a7) 13.9 1st quartile =quartile(A1:A7,1) 3 14.1 3rd quartile =quartile(A1:A7,3) 19.7 25.3 Minimum =min(a1:a7) -17.5 45.8 Maximum =max(a1:a7) 45.8 Skewness =skew(a1:a7) 0.3059 Kurtosis =KURT(A1:A7) 0.8924 -17.5 Copyright © 2009 Pearson Education, Inc. Answer 13.9 Slide 4- 101 *Other summary measures: Skewness For data points Y1, Y2, …, YN, the skewness is defined as Note that it involves “cubes”, the third power. The data are positively or negatively skewed depending on whether this quantity is greater than or less than 0. The magnitude of this quantity is a measure of how skewed the data are. Source: Wikipedia Copyright © 2009 Pearson Education, Inc. Slide 4- 102 *Other summary measures: Kurtosis Kurtosis is a measure of how peaked or flat your data are. Mathematically, kurtosis is defined as: _ 2 ( x x) _ 4 3 ( x x) Note that this involves the fourth power. A value of 0 indicates a perfect bell shape Greater than 0: More peaked Less than 0: Flatter Copyright © 2009 Pearson Education, Inc. Slide 4- 103 *Other summary measures: Coefficient of Variation You may see this in upper level textbooks. The “coefficient of variation” is the standard deviation divided by the mean. For the most recent example, CV = 0.06584. This is normally expressed as a percent, i.e. CV=6.584%. Notice that the CV is “unitless”. This is an advantage since it allows us to compare different populations. We will see this a lot in the course. Copyright © 2009 Pearson Education, Inc. Slide 4- 104 Thinking About Variation Since Statistics is about variation, spread is an important fundamental concept of Statistics. Measures of spread help us talk about what we don’t know. When the data values are tightly clustered around the center of the distribution, the IQR and standard deviation will be small. When the data values are scattered far from the center, the IQR and standard deviation will be large. Copyright © 2009 Pearson Education, Inc. Slide 4- 105 Tell - Draw a Picture When telling about quantitative variables, start by making a histogram or stem-andleaf display and discuss the shape of the distribution. Copyright © 2009 Pearson Education, Inc. Slide 4- 106 Tell - Shape, Center, and Spread Next, always report the shape of its distribution, along with a center and a spread. If the shape is skewed, report the median and IQR. Note: Skewed to the right: Mean is larger than median. Skewed to the left: Median is larger. If the shape is symmetric, report the mean and standard deviation and possibly the median and IQR as well. Copyright © 2009 Pearson Education, Inc. Slide 4- 107 Tell - What About Unusual Features? If there are multiple modes, try to understand why. If you identify a reason for the separate modes, it may be good to split the data into two groups. If there are any clear outliers and you are reporting the mean and standard deviation, report them with the outliers present and with the outliers removed. The differences may be quite revealing. Copyright © 2009 Pearson Education, Inc. Slide 4- 108 What Can Go Wrong? Don’t make a histogram of a categorical variable— bar charts or pie charts should be used for categorical data. Don’t look for shape, center, and spread of a bar chart. Copyright © 2009 Pearson Education, Inc. Slide 4- 109 What Can Go Wrong? (cont.) Don’t use bars in every display—save them for histograms and bar charts. Below is a badly drawn plot and the proper histogram for the number of juvenile bald eagles sighted in a collection of weeks: Copyright © 2009 Pearson Education, Inc. Slide 4- 110 What Can Go Wrong? (cont.) Choose a bin width appropriate to the data. Changing the bin width changes the appearance of the histogram: Copyright © 2009 Pearson Education, Inc. Slide 4- 111 What Can Go Wrong? (cont.) Don’t forget to do a reality check – don’t let the calculator do the thinking for you. Don’t forget to sort the values before finding the median or percentiles. Don’t worry about small differences when using different methods. Don’t compute numerical summaries of a categorical variable. Copyright © 2009 Pearson Education, Inc. Slide 4- 112 Source: http://www.causeweb.org/resources/fun/ Copyright © 2009 Pearson Education, Inc. Slide 4- 113 What Can Go Wrong? (cont.) Don’t report too many decimal places. Don’t round in the middle of a calculation. Watch out for multiple modes Beware of outliers Make a picture !!! Check for typos (An example follows) Copyright © 2009 Pearson Education, Inc. Slide 4- 114 An example that we’ll use in Chapter 6 Here are the prices per gallon for regular gas as reported by 12 gas stations in and around HCC’s Zip code 21044 on the morning of August 4, 2012 Source: http://www.marylandgasprices.com Copyright © 2009 Pearson Education, Inc. 3.459 3.539 3.539 3.539 3.559 3.629 3.649 3.699 3.699 3.699 3.699 3.699 Slide 4- 115 Summary statistics and histogram Mean $3.6173 / gallon Standard Deviation $0.086, or about 8.6 cents a gallon Copyright © 2009 Pearson Education, Inc. Slide 4- 116 Let’s pretend we’re seeing this for the first time Here are the prices per gallon for regular-grade gasoline as reported by thirteen Columbia area gas stations on the morning of August 4, 2012 Source: http://www.marylandgasprices.com Copyright © 2009 Pearson Education, Inc. 3459 3.539 3.539 3.539 3.559 3.629 3.649 3.699 3.699 3.699 3.699 3.699 Slide 4- 117 OOPS! Mean $291.58/ gallon! Standard Deviation $997.48 / gallon Copyright © 2009 Pearson Education, Inc. Slide 4- 118 Results of our mistake Mean and standard deviation – HUGE effect! Histogram – totally obscured the real data; not just the typo but everything else! Five number summary – moderate effect on everything but the maximum (huge effect there) The effect on the five-number summary depends on where you made your mistake. Two courses of action Best action: Correct what your mistake if you know the correct entry. If you don’t know what the entry should be, remove the 3459 as an outlier and document what happened. Copyright © 2009 Pearson Education, Inc. Slide 4- 119 What have we learned? We’ve learned how to make a picture for quantitative data to help us see the story the data have to Tell. We can display the distribution of quantitative data with a histogram, stem-and-leaf display, or dotplot. We’ve learned how to summarize distributions of quantitative variables numerically. Measures of center for a distribution include the median and mean. Measures of spread include the range, IQR, and standard deviation. Use the median and IQR when the distribution is skewed. Use the mean and standard deviation if the distribution is symmetric. Copyright © 2009 Pearson Education, Inc. Slide 4- 120 What have we learned? (cont.) We’ve learned to Think about the type of variable we are summarizing. All methods of this chapter assume the data are quantitative. The Quantitative Data Condition serves as a check that the data are, in fact, quantitative. Copyright © 2009 Pearson Education, Inc. Slide 4- 121 Overview – Organization of the chapter Pictorial Display Histogram Stem – and Leaf Plot Dotplot Numerical summary Shape of data Center Spread Copyright © 2009 Pearson Education, Inc. First set of measures Median Range Quartiles IQR Second set Mean Variance Standard Deviation Slide 4- 122 Division of Mathematics, HCC Course Objectives for Chapter 4 After studying this chapter, the student will be able to: 7. Appropriately display quantitative data using a frequency distribution, histogram, relative frequency histogram, stem-and-leaf display, dotplot. 8. Describe the general shape of a distribution in terms of shape, center and spread. 9. Describe any anomalies or extraordinary features revealed by the display of a variable. 10. Compute and apply the concepts of mean and median to a set of data. 11. Compute and apply the concept of the standard deviation and IQR to a set of data. 12. Select a suitable measure of center/spread for a variable based on information about its distribution. 13. Create a five-number summary of a variable. Copyright © 2009 Pearson Education, Inc.