Download Mind on Statistics

Excel® Technology Manual to Accompany Mind on Statistics © Cengage Learning. All rights reserved. No distribution allowed without express authorization. FIFTH EDITION Jessica M. Utts University of California, Irvine Irvine, CA Robert F. Heckard Pennsylvania State University State College, PA Prepared by Melissa M. Sovak California University of Pennsylvania, California, PA Australia • Brazil • Mexico • Singapore • United Kingdom • United States ISBN-13: 978-1-285-83862-5 ISBN-10: 1-285-83862-9 © 2015 Cengage Learning ALL RIGHTS RESERVED. No part of this work covered by the copyright herein may be reproduced, transmitted, stored, or used in any form or by any means graphic, electronic, or mechanical, including but not limited to photocopying, recording, scanning, digitizing, taping, Web distribution, information networks, or information storage and retrieval systems, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the publisher except as may be permitted by the license terms below. Cengage Learning 200 First Stamford Place, 4th Floor Stamford, CT 06902 USA Cengage Learning is a leading provider of customized learning solutions with office locations around the globe, including Singapore, the United Kingdom, Australia, Mexico, Brazil, and Japan. Locate your local office at: www.cengage.com/global. Cengage Learning products are represented in Canada by Nelson Education, Ltd. For product information and technology assistance, contact us at Cengage Learning Customer & Sales Support, 1-800-354-9706. To learn more about Cengage Learning Solutions, visit www.cengage.com. For permission to use material from this text or product, submit all requests online at www.cengage.com/permissions Further permissions questions can be emailed to [email protected]. Purchase any of our products at your local college store or at our preferred online store www.cengagebrain.com. NOTE: UNDER NO CIRCUMSTANCES MAY THIS MATERIAL OR ANY PORTION THEREOF BE SOLD, LICENSED, AUCTIONED, OR OTHERWISE REDISTRIBUTED EXCEPT AS MAY BE PERMITTED BY THE LICENSE TERMS HEREIN. READ IMPORTANT LICENSE INFORMATION Dear Professor or Other Supplement Recipient: Cengage Learning has provided you with this product (the “Supplement”) for your review and, to the extent that you adopt the associated textbook for use in connection with your course (the “Course”), you and your students who purchase the textbook may use the Supplement as described below. Cengage Learning has established these use limitations in response to concerns raised by authors, professors, and other users regarding the pedagogical problems stemming from unlimited distribution of Supplements. Cengage Learning hereby grants you a nontransferable license to use the Supplement in connection with the Course, subject to the following conditions. The Supplement is for your personal, noncommercial use only and may not be reproduced, or distributed, except that portions of the Supplement may be provided to your students in connection with your instruction of the Course, so long as such students are advised that they may not copy or distribute any portion of the Supplement to any third party. Test banks, and other testing materials may be made available in the classroom and collected at the end of each class session, or posted electronically as described herein. Any material posted electronically must be through a passwordprotected site, with all copy and download functionality disabled, and accessible solely by your students who have purchased the associated textbook for the Course. You may not sell, license, auction, or otherwise redistribute the Supplement in any form. We ask that you take reasonable steps to protect the Supplement from unauthorized use, reproduction, or distribution. Your use of the Supplement indicates your acceptance of the conditions set forth in this Agreement. If you do not accept these conditions, you must return the Supplement unused within 30 days of receipt. All rights (including without limitation, copyrights, patents, and trade secrets) in the Supplement are and will remain the sole and exclusive property of Cengage Learning and/or its licensors. The Supplement is furnished by Cengage Learning on an “as is” basis without any warranties, express or implied. This Agreement will be governed by and construed pursuant to the laws of the State of New York, without regard to such State’s conflict of law rules. Thank you for your assistance in helping to safeguard the integrity of the content contained in this Supplement. We trust you find the Supplement a useful teaching tool. Excel® is a trademark of the Microsoft group of companies. Excel Technology Manual for Mind on Statistics 5e is an independent publication and is not affiliated with, nor has it been authorized, sponsored, or otherwise approved by Microsoft Corporation. Printed in the United States of America 1 2 3 4 5 6 7 17 16 15 14 13 Contents Chapter 1: Introduction .................................................................................................................. 1 Chapter 2: Turning Data into Information ..................................................................................... 5 Chapter 3: Relationships between Quantitative Variables ........................................................... 19 Chapter 4: Relationships between Categorical Variables ............................................................ 27 Chapter 5: Sampling: Surveys and How to Ask Questions.......................................................... 33 Chapter 6: Gathering Useful Data for Examining Relationships ................................................. 38 Chapter 7: Probability .................................................................................................................. 39 Chapter 8: Random Variables ...................................................................................................... 42 Chapter 9: Understanding Sampling Distributions: Statistics as Random Variables................... 53 Chapter 10: Estimating Proportions with Confidence ................................................................... 59 Chapter 11: Estimating Means with Confidence ........................................................................... 64 Chapter 12: Testing Hypotheses about Proportions ....................................................................... 73 Chapter 13: Testing Hypotheses about Means............................................................................... 81 Chapter 14: More about Regression............................................................................................... 88 Chapter 15: More about Categorical Variables.............................................................................. 91 Chapter 16: Analysis of Variance .................................................................................................. 94 iii Chapter 1: Introduction Organization of this manual This manual's goal is to help you learn to perform the computational parts of statistical analysis using Microsoft Excel. Each chapter is a companion to the corresponding chapter in your Mind On Statistics textbook. I have used the same chapter titles to avoid confusion. The manual uses examples from the text so that, after you have analyzed the data using Excel, you can always check you results with those in the text. When a method discussed in the text is not included in this manual that means that method is not a feature of standard Excel nor can the method be implemented using formulas presented in the text. Excel This manual is not a comprehensive guide to Excel. It focuses specifically on statistical analysis. Furthermore, it does not explain how to use a personal computer or how to work with the Microsoft Windows operating system, as this manual was written assuming that the user has had experience with this operating system. The first step, of course, is to make sure that the computer you plan to use has Microsoft Excel installed. Excel is part of the Microsoft Office suite of programs. If you are not sure whether Excel is on your computer, the fastest way to find out is to click the Start button, scroll up to Programs, and look for Microsoft Excel in the list of programs that appear on the screen. If you find it, click on the title, and Excel will open. As it does, you will see a small window appear on the screen temporarily that indicates what version of Excel is on the computer. In this manual I have written all instructions based on Excel 2003. However, if you have an earlier version of Excel, you will find that most of the instructions I give will work for you as they are presented here. For Excel 2010 the first screen should look like this Before we explore Excel, I want to call your attention to a convention I just used because I will use it throughout the manual. 1. 2. I will use red type for references to the textbook such as case study 1.1. I will use green type to refer to variables such as HrsSleep and data files such as pennstate1. 1 3. 4. I will use blue type when an action is called for, such as click or scroll or when I am referring to an Excel menu item such as File or an Excel element such as the function Average. I will use bold type for a reference to a cell such as A3, text you are to type in a cell such as =4*A3, and the contents of a cell after you have carried out an instruction. If you are not familiar with all of the terms in these four statements, don't worry; we'll take care of that right now. You should now have Excel open on your computer screen. Across the top of the screen you will see a list of menu names, File, Home, Insert, Page Layout, etc. Most of these you are already used to seeing if you use Microsoft Word. Under these you should see several options associated with the Home menu such as text options, alignment options, copy and paste options and more. The rest of the screen is either dark gray or is white with a grid of vertical and horizontal lines as shown above. If the screen is dark gray, use the mouse to move the cursor to the File menu and click on New and a new workbook opens. Near the bottom of the screen you should see tabs labeled Sheet1, Sheet2, and Sheet3. When you open a new workbook, it contains three worksheets. You can add or delete worksheets as needed, but more about that later. You should also see that grid of lines I mentioned before. There should be headings across the top of the grid: A, B, C, etc. These are the column headings. Down the left side of the screen there should be numbers: 1, 2, 3, etc. These are the row headings. Where a column and a row intersect is called a cell, and its row and column designators refer to the cell. Thus D3 is the cell reference for the cell in the fourth column and the third row. Notice that cell D3 is outlined with a dark line in the figure below. The cell is the basic work unit within an Excel worksheet. Let’s have our first look at what you can do with a few cells in a worksheet. 1. 2. 3. In cell A1 type Temperature Conversion. Notice that not all of the text fit in cell A1, but Excel allowed it to spill over into cells B1 and C1. In cells A3 and B3 type Temp F and Temp C, respectively. In cell A4 type 68, and in cell B4 type =5/9*(A4-32). After you press the Enter key you should see the number 20 in cell B4. You have just converted a temperature in Fahrenheit, 68 degrees, into a temperature in Celsius, 20 degrees. 2 4. 5. 6. 7. Now click on cell B4. Notice the dark box around the cell. This tells you that cell is currently active. Just below the formatting toolbar, you should see the Formula Bar containing what you typed in cell B4. If you needed to edit what you typed in the cell, you would edit the contents of the Formula Bar. With cell B4 active, move the cursor to the home toolbar and click on the icon B. The 20 in cell B4 should now be in boldface type. Click on the B again, and the boldface goes back to regular type. To the right of the icons B, I, and U are four icons for aligning text. Click the icon to Center the contents of B4. Also click to Center the contents of cell A4. Finally, change the title in cell A1 to boldface type. This is what your Excel worksheet should now look like. Let’s review what you just did. You typed three kinds of contents into worksheet cells: text in cells A1, A3, and B3, a numerical value in cell A4, and a formula in cell B4. I want to show you one more feature of Excel before we leave our temperature example. 8. Type the number 70 in cell A5. Center the number in this cell. icon to copy the formula. 9. Click on cell B4. On the home menu, click the 10. Click on cell B5 to make it active, and then click on Paste. What do you see in cell B5? It should be the number 21.11111. 11. Let’s round this off to 21.1. Click on cell B5 to make it active. Move the cursor to the Decrease Decimal icon on the formatting menu. The icon looks like this: .00 >.0 12. Click on this icon. The number in cell B5 should now be 21.1111. Click the Decrease Decimal icon three more times, and cell B5 should now contain 21.1. However, it is important to realize that only the display of the number in the cell has been rounded. The number stored in the memory of the computer has not been changed. Your worksheet should now look like this. 3 Let’s review. You now know how to: 1. Open a new workbook 2. Make a worksheet cell active 3. Enter content into a cell 4. Change the format of the content, and 5. Copy the content of a cell and paste it into another cell. In later chapters you will discover how much time that copy/paste feature can save you. These are the basics. In the chapters that follow we will explore many additional capabilities of Excel, but you will use these basic features every time you work with Excel. Now let’s proceed to chapter 2 and do some statistics. A Note about Add-Ins The capabilities of basic Excel for statistical analysis are somewhat limited. If you are using Excel for this purpose, you should be aware that there are several statistics add-ins available that extend Excel’s statistical capabilities significantly. Some are sold commercially, and some are bundled with statistics textbooks. 4 Chapter 2: Turning Data into Information In chapter 2 of the textbook, you learned a variety of ways, both graphical and numerical, to summarize a set of numbers. In this chapter of the guide we will see how you can use Excel to help you with these tasks. Types of Data The first thing we need to discuss is types of data. This is very important since not all tools for summarizing data are appropriate for all types of data. Unfortunately, the computer cannot, in most situations, tell you which tools to use and which not to use. It will obediently use the wrong tool and give you results that are nonsense. So, you need to first identify what kind of data you have. Here is a rule of thumb that can help. If the data are represented by text, such as "left," "male," "yes," or "strongly agree," then it is almost certainly categorical data. If numbers, such as 3 or 6.514 represent the data, then it is likely to be measurement data. Try this rule on the list of eight questions and the resulting data in Section 2.1 of the textbook. Don't be misled by the text that gives units of measure such as hours, inches, and mph. Those terms are not part of the data. How many of the questions did you find that result in categorical data? If you said questions 1 (sex: m or f) and 3 (letter: S or Q), you are correct. The other six all result in measurement data. You do need to watch out for categorical data that have been coded numerically. For example, if you have data representing responses to question 1, the sex of the respondent, in Section 2.1 of the textbook, but the data are all 0 and 1, don't be misled. Instead of "m" and "f" the data could also be "coded" using 0 for male and 1 for female. Sex is still a categorical variable because the responses all fit in one of two categories and do not represent quantitative information. That is, we would not count or measure to determine which category a person belongs to. With this rule of thumb and caution in mind, let's get busy summarizing. I will point out, as we go along, a few ways that Excel tries to help you match the tool to the data. Summarizing Categorical Variables with the Pivot Table The first Excel tool that we will meet for summarizing data is also one of the most useful. It is called the pivot table and is found in Excel's Data menu. I will introduce you to the pivot table by showing you how to use it to create a summary of a categorical variable similar to Tables 2.1 and 2.2 in the text. First, you need to open Excel and then open the Excel data file YouthRisk03 (from the companion website) using the procedures described in chapter 1. While you do that, I'm going to get a cup of coffee. You should now see an Excel worksheet with data. You should see names in the top row: Sex, Grades, etc. There are five columns and 3042 rows of data (not counting the row with the variable names). Your screen should look like the picture below. Note that only the row of variable names and the first 16 rows of data are shown. You need to scroll down to see the rest of the data. Using the terms from Section 2.1 of the textbook, observe that there are five variables, one in each of the first five columns of the worksheet, 3042 observational units, in rows 2 through 3043, and that one observation, seatbelt use for observational unit 11 is the active cell, E12. The complete data set resides in the block of cells A2 to E3043. It is important to remember that, since we always use row 1 for the variable names, the row number of the last row of data will be the number of observational units plus one. For this data set that is 3042 + 1 = 3043. 5 Here are the steps of the procedure for creating a pivot table: 1. Click on a cell anywhere in the data set. 2. Now move the cursor to the top of the screen and click on the word Insert. That will cause a new menu ribbon to appear. 3. Click on Pivot Table. This opens the Create Pivot Table window. 4. Select the entire data set, including the labels in the first row are in the selected range of cells. If it is not, you can change row and column references in the Table/Range box to include the entire data set. 5. Under Choose where you want the Pivot Table report to be placed, select New worksheet. Click OK. You should now see the Pivot Table and Pivot Chart Wizard – Layout window shown below. 6. Find the Seatbelt button on the right and drag it to the Row section of the Layout template. Also drag the same Seatbelt button to the Data section of the Layout template. 6 You should now have a pivot table that looks like this. Notice that this table does not match the one in Section 2.3 of the textbook. The category names are in a different order and they have numbers in front of them. However, we can edit the pivot table to look like the one in the textbook. Move the cursor over cell A5 and click to make that cell active as shown above. The contents of cell A5 should now be visible in the Formula Bar. Click on the Formula Bar and position the cursor to the left of the word “Never.” Press the backspace key twice (to delete “1_”). Repeat this process for cells A5 through A8. Now the category names should be the same as those in the textbook table, but they are still in a different order. Right-click, that is, click the right button on your mouse, on cell A9 and, as that cell becomes active, a menu should appear. Move the cursor over the word Order and a sub-menu of options should appear. Click on Move to Beginning. Repeat the use of the Order options until your pivot table looks like this: If you want to convert the counts to percentages: 7 1. 2. 3. 4. Move the cursor anywhere over the pivot table and right-click. From the menu that drops down, click Value Field Settings. Select the Show Values As tab. Select % of column total from the drop down menu. 5. Click OK. Your table should now look like this: This table summarizes one variable, the frequency of use of seatbelts. So it is a "one-way" pivot table. Now let's create a "two-way" pivot table to summarize the frequency of seatbelt use for men and women separately. 1. Again, click any cell in the data and then click on Insert. As before, click Pivot Table. 2. Check to see that all of the data cells, including variable names, are selected, and click OK. 8 3. This time you should drag the Sex button into the Row section of the template, the Seatbelt button into the Column section, and the Seatbelt button into the Data section. You should now see a button in the Data section that says Count of Seatbelt. (Actually, you could drag either the Seatbelt button or the Sex button into the Data section and get the same result.) Does your table look like this? This table contains the same information as is shown in Table 2.2 in the text. Based on the numbers in the table, would you say that the order in which the letters are presented influenced the choices made by the students who answered? It might be easier to answer the question after looking at percents instead of counts. Repeat the procedure used above to change Field Settings. This time under Show values as, select % of row. Does your table look like this? 9 Visual Summaries for Categorical Variables: Pie Charts and Bar Graphs Tables of numbers like the tables you just created often come in handy in interpreting data. However, a well-chosen graph can not only be easier to understand but also be visually more powerful. Let's look at a couple of graphs that Excel can create. Look at Figure 2.2 in MOS, your textbook. We are going to create a pie chart very similar to that, but in order to do so we must first create another pivot table. Go back to the data in pennstate1 and use the procedure described above for one-way pivot tables to create a pivot table for the column headed RandNumb. When you drag the RandNumb button into the Data section of the template you will probably see Sum of RandNumb. To change this from sum to count, double click on the Sum of RandNumb button. The Pivot Table Field window should open. Under Summarize by, click Count, click OK, and then proceed as before. Your table should look like the one below. Next you will instruct Excel to use this table to create a pie chart. 1. Move the cursor over the table and click on any cell. 2. Now move the cursor to the top of the screen and click Insert. 3. Click Pie and select the first option. Check you pie chart against the one shown below. Note that it is not exactly like Figure 2.2 in MOS, but the difference is not significant. Repeat steps 2 and 3 above, but, this time, select Column under Chart type instead of Pie and select the first option. Note that it is essentially the same as the bar chart in Figure 2.2 in MOS. 10 Does your bar chart look like the one below? Before we finish our exploration of how to summarize categorical data, we will create one more graph a bar chart for two categorical variables. You might want to reread Example 2.2 in MOS and look at Figure 2.4. 1. First, in a blank Excel worksheet, create a table like Table 2.3. Since you do not have the original data from the survey of 479 children, you should type in category names and percentages to create a table like the one in the text. 2. Do not include the percent sign (%) or counts shown in Table 2.3. Instead enter each percent as a decimal fraction. For example, the cell in the row labeled Darkness and the column labeled No Myopia contains 90%. Type this into your table as 0.9. 3. Now press Enter and then click on the cell into which you just entered the 0.9. 4. In the tool bars at the top of the screen, find the % icon. Hold the cursor over it for a few seconds without clicking. The phrase Percent Styles should appear. Click on the icon. The contents of the cell should have changed from 0.9 to 90%. 11 5. 6. 7. Type the rest of the percents into the table as decimal values (0.09, 0.01, 0.66, etc.), highlight all of the numbers, by clicking and dragging over the numbers, and again click on the % icon. You do not need to include the Total column or row, as you will not include them in your graph. Now, highlight all of the percentages in your table along with the column labels (No Myopia, etc.), but not the row labels. Then click the Insert. Select Column and select the first option under 3-D Column. Your bar chart should look like the one below. Compare it with Figure 2.3 in MOS. Note that the colors in the chart below are the same as the ones in the text. This is not important except to demonstrate that you can change the formatting of a chart created by Excel. Finding Information in Quantitative Data: The Five-Number Summary We now move on to summarizing measurement or quantitative data. We will begin by creating the fivenumber summaries for Example 2.5 in MOS. To create these summaries you will use several of Excel's built-in statistics functions. But first a little data sorting is needed. It is sometimes the case that the way the data are arranged is not the way you need to have them arranged for the analysis you want to perform. So knowing how to rearrange the data can come in handy. In the pennstate1 workbook, the right hand span data are in what is called a "stacked" format. That is because right hand spans of women and those of men are stacked up in the same column. The only way we can tell which is which is to have another column, in this case the variable called Sex, to tell us which numbers are hand spans of women and which are those of men. What you need to do now is “unstack” the RtSpan column so that you have hand spans for men and hand spans for women in two different columns. 1. 2. 3. Find a blank worksheet in the Excel workbook containing the hand span data. If there isn’t one, move click on the tab without a name to create a new worksheet. Next click on the tab of the worksheet containing the data and click the letter at the top of the column containing the variable Sex. This is probably column A. The entire column should now be highlighted. Click the Home menu and select Copy. 12 4. 5. 6. 7. 8. 9. Now click the tab of the blank worksheet, click on cell A1, click on the Home menu, and select Paste. You should now have the Sex variable in the leftmost column (column A) of the new worksheet. Go back to the data worksheet, select the RtSpan column, and then copy/paste it into column B of the new worksheet. I’ll wait. Now we’re ready to sort data. In the new worksheet, click on any cell in the data, then click on the Data menu and select Sort. The Sort window should now be open. If the variable Sex is not selected in the Sort by box, click on the down arrow and select it. Click Add Level. In the Then by box, select RtSpan. Your Sort window should look like this. 10. Click OK. The data are now sorted. Next you will rearrange the hand span data into two columns. 1. 2. 3. 4. 5. Scroll down to the last row containing the word Female in column A. This should be row 104. Click on the cell containing the first male right hand span. Is this cell B105? It is in my worksheet. Now scroll to the last row that contains data, row 191, hold down the shift key, and click on cell B191. Go to the Home menu and click Cut. Finally, scroll back to the top of the worksheet, click cell C2, and click Home and then Paste. Whew! You’ve done it. The data are unstacked, and you are ready to compute five-number summaries. This may seem like a lot of work just to prepare the data for analysis, but once you have done it a few times, you will do it very quickly. There is one more small detail to see to. You need to define column headings (variable names) that reflect the new arrangement of the data. For example, I typed F Span in cell B1 and M Span in cell C1. 1. 2. 3. 4. Click on a blank cell to the right of the data. I chose cell E7. Type the word Median. Now move two cells to the right and type =MEDIAN(. Click on the first male hand span (cell C2), then hold down the shift key and click (that's called a shift-click) on the last male hand span (cell C88). Type a right parenthesis, ), and press the enter key. You should now see 22.5, the median of the male hand spans, in cell G7. You have just used one of Excel’s many built-in functions. Go back and highlight the cell so that you can review what you typed. Start with an equal sign, then type the name of the function, in this case median, and then parentheses containing any “arguments” required by the function. For the median function, the arguments required are the first and last cells that contain the data. By the way, I always type the function name in all capital letters to remind me that I am using an Excel function, but that is not required. There is an alternative way of entering a function into a cell. 13 1. 2. 3. 4. 5. 6. 7. 8. 9. Click the cell immediately under the cell in which you typed Median. Type the word Quartiles. Now click the cell two cells to the right, G8. Go to the top tool bar and click the Paste Function icon. It looks like this: fx. The Paste Function window should now be open on your screen. Scroll down until you find Quartile, click on it, and then click OK. In the Quartile window, click on the small red arrow on the right side of the Array box. The Quartile window collapses into a single box. Now click the top male hand span, scroll down, and then shift/click the last male hand span. All of the male hand spans should now be selected. In the box still on the screen you should see C2:C88. Click the small red arrow on the right end of the box, and the Quartile window should re-appear. In the box labeled Quart type 1 to indicate that you want the first quartile, and then click OK. The cell you first selected should now contain 21.75, which is the first quartile of the male hand spans. I’ll bet you are ready to compute the rest of the numbers for the male hand span five-number summary and then compute the summary for the female hand spans. Here are a few hints. To find the third quartile, enter 3 into the Quart box; to find the largest number in a data set, use the MAX function; and to find the smallest number, use the MIN function. When you have finished here is what you should have. Histograms, Stem-and-leaf plots, and Dotplots Unfortunately, Excel’s histogram is not one its best features. The procedure for creating a histogram in Excel is cumbersome, and the result does not look quite the way a histogram is supposed to look. Nevertheless, let’s create the histogram first; we can identify its strengths and weaknesses later. I want you to use Excel to create a histogram, like the one in Figure 2.7 in MOS, of women’s right hand spans. The process is made easier by the fact that you have already computed the five-number summary for this data. For a histogram we need to determine a set of categories into which the data will be grouped. We need to tell Excel what the boundaries of those categories are, what Excel refers to as the “bins.” We know – because we have the five-number summary – that these hand spans range from a low of 12.5 inches to a high of 23.25 inches. There are many ways we could define categories to cover this range, but let’s copy the categories used in Figure 2.5. The categories are 10 to 11, 11 to 12, 12 to 13, and so on up to 23 to 24. The bins, corresponding to these categories, for Excel are 11, 12, 13 up to 24. 14 1. 2. 3. 4. 5. 6. 7. 8. In the worksheet where you sorted the hand span data, select a column to the right of the data, type the word bins in the first cell and the numbers 11 to 24 in the cells under that, one number per cell. Next go to the Data menu and select Data Analysis. If Data Analysis is in the list of tools, skip to the step 4. If it is not in the list, you will need to add it. a. To do this, go to the File menu and select Options then click Add-ins. b. In the Add-ins window that appears you should see Analysis ToolPak. Click the box to the left of that, and then click OK. It may take Excel a few seconds to load the tool pack. Now go back to the Data menu, and you should find Data Analysis listed. When the Data Analysis window opens, scroll down to Histogram and then click OK. In the Input Range box, click the red arrow and select the range of female hand span data, including the variable name in the first row. In the Bin Range box repeat the procedure to select the list of bins, again including the name bins in the first row. Now click the Labels box, then click Chart Output, and finally click OK. The histogram may be quite squashed down. If so, click on it anywhere and, then move the cursor over the small black square in the center of the bottom of the histogram box. The cursor should change to a double arrow. Click and hold the mouse button down while you drag down until the histogram is large enough to be easily to understand. It should look like the one below. 15 Notice that the shape formed by the vertical bars in the histogram is very similar to that in Figure 2.7 in the text. However, there is a difference between the two histograms. The main feature that distinguishes a histogram from a bar chart is that, in a histogram, there are no gaps between the vertical bars. Note that this is the case with Figure 2.7. It is not true; however, of the histogram you have just created using Excel. The gaps between bars in a bar chart are there to emphasize that the bars represent distinct categories. For the same reason, the histogram should not have gaps because the bars represent categories that make up one continuous and uninterrupted range of numbers. It is a flaw in Excel's histogram that it is depicted as a bar chart. However, here is a histogram of the same data created with an Excel add-in. Histogram of women's right handspans 30 25 Frequency 20 15 10 5 0 <=11 11- 12 12- 13 13- 14 14- 15 15- 16 16- 17 17- 18 18- 19 19- 20 20- 21 21- 22 22- 23 >23 Right handspan (cm) Standard Excel does not include stem-and-leaf plots, dot-plots or box-plots among its data analysis tools. Thus we cannot use Excel to create plots similar to Figures 2.8 and 2.9 in MOS. As mentioned in chapter 1 of this manual, there are add-ins that extend the statistical capabilities of Excel. Several of these add-ins include menu options for creating one or more of these plots. Summary Measures Section 2.5 in MOS presents several summary measures for quantitative variables. Let's see how to use Excel to compute those measures. Specifically, we will use Excel to compute the mean, range, and interquartile range, in addition to the five-number summary encountered earlier. We will use theSongs on Student iPods data shown in Example 2.11 in MOS. Before you can compute the summaries, you will need to enter the data into Excel. Go ahead; I'll wait. 1. 2. 3. In cells E5 to E12 type the following eight labels in a column: mean, minimum, 1st quartile, median, 3rd quartile, maximum, range, and interquartile range. To compute the mean, use Excel's Average function. In the cell to the right of the word mean, type =AVERAGE(a2:a25). You already know how to find each of the numbers in the five-number summary. So go ahead and do that. The range is simply the largest speed minus the smallest speed. Thus in the cell to the right of the word range, you can type =F10-F6 since F10 should be the cell where you determined the maximum speed (using the MAX function) and F6 should be the address of the cell where you determined the minimum speed (using the MIN function). Finally, the interquartile range is the difference between the third quartile and the first quartiles. So, you can compute this using the quartiles you have already determined in the same way you just computed the range. When you are finished, your results should look like this. 16 If you want to check your formulas in cells F5 through F12, here is what they should be. Finally, you will compute a variance and a standard deviation. First, let's use the Songs data set listed in Example 2.11 of MOS. 1. Type these numbers in cells A2 through A25 in an empty Excel worksheet. Remember that, if you don't have any empty worksheets, you now know how to add one (reminder: look in the Insert menu). 2. Type a label in cell A1 (I used Songs) just to get in the habit of always using a label for a column of data. 3. Next in a blank cell type the word Variance. In the cell to the right of this type =VAR(A2:A25). 4. To obtain the standard deviation, use the Excel function STDEV in the same way you just used VAR. That is, type =STDEV(A2:A25). The results are: 17 Let's review what we have accomplished in this chapter. You can create one-way and two-way frequency tables (using Pivot Table and Pivot Chart Report from the Data menu).You now know how to use Excel to create a pie chart or a bar chart (using the Chart Wizard) and a histogram (using the Histogram command under Data Analysis in the Tools menu). You also know how to use Excel to compute a five-number summary as well as a mean, range, interquartile range, variance, and standard deviation (all using Excel’s built in functions). Along the way you learned how to unstack and sort data (using the Sort command under Data). When you have a set of data to analyze, it is a good idea to start by creating graphs and computing summary measures to “get a feel for” the data. It may not seem like we have covered a lot when it’s listed in one short paragraph, but you now have a useful array of tools for summarizing a set of data. 18 Chapter 3: Relationships Between Quantitative Variables Scatter plot Summaries of single variables such as pie charts, histograms, means, and five-number summaries are useful. But the real power of statistics comes from its methods for analyzing relationships between variables. As we did with single variables in chapter 2, we start with graphical summaries and then look at numerical summaries. Look at Figure 3.1 in MOS. It is a scatter plot of hand span and height. Note which variable is plotted on the horizontal axis and which is on the vertical axis. When Excel is used to plot these data, here is what the result looks like: HandSpans and height 26.0 25.0 Handspan (cm.) 24.0 23.0 22.0 21.0 20.0 19.0 18.0 17.0 16.0 15.0 50 55 60 65 70 75 80 Height (in.) Now let's see how to create this scatter plot using Excel. 1. Start by opening the workbook handheight. You should find three variables, sex, height, and handspan, with 167 values in each column not counting the variable names in the first row. 2. Select all values in the height and handspan columns. That is, cells B1 through C168. 3. Click Insert, click on the Scatter option and select the first option. 19 The points on the scatter plot should be clustered tightly in the upper right corner of the graph. Let's change that. Notice that the points range from approximately 55 to 80 on the horizontal axis and approximately 16 to 26 on the vertical axis. We will make use of that information to adjust the scales on the two axes. 1. Double-click on one of the numbers that label the horizontal axis. This should open the Format axis window. 2. Click the Axis Options tab if it is not already in front. 3. To the right of the word Minimum, select Fixed and replace the number by highlighting the 0 and typing 55, to the right of Maximum, select Fixed and replace 100 with 80, and to the right of Major unit select Fixed and replace 20 with 5. Then click Close. 4. Double-click on one of the numbers that label the vertical axis. Again, click the Axis Options tab if it is not already in front. 5. Next to Minimum, click Fixed and replace 0 with 14, next to Maximum click Fixed and replace 30 with 26, and next to Major unit click Fixed and replace 5 with 2. Then click Close. 20 Your scatter plot should now look like the one shown above. You will probably note a few differences in format such as the color of the background and the absence of grid lines on the scatter plot above. I encourage you to experiment with right-clicking at various places on your graph and exploring the menu options that appear. Standard Excel does not offer the option to indicate two groups on a scatter plot using a third variable as in Figure 3.4. However, there are statistical add-ins for Excel that do offer this option. Trend Line, Regression, and Residuals Now follow the same procedure to create a scatter plot of driver age and maximum legibility distance of highway signs as shown in Figure 3.7 in MOS (refer back to Example 3.2). The data are in the workbook signdist. As before adjust the limits and major units of the scales so that they are roughly the same as shown in Figure 3.7. I used 15 to 85, with a major unit of 10, on the horizontal axis and 250 to 600, with a major unit of 50, on the vertical axis. Now it's time to add a trend line to scatter plot. 1. Move the cursor over one of the points on your scatter plot and right-click. 2. When the menu pops up, select Add Trendline. 3. When the Add Trendline window opens, select the Linear Trend/Regression type. 4. Click the box to the left of Display equation on chart. 5. Click Close. You should now see a straight line superimposed on your scatter plot similar to the one below. You should also see the equation y = -3.0068x + 576.68 which is the regression equation corresponding to the trend line. 21 The regression equation can be used to predict maximum sign legibility distance based on a driver's age. However, sometimes we want to go beyond the regression equation and obtain additional information about the relationship between the two variables. For example, as described in Section 3.2, we may want to analyze the residuals. In order to obtain residuals we use Excel's Regression command, one of the options in Data Analysis (in the Data menu). 1. Click on the tab for the work sheet containing the age and distance data in signdist. 2. Select Data > Data Analysis > Regression. In the Regression window several boxes must be filled in. a. b. c. d. In the box to the right of Input Y Range type the range of cells containing the distance data (for example, B1:B31). In the box to the right of Input X Range type the cells containing the age data. Click the box next to Labels. In the box to the right of New Worksheet Ply type a name for the new worksheet Excel will create (I wasn't very creative; I just named it regression.) 22 e. Click the box next to Residuals and the box next to Residual Plots. f. Finally, click OK, and you should get a new worksheet showing regression results. On the new worksheet, scroll down until you see RESIDUAL OUTPUT. Below that heading you will see the predicted distance and corresponding residual for each of the 30 points on the scatter plot. A partial listing is included below. On the upper right part of the worksheet you should see a scatter plot, like the one shown below, with age on the horizontal axis and residuals on the vertical axis. You can ignore the rest of the regression output for now. 23 24 Correlation It is quite simple to determine a correlation coefficient with Excel. Let's do that for the age and distance data, from Example 3.2, you have just been working with. 1. Click on the tab of the worksheet containing the age and distance data, and click on an empty cell where you want Excel to put the correlation coefficient. (I clicked cell E3.) 2. Click the Paste Function icon (fx). 3. Under Function category select Statistical, and then, under Function name, select CORREL. 4. Click OK. 5. In the Array 1 box type the range of data for either variable, age or distance. Then in the Array 2 box type the range of data for the other variable. Notice that, for correlation, it does not matter which variable you specify as array 1. Click OK. Correlation -0.801244651 6. You should now see the correlation between driver age and maximum legibility distance of highway signs as shown below. I entered the word correlation in the cell above the correlation coefficient since Excel does not automatically add a label. 25 Regression Output Finally, we will look at some of the elements of the regression output generated by Excel. Open the workbook pennstate1. Letting RtSpan be the explanatory variable and LftSpan be the response variable, use Excel's regression command (Data > Data Analysis) to perform a regression analysis. For each of the values below highlighted in yellow, confirm that you got the same result and compare it with the value shown on page 88 of the text– continued in MOS. SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.950 0.902 0.902 0.639 190 ANOVA df Regression Residual Total I ntercept RtSpan 1 188 189 SS 708.15 76.67 784.82 MS 708.15 0.41 Coefficients Standard Error 1.4635 0.4792 0.9383 0.0225 26 F Significance F 1736.38 0.000 t Stat P-value 3.05 0.003 41.67 0.000 Chapter 4: Relationships between Categorical Variables In chapter 3 we saw how Excel can help us analyze relationships between quantitative variables. In this chapter we address the same questions: (1) are the variables related and, if so, (2) what is the nature of the relationship? But this time we will consider categorical variables. We will use the pennstate1.xls data file. I recommend you also review the section headed Summarizing Categorical Variables with the Pivot Table in chapter 2 of this manual. Please use chapter 2 of this manual to create a pivot table using Form as the row variable and SQPick as the column variable. While you are doing that, I'm going to get another cup of coffee. In case you are having trouble finding the pivot table, here is what it looks like. You can easily convert these observed counts into row percents. 1. Move the cursor anywhere over the pivot table and right-click. 2. From the menu that drops down, select Show Values as. 3. Select % of row total. The new table should look like this: 27 You can obtain a table of column percentages by repeating the procedure above. However, this time, next to Show values as, select % of column. The new table looks like this. Finally, to get back to the original table of counts, repeat the above procedure and, next to Show data as, select No Calculation. Expected Counts If you examine Figure 6.4 in MOS, you will notice that it includes expected counts. Recall that these are the counts we would expect if there were no relationship between the two variables: order of letters on the questionnaire and letter picked by the student. Let's see how we can use Excel to compute these expected counts. 1. If your table contains percentages restore them to cell counts. Repeat the three step process above and, in the final step, under Show data as, select No Calculation. 28 2. 3. 4. 5. We want to leave the original table where it is, but create a copy and place it to the right of the original. Starting in the lower right cell of the table, click and drag to the upper left cell. The entire table should now be highlighted. Use Copy/Paste Special, pasting values only to place a copy of the table with its upper left cell in cell F2. (If your original table does not have its upper left cell in cell A2, use Copy/Paste to move it there.) Note that the following cell references will not work for you unless you have the upper left corners of your tables in cells A2 and F2, respectively. In cell G5 type =$D5*B$7/$D$7. Press Enter. Select cell G5, select Copy, select cells G5 through H6 (the four cells that contain the observed counts), and select Paste. Voila! Your copy of the original table should now contain the expected counts. What about that formula that you entered into cell G5? Where did that come from? Recall that a $ to the left of a row or column reference makes that reference "fixed" instead of "relative." For example, the reference $D5 has a fixed column reference but a relative row reference. When the formula is copied into cell G5, the row reference will change to a 4, but the column reference will remain constant as D. I recommend you study the formula until you are sure you understand what happens when you copy it to other cells. The ability to combine fixed and relative references is one of the features that make Excel so versatile. I also labeled the two tables, in cells A1 and F1, for use in the next section. The Chi-square Test We now have the two main ingredients for a chi-square test to determine whether the two variables are related in the population from which the sample was drawn. Those ingredients are a table of observed counts and a table of expected counts. We will use the Excel function Chitest to find the p-value for the test. 1. Click on the empty cell where you want the result of the chi-square test to be placed. 2. Click the Paste Function icon. Under Function category select Statistical, and then under Function name, select CHITEST. 29 3. 4. 5. 6. Click OK. In the Actual range box specify the range of cells that contain the observed counts, not including the totals. That is, type B5:C6. In the Expected range box type the range of cells containing the expected counts, G5:H6. Click OK. You should now see the p-value: p-value = 0.004689 Notice that I have added a title so that, if I look at my Excel worksheet in the future, I will remember what the number represents. Finding Chi-Square Excel's Chitest function does not provide the chi-square value, only the p-value. However, if you want to know the value of chi-square, it is easy to find. 1. Click on the empty cell where you want the chi-square value placed. 2. Click the Paste Function icon. Under Function category select Statistical, then under Function name, select CHIINV that stands for chi-square inverse. That is, CHIINV takes a p-value and "works back" to the chi-square value based on the chi-square probability distribution. 30 3. 4. 5. Click OK. In the Probability box enter 1- then the reference to the cell that contains the p-value (0.004689). Alternatively, you can enter the p-value directly instead of the cell reference. In the Deg freedom box enter the number 1. 6. Click OK. You should now see the chi-square value below to which I added a label. Chi-square = 7.995561 Finding The p-value Finally, if you already have a chi-square value and want to find the corresponding p-value, here's how to do it with Excel. Let's use a chi-square value of 7.995. 1. Click on the empty cell where you want the p-value to be placed. 2. Click the Paste Function icon. Under Function category select Statistical, and then under Function name, select CHIDIST that stands for chi-square distribution. That is, CHIDIST takes a chi-square value and finds the corresponding p-value based on the chi-square probability distribution. 31 3. 4. 5. In the X box type either 7.995 or a reference to a cell where you have already entered the chi-square value. In the Deg freedom box type the number 1. Click OK. You should now see .995311. To find the p-value, select a cell and type =1-.995311 to find the p-value 0.004689. 32 Chapter 5: Sampling: Surveys and How to Ask Questions The topic in chapter 5 of the textbook that Excel can help you with is selecting a random sample from a population. To see how this works, we will now select a simple random sample of ten students’ responses from the responses of 173 students in the file UCDavis1 from the companion website for the text. So, for our purposes, the 173 students represented in the file make up the population. Our task is to select a simple random sample of ten students and the amount of TV they watch. We will use a procedure introduced in chapter 2 of this manual. There we copied columns from one worksheet to another and then sorted the contents of the copied columns. If you want to review this refer to the section titled Finding Information in Quantitative Data: The Five-Number Summary. 1. 2. 3. 4. Open the file UCDavis1 and, if necessary, insert a blank worksheet. Copy the column containing the data for variable TV into column B of the blank worksheet. Next we will create a column with an ID number for each student. In cell A1 type ID. Click on cell A2 to make it the active cell. Type the number 1 in cell A2, press Enter, and click on cell A2 again. 5. 6. icon, and click on Series. Click the Home menu, select the Click the radio button next to Columns under Series in. To the right of Stop value enter 173. The series window should look like this: 7. Finally, click OK. If you scroll down the worksheet, you should see the numbers 1 through 173 in cells A2 through A174. Before we continue, I want to introduce you to the time-saving device of splitting an Excel worksheet. Notice the circle on the right in the screen shot below. In that circle, just above the up-arrow for the scroll bar, is a short horizontal bar. When you move the cursor over that bar, the cursor changes to a double, updown, arrow. Click and drag the bar half way down the worksheet. You should now see that the scroll bar has split into two bars. With bottom bar scroll down until you can see row 175 of the worksheet. That row should be blank. 33 You can now see both the top and the bottom rows of the data set in columns A and B. Your worksheet should look like the one below. 34 Now we are ready to make use of the Excel function RAND(). This function serves essentially the same purpose as Table 5.2 in MOS. That is, it will provide us with random numbers that we will use to select our random sample from the population. However, each time RAND() generates a random number it will be a number between 0 and 1. Here is the procedure that will yield the desired random sample. 1. 2. 3. 4. Type Random in cell C1. In cell C2 type =RAND(). Press Enter. That is not a misprint; the function name, RAND, is followed by empty parentheses. Even though the parentheses contain nothing, they must be included. Click C2 to make it the active cell. With the cursor over the cell, right-click and select Copy from the menu that pops up. (This is just a short cut to avoid moving up to the Home menu.) Click cell C3. While holding down the shirt key, click cell C174. This is called a shift-click. Cells C3 through C174 should now be highlighted. Move the cursor over cell C3, right-click, and select Paste. Your worksheet should now look like this. However the numbers in column C on you worksheet will probably be different from those below. This is the nature or a random number generator. Each time you use it you will get a different sequence of numbers. 35 Since the numbers in column C are random, we can select the rows containing the smallest ten numbers (or the largest ten numbers, for that matter) and we will have a simple random sample. This is where the sorting I mentioned at the beginning comes in. Our final step in the selection of our sample is to use the sort feature in the Data menu. But, before we do that, we need to deal with one feature of the RAND() function that can be a nuisance. Every cell in the worksheet that contains the RAND() function will re-compute, and generate a new random number, every time you change any of the worksheet’s contents. Here’s how we prevent that: 1. 2. 3. 4. Click cell C2 and then shift-click cell C174. The column of random numbers should now be highlighted. Right-click on cell C2 and select Copy. Click cell C2 to make it active, right-click, and select Paste Special. The Paste Special box should now appear. Under Paste, click the radio button next to Values, and then click OK. If you click on one of the cells in column C, you will see a number instead of the RAND() function. In all of the cells you have replaced the function with the random number it generated. These numbers are now fixed and will not change. Now we are ready to sort. 36 1. 2. 3. Click the Data menu and select Sort. Under Sort by select Random. Click OK. In the screen shot above, I have highlighted the first ten rows to indicate that they define our random sample. The numbers in column A are the student ID’s. My sample includes students 36, 98, 114, and so on. Your sample very likely includes different student ID’s since RAND() very likely generated different random numbers for you. The numbers in column B are the times spent watching TV by the ten students in the sample. The numbers in column C are no longer of use. 37 Chapter 6: Gathering Useful Data for Examining Relationships Most of chapter 6 in MOS is about designing procedures for collecting data rather than analyzing data. However, Case Study 6.2 requires randomly assigning 43 children to three treatment groups. We can use the procedure developed in chapter 5 of this manual, with a couple of steps added, to accomplish this random assignment. I will give an abbreviated description of the procedure required. You can check your understanding of chapter 5 by filling in the details I omit. It would probably be a good idea to review chapter 5 of this manual before you proceed. 1. 2. 3. 4. 5. 6. 7. 8. Open a new Excel workbook. In cell A1 type ID. In cell A2 type 1, and use Fill (Home menu) to put the numbers 2 through 43 in cells A3 through A44. Type Random in cell B1. Type =RAND() in cell B2, and Copy/Paste it into cells B2 through B44. Use Copy/Paste Special to replace RAND() with the random number generated in cells B2 through B44. Sort the random numbers in column B into ascending order (Sort is in the Data menu). Since the group of 43 child ID’s is now in random order, those in cells A2 through A16 can be assigned to treatment group 1, those in cells A17 through A32 to treatment group 2, and those in cells A33 through A44 to treatment group 3 as shown in the spreadsheet below. Notice that I have used step 8 above to identify the groups and then have used Copy/Paste to move them into separate columns. At this point it is only the child ID’s that we are interested in. This is because the ID’s tell us that, for example, children 29, 42, 31, 13, and so on are to be assigned to group 1. 38 Chapter 7: Probability In this chapter we will look at an example of how Excel can be used to simulate random outcomes. Take a few minutes to reread Example 7.30 in MOS. You will simulate the process described there and then summarize the results. 1. 2. First, open a new workbook in Excel. In cell A1 type Participant. You will use column 1 to number the simulated participants from 1 to 1000. First, type 1 in cell A2. 3. Use Home > Fill > Series to place the numbers 2 through 1000 in cells A3 through A1001. Review the use of Fill in chapters 5 and 6 if you don’t remember how it works. 4. Next split the screen and scroll down in the bottom part until you can see row 1001. We used this device in chapter 5. 5. In cells B1 and C1 type T1 and T2, respectively. 6. Now select both cells B1 and C1. There is now a small black square in the lower right corner of the area you have highlighted. 7. Click in the square, hold down the mouse button and drag the square to cell I1. 8. When you release the mouse button, you should see T3, T4, etc. have been entered in cells D1, E1, and so on. These headings denote simulated trials 1 through 8 for each simulated participant. 9. In cell J1 type the word Sum. 10. In cell B2 enter =IF(RAND()<0.7,0,1) and press Enter. You should see either a zero or a one in cell B2. If you press the F9 function key, sometimes the number in cell B2 will change and sometimes it won't. The formula in this cell simulates one trial for simulated participant number 1. Here's how it works. RAND() generates a random number between zero and one with all possible numbers in that range equally likely. The IF function tests the number returned by RAND(). If it is less than 0.7 (the tested condition is true), a zero is entered into cell B2 representing an incorrect answer by the participant. If the RAND() number is not less than 0.7 (the condition tested is false), then a 1 is entered into cell B2 representing a correct answer by the participant. 14. Now click on cell B2 and select Copy. 15. Again click cell B2 and shift-click cell I1001. 16. Finally, select Paste. You have just simulated 8 trials for all 1000 participants. 17. Click on cell J2 and then click the AutoSum icon on the tool bar (it is a Greek letter sigma, Σ). The Sum function appears in cell J2. 18. If the range specified is A2 to I2, replace the A with a B since you want Excel to sum the results in columns B through I. Next Copy the contents of cell J2 and Paste them into cells J3 to J1001. 39 The simulation is complete and you have summed the results for each participant. Now it is time to tabulate the sums and create a bar chart. You should now use your skill at creating pivot tables to summarize the data in column J of your table. Here’s what I got, but your numbers will not be exactly the same because of the use of random numbers in the simulation. Move the cursor anywhere over the pivot table, and, in the menu that pops up, select Insert. You should now see a chart sheet with a bar chart of the counts in your pivot table. You can right-click in the chart area and select Chart Options to add titles and a label for the vertical axis. 40 Notice that the counts in the pivot table above are not the same as the counts shown in Section 7.6. I'm guessing that neither set of counts is exactly the same as the counts you obtained. That is because the results are based on random numbers and Excel generates different random numbers every time RAND() is used. Excel’s computational versatility together with its ability to generate random numbers make it a powerful tool for simulating financial risk, product demand, and other types of processes involving uncertainty about the future. 41 Chapter 8: Random Variables Binomial Random Variables In chapter 8 of MOS you learned that, even when a single event is completely unpredictable, such as whether one toss of a coin results in a head or a tail, a pattern emerges when the event is repeated many times. One pattern that emerges for many discrete random events is the binomial distribution. Before we move to our first example, you might want to review Example 8.16 in MOS. I don’t mind waiting while you do. In this example, the number of repetitions (births), n, is 10 and the probability, p, that any birth is a girl is 0.488. The Excel function BINOMDIST is used to compute the probability of exactly seven girls in ten births and the cumulative probability of seven or fewer girls in ten births. You will now use this Excel function to find the probabilities of all possible numbers of girls in ten births. Begin by opening a new Excel workbook. 1. 2. In cell A1 type Finding Binomial Probabilities. I will continue to remind you that it is good practice to give your work a title for future reference. In cells A3 through D3 type what is shown in the four cells below: n= 3. 4. 5. 6. 10 p= 0.488 In cells A5 to C5 type X, Prob, and Cum Prob, respectively. In cell A6 type 0 (that is a zero). Then use Home/Fill to enter the numbers 1 through 10 in cells A7 through A16. Click on cell B6. Next, click the Insert Function icon. Select the Statistical category, and then the BINOM.DIST function. Note that BINOM.DIST has four function arguments. The first, Number_s, refers to the number of events or “successes” out of n trials. This is denoted by k in Example 8.16. The second argument, Trials, refers to n, the number of trials. Argument three, Probability_s, refers to the probability of success on each trial, denoted by p in Example 8.16. If you type False for the final argument, as 42 shown below, Excel will compute P(X = k), the exact binomial probability. If you type True, Excel will compute P(X ≤ k), the cumulative binomial probability. Finally, click OK. 7. 8. 9. Click on cell C6 and again insert the BINOM.DIST function. Enter the first three arguments exactly as before, but this time type True for the last argument. Click OK. Before you can copy the function statements in cells B6 and C6, you need to change references to both from relative, B6 and C6, to absolute, $B$6 and $C$6. Copy the contents of cells B6 and C6 and Paste them into cells B7 through C16. Your table of probabilities should look like this: Notice that this table of number is not very easy to interpret. Let’s create a bar graph to show the probabilities in the middle column. 1. 2. Select cells B5 through B16. Click Insert and select Column and select the first option. You should see a bar chart that looks like this (except that your labels probably aren’t exactly the same as mine): 43 Let’s make use of this bar chart to see what happens to a binomial distribution as we change the value of p, the probability that a birth is a girl. 1. Change the value in cell D3 from 0.488 to 0.4. Notice that the distribution shifted slightly to the right, but its shape did not change noticeably. 2. Change the value in cell D3 from 0.4 to 0.25. Now the distribution is becoming noticeably skewed to the right. 44 3. Finally, change the value in cell D3 from 0.25 to 0.1. The distribution is now quite skewed and shows that there is very little chance of more than four girls in ten births. You might want to experiment with other values for p. The only restriction is that p must be between zero and one. Before we leave the binomial distribution, let’s find three more probabilities. First, be sure to reset the value in cell D3 to 0.488. Here is what the results should look like when you are finished: X ≥7 7 ≤7 Prob 0.153 0.106 0.953 Of course, the entries under the heading of the left column are labels I typed into three cells. The entries in the right column, under the heading Prob, were computed using probabilities already computed in the table you created. So, let’s see how. 45 1. 2. 3. Notice that cell C12 contains the probability of six or fewer girl births in ten births. If we subtract this from one, we get the probability of seven or more (≥ 7) girl births. So the formula that resulted in the 0.153 above is =1-C12. The probability of exactly seven (= 7) girl births in ten births is in cell B13. So the cell to the right of the label =7 contains =B13. Finally, the probability of seven or fewer (≤ 7) girl births in ten births is in cell C13. By now you have already guessed that the final cell contains =C13. By the way, if you tried to enter the label = 7 in a cell, it probably didn’t work. As soon as you pressed the enter key the = 7 changed to 7. If you want to enter a string of text that begins with an equal sign into a worksheet cell, the equal sign must be preceded by an apostrophe, ‘, sometimes referred to as a single quote. Otherwise, Excel interprets the equal sign as the beginning of a formula. Also, if you want to insert ≥ 7 instead of the clumsier >= 7, click on the cell where you want to put the text, click the Insert menu and select Symbol. Find the symbol you want—this may take a bit of scrolling and searching—and click on it. Finally, click Insert and then click Close. Uniform Random Variables As an example of a continuous random variable, MOS describes and illustrates the uniform probability density function. Briefly review Example 8.19. The continuation of Example 8.19 requires finding “the probability that the waiting time X was in the interval from 5 to 7 minutes.” You can use Excel to find this probability. Compared with the work involved in some of the procedures we have used, this will be a piece of cake. Waiting times are uniformly distributed over the interval 0 to 10 minutes. Figure 8.6 helps us visualize the probability we are looking for. It is defined by the ratio of the area of the shaded rectangle to the area of the larger rectangle. However, in the special case of the uniform distribution, the probability can also be defined as the ratio of the length of the specified interval (in this case 5 to 7) to the range of values over which the distribution is defined (in this case 0 to 10). 1. 2. Click the tab of a blank worksheet in the Excel workbook you currently have open. You know the routine. Start by typing in a title. In cell A1 type Finding Uniform Probabilities. 46 3. In cells A3 to B6 type the following: Distribution Limits 0 10 Interval Limits 5 7 4. 5. Type the label Finding Uniform Probabilities in cell A8. Finally, in cell D8 type =(B6-A6)/(B4-A4). That’s it. You should see 0.2 in cell D8. You might want to try some other intervals in addition to the 5 to 7 interval specified in the example. The only restriction is that both ends of the interval are between 0 and 10. Normal Random Variables Without a doubt the most important and most frequently encountered probability distribution is the normal distribution or, more precisely, the normal family of distributions. Review Example 8.24 in MOS. First we will use Excel to find the probability that Z is greater than 1.31 and the probability that Z is less than 1.31. In other words, we are finding probabilities for the standard normal distribution. 1. 2. I’m not even going to mention that you should put a title for the worksheet in cell A1. But I will say that I typed Finding Normal Probabilities. Type the entries below in cells A3 to D4. 47 3. 4. Type =B3 in cell E3 and in cell E4. Then type ) = in cell F3 and in cell F4. Finally, type =NORM.S.DIST(B3, TRUE) in cell G3 and =1-G3. The S in the function name tells us that the function returns probabilities or proportions for the standard normal distribution (mean = 1 and standard deviation = 0). Specifically, it returns the area under the standard normal curve and to the left of the Z value specified in the argument. 5. Notice that the effect of the entries in column E is to create labels that include the value of Z, 1.31. But, if you enter a new value for Z in cell B3, not only will the probability change, but the value of Z shown as part of the label (in cells E3 and E4) will also change. Try it. For example, type 1.96 in cell B3. Now you should see: 6. 48 Try some other values for Z, by typing different values in cell B3, to see how the probabilities change. In this example we saw a used a different method of entering an Excel function. For previous uses we have clicked on the Insert Function icon and selected the needed function from a list. This time we typed the function name and argument values directly into the cell. Either method works. Typing the function saves time but requires that you know the function name and arguments exactly. Using Insert Function takes a little longer, but gives a list to choose from and then describes each required argument. We can generalize our computation to work for any normal distribution. For example, look at Example 8.24. Let’s use Excel to find the proportion of college women who are taller than 68 inches. 1. Type the entries below into cells A3 to D5. 2. Next type =B5 in cell E5, ) = in cell F5, and =1-NORM.DIST(B5,B3,B4,true) in cell G5. Notice that the function Normdist has four arguments. They specify, in order, the mean of the normal distribution of interest, the standard deviation, the value of X—in this case 68 inches—and the word “True” to cause the function to return a cumulative probability. You should now see: 49 Notice that since the function returns the probability that the height is 68 inches or less, we subtract this probability from 1 in order to get the probability the height is greater than 68 inches. Finally notice that you did not need to go through the intermediate step of finding the value of Z that corresponds to a height of 68 inches since NORM.DIST does that for you. Try some other values for X, by typing different values in cell B5, to see how the probability changes. You have used the Excel function NORM.DIST to find the proportion of women who are taller than 68 inches. You can also use this function to find the proportion of women whose height falls within some specified interval. For example, review the continuation of Example 8.24. Now use Excel to find the proportion of women between 62 and 68 inches tall. Here is what your results should look like in cells A7 to J8. x1 = x2 = 1. 2. 62 68 Prob ( 62 <= X <= 68) = 0.733 See if you can figure out on your own what to type in each cell. If you have trouble, here is some help. In cell E8 type =B7. In cell H8 type =B8. In cell J8 type =NORM.DIST(B8,B3,B4,1)-NORM.DIST(B7,B3,B4,1). 50 We will conclude this section by using Excel to find percentiles for a normal distribution. Review Example 8.26 in MOS. Note that in the two previous examples you have specified a value for the variable (either Z or X) and used an Excel function to return the corresponding probability or area under the normal curve. Now you will reverse the process by specifying a value for the probability and using an Excel function to find the corresponding value of X. 1. Type the entries below in cells A3 to D5 2. Next type =B5 in cell E5, ‘= in cell F5, and =NORM.INV(B5/100,B3,B4) in cell G5. The “inv” in Norminv indicates an inverse function that has three arguments, the percentile specified as a probability or proportion, the mean of the normal distribution, and the standard deviation. The function returns the corresponding value of X, in this case a systolic blood pressure of 127 after clicking the Decrease Decimal icon several times, that is. 51 You can now use Excel to find probabilities or proportions for binomial, uniform or normal random variables. There are many other types of random variables, but these are the most frequently encountered. 52 Chapter 9: Understanding Sampling Distributions: Statistics as Random Variables Sample Proportions as Random Variables In chapter 9 of MOS you learned about sampling distributions for sample proportions. Before we continue you should probably go back and review Examples 9.4 and 9.12. I don’t mind; I have some other things to do while you’re busy. In Example 9.4 the authors mention the process of selecting a sample of 2400 voters from a population in which 40 % of all voters favor Candidate X (whoever she may be). Then they ask the question, “What proportion of the sample would be expected to favor Candidate X?” The Normal Curve Approximation Rule for Sample Proportions indicates that the proportion is a random variable having a normal distribution with a mean of 0.40 (40%) and a standard deviation of 0.01 (1%). Finally, the authors mention simulating 400 repetitions of this process of selecting 2400 voters and computing the sample proportion. Let’s use Excel to carry out that simulation. 1. 2. 3. Open a new Excel workbook and type a title for the simulation in cell A1. You could use something like Proportion: Simulating the Sampling Distribution. Type the following entries in cells A3 to B5. Remember to type 0.40 in cell B4 and then use the Percent Style icon (%) to convert the decimal value to a percentage. In cell B5 you should enter the formula for s.d.(p-hat) from Example 9.9. That is, type =SQRT(B4*(1-B4)/B3) where SQRT is the Excel function that takes finds the square root of the argument value. Next type the following headings in cells A7 to C7. Sample 4. 5. 6. Proportion bins Type 1 in cell A8, press Enter, and then select cell A8. Use Fill (in the Home menu) to enter 2 through 400 in cells A9 through A407. This allows us to identify each of the 400 simulations with a number from column A. Remember the way we split the screen before? You should do that now so that you can see cells B9 and B407 at the same time. In cell B8 type =NORM.INV(RAND(),$B$4,$B$5). We used the NORM.INV function in chapter 8 of this manual. Also, you may recall that we encountered the Excel function Rand() in chapters 5, 6, and 7. I recommend you pause here and go back and review the description of this function in chapter 53 7. 5. I thought so. You were going to ignore my recommendation and not review Rand(), but I really think you should take time for the review. I don’t mind waiting. Copy/Paste the contents of cell B8 to cells B9 to B407. If you want these simulated proportions to remain constant while you finish the exercise, now is the time to use Copy/Paste Special (and choose Values). The final step is to create a histogram of the simulated proportions in cells B8 to B407. You might want to review the procedure for creating a histogram in chapter 2 of this manual (Data > Data Analysis > Histogram). Also, here’s a hint: If you want your histogram to have the same scale on the horizontal axis as the one in Figure 9.4, your bin numbers (to be entered in column C) must run from 0.37 to 0.43 with an interval between bin numbers of 0.0025. That is, the bins should be 0.37, 0.3725, 0.375, 0.3775, 0.38, etc. up to 0.43. Your result should be similar to the histogram below, but will probably not be exactly like it because of the use of random numbers in generating the 400 sample proportions. 54 Sample Means as Random Variables Next, we will essentially repeat the process just described, but this time we will simulate selecting samples and computing sample means instead of sample proportions. Take a few minutes to review Example 9.7. 1. Start by typing the following entries as shown below. Note that the formula entered into cell E5 is the same as the formula for s.d.(x-bar) in Example 9.7. 2. Use Home/Fill to enter the numbers 1 through 400 in cells A8 through A407. Remember to type a 1 in cell A8 first. Split the screen so you can see row 407. In cell B8 type =NORM.INV(RAND(),$B$5,$E$5). Copy/Paste this into cells B9 through B407. Create a histogram of the sample means in cells B8 through B407, that is, if you use the appropriate bin values. 3. 4. 55 The result should be similar to, though not exactly like, the histogram below. The t-Distribution Finally, in this section, we will compute probabilities and t-values for the student t-distribution. First review Section 9.9 of MOS. In that section the MOS authors briefly discuss the use of Excel functions TDIST and TINV. We will now use these functions to find the desired probabilities and t-values. First we will find the shaded area under the t-distribution curve in Figure 9.13. This is the area under the curve and to the right of t = 0.34 with 24 degrees of freedom. 1. In an empty Excel worksheet type a suitable title in cell A1. Then type the following entries in cells as shown below. 56 2. In cell A7 type Area =. Then in cell B7 type =1-T.DIST(B3,B5,1). After you press Enter, you should see 0.368407 in cell B7. Notice the difference between the functions NORM.DIST and T.DIST. For a specified value of Z the function NORM.DIST returns the area under the standard normal curve to the left of Z. However, for a specified value of t, the function T.DIST returns the area under the t-distribution curve to the right of t. This is an inconsistency that we have to remember if we are to find the correct areas under these curves. Now we reverse the process and find a t-value corresponding to a specified area under the curve. 1. In the same worksheet you used above, type the following entries into cells D3 to E5. Area = n= d.f. = 2. 0.975 25 24 In cell D7 type t-value =. Then in cell E7 type =T.INV(E3,E5). After you press Enter, you should see what is shown below. 57 Again, there is an inconsistency in the way functions T.INV and NORMS.INV work. For a specified area, NORMS.INV returns the z-value such that the area under the curve to the left of the z-value is the specified area. However, for a specified area, T.INV returns the t-value such that the area under the curve between –t and +t is the specified area. We just have to remember this difference if we are to find the correct z-values and t-values with Excel. 58 Chapter 10: Estimating Proportions with Confidence As you learned in Section 10.2 of MOS, when computing a confidence interval, you need a multiplier. The multiplier determines how many standard errors to include in the interval. The confidence level determines its value. Let's see how Excel can help us find the correct multiplier. Excel has several built-in functions that relate to the normal distribution. The one we want is the function that takes an area under the standard normal curve and returns the z-value that corresponds to that area. This is the NORMSINV function, and we used it in chapter 9 of this manual. You will now use that function to generate the first two columns in Table 10.1 of MOS. 1. 2. 3. 4. 5. 6. Start with an empty Excel worksheet. In cell B1 type the phrase Confidence Level, and in cell C1 type the word Multiplier. In cells B2 through B5 type 0.90, 0.95, 0.98, and 0.99, respectively. Select those four cells and click the Percent Style icon (%) on the tool bar near the top of the screen. Next type =NORM.S.INV(B2+(1-B2)/2) in cell C2. When you press the Enter key, you should see 1.644854 in cell C2. Click on the cell and then on the Decrease Decimal icon. If you click on this icon three times, the contents of cell C2 should change to 1.645, the same as the value shown in Table 10.1 for a 90% confidence level. Click on cell C2. Then click on the small black square in the lower right corner of that cell and drag down to the bottom of cell C5. Notice that both formula and the format (three places to the right of the decimal point) are copied. your table of confidence levels and multipliers should look like this. Now let's go to Example 10.3 and compute 90%, 95%, and 98% confidence intervals for the proportion of all Americans who believe there is intelligent life on other planets. 1. 2. 3. 4. 5. 6. 7. In cell B7 type the phrase Standard Error, and in cell C7 type =SQRT(0.6*(1-0.6)/935). Round this to three places to the right of the decimal, and you should have 0.016 the same as the value shown in MOS. Next Copy the contents of cells B1 through B4 into cells B9 through B12. In cell C9 type M of E, an abbreviation for margin of error. Then in cell C10 type =C2*$C$7 and press the Enter key. Round this to three places to the right of the decimal, and you should have 0.026 as shown in MOS. Copy the contents of cell C10 and Paste it into cells C11 and C12. Your table of results should look like this. 59 A reminder: why did we need the dollar signs in the formula in cell C9? When you copy/pasted the contents of cell C10 into cell C11 and then C12, you wanted the reference to cell C2 to change to C3 and C4, respectively. However, you wanted the reference to cell C7 to remain the same. That is what the dollar signs do. they change a cell reference from "relative" to "fixed." Our final task is to compute the confidence intervals using for Example 10.3 using the sample proportion, 60%, and the margins of error just computed. 1. Type the contents shown below into cells D7 through E9. 2. 3. 4. Type =$E$7-C10 in cell D10 and =$E$7+C10 in cell E10. Copy/Paste the contents of D10 and E10 into D11 through E12. Use Percent Style to convert these six decimal values to percentages. Your results should look like this. 60 Confidence Intervals for the Difference between Two Population Proportions The final computation we will look at in this chapter is that of a confidence interval for the difference between two population proportions. Briefly review Example 10.9 in MOS. In this example the authors examine the difference between the proportions of young people who use the internet as their primary source of news and older people who use the internet as their primary source of news. Notice that the data have already been summarized for us. We are told that 92 people out of 262 in sample 1 use the internet as their primary source of news and 59 people out of 632 in sample 2 use the internet as their primary source of news. If we were given the original data instead of a summary, it would likely be in the form of two columns of data. One column would contain the sample number, 1 or 2, and the other column would contain an indicator of whether or not the person had heart disease, perhaps a Yes or a No. In that situation we would use Excel’s Pivot Table, which should be familiar by now, to summarize the data. Let’s now compute the confidence interval for the difference between the proportions of people in the two populations who have heart disease. Recall that the two populations, from which the two samples were selected, are all snorers and all non-snorers. 1. In a blank worksheet, to which you have added a suitable title in cell A1, type Sample 1 and Sample 2 in cells B3 and C3, respectively. 2. Next, type the contents shown below in cells A4 through A10. Note that the headings in cells A9 and A10 correspond to quantities computed in Example 10.9 in MOS. 61 3. 4. 5. 6. 7. 8. In cells B4 and C4 type the numbers of people with heart disease in the two samples, 92 and 59, respectively. In cells B5 and C5 type the sample sizes, 262 and 632, respectively. In cell B6 type =B4/B5 to determine the sample proportion for sample 1. In cell C6 type =C4/C5. Type =B6*(1-B6)/B5 in cell B7 and type =C6*(1-C6)/C5 in cell C7 as steps to computing the standard error of the difference in cell B10. Type =B6-C6 in cell B9 to compute the difference between the two sample proportions. In cell B10 type =SQRT(B7+C7), the standard error of the difference. Your results should look like those shown below. You spreadsheet should now look like this: You now have all of the pieces needed to compute the desired confidence interval. However, instead of a single confidence interval, you will now compute three intervals shown in MOS. Notice that this allows us to see how much the confidence interval changes when the confidence level is changed. 1. Type the following labels in cells A13 to C14. Confidence interval for difference 62 Conf level 2. 3. 4. 5. Lower Upper Type 0.90, 0.95, and 0.99 in cells A15 through A17. Then select those cells and click the Percent Style icon [%] to change the fractions to percents. In cell B15 type =$B$9-NORMSINV($A15+(1-$A15)/2)*$B$10, the formula for the lower limit of the 90% confidence interval. We use the function for the inverse of the standard normal distribution to find the z-multiplier. Similarly, type =$B$9+NORMSINV($A15+(1-$A15)/2)*$B$10 in cell C15 to compute the upper limit of the 90% confidence interval. Copy cells B15 and C15 and Paste them into cells B16 to C17. You should now have these results: Note that these results match the ones in Example 10.9 in MOS. You now know how to use Excel to compute a confidence interval for a single population mean, for the difference between two population means with either paired samples or independent samples, and a confidence interval for the difference between two population proportions with independent samples. 63 Chapter 11: Estimating Means with Confidence In chapter 10 we computed confidence intervals for single population proportions and for the difference between two population proportions. In this chapter we will compute confidence intervals for population means in a variety of situations. The first of these situations is computing a confidence interval for a single population mean based on a random sample of measurement data. Determining the t-multiplier for a confidence interval In order to use the T.INV function to find the t* multiplier we need two pieces of information, the desired confidence level, and the degrees of freedom which is the sample size minus 1. Let’s use Example 11.4 to find the t* value for 24 degrees of freedom and 95% or 99% confidence. As the textbook states, we need a t-multiplier, t*, such that the area between –t* and +t* is 0.95 for a 95% confidence interval. The Excel function T.INV does exactly that. This function has two arguments. The first is 1 – the confidence level, expressed as a decimal. For example, if we want a 95% confidence interval, then the first argument would be 1 – 0.95. The second argument is the degrees of freedom. 1. 2. 3. 4. 5. 6. In an empty worksheet, type an appropriate title, perhaps something like Finding a t* multiplier. Type Confidence Level in cell A3 and t* Multiplier in cell B3. Next type Degrees of freedom = and 24 in cells D3 and E3, respectively. In cells A4 and A5 type 0.95 and 0.99, respectively. Change these decimal values to percents in the usual way. Type =T.INV(A4*(1-A4)/2,$E$3) in cell B4 to find the t* multiplier for 95% confidence and degrees of freedom = 24. Before continuing be sure that you recall why the dollar signs are included in the reference to cell E3. Finally, Copy/Paste the contents of cell B4 to cell B5. Your results should look like this: Constructing a Confidence Interval for a Single Population Mean Take a moment to reread Example 11.5 in Section 11.2 of MOS. We will compute a confidence interval for the mean forearm length of all men based on the sample of 9 forearm lengths in Example 11.5. 1. In a blank Excel worksheet type a title for the worksheet in cell A1, the word Forearm in cell A3, and the 9 forearm lengths in cells A4 through A12. 64 2. 3. 4. In cell A13 type Excel’s AVERAGE function to compute the sample mean of the forearm measurements. Specifically, type =AVERAGE(A4:A12). In cell A14 type the Excel function STDEV to compute the sample standard deviation of the forearm measurements. That is type =STDEV(A4:A12). In cell A15 type 9 and in cell A16 type =A14/SQRT(A15) to compute the standard error of the mean. When I followed these steps and added labels in column B, my results looked like this: 65 25.50 1.52 9.0 0.507 = mean = std. dev. = d.f. = std. err. At this point we have all we need to compute a confidence interval except the t* multiplier. 5. 6. 7. 8. 9. Type the labels Confidence level = and t* multiplier = in cells C3 and C4, respectively. In cell D3 type 0.95 representing a 95% confidence level. Remember to change the format of the number in cell D3 using the Percent Style icon [%]. Next type =TINV(D3+(1-D3)/2,A15-1) in cell D4 to find the t* multiplier for 95% confidence and degrees of freedom = 8. Finally, you are ready to determine the limits of the confidence interval. Type Confidence interval, Lower, and Upper in cells C5 through C7. Type =A13-D4*A16 in cell D6 and =A13+D4*A16 in cell D7. Your results should look like this. So the 95% confidence interval for the mean forearm length for all men in the sampled population is the interval from 24.33 cm to 26.67 cm. Checking Conditions In Section 11.2 of MOS there is a description of “two situations for which a t confidence interval for one mean is valid.” The way to determine whether situation 1 holds is to look for skewness or outliers in the sample data. We next turn to Example 11.7 in MOS. You can compute the desired confidence intervals for mean number of hours slept using only the data summary in the example. However, you cannot check for skewness or outliers unless you go back to the original data. The data described are in the file UCDavis1. 1. Before you can analyze the data you must unstack the data in variable Sleep using the variable class. Recall that you learned how to unstack data back in chapter 2 of this manual. You might want to review that procedure now. 66 2. 3. After you have unstacked the data into two columns you should create a histogram of the data for the Statistics 10 class (liberal arts majors). For the bin values I used the integers 2 through 12. Next, create a histogram of the data for the Statistics13 (technical majors). For bin values I used integers 2 through 12. My histograms look like this: I used the same bins for both histograms so that they would be plotted on the same horizontal scale. This makes it easier to compare them. The histogram for Statistics 13 is quite symmetric and shows no outliers. Besides, the sample is large enough, 148 data values, for situation 2 to hold. The histogram for Statistics 10 shows no outliers but does indicate some skewness. However, for a sample this small, only 25 data values, the skewness is not significant. Thus, we can reasonably assume situation 1 holds. Now repeat the procedure we used in the previous section, this time to compute a confidence interval for the mean hours slept for each of the two statistics classes. I’ll bet you can do it without my help, but just as a check, I will show you what my results look like. A Confidence Interval for the Population Mean Difference for Paired Variables Next we will use essentially the same computational method for a different situation. The situation is one in which data are paired, two measurements for each individual in the sample. Examples of this are discussed in MOS. 67 Take a few minutes to reread Example 11.9 in MOS because that is our next challenge. Once again the data are in the workbook UCDavis1. This time the relevant variables are TV and computer, and you should find them in columns B and C. However, we are interested only in the 25 students in the Statistics 10 class. You can identify these students by the LibArts designator in the class variable (column L). So once again you need to unstack the variables to get at the data you need, TV watching times and computer usage times for the 25 students in Statistics 10. Then you need to set up a table like Table 11.2 in MOS except that your data should all be in one set of four columns in rows 2 through 26 with variable names in row 1. Now you treat the difference column like a single variable for which you want to compute a confidence interval for the population mean. First you need to check for skewness or outliers by creating a histogram of the difference data. Here's what I got for my histogram: While the distribution is not perfectly symmetric, the irregularities are not significant for such a small sample. So we can proceed to compute the confidence interval for the difference data. You can do this by entering formulas as demonstrated in earlier sections of this chapter. Your computations should give the following results: The multiplier t* is found using the formula =TINV(0.95,G4-1) where G4 is the cell containing the sample size, 25. Note that the first argument of the function is 0.95 instead of 0.975. This is because Example 11.9 instructs us to find a 90% confidence interval. 68 A Confidence Interval for the Difference between Two Population Means Instead of the difference between population means for paired data, we will now consider the difference between population means when the two samples are independent. The formulas are starting to get rather complicated. Fortunately, Excel's Data Analysis comes to our rescue and does most of the computation for us. Unfortunately, it does not do quite all of it. But, before we can compute a confidence interval, we must check to see if the required conditions hold. These conditions are described in Section 11.4 of MOS. The example we will use to illustrate the check on conditions and the computation of a confidence interval is Example 11.14. For the data we turn again to the workbook UCDavis1. As you will see in Example 11.14, you need to identify the sleep times for men and woman in the Statistics 13 class. This requires unstacking the data using the variable sex. However, before you can do that you need to eliminate the rows representing the students in Statistics 10. These students are indicated by the designator LibArts in the class variable column. Here’s how to get at the data you need. 1. 2. 3. 4. 5. 6. 7. Open the workbook UCDavis1. Insert a new Worksheet. Copy/Paste the columns containing the variables sex, sleep, and class from the data worksheet into columns A through C of the new worksheet. In the new worksheet, scroll down to the first row containing LibArts in the class column. This should be row 93. If you continue scrolling down, you should find the last occurrence of LibArts in row 117. Select all cells in columns A through C in those rows with LibArts in the class column. In the Home menu click Delete. You can now delete the class column since you no longer need it. Unstack the sleep column according to sex using the procedure we went through in chapter 2 with sleep times for males and females in adjacent columns. (I labeled my columns F Sleep and M Sleep.) When you finish you should have 83 sleep times for females and 65 sleep times for males. Now you are ready to check the data for the situations described in MOS. As in the first example in this chapter, you need to plot two histograms. This time you will plot one for female sleep times and one for male sleep times. My histograms look like this: 69 Notice that both distributions are roughly symmetric. Although the 2-hour and 12-hour male sleep times might be considered outliers, they are not extreme. Since both samples contain more than 30 observations, situation 2 holds. You are now ready to compute a 95% confidence interval for the difference in mean sleep times between males and females. 1. With the worksheet containing your unstacked sleep times on the screen, select Data/Data Analysis, click t-Test: Two-Sample Assuming Equal Variances, and click OK. 2. In the Variable 1 Range box specify the cells containing female sleep times. (I have the female sleep times in column D. So I entered D1:D84 in the Variable 1 Range box.) In the Variable 2 Range box specify the cells containing male sleep times. Actually, it does not matter which variable you specify in which box. (I entered E1:E66.) Click the box next to Labels. For the Hypothesized Mean Difference, type 0. Set Alpha at 0.05 if it is not already set at that value. Recall that Alpha = 0.05 corresponds to a 95% confidence level. Under Output options select Output Range and type G1 in the box. Click OK. 3. 4. 5. 6. 7. 8. Your results should look like the highlighted table in the worksheet shown below. Notice that I have rounded values to three places to the right of the decimal point. Unfortunately, you still have some work to do to get the desired confidence interval. The Data Analysis procedure you have just used is intended to test a hypothesis, but it gives the values needed to easily compute a confidence interval. 70 We will now compute the confidence interval shown under the table in the worksheet above. 9. In the worksheet you have just been working in, type the label Confidence interval for difference in cell G18 under the table of results. 10. In cells H17 and I17 type Lower and Upper, respectively. 11. In cell H18 type =(H4-I4)-H14*SQRT(H7*(1/H6+1/I6)). This formula computes the lower limit of the confidence interval assuming that your table of results from step 6 is in cells H1 through I14. 12. Finally, in cell I18 type =(H4-I4)+H14*SQRT(H7*(1/H6+1/I6)). This formula computes the upper limit of the confidence interval. Note that the resulting confidence interval, -0.103 to 1.025, is the same as the one shown in Example 11.14 in MOS. As the authors observe, “it is reasonable to assume that the population variances are similar.” Nevertheless, you should also compute the confidence interval without the assumption of equal variances to see if the result is significantly different. Be sure to save the worksheet containing the above table results. You will refer to it again near the end of chapter 13 of this manual. 1. 2. 3. With a worksheet containing your unstacked sleep times on the screen, select Data/Data Analysis, click t-Test: Two-Sample Assuming Unequal Variances, and click OK. Follow steps 2 through 6 above exactly as you did for the equal variance analysis. Under Output options select New Worksheet Ply and type G1 in the box. Your results should look like the table shown below. 71 4. 5. 6. Repeat steps 9 and 10 above. In cell H18 type =(H4-I4)-H13*SQRT(H5/H6+I5/I6). This formula computes the lower limit of the confidence interval assuming that your table of results from step 6 is in cells D1 through F13. Finally, in cell I18 type =(H4-I4)+H13*SQRT(H5/H6+I5/I6). This formula computes the upper limit of the confidence interval. You now know how to use Excel to compute a confidence interval for a single population mean and for the difference between two population means with either paired samples or independent samples. 72 Chapter 12: Testing Hypotheses about Proportions Using The Original Data By now you know the first thing I am going to say; you should review material in the text. At least reread Example 12.110 and its continuation. The question to be addressed is whether respondents tend to simply pick the first letter regardless of whether it is an S or a Q. The answer to this question is more important than you might think because it goes to the heart of an issue that has major importance for people who conduct survey research and those who administer local, state, and national elections. That is, do people tend to select the first option offered (or the first candidate) if they have no strongly held opinion on the issue (or the candidates for an office)? We start with the data in workbook pennstate1. Recall that the variable SQpick is a record of which letter each respondent selected, and the variable Form indicates which letter was listed first on each respondent’s form. The first step is to summarize this data in a two-way pivot table. However, we have already done that in chapter 2. If you want to review the process, look in the section headed Summarizing Categorical Variables with the Pivot Table. Here is the pivot table we created in that section. Count of SQpick SQpick Form Q Picked Q Listed First 53 S Listed First 31 Grand Total 84 S Picked Grand Total 45 98 61 92 106 190 What we need to determine is the number of respondents who picked the letter listed first on their form. As described in MOS, we see that (53 + 61)/190 = 60% of respondents selected the first letter on their form. Is this strong evidence of a preference for the first letter listed? In order to answer that question we need to test the null hypothesis H0: p ≤ 0.5 (50%). 1. 2. In order to follow the same procedure I used, start by creating or copying the pivot table above into cells D3 to G7. Type the following labels in cells A3 to A7. p0 = p-hat = Null s.e. = z= p-value = 3. Type 0.5 in cell B3. If you have the pivot table in the cells indicated in step 1, then type =(E5+F6)/G7 in cell B4 to compute p-hat, the sample proportion. 4. Type =SQRT(B3*(1-B3)/G7) in cell B5 to compute the null standard error. 5. Type =(B4-B3)/B5 in cell B6 to compute the z-statistic for the significance test. Your worksheet should now look like this: 73 The final step is to compute the p-value described in MOS. 6. Type =1-NORM.S.DIST(B6, TRUE) in cell B7, and you’re finished. You should see a result of 0.0029183, the same as the value shown in Figure 12.2 in MOS. Notice that we subtract the value returned by NORM.S.DIST from 1. This is because the function returns the value under the curve to the left of the z-value, but we want the value to the right of the z-value, the right tail of the distribution. This tells us that the chances are only 3 in 1000 that 114 or more respondents out of 190 would select the first letter if there were no preference, that is, if the probability of any one respondent selecting the first letter were 50%. Using a Summary of The Original Data Next we will answer a question that is similar to the one just considered. However, there are two important differences. First, instead of having the original or “raw” data, we will have only a summary of the data. Second, we will use a two-sided test (H0: p = 0.5) instead of a one-sided test (H0: p ≤ 0.5). In the previous situation we assumed that either there was no preference at all or there was a preference for the first letter. That is, we assumed that either p = 0.5 (50%) or p > 0.5 where p represents the proportion of respondents picking the first letter. In Example 12.13 (yes, you should now reread it) we make no 74 assumptions about which foot is longer if there is a difference between right and left foot lengths. In other words, we state hypotheses that, for those people having different right and left foot lengths, the proportion for whom the right foot is longer is 0.5 (50%), the null hypothesis, or is something other than 0.5, the alternative hypothesis. If the proportion is not 0.5, then it may be either larger than 0.5 or smaller than 0.5. We have no basis for expecting it to be one rather than the other. This is represented in the alternative hypothesis by p ≠ 0. Let’s test the null hypothesis using the data summary in Example 12.13 of MOS. A total of 112 respondents had unequal foot lengths. Of these, 63 had their right foot longer than their left foot. 1. In an empty Excel worksheet, type the following labels in cells A3 to A9. p0 = X= n= p-hat = Null s.e. = z= p-value = 2. 3. 4. 5. 6. Type 0.5, 63, and 112 in cells B3, B4, and B5, respectively. These values represent p from the null hypothesis, the number of respondents with a longer right foot, and the number of respondents with unequal foot lengths. Type =B4/B5 in cell B6 to compute the sample proportion for the data. Type =SQRT(B3*(1-B3)/B5) in cell B7 to compute the null standard error. Type =(B6-B3)/B7 in cell B8 to compute the z-value. Finally, type =2*(1-NORM.S.DIST(B8, TRUE)) in cell B9 to compute the p-value for the significance test. Notice that this time we subtract the value returned by NORM.S.DIST from 1 and then multiply the result by 2. This is because we want the value in the right tail of the distribution, to the right of the +z, and the value in the left tail, to the left of –z added together. Since, by the symmetry of the distribution, these areas are equal, we can compute one and double it. When you finish, your results should look like this: 75 The p-value here is the same as the one shown in Figure 12.45 in MOS. Finding Exact Binomial P-values Review Example 12.14 in MOS. In order to find exact binomial p-values, we will use the BINOM.DIST function that we first met in chapter 8. 1. Begin by typing the following information in an empty worksheet. 76 2. 3. Type =BINOMDIST(A9,$D$3,$D$4,TRUE) in cell B9. Copy/Paste this formula into cells B10 through B19. Type =1-B9 in cell C9. Copy/Paste this formula into cells B10 through B19. Your results should look like this: 77 To summarize, you have just computed cumulative binomial probabilities (in column B) for any number of possible scores on a ten question true/false test, assuming the test-taker was guessing (p = 0.5 for each question). Then, in column C, you computed the p-value by subtracting the cumulative binomial probability from 1. This assumes a one-sided hypothesis test, either p < p0 or p > p0. As the In Summary table at the end of Example 12.14 shows, if we are testing a two-sided hypothesis, p = p0, then we multiply the p-value in column C by 2. Sample Size and the Power of the Hypothesis Test Section 12.43 in MOS discusses the relationship among (1) the power of a hypothesis test, that is, its ability to detect when the alternative hypothesis is true, (2) the sample size, and (3) the true value of the parameter specified in the hypotheses. To illustrate this relationship, we will consider Example 12.2116 in MOS. This example involves a hypothesis test about a proportion. The null value, from the null hypothesis, is 0.50 or 50%. The level of significance, alpha, has been set at 0.05 or 5%. 78 The procedure for computing the entries in Table 12.2 is fairly complex. So, I think some preliminary description of the step involved will help. 1. If the null hypothesis is true, then the population proportion is 0.5 (50%), and the null standard error is, in Excel notation, SQRT(0.5*(1-0.5)/50) for a sample size of 50. This computes to 0.071. 2. The z-value that corresponds to an alpha of 0.05 (5%) is 1.645. 3. This means that for any sample proportion greater than 0.5 + 1.645 x 0.071 = 0.616, we would reject the null hypothesis. 4. The power of the hypothesis test is the probability of rejecting the null hypothesis when the alternative hypothesis is true. So, for this situation, the power of the test is the probability that a population with the specified true proportion would yield a sample of size 50 with a proportion greater than 0.616. If the sample size is increased, this value will change. If you followed the above steps, the rest is all computational detail. Let’s get to it. 1. In a blank worksheet type the outline of Table 12.2. That is, enter the title, all headings, and all labels. Here is what my worksheet looked like after I set up my table. 2. Type s.e.(p) null in cell F5 and p-crit in cell G5. In columns F and G, you will compute the null standard error and the critical value of the sample proportion, that is, the value above which the null hypothesis is rejected. For a sample size of 50, this is shown in the four steps above. However, these values must also be computed for sample sizes of 100 and 400. 79 3. 4. 5. 6. 7. Type z-crit in cell F1 and =NORMSINV(0.95) in cell F2. You will need the z-value (1.645) in computing the critical sample proportion, p-crit. Type =SQRT(0.5*0.5/B6) in cell F6. Copy/Paste this formula to cells F7 and F8. Type =0.5+$F$2*F6 in cell G6 and Copy/Paste it to cells G7 and G8. Now you need to compute the standard error of the sample proportion for each combination of a true population proportion (0.52, 0.6, and 0.65) and a sample size (50, 100, and 400). So, in cell H6 type =SQRT(C$5*(1-C$5)/$B6), and Copy/Paste it to cells H6 through J8. Finally, you are ready to compute the power of the test for each of the nine empty cells above. In cell C6 type =1-NORMDIST($G6,C$5,H6,TRUE) and Copy/Paste it to cells C6 through E8. If all went according to plan, your results should look like this: Note that the values in cells C6 through E8 are the same power values as those shown in Table 12.2 except for cell E8. However, 0.999998, as shown in cell E12 is "nearly 1." A Hypothesis Test for the Difference between Two Population Proportions 80 Our final hypothesis test in this chapter is for comparing two population proportions. We will use Example 12.17 in MOS to illustrate the computations in Excel. In this example the relevant proportion is the number of children getting ear infections. The experimental group took xylitol, and the control group took placebo syrup. The null hypothesis is H0: p1 – p2 = 0. In other words, the null hypothesis states that the two proportions are equal, that xylitol does not reduce the frequency of ear infections. The alternative hypothesis, Ha: p1 – p2 > 0, states that the proportion of ear infections for the control group, p1, is larger than the proportion of ear infections for the experimental group, p2. Notice that, in step 2, the authors verify that the necessary data conditions are met. This means it is appropriate to use the hypothesis test in Section 12.4. 1. In a blank worksheet, type a title in cell A1, and then type the following labels in cells A4 through A12. 2. 3. Type Control and Exper in cells B3 and C3, respectively. Type the data for the two samples in cells B4 through C5, and type the formulas for the sample proportions in cells B6 and C6. For example, in cell B6 type =B4/B5. In cell B7 type =(B4+C4)/(B5+C5) to compute the combined proportion. Type =B6-C6 in cell B9 to compute the sample statistic, the difference between the two sample proportions. Type =SQRT(B7*(1-B7)*(1/B5+1/C5)) in cell B10 to compute the null standard error. Type =B9/B10 in cell B11 to compute z. Note that, since the null value is zero, we can drop that term from the computation of z. Finally, in cell B12 type =1-NORM.S.DIST(B11, TRUE) to compute the p-value. Remember that we subtract the value returned by the Normsdist function from 1 since we want the area to the right of z under the standard normal distribution. Your results should look like this: 4. 5. 6. 7. 8. 81 Compare the values shown above with those in MOS. 82 Chapter 13: Testing Hypotheses about Means In this chapter we will revisit the topic of chapter 11, namely, inference for means. We will look at each situation for which we computed a confidence interval in chapter 11, but this time we will test a hypothesis instead. Testing a Hypothesis About A Single Population Mean We begin with Example 13.1 in MOS that involves testing a hypothesis about normal (mean) human body temperature based on a set of 18 temperature measurements. The null hypothesis is that the mean human body temperature, for the whole population, is 98.6 degrees Fahrenheit. The alternative hypothesis is that the population mean is something less than 98.6 degrees. The sample temperature data are listed in Example 13.1. Checking Conditions As in chapter 12, before we apply the method, we must determine whether necessary data conditions are met. These conditions are described in MOS. Clearly, a sample of only 18 data values cannot be considered large. Thus, the hypothesis testing method described in the text is appropriate only if situation 1 holds. A histogram based on a sample this small is not reliable for determining whether the population of measurements (body temperatures, in this case) is normal or even symmetric. The histogram below gives us no reason to think the sample data are not from a symmetric, perhaps normal population distribution. Furthermore, it shows no outliers. (I urge you to create this histogram for yourself to gain additional practice.) An additional check is to compute both the sample mean and the sample median and compare them. If you use Excel functions AVERAGE and MEDIAN, you will find that the sample mean is 98.217 degrees, and the sample median is 98.2 degrees. The fact that these two statistics are very close to each other is an indication that there is little skewness and probably no outliers in the sample. It seems reasonable to assume that situation 1 holds. Thus, you should proceed with testing the hypothesis stated earlier. 1. 2. 3. 4. If you have not already done so, type the body temperature data in cells A4 through A19 of a blank worksheet. I typed B Temp in cell A3 as a name for the temperature variable. In cell A20 type =AVERAGE(A4:A19) to find the sample mean, and in cell A21 type =STDEV(A4:A19) to find the sample standard deviation. I tossed in these reminders of things you have already learned so that you wouldn’t have to look them up. Since I saved you that time, you have time to review Step 2 in MOS that describes how to compute the test statistic, t. Note the computation requires both the sample mean and the sample standard deviation. In cell A22 compute the standard error of the mean by typing =A21/SQRT(16). In cell A23 compute the test statistic by typing =(A20-98.6)/A22. 83 5. Finally, we will use the Excel function T.DIST to find the p-value corresponding to t = -3.22. Type =T.DIST(A23,15,TRUE) in cell A24. To summarize, the results of your computations should look like this: These values are the same as the ones shown in MOS. A Hypothesis Test for The Population Mean Difference for Paired Variables Next we revisit the situation in which we have paired data as described in MOS. Review Example 13.2 in Section 13.3. The with-alcohol and without-alcohol data are listed as are the differences between performances for the pilots tested. 1. 2. Enter this data in cells A3 through D13 with the variable names in row 3. Note that you can use Home/Fill to enter the pilot numbers in column A and can compute the differences in column D (type =B4-C4 in cell D4 and Copy/Paste it to cells D5 through D13). Repeat steps 2 through 5 above to compute the sample mean, sample standard deviation, standard error of the mean, test statistic, and p-value. I computed these statistics in cells D14 through D18 and added labels in cells C14 through C18. Here are my results. 84 These values are the same as the ones shown in MOS. It is necessary to verify the same data conditions as in the first example. Note that the boxplot in Figure 11.8 in MOS indicates skewness. Comparing the sample mean and the sample median supports this. The sample median is 125.5, which is much smaller than the sample mean of 195.6. However, the sample contains only ten pairs of observations. So, the skewness could result from taking such a small sample rather than from significant skewness in the distribution of population measurements. A Hypothesis Test for the Difference between Two Population Means We move now to the testing of the difference between two means when the samples are independent. Review Example 13.4 and the data that are in Example 11.11. 1. 2. 3. 4. 5. 6. Type this data in columns A and B starting with variable names in row 3. Although it is not essential, you will find it easier to compare your results with those in the text if you put the No Stare data in column A and the Stare data in column B. Don’t forget to put a title for the worksheet in cell A1. Note that the sample sizes are different, 14 and 13. Select Data/Data Analysis, click t-Test: Tow-Sample Assuming Unequal Variances, and click OK. Following the analysis in Example 13.4 – continued in MOS, we will not assume these samples come from populations with the same variances. That is, we will not make the equal variance assumption. If you compute the two sample standard deviations, you will find that one is 66% larger than the other (1.36/0.82x100). In the Variable 1 Range box specify the cells containing the No Stare data by typing A3:A17. In the Variable 2 Range box type B3:B16. For the Hypothesized Mean Difference type 0. Click the Labels box, and type 0.05 for Alpha if that is not already its value. Click the radio button next to Output Range, and select cell D1. Click OK. 85 Your results should look like this: The test statistic is 2.412, the same as the value shown in MOS. Since the appropriate test is a one-tail test, the p-value is 0.012, slightly different from the 0.013 in MOS. Assuming Equal Variances (Pooled Two-Sample t-Test) Next we turn to a two-sample t-test in which an equal variance assumption is reasonable. Refer to Example 13.6 in Section 13.4 in MOS. This example takes us back to the sleep time data for men and women in the Statistics 13 class. Notice that the sample standard deviations are 1.75 for females and 1.68 for males. With sample standard deviations that close, it is reasonable to assume the samples are drawn from populations having the same variance. 86 We used Data/Data Analysis/t-Test: Two-Sample Assuming Equal Variances to analyze this data in chapter 11 to compute a confidence interval. You might have saved the worksheet. Here is what the results looked like. The test statistic, 1.62, and the p-value, 0.11 (two-tail test), are the same as the values shown in the table in Example 13.6 in MOS and. You should go back and review how we got from the unstacked sleep time data to this table of results. 87 Chapter 14: More about Regression Review In this chapter we return to the topic of chapter 3, regression analysis. Since it has probably been quite a while since you looked at that chapter of the Excel guide, I recommend you go back and review it, especially the final section on regression output. Let's use Example 14.5 in MOS to help us review that material. For each of the following review items, I will identify the elements of Example 14.5 (which is a repeat of Example 3.2). You should make sure you remember: 1. 2. 3. The distinction between the explanatory variable (age) and the response variable (distance) How to use the regression option under Data/Data Analysis How to find each of the following in the Excel regression results: a. The regression equation (distance = 577 – 3.01×age) b. The slope (– 3.0068) c. The intercept (576.68) d. r2 (0.642) e. The sum of squares total, SSTO, (193666.667) f. The sum of squares error, SSE, (69334.024) g. The standard deviation for regression or standard error, s, (49.762) The relevant parts of the regression results from Excel are shown below. SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.801244651 0.641992991 0.629207026 49.76158305 30 ANOVA df Regression Residual Total Intercept Age 1 28 29 SS 124332.6425 69334.02414 193666.6667 MS 124332.6425 2476.215148 F Significance F 50.2107592 1.041E-07 Coefficients Standard Error t Stat P-value 576.6819372 23.47088077 24.57010211 1.72511E-20 -3.006835369 0.424337346 -7.085955066 1.041E-07 Inference About The Population Slope Testing A Hypothesis About The Population Slope If you have not already done so, perform a regression analysis on the driver age and highway sign reading distance data in workbook signdist. In the table immediately above, notice that one of the columns has the heading t Stat. In the row for age the t Stat value is –7.086, which is essentially the same as the value shown (–7.09) for the test statistic in Figure 14.5 in MOS. The p-value is shown as 1.04E-07. If you are not familiar with scientific notation, this is short hand for 0.000000104 which is 104 over 1 billion. That's a very small number. A probability that small means essentially that the event won't happen. Anyway, Excel's regression results give us all the information needed to test the hypothesis that the population slope is zero. 88 Computing A Confidence Interval for The Population Slope To find a confidence interval for the population slope, we need to find a t* multiplier and perform one multiplication. Look at Example 14.5 – continued. We need to find a 95% confidence interval for the population slope (coefficient of age). 1. 2. 3. Type =TINV(0.975,28) in any empty worksheet cell, and you will find t* is 2.0484 (rounded to 2.05 in MOS). I have this in cell E5 in my worksheet. The standard error for the age variable, s.e.(b1) is 0.424 as shown in the table above. In my worksheet this is cell F5. Type =E5*F5 in an empty cell, changing the cell references as necessary to reflect where you have t* and s.e.(b1). The resulting value is 0.8692 (rounded to 0.87 in MOS). Finally, subtract 0.87 from and add it to the slope 3.01 to find the confidence interval –3.88 to –2.14. Testing Hypotheses about the Population Correlation Coefficient Excel does not include this procedure in its Data Analysis add-in. However, we can use the formula in the tech note at the end of Section 14.3 of MOS. 1. In the worksheet containing the data for Example 14.5, use the CORREL function to find the correlation between age and distance. Type the function statement in cell E2. 2. In cell E5 type =SQRT(30-2)*E2/SQRT(1-E2^2). Finally, in cell E8, type =T.DIST.2T(ABS(E5),28). Your results should look like this: Cell E5 contains the formula from the tech note. Cell E8 uses function T.DIST.2T which has two arguments, the absolute value of the t-value computed in cell E5 and the degrees of freedom (the sample size minus 1). Finding a Prediction Interval or a Confidence Interval for y, the response variable Excel’s regression tool does not compute prediction or confidence intervals for y, the response variable. Since the textbook does not provide the formulas needed for these computations, we will not compute these intervals using Excel. 89 Checking The Conditions for Linear Regression with Plots We will use Example 14.3 – continued in MOS to examine how we can make use of scatterplots and histograms to check the conditions for linear regression. 1. 2. Open the workbook wtheightM. Create a scatterplot of the data. Make sure height is on the horizontal or X axis and weight is on the vertical or Y axis. Below is the result, which is the same as that shown in MOS. 3. Go to Data/Data Analysis and select Regression. The Regression window will open. You can refer back to chapter 5 of this manual to refresh your memory of the information required in this window. In the Input Y Range box, specify the cells containing weight data. In the Input X Range box, specify the cells containing height data. Click the box next to Labels. Under Output options type in a name for the New Worksheet Ply. Click the boxes next to Residual Plots and Normal Probability Plots. 4. 5. 6. 7. The residual plot and normal probability plot should look like this: The residual plot, on the left, is essentially the same as the one shown in Figure 14.10 in MOS. The normal probability plot, on the right, is mentioned but not shown in MOS. This plot is constructed in such a way that, if the residuals are normally distributed, the points will form a straight line. I have added a trend line to show that this probability plot is approximately a straight line, which indicates the residuals are approximately normally distributed. 90 Chapter 15: More about Categorical Variables The Chi-Square Test Revisited In this chapter we turn our attention again to relationships between categorical variables, a subject we first looked at in chapter 4. As usual I am going to suggest that you go back and review that chapter in this Excel guide. In addition, as a way of reviewing what we did there, let’s look at Example 15.1 in MOS. 1. 2. 3. In a blank worksheet, with a suitable title added in cell A1, type the labels and cell counts from Table 15.1. Use the AutoSum [Σ icon] to compute row and column totals. We don’t need the row percentages. So, you can ignore them. Compute a table of expected counts. Follow the procedure we used in chapter 6. Here is what the resulting tables should look like: Use Excel’s function CHISQ.TEST to find the p-value. Again review the way we did this in chapter 4, and follow the same procedure. You should get a p-value of 0.035 91 4. If you want to find Chi-square, use the CHISQ.INV.RT function as we did in chapter 6. With the pvalue in cell B19, here is what the function window should look like: Here are the tables and the results all together: Chi-Square Goodness of Fit Test Chi-square is also used to test a hypothesis about a single categorical variable. This application of chisquare is called the goodness of fit test. We will look at the computations involved in this test using Example 15.13 in MOS. Note the null hypothesis stated in the example. 1. 2. 3. 4. In a blank worksheet, with a suitable title added in cell A1, type labels Number, Observed, Expected, and Chi-term in cells A5 to D5. I put a title for the table in cells A3 and A4. Type the digits 0 through 9 in cells A6 through A15. These represent the ten possible numbers drawn for the first digit of the Daily Number. Type the observed frequencies in cells B6 through B15. Type the expected frequency, 50, in cells C6 through C15. It may seem like a waste of time to enter the same number into all ten cells. However, in general, the expected frequencies for a goodness of fit test will not all be the same. So, I am demonstrating the general form for the computations. 92 5. 6. 7. In cell D6 type =(B6-C6)^2/C6. Then Copy/Paste this to cells D7 through D15. The symbol ^ means raise to the power. ^2 indicates that Excel should compute (B6-C6) and square the result. In cell D16 sum the values in cells D6 through D15. You can use the AutoSum icon [Σ] for this. This step computes the chi-square value, 6.04. Finally, type =1-CHISQ.DIST(D16,9, TRUE) in cell D17 in order to find the p-value corresponding to a chi-square value of 6.04 with degrees of freedom 10 – 1 = 9. The result should look like those shown below. The large p-value, 0.736, does not mean we have shown the null hypothesis (p = 1/10 for each of the ten possible digits in the first container) to be true. Rather it means we have not found evidence to reject it. This is a very important distinction in hypothesis testing. 93 Chapter 16: Analysis of Variance We will use Example 16.1 in MOS to illustrate the computations required to perform a one-way analysis of variance, or ANOVA. However, before we can get to the computations, you need to do some data manipulation to retrieve the needed data and get it into a form that you can use to perform an ANOVA using Excel. In case you are thinking this data manipulation is a distraction from the work of computation, don’t! It is often the case that data are not in the form we need to perform the desired analysis. Knowing how to combine and rearrange data is just as important a part of statistical analysis as knowing which method to use. 1. 2. Open the workbooks containing the data for Example 16.1 (UCDavis1 and UCDavis 2). Select cell B2. Actually, any cell in the data will work. Select Data/Sort. In the Sort window, under Sort by, you should see GPA. To the right of that click Largest to Smallest. 3. Scroll to the bottom of the data. You should find several rows in which there is a seat preference designator but no GPA. Missing data! These data are from students who indicated where they prefer to sit but did not give their grade point average. There is nothing we can do with their partial data. So, delete the Seat designators for these students. You are almost finished rearranging the data, but not quite. You now need to Sort by Seat. Unstack the GPA data into three columns, according to seating preference (B, M or F). I showed you how to do this in chapter 2. So I will leave you to review that and do the unstacking. So that we have our data in the same cells, put your unstacked GPA data in columns A, B, and C in a new worksheet. Don’t forget to put variable names in row 1. I put front- seaters in column D, middle-seaters in column E, and back-seaters in column F and used variable names GPA F, GPA M, and GPA B. When you are finished, you should find 90 students who prefer front seats, 222 who prefer the middle, and 79 who prefer to sit in the back for a total of 391 students. Now you are ready to do some ANOVA. The first thing to consider is whether the conditions specified in MOS are satisfied. The three sets of GPA data satisfy the conditions for doing an F-test as discussed in Example 16.1—continued. 1. Select Data/Data Analysis/ANOVA: Single Factor. 94 2. 3. 4. Click OK. In the Input Range box, specify all cells in columns A through C that contain data. For me this was cells A1 through C223. Note that this means including several empty cells in the selected range, but that is not a problem. Click Labels in First Row. Set Alpha at 0.05 if that is not already its value. Type a name in the box to the right of New Worksheet Ply. Click OK. Your results should look like this. 95 Note that I have highlighted the p-value in the ANOVA table. Unfortunately, Excel does not include the Tukey or Fisher’s paired comparisons or Kruskal-Wallis test or Mood’s medial test. For these you will need to turn to software that is specifically for statistical analysis. 96

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Mind on Statistics