Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
JMP® Technology Manual to Accompany Statistics Learning from Data Roxy Peck © 2014 Cengage Learning. All Rights Reserved. This content is not yet final and Cengage Learning does not guarantee this page will contain current material or match the published product. California Polytechnic State University, San Luis Obispo, CA Prepared by Alexander Kolesnik Ventura College, Ventura, CA Australia • Brazil • Mexico • Singapore • United Kingdom • United States Contents* Chapter 1 ............................................................................................................................................... 1 Chapter 2 ............................................................................................................................................... 2 Chapter 3 .............................................................................................................................................. 17 Chapter 4 .............................................................................................................................................. 34 Chapter 6 .............................................................................................................................................. 45 Chapter 9 .............................................................................................................................................. 61 Chapter 10 ............................................................................................................................................ 66 Chapter 11 ............................................................................................................................................ 73 Chapter 12 ............................................................................................................................................ 83 Chapter 13 ............................................................................................................................................ 94 Chapter 15 ........................................................................................................................................... 108 ----------------------------------------------*Chapters 5, 7, 8, and 14 have been omitted from this guide since they contain no material relevant to JMP. ii Chapter 1 Introduction This manual accompanies Statistics: Learning from Data by Roxy Peck. It is intended to be used in conjunction with the text, so each chapter of this book corresponds to a chapter in the main text. You’ll find examples from each chapter worked out here, intended to show you how to use JMP for all the problems in the text. This book is not intended to be a complete user’s guide to JMP. If you have questions about specific capabilities of JMP, refer to the online help. About JMP JMP (pronounced "jump") is a computer program for statistics developed by the JMP business unit of SAS Institute. It was created in the 1980s to take advantage of the graphical user interface introduced by the Macintosh. It has since been improved and made available for other operating systems. Statistical Analyses in JMP This book will describe the step-by-step commands to do all of the required statistical computations using the software. The data, or a summary of the data, will need to be in a JMP data table to do the statistical analysis. The results will be shown. 1 Chapter 2 Graphical Methods for Describing Data Distributions This chapter is designed to make the data collected in a statistical study easier to “see” by summarizing the data graphically and numerically, as opposed to just a list of observations. We will look at examples to see how JMP can be used to create these summaries. Example 2.4 How Far Is Far Enough? We are told: “Each year, The Princeton Review conducts surveys of high school students who are applying to college and of parents of college applicants. The report ‘2009 College Hopes & Worries Survey findings’ (www.princetonreview/college-hopes-worries-2009) included a summary of how 12,715 high school students responded to the question ‘Ideally how far from home would you like the college you attend to be?’ Students responded by choosing one of four possible distance categories. Also included was a summary of how 3,007 parents of students applying to college responded to the question ‘How far from home would you like the college your child attends to be?’ The accompanying relative frequency table summarizes the student and parent responses.” We would like to make a comparative bar chart for this example. The relative frequency table is in the JMP data set table shown below. 2 We start the graphing with the following JMP commands: Select Graph->Graph Builder Select Ideal Distance (Miles) and drag to the X variable area. This gives us the appropriate label along the bottom of the graph (the horizontal axis). 3 Select the Bar graph from the pictures of the graphs. This changes the dots in the graph to bars whose heights represent the relative frequencies, with the scale on the vertical axis. 4 Select Relative Frequency (Students) and drag into the graph area. This now gives us the bar graph for just the students. Select Relative Frequency (Parents) and also drag into the graph area. This will give us the bar graph for parents, next to the ones already displayed for the students. Select Done 5 The completed comparative bar graph is shown below This is called a comparative bar graph. It allows us to visually compare the differences between students and parents. 6 Example 2.6 Graduation Rates Revisited We are told: “The article referenced in Example 2.5 also gave graduation rates for all student athletes at the 63 schools in the 2009 Division I basketball playoffs. The data are listed below. Also listed are the differences between the graduation rate for all student athletes and the graduation rate for basketball players.” We would like to make a comparative dotplot for this example. The given data is in the JMP data set table shown below. Click Graph and then select Chart 7 Click and drag the Basketball column into Categories, X, Levels Select Point Chart, as we want the data to be dots on our graph Click OK 8 Repeat for Athletes, and then All-BB 9 We can compare the graphs for the Basketball and all Athletes, and also look the difference graph. This last graph has both positive and negative values for the differences. The positive differences indicate that those schools had higher graduation rates for all athletes than they did for their basketball players. 10 Example 2.13 Enrollments at Public Universities We are told: “States differ widely in the percentage of college students who are enrolled in public institutions. The National Center for Education Statistics provided the accompanying data on this percentage for the 50 U.S. states for fall 2007.” We would like to make a histogram for this example. The given data is in the JMP data set table shown below. Select Analyze and then Distribution 11 Click and drag the selected column into Y, Columns Click Histograms Only 12 Click OK Click the red arrow next to Percent of Students Under Histogram options, deselect Vertical, and select Show Counts and Show Percents We now see the histogram for this data set, along with the frequency counts and relative frequencies (written as percentages) above the bars corresponding to each class interval. 13 Example 2.17 Worth the Price You Pay? We are told: “The accompanying table gives the cost and an overall quality rating for 10 different brands of men’s athletic shoes (www.consumerreports.org).” We would like to make a scatterplot for this example. The given data is in the JMP data set table shown below. Click Analyze and then select Fit Y by X 14 Click and drag the Cost column into X, Factor window Click and drag the Rating column into Y, Response window 15 Click OK The scatterplot for this data set is shown above. 16 Chapter 3 Numerical Methods for Describing Data Distributions In Chapter 2, graphical displays were used to summarize data. By creating a visual display of the data distribution, it is easier to see and describe its important characteristics, such as shape, center, and spread. In this chapter, you will see how numerical measures are used to describe important characteristics of a data distribution. We will again be using JMP to generate statistical output with the desired numerical measures. Example 3.6 Thirsty Bats We are told: “The short article ‘How to Confuse Thirsty Bats’ (nature.com) summarized a study that was published in the journal Nature Communications (‘Innate Recognition of Water Bodies in echolocating Bats,’ November 2, 2010). The article states ‘Echolocating bats have a legendary ability to find prey in the dark—so you’d think they would be able to tell the difference between water and a sheet of metal. Not so, report Greif and Siemers in Nature Communications. They have found that bats identify any extended, echoacoustically smooth surface as water, and will try to drink from it.’ This conclusion was based on a study where bats were placed in a room that had two large plates on the floor. One plate was made of wood and had an irregular surface. The other plate was made of metal and had a smooth surface. The researchers found that the bats never attempted to drink from the irregular surface, but that they made repeated attempts to drink from the smooth, metal surface. The number of attempts to drink from the smooth metal surface for 11 bats are shown here: 66 144 13 26 94 163 8 125 1 64 56 These data will be used to select, compute, and interpret appropriate summary measures of center and spread.” We will use JMP to compute these summary measures. The given data is in the JMP data set table shown below. 17 Click Analyze and then select Distribution Click and drag selected column to Y, Columns 18 Click OK 19 If you want other statistics, click on the red arrow next to the name of the column, Number of Drinking Attempts Select Display Options, and the Customize Summary Statistics 20 Select desired statistics (such as the ones shown clicked below) 21 Click OK Close Quantiles 22 The summary statistics selected are shown above, indicating the mean is 69.090909 and the standard deviation is 56.351494. These measures provide us with a good glimpse at the data. 23 Example 3.11 Higher Education We are told: “The Chronicle of Higher Education (Almanac Issue, 2009–2010) published the accompanying data on the percentage of the population with a bachelor’s degree or graduate degree in 2007 for each of the 50 U.S. states and the District of Columbia. The 51 data values are: 21 27 26 19 30 35 35 26 47 26 27 30 24 29 22 24 29 20 20 27 35 38 25 31 19 24 27 27 23 34 34 25 32 26 26 24 22 28 26 30 23 25 22 25 29 33 34 30 17 25 23 These data will be used to select, compute, and interpret appropriate summary measures of center and spread.” We will use JMP to compute these summary measures. The given data is in the JMP data set table shown below. 24 Click Analyze and then select Distribution 25 Click and drag the selected column to Y, Columns Click OK 26 Hide the histogram and summary statistics 27 The maximum of 47, minimum of 17, median of 26, and the 1st and 3rd quartile values, 24 and 30 respectively, are shown in the output above. We use the five number summary here to get a better representation of the center and spread. 28 Example 3.13 Video Game Practice Strategies We are told (from Example 3.12): “The authors of the paper ‘Striatal Volume Predicts level of Video game Skill Acquisition’ (Cerebral Cortex[2010]: 2522–2530) studied a number of factors that affect performance in a complex video game. One factor was practice strategy. Forty college students who all reported playing video games less than 3 hours per week over the past two years and who had never played the game Space Fortress were assigned at random to one of two groups. Each person completed 20 two-hour practice sessions. Those in the fixed priority group were told to work on improving their total score at each practice session. Those in the variable priority group were told to focus on a different aspect of the game, such as improving speed score, in each practice session. The investigators were interested in whether practice strategy makes a difference. They measured the improvement in total score from the first practice session to the last. Improvement scores (approximated from a graph in the paper) for the 20 people in each practice strategy group are given below.” We will use JMP to construct a boxplot. The given data is in the JMP data set table shown below. 29 Click Analyze and then select Distribution 30 Click and drag selected column to Y, Columns Click OK 31 Close the Histogram, Quantiles, and Summary Statistics 32 The boxplot is shown above. 33 Chapter 4 Describing Bivariate Numerical Data What can you learn from bivariate numerical data? A good place to start is with a scatterplot of the data. If it appears that the two variables that define the data set are related, it may be possible to describe the relationship in a way that allows you to predict the value of one variable based on the value of the other. For example, if there is a relationship between a blood test measure and age and you could describe that relationship mathematically, it might be possible to predict the age of a crime victim. If you can describe the relationship between fuel efficiency and the weight of a car, you could predict the fuel efficiency of a car based on its weight. In this chapter, you will see how this can be accomplished. We will use JMP to create scatterplots for given data sets, to find the correlation between variables, and to create and interpret regression equations. Example 4.3 Does It Pay to Pay More for a Bike Helmet? We are told: “Are more expensive bike helmets safer than less expensive ones? The accompanying data on x = price and y = quality rating for 11 different brands of bike helmets is from the Consumer Reports web site (www.consumerreports.org/health). Quality rating was a number from 0 (the worst possible rating) to 100 and was determined using factors that included how well the helmet absorbed the force of an impact, the strength of the helmet, ventilation, and ease of use.” The data set for this example is in the JMP data table below. We will use JMP to create a scatterplot comparing price and quality rating, and then to find the correlation between the two variables. 34 For the Scatterplot: Click Analyze and then select Fit Y by X 35 Click and drag Price to X, Factor Click and drag Quality Rating to Y, Response Click OK 36 The scatterplot appears above, with the price of the helmets on the x-axis and the quality rating on the y-axis. For the Correlation: Click Analyze and then select Multivariate Methods, and then Multivariate Click both selected columns and drag each to Y, Columns 37 Click OK 38 The correlation between price and quality rating is given as 0.3034 in the output above. 39 Example 4.6 It May Be a Pile of Debris to You, but It Is Home to a Mouse We are told: “The accompanying data is a subset of data from a scatterplot that appeared in the paper ‘Small Mammal Responses to fine Woody Debris and Forest Fuel Reduction in Southwest Oregon’ (Journal of Wildlife Management[2005]: 625–632). The authors of the paper were interested in how the distance a deer mouse will travel for food is related to the distance from the food to the nearest pile of fine woody debris. Distances were measured in meters.” The data set for this example is in the JMP data table below. We will use the software to graph the regression line (on top of the scatterplot), and to come up with the regression equation. Click Analyze and then select Fit Y by X 40 Click and drag Distance from Debris column to X, Factor Click and drag Distance Traveled column to Y, Response 41 Click OK Click on red arrow next to Bivariate Fit of … and select Fit Line 42 The output above shows the regression line (in red) on the scatterplot, and the regression equation (along with the related measures). For residuals: 43 Click on red arrow next to Linear Fit Select Save Residuals The residuals are now shown in the data table above. 44 Chapter 6 Random Variables and Probability Distributions One way to learn from data is to use information from a sample to learn about a population distribution. In this situation, you are usually interested in the distribution of one or more variables. For example, an environmental scientist who obtains an air sample from a specified location might be interested in the concentration of ozone (a major constituent of smog). Before selection of the air sample, the value of the ozone concentration is uncertain. Because the value of a variable quantity such as ozone concentration is subject to uncertainty, such variables are called random variables. In this Chapter, you will learn how probability models are used to describe the behavior of random variables. Example 6.21 Newborn Birth Weights We are told: “Data from the paper ‘Fetal growth Parameters and Birth Weight: their relationship to neonatal Body Composition’ (Ultrasound in Obstetrics and Gynecology[2009]: 441–446) suggest that a normal distribution with a mean of 3,500 grams and standard deviation of 600 grams is a reasonable model for the probability distribution of birth weight of a randomly selected full-term baby. What proportion of birth weights are between 2,900 and 4,700 grams?” For this example, we will use the formula editor function in JMP. Click Rows and select Add Rows Type in 1 and click OK 45 Double-click on column 1 Click Column Properties and select Formula 46 Click Edit Formula 47 Click Probability and select Normal Distribution 48 Click on the red box around x and type in -1.00. We use -1 since 2900 is exactly 1 standard deviation (600) below the mean of 3500. 49 The probability to the left of 2900 is given in column 1. This tells us that the proportion of babies that weigh below 2900 grams is 0.1586552539. Now repeat to find the probability to the left of 4700, entered in column 2 50 51 52 53 This time we type in 2 in the red box, since 4700 is 2 standard deviations (1200) above 3500. 54 This tells us that the probability to the left of 4700 is 0.9772498681, which is the proportion of babies that weight below 4700 grams. Now subtract column 1 from column 2 to find the probability of being born between 2900 and 4700 grams. To find the probability of greater than 4500, repeat the process with 1.67 entered for x in the Normal Distribution 55 56 57 58 We put in 1.67 since 4500 is 1.67 standard deviations above the mean of 3500. In other words, the difference of 1000 divided by 600 gives us 1.67. 59 So the probability of being below 4500 is 0.9525403182. Now subtract from 1 to find the probability to the right of 4500. So the proportion of babies that weight more than 4500 grams is 0.0474596818. 60 Chapter 9 Estimating a Population Proportion When a sample is selected from a population, it is usually because you hope it will provide information about the population. For example, you might want to use sample data to learn about the value of a population characteristic such as the proportion of students enrolled at a college who purchase textbooks online or the mean number of hours that students at the college spend studying each week. This chapter considers how sample data can be used to estimate the value of a population proportion. Example 9.5 Dangerous Driving We are told: “The article ‘Nine out of Ten Drivers Admit in Survey to Having Done Something Dangerous’ (Knight Ridder Newspapers, July 8, 2005) reported on a survey of 1,100 drivers. Of those surveyed, 990 admitted to careless or aggressive driving during the previous 6 months. Assuming that it is reasonable to regard this sample of 1,100 as representative of the population of drivers, you can use this information to construct an estimate of p, the proportion of all drivers who have engaged in careless or aggressive driving in the last 6 months.” A summary of the results from this study is in the JMP data table below. 61 Click Analyze and select Distribution Click and drag Column 1 to Y, Columns 62 Click and drag Column 2 to Freq Click OK 63 Click on the red arrow next to Column 1, select Confidence Interval with level of 0.90 64 The 90% confidence interval is shown above, with the lower value being 0.884126 and the upper value being 0.913911. 65 Chapter 10 Asking and Answering Questions about a Population Proportion Two types of inference problems are considered in this text. In estimation problems, sample data are used to learn about the value of a population characteristic. In hypothesis testing problems, sample data are used to decide if some claim about the value of a population characteristic is plausible. In Chapter 9, you saw how to use sample data to estimate a population proportion. In this chapter, you will see how sample data can also be used to decide whether a claim, called a hypothesis, about a population proportion is believable. Example 10.11 Unfit Teens We are told: “The article ‘7 Million U.S. Teens would Flunk Treadmill Tests’ (Associated Press, December 11, 2005) summarized a study in which 2,205 adolescents ages 12 to 19 took a cardiovascular treadmill test. The researchers conducting the study believed that the sample was representative of adolescents nationwide. Of the 2,205 adolescents tested, 750 had a poor level of cardiovascular fitness. Does this sample provide support for the claim that more than thirty percent of adolescents have a poor level of cardiovascular fitness?” A summary of the results from this study is in the JMP data table below. Click on Analyze then select Distribution 66 Click on Column 1 and drag to Y, Columns Click on Column 2 and drag to Freq 67 Click OK Click the red arrow next to Column 1 and select Test Probabilities 68 Enter 0.3 into Hypoth Prob space for Poor level, since we would like to test the claim that 30% are in this category 69 Select “probability greater than hypothesized value” option, since we are doing a one-sided alternative (the JMP software will do an exact one-sided test based on the binomial distribution) 70 Click Done 71 The software output above tells us that the p-value is <0.0001, leading us to reject the claim of 0.3, and conclude that more than 30% are at the poor level 72 Chapter 11 Asking and Answering Questions about the Difference between Two Population Proportions Many statistical investigations involve comparing two populations. In Chapters 9 and 10, you saw how sample data could be used to estimate a population proportion and to test hypotheses about the value of a single population proportion. In this chapter, you will see how sample data can be used to learn about the difference between two population proportions. A summary of the results from this study is in the JMP data table below. Example 11.1 Cell Phones in Bed We are told: “Let’s return to the example at the beginning of this section to answer the question, ‘How much greater is the proportion who use a cell phone to stay connected in bed for cell phone users ages 20 to 39 than for those 40 to 49?’ The study described earlier found that 168 of the 258 people in the sample of 20- to 39-year-olds and 61 of the 129 people in the sample of 40- to 49-year-olds said that they sleep with their cell phones. Based on these sample data, what can you learn about the actual difference in proportions for these two populations?” 73 Click Analyze and select Fit Y by X Click Column 1 and drag to X, Factor Click Column 2 and drag to Y, Response 74 Click Column 3 and drag to Freq Click OK 75 Click the red arrow next to Contingency Analysis of … and change Set α level to 0.10, then select Two Sample Test for Proportions (minimize the Mosaic Plot, Contingency Table, and Tests) 76 Select the Use cell phone in bed option The 90% confidence interval for the difference is given above, with the lower value of 0.090024 and the upper value of 0.263412 77 Example 11.5 Cell Phone Fundraising Part 2 We are told: “The Preview Example for this chapter described a study that looked at ways people donated to the 2010 Haiti earthquake relief effort. Two independently selected random samples—one of Gen Y cell phone users and one of Gen X cell phone users— resulted in the following information: Gen Y (those born between 1980 and 1988): 17% had made a donation via cell phone Gen X (those born between 1968 and 1979): 14% had made a donation via cell phone The question posed in the preview example was: Is there convincing evidence that the proportion who donated via cell phone is higher for the Gen Y population than for the Gen X population? The report referenced in the preview example does not say how large the sample sizes were, but the description of the survey methodology indicates that the samples can be regarded as independent random samples. For purposes of this example, let’s suppose that both sample sizes were 1,200. Now you can use the given information to answer the questions posed. Considering the four key questions (QSTN), this situation can be described as hypothesis testing, sample data, one categorical variable (did or did not donate by cell phone), and two samples. This combination suggests a large-sample hypothesis test for a difference in population proportions.” A summary of the results from this study is in the JMP data table below. Click Analyze and then select Fit Y by X 78 Select Column 1 and drag to X, Factor Select Column 2 and drag to Y, Response 79 Click on Column 3 and drag to Freq Click OK 80 Click the red arrow next to Contingency Analysis of … and then select Two Sample Test for Proportions (minimize the Mosaic Plot, Contingency Table, and Tests) 81 We can use the probability stated in the first row of the Adjusted Wald test for the hypothesis test. The p-value is 0.0210. 82 Chapter 12 Asking and Answering Questions about a Population Mean One of the key questions used to determine an appropriate data analysis method is whether the data are categorical or numerical. In the previous chapters, the focus has been on how categorical data can be used to learn about the value of a population proportion. Now you will use numerical data from a sample to learn about the value of a population mean, such as the mean number of hours that students enrolled at your college spend studying each week or the mean weight gain of students at the college during their freshman year. Example 12.9 Selfish Chimps? We are told: “The article ‘Chimps Aren’t Charitable’ (Newsday, November 2, 2005) summarized a research study published in the journal Nature. In this study, chimpanzees learned to use an apparatus that dispensed food when either of two ropes was pulled. When one of the ropes was pulled, only the chimp controlling the apparatus received food. When the other rope was pulled, food was dispensed both to the chimp controlling the apparatus and also to a chimp in the adjoining cage. The accompanying data (approximated from a graph in the paper) represent the number of times out of 36 trials that each of seven chimps chose the option that would provide food to both chimps (the ‘charitable’ response). 23 22 21 24 19 20 20 You can use these data to estimate the mean number of times out of 36 that chimps choose the charitable response. For purposes of this example, let’s suppose it is reasonable to regard this sample of seven chimps as representative of the population of all chimpanzees. This is an estimation problem, and you have sample data, one numerical variable (the number of times out of 36 that the charitable response is chosen), and one sample. These are the answers to the four key questions that lead you to consider a one-sample t confidence interval for a population mean as a potential method. The five-step process for estimation problems (EMC3) can be used to construct a 99% confidence interval.” A summary of the results from this study is in the JMP data table below. 83 Click Analyze and then select Distribution Click on Chose charitable column and drag to Y, Columns 84 Click OK 85 Click on red arrow next to Chose charitable and select Confidence interval and 0.99 confidence level, and minimize Quantiles and Summary Statistics 86 The 99% confidence interval for the mean is given as 18.76416 to 23.80727. 87 Example 12.12 Time Stands Still (or So it Seems) We are told: “A study conducted by researchers at Pennsylvania State University investigated whether time perception, an indication of a person’s ability to concentrate, is impaired during nicotine withdrawal. The study results were summarized in the paper ‘Smoking Abstinence Impairs Time Estimation Accuracy in Cigarette Smokers’ (Psychopharmacology Bulletin [2003]: 90–95). After a 24-hour smoking abstinence, 20 smokers were asked to estimate how much time had passed during a 45-second period. Suppose the resulting data on perceived elapsed time (in seconds) were as follows (these data are artificial but are consistent with summary quantities given in the paper): 69 65 72 73 59 55 39 52 67 57 56 50 70 47 56 45 70 64 67 53 These data were used to compute the sample mean and standard deviation: n = 20 x = 59.30 s = 9.84 The authors of the paper believed that it was reasonable to consider this sample as representative of smokers in general. The researchers wanted to determine whether smoking abstinence tends to lead to elapsed time being overestimated.” The data for this study is in the JMP data table below. 88 Click Analyze and select Distribution 89 Click Elapsed time and drag to Y, Columns Click OK 90 Click the red arrow next to Elapsed Time and select Test Mean 91 Enter 45 in the Specify Hypothesized Mean box Click OK (and minimize the unused information) 92 The test statistic for t is 6.5018 and the P-value is <.0001, leading us to reject. 93 Chapter 13 Asking and Answering Questions about the Difference between Two Population Means In Chapter 12, you saw how sample data could be used to estimate a population mean and to test hypotheses about the value of a single population mean. In this chapter you will see how sample data can be used to learn about the difference between two population means. Example 13.2 Salary and Gender We are told: “Are women still paid less than men for comparable work? The authors of the paper ‘Sex and Salary: A Survey of Purchasing and Supply Professionals’ (Journal of Purchasing and Supply Management [2008]: 112–124) carried out a study in which salary data were collected from a random sample of men and from a random sample of women who worked as purchasing managers and who were subscribers to Purchasing magazine. Salary data consistent with summary quantities given in the paper appear below (the actual sample sizes for the study were much larger): Annual Salary (in thousands of dollars) Men 81 69 81 76 76 74 69 76 79 65 Women 78 60 67 61 62 73 71 58 68 48 Even though the samples were selected from subscribers to a particular magazine, the authors of the paper believed the samples to be representative of the two populations of interest—male purchasing managers and female purchasing managers. Let’s use the sample data to determine if there is convincing evidence that the mean annual salary for male purchasing managers is greater than the mean annual salary for female purchasing managers.” The data for this study is in the JMP data table below. 94 Click Analyze and then select Fit Y by X 95 Click Annual Salary and drag to Y, Response Click Gender and drag to X, Factor 96 Click OK Click the red arrow next to Oneway Analysis of… and select t test 97 The t value is 3.109518 with 15.12279 degrees of freedom in the output, with the P-value given as 0.0036, leading us to reject. 98 Example 13.4 Benefits of Ultrasound Revisited We are told: “You can use the range of motion data of Example 13.3 to test the claim that ultrasound increases mean range of motion. Because the samples are paired, the first thing to do is compute the sample differences. These are the before – after range of motion differences for the seven physical therapy patients in the sample. A negative difference means that the after measurement was larger, so range of motion increased after the ultrasound therapy.” The sample data are shown in the JMP data table below. Click Analyze and select Matched Pairs 99 Click each column and drag to Y, Paired Response Click OK 100 This gives the After Ultrasound minus Before Ultrasound, so we need to change the signs of the mean difference and the t statistic. So the mean difference is -3.42857, with a t value of -2.587987. The P-value is 0.0207. 101 Example 13.7 Freshman Year Weight Gain We are told: “The paper ‘Predicting the “Freshman 15”: Environmental and Psychological Predictors of Weight Gain in First-Year University Students’ (Health Education Journal [2010]: 321–332) described a study conducted by researchers at Carleton University in Canada. The researchers studied a random sample of first-year students who lived on campus and a random sample of first-year students who lived off campus. Data on weight gain (in kg) during the first year, consistent with summary quantities given in the paper, are given below. A negative weight gain represents a weight loss. The researchers believed that the mean weight gain of students living on campus was higher than the mean weight gain for students living off campus and were interested in estimating the difference in means for these two groups.” The data for both groups is in the JMP data set below. 102 Click Analyze and then select Fit Y by X 103 Click Weight Gain and drag to Y, Response Click Sample and drag to X, Factor 104 Click OK Click the red arrow next to Oneway Analysis of … and select t Test 105 The 95% confidence interval is (-0.9754, 4.1954). 106 Example 13.8 Benefits of Ultrasound One More Time This is the same data as for example 13.4. We are asked for a confidence interval this time. Here are the results we got: We need to switch the signs to get the difference of After minus Before, so we get (-6.67025, -0.1869) from the JMP results. 107 Chapter 15 Learning from Categorical Data This chapter introduces three additional methods for learning from categorical data. Sometimes a categorical data set consists of observations on a single variable of interest (univariate data). When the categorical variable has only two possible categories, the methods introduced in Chapters 9, 10, and 11 can be used to learn about the proportion of “successes.” For example, suppose calls made to the 9-1-1 emergency number are classified according to whether they are for true emergencies or not. You can estimate the proportion of calls that are for true emergencies or you can use data from two different cities to determine if there is evidence of a difference in the proportions of true emergency calls. But the methods of Chapters 9, 10, and 11 are only appropriate when the categorical variable of interest has two possible categories. In this chapter, you will see how to analyze data on a categorical variable with more than two possible categories. You will also see how to compare two or more populations on the basis of a categorical variable. Example 15.3 Tasty Dog Food? We are told: “The article ‘Can People Distinguish Pâté from Dog Food?’ (American Association of Wine Economists, April 2009, www.wine-economics.org) describes a study that investigated whether people can tell the difference between dog food, pâté (a spread made of finely chopped liver, meat, or fish), and processed meats (such as Spam and liverwurst). Researchers used a food processor to make spreads that had the same texture and consistency as pâté from Newman’s Own brand dog food and from the processed meats. Each participant in the study tasted five spreads (duck liver pâté, Spam, dog food, pork liver pâté, and liverwurst). After tasting all five spreads, each participant was asked to choose the one that they thought was the dog food. The first few observations were Liverwurst; pork liver pâté; liverwurst; dog food You can use the dog food taste data to test the hypothesis that the five different spreads (duck liver pâté, Spam, dog food, pork liver pâté, and liverwurst) are chosen equally often when people who have tasted all five spreads are asked to identify the one they think is the dog food.” A summary of the responses from this study are in the JMP data table below. 108 Click Analyze and then select Distribution Click on Spread Chosen and drag to Y, Columns 109 Click on Frequency and drag to Freq Click OK 110 Click on the red arrow next to Spread Chosen and select Test Probabilities (minimize Frequencies) 111 Type in 0.2 for Hypoth Prob (hypothesized probabilities) for all 5 boxes, since if we assume that the five probabilities are all equal, they must each equal 0.2 112 Click Done 113 The Chi-Square value is given in the output as 21.4 (Pearson) with a P-value of 0.0003, leading us to reject. 114 Example 15.6 Risky Soccer? We are told: “The paper ‘No Evidence of Impaired Neurocognitive Performance in Collegiate Soccer Players’ (American Journal of Sports Medicine [2002]:157–162) compared collegiate soccer players, athletes in sports other than soccer, and a group of students who were not involved in collegiate sports on the basis of their history of head injuries. Table 15.3, a 3 by 4 two-way frequency table, is the result of classifying each student in independently selected random samples of 91 soccer players, 96 non-soccer athletes, and 53 non-athletes into one of four head injury categories.” A summary of the results from this study is in the JMP data table below. Click Analyze and then select Fit Y by X 115 Click Head Injury and drag into Y, Response Click Athlete type and drag into X, Factor 116 Click Frequency column and drag into Freq Click OK 117 Minimize the mosaic plot 118 Click the red arrow next to Contingency Table, unselect the different options, and only select the expected 119 The observed and expected counts appear in the table. The Chi-Square value and P-value are also there, under the Tests. 120 Example 15.10 Stroke Mortality and Education We are told: “Table 15.8 was constructed using data from the article ‘Influence of Socioeconomic Status on Mortality after Stroke’ (Stroke [2005]: 310–314). One of the questions of interest was whether there was an association between survival after a stroke and level of education. Medical records for a random sample of 2,333 residents of Vienna, Austria, who had suffered a stroke were used to classify each individual according to two variables—survival (survived, died) and level of education (no basic education, secondary school graduation, technical training/apprenticed, higher secondary school degree, university graduate). Expected cell counts (computed under the assumption of no association between survival and level of education) appear in parentheses in the table.” A summary of the results of this study is in the JMP data table below. Click on Analyze and select Fit Y by X 121 Click on Education and drag to Y, Response Click on Survival and drag to X, Factor 122 Click on Frequency and drag to Freq Click OK 123 The Chi-Square value is 12.219 with a P-value of 0.0158, above the significance level of 0.01. So we do not reject the null hypothesis. 124