Download JMP Technology Manual to Accompany

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
JMP® Technology Manual
to Accompany
Statistics
Learning from Data
Roxy Peck
© 2014 Cengage Learning. All Rights Reserved. This content is not yet final and Cengage Learning
does not guarantee this page will contain current material or match the published product.
California Polytechnic State University,
San Luis Obispo, CA
Prepared by
Alexander Kolesnik
Ventura College, Ventura, CA
Australia • Brazil • Mexico • Singapore • United Kingdom • United States
Contents*
Chapter 1 ............................................................................................................................................... 1
Chapter 2 ............................................................................................................................................... 2
Chapter 3 .............................................................................................................................................. 17
Chapter 4 .............................................................................................................................................. 34
Chapter 6 .............................................................................................................................................. 45
Chapter 9 .............................................................................................................................................. 61
Chapter 10 ............................................................................................................................................ 66
Chapter 11 ............................................................................................................................................ 73
Chapter 12 ............................................................................................................................................ 83
Chapter 13 ............................................................................................................................................ 94
Chapter 15 ........................................................................................................................................... 108
----------------------------------------------*Chapters 5, 7, 8, and 14 have been omitted from this guide since they contain no material relevant to
JMP.
ii
Chapter 1
Introduction
This manual accompanies Statistics: Learning from Data by Roxy Peck. It is intended to be
used in conjunction with the text, so each chapter of this book corresponds to a chapter in
the main text. You’ll find examples from each chapter worked out here, intended to show
you how to use JMP for all the problems in the text. This book is not intended to be a
complete user’s guide to JMP. If you have questions about specific capabilities of JMP,
refer to the online help.
About JMP
JMP (pronounced "jump") is a computer program for statistics developed by the JMP
business unit of SAS Institute. It was created in the 1980s to take advantage of the
graphical user interface introduced by the Macintosh. It has since been improved and
made available for other operating systems.
Statistical Analyses in JMP
This book will describe the step-by-step commands to do all of the required statistical
computations using the software. The data, or a summary of the data, will need to be in a
JMP data table to do the statistical analysis. The results will be shown.
1
Chapter 2
Graphical Methods for Describing Data Distributions
This chapter is designed to make the data collected in a statistical study easier to “see” by
summarizing the data graphically and numerically, as opposed to just a list of observations.
We will look at examples to see how JMP can be used to create these summaries.
Example 2.4
How Far Is Far Enough?
We are told: “Each year, The Princeton Review conducts surveys of high school students
who are applying to college and of parents of college applicants. The report ‘2009 College
Hopes & Worries Survey findings’ (www.princetonreview/college-hopes-worries-2009)
included a summary of how 12,715 high school students responded to the question ‘Ideally
how far from home would you like the college you attend to be?’ Students responded by
choosing one of four possible distance categories. Also included was a summary of how
3,007 parents of students applying to college responded to the question ‘How far from
home would you like the college your child attends to be?’ The accompanying relative
frequency table summarizes the student and parent responses.”
We would like to make a comparative bar chart for this example. The relative frequency
table is in the JMP data set table shown below.
2
We start the graphing with the following JMP commands:
Select Graph->Graph Builder
Select Ideal Distance (Miles) and drag to the X variable area. This gives us the appropriate
label along the bottom of the graph (the horizontal axis).
3
Select the Bar graph from the pictures of the graphs. This changes the dots in the graph to
bars whose heights represent the relative frequencies, with the scale on the vertical axis.
4
Select Relative Frequency (Students) and drag into the graph area. This now gives us the
bar graph for just the students.
Select Relative Frequency (Parents) and also drag into the graph area. This will give us the
bar graph for parents, next to the ones already displayed for the students.
Select Done
5
The completed comparative bar graph is shown below
This is called a comparative bar graph. It allows us to visually compare the differences
between students and parents.
6
Example 2.6
Graduation Rates Revisited
We are told: “The article referenced in Example 2.5 also gave graduation rates for all
student athletes at the 63 schools in the 2009 Division I basketball playoffs. The data are
listed below. Also listed are the differences between the graduation rate for all student
athletes and the graduation rate for basketball players.”
We would like to make a comparative dotplot for this example. The given data is in the
JMP data set table shown below.
Click Graph and then select Chart
7
Click and drag the Basketball column into Categories, X, Levels
Select Point Chart, as we want the data to be dots on our graph
Click OK
8
Repeat for Athletes, and then All-BB
9
We can compare the graphs for the Basketball and all Athletes, and also look the difference
graph. This last graph has both positive and negative values for the differences. The
positive differences indicate that those schools had higher graduation rates for all athletes
than they did for their basketball players.
10
Example 2.13
Enrollments at Public Universities
We are told: “States differ widely in the percentage of college students who are enrolled in
public institutions. The National Center for Education Statistics provided the
accompanying data on this percentage for the 50 U.S. states for fall 2007.”
We would like to make a histogram for this example. The given data is in the JMP data set
table shown below.
Select Analyze and then Distribution
11
Click and drag the selected column into Y, Columns
Click Histograms Only
12
Click OK
Click the red arrow next to Percent of Students
Under Histogram options, deselect Vertical, and select Show Counts and Show Percents
We now see the histogram for this data set, along with the frequency counts and relative
frequencies (written as percentages) above the bars corresponding to each class interval.
13
Example 2.17
Worth the Price You Pay?
We are told: “The accompanying table gives the cost and an overall quality rating for 10
different brands of men’s athletic shoes (www.consumerreports.org).”
We would like to make a scatterplot for this example. The given data is in the JMP data set
table shown below.
Click Analyze and then select Fit Y by X
14
Click and drag the Cost column into X, Factor window
Click and drag the Rating column into Y, Response window
15
Click OK
The scatterplot for this data set is shown above.
16
Chapter 3
Numerical Methods for Describing Data Distributions
In Chapter 2, graphical displays were used to summarize data. By creating a visual display
of the data distribution, it is easier to see and describe its important characteristics, such as
shape, center, and spread. In this chapter, you will see how numerical measures are used to
describe important characteristics of a data distribution. We will again be using JMP to
generate statistical output with the desired numerical measures.
Example 3.6
Thirsty Bats
We are told: “The short article ‘How to Confuse Thirsty Bats’ (nature.com) summarized a
study that was published in the journal Nature Communications (‘Innate Recognition of
Water Bodies in echolocating Bats,’ November 2, 2010). The article states ‘Echolocating
bats have a legendary ability to find prey in the dark—so you’d think they would be able to
tell the difference between water and a sheet of metal. Not so, report Greif and Siemers in
Nature Communications. They have found that bats identify any extended, echoacoustically smooth surface as water, and will try to drink from it.’ This conclusion was
based on a study where bats were placed in a room that had two large plates on the floor.
One plate was made of wood and had an irregular surface. The other plate was made of
metal and had a smooth surface. The researchers found that the bats never attempted to
drink from the irregular surface, but that they made repeated attempts to drink from the
smooth, metal surface. The number of attempts to drink from the smooth metal surface for
11 bats are shown here:
66 144 13 26 94 163 8 125 1 64 56
These data will be used to select, compute, and interpret appropriate summary measures of
center and spread.”
We will use JMP to compute these summary measures. The given data is in the JMP data
set table shown below.
17
Click Analyze and then select Distribution
Click and drag selected column to Y, Columns
18
Click OK
19
If you want other statistics, click on the red arrow next to the name of the column, Number
of Drinking Attempts
Select Display Options, and the Customize Summary Statistics
20
Select desired statistics (such as the ones shown clicked below)
21
Click OK
Close Quantiles
22
The summary statistics selected are shown above, indicating the mean is 69.090909 and the
standard deviation is 56.351494. These measures provide us with a good glimpse at the
data.
23
Example 3.11
Higher Education
We are told: “The Chronicle of Higher Education (Almanac Issue, 2009–2010) published
the accompanying data on the percentage of the population with a bachelor’s degree or
graduate degree in 2007 for each of the 50 U.S. states and the District of Columbia. The 51
data values are:
21 27 26 19 30 35 35 26 47 26 27 30 24 29 22 24 29 20 20 27 35 38 25 31 19 24 27 27 23 34 34
25 32 26 26 24 22 28 26 30 23 25 22 25 29 33 34 30 17 25 23
These data will be used to select, compute, and interpret appropriate summary measures of
center and spread.”
We will use JMP to compute these summary measures. The given data is in the JMP data
set table shown below.
24
Click Analyze and then select Distribution
25
Click and drag the selected column to Y, Columns
Click OK
26
Hide the histogram and summary statistics
27
The maximum of 47, minimum of 17, median of 26, and the 1st and 3rd quartile values, 24
and 30 respectively, are shown in the output above. We use the five number summary here
to get a better representation of the center and spread.
28
Example 3.13
Video Game Practice Strategies
We are told (from Example 3.12): “The authors of the paper ‘Striatal Volume Predicts
level of Video game Skill Acquisition’ (Cerebral Cortex[2010]: 2522–2530) studied a
number of factors that affect performance in a complex video game. One factor was
practice strategy. Forty college students who all reported playing video games less than 3
hours per week over the past two years and who had never played the game Space Fortress
were assigned at random to one of two groups. Each person completed 20 two-hour
practice sessions. Those in the fixed priority group were told to work on improving their
total score at each practice session. Those in the variable priority group were told to focus
on a different aspect of the game, such as improving speed score, in each practice session.
The investigators were interested in whether practice strategy makes a difference. They
measured the improvement in total score from the first practice session to the last.
Improvement scores (approximated from a graph in the paper) for the 20 people in each
practice strategy group are given below.”
We will use JMP to construct a boxplot. The given data is in the JMP data set table shown
below.
29
Click Analyze and then select Distribution
30
Click and drag selected column to Y, Columns
Click OK
31
Close the Histogram, Quantiles, and Summary Statistics
32
The boxplot is shown above.
33
Chapter 4
Describing Bivariate Numerical Data
What can you learn from bivariate numerical data? A good place to start is with a
scatterplot of the data. If it appears that the two variables that define the data set are
related, it may be possible to describe the relationship in a way that allows you to predict
the value of one variable based on the value of the other. For example, if there is a
relationship between a blood test measure and age and you could describe that relationship
mathematically, it might be possible to predict the age of a crime victim. If you can
describe the relationship between fuel efficiency and the weight of a car, you could predict
the fuel efficiency of a car based on its weight. In this chapter, you will see how this can be
accomplished. We will use JMP to create scatterplots for given data sets, to find the
correlation between variables, and to create and interpret regression equations.
Example 4.3
Does It Pay to Pay More for a Bike Helmet?
We are told: “Are more expensive bike helmets safer than less expensive ones? The
accompanying data on x = price and y = quality rating for 11 different brands of bike
helmets is from the Consumer Reports web site (www.consumerreports.org/health).
Quality rating was a number from 0 (the worst possible rating) to 100 and was determined
using factors that included how well the helmet absorbed the force of an impact, the
strength of the helmet, ventilation, and ease of use.”
The data set for this example is in the JMP data table below. We will use JMP to create a
scatterplot comparing price and quality rating, and then to find the correlation between the
two variables.
34
For the Scatterplot:
Click Analyze and then select Fit Y by X
35
Click and drag Price to X, Factor
Click and drag Quality Rating to Y, Response
Click OK
36
The scatterplot appears above, with the price of the helmets on the x-axis and the quality
rating on the y-axis.
For the Correlation:
Click Analyze and then select Multivariate Methods, and then Multivariate
Click both selected columns and drag each to Y, Columns
37
Click OK
38
The correlation between price and quality rating is given as 0.3034 in the output above.
39
Example 4.6
It May Be a Pile of Debris to You, but It Is Home to a Mouse
We are told: “The accompanying data is a subset of data from a scatterplot that appeared
in the paper ‘Small Mammal Responses to fine Woody Debris and Forest Fuel Reduction in
Southwest Oregon’ (Journal of Wildlife Management[2005]: 625–632). The authors of the
paper were interested in how the distance a deer mouse will travel for food is related to the
distance from the food to the nearest pile of fine woody debris. Distances were measured in
meters.”
The data set for this example is in the JMP data table below. We will use the software to
graph the regression line (on top of the scatterplot), and to come up with the regression
equation.
Click Analyze and then select Fit Y by X
40
Click and drag Distance from Debris column to X, Factor
Click and drag Distance Traveled column to Y, Response
41
Click OK
Click on red arrow next to Bivariate Fit of … and select Fit Line
42
The output above shows the regression line (in red) on the scatterplot, and the regression
equation (along with the related measures).
For residuals:
43
Click on red arrow next to Linear Fit
Select Save Residuals
The residuals are now shown in the data table above.
44
Chapter 6
Random Variables and Probability Distributions
One way to learn from data is to use information from a sample to learn about a
population distribution. In this situation, you are usually interested in the distribution of
one or more variables. For example, an environmental scientist who obtains an air sample
from a specified location might be interested in the concentration of ozone (a major
constituent of smog). Before selection of the air sample, the value of the ozone
concentration is uncertain. Because the value of a variable quantity such as ozone
concentration is subject to uncertainty, such variables are called random variables. In this
Chapter, you will learn how probability models are used to describe the behavior of
random variables.
Example 6.21
Newborn Birth Weights
We are told: “Data from the paper ‘Fetal growth Parameters and Birth Weight: their
relationship to neonatal Body Composition’ (Ultrasound in Obstetrics and
Gynecology[2009]: 441–446) suggest that a normal distribution with a mean of 3,500 grams
and standard deviation of 600 grams is a reasonable model for the probability distribution
of birth weight of a randomly selected full-term baby. What proportion of birth weights
are between 2,900 and 4,700 grams?”
For this example, we will use the formula editor function in JMP.
Click Rows and select Add Rows
Type in 1 and click OK
45
Double-click on column 1
Click Column Properties and select Formula
46
Click Edit Formula
47
Click Probability and select Normal Distribution
48
Click on the red box around x and type in -1.00. We use -1 since 2900 is exactly 1 standard
deviation (600) below the mean of 3500.
49
The probability to the left of 2900 is given in column 1. This tells us that the proportion of
babies that weigh below 2900 grams is 0.1586552539.
Now repeat to find the probability to the left of 4700, entered in column 2
50
51
52
53
This time we type in 2 in the red box, since 4700 is 2 standard deviations (1200) above 3500.
54
This tells us that the probability to the left of 4700 is 0.9772498681, which is the proportion
of babies that weight below 4700 grams. Now subtract column 1 from column 2 to find the
probability of being born between 2900 and 4700 grams.
To find the probability of greater than 4500, repeat the process with 1.67 entered for x in
the Normal Distribution
55
56
57
58
We put in 1.67 since 4500 is 1.67 standard deviations above the mean of 3500. In other
words, the difference of 1000 divided by 600 gives us 1.67.
59
So the probability of being below 4500 is 0.9525403182. Now subtract from 1 to find the
probability to the right of 4500. So the proportion of babies that weight more than 4500
grams is 0.0474596818.
60
Chapter 9
Estimating a Population Proportion
When a sample is selected from a population, it is usually because you hope it will provide
information about the population. For example, you might want to use sample data to learn
about the value of a population characteristic such as the proportion of students enrolled at
a college who purchase textbooks online or the mean number of hours that students at the
college spend studying each week. This chapter considers how sample data can be used to
estimate the value of a population proportion.
Example 9.5
Dangerous Driving
We are told: “The article ‘Nine out of Ten Drivers Admit in Survey to Having Done
Something Dangerous’ (Knight Ridder Newspapers, July 8, 2005) reported on a survey of
1,100 drivers. Of those surveyed, 990 admitted to careless or aggressive driving during the
previous 6 months. Assuming that it is reasonable to regard this sample of 1,100 as
representative of the population of drivers, you can use this information to construct an
estimate of p, the proportion of all drivers who have engaged in careless or aggressive
driving in the last 6 months.”
A summary of the results from this study is in the JMP data table below.
61
Click Analyze and select Distribution
Click and drag Column 1 to Y, Columns
62
Click and drag Column 2 to Freq
Click OK
63
Click on the red arrow next to Column 1, select Confidence Interval with level of 0.90
64
The 90% confidence interval is shown above, with the lower value being 0.884126 and the
upper value being 0.913911.
65
Chapter 10
Asking and Answering Questions about a Population Proportion
Two types of inference problems are considered in this text. In estimation problems,
sample data are used to learn about the value of a population characteristic. In hypothesis
testing problems, sample data are used to decide if some claim about the value of a
population characteristic is plausible. In Chapter 9, you saw how to use sample data to
estimate a population proportion. In this chapter, you will see how sample data can also be
used to decide whether a claim, called a hypothesis, about a population proportion is
believable.
Example 10.11
Unfit Teens
We are told: “The article ‘7 Million U.S. Teens would Flunk Treadmill Tests’ (Associated
Press, December 11, 2005) summarized a study in which 2,205 adolescents ages 12 to 19
took a cardiovascular treadmill test. The researchers conducting the study believed that the
sample was representative of adolescents nationwide. Of the 2,205 adolescents tested, 750
had a poor level of cardiovascular fitness. Does this sample provide support for the claim
that more than thirty percent of adolescents have a poor level of cardiovascular fitness?”
A summary of the results from this study is in the JMP data table below.
Click on Analyze then select Distribution
66
Click on Column 1 and drag to Y, Columns
Click on Column 2 and drag to Freq
67
Click OK
Click the red arrow next to Column 1 and select Test Probabilities
68
Enter 0.3 into Hypoth Prob space for Poor level, since we would like to test the claim that
30% are in this category
69
Select “probability greater than hypothesized value” option, since we are doing a one-sided
alternative (the JMP software will do an exact one-sided test based on the binomial
distribution)
70
Click Done
71
The software output above tells us that the p-value is <0.0001, leading us to reject the claim
of 0.3, and conclude that more than 30% are at the poor level
72
Chapter 11
Asking and Answering Questions about the Difference between Two
Population Proportions
Many statistical investigations involve comparing two populations. In Chapters 9 and 10,
you saw how sample data could be used to estimate a population proportion and to test
hypotheses about the value of a single population proportion. In this chapter, you will see
how sample data can be used to learn about the difference between two population
proportions.
A summary of the results from this study is in the JMP data table below.
Example 11.1
Cell Phones in Bed
We are told: “Let’s return to the example at the beginning of this section to answer the
question, ‘How much greater is the proportion who use a cell phone to stay connected in
bed for cell phone users ages 20 to 39 than for those 40 to 49?’ The study described earlier
found that 168 of the 258 people in the sample of 20- to 39-year-olds and 61 of the 129
people in the sample of 40- to 49-year-olds said that they sleep with their cell phones.
Based on these sample data, what can you learn about the actual difference in proportions
for these two populations?”
73
Click Analyze and select Fit Y by X
Click Column 1 and drag to X, Factor
Click Column 2 and drag to Y, Response
74
Click Column 3 and drag to Freq
Click OK
75
Click the red arrow next to Contingency Analysis of … and change Set α level to 0.10,
then select Two Sample Test for Proportions (minimize the Mosaic Plot, Contingency
Table, and Tests)
76
Select the Use cell phone in bed option
The 90% confidence interval for the difference is given above, with the lower value of
0.090024 and the upper value of 0.263412
77
Example 11.5
Cell Phone Fundraising Part 2
We are told: “The Preview Example for this chapter described a study that looked at ways
people donated to the 2010 Haiti earthquake relief effort. Two independently selected
random samples—one of Gen Y cell phone users and one of Gen X cell phone users—
resulted in the following information:
Gen Y (those born between 1980 and 1988): 17% had made a donation via cell phone
Gen X (those born between 1968 and 1979): 14% had made a donation via cell phone
The question posed in the preview example was: Is there convincing evidence that the
proportion who donated via cell phone is higher for the Gen Y population than for the Gen
X population? The report referenced in the preview example does not say how large the
sample sizes were, but the description of the survey methodology indicates that the samples
can be regarded as independent random samples. For purposes of this example, let’s
suppose that both sample sizes were 1,200.
Now you can use the given information to answer the questions posed. Considering the four
key questions (QSTN), this situation can be described as hypothesis testing, sample data,
one categorical variable (did or did not donate by cell phone), and two samples. This
combination suggests a large-sample hypothesis test for a difference in population
proportions.”
A summary of the results from this study is in the JMP data table below.
Click Analyze and then select Fit Y by X
78
Select Column 1 and drag to X, Factor
Select Column 2 and drag to Y, Response
79
Click on Column 3 and drag to Freq
Click OK
80
Click the red arrow next to Contingency Analysis of … and then select Two Sample Test
for Proportions (minimize the Mosaic Plot, Contingency Table, and Tests)
81
We can use the probability stated in the first row of the Adjusted Wald test for the
hypothesis test. The p-value is 0.0210.
82
Chapter 12
Asking and Answering Questions about a Population Mean
One of the key questions used to determine an appropriate data analysis method is whether
the data are categorical or numerical. In the previous chapters, the focus has been on how
categorical data can be used to learn about the value of a population proportion. Now you
will use numerical data from a sample to learn about the value of a population mean, such
as the mean number of hours that students enrolled at your college spend studying each
week or the mean weight gain of students at the college during their freshman year.
Example 12.9
Selfish Chimps?
We are told: “The article ‘Chimps Aren’t Charitable’ (Newsday, November 2, 2005)
summarized a research study published in the journal Nature. In this study, chimpanzees
learned to use an apparatus that dispensed food when either of two ropes was pulled. When
one of the ropes was pulled, only the chimp controlling the apparatus received food. When
the other rope was pulled, food was dispensed both to the chimp controlling the apparatus
and also to a chimp in the adjoining cage. The accompanying data (approximated from a
graph in the paper) represent the number of times out of 36 trials that each of seven
chimps chose the option that would provide food to both chimps (the ‘charitable’
response).
23 22 21 24 19 20 20
You can use these data to estimate the mean number of times out of 36 that chimps choose
the charitable response. For purposes of this example, let’s suppose it is reasonable to
regard this sample of seven chimps as representative of the population of all chimpanzees.
This is an estimation problem, and you have sample data, one numerical variable (the
number of times out of 36 that the charitable response is chosen), and one sample. These
are the answers to the four key questions that lead you to consider a one-sample t
confidence interval for a population mean as a potential method. The five-step process for
estimation problems (EMC3) can be used to construct a 99% confidence interval.”
A summary of the results from this study is in the JMP data table below.
83
Click Analyze and then select Distribution
Click on Chose charitable column and drag to Y, Columns
84
Click OK
85
Click on red arrow next to Chose charitable and select Confidence interval and 0.99
confidence level, and minimize Quantiles and Summary Statistics
86
The 99% confidence interval for the mean is given as 18.76416 to 23.80727.
87
Example 12.12
Time Stands Still (or So it Seems)
We are told: “A study conducted by researchers at Pennsylvania State University
investigated whether time perception, an indication of a person’s ability to concentrate, is
impaired during nicotine withdrawal. The study results were summarized in the paper
‘Smoking Abstinence Impairs Time Estimation Accuracy in Cigarette Smokers’
(Psychopharmacology Bulletin [2003]: 90–95). After a 24-hour smoking abstinence, 20
smokers were asked to estimate how much time had passed during a 45-second period.
Suppose the resulting data on perceived elapsed time (in seconds) were as follows (these
data are artificial but are consistent with summary quantities given in the paper):
69 65 72 73 59 55 39 52 67 57 56 50 70 47 56 45 70 64 67 53
These data were used to compute the sample mean and standard deviation:
n = 20 x = 59.30 s = 9.84
The authors of the paper believed that it was reasonable to consider this sample as
representative of smokers in general. The researchers wanted to determine whether
smoking abstinence tends to lead to elapsed time being overestimated.”
The data for this study is in the JMP data table below.
88
Click Analyze and select Distribution
89
Click Elapsed time and drag to Y, Columns
Click OK
90
Click the red arrow next to Elapsed Time and select Test Mean
91
Enter 45 in the Specify Hypothesized Mean box
Click OK (and minimize the unused information)
92
The test statistic for t is 6.5018 and the P-value is <.0001, leading us to reject.
93
Chapter 13
Asking and Answering Questions about the Difference between Two Population Means
In Chapter 12, you saw how sample data could be used to estimate a population mean and
to test hypotheses about the value of a single population mean. In this chapter you will see
how sample data can be used to learn about the difference between two population means.
Example 13.2
Salary and Gender
We are told: “Are women still paid less than men for comparable work? The authors of the
paper ‘Sex and Salary: A Survey of Purchasing and Supply Professionals’ (Journal of
Purchasing and Supply Management [2008]: 112–124) carried out a study in which salary
data were collected from a random sample of men and from a random sample of women
who worked as purchasing managers and who were subscribers to Purchasing magazine.
Salary data consistent with summary quantities given in the paper appear below (the
actual sample sizes for the study were much larger):
Annual Salary (in thousands of dollars)
Men
81 69 81 76 76 74 69 76 79 65
Women
78 60 67 61 62 73 71 58 68 48
Even though the samples were selected from subscribers to a particular magazine, the
authors of the paper believed the samples to be representative of the two populations of
interest—male purchasing managers and female purchasing managers. Let’s use the
sample data to determine if there is convincing evidence that the mean annual salary for
male purchasing managers is greater than the mean annual salary for female purchasing
managers.”
The data for this study is in the JMP data table below.
94
Click Analyze and then select Fit Y by X
95
Click Annual Salary and drag to Y, Response
Click Gender and drag to X, Factor
96
Click OK
Click the red arrow next to Oneway Analysis of… and select t test
97
The t value is 3.109518 with 15.12279 degrees of freedom in the output, with the P-value
given as 0.0036, leading us to reject.
98
Example 13.4
Benefits of Ultrasound Revisited
We are told: “You can use the range of motion data of Example 13.3 to test the claim that
ultrasound increases mean range of motion. Because the samples are paired, the first thing
to do is compute the sample differences. These are the before – after range of motion
differences for the seven physical therapy patients in the sample. A negative difference
means that the after measurement was larger, so range of motion increased after the
ultrasound therapy.”
The sample data are shown in the JMP data table below.
Click Analyze and select Matched Pairs
99
Click each column and drag to Y, Paired Response
Click OK
100
This gives the After Ultrasound minus Before Ultrasound, so we need to change the signs of
the mean difference and the t statistic. So the mean difference is -3.42857, with a t value of
-2.587987. The P-value is 0.0207.
101
Example 13.7
Freshman Year Weight Gain
We are told: “The paper ‘Predicting the “Freshman 15”: Environmental and Psychological
Predictors of Weight Gain in First-Year University Students’ (Health Education Journal
[2010]: 321–332) described a study conducted by researchers at Carleton University in
Canada. The researchers studied a random sample of first-year students who lived on
campus and a random sample of first-year students who lived off campus. Data on weight
gain (in kg) during the first year, consistent with summary quantities given in the paper,
are given below. A negative weight gain represents a weight loss. The researchers believed
that the mean weight gain of students living on campus was higher than the mean weight
gain for students living off campus and were interested in estimating the difference in
means for these two groups.”
The data for both groups is in the JMP data set below.
102
Click Analyze and then select Fit Y by X
103
Click Weight Gain and drag to Y, Response
Click Sample and drag to X, Factor
104
Click OK
Click the red arrow next to Oneway Analysis of … and select t Test
105
The 95% confidence interval is (-0.9754, 4.1954).
106
Example 13.8
Benefits of Ultrasound One More Time
This is the same data as for example 13.4. We are asked for a confidence interval this time.
Here are the results we got:
We need to switch the signs to get the difference of After minus Before, so we get (-6.67025,
-0.1869) from the JMP results.
107
Chapter 15
Learning from Categorical Data
This chapter introduces three additional methods for learning from categorical data.
Sometimes a categorical data set consists of observations on a single variable of interest
(univariate data). When the categorical variable has only two possible categories, the
methods introduced in Chapters 9, 10, and 11 can be used to learn about the proportion of
“successes.” For example, suppose calls made to the 9-1-1 emergency number are classified
according to whether they are for true emergencies or not. You can estimate the proportion
of calls that are for true emergencies or you can use data from two different cities to
determine if there is evidence of a difference in the proportions of true emergency calls. But
the methods of Chapters 9, 10, and 11 are only appropriate when the categorical variable
of interest has two possible categories. In this chapter, you will see how to analyze data on a
categorical variable with more than two possible categories. You will also see how to
compare two or more populations on the basis of a categorical variable.
Example 15.3
Tasty Dog Food?
We are told: “The article ‘Can People Distinguish Pâté from Dog Food?’ (American
Association of Wine Economists, April 2009, www.wine-economics.org) describes a study
that investigated whether people can tell the difference between dog food, pâté (a spread
made of finely chopped liver, meat, or fish), and processed meats (such as Spam and
liverwurst). Researchers used a food processor to make spreads that had the same texture
and consistency as pâté from Newman’s Own brand dog food and from the processed
meats. Each participant in the study tasted five spreads (duck liver pâté, Spam, dog food,
pork liver pâté, and liverwurst). After tasting all five spreads, each participant was asked
to choose the one that they thought was the dog food. The first few observations were
Liverwurst; pork liver pâté; liverwurst; dog food
You can use the dog food taste data to test the hypothesis that the five different spreads
(duck liver pâté, Spam, dog food, pork liver pâté, and liverwurst) are chosen equally often
when people who have tasted all five spreads are asked to identify the one they think is the
dog food.”
A summary of the responses from this study are in the JMP data table below.
108
Click Analyze and then select Distribution
Click on Spread Chosen and drag to Y, Columns
109
Click on Frequency and drag to Freq
Click OK
110
Click on the red arrow next to Spread Chosen and select Test Probabilities (minimize
Frequencies)
111
Type in 0.2 for Hypoth Prob (hypothesized probabilities) for all 5 boxes, since if we assume
that the five probabilities are all equal, they must each equal 0.2
112
Click Done
113
The Chi-Square value is given in the output as 21.4 (Pearson) with a P-value of 0.0003,
leading us to reject.
114
Example 15.6
Risky Soccer?
We are told: “The paper ‘No Evidence of Impaired Neurocognitive Performance in
Collegiate Soccer Players’ (American Journal of Sports Medicine [2002]:157–162)
compared collegiate soccer players, athletes in sports other than soccer, and a group of
students who were not involved in collegiate sports on the basis of their history of head
injuries. Table 15.3, a 3 by 4 two-way frequency table, is the result of classifying each
student in independently selected random samples of 91 soccer players, 96 non-soccer
athletes, and 53 non-athletes into one of four head injury categories.”
A summary of the results from this study is in the JMP data table below.
Click Analyze and then select Fit Y by X
115
Click Head Injury and drag into Y, Response
Click Athlete type and drag into X, Factor
116
Click Frequency column and drag into Freq
Click OK
117
Minimize the mosaic plot
118
Click the red arrow next to Contingency Table, unselect the different options, and only
select the expected
119
The observed and expected counts appear in the table. The Chi-Square value and P-value
are also there, under the Tests.
120
Example 15.10
Stroke Mortality and Education
We are told: “Table 15.8 was constructed using data from the article ‘Influence of
Socioeconomic Status on Mortality after Stroke’ (Stroke [2005]: 310–314). One of the
questions of interest was whether there was an association between survival after a stroke
and level of education. Medical records for a random sample of 2,333 residents of Vienna,
Austria, who had suffered a stroke were used to classify each individual according to two
variables—survival (survived, died) and level of education (no basic education, secondary
school graduation, technical training/apprenticed, higher secondary school degree,
university graduate). Expected cell counts (computed under the assumption of no
association between survival and level of education) appear in parentheses in the table.”
A summary of the results of this study is in the JMP data table below.
Click on Analyze and select Fit Y by X
121
Click on Education and drag to Y, Response
Click on Survival and drag to X, Factor
122
Click on Frequency and drag to Freq
Click OK
123
The Chi-Square value is 12.219 with a P-value of 0.0158, above the significance level of
0.01. So we do not reject the null hypothesis.
124