Download No Slide Title

Document related concepts
no text concepts found
Transcript
Chi-square Test
for Goodness of Fit
(GOF)
What would it have been like to be
a crew member on the Titanic?



The crew of the Titanic made up 40% of the
people on board.
But 45% of the deaths were from the crew (685 of
1517 death were from the crew)
Did the crew pay a heavier cost?
Copyright © 2009 Pearson Education, Inc.
Pop: Crew on the Titanic
Parameter of interest: proportion lost
H 0 : p  .4
H a : p  .4
1-Proportion Test

 Independence: pop>10n ?
 Normality: np >10 yes, nq >10 yes.
Copyright © 2009 Pearson Education, Inc.
.45  .40
 4.10
.4  .6
899
P  z  4.10   .0000208
z
Because the P-value is so low, we reject
the H 0 . We believe that the proportion
of death among crew members were higher
than expected.
Percent
On board On board
Observed
Expected
Lost
Lost
O-E
(O-E)^2/E
First class
329
15%
130
224.51
-94.51
39.79
Second
Class
285
13%
166
194.49
-28.49
4.17
Third Class
710
32%
536
484.51
51.49
5.47
Crew
899
40%
685
613.49
71.51
8.34
2223
100%
1517
0.00
57.77
Copyright © 2009 Pearson Education, Inc.
Pop: Titanic Deaths
Model:
15% 1st, 13% 2nd,
32% 3rd, 40% Crew
H 0 : The model is a good fit.
 2  57, df  3
P   2  57  is essentially 0
H a : The model is not a good fit.
 2 Goodness of Fit
Because the P-value is so low, we
 Bias: Not SRS. PWC
reject the H 0 . The deaths on the Titanic
 Independence: pop>10n ?
were not distributed according to the model
 No expected counts less than 1.
 No more than 20% of expected counts less
than 5.
Copyright © 2009 Pearson Education, Inc.
What new for Chi-squared GOF





No parameter of interest; rather a model
describing the distribution amongst several
categories.
H0: the model is good.
Ha: the model is not good.
No normality check; rather no expected counts
lower than 1. No more than 20% of the expected
counts lower than 5.
df=number of categories-1
Copyright © 2009 Pearson Education, Inc.
The statistic for goodness of fit

The Chi-square test
statistic measures how
2
closely the observed
O  E

2
 
data matches what is
E
expected given a
df  categories - 1
particular model

Copyright © 2009 Pearson Education, Inc.
Step 3: Test Statistic
  some value
2
P    some value  
2
Copyright © 2009 Pearson Education, Inc.
AP Statistics,
8
Section 13.1
Observed
Expected
Expected
Count
Percentage
Count
Blue
24%
Brown
13%
Green
16%
Orange
20%
Red
13%
Yellow
14%
Copyright © 2009 Pearson Education, Inc.
(O-E)^2/E
AP Statistics,
9
Section 13.1
Chapter 7
Scatterplots, Association,
and Correlation
Copyright © 2009 Pearson Education, Inc.
Looking at Scatterplots


Scatterplots may be the most common and most
effective display for data.
 In a scatterplot, you can see patterns, trends,
relationships, and even the occasional
extraordinary value sitting apart from the
others.
Scatterplots are the best way to start observing
the relationship and the ideal way to picture
associations between two quantitative variables.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 12
Looking at Scatterplots (cont.)


When looking at scatterplots, we will look for
direction, form, strength, and unusual features.
Direction:
 A pattern that runs from the upper left to the
lower right is said to have a negative direction.
 A trend running the other way has a positive
direction.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 13
Looking at Scatterplots (cont.)


Copyright © 2009 Pearson Education, Inc.
The figure shows a
negative direction
between the year
since 1970 and the
and the prediction
errors made by
NOAA.
As the years have
passed, the
predictions have
improved (errors
have decreased).
Slide 1- 14
Looking at Scatterplots (cont.)


Copyright © 2009 Pearson Education, Inc.
The example in the
text shows a
negative association
between central
pressure and
maximum wind
speed
As the central
pressure increases,
the maximum wind
speed decreases.
Slide 1- 15
Looking at Scatterplots (cont.)

Form:
 If there is a
straight line
(linear)
relationship, it will
appear as a cloud
or swarm of
points stretched
out in a generally
consistent,
straight form.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 16
Looking at Scatterplots (cont.)

Form:
 If the relationship curves sharply,
the methods of this book cannot really help us.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 17
Looking at Scatterplots (cont.)

Strength:
 At one extreme, the points appear to follow a
single stream
(whether straight, curved, or bending all over
the place).
Copyright © 2009 Pearson Education, Inc.
Slide 1- 18
Looking at Scatterplots (cont.)

Strength:
 At the other extreme, the points appear as a
vague cloud with no discernable trend or
pattern:

Note: we will quantify the amount of scatter
soon.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 19
Looking at Scatterplots (cont.)

Unusual features:
 Look for the unexpected.
 Often the most interesting thing to see in a
scatterplot is the thing you never thought to
look for.
 One example of such a surprise is an outlier
standing away from the overall pattern of the
scatterplot.
 Clusters or subgroups should also raise
questions.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 20
Roles for Variables



It is important to determine which of the two
quantitative variables goes on the x-axis and
which on the y-axis.
This determination is made based on the roles
played by the variables.
When the roles are clear, the explanatory or
predictor variable goes on the x-axis, and the
response variable goes on the y-axis.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 21
Roles for Variables (cont.)


The roles that we choose for variables are more
about how we think about them rather than about
the variables themselves.
Just placing a variable on the x-axis doesn’t
necessarily mean that it explains or predicts
anything. And the variable on the y-axis may not
respond to it in any way.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 22
Correlation


Data collected from students in Statistics classes
included their heights (in inches) and weights (in
pounds):
Here we see a
positive association
and a fairly straight
form, although
there seems to be
a high outlier.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 23
Correlation (cont.)



How strong is the association between weight and height
of Statistics students?
If we had to put a number on the strength, we would not
want it to depend on the units we used.
A scatterplot of heights
(in centimeters) and
weights (in kilograms)
doesn’t change the
shape of the pattern:
Copyright © 2009 Pearson Education, Inc.
Slide 1- 24
Correlation (cont.)

The correlation coefficient (r) gives us a
numerical measurement of the strength of the
linear relationship between the explanatory and
response variables.
z z

r
x y
n 1
Copyright © 2009 Pearson Education, Inc.
Slide 1- 25
Correlation (cont.)


For the students’ heights and weights, the
correlation is 0.644.
What does this mean in terms of strength? We’ll
address this shortly.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 26
Correlation Conditions


Correlation measures the strength of the linear
association between two quantitative variables.
Before you use correlation, you must check
several conditions:
 Quantitative Variables Condition
 Straight Enough Condition
 Outlier Condition
Copyright © 2009 Pearson Education, Inc.
Slide 1- 27
Correlation Conditions (cont.)

Quantitative Variables Condition:
 Correlation applies only to quantitative
variables.
 Don’t apply correlation to categorical data
masquerading as quantitative.
 Check that you know the variables’ units and
what they measure.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 28
Correlation Conditions (cont.)

Straight Enough Condition:
 You can calculate a correlation coefficient for
any pair of variables.
 But correlation measures the strength only of
the linear association, and will be misleading if
the relationship is not linear.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 29
Correlation Conditions (cont.)

Outlier Condition:
 Outliers can distort the correlation dramatically.
 An outlier can make an otherwise small
correlation look big, or hide a large correlation.
 It can even give an otherwise positive
association a negative correlation coefficient
(and vice versa).
 When you see an outlier, it’s often a good idea
to report the correlations with and without that
point.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 30
Correlation Properties


The sign of a correlation coefficient gives the
direction of the association.
Correlation is always between -1 and +1.
 Correlation can be exactly equal to -1 or +1,
but these values are unusual in real data
because they mean that all the data points fall
exactly on a single straight line.
 A correlation near zero corresponds to a weak
linear association.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 31
Correlation Properties (cont.)



Correlation treats x and y symmetrically:
 The correlation of x with y is the same as the
correlation of y with x.
Correlation has no units.
Correlation is not affected by changes in the
center or scale of either variable.
 Correlation depends only on the z-scores, and
they are unaffected by changes in center or
scale.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 32
Correlation Properties (cont.)


Correlation measures the strength of the linear
association between the two variables.
 Variables can have a strong association but
still have a small correlation if the association
isn’t linear.
Correlation is sensitive to outliers. A single
outlying value can make a small correlation large
or make a large one small.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 33
Correlation ≠ Causation



Whenever we have a strong correlation, it is tempting to
explain it by imagining that the predictor variable has
caused the response to help.
Scatterplots and correlation coefficients never prove
causation.
A hidden variable that stands behind a relationship and
determines it by simultaneously affecting the other two
variables is called a lurking variable.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 34
What Can Go Wrong?

Don’t confuse “correlation” with “causation.”
 Scatterplots and correlations never
demonstrate causation.
 These statistical tools can only demonstrate an
association between variables.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 35
What Can Go Wrong? (cont.)


Don’t correlate categorical variables.
 Be sure to check the Quantitative Variables Condition.
Be sure the association is linear.
 There may be a strong association between two
variables that have a nonlinear association.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 36
What Can Go Wrong? (cont.)


Don’t assume the relationship is linear just
because the correlation coefficient is high.
Here the correlation is
0.979, but the relationship
is actually bent.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 37
What Can Go Wrong? (cont.)

Beware of outliers.
 Even a single outlier
can dominate the
correlation value.
 Make sure to check
the Outlier Condition.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 38
What have we learned?


We examine scatterplots for direction, form, strength, and
unusual features.
Although not every relationship is linear, when the
scatterplot is straight enough, the correlation coefficient is
a useful numerical summary.
 The sign of the correlation tells us the direction of the
association.
 The magnitude of the correlation tells us the strength of
a linear association.
 Correlation has no units, so shifting or scaling the data,
standardizing, or swapping the variables has no effect
on the numerical value.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 39
What have we learned? (cont.)


Doing Statistics right means that we have to
Think about whether our choice of methods is
appropriate.
 Before finding or talking about a correlation,
check the Straight Enough Condition.
 Watch out for outliers!
Don’t assume that a high correlation or strong
association is evidence of a cause-and-effect
relationship—beware of lurking variables!
Copyright © 2009 Pearson Education, Inc.
Slide 1- 40
Chapter 8
Linear Regression
Copyright © 2009 Pearson Education, Inc.
Fat Versus Protein: An Example

The following is a scatterplot of total fat versus
protein for 30 items on the Burger King menu:
Copyright © 2009 Pearson Education, Inc.
Slide 1- 42
Residuals



The model won’t be perfect, regardless of the line
we draw.
Some points will be above the line and some will
be below.
The estimate made from a model is the predicted
value (denoted as ŷ ).
Copyright © 2009 Pearson Education, Inc.
Slide 1- 43
Residuals (cont.)


The difference between the observed value and
its associated predicted value is called the
residual.
To find the residuals, we always subtract the
predicted value from the observed one:
residual  observed  predicted  y  yˆ
Copyright © 2009 Pearson Education, Inc.
Slide 1- 44
Residuals (cont.)


A negative residual means
the predicted value’s too
big (an overestimate).
A positive residual means
the predicted value’s too
small (an underestimate).
Copyright © 2009 Pearson Education, Inc.
Slide 1- 45
“Best Fit” Means Least Squares





Some residuals are positive, others are negative,
and, on average, they cancel each other out.
So, we can’t assess how well the line fits by
adding up all the residuals.
Similar to what we did with deviations, we square
the residuals and add the squares.
The smaller the sum, the better the fit.
The line of best fit is the line for which the sum of
the squared residuals is smallest.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 46
The Linear Model


Remember from Algebra that a straight line can
be written as: y  mx  b
In Statistics we use a slightly different notation:
ŷ  b0  b1 x

We write ŷ to emphasize that the points that
satisfy this equation are just our predicted values,
not the actual data values.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 47
The Linear Model (cont.)


We write b1 and b0 for the slope and intercept of
the line. The b’s are called the coefficients of the
linear model.
The coefficient b1 is the slope, which tells us how
rapidly ŷ changes with respect to x. The
coefficient b0 is the intercept, which tells where
the line hits (intercepts) the y-axis.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 48
The Least Squares Line

In our model, we have a slope (b1):
 The slope is built from the correlation and the
standard deviations:
b1  r

sy
sx
Our slope is always in units of y per unit of x.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 49
The Least Squares Line (cont.)

In our model, we also have an intercept (b0).
 The intercept is built from the means and the
slope:
b0  y  b1 x

Our intercept is always in units of y.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 50
Fat Versus Protein: An Example

The regression line for the
Burger King data fits the
data well:
 The equation is

The predicted fat
content for a BK Broiler
chicken sandwich is
6.8 + 0.97(30) = 35.9
grams of fat.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 51
The Least Squares Line (cont.)

Since regression and correlation are closely
related, we need to check the same conditions for
regressions as we did for correlations:
 Quantitative Variables Condition
 Straight Enough Condition
 Outlier Condition
Copyright © 2009 Pearson Education, Inc.
Slide 1- 52
Correlation and the Line

Moving one standard deviation away from the
mean in x moves us r standard deviations away
from the mean in y.

This relationship is
shown in a scatterplot
of z-scores for
fat and protein:
Copyright © 2009 Pearson Education, Inc.
Slide 1- 53
Correlation and the Line (cont.)

Put generally, moving any number of standard
deviations away from the mean in x moves us r
times that number of standard deviations away
from the mean in y.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 54
How Big Can Predicted Values Get?


r cannot be bigger than 1 (in absolute value),
so each predicted y tends to be closer to its mean
(in standard deviations) than its corresponding
x was.
This property of the linear model is called
regression to the mean; the line is called the
regression line.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 55
Residuals Revisited
The linear model assumes that the relationship
between the two variables is a perfect straight
line. The residuals are the part of the data that
hasn’t been modeled.
Data = Model + Residual
or (equivalently)
Residual = Data – Model
Or, in symbols,

e  y  yˆ
Copyright © 2009 Pearson Education, Inc.
Slide 1- 56
Residuals Revisited (cont.)



Residuals help us to see whether the model
makes sense.
When a regression model is appropriate, nothing
interesting should be left behind.
After we fit a regression model, we usually plot
the residuals in the hope of finding…nothing.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 57
Residuals Revisited (cont.)

The residuals for the BK menu regression look
appropriately boring:
Copyright © 2009 Pearson Education, Inc.
Slide 1- 58
Regression Assumptions and Conditions


Quantitative Variables Condition:
 Regression can only be done on two
quantitative variables, so make sure to check
this condition.
Straight Enough Condition:
 The linear model assumes that the relationship
between the variables is linear.
 A scatterplot will let you check that the
assumption is reasonable.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 59
Regressions Assumptions and Conditions (cont.)



It’s a good idea to check linearity again after
computing the regression when we can examine
the residuals.
You should also check for outliers, which could
change the regression.
If the data seem to clump or cluster in the
scatterplot, that could be a sign of trouble worth
looking into further.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 60
Regressions Assumptions and Conditions (cont.)


If the scatterplot is not straight enough, stop here.
 You can’t use a linear model for any two
variables, even if they are related.
 They must have a linear association or the
model won’t mean a thing.
Some nonlinear relationships can be saved by reexpressing the data to make the scatterplot more
linear.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 61
Regressions Assumptions and Conditions (cont.)

Outlier Condition:
 Watch out for outliers.
 Outlying points can dramatically change a
regression model.
 Outliers can even change the sign of the slope,
misleading us about the underlying relationship
between the variables.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 62
Reality Check:
Is the Regression Reasonable?


Statistics don’t come out of nowhere. They are based on
data.
 The results of a statistical analysis should reinforce
your common sense, not fly in its face.
 If the results are surprising, then either you’ve learned
something new about the world or your analysis is
wrong.
When you perform a regression, think about the
coefficients and ask yourself whether they make sense.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 63
What Can Go Wrong?






Don’t fit a straight line to a nonlinear relationship.
Beware of extraordinary points (y-values that stand off
from the linear pattern or extreme x-values).
Don’t invert the regression. To swap the predictorresponse roles of the variables, we must fit a new
regression equation.
Don’t extrapolate beyond the data—the linear model may
no longer hold outside of the range of the data.
Don’t infer that x causes y just because there is a good
linear model for their relationship—association is not
causation.
Don’t choose a model based on R2 alone.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 64
What have we learned?

When the relationship between two quantitative
variables is fairly straight, a linear model can help
summarize that relationship.
 The regression line doesn’t pass through all
the points, but it is the best compromise in the
sense that it has the smallest sum of squared
residuals.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 65
What have we learned? (cont.)

The correlation tells us several things about the
regression:
 The slope of the line is based on the correlation,
adjusted for the units of x and y.
 For each SD in x that we are away from the x mean,
we expect to be r SDs in y away from the y mean.
 Since r is always between -1 and +1, each predicted y
is fewer SDs away from its mean than the
corresponding x was (regression to the mean).
Copyright © 2009 Pearson Education, Inc.
Slide 1- 66
What have we learned? (cont.)

The residuals also reveal how well the model
works.
 If a plot of the residuals against predicted
values shows a pattern, we should re-examine
the data to see why.
 The standard deviation of the residuals
quantifies the amount of scatter around the
line.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 67
What have we learned? (cont.)



The linear model makes no sense unless the Linear
Relationship Assumption is satisfied.
Also, we need to check the Straight Enough
Condition and Outlier Condition with a scatterplot.
For the standard deviation of the residuals, we must
make the Equal Variance Assumption. We check it
by looking at both the original scatterplot and the
residual plot for Does the Plot Thicken? Condition.
Copyright © 2009 Pearson Education, Inc.
Slide 1- 68
Related documents