Download Chapter 12

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Chapter 12: Analyzing
Association Between
Quantitative Variables:
Regression Analysis
Section 12.1: How Can We Model How Two
Variables Are Related?
1
Learning Objectives
Regression Analysis
The Scatterplot
The Regression Line Equation
Outliers
Influential Points
Residuals are Prediction Errors
Regression Model: A Line Describes How the Mean
of y Depends on x
8. The Population Regression Equation
9. Variability about the Line
10. A Statistical Model
1.
2.
3.
4.
5.
6.
7.
2
Learning Objective 1:
Regression Analysis
 The first step of a regression analysis is
to identify the response and explanatory
variables

We use y to denote the response
variable

We use x to denote the explanatory
variable
3
Learning Objective 2:
The Scatterplot
 The first step in answering the question
of association is to look at the data
 A scatterplot is a graphical display of the
relationship between the response
variable (y-axis) and the explanatory
variable (x-axis)
4
Learning Objective 2:
Example: What Do We Learn from a Scatterplot
in the Strength Study?
 An experiment was designed to measure
the strength of female athletes
 The goal of the experiment was to find the
maximum number of pounds that each
individual athlete could bench press
5
Learning Objective 2:
Example: What Do We Learn from a Scatterplot
in the Strength Study?
 57 high school female athletes
participated in the study
 The data consisted of the following
variables:


x: the number of 60-pound bench presses
an athlete could do
y: maximum bench press
6
Learning Objective 2:
Example: What Do We Learn from a Scatterplot
in the Strength Study?
 For the 57 girls in this study, these
variable are summarized by:


x: mean = 11.0, st.deviation = 7.1
y: mean = 79.9 lbs, st.dev. = 13.3 lbs
7
Learning Objective 2:
Example: What Do We Learn from a Scatterplot
in the Strength Study?
8
Learning Objective 3:
The Regression Line Equation
 When the scatterplot shows a linear trend, a
straight line can be fitted through the data points
to describe that trend
 The regression line is:
yˆ  a  bx


ŷ is the predicted value of the response variable y
a is the y-intercept and b is the slope
9
Learning Objective 3:
Example: What Do We Learn from a Scatterplot
in the Strength Study?
10
Learning Objective 3:
Example: What Do We Learn from a Scatterplot
in the Strength Study?
 The MINITAB output shows the following
regression equation:
 BP = 63.5 + 1.49 (BP_60)
 The y-intercept is 63.5 and the slope is 1.49
 The slope of 1.49 tells us that predicted maximum
bench press increases by about 1.5 pounds for
every additional 60-pound bench press an athlete
can do
11
Learning Objective 4:
Outliers
 Check for outliers by plotting the data
 The regression line can be pulled toward
an outlier and away from the general trend
of points
12
Learning Objective 5:
Influential Points
 An observation can be influential in
affecting the regression line when two
thing happen:


Its x value is low or high compared to the
rest of the data
It does not fall in the straight-line pattern
that the rest of the data have
13
Learning Objective 6:
Residuals are Prediction Errors
 The regression equation is often called a
prediction equation
 The difference y  yˆ between an observed
outcome and its predicted value is the
prediction error, called a residual

14
Learning Objective 6:
Residuals
 Each observation has a residual
 A residual is the vertical distance between
the data point and the regression line
15
Learning Objective 6:
Residuals
 We can summarize how near the regression
line the data points fall by
sum of squared residuals 
 (residuals)   ( y  yˆ )
2
2
 The regression line has the smallest sum of
squared residuals and is called the least
squares line
16
Learning Objective 7:
Regression Model: A Line Describes How the
Mean of y Depends on x
 At a given value of x, the equation:
yˆ  a  bx

Predicts a single value of the response
variable

But… we should not expect all subjects at
that value of x to have the same value of y

Variability occurs in the y values
17
Learning Objective 7:
The Regression Line
 The regression line connects the estimated
means of y at the various x values
 In summary,
yˆ  a  bx
Describes the relationship between x and the
estimated means of y at the various values of x
18
Learning Objective 8:
The Population Regression Equation
 The population regression equation describes the
relationship in the population between x and the
means of y
 The equation is:
    x
y
19
Learning Objective 8:
The Population Regression Equation
 In the population regression equation, α is
a population y-intercept and β is a
population slope

These are parameters
 In practice we estimate the population
regression equation using the prediction
equation for the sample data
20
Learning Objective 8:
The Population Regression Equation
 The population regression equation
merely approximates the actual
relationship between x and the population
means of y
 It is a model
 A model is a simple approximation for how
variables relate in the population
21
Learning Objective 8:
The Regression Model
22
Learning Objective 8:
The Regression Model
 If the true relationship is far from a straight line,
this regression model may be a poor one
23
Learning Objective 9:
Variability about the Line
 At each fixed value of x, variability occurs in the y
values around their mean, µy
 The probability distribution of y values at a fixed
value of x is a conditional distribution
 At each value of x, there is a conditional
distribution of y values
 An additional parameter σ describes the standard
deviation of each conditional distribution
24
Learning Objective 10:
A Statistical Model
 A statistical model never holds exactly in
practice.
 It is merely an approximation for reality
 Even though it does not describe reality
exactly, a model is useful if the true relationship
is close to what the model predicts
25
Chapter 12: Analyzing
Association Between
Quantitative Variables:
Regression Analysis
Section 12.2: How Can We Describe
Strength of Association?
26
Learning Objectives
1. Correlation and Slope
2. Example: What’s the Correlation for
Predicting Strength?
3. The Squared Correlation
27
Learning Objective 1:
Correlation
 The correlation, denoted by r, describes
linear association
 The correlation ‘r’ has the same sign as
the slope ‘b’
 The correlation ‘r’ always falls between -1
and +1
 The larger the absolute value of r, the
stronger the linear association
28
Learning Objective 1:
Correlation and Slope
 We can’t use the slope to describe the strength of
the association between two variables because
the slope’s numerical value depends on the units
of measurement
 The correlation is a standardized version of the
slope
 The correlation does not depend on units of
measurement.
29
Learning Objective 1:
Correlation and Slope
 The correlation and the slope are related in the
following way:
s
r b
s
x
y
30
Learning Objective 2:
Example: What’s the Correlation for Predicting
Strength?
 For the female athlete strength study:
 x: number of 60-pound bench presses
 y: maximum bench press
 x: mean = 11.0, st.dev.=7.1
 y: mean= 79.9 lbs., st.dev. = 13.3 lbs.
 Regression equation:
yˆ  63.5  1.49x
31

Learning Objective 2:
Example: What’s the Correlation for Predicting
Strength?
s 
 7.1 
x
r  b
 0.80
s 
1.49
13.3
 y 
 The variables have a strong, positive
association
32
Learning Objective 3:
The Squared Correlation
 Another way to describe the strength of
association refers to how close predictions for y
tend to be to observed y values
 The variables are strongly associated if you can
predict y much better by substituting x values
into the prediction equation than by merely using
the sample mean y and ignoring x

33
Learning Objective 3:
The Squared Correlation
 Consider the prediction error: the difference
between the observed and predicted values of
y

Using the regression line to make a prediction,
each error is:
y  yˆ

Using only the sample mean,
prediction, each error is:
y , to make a
y y

34
Learning Objective 3:
The Squared Correlation
 When we predict y using
y (that is, ignoring x),
the error summary equals:
(
y

y
)


2
 This is called the total sum of squares
35
Learning Objective 3:
The Squared Correlation
 When we predict y using x with the regression
equation, the error summary is:
 ( y  yˆ )
2
 This is called the residual sum of squares
36
Learning Objective 3:
The Squared Correlation
 When a strong linear association exists,
the regression equation predictions tend
to be much better than the predictions
using y
 We measure the proportional reduction in

error and call it, r2
37
Learning Objective 3:
The Squared Correlation
r 
2
 ( y  y )   ( y  yˆ )
2
 ( y  y)
2
2
 We use the notation r2 for this measure
because it equals the square of the correlation r
38
Learning Objective 3:
The Squared Correlation Example: What Does r2
Tell Us in the Strength Study?
 For the female athlete strength study:
 x: number of 60-pund bench presses
 y: maximum bench press
 The correlation value was found to be r = 0.80
 We can calculate r2 from r: (0.80)2=0.64
 For predicting maximum bench press, the
regression equation has 64% less error than y
has

39
Learning Objective 3:
The Squared Correlation
 Properties:
 r2



falls between 0 and 1
2
2
ˆ
y

y
 0 . This happens only


r =1 when 
when all the data points fall exactly on the
regression line
2
2
ˆ

y

y

y

y
 . This happens
r2=0 when   
when the slope b=0, in which case each yˆ  y
The closer r2 is to 1, the stronger the linear

association: the more effective the regression
equation is compared to y in predicting y

40
Learning Objective 3:
Correlation r and Its Square r2
 Both r and r2 describe the strength of association
 ‘r’ falls between -1 and +1
 It represents the slope of the regression line when
x and y have been standardized
 ‘r2’ falls between 0 and 1
 It summarizes the reduction in sum of squared
errors in predicting y using the regression line
instead of using y

41
Chapter 12: Analyzing
Association Between
Quantitative Variables:
Regression Analysis
Section 12.3: How Can We Make Inferences
About the Association?
42
Learning Objectives
1. Descriptive and Inferential Parts of
Regression
2. Assumptions for Regression Analysis
3. Testing Independence between Quantitative
Variables
4. A Confidence Interval for β
43
Learning Objective 1:
Descriptive and Inferential Parts of Regression
 The sample regression equation, r, and r2 are
descriptive parts of a regression analysis
 The inferential parts of regression use the tools
of confidence intervals and significance tests to
provide inference about the regression equation,
the correlation and r-squared in the population of
interest
44
Learning Objective 2:
Assumptions for Regression Analysis
 Basic assumption for using regression line for
description:

The population means of y at different values
of x have a straight-line relationship with x,
that is:
    x
y


This assumption states that a straight-line
regression model is valid
This can be verified with a scatterplot.
45
Learning Objective 2:
Assumptions for Regression Analysis
 Extra assumptions for using regression to
make statistical inference:
 The data were gathered using
randomization
 The population values of y at each value
of x follow a normal distribution, with
the same standard deviation at each x
value
46
Learning Objective 2:
Assumptions for Regression Analysis
 Models, such as the regression model,
merely approximate the true relationship
between the variables
 A relationship will not be exactly linear,
with exactly normal distributions for y at
each x and with exactly the same standard
deviation of y values at each x value
47
Learning Objective 3:
Testing Independence between Quantitative
Variables
 Suppose that the slope β of the regression line
equals 0
Then…


The mean of y is identical at each x value
The two variables, x and y, are statistically
independent:
 The outcome for y does not depend on the value
of x
 It does not help us to know the value of x if we
want to predict the value of y
48
Learning Objective 3:
Testing Independence between Quantitative
Variables
49
Learning Objective 3:
Testing Independence between Quantitative
Variables
 Steps of Two-Sided Significance Test about a
Population Slope β:
1. Assumptions:
 The population satisfies regression line:
    x
y


Randomization
The population values of y at each value of x
follow a normal distribution, with the same
standard deviation at each x value
50
Learning Objective 3:
Testing Independence between Quantitative
Variables
 Steps of Two-Sided Significance Test about a
Population Slope β:
2. Hypotheses:
H0: β = 0, Ha: β ≠ 0
3. Test statistic:
b0
t 
se
 Software supplies sample slope b and its se
51
Learning Objective 3:
Testing Independence between Quantitative
Variables
 Steps of Two-Sided Significance Test
about a Population Slope β:
4. P-value: Two-tail probability of t test statistic value
more extreme than observed:
Use t distribution with df = n-2
5. Conclusions: Interpret P-value in context

If decision needed, reject H0 if P-value ≤ significance
level
52
Learning Objective 3:
Example: Is Strength Associated with 60-Pound
Bench Press?
53
Learning Objective 3:
Example: Is Strength Associated with 60-Pound
Bench Press?
 Conduct a two-sided significance test of the null
hypothesis of independence
 Assumptions:



A scatterplot of the data revealed a linear trend so the
straight-line regression model seems appropriate
The scatter of points have a similar spread at
different x values
The sample was a convenience sample, not a random
sample, so this is a concern
54
Learning Objective 3:
Example: Is Strength Associated with 60-Pound
Bench Press?
 Hypotheses: H0: β = 0, Ha: β ≠ 0
 Test statistic:
b  0 (1.49  0)
t

 9.96
se
0.150
 P-value: 0.000
 Conclusion: An association exists between the number of
60-pound bench presses and maximum bench press
55
Learning Objective 4:
A Confidence Interval for β
 A small P-value in the significance test of H0: β =
0 suggests that the population regression line
has a nonzero slope
 To learn how far the slope β falls from 0, we
construct a confidence interval:
b  t (se) with df  n  2
.025
56
Learning Objective 4:
Example: Estimating the Slope for Predicting
Maximum Bench Press
 Construct a 95% confidence interval for β
1.49  2.00(0.150) which is :
1.49  0.30 or (1.2,1.8)
 Based on a 95% CI, we can conclude, on average,
the maximum bench press increases by between
1.2 and 1.8 pounds for each additional 60-pound
bench press that an athlete can do
57
Learning Objective 4:
Example: Estimating the Slope for Predicting
Maximum Bench Press
 Let’s estimate the effect of a 10-unit
increase in x:


Since the 95% CI for β is (1.2, 1.8), the
95% CI for 10β is (12, 18)
On the average, we infer that the maximum
bench press increases by at least 12
pounds and at most 18 pounds, for an
increase of 10 in the number of 60-pound
bench presses
58
Chapter 12: Analyzing
Association Between
Quantitative Variables:
Regression Analysis
Section 12.4: What Do We Learn from How
the Data Vary Around the Regression Line?
59
Learning Objectives
1. Residuals and Standardized Residuals
2. Analyzing Large Standardized Residuals
3. The Residual Standard Deviation
4. Confidence Interval for µy
5. Prediction Interval for y
6. Prediction Interval for y vs Confidence
Interval for µy
60
Learning Objective 1:
Residuals and Standardized Residuals
 A residual is a prediction error – the
difference between an observed outcome
and its predicted value

The magnitude of these residuals depends
on the units of measurement for y
 A standardized version of the residual
does not depend on the units
61
Learning Objective 1:
Standardized Residuals
 Standardized residual:
(y  yˆ )
se(y  yˆ )
 The se formula is complex, so we rely on software to find
it
 A standardized residual indicates how many standard
errors a residual falls from 0
 If the relationship is truly linear and the standardized
residuals have approximately a bell-shaped distribution,
observations with standardized residuals larger than 3 in
absolute value often represent outliers

62
Learning Objective 1:
Example: Detecting an Underachieving College
Student
 Data was collected on a sample of 59
students at the University of Georgia
 Two of the variables were:


CGPA: College Grade Point Average
HSGPA: High School Grade Point Average
63
Learning Objective 1:
Example: Detecting an Underachieving College
Student
 A regression equation was created from the
data:


x: HSGPA
y: CGPA
 Equation:
yˆ  1.19  0.64x
64
Learning Objective 1:
Example: Detecting an Underachieving College
Student
MINITAB highlights observations that have
standardized residuals with absolute value
larger than 2:
65
Learning Objective 1:
Example: Detecting an Underachieving College
Student
 Consider the reported standardized
residual of -3.14


This indicates that the residual is 3.14
standard errors below 0
This student’s actual college GPA is quite
far below what the regression line predicts
66
Learning Objective 2:
Analyzing Large Standardized Residuals
 Does it fall well away from the linear trend
that the other points follow?
 Does it have too much influence on the
results?
 Note: Some large standardized residuals
may occur just because of ordinary
random variability-even if the model is
perfect, we’d expect about 5% of the
standardized residuals to have absolute
values > 2 by chance.
67
Learning Objective 2:
Histogram of Residuals
 A histogram of residuals or standardized
residuals is a good way of detecting
unusual observations
 A histogram is also a good way of
checking the assumption that the
conditional distribution of y at each x
value is normal

Look for a bell-shaped histogram
68
Learning Objective 2:
Histogram of Residuals
 Suppose the histogram is not bell-shaped:
The distribution of the residuals is not
normal
However….
 Two-sided inferences about the slope
parameter still work quite well
 The t- inferences are robust

69
Learning Objective 3:
The Residual Standard Deviation
 For statistical inference, the regression
model assumes that the conditional
distribution of y at a fixed value of x is
normal, with the same standard deviation
at each x
 This standard deviation, denoted by σ,
refers to the variability of y values for all
subjects with the same x value
70
Learning Objective 3:
The Residual Standard Deviation
 The estimate of σ, obtained from the data, is:
s
 ( y  yˆ )
2
n2
71
Learning Objective 3:
Example: How Variable are the Athletes’
Strengths?
 From MINITAB output, we obtain s, the
residual standard deviation of y:
3522.8
s
 8.0
55
 For any given x value, we estimate the mean y
value using the regression equation and we
estimate the standard deviation using s: s =
8.0
72
Learning Objective 4:
Confidence Interval for µy
 We estimate
µy, the population mean of y at a
given value of x by:
yˆ  a  bx
 We can construct a 95% confidence interval for
µy using:
yˆ  t.025(se of yˆ )
where the t-score has df=n-2

73
Learning Objective 5:
Prediction Interval for y
 The estimate yˆ  a  bx for the mean of y at
a fixed value of x is also a prediction for an
individual outcome y at the fixed value of x
 Most regression software will form this

interval within which an outcome y is likely
to fall
yˆ  2s
where s is the residual standard deviation
74
Learning Objective 6:
Prediction Interval for y vs Confidence Interval for µy
 The prediction interval for y is an
inference about where individual
observations fall

Use a prediction interval for y if you
want to predict where a single
observation on y will fall for a particular
x value
75
Learning Objective 6:
Prediction Interval for y vs Confidence Interval for µy
 The confidence interval for
µy is an
inference about where a population mean
falls
 Use a confidence interval for
µy if you
want to estimate the mean of y for all
individuals having a particular x value

yˆ  2 s
n

where s is the residual standard deviation
76
Learning Objective 6:
Prediction Interval for y vs Confidence Interval for µy
 Note that the prediction interval is wider
than the confidence interval - you can
estimate a population mean more
precisely than you can predict a single
observation
 Caution: in order for these intervals to be
valid, the true relationship must be close
to linear with about the same variability of
y-values at each fixed x-value
77
Learning Objective 6:
Example: Predicting Maximum Bench Press and
Estimating its Mean
78
Learning Objective 6:
Example: Predicting Maximum Bench Press and
Estimating its Mean
 Use the MINITAB output to find and interpret a 95%
CI for the population mean of the maximum bench
press values for all female high school athletes who
can do x = 11 sixty-pound bench presses
 For all female high school athletes who can do 11
sixty-pound bench presses, we estimate the mean
of their maximum bench press values falls between
78 and 82 pounds
79
Learning Objective 6:
Example: Predicting Maximum Bench Press and
Estimating its Mean
 Use the MINITAB output to find and interpret a 95%
Prediction Interval for a single new observation on
the maximum bench press for a randomly chosen
female high school athlete who can do x = 11 sixtypound bench presses
 For all female high school athletes who can do 11
sixty-pound bench presses, we predict that 95% of
them have maximum bench press values between
64 and 96 pounds
80
Chapter 12: Analyzing
Association Between
Quantitative Variables:
Regression Analysis
Section 12.5: Exponential Regression: A
Model for Nonlinearity
81
Learning Objectives
1. Nonlinear Regression Models
2. Exponential Regression Model
3. Interpreting Exponential Regression Models
82
Learning Objective 1:
Nonlinear Regression Models
 If a scatterplot indicates substantial curvature in a
relationship, then equations that provide curvature
are needed

Occasionally a scatterplot has a parabolic
appearance: as x increases, y increases then
it goes back down

More often, y tends to continually increase or
continually decrease but the trend shows
curvature
83
Learning Objective 1:
Example: Exponential Growth in Population
Size
 Since 2000, the population of the U.S. has been
growing at a rate of 2% a year







The population size in 2000 was 280 million
The population size in 2001 was 280 x 1.02
The population size in 2002 was 280 x (1.02)2
…
The population size in 2010 is estimated to be
280 x (1.02)10
This is called exponential growth
84
Learning Objective 2:
Exponential Regression Model
 An exponential regression model has the
formula:
  
x
y
For the mean µy of y at a given value of x,
where α and β are parameters
85
Learning Objective 2:
Exponential Regression Model
 In the exponential regression equation,
the explanatory variable x appears as the
exponent of a parameter
 The mean µy and the parameter β can take
only positive values
86
Learning Objective 2:
Exponential Regression Model
As x increases, the mean µy increases when β>1
It continually decreases when 0 < β<1
87
Learning Objective 2:
Exponential Regression Model
 For exponential regression, the logarithm
of the mean is a linear function of x
 When the exponential regression model
holds, a plot of the log of the y values
versus x should show an approximate
straight-line relation with x
88
Learning Objective 2:
Example: Explosion in Number of People Using
the Internet
89
Learning Objective 2:
Example: Explosion in Number of People Using
the Internet
Plot of Number of People Using Internet between
1995 and 2001
90
Learning Objective 2:
Example: Explosion in Number of People Using
the Internet
Plot of Log Number of People Using Internet
between 1995 and 2001
91
Learning Objective 2:
Example: Explosion in Number of People Using
the Internet
 Using regression software, we can create the
exponential regression equation:
 x: the number of years since 1995. Start
with x = 0 for 1995, then x=1 for 1996, etc
 y: number of internet users
 Equation:
yˆ  20.38(1.7708)
x
92
Learning Objective 3:
Interpreting Exponential Regression Models
 In the exponential regression model,
  
x
y
 the parameter α represents the mean value of y
when x = 0;
 The parameter β represents the multiplicative
effect on the mean of y for a one-unit increase in
x
93
Learning Objective 3:
Example: Explosion in Number of People Using
the Internet
 In this model:
yˆ  20.38(1.7708)
x
 The predicted number of Internet users in 1995
(for which x = 0) is 20.38 million
 The predicted number of Internet users in 1996 is
20.38 times 1.7708
94