Download Simple linear regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Forecasting wikipedia , lookup

Choice modelling wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Regression toward the mean wikipedia , lookup

Time series wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Happiness comes not from material wealth but less desire.
1
Applied Statistics Using SAS
and SPSS
Topic: Simple linear regression
By Prof Kelly Fan, Cal State Univ, East Bay
2
Example: Computer Repair
A company markets and repairs small
computers. How fast (Time) an
electronic component (Computer Unit)
can be repaired is very important to the
efficiency of the company. The
Variables in this example are:
Time and Units.
3
Humm…
How long will it take
me to repair this
unit?
Goal: to predict the length of repair
Time for a given number of computer
Units
4
Computer Repair Data
Units
Min’s
Units
Min’s
1
23
6
97
2
29
7
109
3
49
8
119
4
64
9
149
4
74
9
145
5
87
10
154
6
96
10
166
5
Graphical Summary of Two Quantitative
Variable
Scatterplot of response variable against explanatory
variable
 What is the overall (average) pattern?
 What is the direction of the pattern?
 How much do data points vary from the overall (average)
pattern?
 Any potential outliers?
6
Summary for Computer Repair Data
Scatterplot (Time vs Units)
Some Simple Conclusions
 Time is Linearly related
with computer Units.
 (The length of) Time is
Increasing as (the number
of) Units increases.
 Data points are closed to
the line.
 No potential outlier.
7
Numerical Summary of Two Quantitative
Variable
Regression equation
Correlation
8
Review: Math Equation for a Line
 Y: the response variable
 X: the explanatory variable
Y=b0+b1X
Y
} b1
1
} b0
X
9
Regression Equation
The regression line models the
relationship between X and Y on average.
The math equation of a regression line is
called regression equation.
10
The Usage of Regression Equation
Predict the value of Y for a given X value
Eg. How long will it take to repair 3
computer units?
11
General Notation
 Yˆ  b0  b1  X
is called “predicted Y,”
pronounced as “y hat,” which estimates
the average Y value for a specified X
value.
Eg.
Yˆ  4.16  15.51 X
 The predicted repair time of a given # of
units
12
The Limitation of the Regression Equation
 The regression equation cannot be used to
predict Y value for the X values which are (far)
beyond the range in which data are observed.
Eg. The predicted WT of a given HT:
Yˆ  205  5  X
Given HT of 40”, the regression equation will
give us WT of -205+5x40 = -5 pounds!!
13
The Unpredicted Part
The value Y  Yˆ is the part the
regression equation (model) cannot
predict, and it is called “residual.”
14
residual {
15
Correlation between X and Y
X and Y might be related to each other in
many ways: linear or curved.
16
y
2.0
1.6
1.5
1.4
1.2
y
1.8
2.5
2.0
2.2
3.0
Examples of Different Levels of Correlation
0.0
0.2
0.4
0.6
x
r=.98
Strong Linearity
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
r=.71
Median Linearity
17
2.5
y
2.0
3.0
1.5
2.5
1.0
2.0
y
3.5
4.0
3.0
Examples of Different Levels of Correlation
0.0
0.2
0.4
0.6
x
r=-.09
Nearly Uncorrelated
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
r=.00
Nearly Curved
18
(Pearson) Correlation Coefficient of X and Y
 A measurement of the strength of the
“LINEAR” association between X and Y
 Sx: the standard deviation of the data
values in X, Sy: the standard deviation of
the data values in Y;
the correlation coefficient of X and Y is:
n
r
(y
i 1
i
 y )( xi  x )
(n  1) s y s x
19
Correlation Coefficient of X and Y
-1< r < 1
The magnitude of r measures the strength
of the linear association of X and Y
The sign of r indicate the direction of the
association: “-”  negative association
“+”  positive association
20
Goodness of Fit
R^2 is the proportion of Y variance
explained/accounted by the model we use
to fit the data
When there is only one X (simple linear
regression) R^2 = r^2.
21
SPSS Output
Analyze >> Regression >> Linear
Model Summaryb
Model
1
R
R Square
a
.994
.987
Adjus ted
R Square
.986
Std. Error of
the Es timate
5.39172
a. Predictors : (Constant), units
b. Dependent Variable: time
ANOVAb
Model
1
Sum of
Squares
Regress ion 27419.509
Res idual
348.848
Total
27768.357
df
1
12
13
Mean Square
27419.509
29.071
F
943.201
Sig.
.000 a
a. Predictors : (Constant), units
b. Dependent Variable: time
22
Confidence Intervals
Coefficientsa
Model
1
(Cons tant)
units
Uns tandardized
Coefficients
B
Std. Error
4.162
3.355
15.509
.505
Standardized
Coefficients
Beta
.994
t
1.240
30.712
Sig.
.239
.000
95% Confidence Interval for B
Lower Bound Upper Bound
-3.148
11.472
14.409
16.609
a. Dependent Variable: time
23
Check for Normality
24
Check for Equal Variances
SCATTERPLOT of zresid & zpred
25
The Influence of Outliers
 The slope
becomes smaller
(toward outliers)
13
Y3
11
9
 The r value
becomes smaller
(less linear)
7
5
4
6
8
10
12
14
X3
26
The Influence of Outliers
 The slope
becomes clear
(toward outliers)
Scatterplot of y vs x
5
4
 The | r | value
becomes larger
(more linear:
0.1590.935)
y
3
2
1
0
0
2
4
6
8
10
x
27
Identify Outliers using Residual Plots
Use “standardized” residuals!!
The cases with standardized residuals of
size 3 or more outliers
28