Download Lecture 17 - Wharton Statistics Department

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Lecture 17
• Interaction Plots
• Simple Linear Regression (Chapter 18.118.2)
• Homework 4 due Friday. JMP instructions
for question 15.41 are actually for question
15.35.
18.1 Introduction
• In Chapters 18 to 20 we examine the relationship
between interval variables via a mathematical
equation.
• The motivation for using the technique:
– Forecast the value of a dependent variable (y) from the
value of independent variables (x1, x2,…xk.).
– Analyze the specific relationships between the
independent variables and the dependent variable.
Uses of Regression Analysis
• A building manager company plans to submit a bid on a
contract to clean 40 corporate offices scattered throughout
an office complex. The costs incurred by the company are
proportional to the number of cleaning crews needed for
this task. How many crews will be enough?
• The product manager in charge of a brand of children’s
cereal would like to predict demand during the next year.
She has available the following “predictor” variables: price
of the product, number of children in target market, price
of competitors’ products, effectiveness of advertising,
annual sales this year and previous year
Uses of Regression Analysis
• A community in the Philadelphia area is interested
in how crime rates affect property values. If low
crime rates increase property values, the
community might be able to cover the cost of
increased police protection by gains in tax
revenues from higher property values.
• A real estate agent wants to more accurately
predict the selling price of houses. She believes
the following variables affect the price of a house:
Size of house (sq. feet), number of bedrooms,
frontage of lot, condition and location.
18.2 The Model
The model has a deterministic and a probabilistic components
House
Cost
Most lots sell
for $25,000
House size
18.2 The Model
However, house cost vary even among same size
houses!
Since cost behave unpredictably,
House
Cost
we add a random component.
Most lots sell
for $25,000
House size
18.2 The Model
• The first order linear model
y  b0  b1x  e
y = dependent variable
x = independent variable
b0 = y-intercept
b1 = slope of the line
e = error variable
y
b0 and b1 are unknown population
parameters, therefore are estimated
from the data.
Rise
b0
b1 = Rise/Run
Run
x
Interpreting the Coefficients
•
E (Y | X )  b0  b1 X
• Roomsclean=1.78+3.70*Number of Crews
•
called the y-intercept and called the slope.
b1
b
0
• Interpretation of slope: “For every additional
cleaning crew, we are able to clean an additional
3.70 rooms on average.”
• Interpretation of intercept: Technically, how many
rooms on average can be cleaned with zero
cleaning crews but doesn’t make sense here
because it involves extrapolation.
Simple Regression Model
• The data ( x1, y1 ),, ( xn ,are
yn ) assumed to be a
realization of
y  b  b x  e , i  1,n
i
0
1 i
i
2
b ,1e
, b  2iid
,
e1 , 
~
N
(
0
,

n
e )
b0  b1 xi is the “signal” and e i is “noise” (error)
0
1
•
2
b
,
b
,

• 0 1 e are the unknown parameters of the
model. Objective of regression is to estimate
them.
2

• What is the interpretation of e ?
18.3 Estimating the Coefficients
• The estimates are determined by
– drawing a sample from the population of interest,
– calculating sample statistics.
– producing a straight line that cuts into the data.
y
w
Question: What should be
considered a good line?
w
w
w
w
w
w
w
w
w
w
w
w
w
x
w
The Least Squares (Regression)
Line
A good line is one that minimizes
the sum of squared differences between the
points and the line.
The Least Squares (Regression)
Line
Sum of squared differences = (2 - 1)2 + (4 - 2)2 +(1.5 - 3)2 + (3.2 - 4)2 = 6.89
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99
4
3
2.5
2
Let us compare two lines
The second line is horizontal
(2,4)
w
w (4,3.2)
(1,2) w
w (3,1.5)
1
1
2
3
4
The smaller the sum of
squared differences
the better the fit of the
line to the data.
The Estimated Coefficients
To calculate the estimates of the line
coefficients, that minimize the differences
between the data points and the line, use
the formulas:
b1 
cov( X , Y )
s 2x
b 0  y  b1 x
The regression equation that estimates
the equation of the first order linear model
is:
ŷ  b 0  b1x
Typical Regression Analysis
• Observe pairs of data
( x1, y1 ),, ( xn , yn )
• Plot the data! See if a simple linear regression
model seems reasonable.
If necessary, transform
yˆ  b  b x
the data.
• Suspect (or hope) SRM assumptions are justified.
• Estimate the true regression line
by the LS regression line
Check the model and make inferences.
E ( y | x)  b0  b1x
0
1
The Simple Linear Regression Line
• Example 18.2 (Xm18-02)
– A car dealer wants to find
the relationship between
the odometer reading and
the selling price of used cars.
– A random sample of 100 cars
is selected, and the data
recorded.
– Find the regression line.
Car Odometer
Price
1 37388
14636
2 44758
14122
3 45833
14016
4 30862
15590
5 31705
15568
6 34010
14718
.
.
.
Independent
Dependent
.
.
.
variable
x variable
y
.
.
.
The Simple Linear Regression
Line
• Solution
– Solving by hand: Calculate a number of statistics
( x  x)


2
 43,528,690
x  36,009.45;
s 2x
y  14,822.823;
( x  x)( y  y )

cov(X, Y) 
 2,712,511
i
n 1
i
i
n 1
where n = 100.
cov(X , Y )  1,712,511
b1 

 .06232
2
sx
43,528,690
b 0  y  b1 x  14,822.82  ( .06232 )( 36,009.45)  17,067
ŷ  b0  b1x  17,067  .0623 x
Interpreting the Linear
Regression -Equation
17067
Odometer Line Fit Plot
Price
16000
0
15000
14000
No data 13000
Odometer
yˆ  17,067  .0623 x
The intercept is b0 = $17067.
Do not interpret the intercept as the
“Price of cars that have not been driven”
This is the slope of the line.
For each additional mile on the odometer,
the price decreases by an average of $0.0623
Fitted Values and Residuals
• The least squares line decomposes the data
into two parts yi  yˆi  ei where yˆi  b0  b1xi , ei  yi  yˆi
• yˆ1,, yˆn are called the fitted or predicted
values.
•
are called the residuals.
• The residuals e1,, en
are estimates of the
errors (e1,, e n )
18.4 Error Variable: Required
Conditions
• The error e is a critical part of the regression model.
• Four requirements involving the distribution of e must
be satisfied.
–
–
–
–
The probability distribution of e is normal.
The mean of e is zero: E(e) = 0.
The standard deviation of e is e for all values of x.
The set of errors associated with different values of y are all
independent.
The Normality of e
E(y|x3)
The standard deviation remains constant,
m3
b0 + b1x3
E(y|x2)
b0 + b1x2
m2
but the mean value changes with x
b0 + b1x1
E(y|x1)
m1
From the first three assumptions we have:
x1
y is normally distributed with mean
E(y) = b0 + b1x, and a constant standard
deviation e
x2
x3
Estimating
e
n
1
2
ˆ
se 
(
y

y
)

i
i
n  2 i1
• The standard error of estimate
(root
mean
se
squared error) is an estimate of
s
e
• The standard error of estimate is basically the
standard deviation of the residuals.
• If the simple regression model holds, then
approximately
e
– 68% of the data will lie within one se of the LS line.
– 95% of the data will lie within two se of the LS line.
Cleaning Crew Example
• Roomsclean=1.78+3.70*Number of Crews
• The building maintenance company is
planning to submit a bid on a contract to
clean 40 corporate offices scattered
throughout an office complex. Currently,
the company has only 11 cleaning crews.
Will 11 crews be enough?
Practice Problems
• 18.4,18.10,18.12
Related documents