Download Simple Linear Regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Simple Linear
Regression
(Session 02)
SADC Course in Statistics
Learning Objectives
At the end of this session, you will be able to
• understand the meaning of a simple linear
regression model, its aims and terminology
• determine the best fitting line describing
the relationship between a quantitative
response (y) and a quantitative
explanatory variable (x)
• Interpret the unknown parameters of the
regression line
To put your footer here go to View > Header and Footer
2
An illustrative example
Data on the next slide shows the average
number of cigarettes smoked per adult in
1930 and the death rate per million in 1952
for sixteen countries.
The question of interest is whether there is a
relationship between the death rate (y) and
level of smoking (x). Here both y and x are
quantitative measurements.
To put your footer here go to View > Header and Footer
3
The Data
Country
England and Wales
Cig. Smoked (x)
Death rate (y)
1378
1662
461
433
Finland
Austria
Nethelands
Belgium
Switzerland
New Zealand
U.S.A.
Denmark
Australia
Canada
France
Italy
Sweden
Norway
Japan
To put your footer
here go to View >
960
632
1066
706
478
1296
465
504
760
585
455
388
359
723
Header
and
Footer
380
276
254
236
216
202
179
177
176
140
110
89
77
40
4
0
100
200
300
400
500
Start by plotting - shows pattern
0
500
1000
Cigarettes smoked (x)
1500
2000
-a straight line relationship seems plausible here.
To put your footer here go to View > Header and Footer
5
Recall reasons for modelling
• To determine which of (often) several
factors explain variability in the key
response of interest;
• To summarise the relationship(s);
• For predictive purposes, e.g. predicting y
for given x’s, or identifying x’s that
optimise y in some way;
Note: Presence of an association between
variables does not necessarily imply
causation.
To put your footer here go to View > Header and Footer
6
Describing the Regression Model
Describe variation in response (here death
rate) in terms of its relationship with the
explanatory variable (here cig. numbers).
Model : data = pattern + residual
– can describe pattern as: a + bx , if
straight line relationship seems reasonable
– residual is unexplained variation assumed to be random.
To put your footer here go to View > Header and Footer
7
Simple Linear Regression Model
If there is only one explanatory variable, we
have a Simple Linear Regression Model.
Here data = pattern + residual becomes:
y =  + x + 
where  + x =pattern and  = residual.
•  is called the intercept
•  is called the slope
• the ’s represent the departure of the true
line from the observed values.
To put your footer here go to View > Header and Footer
8
A Diagrammatic Representation
y
×
× 
i
yi
}
×
×
×
×
×
×
y x
×
}
xi
To put your footer here go to View > Header and Footer
x
9
Parameters of Model & Assumptions
•  and  are the unknown parameters in the
model. They are estimated from the data
• The random error, , is assumed to have a
– normal distribution
– with constant variance (whatever the
value of x)
We shall return to these assumptions later.
To put your footer here go to View > Header and Footer
10
Results of model fitting
-----------------------------------------------------deathrate|Coef. Std.Err.
t
P>|t| [95% Conf.Int.]
---------+-------------------------------------------Cigars
| .2410
.0544
4.43 0.001
.1245
.3577
Const.
| 28.31
46.92
0.60 0.556 -72.34 128.95
------------------------------------------------------
These are estimates of coefficients of the
regression equation since this is a sample of
data - precision quantified by standard errors
Estimated equation is: y = 28.31 + 0.241 * x
Note: The t and P>|t| columns will be discussed in
the next session.
To put your footer here go to View > Header and Footer
11
0
100
200
300
400
500
The fitted line
0
500
1000
Cigarettes smoked (x)
Death rate (y)
1500
2000
Fitted values
To put your footer here go to View > Header and Footer
12
Interpreting model parameters
• Slope (regression coefficient): If cigarettes
smoked increases by 1 unit per year, death
rate will increase by 0.24 units. In other
words, if cigarettes smoked increases by
100 units, death rate will increase by 24
units.
• Intercept of 28.31 only has meaning if the
range of x values (cigarettes smoked)
under study includes the value of zero.
Here zero cigarettes smoked still gives an
estimated death rate of 28.3
To put your footer here go to View > Header and Footer
13
Predictions from the line
The model equation can also be used to
predict y at a given value of x
Thus from y = 28.31 + 0.241 x,
predicted death rate ( ŷ ) in a country where
number of cigarettes smoked is x=1000, is
given by
ˆ ˆx
ŷ
= 28.31 + 0.241 (1000)
= 269.3
Note: Predictions will be discussed in greater
detail in Session 9.
To put your footer here go to View > Header and Footer
14
Computation of model estimates
(for reference only)
ˆ  x i yi  ( x i )( yi ) / n Sxy
2
2
Sxx
 x i  ( x i ) / n
ˆ

ˆ x
y


 i  i
n
ˆx
y
Note: Can also write
(x i  x)(yi  y)
Sxy




2
Sxx
 (xi  x)
To put your footer here go to View > Header and Footer
15
Practical work follows to ensure
learning objectives are
achieved…
To put your footer here go to View > Header and Footer
16