Download Introduction to Regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Regression toward the mean wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Introduction to
Regression
©2005 Dr. B. C. Paul
Things Favoring ANOVA
Analysis

ANOVA tells you whether a factor is
controlling a result

It requires that the control factor be easily
categorized



Example Spring Summer Fall
Tends to work well on non-quantitative or unordered or
discontinuous controlling factors
Does not quantify the magnitude or type of effect
– only its existence

Example Gas Mileage in influence by the season of
the year, the driving distance, and the driver
Things Favoring Regression
Analysis

Suppose your gas mileage data is





Outside Temperature
Distance Driven
Age of Driver
The data can be categorized only by arbitrary
divisions
Suppose I want to know quantitatively how
these continuous numeric variables control
gas mileage
What Regression Does

Idea is that you have a “Dependent Variable”
that is a function of some “Independent
Variable”



Y = F(X)
Could be gas mileage as a function of
temperature
The simplest form of a function is a straight
line

Y=bo+b1*X
Reminders on Linear Form
B1 represents the units of rise in Y per unit of
Run in X (ie it is the slope of the line)
b. Is the intercept of the line with the vertical axis at X=0
Idea Behind Linear Regression




Most of the variation of Y can be explained as a linear function of
X
The portion of variation in Y due to other known, unknown or
random causes is normally distributed about the regression line
The degree to which we missed predicting Y using X can be
measured by squaring the difference between the actual and
predicted value
 We will select our linear coefficients bo and b1 such that the sum
of all these squared differences is minimized
For this class we will skip the formula derivation and
mathematical formulas used to get bo and b1
 Linear Regression is readily done by most calculators and our
friend program SPSS
Doing Linear Regression With
SPSS
Begin by Entering the Data
In this case we will consider gas mileage
As a function of the distance a car is driven.
We believe that there may be a relationship
Because vehicles take a while to warm up
And get better mileage after warm up.
Why Did I Pick Linear
Regression?

Controlling variable was continuous – not
category



If I had looked at gas mileage as a function of
gender the control variable would have been
category (male, female)
A linear relationship is an easy one to
consider
There are ways of plotting data to see if it
appears there might be a linear trend.
A Note on Modeling



Statistical Methods are all about fitting
mathematical models to real data
A linear regression attempts to fit a straight
line function of x through the data Y
Ultimately the quality of what I do does
depend on how good the model represented
reality


A poorly fit model will produce answers
But right answers cost more and are harder to get
Visually Examining Our Data
Set
Go to Graphs and click to pull down the
menu
Highlight and click on scatter Plot
You Will Be Given A Choice of
Types of Scatter Plots
The Default is a simple
Scatter plot – which I am
Going to accept.
I will click on the define
Button to move to the next
Screen.
I Need to Define What to Plot
on the Y and X axis
The Y axis is my Dependent
Variable.
In this case I believe that MPG
Is a function of distance
Traveled.
To make it my variable I will
Highlight MPG and then
Click the arrow by Y Axis
Next Choose my Independent
Variable
Since I believe that MPG
Might be a function of
Distance driven I next select
Distance and click the arrow
By X axis to move the variable
Over to X axis
Then I click Ok to go to the
Plot.
Out Comes my Plot
I see a fairly clear
Indication that gas
Mileage is improving
With the length of the trip
Now Getting on to Regression
Click the Pull Down Menu for
Analyze
Highlight Regression to pop the
Side menu out
Highlight and Click Linear
Select the Regression
Variables
Note that I selected MPG for my
Dependent variable and distance
As my independent variable.
Click OK and Out Comes Stuff
First it tells me about variables that entered
(Ie – what did it try to make MPG a function of).
I told it to make it a function of distance and
The table says it entered distance as the
Controlling variable. Method Enter means it
entered that variable because I told it to.
Next Box Tells Me About How Well
I Did Guessing a Linear Model
R2 is called the Pearson Product
Coefficient. It tells me how much
Of the total scatter in the data is
Explained by my linear regression
Of one variable (distance).
0.393 means 39.3% was explained
More Interpretation
R value tells you how well your
Data followed a straight line.
1 means it is a straight line. 0 means
Its nothing like a straight line (a circle
Would pull a 0 even though Y is
A function of X – its not a linear one).
Standard Error of the Estimate
Is how far on average you would
Miss your guess if you just gave
The mileage predicted by the
Equation.
The ANOVA Table
SPSS does an ANOVA on
The linear model as a
Predictor. The F value for
The regression is 17.476.
The chances of getting an F
Value that high if the model
Fit was a fluke is essentially 0.
The Coefficients Table
Coefficient Table Gives the
Regression Constants
Bo=14.962
B1=0.654
Y=14.962+0.654*X
How Good are Our
Coefficients?
The standard deviation for each
Coefficient value is given here.
The constant is 14.962 and the
Standard deviation of that estimate
Is 2.803.
Test Statistic is done for each
Coefficient in the equation.
The “null hypothesis” is that the
Slope or intercept is actually 0.
The test statistic has a t distribution
Significance of the
Coefficients
Significance levels in this table indicate the chance
That the real value of the regression coefficient
Should be 0.
As can be seen, for both coefficients there is
Essentially no chance that any of the coefficients
Should be 0.
Some Conclusions



There is definitely a linear influence of miles driven
on gas mileage however the linear relationship only
explains about 40% of the variability in the data.
We know there is still something out there
We may also want to examine our residuals to see if
there are any trends in the residuals indicating we
might be missing something or that our constant
normal distribution of residuals about the model is
wrong

It might come up wrong for example if we were wrong
about a linear model being the best fit.
Examining Residuals of
Regression
Set up your linear regression in the
Usual manner.
Selecting Plots
After setting you dependent and
Independent variables and before
Clicking ok, click plots instead.
Picking Residual Plots
Plot the residual on the Y axis
Against the predicted value on
The X axis.
Ask for Histograms and normal
Probability plots.
More Plots
Use the next button to allow you
To select another plot.
Then enter the residual on the
Y axis against the dependent
Variable.
Finally tell the computer to
Continue.
You Will Still Get the Normal
Tables we Saw Before
Scroll down
To see what
Is new.
Some Abnormality in the
Histogram
A Histogram is a bar chart
Showing the number of
Results in different numeric
Intervals.
In this case we can see there
May be two families of
Unexplained events and
One of them is causing the
Model to over-predict
(note the negative tail).
We Have a Cumulative
Probability Plot
Cumulative probability
Counts all the samples
That should have come
Up by a certain point
(it is an integration of the
Probability distribution).
Normal would plot on a
Straight line. This is
Somewhat straight but
The slope at the center is
Wrong and the tails
Drift off. (More commentary
On reading cumulative
Probability plots later).
Look for Trends that have
been systematically missed
This plot shows
The residual
(amount we
Missed by) against
The predicted
Value.
If there is a trend
In the points it
May tell us
What we missed.
In this case it is
Pretty scattered.
Missing Trends
We are still missing
Something because
There is a definite
Trend in the residuals
Relative to the actual
MPG.
We are missing a
Variable or factor.
(it might be linear).
Consider Another Data Set
We have an Independent and
Dependent Variable.
(The data set could represent
Any problem we wished to
Model).
Tell it to do a Regression of the
Dependent against the
Independent Variable.
Be sure we also ask for our
Residual plots.
Go to Results
The R^2 value is 0.996 – darn
One is a straight line. How much
Closer do you want to be.
This regression looks like it
Fits like a glove – The
Mean Square for regression
Is 5 orders of magnitude
Greater than the MS for error.
The F statistic blows the null
Hypothesis off the map.
No Chance the Slope or
Constant are Zero
There is some evidence the
distribution of residuals is a little
skewed.
The residual distribution is
definitely skewed off to one side
Oh Boy – Can You See the
Trend we missed here?
Here the residuals
Follow a clear and
Unmistakable shape of
An effect we missed.
This Thing Has a Second
Order or Curved Effect
OK – Now What Do I Do?




Linear Regression Rapidly and Quantitatively Fits a
simple linear function of one variable to another.
We noted that there had to be other effects present
on the gas mileage but linear regression only
handles one independent variable.
We also noted that sometimes there our second or
higher order effects of a variable present – a straight
line just doesn’t fit that
We may want to have some more powerful tools to
fall back on (we just try the easy stuff first).