* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Linear Regression
Survey
Document related concepts
Transcript
Linear Regression
Exploring relationships
between two metric variables
Correlation
• The correlation coefficient
measures the strength of a
relationship between two variables
• The relationship involves our ability
to estimate or predict a variable
based on knowledge of another
variable
Linear Regression
• The process of fitting a straight line to
a pair of variables.
• The equation is of the form: y = a + bx
• x is the independent or explanatory
variable
• y is the dependent or response
variable
Linear Coefficients
• Given x and y, linear regression
estimates values for a and b
• The coefficient a, the intercept,
gives the value of y when x=0
• The coefficient b, the slope, gives
the amount that y increases (or
decreases) for each increase in x
x=1:5
y <- 2.5*x + 1
plot(y~x, xlim=c(0, 5), ylim=c(0, 14),
yaxp=c(0, 14, 14), las=1, pch=16)
abline(lm(y~x))
points(0, 1, pch=8)
points(mean(x), mean(y), cex=3)
segments(c(1, 2), c(3.5, 3.5), c(2, 2), c(3.5, 6))
text(c(1.5, 2.25), c(3, 4.75), c("1", "2.5"))
text(mean(x), mean(y), "x = mean(x), y = mean(y)",
pos=4, offset=1)
text(0, 1, "y-intercept = 1", pos=4)
text(1.5, 5, "slope = 2.5/1 = 2.5", pos=2)
text(2, 12, "y = 1 + 2.5x", cex=1.5)
Least Squares
• Many lines could fit the data
depending on how we define the
“best fit”
• Least squares regression minimizes
the squared deviations between the
y-values and the line
lm()
• Function lm() performs least
squares linear regression in R
• Formula used to indicate
Dependent/Response from
Independent/Explanatory
• Tilde(~) separates them D~I or R~E
• Rcmdr Statistics | Fit model | Linear
regression
> RegModel.1 <- lm(LMS~People, data=Kalahari)
> summary(RegModel.1)
Call:
lm(formula = LMS ~ People, data = Kalahari)
Residuals:
Min
1Q
-86.400 -24.657
Median
-2.561
3Q
24.902
Max
86.100
Coefficients:
Estimate Std. Error t value
(Intercept) -64.425
44.924 -1.434
People
12.868
2.591
4.966
--Signif. codes: 0 '***' 0.001 '**' 0.01
Pr(>|t|)
0.175161
0.000258 ***
'*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 47.88 on 13 degrees of freedom
Multiple R-squared: 0.6548,
Adjusted R-squared: 0.6282
F-statistic: 24.66 on 1 and 13 DF, p-value: 0.0002582
plot(LMS~People, data=Kalahari, pch=16, las=1)
RegModel.1 <- lm(LMS~People, data=Kalahari)
abline(Line)
segments(Kalahari$People, Kalahari$LMS, Kalahari$People,
RegModel.1$fitted, lty=2)
text(12, 250, paste("y = ",
as.character(round(RegModel.1$coefficients[[1]], 2)),
" + ",
as.character(round(RegModel.1$coefficients[[2]], 2)),
"x", sep=""), cex=1.25, pos=4)
Errors
• Linear regression assumes all
errors are in the measurement of y
• There are also errors in the
estimation of a (intercept) and b
(slope)
• Significance for a and b is based on
the t distribution
Errors 2
• The errors in the intercept and slope
can be combined to develop a
confidence interval for the
regression line
• We can also compute a prediction
interval which is the confidence we
have in a single prediction
predict()
• predict() uses the results of a linear
regression to predict the values of
the dependent/response variable
• It can also produce confidence and
prediction intervals:
– predict(RegModel.1, data.frame(People
= c(10, 20, 30)), interval="prediction")
RegModel.1 <- lm(LMS~People, data=Kalahari)
plot(LMS~People, data=Kalahari, pch=16, las=1)
xp<-seq(10,25,.1)
yp<-predict(RegModel.1,data.frame(People=xp),int="c")
matlines(xp,yp, lty=c(1,2,2),col="black")
yp<-predict(RegModel.1,data.frame(People=xp),int="p")
matlines(xp,yp, lty=c(1,3,3),col="black")
legend("topleft", c("Confidence interval (95%)",
"Prediction interval (95%)"),
lty=c(2, 3))
Diagnostics
• Models | Graphs | Basic diagnostic
plots
– Look for trend in residuals
– Look for change in residual variance
– Look for deviation from normally
distributed residuals
– Look for influential data points
Diagnostics 2
• influence(RegModel.1) returns
–
–
–
–
Hat (leverage) coefficients
Coefficient changes (leave one out)
Sigma, residual changes (leave one out)
wt.res, weighted residuals
Other Approaches
• rlm() fits a robust line that is less
influenced by outliers
• sma() in package smatr fits
standardized major axis (aka
reduced major axis) regression and
major axis regression – used in
allometry