Document related concepts

Data assimilation wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Regression toward the mean wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
```Regression Analysis Using JMP
Yiming Peng, Department of Statistics
February 12, 2013
2
3
Presentation and Data
http://www.lisa.stat.vt.edu
Short Courses
Regression Analysis Using JMP
Presentation Outline
•
Simple Linear Regression
•
Multiple Linear Regression
•
Presentation Outline
(if time permits)
•
Regression with Binary Response Variables
•
Individual Goals/Interests
Simple Linear Regression
1.
2.
3.
4.
5.
6.
7.
Definition
Scatterplot and Correlation
Model and Estimation
Coefficient of Determination (R2)
Assumptions
Caution
Example
Simple Linear Regression
• Simple Linear Regression (SLR) is used to study the relationship
between a variable of interest and another variable.
• Both variables must be quantitative (continuous)
• Variable of interest known as Response or Dependent
Variable (Y)
• Other variable known as Explanatory or Independent Variable
(X)
Simple Linear Regression
• Objectives
• How is the response variable affected by changes in
explanatory variable? We would like a numerical description of
how both variables vary together.
• Determine the significance of the explanatory variable in
explaining the variability in the response (not necessarily
causation).
• Predict values of the response variable for given values of the
explanatory variable.
Simple Linear Regression
• Scatterplots are used to graphically examine the relationship
between two quantitative variables.
Student
Beers
BAC
1
5
0.1
2
2
0.03
3
9
0.19
6
7
0.095
7
3
0.07
9
3
0.02
11
4
0.07
13
5
0.085
4
8
0.12
5
3
0.04
8
5
0.06
10
5
0.05
12
6
0.1
14
7
0.09
15
1
0.01
16
4
0.05
Simple Linear Regression
• After plotting two variables on a scatterplot, we describe the
relationship by examining the form, direction, and strength of
the association. We look for an overall pattern …
• Form: linear, curved, clusters, no pattern
• Direction: positive, negative, no direction
• Strength: how closely the points fit the “form”
• … and deviations from that pattern which cause significant
changes in the direction of the overall pattern.
• Outliers
Simple Linear Regression
Non-Linear Relationship
Y
No Relationship
Y
Y
Positive Linear Relationship Negative Linear Relationship
Simple Linear Regression
• An outlier is a data value that has a very low probability of
occurrence (i.e., it is unusual or unexpected).
An outlier is a data value that has a very
low probability of occurrence (i.e., it is
unusual or unexpected).
• In a scatterplot, outliers are points that fall outside of the overall
pattern of the relationship.
Simple Linear Regression
• Correlation
• Measures the direction and strength of a linear relationship
between two quantitative variables.
• Pearson Correlation Coefficient
• Assumption of normality
• Calculation:
• Spearman’s Rho and Kendall’s Tau are used for non-normal
quantitative variables.
Simple Linear Regression
• Properties of Pearson Correlation Coefficient
• -1 ≤ r ≤ 1
• Positive values of r: as one variable increases, the other
increases
• Negative values of r: as one variable increases, the other
decreases
• Values close to 0 indicate no linear relationship between the
two variables
• Values close to +1 or -1 indicated strong linear relationships
• r doesn’t distinguish explanatory and response variables
• r has no unit
• Important note: Correlation does not imply causation
Simple Linear Regression
• Pearson Correlation
Coefficient: General
Guidelines
• 0 ≤ |r| < 0.2 : Very Weak
linear relationship
• 0.2 ≤ |r| < 0.4 : Weak
linear relationship
• 0.4 ≤ |r| < 0.6 : Moderate
linear relationship
• 0.6 ≤ |r| < 0.8 : Strong
linear relationship
• 0.8 ≤ |r| < 1.0 : Very
Strong linear relationship
Simple Linear Regression
• The Simple Linear Regression Model
• Basic Model: response = deterministic + stochastic
• Deterministic: model of the linear relationship between X
and Y
• Stochastic: Variation, uncertainty, and miscellaneous
factors
• Model
yi= value of the response variable for the ith observation
xi= value of the explanatory variable for the ith observation
β0= y-intercept
β1= slope
εi= random error, iid Normal(0,σ2)
Simple Linear Regression
But which line best
describes our data?
Simple Linear Regression
The least-squares regression line is the unique line such that the sum
of the total vertical (y) distances is zero and sum of the squared vertical (y)
distances between the data points and the line is the smallest possible.
Distances between the points and
line are squared so all are positive
values. This is done so that
distances can be properly added .
Simple Linear Regression
• Least Square Estimation
• Predicted Values
• Residuals
• The distinction between explanatory and response variables is
essential in regression
Simple Linear Regression
• Interpretation of Parameters
• β0: Value of Y when X=0
• β1: Change in the value of Y with an increase of 1 unit of X
(also known as the slope of the line)
• Hypothesis Testing
• β0- Test whether the true y-intercept is different from 0
Null Hypothesis: β0=0
Alternative Hypothesis: β0≠0
• β1- Test whether the slope is different from 0
Null Hypothesis: β1=0
Alternative Hypothesis: β1≠0
Simple Linear Regression
• Analysis of Variance (ANOVA) for Simple Linear Regression
Source
Df
Sum of
Squares
Mean Square
F Ratio
P-value
Model
1
SSR
SSR/1
F1=MSR/MSE
P(F>F1,1-α,1,n-2)
Error
n-2
SSE
SSE/(n-2)
Total
n-1
SST
Simple Linear Regression
Simple Linear Regression
• Coefficient of Determination (R2)
• Percent variation in the response variable (Y) that is explained
by the least squares regression line
• 0 ≤ R2 ≤ 1
• Calculation:
• Prediction
Simple Linear Regression
• Assumptions of Simple Linear Regression
1. Independence
Residuals are independent of each other
Related to the method in which the data were collected or
time related data
Tested by plotting time collected vs. residuals
Parametric test: Durbin-Watson Test
2. Constant Variance
Variance of the residuals is constant
Tested by plotting predicted values vs. residuals
Parametric test: Brown-Forsythe Test
Simple Linear Regression
• Assumptions of Simple Linear Regression
3. Normality
Residuals are normally distributed
Tested by evaluating histograms and normal-quantile plots of
residuals
Parametric test: Shapiro Wilkes test
Simple Linear Regression
• Constant Variance: Plot of Fitted Values vs. Residuals
Good Residual Plot: No Pattern
8
30
6
20
4
2
10
0
0
-2
-10
-4
-20
-6
0
5
10
Predicted Values
15
20
0
1
2
3
4
5
6
Predicted Values
7
8
9
10
Simple Linear Regression
• Normality: Histogram and Q-Q Plot of Residuals
Normal Assumption Appropriate
Normal Assumption Not Appropriate
Simple Linear Regression
• Some Remedies
• Non-Constant Variance: Weight Least Squares
• Non-normality:
Box-Cox Transformation
• Dependence:
Auto-Regressive Models
Simple Linear Regression
Do not use a regression on inappropriate data.
Pattern in the residuals
Presence of large outliers
Use residual plots for help.
Clumped data falsely appearing linear
Recognize when the correlation/regression is performed on
averages.
A relationship, however strong, does not itself imply causation.
Beware of lurking variables.
Simple Linear Regression
A lurking variable is a variable not included in the study design that does
or may have an effect on the variables studied.
Lurking variables can falsely suggest a relationship.
What is the lurking variable here? Some more obvious than others.
Strong positive association between
the number firefighters at a fire site and
the amount of damage a fire does.
Negative association between moderate
amounts of wine-drinking and death rates
from heart disease in developed nations.
Simple Linear Regression
• Example Dataset:
Fitness
• A researcher was interested in the relationship between oxygen
uptake and a number of potential explanatory variables separately,
including age, weight, running time, running pulse rate, rest pulse
rate, and max pulse rate.
• Filename: Fitness0.jmp (JMP sample data)
Simple Linear Regression
Multiple Linear Regression
1.
2.
3.
4.
5.
6.
7.
Definition
Categorical Explanatory Variables
Model and Estimation
Assumptions
Model Selection
Example
Multiple Linear Regression
•
Explanatory Variables
• Two Types: Continuous and Categorical
• Continuous Predictor Variables
• Examples – Time, Grade Point Average, Test Score, etc.
• Coded with one parameter – β#x#
• Categorical Predictor Variables
• Examples – Sex, Political Affiliation, Marital Status, etc.
• Actual value assigned to Category not important
• Ex) Sex - Male/Female, M/F, 1/2, 0/1, etc.
• Coded Differently than continuous variables
Multiple Linear Regression
• Similar to simple linear regression, except now there is more than
one explanatory variable, which may be continuous and/or
categorical.
• Model
yi= value of the response variable for the ith observation
x#i= value of the explanatory variable # for the ith observation
β0= y-intercept
β#= parameter corresponding to explanatory variable #
εi= random error, iid Normal(0,σ2)
Multiple Linear Regression
• Least Square Estimation
• Predicted Values
• Residuals
Multiple Linear Regression
• Interpretation of Parameters
• β0: Value of Y when X=0
• Β#: Change in the value of Y with an increase of 1 unit of X#
in the presence of the other explanatory variables
• Hypothesis Testing
• β0- Test whether the true y-intercept is different from 0
Null Hypothesis: β0=0
Alternative Hypothesis: β0≠0
• Β#- Test of whether the value change in Y with an increase
of 1 unit in X# is different from 0 in the presence of the
other explanatory variables.
Null Hypothesis: β#=0
Alternative Hypothesis: β#≠0
Multiple Linear Regression
• Percent variation in the response variable (Y) that is explained
by the least squares regression line with explanatory variables
x1, x2,…,xp
• Calculation of R2:
• The R2 value will increase as explanatory variables added to
the model
• The adjusted R2 introduces a penalty for the number of
explanatory variables.
Multiple Linear Regression
• Other Model Evaluation Statistics
• Akaike Information Criterion (AIC or AICc)
• Schwartz Information Criterion (SIC)
• Bayesian Information Criterion (BIC)
• Mallows’ Cp
• Prediction Sum of Squares (PRESS)
Multiple Linear Regression
• Model Selection
• 2 Goals: Complex enough to fit the data well
Simple to interpret, does not overfit the data
• Study the effect of each explanatory variable on the
response Y
• Continuous Variable – Graph Y versus X
• Categorical Variable - Boxplot of Y for categories of X
Multiple Linear Regression
• Model Selection cont.
• Multicollinearity
• Correlations among explanatory variables resulting in an
increase in variance
• Reduces the significance value of the variable
• Occurs when several explanatory variables are used in the
model
• Diagnostic
• Variance Inflation Factor (VIF)
• Correlation Matrix
Multiple Linear Regression
• Algorithmic Model Selection
model and remove those that are
insignificant
variables one at a time
then alternate backward and forward
selection steps until no variables to add
or remove
Multiple Linear Regression
• Example Dataset:
Fitness
• A researcher was interested in the relationship between
oxygen uptake and a number of potential explanatory
variables together, including age, weight, running time,
running pulse rate, rest pulse rate, and max pulse rate.
• Filename: Fitness0.jmp (JMP sample data)
Multiple Linear Regression
• Other Multiple Linear Regression Issues
• Outliers
• Interaction Terms
• Higher Order Terms
Multiple Linear Regression
Regression with
Non-Normal Response
• Logistic Regression with Binary Response
Logistic Regression
• Consider a binary response variable.
• Variable with two outcomes
• One outcome represented by a 1 and the other represented by
a0
• Examples:
Does the person have a disease? Yes or No
Who is the person voting for?
Romney or Obama
Outcome of a baseball game?
Win or loss
Logistic Regression
•
Consider the linear probability model
yi = β0 + β1 xi
where
yi = response for observation i
xi = quantitative explanatory variable
Predicted values represent the probability of Y=1 given X
• Issue: Predicted probability for some subjects fall outside of the
[0,1] range.
Logistic Regression
•
Consider the logistic regression model
exp ( β0 + β1 xi )
E [Yi ] = P(Yi = 1| xi ) = π (xi ) =
1+ exp ( β0 + β1 xi )
 π ( xi ) 
 = β 0 + β1 xi

→ logit[π ( xi )] = log
 1 − π ( xi ) 
• Predicted values from the regression equation fall between 0 and
1
Logistic Regression
•
Interpretation of Coefficient β – Odds Ratio
• The odds ratio is a statistic that measures the odds of an event
compared to the odds of another event.
• Say the probability of Event 1 is π1 and the probability of Event 2 is
π2 . Then the odds ratio of Event 1 to Event 2 is:
Odds(π 1 ) π 1 1−π1
Odds _ Ratio =
= π2
Odds(π 2 )
1−π 2
• Value of Odds Ratio range from 0 to Infinity
• Value between 0 and 1 indicate the odds of Event 2 are greater
• Value between 1 and infinity indicate odds of Event 1 are greater
• Value equal to 1 indicates events are equally likely
Logistic Regression
•
Example Dataset: APACHE II Score and Mortality in Sepsis
•
The following figure shows 30 day mortality in a sample of septic patients
as a function of their baseline APACHE II Score. Patients are coded as 1
or 0 depending on whether they are dead or alive in 30 days,
respectively.
•
We wish to predict death from baseline APACHE II score in these
patients.
Filename: APACHE.jmp
Important Note:
JMP models the probability of the 0 category
Thank you!
```