Download Regression Analysis Using JMP
Document related concepts
Regression Analysis Using JMP Yiming Peng, Department of Statistics February 12, 2013 2 3 Presentation and Data http://www.lisa.stat.vt.edu Short Courses Regression Analysis Using JMP Download Data to Desktop Presentation Outline • Simple Linear Regression • Multiple Linear Regression • Questions/Comments Presentation Outline (if time permits) • Regression with Binary Response Variables • Individual Goals/Interests Simple Linear Regression 1. 2. 3. 4. 5. 6. 7. Definition Scatterplot and Correlation Model and Estimation Coefficient of Determination (R2) Assumptions Caution Example Simple Linear Regression • Simple Linear Regression (SLR) is used to study the relationship between a variable of interest and another variable. • Both variables must be quantitative (continuous) • Variable of interest known as Response or Dependent Variable (Y) • Other variable known as Explanatory or Independent Variable (X) Simple Linear Regression • Objectives • How is the response variable affected by changes in explanatory variable? We would like a numerical description of how both variables vary together. • Determine the significance of the explanatory variable in explaining the variability in the response (not necessarily causation). • Predict values of the response variable for given values of the explanatory variable. Simple Linear Regression • Scatterplots are used to graphically examine the relationship between two quantitative variables. Student Beers BAC 1 5 0.1 2 2 0.03 3 9 0.19 6 7 0.095 7 3 0.07 9 3 0.02 11 4 0.07 13 5 0.085 4 8 0.12 5 3 0.04 8 5 0.06 10 5 0.05 12 6 0.1 14 7 0.09 15 1 0.01 16 4 0.05 Simple Linear Regression • After plotting two variables on a scatterplot, we describe the relationship by examining the form, direction, and strength of the association. We look for an overall pattern … • Form: linear, curved, clusters, no pattern • Direction: positive, negative, no direction • Strength: how closely the points fit the “form” • … and deviations from that pattern which cause significant changes in the direction of the overall pattern. • Outliers Simple Linear Regression Non-Linear Relationship Y No Relationship Y Y Positive Linear Relationship Negative Linear Relationship Simple Linear Regression • An outlier is a data value that has a very low probability of occurrence (i.e., it is unusual or unexpected). An outlier is a data value that has a very low probability of occurrence (i.e., it is unusual or unexpected). • In a scatterplot, outliers are points that fall outside of the overall pattern of the relationship. Simple Linear Regression • Correlation • Measures the direction and strength of a linear relationship between two quantitative variables. • Pearson Correlation Coefficient • Assumption of normality • Calculation: • Spearman’s Rho and Kendall’s Tau are used for non-normal quantitative variables. Simple Linear Regression • Properties of Pearson Correlation Coefficient • -1 ≤ r ≤ 1 • Positive values of r: as one variable increases, the other increases • Negative values of r: as one variable increases, the other decreases • Values close to 0 indicate no linear relationship between the two variables • Values close to +1 or -1 indicated strong linear relationships • r doesn’t distinguish explanatory and response variables • r has no unit • Important note: Correlation does not imply causation Simple Linear Regression • Pearson Correlation Coefficient: General Guidelines • 0 ≤ |r| < 0.2 : Very Weak linear relationship • 0.2 ≤ |r| < 0.4 : Weak linear relationship • 0.4 ≤ |r| < 0.6 : Moderate linear relationship • 0.6 ≤ |r| < 0.8 : Strong linear relationship • 0.8 ≤ |r| < 1.0 : Very Strong linear relationship Simple Linear Regression • The Simple Linear Regression Model • Basic Model: response = deterministic + stochastic • Deterministic: model of the linear relationship between X and Y • Stochastic: Variation, uncertainty, and miscellaneous factors • Model yi= value of the response variable for the ith observation xi= value of the explanatory variable for the ith observation β0= y-intercept β1= slope εi= random error, iid Normal(0,σ2) Simple Linear Regression But which line best describes our data? Simple Linear Regression The least-squares regression line is the unique line such that the sum of the total vertical (y) distances is zero and sum of the squared vertical (y) distances between the data points and the line is the smallest possible. Distances between the points and line are squared so all are positive values. This is done so that distances can be properly added . Simple Linear Regression • Least Square Estimation • Predicted Values • Residuals • The distinction between explanatory and response variables is essential in regression Simple Linear Regression • Interpretation of Parameters • β0: Value of Y when X=0 • β1: Change in the value of Y with an increase of 1 unit of X (also known as the slope of the line) • Hypothesis Testing • β0- Test whether the true y-intercept is different from 0 Null Hypothesis: β0=0 Alternative Hypothesis: β0≠0 • β1- Test whether the slope is different from 0 Null Hypothesis: β1=0 Alternative Hypothesis: β1≠0 Simple Linear Regression • Analysis of Variance (ANOVA) for Simple Linear Regression Source Df Sum of Squares Mean Square F Ratio P-value Model 1 SSR SSR/1 F1=MSR/MSE P(F>F1,1-α,1,n-2) Error n-2 SSE SSE/(n-2) Total n-1 SST Simple Linear Regression Simple Linear Regression • Coefficient of Determination (R2) • Percent variation in the response variable (Y) that is explained by the least squares regression line • 0 ≤ R2 ≤ 1 • Calculation: • Prediction Simple Linear Regression • Assumptions of Simple Linear Regression 1. Independence Residuals are independent of each other Related to the method in which the data were collected or time related data Tested by plotting time collected vs. residuals Parametric test: Durbin-Watson Test 2. Constant Variance Variance of the residuals is constant Tested by plotting predicted values vs. residuals Parametric test: Brown-Forsythe Test Simple Linear Regression • Assumptions of Simple Linear Regression 3. Normality Residuals are normally distributed Tested by evaluating histograms and normal-quantile plots of residuals Parametric test: Shapiro Wilkes test Simple Linear Regression • Constant Variance: Plot of Fitted Values vs. Residuals Good Residual Plot: No Pattern Bad Residual Plot: Variability Increasing 8 30 6 20 4 2 10 0 0 -2 -10 -4 -20 -6 0 5 10 Predicted Values 15 20 0 1 2 3 4 5 6 Predicted Values 7 8 9 10 Simple Linear Regression • Normality: Histogram and Q-Q Plot of Residuals Normal Assumption Appropriate Normal Assumption Not Appropriate Simple Linear Regression • Some Remedies • Non-Constant Variance: Weight Least Squares • Non-normality: Box-Cox Transformation • Dependence: Auto-Regressive Models Simple Linear Regression Do not use a regression on inappropriate data. Pattern in the residuals Presence of large outliers Use residual plots for help. Clumped data falsely appearing linear Recognize when the correlation/regression is performed on averages. A relationship, however strong, does not itself imply causation. Beware of lurking variables. Simple Linear Regression A lurking variable is a variable not included in the study design that does or may have an effect on the variables studied. Lurking variables can falsely suggest a relationship. What is the lurking variable here? Some more obvious than others. Strong positive association between the number firefighters at a fire site and the amount of damage a fire does. Negative association between moderate amounts of wine-drinking and death rates from heart disease in developed nations. Simple Linear Regression • Example Dataset: Fitness • A researcher was interested in the relationship between oxygen uptake and a number of potential explanatory variables separately, including age, weight, running time, running pulse rate, rest pulse rate, and max pulse rate. • Filename: Fitness0.jmp (JMP sample data) Simple Linear Regression • Questions/Comments about Simple Linear Regression Multiple Linear Regression 1. 2. 3. 4. 5. 6. 7. Definition Categorical Explanatory Variables Model and Estimation Adjusted Coefficient of Determination Assumptions Model Selection Example Multiple Linear Regression • Explanatory Variables • Two Types: Continuous and Categorical • Continuous Predictor Variables • Examples – Time, Grade Point Average, Test Score, etc. • Coded with one parameter – β#x# • Categorical Predictor Variables • Examples – Sex, Political Affiliation, Marital Status, etc. • Actual value assigned to Category not important • Ex) Sex - Male/Female, M/F, 1/2, 0/1, etc. • Coded Differently than continuous variables Multiple Linear Regression • Similar to simple linear regression, except now there is more than one explanatory variable, which may be continuous and/or categorical. • Model yi= value of the response variable for the ith observation x#i= value of the explanatory variable # for the ith observation β0= y-intercept β#= parameter corresponding to explanatory variable # εi= random error, iid Normal(0,σ2) Multiple Linear Regression • Least Square Estimation • Predicted Values • Residuals Multiple Linear Regression • Interpretation of Parameters • β0: Value of Y when X=0 • Β#: Change in the value of Y with an increase of 1 unit of X# in the presence of the other explanatory variables • Hypothesis Testing • β0- Test whether the true y-intercept is different from 0 Null Hypothesis: β0=0 Alternative Hypothesis: β0≠0 • Β#- Test of whether the value change in Y with an increase of 1 unit in X# is different from 0 in the presence of the other explanatory variables. Null Hypothesis: β#=0 Alternative Hypothesis: β#≠0 Multiple Linear Regression • Adjusted Coefficient of Determination (ADJ R2) • Percent variation in the response variable (Y) that is explained by the least squares regression line with explanatory variables x1, x2,…,xp • Calculation of R2: • The R2 value will increase as explanatory variables added to the model • The adjusted R2 introduces a penalty for the number of explanatory variables. Multiple Linear Regression • Other Model Evaluation Statistics • Akaike Information Criterion (AIC or AICc) • Schwartz Information Criterion (SIC) • Bayesian Information Criterion (BIC) • Mallows’ Cp • Prediction Sum of Squares (PRESS) Multiple Linear Regression • Model Selection • 2 Goals: Complex enough to fit the data well Simple to interpret, does not overfit the data • Study the effect of each explanatory variable on the response Y • Continuous Variable – Graph Y versus X • Categorical Variable - Boxplot of Y for categories of X Multiple Linear Regression • Model Selection cont. • Multicollinearity • Correlations among explanatory variables resulting in an increase in variance • Reduces the significance value of the variable • Occurs when several explanatory variables are used in the model • Diagnostic • Variance Inflation Factor (VIF) • Correlation Matrix Multiple Linear Regression • Algorithmic Model Selection • Backward Selection: Start with all explanatory variables in the model and remove those that are insignificant • Forward Selection: Start with no explanatory variables in the model and add best explanatory variables one at a time • Stepwise Selection: Start with two forward selection steps then alternate backward and forward selection steps until no variables to add or remove Multiple Linear Regression • Example Dataset: Fitness • A researcher was interested in the relationship between oxygen uptake and a number of potential explanatory variables together, including age, weight, running time, running pulse rate, rest pulse rate, and max pulse rate. • Filename: Fitness0.jmp (JMP sample data) Multiple Linear Regression • Other Multiple Linear Regression Issues • Outliers • Interaction Terms • Higher Order Terms Multiple Linear Regression • Questions/Comments about Multiple Linear Regression Regression with Non-Normal Response • Logistic Regression with Binary Response Logistic Regression • Consider a binary response variable. • Variable with two outcomes • One outcome represented by a 1 and the other represented by a0 • Examples: Does the person have a disease? Yes or No Who is the person voting for? Romney or Obama Outcome of a baseball game? Win or loss Logistic Regression • Consider the linear probability model yi = β0 + β1 xi where yi = response for observation i xi = quantitative explanatory variable Predicted values represent the probability of Y=1 given X • Issue: Predicted probability for some subjects fall outside of the [0,1] range. Logistic Regression • Consider the logistic regression model exp ( β0 + β1 xi ) E [Yi ] = P(Yi = 1| xi ) = π (xi ) = 1+ exp ( β0 + β1 xi ) π ( xi ) = β 0 + β1 xi → logit[π ( xi )] = log 1 − π ( xi ) • Predicted values from the regression equation fall between 0 and 1 Logistic Regression • Interpretation of Coefficient β – Odds Ratio • The odds ratio is a statistic that measures the odds of an event compared to the odds of another event. • Say the probability of Event 1 is π1 and the probability of Event 2 is π2 . Then the odds ratio of Event 1 to Event 2 is: Odds(π 1 ) π 1 1−π1 Odds _ Ratio = = π2 Odds(π 2 ) 1−π 2 • Value of Odds Ratio range from 0 to Infinity • Value between 0 and 1 indicate the odds of Event 2 are greater • Value between 1 and infinity indicate odds of Event 1 are greater • Value equal to 1 indicates events are equally likely Logistic Regression • Example Dataset: APACHE II Score and Mortality in Sepsis • The following figure shows 30 day mortality in a sample of septic patients as a function of their baseline APACHE II Score. Patients are coded as 1 or 0 depending on whether they are dead or alive in 30 days, respectively. • We wish to predict death from baseline APACHE II score in these patients. Filename: APACHE.jmp Important Note: JMP models the probability of the 0 category Thank you!