Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Single Variable Regression Go to Table of Content Which Approach Is Appropriate When? • Choosing the right method for the data is the key statistical expertise that you need to have. Go to Table of Content 2 Do I Need to Know the Formulas? • You do not need to know exact formulas. • You do need to understand the concept behind them and the general statistical concepts imbedded in the use of the formulas. • You do not need to be able to do correlation and regression by hand. • You must be able to do it on a computer using Excel. Go to Table of Content 3 Table of Content • • • • • Objectives Purpose of Regression Correlation or Regression? First Order Linear Model Probabilistic Linear Relationship • Estimating Regression Parameters • Assumptions • Sum of squares • Tests • Percent of variation explained • Example • Regression Analysis in Excel • Normal Probability Plot • Residual Plot • Goodness of Fit • ANOVA For Regression Go to Table of Content 4 Objectives • To learn the assumptions behind and the interpretation of single and multiple variable regression. • To use Excel to calculate regressions and test hypotheses. Go to Table of Content 5 Purpose of Regression • To determine whether values of one or more variable are related to the response variable. • To predict the value of one variable based on the value of one or more variables. • To test hypotheses. Go to Table of Content 6 Correlation or Regression? • Use correlation if you are interested only in whether a relationship exists. • Use Regression if you are interested in building a mathematical model that can predict the response variable. • Use regression if you are interested in the relative effectiveness of several variables in predicting the response variable. Go to Table of Content 7 First Order Linear Model • A deterministic mathematical model between y and x: • 0 is the intercept with y axis, the point at which x=0 • 1 is the angle of the line, the ratio of rise divided by the run in figure to the right. It measures the change in y for one unit of change in x. Independent variable y y = 0 + 1 * x Go to Table of Content Rise Run Dependent variable x 8 Probabilistic Linear Relationship • But relationship between x and y is not always exact. Observations do not always fall on a straight line. • To accommodate this, we introduce a random error term referred to as epsilon: y = 0 + 1 * x + • The task of regression analysis then is to estimate the parameters b0 and b1 in the equation: ^y = b + b * x 0 1 so that the difference between y and ^y is Go to Table of Content minimized 9 Estimating Regression Parameters 50 Regression line 45 40 Y • Red dots show the observations • The solid line shows the estimated regression line • The distance between each observation and the solid line is called residual • Minimize the sum of the squared residuals (differences between line and observations). Go to Table of Content 35 30 Residual 25 20 1 3 5 X 10 Assumptions • The dependent (response) variable is measured on an interval scale • The probability distribution of the error is Normal with mean zero • The standard deviation of error is constant and does not depend on values of x • The error terms associated with any particular value of Y is independent of error term associated with other values of Y Go to Table of Content 11 Sum of Squares • Variation in y = SSR + SSE • MSR divided by MSE is the test statistic for ability of regression to explain the data Degrees of freedom Sum of square of differences between Predicted values and mean Regression (SSR) of observations 1 Predicted values and Error (SSE) observations n-2 Observations and mean of Variation in Y observations n-1 Mean sum of square is obtained by dividing SS by degrees of freedom Go to Table of Content 12 Tests • The hypothesis that the regression equation does not explain variation in Y and can be tested using F test. • The hypothesis that the coefficient for x is zero can be tested using t statistic. • The hypothesis that the intercept is 0 can be tested using t statistic Go to Table of Content 13 Percent of Variation Explained • • • • R2 is the coefficient of determination. The minimum R2 is zero. The maximum is 1. 1- R2 is the variation left unexplained. If Y is not related to X or related in a non-linear fashion, then R2 will be small. • Adjusted R2 shows the value of R2 after adjustment for degrees of freedom. It protects against having an artificially high R2 by increasing the number of variables in the model. Go to Table of Content 14 Example • Is waiting time related to satisfaction ratings? • Predict what will happen to satisfaction ratings if waiting time reaches 15 minutes? Waiting time Patient Go to Table of Content 1 2 3 4 5 6 7 8 9 7 5 6 8 5 7 8 Satisfaction ratings 80 90 90 100 85 100 85 75 15 Regression Analysis in Excel • • • • Select tools Select data analysis Select regression analysis Identify the x and y data of equal length • Ask for residual plots to test assumptions • Ask for normal probability plot to test assumption Go to Table of Content 16 Normal Probability Plot Norm al Probability Plot 110 Satisfaction ratings • Normal Probability Plot compares the percent of errors falling in particular bins to the percentage expected from Normal distribution. • If assumption is met then the plot should look like a straight line. Go to Table of Content 100 90 80 70 60 0 50 100 Sam ple Percentile 17 Residual Plot The difference between the observed value of the dependent variable (y) and the predicted value (ŷ) is called the residual (e). Each data point has one residual. Residual = Observed value - Predicted value Waiting tim e Residual Plot 10 Residuals • Tests that residuals have mean of zero and constant standard deviation • Tests that residuals are not dependent on values of x Go to Table of Content 5 0 -5 4 6 8 10 -10 Waiting tim e 18 Residual Plot • A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis. • If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate. • Below the chart displays the residual (e) and independent variable (X) as a residual plot. • This random pattern indicates that a linear model provides a decent fit to the data. Go to Table of Content 19 Residual Plot • Below, the residual plots show three typical patterns. • The first plot shows a random pattern, indicating a good fit for a linear model. • The other plot patterns are non-random (U-shaped and inverted U), suggesting a better fit for a non-linear model. Random pattern Non-random: U-shaped Go to Table of Content Non-random: Inverted U 20 Linear Equation • Satisfaction = 121.3 – 4.8* Waiting time • At 15 minutes waiting time, satisfaction is predicted to be: 121.3 - 4.8 * 15 = 48.87 • The t statistic related to both the intercept and waiting time coefficient are statistically significant. • The hypotheses that the coefficients are zero are rejected. Standard Coefficients Error t Stat P-value Intercept 121.34 10.48 11.58 0.00 Waiting time -4.83 1.50 -3.23 0.02 Go to Table of Content 21 Goodness of Fit • 57% of variation in satisfaction ratings is explained by the equation • 43% of variation in satisfaction ratings is left unexplained Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.796902768 0.635054022 0.574229692 5.7674349 8 Go to Table of Content 22 ANOVA For Regression • The regression model has mean sum of square of 347. • The mean sum of errors is 33. Note the error term is called residuals in Excel. • F statistics is 10, the probability of observing this statistic is 0.02. • The hypothesis that the MSR and MSE are equal is rejected. Significant variation is explained by regression. ANOVA df Regression 1 Residual 6 Total 7 SS 347.30 199.58 546.88 Signific MS F ance F 347.30 10.44 0.02 33.26 Go to Table of Content 23 Null Hypothesis • The null hypothesis corresponds to a general or default position. • For example, the null hypothesis might be that there is no relationship between two measured phenomena or that a potential treatment has no effect. • It is important to understand that the null hypothesis can never be proven. • A set of data can only reject a null hypothesis or fail to reject it. • For example, if comparison of two groups (e.g.: treatment, no treatment) reveals no statistically significant difference between the two, it does not mean that there is no difference in reality. • It only means that there is not enough evidence to reject the null hypothesis (in other words, the experiment fails to reject the Go to Table of Content 24 null hypothesis) What is a P value? • ‘P’ stands for probability • Measures the strength of the evidence against the null hypothesis (that our regression has no significance) • Smaller P values indicate stronger evidence against the null hypothesis • By convention, p-values of <.05 are often accepted as “statistically significant”; but this is an arbitrary cut-off. Go to Table of Content 25