Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data assimilation wikipedia , lookup
Forecasting wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Choice modelling wikipedia , lookup
Regression toward the mean wikipedia , lookup
Time series wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Regression analysis wikipedia , lookup
More Statistics tutorial at www.dumblittledoctor.com Lecture notes on Regression & SAS example demonstration Regression & Correlation (p. 215) ¾ When two variables are measured on a single experimental unit, the resulting data are called bivariate data. ¾ You can describe each variable individually, and you can also explore the relationship between the two variables. Simple Linear Regression & Correlation (p.214) For quantitative variables one could employ methods of regression analysis. Regression analysis is an area of statistics that is concerned with finding a model that describes the relationship that may exist between variables and determining the validity of such a relationship. Examples Do housing prices vary according to distance to a major freeway? Does respiration rate vary with altitude? Is snowfall related to elevation and if so, what kind of relationship is there between these two variables. Speaking of snow, let’s consider wind chill Example 10.1 (p. 214) Suppose we are interested in determining the wind chill temperature. For those of us from regions where the winters are extremely cold (like North Dakota), we know that this temperature is dependent upon variables such as the wind velocity (speed and direction), the absolute temperature, relative humidity, etc. Is the wind chill temp important? Dependent (response) variable: wind chill temperature Independent (regressor/predictor) variables: temp, wind velocity, relative humidity z California: +850 z Minneapolis: -230 Wind chill temp: -780 What do you say to that????? z Pretty cold if you ask me! 1 More Statistics tutorial at www.dumblittledoctor.com Lecture notes on Regression & SAS example demonstration Regression analysis allows us ¾ to represent the relationship between the variables. ¾ to examine how the variable of interest (wind chill), often called the dependent or response variable is affected by one or more control or independent variables (wind speed, actual temperature, relative humidity). Correlation analysis will be used as a measure of the strength of the given relationship. Note the following concepts: ¾ Quantitative variables may be classified according to types. To study the relationship between variables, one could use the following as guides: ¾Start by preparing a graph (scatterplot). ¾Examine the graph for an overall pattern and deviations from that pattern (check for outliers, etc.). It provides us with ½ a simplified view of the relationship between variables, ½ a way of fitting a model with our data, and ½ a means for evaluating the importance of the variables included in the model and the correctness of the model. Response variable: a variable whose changes are of interest to an experimenter. Explanatory variable: a variable that explains or causes changes in a response variable NOTE: We will generally denote the explanatory variable by x and the response variable by y. Scatterplots Plot explanatory (independent) variable on horizontal axis & response variable on the vertical axis Look for pattern: form, direction & strength of relationship Note the following: ¾Add numerical descriptive measures for additional information and support. 2 More Statistics tutorial at www.dumblittledoctor.com Lecture notes on Regression & SAS example demonstration Association Positive association: large values of one variable correspond to large values of the other Negative association: large values of one variable correspond to small values of the other Scatterplot of Diving Reflex EXAMPLE 10.3 (p. 215): Physicians have used the so-called diving reflex to reduce abnormally rapid heartbeats in humans by submerging the patient's face in cold water. (The reflex, triggered by cold water temperatures, is an involuntary neural response that shuts off circulation to the skin, muscles, and internal organs, and diverts extra oxygen-carrying blood to the heart, lungs, and brain.) A research physician conducted an experiment to investigate the effects of various cold water temperatures on the pulse rates of 10 children with the following results: (See Lecture Notes) Correlation (p. 220) If two variables are related in such a way that the value of one is indicative of the value of the other, we say the variables are correlated. The correlation coefficient, ρ is a measure of the strength of the linear relationship between two variables. Data looks reasonably linear with redpr decreasing as temp increases See formulas on this page. SOME NOTES (p. 221) ¾ The closer r is to ± 1, the stronger the linear relationship. ¾The closer r is to 0, the weaker the linear relationship. ¾ If r = ± 1, the relationship is perfectly linear (all the points lie exactly on the line). SOME NOTES (p. 221) ¾r > 0 → as x increases, y increases (positive association). ¾r < 0 → as x increases, y decreases (negative association). ¾r = 0 → no linear association 3 More Statistics tutorial at www.dumblittledoctor.com Lecture notes on Regression & SAS example demonstration Your Task ¾Read general guidelines PROC CORR (p. 223) • Produces correlation matrix which lists the Pearson's correlation coefficients between all sets of included variables. • Produces descriptive statistics and the p-value for testing the population correlation coefficient ρ = 0 for each set of variables. GENERAL FORM proc corr data = dataset name options; by variables; var variables; with variables; partial variables; See Lecture Notes for options. EXAMPLE 10.11 (p. 224) SAS (p. 223) Proc corr; var list of variables; NOTE: If you do not specify a list of variables, SAS will report the correlation between all pairs of variables. options nocenter nodate ps=55 ls=70 nonumber nodate; Refer to the previous example on diving reflex. Use SAS to find the correlation between reduction in pulse rate and cold water temperature. /* Set up temporary SAS dataset named diving */ We write the following SAS code: datalines; data diving; input temp redpr @@; 68 2 65 5 70 1 62 10 60 9 55 13 58 10 65 3 69 4 63 6 4 More Statistics tutorial at www.dumblittledoctor.com Lecture notes on Regression & SAS example demonstration /* Use proc corr to obtain correlation noprob suppress printing of p-value for testing rho = 0 Example 10.11 (p. 224) nosimple suppress printing of desc stat */ proc corr noprob nosimple; var temp redpr; run; Quit; temp redpr temp 1 -0.94135 redpr -0.94135 1 NOTE: The correlation matrix is symmetric with 1’s along the main diagonal and the correlation along the other diagonal. NOTE Corr(X,X) = 1 Corr(temp,temp) = 1 Corr(X,Y) = Corr(Y,X) SIMPLE LINEAR REGRESSION GOAL: Find the equation of the line that best describes the linear relationship between the dependent variable and a single independent variable Simple ↔ single independent variable Linear ↔ equation of a line Value & Interpretation R = -0.94135 → strong inverse linear relationship between reduction in pulse rate and cold water temperatures. Deterministic Model: y = β 0 + β1 x ¾ Requires that all points lie exactly on the line ¾ Perfect linear relationship linear in the parameters 5 More Statistics tutorial at www.dumblittledoctor.com Lecture notes on Regression & SAS example demonstration Probabilistic Model: y = β 0 + β1 x + ε ¾Does NOT require that all points lie exactly on the line ¾Allows for some error/deviation from the line Methods of Least Squares β0 and β1 are unknown parameters and need to be estimated. Want to estimate so that errors are minimized S xy ˆ b = β1 = S xx Estimate of slope a = βˆ0 = y − βˆ1 x Estimate of y-intercept For a particular value of x: Vertical distance = (observed value of y) – (predicted value of y obtained from estimated regression equation) ε ~ N (0, σ ε2 ) Represents random error, independent Want to estimate the slope and y-intercept in such a way that n n n =1 i −1 min SSE = min ∑ ε i2 = min ∑ ( yi − yˆ i ) 2 → yˆ = βˆ0 + βˆ1 x = a + bx Estimated regression equation Least squares regression equation 6 More Statistics tutorial at www.dumblittledoctor.com Lecture notes on Regression & SAS example demonstration PROC REG in SAS (p.230) EXAMPLE 10.15 (p. 232) GENERAL FORMAT: proc reg data = dataset options; by variables; model dependent variable = independent variables / options; plot yvariable*xvariable symbol / options; output out = new dataset = names; keywords Refer to the previous example on diving reflex. Use SAS to find the estimated regression equation relating reduction in pulse rate and cold water temperature. We add the following SAS code to our existing code, just before the run statement: **See Lecture Notes for options The REG Procedure proc reg; model redpr = temp; REMEMBER: model dependent = independent; Model: MODEL1 Dependent Variable: redpr Analysis of Variance Sum of Mean DF Squares Square F Value Pr > F Model 1 127.69347 127.69347 62.26 <.0001 Error 8 16.40653 2.05082 Corr Total 9 144.10000 Source Root MSE 1.43207 R-Square 0.8861 Dependent Mean 6.30000 Adj R-Sq 0.8719 Coeff Var 22.73122 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 55.29417 6.22552 8.88 <.0001 temp -0.77156 0.09778 -7.89 <.0001 1 Suppose x = 61, then → yˆ = βˆ0 + βˆ1 x = 55.29417 − 0.77156 x yˆ = 55.29417 − 0.771562(61) Suppose x = 150. Would you use this equation? NO 7 More Statistics tutorial at www.dumblittledoctor.com Lecture notes on Regression & SAS example demonstration Suppose x = 34. Would you use this equation? NO Evaluating Regression Equation (p. 236) Once we have the regression, we need to evaluate its effectiveness: THE LESSON: •Correlation BE CAREFUL! This equation is NOT universally valid. •Coefficient of Determination •Test slope •Validate assumptions Coefficient of Determination, R2 (p. 236) •0 ≤ R2 ≤ 1 •closer R2 gets to 1, the better fit we have. •Represents the proportion of variability in the dependent variable, y, that can be accounted for by the variability in the independent variable, x. •SLR, R2 = (corr coeff)2 •Reduction in SSE by using regression equation to predict y as opposed to just using the sample mean The REG Procedure regression sum of squares R2 = total sum of squares Model: MODEL1 Dependent Variable: redpr Analysis of Variance Sum of Mean DF Squares Square F Value Pr > F Model 1 127.69347 127.69347 62.26 <.0001 Error 8 16.40653 2.05082 Corr Total 9 144.10000 Source mod el sum of squares = total sum of squares SSR SSM = = TSS TSS Root MSE 1.43207 R-Square 0.8861 Dependent Mean 6.30000 Adj R-Sq 0.8719 Coeff Var 22.73122 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 55.29417 6.22552 8.88 <.0001 temp -0.77156 0.09778 -7.89 <.0001 1 8 More Statistics tutorial at www.dumblittledoctor.com Lecture notes on Regression & SAS example demonstration R2 = 0.8861 → 88.61% of the variability in reduction in pulse rate can be accounted for by the variability in cold water temperature OR: One can get an 88.61% reduction in the SSE by using the model to predict the dependent variable instead just using the sample mean to predict the dependent variable NOTE: This means that approximately 11.39% of the sample variability in reduction in pulse rate cannot be accounted for by the current model. CI & Tests of Hypothesis What if slope = 0? You would have a horizontal line. Thus knowing x would not help predict y. So our regression equation would not be useful! We can perform a test of hypothesis to determine whether the slope is 0. CI & Tests of Hypothesis EXAMPLE 10.20 (p. 237) Refer to the diving reflex example example. Test whether the slope is significantly different from 0. Usual t-test The REG Procedure Model: MODEL1 EXAMPLE 10.20 Soln 1. H 0 : β1 = 0 2. H a : β1 ≠ 0 Dependent Variable: redpr Analysis of Variance Sum of Mean DF Squares Square F Value Pr > F Model 1 127.69347 127.69347 62.26 <.0001 Error 8 16.40653 2.05082 Corr Total 9 144.10000 Source Root MSE 1.43207 R-Square 0.8861 Dependent Mean 6.30000 Adj R-Sq 0.8719 Coeff Var 22.73122 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 55.29417 6.22552 8.88 <.0001 temp -0.77156 0.09778 -7.89 <.0001 1 9 More Statistics tutorial at www.dumblittledoctor.com Lecture notes on Regression & SAS example demonstration EXAMPLE 10.20 Soln 3. p − value < 0.0001 4. Reject H 0 if p-value < α = 0.05 EXAMPLE 10.20 Soln 5. Since p − value < 0.0001 < 0.05 → reject H 0 → conclude the slope is significantly different from 0 Soft Drink Example (Handout) Confidence Intervals βˆ1 ± tα / 2,n − 2 Point Estimate Distribution pt A soft drink vendor, set up near a beach for the summer (clearly summer has not yet arrived in Riverside), was interested in examining the relationship between sales of soft drinks, y (in gallons per day) and the maximum temperature of the day, x. 2 s S xx See Handout for data Write a SAS program to read in and print out the data. Standard deviation of pt estimate options ls=78 nocenter nodate ps=55 nonumber; datalines; /* Create temporary SAS dataset and enter data */ 90 7.3 95 8.5 101 10.1 95 9.3 data e1q1; input x y @@; /* Add titles */ title1 'Statistics 157 Extra SLR Example'; title2 'Winter 2008'; title3 'Linda M. Penas'; 87 6.7 97 9.2 102 10.2 88 6.7 88 7.1 99 9.9 101 9.9 83 10.2 ; /* Print the data as a check */ proc print; run; title4 'Question 1'; 10 More Statistics tutorial at www.dumblittledoctor.com Lecture notes on Regression & SAS example demonstration Correlation Coeff for Example Correlation Output Find and interpret the correlation between sales of soft drinks and maximum temp of the day. The CORR Procedure Add the following lines of code: Pearson Correlation Coefficients, N = 12 2 Variables: /* Use proc corr to generate correlation information nosimple suppress printing of desc. statistics noprob suppress printing of p-value for testing rho=0 */ proc corr nosimple noprob; var x y; x y x y x 1.00000 0.62180 y 0.62180 1.00000 R = 0.62180 → moderate positive linear relationship between max temp and soft drink sales. run; Regression Regression Output Find the estimated regression equation Parameter Estimates ŷ = βˆ0 + βˆ1 x Parameter Standard /* Use proc reg to generate regression information Variable DF Estimate Error Intercept 1 -4.19781 5.17157 -0.81 0.4359 x 0.13808 0.05500 2.51 0.0309 1 t Value Pr > |t| model dependent = independent */ yˆ = −4.19781 + 0.13808 x proc reg; model y = x; run; Coefficient of Determination Find and interpret the coefficient of determination. Root MSE For each xi: 1.17396 R-Square 0.3866 Dependent Mean 8.75833 Adj R-Sq 0.3253 R2 = 0.3866 → 38.66% of the variability in reduction in sales can be accounted for by the variability in max temperature Bad model! Intro to Residual Analysis (p.243) residuals = ei = observed errors = where yi - y-hati, i = 1,2,. . ., n, yi = observed value (in the data) y-hati = corresponding predicted or fitted value (calculated from equation). 11 More Statistics tutorial at www.dumblittledoctor.com Lecture notes on Regression & SAS example demonstration For a given value of x, Residual = difference between what we observe in the data and what is predicted by the regression equation = amount the regression equation has not been able to explain = observed errors if the model is correct Can examine the residuals through the use of various plots. Abnormalities would be indicated if •The plot shows a fan shape. (indicates violation of common variance assumption) •Plot shows a definite linear trend. (indicates the need for a linear term in the model) •Plot shows a quadratic shape. (indicates the need for a quadratic or crossproduct terms in the model) NOTE: It is often easier to examine the standardized or studentized residuals. We can interpret them similarly to zscores: 2 < | std residual | < 3 suspect outlier | std residual | > 3 extreme outlier (Outlier = doesn’t seem to fit with the rest of the data Quadratic term needed = seems out of place) x y Fit SE Fit Residual St Resid E su xam sp in ec e t to or s ex ee tr if em t e her ou e tl a r ie e rs a ny Obs 1 1.0 50.00 30.06 LOOKS RANDOM 12.63 19.94 1.44 2 2.0 110.00 101.03 8.03 8.97 0.53 3 2.0 90.00 101.03 8.03 -11.03 -0.65 4 3.0 150.00 163.86 6.45 -13.86 -0.79 5 3.0 140.00 163.86 6.45 -23.86 -1.36 6 3.0 180.00 163.86 6.45 16.14 0.92 12 More Statistics tutorial at www.dumblittledoctor.com Lecture notes on Regression & SAS example demonstration Obs x y Fit SE Fit Residual St Resid CONCLUSION 7 4.0 190.00 218.54 7.15 -28.54 -1.65 8 6.0 310.00 303.47 8.47 6.53 0.39 9 6.0 330.00 303.47 8.47 26.53 1.59 The plot shows no apparent pattern. 10 7.0 340.00 333.73 8.16 6.27 0.37 Since 0 < | std res | < 2 11 8.0 360.00 355.84 7.84 4.16 0.25 12 10.0 380.00 375.62 12.54 4.38 0.32 → no suspect or extreme outliers either 13 10.0 360.00 375.62 12.54 15.62 -1.12 To get residual and residual plots in SAS: EXAMPLE 10.26: Diving Reflex Example proc reg; /* P = predicted values R = residuals Student = studentized residuals (act like z-scores) output out = datasetname Fanning out: non-constant variance */ model y = x /P R; output out = a P = pred R = Resid Student= stdres; run; Residual Plot Generate a residual plot of student (studentized) residuals versus predicted values. To get residual and residual plots in SAS: EXAMPLE: Soft Drink Example proc reg; /* P = predicted values R = residuals proc plot vpercent = 70 hpercent = 70; Student = studentized residuals (act like z-scores) plot stdres*pred; output out = datasetname */ model y = x /P R; output out = a P = pred R = Resid Student= stdres; run; 13 More Statistics tutorial at www.dumblittledoctor.com Lecture notes on Regression & SAS example demonstration Residual Info Residual Plot The REG Procedure Model: MODEL1 Dependent Variable: y Dep Var Output Statistics Predicted Std Error Std Error Student Obs y Value Mean Predict Residual Residual Residual 1 7.3000 8.2290 0.3991 -0.9290 1.104 -0.841 2 8.5000 8.9194 0.3449 -0.4194 1.122 -0.374 3 10.1000 9.7479 0.5198 0.3521 1.053 0.335 4 9.3000 8.9194 0.3449 0.3806 1.122 5 6.7000 7.8148 0.5060 -1.1148 1.059 -1.052 6 9.2000 9.1956 0.3810 0.004426 1.110 0.00399 7 10.2000 9.8860 0.5626 0.3140 1.030 0.305 8 6.7000 7.9529 0.4667 -1.2529 1.077 -1.163 9 7.1000 7.9529 0.4667 -0.8529 1.077 -0.792 10 0.339 9.9000 9.4717 0.4423 0.4283 1.087 0.394 11 9.9000 9.7479 0.5198 0.1521 1.053 0.145 12 10.2000 7.2625 0.6854 2.9375 0.953 3.082 Generate a residual plot of student (studentized) residuals versus predicted values. proc plot vpercent = 70 hpercent = 70; plot stdres*pred; PART 2 data e1q2; input x y @@; title4 'Question 2'; datalines; 90 7.3 95 8.5 101 10.1 95 9.3 87 6.7 97 9.2 102 10.2 88 6.7 88 7.1 99 9.9 101 9.9 Generate new information with the outlier (83,10.2) removed ; proc print; proc corr nosimple noprob; var x y; /* Make sure you use different names for your residuals so you do not overwrite the old ones */ proc reg; model y = x /P R; output out = b P = pred1 R = resid1 Student = stdres1; proc plot vpercent = 70 hpercent = 70; plot stdres1*pred1; Run; New Output Output Statistics Obs 1 Dep Var Predicted Std Error Std Error Student y Value Mean Predict Residual Residual Residual 7.4515 0.1114 -0.1515 0.254 -0.597 2 8.5000 8.6716 0.0835 -0.1716 0.264 -0.650 3 10.1000 7.3000 10.1358 0.1262 -0.0358 0.247 -0.145 4 9.3000 8.6716 0.0835 0.6284 0.264 2.380 5 6.7000 6.7194 0.1459 -0.0194 0.235 -0.0823 6 9.2000 9.1597 0.0899 0.0403 0.262 0.154 7 10.2000 10.3799 0.1380 -0.1799 0.240 -0.749 8 6.7000 6.9634 0.1336 -0.2634 0.243 -1.086 9 7.1000 6.9634 0.1336 0.1366 0.243 0.563 10 9.9000 9.6478 0.1052 0.2522 0.256 0.985 11 9.9000 10.1358 0.1262 -0.2358 0.247 -0.957 14 More Statistics tutorial at www.dumblittledoctor.com Lecture notes on Regression & SAS example demonstration Normality of Residuals (add-on) One should continue to remove the potential outliers and generate new models, residuals etc. until reaching the final information on pages 6-7. Normality test proc univariate normal; ods select TestsForNormality; var stdres; Example The UNIVARIATE Procedure Variable: stdres (Studentized Residual) Tests for Normality Test --Statistic--- Shapiro-Wilk W Kolmogorov-Smirnov D Cramer-von Mises W-Sq Anderson-Darling A-Sq Normality Test 1. H0: errors are normally distributed 2. Ha: errors are not normally distributed 3. TS: p-value =0.2068 -----p Value------ 4. RR: Reject H0 if p-value < α = 0.05 0.897717 Pr < W 0.2068 0.250982 Pr > D 0.0739 0.106686 Pr > W-Sq 0.0818 0.577399 Pr > A-Sq 0.0989 5. Since p-value =0.2068 not < α = 0.05→ do not reject H0 → ok to assume errors are normally distributed Some Relationships S xy ≤ 0 → βˆ 1 ≤ 0, r ≤ 0 S xy ≥ 0 → βˆ 1 ≥ 0, r ≥ 0 S xy = 0 → βˆ 1 = 0, r = 0 SOME MORE INFO Total sum of squares = TSS = Syy = SSE (sum of squares of the error) + SSR (sum of squares due to regression model) TSS is constant for a given set of data SSE and SSR vary depending on the model – change the model, SSE and SSR may/will change (but their sum is always constant = TSS) 15 More Statistics tutorial at www.dumblittledoctor.com Lecture notes on Regression & SAS example demonstration n TSS = S yy = ∑ ( y i − y ) 2 i =1 n SSE = ∑ ( y i − yˆ ) 2 = S yy − βˆ1S xy i =1 SSR = TSS − SSE 16