Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data assimilation wikipedia , lookup
Forecasting wikipedia , lookup
Choice modelling wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Regression toward the mean wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Linear regression wikipedia , lookup
Lecture Ten 1 Lecture • Part I: Regression • Part II: Experimental Method 2 Outline: Regression • The Assumptions of Least Squares • The Pathologies of Least Squares • Diagnostics for Least Squares 3 Assumptions • Expected value of the error is zero, E[e)t)]= 0 • The error is independent of the explanatory variable, E{e(t) [x(t)-Ex(t)]}=0 • The errors are independent of one another, E[e(i)e(j)] = 0 , i not equal to j. • The variance is homoskedatic, E[e(i)]2=E[e(j)]2 • The error is normal with mean zero and 2 variance 18.4 Error Variable: Required Conditions • The error e is a critical part of the regression model. • Four requirements involving the distribution of e must be satisfied. – – – – The probability distribution of e is normal. The mean of e is zero: E(e) = 0. The standard deviation of e is e for all values of x. The set of errors associated with different values of y are all independent. 5 The Normality of e E(y|x3) The standard deviation remains constant, m3 b0 + b1x3 E(y|x2) b0 + b1x2 m2 but the mean value changes with x b0 + b1x1 E(y|x1) m1 From the first three assumptions we have: x1 y is normally distributed with mean E(y) = b0 + b1x, and a constant standard deviation e x2 x3 Pathologies • Cross section data: error variance is heteroskedatic. Example, could vary with firm size. Consequence, all the information available is not used efficiently, and better estimates of the standard error of regression parameters is possible. • Time series data: errors are serially correlated, i.e auto-correlated. Consequence, inefficiency. 7 Pathologies ( Cont. ) • Explanatory variable is not independent of the error. Consequence, inconsistency, i.e. larger sample sizes do not lead to lower standard errors for the parameters, and the parameter estimates (slope etc.) are biased. • The error is not distributed normally. Example, there may be fat tails. Consequence, use of the normal may underestimate true 95 % confidence intervals. 8 Pathologies (Cont.) • Multicollinearity: The independent variables may be highly correlated. As a consequence, they do not truly represent separate causal factors, but instead a common causal factor. 9 18.9 Regression Diagnostics - I • The three conditions required for the validity of the regression analysis are: – the error variable is normally distributed. – the error variance is constant for all values of x. – The errors are independent of each other. • How can we diagnose violations of these conditions? 10 Residual Analysis • Examining the residuals (or standardized residuals), help detect violations of the required conditions. • Example 18.2 – continued: – Nonnormality. • Use Excel to obtain the standardized residual histogram. • Examine the histogram and look for a bell shaped. diagram with a mean close to zero. 11 Diagnostics ( Cont. ) • Multicollinearity may be suspected if the tstatistics for the coefficients of the explanatory variables are not significant but the coefficient of determination is high. The correlation between the explanatory variable can then be calculated. To see if it is high. 12 Diagnostics • Is the error normal? Using EViews, with the view menu in the regression window, a histogram of the distribution of the estimated error is available, along with the coefficients of skewness and kurtosis, and the Jarque-Bera statistic testing for normality. 13 Diagnostics (Cont.) • To detect heteroskedasticity: if there are sufficient observations, plot the estimated errors against the fitted dependent variable 14 Heteroscedasticity • When the requirement of a constant variance is violated we have a condition of heteroscedasticity. • Diagnose heteroscedasticity by plotting the residual against the predicted y. + ^y ++ Residual + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + The spread increases with ^y y^ ++ + ++ ++ ++ + + ++ + + Homoscedasticity • When the requirement of a constant variance is not violated we have a condition of homoscedasticity. • Example 18.2 - continued Residuals 1000 500 0 13500 -500 14000 14500 15000 15500 16000 -1000 Predicted Price 16 Diagnostics ( Cont.) • Autocorrelation: The Durbin-Watson statistic is a scalar index of autocorrelation, with values near 2 indicating no autocorrelation and values near zero indicating autocorrelation. Examine the plot of the residuals in the view menu of the regression window in EViews. 17 Non Independence of Error Variables – A time series is constituted if data were collected over time. – Examining the residuals over time, no pattern should be observed if the errors are independent. – When a pattern is detected, the errors are said to be autocorrelated. – Autocorrelation can be detected by graphing the residuals against time. 18 Non Independence of Error Variables Patterns in the appearance of the residuals over time indicates that autocorrelation exists. Residual Residual + ++ + 0 + + + + + + + + + + ++ + + + Time Note the runs of positive residuals, replaced by runs of negative residuals + + + 0 + + + + Time + + Note the oscillating behavior of the residuals around zero. 19 Fix-Ups • Error is not distributed normally. For example, regression of personal income on explanatory variables. Sometimes a transformation, such as regressing the natural logarithm of income on the explanatory variables may make the error closer to normal. 20 Fix-ups (Cont.) • If the explanatory variable is not independent of the error, look for a substitute that is highly correlated with the dependent variable but is independent of the error. Such a variable is called an instrument. 21 Data Errors: May lead to outliers • Typos may lead to outliers and looking for ouliers is a good way to check for serious typos 22 Outliers • An outlier is an observation that is unusually small or large. • Several possibilities need to be investigated when an outlier is observed: – There was an error in recording the value. – The point does not belong in the sample. – The observation is valid. • Identify outliers from the scatter diagram. • It is customary to suspect an observation is an outlier if its |standard residual| > 2 23 An outlier + + + + + + + + + An influential observation +++++++++++ … but, some outliers may be very influential + + + + + + + The outlier causes a shift in the regression line 24 Procedure for Regression Diagnostics • Develop a model that has a theoretical basis. • Gather data for the two variables in the model. • Draw the scatter diagram to determine whether a linear model appears to be appropriate. • Determine the regression equation. • Check the required conditions for the errors. • Check the existence of outliers and influential observations • Assess the model fit. • If the model fits the data, use the regression equation. 25 Part II: Experimental Method 26 Outline • Critique of Regression 27 Critique of Regression • Samples of opportunity rather than random sample • Uncontrolled Causal Variables – omitted variables – unmeasured variables • Insufficient theory to properly specify regression equation 28 Experimental Method: # Examples • Deterrence • Aspirin • Miles per Gallon 29 Deterrence and the Death Penalty 30 Isaac Ehrlich Study of the Death Penalty: 1933-1969 Homicide Rate Per Capita Control Variables probability of arrest probability of conviction given charged Probability of execution given conviction Causal Variables labor force participation rate unemployment rate percent population aged 14-24 years permanent income trend 31 Long Swings in the Homicide Rate in the US: 1900-1980 Source: Report to the Nation on Crime and Justice Ehrlich Results: Elasticities of Homicide with respect to Controls Control Elasticity Average Value of Control 0.90 Prob. of Arrest -1.6 Prob. of Conviction Given Charged Prob. of Execution Given Convicted -0.5 0.43 -0.04 0.026 Source: Isaac Ehrlich, “The Deterrent Effect of Capital Punishment Critique of Ehrlich by Death Penalty Opponents Time period used: 1933-1968 period of declining probability of execution Ehrlich did not include probability of imprisonment given conviction as a control variable Causal variables included are unconvincing as causes of homicide 34 United States Bureau of Justice Statistics http://www.ojp.usdoj.gov/bjs/ 35 Experimental Method • Police intervention in family violence 36 United States Bureau of Justice Statistics http://www.ojp.usdoj.gov/bjs/ 37 United States Bureau of Justice Statistics http://www.ojp.usdoj.gov/bjs/ 38 Police Intervention with Experimental Controls A 911 call from a family member the case is randomly assigned for “treatment” A police patrol responds and visits the household police calm down the family members based on the treatment randomly assigned, the police carry out the sanctions 39 Why is Treatment Assigned Randomly? To control for unknown causal factors assign known numbers of cases, for example equal numbers, to each treatment with this procedure, there should be an even distribution of difficult cases in each treatment group 40 911 call (characteristics of household Participants unknown) Random Assignment code blue code gold patrol responds patrol responds settles the household verbally warn the husband settles the household take the husband to jail for the night 41 Experimental Method: Clinical Trials • • • • Doctors Volunteer Randomly assigned to two groups treatment group takes an aspirin a day the control group takes a placebo (sugar pill) per day • After 5 years, 11,037 experimentals have 139 heart attacks (fatal and non fatal) pE = 0.0126 • after 5 years, 11034 controls have 239 heart attacks, pc= 0.0217 42 Conclusions from the Clinical Trials • Hypotheses: H0 : pC = pE , or pC - pE = 0.; Ha : (pC - pE ) 0. • Statistic:Z = [p̂C - p̂E ) – (pC - pE )]/( pC - pE ) • Var p̂C - p̂E ) = Var( p̂C ) + Var ( p̂E ) • recall, from the variance for a proportionSE SE(p̂C - p̂ E )={[p̂c (1-p̂ c )]/nc + [p̂E(1-p̂ E )]/nE }1/2 • { [0.={[0217 ( 1- 0.0217)/ 11,034] + [0.0126 ( 1 – 0.0126)/ 11,039}1/2 • = 0.00175, so z = (.2017-.0126)/.00175 • z= 5.2 Experimental Method • Experimental Design: Paired Comparisons 44 Table 1: Miles Per Gallon for Brand A and Brand B Cab Brand A Brand B Difference 1 27.01 26.95 0.06 2 20.00 20.44 -0.44 3 23.41 25.05 -1.64 4 25.22 26.32 -1.10 5 30.11 29.56 0.55 6 25.55 26.60 -1.05 7 22.23 22.93 -0.70 8 19.78 20.23 -0.45 9 33.45 33.95 -0.50 10 25.22 26.01 -0.79 Sample Mean 25.20 25.80 -0.60 Standard Deviation 4.27 4.10 0.61