Download Lecture 7_Model Building with Multiple regression_Nov 3

Advanced Statistical Methods: Continuous Variables http://statisticalmethods.wordpress.com Model Building with Multiple Regression [email protected] Selecting appropriate variables; Data Screening: - frequencies/descriptives; - correlations: ‘real’? - missing data: pattern? how much? why? Missing Data – Pattern - missing at random - non-random: affects generalizability of results Test: E.g.: construct dummy variable: 1 = cases missing on income; 0 = nonmissing; test of mean differences in attitude btw. the 2 groups. If non-significant test  decision on missing = open (various options would work) If significant  careful with decision on missing SPSS: Missing Value Analysis (do in Lab) Missing Data – How many? - 5% or less & at random & large file  similar solutions from dealing w. missing; - else, problematic Dealing with Missing Data (M.D.) Deleting M.D. - default in most software packages; problematic if missing values = scattered through cases & variables Imputation of Missing Values (estimating M.D.) - Mean substitution Regression Expectation maximization Multiple imputation M. D. correlation matrix Treating Missing as Data Repeating Analyses w. and without M.D Dealing with Missing Data (M.D.) Deleting M.D. - default in most software packages; problematic if missing values = scattered through cases & variables Imputation of Missing Values (estimating M.D.) - Mean substitution Regression Expectation maximization Multiple imputation M. D. correlation matrix Treating Missing as Data Repeating Analyses w. and without M.D The Multiple Regression Model Ŷ = a + b1X1 + b2X2 + ... + biXi - the best prediction of a DV from several continuous (or dummy) IVs; - also allows for non-linear relationships, by redefining the IV(s): squaring, cubing, .. of the original IV Regression coefficients: - minimize (the sum of squared) deviations between Ŷ and Y; - optimize the correlation btw. Ŷ and Y for the data set. Assumptions - - Random sampling; DV = continuous; IV(s) variables = continuous (can be treated as such), or dummies; Linear relationship btw. the DV & the IVs variables (but we can model non-linear relations); Normally distributed characteristics of Y in the population; Normality, linearity, and homoskedasticity btw. predicted DV scores (Ŷs) and the errors of prediction (residuals) Independence of errors; No large outliers 4. Assumptions of normality, linearity, and homoskedasticity btw. predicted DV scores (Ŷs) and the errors of prediction (residuals) - 4.a. Multivariate Normality each variable & all linear combinations of the variables are normally distributed; if this assumption is met  residuals of analysis = normally distributed & independent For grouped data: assumption pertains to the sampling distribution of means of variables;  Central Limit Theory: with sufficiently large sample size, sampling distributions are normally distributed regardless of the distribution of the variables What to look for (in ungrouped data): - is each variable normally distributed? Shape of distribution: skewness & kurtosis. Frequency histograms; expected normal probability plots; detrend expected normal probability plots - are the realtionships btw. pairs of variables (a) linear, and (b) homoskedastic (i.e. the variance of one variable is the same at all values of other variables)? Homoskedasticity - for ungrouped data: the variability in scores for one continuous variable is ~ the same at all values of another continuous variable for grouped data: the variability in the DV is expected to be ~ the same at all levels of the grouping variable Heteroskedasticity = caused by: - non-normality of one of the variables; - one variable is related to some transformation of the other; - greater error of measurement at some level of an IV Residuals Scatter Plots to check if: 4.a. Errors of prediction are normally distributed around each & every Ŷ 4.b. Residuals have straight line relationship with Ŷs - If genuine curvilinear relation btw. an IV and the DV, include a square of the IV in the model 4.c. The variance of the residuals about Ŷs is ~the same for all predicted scores (assumption of homoskedasticity) - heteroskedasticity may occur when: - some of the variables are skewed, and others are not;  may consider transforming the variable(s) - one IV interacts with another variable that is not part of the equation 5. Errors of prediction are independent of one another Durbin-Watson statistic = measure of autocorrelation of errors over the sequence of cases; if significant it indicates non-independence of errors

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture 7_Model Building with Multiple regression_Nov 3