Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and Outliers Strategy for Solving Problems Practice Problems SW388R7 Data Analysis & Computers II Multiple Regression and Assumptions Slide 2 Multiple regression is most effect at identifying relationship between a dependent variable and a combination of independent variables when its underlying assumptions are satisfied: each of the metric variables are normally distributed, the relationships between metric variables are linear, and the relationship between metric and dichotomous variables is homoscedastic. Failing to satisfy the assumptions does not mean that our answer is wrong. It means that our solution may under-report the strength of the relationships. SW388R7 Data Analysis & Computers II Multiple Regression and Outliers Slide 3 Outliers can distort the regression results. When an outlier is included in the analysis, it pulls the regression line towards itself. This can result in a solution that is more accurate for the outlier, but less accurate for all of the other cases in the data set. We will check for univariate outliers on the dependent variable and multivariate outliers on the independent variables. SW388R7 Data Analysis & Computers II Relationship between assumptions and outliers Slide 4 The problems of satisfying assumptions and detecting outliers are intertwined. For example, if a case has a value on the dependent variable that is an outlier, it will affect the skew, and hence, the normality of the distribution. Removing an outlier may improve the distribution of a variable. Transforming a variable may reduce the likelihood that the value for a case will be characterized as an outlier. SW388R7 Data Analysis & Computers II Order of analysis is important Slide 5 The order in which we check assumptions and detect outliers will affect our results because we may get a different subset of cases in the final analysis. In order to maximize the number of cases available to the analysis, we will evaluate assumptions first. We will substitute any transformations of variable that enable us to satisfy the assumptions. We will use any transformed variables that are required in our analysis to detect outliers. SW388R7 Data Analysis & Computers II Strategy for solving problems Slide 6 Our strategy for solving problems about violations of assumptions and outliers will include the following steps: 1. 2. 3. 4. 5. 6. Run type of regression specified in problem statement on variables using full data set. Test the dependent variable for normality. If it does not satisfy the criteria for normality unless transformed, substitute the transformed variable in the remaining tests that call for the use of the dependent variable. Test for normality, linearity, homoscedasticity using scripts. Decide which transformations should be used. Substitute transformations and run regression entering all independent variables, saving studentized residuals and Mahalanobis distance scores. Compute probabilities for D². Remove the outliers (studentized residual greater than 3 or Mahalanobis D² with p <= 0.001), and run regression with the method and variables specified in the problem. Compare R² for analysis using transformed variables and omitting outliers (step 5) to R² obtained for model using all data and original variables (step 1). SW388R7 Data Analysis & Computers II Transforming dependent variables Slide 7 We will use the following logic to transform variables: If dependent variable is not normally distributed: Try log, square root, and inverse transformation. Use first transformed variable that satisfies normality criteria. If no transformation satisfies normality criteria, use untransformed variable and add caution for violation of assumption. If a transformation satisfies normality, use the transformed variable in the tests of the independent variables. SW388R7 Data Analysis & Computers II Transforming independent variables - 1 Slide 8 If independent variable is normally distributed and linearly related to dependent variable, use as is. If independent variable is normally distributed but not linearly related to dependent variable: Try log, square root, square, and inverse transformation. Use first transformed variable that satisfies linearity criteria and does not violate normality criteria If no transformation satisfies linearity criteria and does not violate normality criteria, use untransformed variable and add caution for violation of assumption SW388R7 Data Analysis & Computers II Transforming independent variables - 2 Slide 9 If independent variable is linearly related to dependent variable but not normally distributed: Try log, square root, and inverse transformation. Use first transformed variable that satisfies normality criteria and does not reduce correlation. Try log, square root, and inverse transformation. Use first transformed variable that satisfies normality criteria and has significant correlation. If no transformation satisfies normality criteria with a significant correlation, use untransformed variable and add caution for violation of assumption SW388R7 Data Analysis & Computers II Transforming independent variables - 3 Slide 10 If independent variable is not linearly related to dependent variable and not normally distributed: Try log, square root, square, and inverse transformation. Use first transformed variable that satisfies normality criteria and has significant correlation. If no transformation satisfies normality criteria with a significant correlation, used untransformed variable and add caution for violation of assumption Impact of transformations and omitting outliers SW388R7 Data Analysis & Computers II Slide 11 We evaluate the regression assumptions and detect outliers with a view toward strengthening the relationship. This may not happen. The regression may be the same, it may be weaker, and it may be stronger. We cannot be certain of the impact until we run the regression again. In the end, we may opt not to exclude outliers and not to employ transformations; the analysis informs us of the consequences of doing either. SW388R7 Data Analysis & Computers II Notes Slide 12 Whenever you start a new problem, make sure you have removed variables created for previous analysis and have included all cases back into the data set. I have added the square transformation to the checkboxes for transformations in the normality script. Since this is an option for linearity, we need to be able to evaluate its impact on normality. If you change the options for output in pivot tables from labels to names, you will get an error message when you use the linearity script. To solve the problem, change the option for output in pivot tables back to labels. SW388R7 Data Analysis & Computers II Problem 1 Slide 13 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. The research question requires us to identify the best subset of predictors of "total family income" [income98] from the list: "sex" [sex], "how many in family earned money" [earnrs], and "income" [rincom98]. After substituting transformed variables to satisfy regression assumptions and removing outliers, the total proportion of variance explained by the regression analysis increased by 10.8%. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic SW388R7 Data Analysis & Computers II Dissecting problem 1 - 1 Slide 14 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. The research question requires us to identify the best subset of predictors The problem may give us different of "total family income" [income98] from the list: "sex" [sex], "how many levels of significance for the analysis. in family earned money" [earnrs], and "income" [rincom98]. In this problem, we are told to use 0.01 as alpha for the regression as well as for testing to substitutinganalysis transformed variables assumptions. After satisfy regression assumptions and removing outliers, the total proportion of variance explained by the regression analysis increased by 10.8%. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic SW388R7 Data Analysis & Computers II Dissecting problem 1 - 2 Slide 15 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. The research question requires us to identify the best subset of predictors of "total family income" [income98] from the list: "sex" [sex], "how many in family earned money" [earnrs], and "income" [rincom98]. After substituting transformed variables to satisfy regression assumptions The method for selecting variables is and removing outliers, total proportion of variance explained by the derivedthe from the research question. regression analysis increased by 10.8%. 1. 2. 3. 4. In this problem we are asked to idnetify the best subset of predicotrs, so we do a stepwise multiple regression. True True with caution False Inappropriate application of a statistic SW388R7 Data Analysis & Computers II Dissecting problem 1 - 3 Slide 16 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application ofofatesting statistic? Assume and that there is no problem with The purpose for assumptions outliers is to stronger The mainof question missing data. identify Use a alevel of model. significance 0.01 to forbethe regression in this problem is whether or not the use analysis. Use answered a level of significance of 0.01 for evaluating assumptions. transformed variables to satisfy assumptions and the removal of outliers improves the overall relationship between the independent variables and the dependent research variable, question requiresbyusR².to identify the best subset as measured The of predictors of "total family income" [income98] from the list: "sex" [sex], "how many in family earned money" [earnrs], and "income" [rincom98]. After substituting transformed variables to satisfy regression assumptions and removing outliers, the total proportion of variance explained by the regression analysis increased by 10.8%. 1. 2. 3. 4. True Specifically, the question asks whether or True with caution not the R² for a regression analysis after substituting transformed variables and False eliminating outliers is 10.8% higher than a regression using the original format Inappropriate application of aanalysis statistic for all variables and including all cases. SW388R7 Data Analysis & Computers II R² before transformations or removing outliers Slide 17 To start out, we run a stepwise multiple regression analysis with income98 as the dependent variable and sex, earnrs, and rincom98 as the independent variables. We select stepwise as the method to select the best subset of predictors. SW388R7 Data Analysis & Computers II R² before transformations or removing outliers Slide 18 Prior to any transformations of variables to satisfy the assumptions of multiple regression or removal of outliers, the proportion of variance in the dependent variable explained by the independent variables (R²) was 51.1%. This is the benchmark that we will use to evaluate the utility of transformations and the elimination of outliers. SW388R7 Data Analysis & Computers II R² before transformations or removing outliers Slide 19 For this particular question, we are not interested in the statistical significance of the overall relationship prior to transformations and removing outliers. In fact, it is possible that the relationship is not statistically significant due to variables that are not normal, relationships that are not linear, and the inclusion of outliers. SW388R7 Data Analysis & Computers II Slide 20 Normality of the dependent variable: total family income In evaluating assumptions, the first step is to examine the normality of the dependent variable. If it is not normally distributed, or cannot be normalized with a transformation, it can affect the relationships with all other variables. First, move the dependent variable INCOME98 to the list box of variables to test. To test the normality of the dependent variable, run the script: NormalityAssumptionAndTransformations.SBS Second, click on the OK button to produce the output. SW388R7 Data Analysis & Computers II Slide 21 Normality of the dependent variable: total family income Descriptives TOTAL FAMILY INCOME Mean 95% Confidence Interval for Mean Lower Bound Upper Bound Statis tic 15.67 14.98 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtos is The dependent variable "total family income" [income98] satisfies the criteria for a normal distribution. The skewness (-0.628) and kurtosis (-0.248) were both between -1.0 and +1.0. No transformation is necessary. Std. Error .349 16.36 15.95 17.00 27.951 5.287 1 23 22 8.00 -.628 -.248 .161 .320 SW388R7 Data Analysis & Computers II Slide 22 Linearity and independent variable: how many in family earned money First, move the dependent variable INCOME98 to the text box for the dependent variable. To evaluate the linearity of the relationship between number of earners and total family income, run the script for the assumption of linearity: LinearityAssumptionAndTransformations.SBS Second, move the independent variable, EARNRS, to the list box for independent variables. Third, click on the OK button to produce the output. SW388R7 Data Analysis & Computers II Slide 23 Linearity and independent variable: how many in family earned money Correlations TOTAL FAMILY INCOME HOW MANY IN FAMILY EARNED MONEY Logarithm of EARNRS [LG10( 1+EARNRS)] Square of EARNRS [(EARNRS)**2] Square Root of EARNRS [SQRT( 1+EARNRS)] Invers e of EARNRS [-1/( 1+EARNRS)] Pears on Correlation Sig. (2-tailed) N Pears on Correlation Sig. (2-tailed) N Pears on Correlation Sig. (2-tailed) N Pears on Correlation Sig. (2-tailed) N Pears on Correlation Sig. (2-tailed) N Pears on Correlation Sig. (2-tailed) N HOW MANY Logarithm of Square of Square Root TOTAL IN FAMILY EARNRS EARNRS of EARNRS Invers e of FAMILY EARNED [LG10( [(EARNR [SQRT( EARNRS [-1/( INCOME MONEY The 1+EARNRS)] S)**2] 1+EARNRS)] independent variable "how many in 1+EARNRS)] 1 .505** .536**money" .376** .527** .526* family earned [earnrs] satisfies . .000 the criteria.000 .000 .000 for the assumption of .000 linearity with the dependent variable 229 228 228 228 228 228 "total family income" [income98], but .505** 1 .959** .908** .989** .871* does not satisfy the assumption of .000 . .000 .000 .000 .000 normality. 269 The evidence of linearity269 in 228 269 269 269 the relationship between the .536** .959** 1 .759** .990** .973* independent variable "how many in .000 .000 . .000 .000 .000 228 .376** .000 228 .527** .000 228 .526** .000 228 **. Correlation is s ignificant at the 0.01 level (2-tailed). family earned money" [earnrs] and the variable "total 269 dependent269 269 family income" 269 [income98] was the statistical .908** .759** 1 .839** significance of the correlation coefficient .000 (r = 0.505). .000The probability . for the.000 269 correlation269 coefficient269 was <0.001,269 less than or equal to the .839** level of significance .989** .990** 1 of 0.01. We reject the null hypothesis .000 .000 .000 . that r = 0 and conclude that there is 269 269 269 269 a linear relationship between the .871** .973** .606** .932** variables. .000 269 .000 269 .000 269 .000 269 269 .606* .000 269 .932* .000 269 1 . 269 SW388R7 Data Analysis & Computers II Slide 24 Normality of independent variable: how many in family earned money After evaluating the dependent variable, we examine the normality of each metric variable and linearity of its relationship with the dependent variable. To test the normality of number of earners in family, run the script: NormalityAssumptionAndTransformations.SB S First, move the independent variable EARNRS to the list box of variables to test. Second, click on the OK button to produce the output. SW388R7 Data Analysis & Computers II Slide 25 Normality of independent variable: how many in family earned money Descriptives HOW MANY IN FAMILY Mean EARNED MONEY 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtos is Lower Bound Upper Bound Statis tic 1.43 1.31 Std. Error .061 1.56 1.37 1.00 1.015 1.008 0 5 5 1.00 .742 1.324 The independent variable "how many in family earned money" [earnrs] satisfies the criteria for the assumption of linearity with the dependent variable "total family income" [income98], but does not satisfy the assumption of normality. In evaluating normality, the skewness (0.742) was between -1.0 and +1.0, but the kurtosis (1.324) was outside the range from -1.0 to +1.0. .149 .296 SW388R7 Data Analysis & Computers II Slide 26 Normality of independent variable: how many in family earned money The square root transformation also has values of skewness and kurtosis in the acceptable range. However, by our order of preference for which transformation to use, the logarithm is preferred. The logarithmic transformation improves the normality of "how many in family earned money" [earnrs] without a reduction in the strength of the relationship to "total family income" [income98]. In evaluating normality, the skewness (-0.483) and kurtosis (-0.309) were both within the range of acceptable values from -1.0 to +1.0. The correlation coefficient for the transformed variable is 0.536. Transformation for how many in family earned money SW388R7 Data Analysis & Computers II Slide 27 The independent variable, how many in family earned money, had a linear relationship to the dependent variable, total family income. The logarithmic transformation improves the normality of "how many in family earned money" [earnrs] without a reduction in the strength of the relationship to "total family income" [income98]. We will substitute the logarithmic transformation of how many in family earned money in the regression analysis. SW388R7 Data Analysis & Computers II Slide 28 Normality of independent variable: respondent’s income After evaluating the dependent variable, we examine the normality of each metric variable and linearity of its relationship with the dependent variable. To test the normality of respondent’s in family, run the script: NormalityAssumptionAndTransformations.SB S First, move the independent variable RINCOM89 to the list box of variables to test. Second, click on the OK button to produce the output. SW388R7 Data Analysis & Computers II Slide 29 Normality of independent variable: respondent’s income Descriptives RESPONDENTS INCOME Mean 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtos is Lower Bound Upper Bound Statis tic 13.35 12.52 Std. Error .419 14.18 13.54 15.00 29.535 5.435 1 23 22 8.00 -.686 -.253 The independent variable "income" [rincom98] satisfies the criteria for both the assumption of normality and the assumption of linearity with the dependent variable "total family income" [income98]. In evaluating normality, the skewness (-0.686) and kurtosis (-0.253) were both within the range of acceptable values from -1.0 to +1.0. .187 .373 SW388R7 Data Analysis & Computers II Slide 30 Linearity and independent variable: respondent’s income First, move the dependent variable INCOME98 to the text box for the dependent variable. To evaluate the linearity of the relationship between respondent’s income and total family income, run the script for the assumption of linearity: LinearityAssumptionAndTransformations.SBS Second, move the independent variable, RINCOM89, to the list box for independent variables. Third, click on the OK button to produce the output. SW388R7 Data Analysis & Computers II Slide 31 Linearity and independent variable: respondent’s income Correlations TOTAL FAMILY INCOME RESPONDENTS INCOME Logarithm of RINCOM98 [LG10( 24-RINCOM98)] Square of RINCOM98 [(RINCOM98)**2] Pears on Correlation Sig. (2-tailed) N Pears on Correlation Sig. (2-tailed) N Pears on Correlation Sig. (2-tailed) N Pears on Correlation Sig. (2-tailed) N Square Root of Pears on Correlation RINCOM98 [SQRT( Sig. (2-tailed) 24-RINCOM98)] N Invers e of RINCOM98 [-1/( Pears on Correlation 24-RINCOM98)] Sig. (2-tailed) N Logarithm of Square Root Invers e of RINCOM98 Square of of RINCOM98 RINCOM9 TOTAL [LG10( RINCOM98 [SQRT( 8 [-1/( FAMILY RESPONDEN 24-RINCOM [(RINCOM9 24-RINCOM9 24-RINC INCOME TS INCOME 98)] 8)**2] 8)] OM98)] 1 .577** -.595** .613** -.601** -.434** . .000 .000of linearity .000in the .000 .000 The evidence 229 163 163 163 163 163 relationship between the independent variable "income" .577** 1 -.922** [rincom98] .967** and the -.985** -.602** dependent variable "total .000 . .000 .000family income" .000 .000 [income98] was the statistical 163 168 168 168 168 168 significance of the correlation coefficient -.595** -.922** 1 -.976** .974** .848** (r = 0.577). The probability for the .000 .000 . .000 .000 .000 163 .613** .000 163 -.601** .000 163 -.434** .000 163 **. Correlation is s ignificant at the 0.01 level (2-tailed). correlation coefficient was <0.001, less than or equal 168 168to the level 168of significance 168 of 0.01. We reject the null hypothesis .967** -.993** that r = 0 -.976** and conclude 1that there is a .000 .000 . .000 linear relationship between the 168 168 168 variables. 168 -.985** .000 168 -.602** .000 168 .974** .000 168 .848** .000 168 -.993** .000 168 -.718** .000 168 1 . 168 .714** .000 168 168 -.718** .000 168 .714** .000 168 1 . 168 SW388R7 Data Analysis & Computers II Homoscedasticity: sex Slide 32 First, move the dependent variable INCOME98 to the text box for the dependent variable. To evaluate the homoscedasticity of the relationship between sex and total family income, run the script for the assumption of homogeneity of variance: Second, move the independent variable, SEX, to the list box for independent variables. HomoscedasticityAssumptionAnd Transformations.SBS Third, click on the OK button to produce the output. SW388R7 Data Analysis & Computers II Homoscedasticity: sex Slide 33 Based on the Levene Test, the variance in "total family income" [income98] is homogeneous for the categories of "sex" [sex]. The probability associated with the Levene Statistic (0.031) is greater than the level of significance, so we fail to reject the null hypothesis and conclude that the homoscedasticity assumption is satisfied. SW388R7 Data Analysis & Computers II Adding a transformed variable Slide 34 Even though we do not need a transformation for any of the variables in this analysis, we will demonstrate how to use a script, such as the normality script, to add a transformed variable to the data set, e.g. a logarithmic transformation for highest year of school. Second, mark the checkbox for the transformation we want to add to the data set, and clear the other checkboxes. Third, clear the checkbox for Delete transformed variables from the data. This will save the transformed variable. First, move the variable that we want to transform to the list box of variables to test. Fourth, click on the OK button to produce the output. SW388R7 Data Analysis & Computers II The transformed variable in the data editor Slide 35 If we scroll to the extreme right in the data editor, we see that the transformed variable has been added to the data set. Whenever we add transformed variables to the data set, we should be sure to delete them before starting another analysis. SW388R7 Data Analysis & Computers II The regression to identify outliers Slide 36 We use the regression procedure to identify both univariate and multivariate outliers. We start with the same dialog we used for the last analysis, in which income98 as the dependent variable and sex, earnrs, and rincom98 were the independent variables. First, we substitute the logarithmic transformation of earnrs, logearn, into the list of independent variables. Second, we change the method of entry from Stepwise to Enter so that all variables will be included in the detection of outliers. Third, we want to save the calculated values of the outlier statistics to the data set. Click on the Save… button to specify what we want to save. SW388R7 Data Analysis & Computers II Saving the measures of outliers Slide 37 First, mark the checkbox for Studentized residuals in the Residuals panel. Studentized residuals are z-scores computed for a case based on the data for all other cases in the data set. Second, mark the checkbox for Mahalanobis in the Distances panel. This will compute Mahalanobis distances for the set of independent variables. Third, click on the OK button to complete the specifications. SW388R7 Data Analysis & Computers II The variables for identifying outliers Slide 38 The variables for identifying univariate outliers for the dependent variable are in a column which SPSS has names sre_1. The variables for identifying multivariate outliers for the independent variables are in a column which SPSS has names mah_1. SW388R7 Data Analysis & Computers II Computing the probability for Mahalanobis D² Slide 39 To compute the probability of D², we will use an SPSS function in a Compute command. First, select the Compute… command from the Transform menu. SW388R7 Data Analysis & Computers II Formula for probability for Mahalanobis D² Slide 40 First, in the target variable text box, type the name "p_mah_1" as an acronym for the probability of the mah_1, the Mahalanobis D² score. Second, to complete the specifications for the CDF.CHISQ function, type the name of the variable containing the D² scores, mah_1, followed by a comma, followed by the number of variables used in the calculations, 3. Third, click on the OK button to signal completion of the computer variable dialog. Since the CDF function (cumulative density function) computes the cumulative probability from the left end of the distribution up through a given value, we subtract it from 1 to obtain the probability in the upper tail of the distribution. SW388R7 Data Analysis & Computers II Multivariate outliers Slide 41 Using the probabilities computed in p_mah_1 to identify outliers, scroll down through the list of case to see if we can find cases with a probability less than 0.001. There are no outliers for the set of independent variables. SW388R7 Data Analysis & Computers II Univariate outliers Slide 42 Similarly, we can scroll down the values of sre_1, the studentized residual to see the one outlier with a value larger than ± 3.0. Based on these criteria, there are 4 outliers.There are 4 cases that have a score on the dependent variable that is sufficiently unusual to be considered outliers (case 20000357: studentized residual=3.08; case 20000416: studentized residual=3.57; case 20001379: studentized residual=3.27; case 20002702: studentized residual=-3.23). SW388R7 Data Analysis & Computers II Omitting the outliers Slide 43 To omit the outliers from the analysis, we select in the cases that are not outliers. First, select the Select Cases… command from the Transform menu. SW388R7 Data Analysis & Computers II Specifying the condition to omit outliers Slide 44 First, mark the If condition is satisfied option button to indicate that we will enter a specific condition for including cases. Second, click on the If… button to specify the criteria for inclusion in the analysis. SW388R7 Data Analysis & Computers II The formula for omitting outliers Slide 45 To eliminate the outliers, we request the cases that are not outliers. The formula specifies that we should include cases if the studentized residual (regardless of sign) if less than 3 and the probability for Mahalanobis D² is higher than the level of significance, 0.001. After typing in the formula, click on the Continue button to close the dialog box, SW388R7 Data Analysis & Computers II Completing the request for the selection Slide 46 To complete the request, we click on the OK button. SW388R7 Data Analysis & Computers II The omitted multivariate outlier Slide 47 SPSS identifies the excluded cases by drawing a slash mark through the case number. Most of the slashes are for cases with missing data, but we also see that the case with the low probability for Mahalanobis distance is included in those that will be omitted. SW388R7 Data Analysis & Computers II Running the regression without outliers Slide 48 We run the regression again, excluding the outliers. Select the Regression | Linear command from the Analyze menu. SW388R7 Data Analysis & Computers II Opening the save options dialog Slide 49 We specify the dependent and independent variables, substituting any transformed variables required by assumptions. When we used regression to detect outliers, we entered all variables. Now we are testing the relationship specified in the problem, so we change the method to Stepwise. On our last run, we instructed SPSS to save studentized residuals and Mahalanobis distance. To prevent these values from being calculated again, click on the Save… button. SW388R7 Data Analysis & Computers II Clearing the request to save outlier data Slide 50 First, clear the checkbox for Studentized residuals. Third, click on the OK button to complete the specifications. Second, clear the checkbox form Mahalanobis distance. SW388R7 Data Analysis & Computers II Opening the statistics options dialog Slide 51 Once we have removed outliers, we need to check the sample size requirement for regression. Since we will need the descriptive statistics for this, click on the Statistics… button. SW388R7 Data Analysis & Computers II Requesting descriptive statistics Slide 52 First, mark the checkbox for Descriptives. Second, click on the Continue button to complete the specifications. SW388R7 Data Analysis & Computers II Requesting the output Slide 53 Having specified the output needed for the analysis, we click on the OK button to obtain the regression output. SW388R7 Data Analysis & Computers II Sample size requirement Slide 54 The minimum ratio of valid cases to independent variables for stepwise multiple regression is 5 to 1. After removing 4 outliers, there are 159 valid cases and 3 independent variables. The ratio of cases to independent variables for this analysis is 53.0 to 1, which satisfies the minimum requirement. In addition, the ratio of 53.0 to 1 satisfies the preferred ratio of 50 to 1. Descriptive Statistics TOTAL FAMILY INCOME RESPONDENTS SEX RESPONDENTS INCOME Logarithm of EARNRS [LG10( 1+EARNRS)] Mean 17.09 1.55 13.76 Std. Deviation 4.073 .499 5.133 .424896 .1156559 N 159 159 159 159 SW388R7 Data Analysis & Computers II Significance of regression relationship Slide 55 ANOVAd Model 1 2 3 Regress ion Res idual Total Regress ion Res idual Total Regress ion Res idual Total Sum of Squares 1122.398 1499.187 2621.585 1572.722 1048.863 2621.585 1623.976 997.609 2621.585 df 1 157 158 2 156 158 3 155 158 Mean Square 1122.398 9.549 F 117.541 Sig. .000 a 786.361 6.723 116.957 .000 b 541.325 6.436 84.107 .000 c a. Predictors : (Constant), RESPONDENTS INCOME b. Predictors : (Constant), RESPONDENTS INCOME, Logarithm of EARNRS [LG10( The1+EARNRS)] probability of the F statistic (84.107) for the regression relationship which includes these variables is <0.001, less c. Predictors : (Constant), RESPONDENTS INCOME, Logarithm of EARNRS [LG10( than or equal to the level of significance of 0.01. We reject 1+EARNRS)], RESPONDENTS SEX the null hypothesis that there is no relationship between d. Variable: TOTAL FAMILY INCOME theDependent best subset of independent variables and the dependent variable (R² = 0). We support the research hypothesis that there is a statistically significant relationship between the best subset of independent variables and the dependent variable. SW388R7 Data Analysis & Computers II Increase in proportion of variance Slide 56 Model Summary Model 1 2 3 R R Square a .654 .428 b .775 .600 c .787 .619 Adjus ted R Square .424 .595 .612 Std. Error of the Es timate 3.090 2.593 2.537 a. Predictors : (Constant), RESPONDENTS INCOME b. Predictors : (Constant), RESPONDENTS INCOME, Logarithm of EARNRS [LG10( 1+EARNRS)] Prior to any transformations of variables to satisfy c. Predictors : (Constant), RESPONDENTS INCOME, the assumptions of multiple regression or removal Logarithm of EARNRS [LG10( 1+EARNRS)], of outliers, the proportion of variance in the RESPONDENTS SEX dependent variable explained by the independent variables (R²) was 51.1%. After transformed variables were substituted to satisfy assumptions and outliers were removed from the sample, the proportion of variance explained by the regression analysis was 61.9%, a difference of 10.8%. The answer to the question is true with caution. A caution is added because of the inclusion of ordinal level variables. SW388R7 Data Analysis & Computers II Problem 2 Slide 57 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. The research question requires us to examine the relationship of "age" [age], "highest year of school completed" [educ], and "sex" [sex] to the dependent variable "occupational prestige score" [prestg80]. After substituting transformed variables to satisfy regression assumptions and removing outliers, the proportion of variance explained by the regression analysis increased by 3.6%. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic SW388R7 Data Analysis & Computers II Dissecting problem 2 - 1 Slide 58 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. The research question requires us to examine the relationship of "age" The problem may give us different [age], "highest year of school completed" [educ], and "sex" [sex] to the levels of significance for the analysis. dependent variable "occupational prestige score" [prestg80]. In this problem, we are told to use 0.05 as alpha for the regression and the more conservative After substitutinganalysis transformed variables to satisfy regression 0.01 as the alpha in testing assumptions and removing outliers, the proportion of variance assumptions. explained by the regression analysis increased by 3.6%. 1. 2. 3. 4. True True with caution False Inappropriate application of a statistic SW388R7 Data Analysis & Computers II Dissecting problem 2 - 2 Slide 59 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. The research question requires us to examine the relationship of "age" [age], "highest year of school completed" [educ], and "sex" [sex] to the dependent variable "occupational prestige score" [prestg80]. After substituting transformed variables to satisfy regression The method for selecting variables is assumptions andderived removing outliers, proportion of variance from the research the question. explained by the regression analysis increased by 3.6%. 1. 2. 3. 4. If we are asked to examine a relationship without any statement about control variables or the best subset of variables, we do a standard multiple regression. True True with caution False Inappropriate application of a statistic SW388R7 Data Analysis & Computers II Dissecting problem 2 - 3 Slide 60 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem The purpose of testing for assumptions and outliers is to with missing data. Use a level of significance of 0.05 for the regression identify a stronger model. The main question to be analysis. Use answered a level of significance of 0.01 forthe evaluating assumptions. in this problem is whether or not use transformed variables to satisfy assumptions and the removal of outliers improves the overall relationship the requires independentus variables and the dependent research between question to examine the relationship variable, as measured by R². The of "age" [age], "highest year of school completed" [educ], and "sex" [sex] to the dependent variable "occupational prestige score" [prestg80]. After substituting transformed variables to satisfy regression assumptions and removing outliers, the proportion of variance explained by the regression analysis increased by 3.6%. 1. 2. 3. 4. True True with caution Specifically, the question asks whether or not the R² for a regression analysis after False substituting transformed variables and eliminating outliers is 3.6% higher than a Inappropriate application a statistic regression of analysis using the original format for all variables and including all cases. SW388R7 Data Analysis & Computers II R² before transformations or removing outliers Slide 61 To start out, we run a standard multiple regression analysis with prestg80 as the dependent variable and age, educ, and sex as the independent variables. SW388R7 Data Analysis & Computers II R² before transformations or removing outliers Slide 62 Prior to any transformations of variables to satisfy the assumptions of multiple regression or removal of outliers, the proportion of variance in the dependent variable explained by the independent variables (R²) was 27.1%. This is the benchmark that we will use to evaluate the utility of transformations and the elimination of outliers. For this particular question, we are not interested in the statistical significance the overall relationship prior to transformations and removing outliers. In fact, it is possible that the relationship is not statistically significant due to variables that are not normal, relationships that are not linear, and the inclusion of outliers. SW388R7 Data Analysis & Computers II Normality of the dependent variable Slide 63 In evaluating assumptions, the first step is to examine the normality of the dependent variable. If it is not normally distributed, or cannot be normalized with a transformation, it can affect the relationships with all other variables. First, move the dependent variable PRESTG80 to the list box of variables to test. To test the normality of the dependent variable, run the script: NormalityAssumptionAndTransformations.SBS Second, click on the OK button to produce the output. SW388R7 Data Analysis & Computers II Normality of the dependent variable Slide 64 The dependent variable "occupational prestige score" [prestg80] satisfies the criteria for a normal distribution. The skewness (0.401) and kurtosis (-0.630) were both between -1.0 and +1.0. No transformation is necessary. SW388R7 Data Analysis & Computers II Normality of independent variable: Age Slide 65 After evaluating the dependent variable, we examine the normality of each metric variable and linearity of its relationship with the dependent variable. To test the normality of age, run the script: NormalityAssumptionAndTransformations.SB S First, move the independent variable AGE to the list box of variables to test. Second, click on the OK button to produce the output. SW388R7 Data Analysis & Computers II Normality of independent variable: Age Slide 66 Descriptives AGE OF RESPONDENT Mean 95% Confidence Interval for Mean Lower Bound Upper Bound 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtos is Statis tic 45.99 43.98 Std. Error 1.023 48.00 45.31 43.50 282.465 16.807 19 89 70 24.00 .595 -.351 The independent variable "age" [age] satisfies the criteria for the assumption of normality, but does not satisfy the assumption of linearity with the dependent variable "occupational prestige score" [prestg80]. In evaluating normality, the skewness (0.595) and kurtosis (-0.351) were both within the range of acceptable values from -1.0 to +1.0. .148 .295 SW388R7 Data Analysis & Computers II Linearity and independent variable: Age Slide 67 First, move the dependent variable PRESTG80 to the text box for the dependent variable. To evaluate the linearity of the relationship between age and occupational prestige, run the script for the assumption of linearity: LinearityAssumptionAndTransformations.SBS Second, move the independent variable, AGE, to the list box for independent variables. Third, click on the OK button to produce the output. SW388R7 Data Analysis & Computers II Linearity and independent variable: Age Slide 68 Correlations RS OCCUPATIONAL PRESTIGE SCORE (1980) AGE OF RESPONDENT Logarithm of AGE [LG10(AGE)] Square of AGE [(AGE)**2] Square Root of AGE [SQRT(AGE)] Invers e of AGE [-1/(AGE)] Pears on Correlation Sig. (2-tailed) N Pears on Correlation Sig. (2-tailed) N Pears on Correlation Sig. (2-tailed) N Pears on Correlation Sig. (2-tailed) N Pears on Correlation Sig. (2-tailed) N Pears on Correlation Sig. (2-tailed) N RS OCCUPA TIONAL PRESTIG E SCORE (1980) 1 . 255 .024 .706 255 .059 .348 255 -.004 .956 255 .041 .518 255 .096 .128 255 **. Correlation is significant at the 0.01 level (2-tailed). AGE OF Logarithm of Square of Square Root Invers e of The evidence of nonlinearity in the RESPON AGE AGE of AGE AGE relationship between the independent DENT [LG10(AGE)] [(AGE)**2] [SQRT(AGE)] [-1/(AGE)] variable "age" [age] and the dependent .024 variable "occupational .059 -.004 .041 .096 prestige score" .706 [prestg80] .348 .956of statistical.518 .128 was the lack coefficient 255 significance 255of the correlation 255 255 255 (r = 0.024). The probability for the 1 .979** .983** .995** .916** correlation coefficient was 0.706, greater . .000 .000 .000 .000 than the level of significance of 0.01. We 270 cannot reject 270 the null 270 270 hypothesis that270 r = 0, .979** and cannot 1conclude .926** .994** .978** that there is a linear the variables. .000 .000 relationship .between .000 .000 270 Since none 270of the transformations 270 to270 improve linearity were successful, it is an .983** .926** 1 .960** indication that the problem may be a weak .000 relationship, .000 rather than .a curvilinear .000 270 relationship 270correctable 270by using a 270 transformation. A weak relationship is not .995** .994** .960** 1 a violation of the assumption of linearity, .000 .000 .000 .and does not require a caution. 270 270 270 270 .916** .978** .832** .951** .000 .000 .000 .000 270 270 270 270 270 .832** .000 270 .951** .000 270 1 . 270 SW388R7 Data Analysis & Computers II Transformation for Age Slide 69 The independent variable age satisfied the criteria for normality. The independent variable age did not have a linear relationship to the dependent variable occupational prestige. However, none of the transformations linearized the relationship. No transformation will be used - it would not help linearity and is not needed for normality. SW388R7 Data Analysis & Computers II Slide 70 Linearity and independent variable: Highest year of school completed First, move the dependent variable PRESTG80 to the text box for the dependent variable. To evaluate the linearity of the relationship between highest year of school and occupational prestige, run the script for the assumption of linearity: Second, move the independent variable, EDUC, to the list box for independent variables. LinearityAssumptionAndTransformations.SBS Third, click on the OK button to produce the output. SW388R7 Data Analysis & Computers II Slide 71 Linearity and independent variable: Highest year of school completed Correlations RS OCCUPATIONAL PRESTIGE SCORE (1980) HIGHEST YEAR OF SCHOOL COMPLETED Logarithm of EDUC [LG10( 21-EDUC)] Square of EDUC [(EDUC)**2] Square Root of EDUC [SQRT( 21-EDUC)] Invers e of EDUC [-1/( 21-EDUC)] Pears on Correlation Sig. (2-tailed) N Pears on Correlation Sig. (2-tailed) N Pears on Correlation Sig. (2-tailed) N Pears on Correlation Sig. (2-tailed) N Pears on Correlation Sig. (2-tailed) N Pears on Correlation Sig. (2-tailed) N RS OCCUPA TIONAL HIGHEST Square Root PRESTIG YEAR OF TheLogarithm of Square of "highest of EDUC Invers e of independent variable year E SCORE SCHOOL of EDUC EDUC [SQRT( EDUC [-1/( school[LG10( completed" [educ] satisfies (1980) COMPLETED the 21-EDUC)] [(EDUC)**2] 21-EDUC)] criteria for the assumption21-EDUC)] of 1 .495** -.512** .528** variable -.518** -.423 linearity with the dependent . .000 "occupational .000 prestige score" .000 .000 .000 255 254 [prestg80],254 254 254 but does not satisfy the 254 assumption of normality. The evidence .495** 1 -.920** .980** -.982** -.699 of linearity in the relationship between .000 . .000 .000 .000 .000 the independent variable "highest year 254 269 269 269 269 269 of school completed" [educ] and the -.512** -.920** 1 -.969** .977** .915 dependent variable "occupational .000 .000 . .000 .000 .000 254 .528** .000 254 -.518** .000 254 -.423** .000 254 **. Correlation is s ignificant at the 0.01 level (2-tailed). prestige score" [prestg80] was the of269 the correlation 269 statistical significance 269 269 coefficient (r = 0.495). The probability .980** -.969** 1 for the correlation coefficient was -.997** .000 <0.001, less .000than or equal. to the level .000 269 of significance 269 of 0.01. We 269 reject the269 -.982** .977** that r = -.997** 1 null hypothesis 0 and conclude is a linear relationship .000 that there .000 .000 . between the variables. 269 269 269 269 -.699** .000 269 .915** .000 269 -.789** .000 269 .812** .000 269 269 -.789 .000 269 .812 .000 269 1 . 269 SW388R7 Data Analysis & Computers II Slide 72 Normality of independent variable: Highest year of school completed To test the normality of EDUC, Highest year of school completed, run the script: NormalityAssumptionAndTransformations.SB S First, move the dependent variable EDUC to the list box of variables to test. Second, click on the OK button to produce the output. SW388R7 Data Analysis & Computers II Slide 73 Normality of independent variable: Highest year of school completed Descriptives HIGHEST YEAR OF SCHOOL COMPLETED Mean 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtos is Lower Bound Upper Bound Statis tic 13.12 12.77 Std. Error .179 13.47 13.14 13.00 8.583 2.930 2 20 18 3.00 -.137 1.246 In evaluating normality, the skewness (-0.137) was between -1.0 and +1.0, but the kurtosis (1.246) was outside the range from -1.0 to +1.0. None of the transformations for normalizing the distribution of "highest year of school completed" [educ] were effective. .149 .296 SW388R7 Data Analysis & Computers II Transformation for highest year of school Slide 74 The independent variable, highest year of school, had a linear relationship to the dependent variable, occupational prestige. The independent variable, highest year of school, did not satisfy the criteria for normality. None of the transformations for normalizing the distribution of "highest year of school completed" [educ] were effective. No transformation will be used - it would not help normality and is not needed for linearity. A caution should be added to any findings. SW388R7 Data Analysis & Computers II Homoscedasticity: sex Slide 75 First, move the dependent variable PRESTG80 to the text box for the dependent variable. To evaluate the homoscedasticity of the relationship between sex and occupational prestige, run the script for the assumption of homogeneity of variance: Second, move the independent variable, SEX, to the list box for independent variables. HomoscedasticityAssumptionAnd Transformations.SBS Third, click on the OK button to produce the output. SW388R7 Data Analysis & Computers II Homoscedasticity: sex Slide 76 Based on the Levene Test, the variance in "occupational prestige score" [prestg80] is homogeneous for the categories of "sex" [sex]. The probability associated with the Levene Statistic (0.808) is greater than the level of significance, so we fail to reject the null hypothesis and conclude that the homoscedasticity assumption is satisfied. Even if we violate the assumption, we would not do a transformation since it could impact the relationships of the other independent variables with the dependent variable. SW388R7 Data Analysis & Computers II Adding a transformed variable Slide 77 Even though we do not need a transformation for any of the variables in this analysis, we will demonstrate how to use a script, such as the normality script, to add a transformed variable to the data set, e.g. a logarithmic transformation for highest year of school. Second, mark the checkbox for the transformation we want to add to the data set, and clear the other checkboxes. Third, clear the checkbox for Delete transformed variables from the data. This will save the transformed variable. First, move the variable that we want to transform to the list box of variables to test. Fourth, click on the OK button to produce the output. SW388R7 Data Analysis & Computers II The transformed variable in the data editor Slide 78 If we scroll to the extreme right in the data editor, we see that the transformed variable has been added to the data set. Whenever we add transformed variables to the data set, we should be sure to delete them before starting another analysis. SW388R7 Data Analysis & Computers II The regression to identify outliers Slide 79 We can use the regression procedure to identify both univariate and multivariate outliers. We start with the same dialog we used for the last analysis, in which prestg90 as the dependent variable and age, educ, and sex were the independent variables. If we need to use any transformed variables, we would substitute them now. We will save the calculated values of the outlier statistics to the data set. Click on the Save… button to specify what we want to save. SW388R7 Data Analysis & Computers II Saving the measures of outliers Slide 80 First, mark the checkbox for Studentized residuals in the Residuals panel. Studentized residuals are z-scores computed for a case based on the data for all other cases in the data set. Second, mark the checkbox for Mahalanobis in the Distances panel. This will compute Mahalanobis distances for the set of independent variables. Third, click on the OK button to complete the specifications. SW388R7 Data Analysis & Computers II The variables for identifying outliers Slide 81 The variables for identifying univariate outliers for the dependent variable are in a column which SPSS has names sre_1. The variables for identifying multivariate outliers for the independent variables are in a column which SPSS has names mah_1. SW388R7 Data Analysis & Computers II Computing the probability for Mahalanobis D² Slide 82 To compute the probability of D², we will use an SPSS function in a Compute command. First, select the Compute… command from the Transform menu. SW388R7 Data Analysis & Computers II Formula for probability for Mahalanobis D² Slide 83 First, in the target variable text box, type the name "p_mah_1" as an acronym for the probability of the mah_1, the Mahalanobis D² score. Second, to complete the specifications for the CDF.CHISQ function, type the name of the variable containing the D² scores, mah_1, followed by a comma, followed by the number of variables used in the calculations, 3. Third, click on the OK button to signal completion of the computer variable dialog. Since the CDF function (cumulative density function) computes the cumulative probability from the left end of the distribution up through a given value, we subtract it from 1 to obtain the probability in the upper tail of the distribution. SW388R7 Data Analysis & Computers II The multivariate outlier Slide 84 Using the probabilities computed in p_mah_1 to identify outliers, scroll down through the list of case to see the one case with a probability less than 0.001. There is 1 case that has a combination of scores on the independent variables that is sufficiently unusual to be considered an outlier (case 20001984: Mahalanobis D²=16.97, p=0.0007). SW388R7 Data Analysis & Computers II The univariate outlier Slide 85 Similarly, we can scroll down the values of sre_1, the studentized residual to see the one outlier with a value larger than 3.0. There is 1 case that has a score on the dependent variable that is sufficiently unusual to be considered an outlier (case 20000391: studentized residual=4.14). SW388R7 Data Analysis & Computers II Omitting the outliers Slide 86 To omit the outliers from the analysis, we select in the cases that are not outliers. First, select the Select Cases… command from the Transform menu. SW388R7 Data Analysis & Computers II Specifying the condition to omit outliers Slide 87 First, mark the If condition is satisfied option button to indicate that we will enter a specific condition for including cases. Second, click on the If… button to specify the criteria for inclusion in the analysis. SW388R7 Data Analysis & Computers II The formula for omitting outliers Slide 88 To eliminate the outliers, we request the cases that are not outliers. The formula specifies that we should include cases if the studentized residual (regardless of sign) if less than 3 and the probability for Mahalanobis D² is higher than the level of significance, 0.001. After typing in the formula, click on the Continue button to close the dialog box, SW388R7 Data Analysis & Computers II Completing the request for the selection Slide 89 To complete the request, we click on the OK button. SW388R7 Data Analysis & Computers II The omitted multivariate outlier Slide 90 SPSS identifies the excluded cases by drawing a slash mark through the case number. Most of the slashes are for cases with missing data, but we also see that the case with the low probability for Mahalanobis distance is included in those that will be omitted. SW388R7 Data Analysis & Computers II Running the regression without outliers Slide 91 We run the regression again, excluding the outliers. Select the Regression | Linear command from the Analyze menu. SW388R7 Data Analysis & Computers II Opening the save options dialog Slide 92 If specify the dependent an independent variables. If we wanted to use any transformed variables we would substitute them now. On our last run, we instructed SPSS to save studentized residuals and Mahalanobis distance. To prevent these values from being calculated again, click on the Save… button. SW388R7 Data Analysis & Computers II Clearing the request to save outlier data Slide 93 First, clear the checkbox for Studentized residuals. Third, click on the OK button to complete the specifications. Second, clear the checkbox form Mahalanobis distance. SW388R7 Data Analysis & Computers II Opening the statistics options dialog Slide 94 Once we have removed outliers, we need to check the sample size requirement for regression. Since we will need the descriptive statistics for this, click on the Statistics… button. SW388R7 Data Analysis & Computers II Requesting descriptive statistics Slide 95 First, mark the checkbox for Descriptives. Second, click on the Continue button to complete the specifications. SW388R7 Data Analysis & Computers II Requesting the output Slide 96 Having specified the output needed for the analysis, we click on the OK button to obtain the regression output. SW388R7 Data Analysis & Computers II Sample size requirement Slide 97 The minimum ratio of valid cases to independent variables for multiple regression is 5 to 1. After removing 2 outliers, there are 252 valid cases and 3 independent variables. The ratio of cases to independent variables for this analysis is 84.0 to 1, which satisfies the minimum requirement. In addition, the ratio of 84.0 to 1 satisfies the preferred ratio of 15 to 1. SW388R7 Data Analysis & Computers II Significance of regression relationship Slide 98 The probability of the F statistic (36.639) for the overall regression relationship is <0.001, less than or equal to the level of significance of 0.05. We reject the null hypothesis that there is no relationship between the set of independent variables and the dependent variable (R² = 0). We support the research hypothesis that there is a statistically significant relationship between the set of independent variables and the dependent variable. SW388R7 Data Analysis & Computers II Increase in proportion of variance Slide 99 Prior to any transformations of variables to satisfy the assumptions of multiple regression or removal of outliers, the proportion of variance in the dependent variable explained by the independent variables (R²) was 27.1%. No transformed variables were substituted to satisfy assumptions, but outliers were removed from the sample. The proportion of variance explained by the regression analysis after removing outliers was 30.7%, a difference of 3.6%. The answer to the question is true with caution. A caution is added because of a violation of regression assumptions. SW388R7 Data Analysis & Computers II Impact of assumptions and outliers - 1 Slide 100 The following is a guide to the decision process for answering problems about the impact of assumptions and outliers on analysis: Dependent variable metric? Independent variables metric or dichotomous? No Inappropriate application of a statistic Yes Ratio of cases to independent variables at least 5 to 1? Yes Run baseline regression and record R² for future reference, using method for including variables identified in the research question. No Inappropriate application of a statistic SW388R7 Data Analysis & Computers II Impact of assumptions and outliers - 2 Slide 101 Is the dependent variable normally distributed? No Try: 1. Logarithmic transformation 2. Square root transformation 3. Inverse transformation If unsuccessful, add caution Yes Metric IV’s normally distributed and linearly related to DV No Try: 1. Logarithmic transformation 2. Square root transformation (3. Square transformation) 4. Inverse transformation If unsuccessful, add caution Yes DV is homoscedastic for categories of dichotomous IV’s? Yes No Add caution SW388R7 Data Analysis & Computers II Impact of assumptions and outliers - 3 Slide 102 Substituting any transformed variables, run regression using direct entry to include all variables to request statistics for detecting outliers Are there univariate outliers (DV) or multivariate outliers (IVs)? Yes Remove outliers from data No Ratio of cases to independent variables at least 5 to 1? Yes Run regression again using transformed variables and eliminating outliers No Inappropriate application of a statistic SW388R7 Data Analysis & Computers II Impact of assumptions and outliers - 4 Slide 103 Yes Probability of ANOVA test of regression less than/equal to level of significance? No False Yes Increase in R² correct? No False Yes Satisfies ratio for preferred sample size: 15 to 1 (stepwise: 50 to 1) Yes No True with caution SW388R7 Data Analysis & Computers II Impact of assumptions and outliers - 5 Slide 104 Yes Other cautions added for ordinal variables or violation of assumptions? Yes True with caution No True