Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 Running Head: Normality and Outliers in ANOVA and MANOVA Checking for Normality and Outliers in ANOVA and MANOVA Lynne Cox University of Calgary 2 NORMALITY AND OUTLIERS IN ANOVA AND MANOVA Checking for non-normality and outliers in ANOVA and MANOVA A parametric test is a statistical procedure that takes a sample statistic and applies those results to make inferences regarding the general population. To ensure the components of the test are compatible with each other, there are assumptions that must be met within each multivariate analysis (Stevens, 2009). Once you have collected your data and before moving forward with statistical analysis, the next step is to look at the quality of the data and take some necessary precautions. Data screening involves checking if the data has been correctly inputted, checking for missing values and outliers and checking for normality (Hindes, 2012). Two assumptions I will cover in this paper will be checking for non-normality and outliers in the Analysis of Variance (ANOVA) and the Multivariate Analysis of Variance (MANOVA). ANOVA is a statistical technique used to determine the degree of difference between three or more groups. MANOVA is an extension of the ANOVA, but it tests the difference in means between two or more groups in vectors of means and allows the examination of two or more dependant variables. Retrieved from http://www.creative-wisdom.com/pub/parametric_WUSS2002.pdf Based on the Central Limit Theorem, one of the assumptions of parametric tests is that the variables are normally distributed. This Central Limit Theorem states that in a large sample 3 NORMALITY AND OUTLIERS IN ANOVA AND MANOVA size the mean and the sum of the sample will tend to follow a normal distribution, commonly referred to as the Bell Curve (Stevens, 2009). If a dataset follows a normal distribution, then about 68% of the observations will fall within one standard deviation of the mean, 95% within 2 standard deviations and 99.7% will fall within 3 standard deviations of the mean. Although no method gives a definitive conclusion, two ways to evaluate normality is through graphical representation and statistical methods. Read more at: Central Limit Theorem: A Simple Explanation of the CLT | Suite101.com http://suite101.com/article/central-limit-theorem-a98157#ixzz1yOeaq8k5 In this example the lighter area shows 68% of the observations falling within -1 and +1 standard deviations of the mean. About 95% of the observations fall within 2 standard deviations of the mean (-2, 2) and about 99.7% of observations fall within 3 standard deviation of the mean (-3, 3), resulting in a bell curve. Retrieved from http://www.stat.yale.edu/Courses/1997-98/101/normal.htm Outliers Outliers are data points that are extreme, atypical and infrequent. The values are far from the mean and fall outside the distribution pattern. Outliers are not always random or by chance and need to be given special notice, as a single outlier can have an excessive influence on the size and direction of the strength and direction of the linear relationship between two variables 4 NORMALITY AND OUTLIERS IN ANOVA AND MANOVA (Sattler, 2008, p. 99). In large data samples, you can expect to find a small number of outliers. A data sample will always have a sample minimum and a sample maximum, but this does not mean the outlier will fall within this range, as the sample minimum and sample maximum may be closer to the other data points Outliers can be caused by a data recording or entry error, instrument error, or by subjects being simply different from the rest of the sample (Sattler, 2008). Since outliers might cause your data to be non-normal, it is important to identify the cause of the outlier and then decide what to do about them (Stevens, 2009, p. 11) Examples of outliers easy to spot when there are 2 data sets Examples with a small data and a larger data set: Case Number x1 1 111 2 92 3 90 4 107 5 98 6 150 7 118 8 110 9 117 10 94 x2 68 46 50 59 50 66 54 51 59 97 In this data set, it is easy to spot that case number 6 in x1 is unexpected and has a higher value than the others as is case number 10 in x2. Both of these can be identified as outliers, just by observing the data set (Stevens, 2009, p. 11) In the example below, it is harder to identify the outliers when using a larger data set with four variables: Case Number x1 1 111 2 92 3 90 4 107 5 98 6 150 7 118 x2 68 46 50 59 50 66 54 x3 17 28 19 25 13 20 11 x4 81 67 83 71 92 90 101 In this example, case 13 does not seem to split off dramatically from the other subject scores, and at first glance, case 13 does not stand out, yet a closer look at case 13 shows X2, X3, and X4 the scores look low, but X1 the score is high compared to the rest of the data set (Stevens, 2009, p. 12) 5 NORMALITY AND OUTLIERS IN ANOVA AND MANOVA 8 9 10 11 12 13 14 15 110 117 94 130 118 155 118 109 51 59 97 57 51 40 61 66 26 18 12 16 19 9 20 13 82 87 69 97 78 58 103 88 The Boxplot is a graphical display that shows the black line as the median, the shaded area as the middle (or 50 % of the scores), top and bottom as 25% of the scores. The smallest and largest (nonoutlier) scores designate the bottom and top lines. An open dot identifies mild outliers (scores more than 1.5 IQR (Interquartile range) and stars indicate the extreme outliers which are scores more than 3 IQR from the rest of the scores. Normal Q-Q plot is another graphical way to look at the level of normality and identify outliers. You can see all but one of the dots fall on or within a very close range of the line of regression. Retrieved from http://www.psychwiki.com/wiki/How_do_I_determine_whether_my_data_are_normal%3F When there are extreme values in a data set, it is better to use the median as a measure of central tendency, as the median is unaffected by outliers and is a strong measure of central 6 NORMALITY AND OUTLIERS IN ANOVA AND MANOVA tendency (Meyers, 2006). If an outlier is discovered it is important to identify the cause before making the decision on further analysis. If the data has been entered incorrectly it can be reentered or if the data was due to an instrumentation error it could be dropped. The analysis can also be run once with the outlier and once without the outlier. Checking for non-normality and outliers in an ANOVA As previously mentioned, the two main methods of assessing normality are: Graphically- using a visual inspection Numerically-relying on a statistical test As a beginning researcher, it is recommended that both methods, rather than relying on just one method are carried out. Using SPSS (Statistical Package for the Social Sciences) allows you to test for both Normality and Outliers. Using the Explore command in SPSS we are able to first look for any outliers and then test for Normality. Using SPSS to look for univariate (ANOVA) outliers 1. Choose Analyse, Descriptive Statistics and then Explore. 2. Select variable 2. Click on Statistics and check off outliers 3. Click on Plots and unclick Stem and Leaf 4. Click OK to produce output Retrieved from https://statistics.laerd.com/spss-tutorials/testing-for-normality-using-spssstatistics.php 7 NORMALITY AND OUTLIERS IN ANOVA AND MANOVA Descriptives Table The mean and trimmed mean will help identify outliers. In the case the Mean is 1.77 while the 5% Trimmed Mean is 1.74, only slightly lower. The trimmed mean shows that 5% of the higher and lower scores have been removed. By comparing the two scores you can identify if any extreme scores are having an influence on the variable. Extreme Values and the Boxplot 8 NORMALITY AND OUTLIERS IN ANOVA AND MANOVA The Boxplot and the Extreme Values tables, show the mild and extreme outliers. Referring to the Extreme Values tables you can identify the case number. This information will help guide the decision on what is to be done with the outliers. You may choose to re-enter data, get rid of the outlier, or run two analyses; one with the outlier and one without. Using SPSS to check for non-normality in an ANOVA 1. Choose Analyze, Descriptive Statistics and then Frequencies 2. Select variables 3. Click on Charts, Histogram, with normal curve 4. Click OK to produce output Most tests rely on the assumption of normality. Referring to the descriptive table, we are able to begin by looking at the measures of skewness and kurtosis. Skewness measures the symmetry of a distribution while kurtosis measures the general peakedness of a 9 NORMALITY AND OUTLIERS IN ANOVA AND MANOVA distribution. A normal distributed variable (showing mesokurtosis) will show values of skewness and kurtosis around zero (Meyers, 2006). Although the Histogram is another approach to be included in looking for non-normality in univariate analysis, it does not provide a definitive indication of violation of normality. The histogram should be used with the probability plot. These plots rank the data along a regression line and when the data falls directly on the straight diagonal line, normality is assumed. The data does fall off the line, and further analysis is needed. 10 NORMALITY AND OUTLIERS IN ANOVA AND MANOVA Tests of Normality 1. Choose Analyze, Descriptive Statistics, Explore 2. Select variables 3. Click Plots, and unclick Stem and Leaf, and click Normality plots with tests 4. Click OK to produce output When looking at the Tests of Normality, you want to have the test come out not significant with a significance level of < .001. Both tests would show non-normal, which is what the other approaches have also indicated. Checking for Non-Normality and Outliers in a MANOVA As MANOVA tests are sensitive to outliers, data should be screened and run through normality tests and plot tests to see that the assumptions are met. Using Mahalanobis’ Distances will help identify outliers in an MANOVA. If the scores for the Mahalanobis Distances exceed the critical value found in the table it will be considered an outlier. “The Mahalanobis distance statistic D2 measures the multivariate “distance” between each case and the group multivariate 11 NORMALITY AND OUTLIERS IN ANOVA AND MANOVA mean (known as a centroid)” (Meyers, 2006, p. 67). The critical value tables are located in the back of most textbooks. As there are more variables needing to be normally distributed to not violate the assumption of normality on the MANOVA, checking for non-normality is a more rigorous task than the assumption of normality on an ANOVA analysis. Two additional properties to check the normality assumption are “(a) any linear combinations of the variables are normally distributed, and (b) all subsets of the set of variables have multivariate normal distributions” (Stevens, 2009, p. 222). The second property implies the scatterplots for each pair of the variables will be elliptical (Steven, 2009). Example of a scatterplot showing an elliptical correlation between the variables. The higher the correlation, the thinner the ellipse (Sattler, 2007) Scatterplot.gif The scatterplot also shows outliers as the data points that fall outside the oval- shape (elliptical) 12 NORMALITY AND OUTLIERS IN ANOVA AND MANOVA Retrieved from: http://www.psychwiki.com/wiki/Analyzing_Data Like the ANOVA, the shape of a distribution for the variables in a MANOVA should follow the bell-shaped curve. Variable 2 is showing a positive skewness, while variable 1 is showing the data to be normal, as it following the bell shape. When a variable has violated the assumption of normality, a data transformation can be used to modify the variable (Hindes, 2012). The square root transformation, the logarithmic transformation and the inverse transformation are three of the more common transformations used. Including a variable that is not normal will reduce the power of the test. Conclusion Research and design uses a systematic approach to collecting and analyzing data to help explain or predict a certain occurrence or trend. Using Univariate and multivariate data analysis, we are able to obtain a more detailed description of the relationship of the variables being studied. Stronger results are reached if the data is screened and the assumptions of the test have not been violated. When using the SPSS software program and following a very systematic process, checking for non-normality and outliers in an ANOVA and a MANOVA analysis is 13 NORMALITY AND OUTLIERS IN ANOVA AND MANOVA straight forward and thorough, with data being analyzed through both graphical and statistical methods. 14 NORMALITY AND OUTLIERS IN ANOVA AND MANOVA References Gay, L. R., Mills, G. E., & Airasian, P. (2012). Educational research: Competencies for analysis and applications (10th ed.). NJ: Pearson Education, Inc. Hindes, Yvonne EDPS 607 L20 Multivariate Design and Analysis Spring 2012 Power point. Meyers, L.S., Gamst, G., & Guarino, A.J. (2006). Applied Multivariate Research: Design and Interpretation. Thousand Oaks, California: Sage Publication. Sattler, J.M. (2008). Assessment of Children Cognitive Foundations (5th Ed.). La Mesa California: Jerome M. Sattler, Publisher, Inc. Todorov, V., & Filzmoser, P. (2010). Robust statistic for the one-way MANOVA Computational Statistics and Data Analysis 54, 37-48. Doi10.1016/j.csda.2009.08.015 Retrieved from: http://www.creative-wisdom.com/pub/parametric_WUSS2002.pdf Retrieved from: http://www.scribd.com/doc/49320849/115/Assumption-testing Retrieved from: http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm Retrieved from: https://statistics.laerd.com/spss-tutorials/testing-for-normality-using-spssstatistics.php Retrieved from: http://pathwayscourses.samhsa.gov/eval201/eval201_supps_pg16.htm Scatterplot.gif retrieved from: https://www.google.ca/search?hl=en&q=scatterplot+elliptical&aq=f&aqi=g-lK1gbsK1&bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&biw=1202&bih=591&wrapid=tlif134043255544610 &um=1&ie=UTF-8&tbm=isch&source=og&sa=N&tab=wi&ei=zGDlT5-sNMTg2gX5ofDZCQ