Download 2: Exploratory data Analysis using SPSS

Research Methods I: SPSS for Windows part 2 2: Exploratory data Analysis using SPSS The first stage in any data analysis is to explore the data collected. Usually we are interested in looking at descriptive statistics such as means, modes, medians, frequencies and so on. Often, we are interested in checking assumptions of the data too (remember that parametric tests require normally distributed data and so we often want to assess the degree to which the data are normal). 2.1. Opening a File Throughout this course you will work with data files that are provided on disk. It is, therefore, important that you know how to load these data files into SPSS. The procedure is very simple. To open a file, simply use the icon (or use the menus: File⇒ Open) to activate the dialogue box in Figure 2.1. First, you need to find the location at which your file is stored. If you are loading a file from the floppy disk then access the floppy drive, for this course, data files are on the drive labelled Courses on psyserver in a folder called andyfield. Once the drive has been accessed you should see a list of files and folders that can be opened. If you are currently in the Data Editor Window then SPSS will display only SPSS data files to be opened (if you are in the navigator window then only output files will be displayed). You can open a folder by double clicking on the folder icon. Once you have tracked down the required file you can open it either by selecting it with the mouse and then clicking on , or by doubleclicking on the icon next to the file you want (e.g. double-clicking on ). The data/output will then appear in the appropriate window. The data we will use are in the file SPSSExam.sav. This file contains four variables: exam (First year SPSS exam scores as a percentage), computer (measure of computer literacy in percent), lecture (percentage of SPSS lectures attended), and numeracy (a measure of each student’s numeracy out of 15). Figure 2.1: Dialogue box to open a file. 2.2. Obtaining Summary Information for One group 2.2.1. Running the Analysis To see the distribution of our four variables, we can use the frequesncies command by using the file path Statistics ⇒ Summarize⇒ ⇒ Frequencies … to accss the main dialogue box in Figure 2.2. The variables in the data editor will be listed on the left-hand side, and can be transferred to the box labelled Variable(s) by clicking on a variable (or highlighting several wityh the mouse) and then clicking on © Dr. Andy Field Page 1 . Any analysses you choose to do will be done on every 3/12/00 Research Methods I: SPSS for Windows part 2 variable listed in the Variable(s) box. By default, SPSS produces a frequency distribution of all scores in table form. However, there are two other dialogue boxes that can be selected that provide other options. The Statistics dialogue box is accessed by clicking on , and the Charts dialogue box is accessed by clicking on . The statistics dialogue box allows you to select several options of ways in which a distribution of scores can be described, such as measures of central tendency (Mean, Mode, Median), measures of variability (range, standard deviation, variance, quartile splits), measures of shape (kurtosis and skewness). To describe the characteristics of the data we should select the mean. Mode, median, standard deviation, variance and range and to check that a distribution of scores is normal, we need to look at the values of kurtosis and skewness. The charts option provides a simple way to plot the frequency distribuition of scores (as a bar chart, a pie chart, or histogram). The most useful chart is the histogram, and for the purpose of checking normality, we should select the option of displaying a normal curve on the histogram. When you have selected the appropriate options, return to the main dialogue box by clicking on . Once in the main dialogue box, click to run the analysis. Figure 2.2: Dialogue boxes for the frequencies command. 2.2.2. Output SPSS Output 2.1 shows the table of descriptive statistics for the four variables. From this table, we can see that, on average, students attended nearly 60% of lectures, obtained 58% in their SPSS exam and scored only 51% on the computer literacy test, and only 5 out of 15 on the numeracy test. In addition, the standard deviation for computer literacy was relatively small compared to the percentage of lectures attended and the exam scores. Finally, these latter two variables had several modes. The other important measures are the skewness and the kurtosis, both of which have an associated standard error. The actual values of skew and kurtosis should be zero if the distribution is normal. Positive values of skewness indicate a pile up of scores on the left of the distribution, whereas negative values indicate a pile up on the right. Positive values of kurtosis indicate a pointy distribution whereas negative values indicate a flat distribution. The further the value is from zero, the more likely it is that the data are not normally distributed. However, the actual © Dr. Andy Field Page 2 3/12/00 Research Methods I: SPSS for Windows part 2 value of skewness and kurtosis are not, in themselves, informative. Instead, we should take the value and convert it to a z-score using the z-score equation (skewness) and a variation on this equation (kurtosis): z skew = S − 0 z kurtosis = S . E. skew K −0 S . E. Kurtosis In these equations, the values of S (skewness) and K (kurtosis) and their respective standard errors are produced by SPSS. However, the significance of z should be tested conservatively (at p < 0.01) in small samples and probably not at all for large samples. Statistics N Computer literacy 100 0 50.7100 .8260 51.5000 54.00 8.2600 68.2282 -.174 .241 .364 .478 46.00 27.00 73.00 Valid Missing Mean Std. Error of Mean Median Mode Std. Deviation Variance Skewness Std. Error of Skewness Kurtosis Std. Error of Kurtosis Range Minimum Maximum Percentage Percentage on SPSS of lectures exam attended Numeracy 100 100 100 0 0 0 58.1000 59.7650 4.8500 2.1316 2.1685 .2706 60.0000 62.0000 4.0000 72.00 a 48.50 a 4.00 21.3156 21.6848 2.7057 454.3535 470.2296 7.3207 -.107 -.422 .961 .241 .241 .241 -1.105 -.179 .946 .478 .478 .478 84.00 92.00 13.00 15.00 8.00 1.00 99.00 100.00 14.00 a. Multiple modes exist. The smallest value is shown SPSS Output 2.1 The output also provides tabulated frequency distributions of each variable. These tables list each score and the number of times that it is found within the data. In addition, each Numeracy Valid 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 12.00 13.00 14.00 Total Frequency 5 16 15 17 13 8 9 9 2 3 1 1 1 100 Percent 5.0 16.0 15.0 17.0 13.0 8.0 9.0 9.0 2.0 3.0 1.0 1.0 1.0 100.0 Valid Percent 5.0 16.0 15.0 17.0 13.0 8.0 9.0 9.0 2.0 3.0 1.0 1.0 1.0 100.0 Cumulative Percent 5.0 21.0 36.0 53.0 66.0 74.0 83.0 92.0 94.0 97.0 98.0 99.0 100.0 frequency value is expressed as a percentage of the sample (in this case the frequencies and percentages are the same because the sample size was 100). Also, the cumulative percentage is given, which tells us how many cases (as a percentage) fell below a certain score. So, for example, we can see that 66% of numeracy scores were 5 or less, 74% were 6 or less, and so on. Looking in the other direction, we can work out that only 8% (100-92%) got scores greater than 8. Finally, we are given histograms of each variable with the normal distribution overlaid. These graphs are displayed in Figure 2.3 and show us several things. First, it looks as though computer literacy is fairly normally distributed (i.e. a few people are very good with computers and a few are very bad, but the majority of people have a similar degree of knowledge). The Exam scores are very interesting because this distribution is quite clearly not normal, in fact, it looks suspiciously bimodal (there are two peaks indicative of two modes). This observation corresponds with the earlier information from the table of descriptive statistics. Lecture attendance is generally quite normal, but the tails of the distribution are quite heavy (i.e. although most people attend the majority of lectures—60% or so—there are a reasonable number of dedicated souls who attend them all and a larger than ‘normal’ proportion who attend very few). This is why there are high frequencies at the two ends of the distribution. Finally, the numeracy test has produced very positively skewed data (i.e. the majority of people did ver badly on this test and only a few did well, hence, most scores are clustered at the low end). © Dr. Andy Field Page 3 3/12/00 Research Methods I: SPSS for Windows part 2 Computer literacy Percentage on SPSS exam 40 Percentage of lectures attended 12 12 10 10 8 8 6 6 4 4 Numeracy 40 30 30 20 20 N = 100.00 25.0 35.0 45.0 55.0 50.0 65.0 60.0 75.0 70.0 N = 100.00 15.0 25.0 20.0 35.0 30.0 45.0 40.0 55.0 50.0 65.0 60.0 75.0 70.0 85.0 80.0 95.0 90.0 100.0 10 Std. Dev = 21.68 2 Mean = 59.8 N = 100.00 0 0. 10.0 95 0 . 90 0 . 85 0 . 80 0 . 75.0 70 0 . 65 0 . 60 0 . 55 0 . 50.0 45 0 . 40 0 . 35 0 . 30 0 . 25.0 20 0 . 15 0 . 40.0 Mean = 58.1 0 10 30.0 Std. Dev = 21.32 2 0 Computer literacy Percentage on SPSS exam Percentage of lectures attended Frequency Mean = 50.7 Frequency Std. Dev = 8.26 0 Frequency Frequency 10 Std. Dev = 2.71 Mean = 4.9 N = 100.00 0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 Numeracy Figure 2.3: Histograms of computer literacy, Exam scores, lecture attendance and numeracy. Although there is a lot of information that we can obtain from histograms and descriptive information about a distribution. There are other ways in which we can assess the degree of normality in a set of data (see section 2.4). 2.3. Obtaining Summary Information for Several Groups: The Split File Command 2.3.1. Running the Analysis There are several ways to produce basic descriptive statistics for separate groups of people (and we will come across some of these methods in due course). However, if you want to repeat any analysis on several groups of cases, there is a function called split file, which allows this to be done. The split file function allows you to specify a grouping variable (remember we used these variables last week to specify categories of people). Any subsequent procedure in SPSS will then be carried out, in turn, on each category belonging to that grouping variable. For, these data, there is a variable called Uni indicating whether the student was at Royal Holloway or Sussex university. If we wanted to obtain descriptive statistics for each of these samples, we could split the file, and then proceed using the frequencies command as in the previous section. To split the file, simply use the menu path Data⇒ ⇒ Split File … or click on . The resulting dialogue box allows you to select the option Organise output by groups. Once this option is selected, the Groups based on box becomes active. Select the variable containing the group codes by which you wish to repeat the analysis (in this example select Uni), and transfer it to the box by clicking on . By default, SPSS will then sort the file by these groups (i.e. it will list one category followed by the other in the data editor window). Once we have split the file, we can again use the frequencies command (see previous section, but this time only request statistics for numeracy and exam scores). 2.3.2. Output The SPSS output will be split into two sections: first the results for students at Sussex University, then the same results but for those attending Royal Holloway. SPSS Output 2.2 shows the two main summary tables. From these tables it is clear that Royal Holloway students scored higher on their SPSS exam than their Sussex counterparts, and also numeracy scores were higher too. In fact, looking at the means reveals that, on average, Royal Holloway students scored 6% more on the SPSS exam than Sussex students, and had numeracy scores twice as high. The standard deviations for both variables are comparable. © Dr. Andy Field Page 4 3/12/00 Research Methods I: SPSS for Windows part 2 Sussex University Royal Holloway Statisticsa Statistics a N Percentage on SPSS exam Numeracy 50 50 0 0 54.4400 3.1800 2.7779 .2094 53.0000 3.0000 47.00 2.00 19.6429 1.4803 385.8433 2.1914 .259 .621 .337 .337 -.893 -.100 .662 .662 77.00 6.00 22.00 1.00 99.00 7.00 Valid Missing Mean Std. Error of Mean Median Mode Std. Deviation Variance Skewness Std. Error of Skewness Kurtosis Std. Error of Kurtosis Range Minimum Maximum a. University = Sussex University N Valid Missing Percentage on SPSS exam 50 Numeracy 50 0 61.7600 3.1774 0 6.5200 .3717 67.5000 77.00 6.5000 5.00 22.4677 504.7984 -.482 2.6283 6.9078 .697 Mean Std. Error of Mean Median Mode Std. Deviation Variance Skewness Std. Error of Skewness Kurtosis Std. Error of Kurtosis Range .337 -.931 .337 .648 .662 82.00 .662 12.00 Minimum Maximum 15.00 97.00 2.00 14.00 a. University = Royal Holloway SPSS Output 2.2 Figure 2.4 shows the histograms of these variables split according to the university attended. For exam marks, the distributions are both bimodal. So, it seems that regardless of the university, there is always a split between students: they either do really well (one mode around 70%) or really badly (second mode at 35%). However, at Royal Holloway, there is a greater concentration of students around the higher mode (the peak is taller). For numeracy scores, the distribution is slightly positively skewed in the Sussex group (there is a larger concentration at the lower end of scores) whereas Royal Holloway students are fairly normally distributed around a mean of 7. Therefore, the overall positive skew observed before is due to the mixture of universities (the Sussex students contaminate Royal Holloway’s normally-distributed scores!). When you have finished with the split file command, remember to switch it off (otherwise SPSS will carry on ding every analysis on each group separately). To switch this function off, return to the split file dialogue box and select Analyse all cases: do not create groups. SPSS Exam Mark Percentage on SPSS exam Numeracy Percentage on SPSS exam Sussex University Numeracy Royal Holloway 8 Numeracy Sussex University 10 Royal Holloway 16 20 14 8 6 12 6 10 4 8 10 4 N = 50.00 20.0 30.0 25.0 40.0 35.0 50.0 45.0 60.0 55.0 70.0 65.0 80.0 75.0 90.0 85.0 100.0 95.0 Percentage on SPSS exam Mean = 61.8 N = 50.00 0 15.0 35.0 25.0 55.0 45.0 75.0 65.0 95.0 85.0 Percentage on SPSS exam 4 Std. Dev = 1.48 2 Mean = 3.2 0 N = 50.00 1.0 2.0 3.0 4.0 5.0 6.0 Frequency Mean = 54.4 0 Std. Dev = 22.47 2 Frequency Std. Dev = 19.64 Frequency Frequency 6 2 Std. Dev = 2.63 Mean = 6.5 N = 50.00 0 7.0 Numeracy 2.0 4.0 6.0 8.0 10.0 12.0 14.0 Numeracy Figure 2.4: Distributions of exam and numeracy scores for Royal Holloway and Sussex students. 2.4. Testing whether a distribution is Normal 2.4.1. Running the Analysis It is all very well to look at histograms, but they tell us little about whether a distribution is close enough to normality to be useful. What is needed is an objective test to decide whether or not a distribution is normal. Fortunately, there is such a test: the Kolmogorov-Smirnov test. This test compares the set of scores to a normally-distributed set of scores with the © Dr. Andy Field Page 5 3/12/00 Research Methods I: SPSS for Windows part 2 same mean and standard deviation. Therefore, if the test is nonsignificant (p > 0.05) it tells us that the distribution we have is not significantly different from a normal distribution (i.e. it is probably normal). If, however, the test is significant (p < 0.05) then we know that the distribution in question is significantly different from a normal distribution (i.e. it is non-normal). This test is great: in one easy procedure, it tells us whether our sample of scores is normally distributed (nice!). This test can be accessed through the Explore command (Analyze⇒ ⇒ Descriptive Statistics ⇒ Explore…). Figure 2.5 shows the dialogue boxes for the Explore command. First, enter any variables of interest in the box labelled Dependent List by highlighting them on the left-hand side and transferring them by clicking on . For this example, just select the exam scores and numeracy scores. In addition, it is possible to select a factor (or grouping variable) by which to split the output (so, if you selected Uni and transferred it to the box labelled Factor List SPSS will produce exploratory analysis for each group—a bit like the split file command). If you click on a dialogue box appears, but the default option is fine (it will produce means, standard deviations and so on). The more interesting option for our puposes is accessed by clicking on . In this dialogue box select the option , and this will produce both the Kolmogorov-Smirnov test and Normal Q-Q plots for all of the variables selected. By defauult, SPSS will produce boxplots (split according to group if a Factor has been specified) and stem and leaf diagrams as well. Click on to return to the main dialogue box and then click to run the analysis. Figure 2.5: Dialogue boxes for the Explore Command. 2.4.2. Output The first table produced by SPSS contains descriptive statistics (Mean etc.) and should have the same values as the tables obtained using the frequencies procedure. The important table is that of the Kolmogorov-Smirnov test. This table includes the test statistic itself, the degrees of freedom (which should equal the sample size) and the significance value of this test). Remember that a significant value (a value less than 0.05) indicates a deviation from normality. For both of these variables, the Kolmogorov-Smirnov test is highly significant, indicating that both distributions are not normal. © Dr. Andy Field Page 6 3/12/00 Research Methods I: SPSS for Windows part 2 Tests of Normality This result is likely to reflect the bimodal distribution found for exam scores, and the positively skewed distribution observed in Kolmogorov-Smirnova Statistic df Sig. Percentage on SPSS .102 exam Numeracy .153 a. Lilliefors Significance Correction 100 .012 100 .000 the numeracy scores. However, these tests confirm that these deviations were significant. In addition, two Normal Q-Q plots are produced. The Normal Q-Q chart plots the values you would expect to get if the distribution were normal (expected values) against the values actually seen in the data set (observed values). If the data are normally distributed, then the observed values (the scores that you measured) should be the same as the scores you would expect to get in a normal distribution (i.e. values along the X and Y axis are the same). The green (straight) line represents this ideal situation. The red dots represent the actual data set. If the data are normally distributed, then the red dots should lie along the green line. Any deviation of the dots from the line represents a deviation from normality. In both the variables analysed we already know that the data are not normal, and these plots confirm this observation (because the red dots deviate substantially from the line. It is noteworthy that the deviation is greater for the numeracy scores, and this is consistent with the higher significance value of this variable on the Kolmogorov-Smirnov test. SPSS Exam Numeracy Normal Q-Q Plot of SPSS exam scores Normal Q-Q Plot of Numeracy 3 3 2 2 1 1 0 Expected Normal Expected Normal 0 -1 -2 -3 0 20 40 60 80 100 120 Observed Value -1 -2 -2 0 2 4 6 8 10 12 14 16 Observed Value 2.5. Crosstabulations (from Raw Scores) 2.5.1. Running the Analysis Sometimes, we are interested not in test scores, or continuous measures, but in categorical variables (such as how many psychology students are male/female compared to computer science students). When we examine the relationship between two (or more) categorical variables it is known as cross-tabulation. On SPSS, this kind of analysis can be done using the Crosstabs command, which tabulates the data and then carries out numerous statistical tests. For example, a researcher was interested in whether animals could be trained to do line dancing. So, they took some cats and dogs (animal) and tried to train them to dance either by giving them food or affection as a reward for dance-like behaviour (training). At the end of the week a note was made of which animals line danced and which did not (dance). These data are in the file called cats.sav, and you should be able to identify the three variables described. Crosstabs is again in the Summarize menu (S tatistics ⇒ Summarize⇒ ⇒ Crosstabs…). To begin with, we are not interested in whether there is a distinction between dogs and cats on the task, we merely want to see whether animals can be trained using the two methods. Figure 2.6 shows the dialogue boxes for the Crosstabs command. First, enter one of the variables of interest in the box labelled Row(s) by highlighting it on the left-hand side and transferring it by © Dr. Andy Field Page 7 3/12/00 Research Methods I: SPSS for Windows part 2 clicking on . For this example, I selected dance to be the rows of the table. Next, select the other variable of interest (training) and transfer it to the box labelled Column(s) by clicking on . In addition, it is possible to select a layer variable (i.e. you can split the rows of the table into further categories). In this case, it would make sense to place animal in this box because SPSS would then split the crosstabulation table into a section for dogs and a section for cats. However, for the time being don’t select this variable. If you click on a dialogue box appears in which you can specify various statistical tests (m,ost of which you won’t have come across yet), select the chi-square test and then click on . If you click on a dialogue box appears in which you can specify they type of data displayed in the crosstabulation table. You should request expected counts (these should all be above 5 for the chi square test to be accurate), and it is very useful to ask for row, column and total percentages too (these values are usually more easily interpreted than the actual frequencies). Once these options have been selected click on to return to the main dialogue box and then click to run the analysis. Figure 2.6: Dialogue boxes for the Crosstabs Command. 2.5.2. Output The crosstabulation table produced by SPSS contains the number of cases that falls into each combination of categories. So, for example, we can see that 49 animals danced when food was offered as a reward compared to only 30 when affection was given as a reward. Likewise, 15 did not dance when food was offered as a reward compared to 40 when affection was offered as a reward. These values are not that meaningful because they depend largely on the sample size, and so it is easier to interpret the percentages. Reading the % within Did they Dance?, it is clear that of those animals that did dance, 62% had a food reward compared to 38% who had affection. This implies that food was a better motivator. Looking at those animals that did not dance, 27.3% had food as a reward compared to a larger 72.7% who had affection. This again supports the notion that affection resulted in less dancing animals! Reading down the columns, © Dr. Andy Field Page 8 3/12/00 Research Methods I: SPSS for Windows part 2 we should look at the % within type of training and see that when food was used as a reward, 76.6% danced and Did they dance? * Type of Training Crosstabulation Type of Training 23.4% did not. When affection was used, 42.9% danced and 57.1% did not. These results imply that affection resulted in roughly chance performance, but food resulted in lots of dancing animals! Did they dance? Yes Total square statistic is given (and the degrees of freedom) and the significance value. For these data, the chi-square is highly significant, indicating that the type of training used had a significant effect on whether an animal would Pearson Chi-Square Continuity Correctiona Likelihood Ratio Fisher's Exact Test Linear-by-Linear 15.579 Association N of Valid Cases 134 a. Computed only for a 2x2 table 1 1 1 1 .000 41.3 79.0 % within Did they dance? 62.0% 38.0% 100.0% % within Type of Training 76.6% 42.9% 59.0% % of Total 36.6% 22.4% 59.0% 15 40 55 % within Did they dance? 26.3 27.3% 28.7 72.7% 55.0 100.0% % within Type of Training 23.4% 57.1% 41.0% % of Total 11.2% 29.9% 41.0% 64 70 134 64.0 70.0 134.0 Count Count Expected Count Total 79 % within Did they dance? 47.8% 52.2% 100.0% % within Type of Training % of Total 100.0% 100.0% 100.0% 47.8% 52.2% 100.0% dance. The continuity corrected chi-square is designed for situations in which you have two Chi-Square Tests Asymp. Sig. (2-sided) .000 .000 .000 37.7 Expected Count In addition to the crosstabulation table, SPSS produces a table of the chi-square statistic. The value of the chi- df Affection as reward 30 Count Expected Count No Value b 15.696 14.334 16.137 Food as Reward 49 Exact Sig. (2-sided) Exact Sig. (1-sided) .000 .000 categorical variables, both containing two categories (as is the situation here). There is still some debate as to whether or not this correction is even accurate, let alone necessary, and so it may be wiser to ignore it. b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 26.27. Homework: Re-run this crosstabulation procedure but include animal in the layers box of the main options. You should get a table divided up for dogs and cats: what does this table tell us about the differences between cats and dogs? Also, re-load the exam mark data used earlier on and carry out an analysis to find out whether computer literacy and percentage of lectures attended are normally distributed. Put your name on the outputs and show them to a demonstrator by 1 week after your SPSS session. This handout contains large excerpts of the following text (so copyright exists!) Field, A. P. (2000). Discovering statistics using SPSS for Windows: advanced techniques for the beginner. London: Sage. Go to http://www.sagepub.co.uk to order a copy © Dr. Andy Field Page 9 3/12/00

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 2: Exploratory data Analysis using SPSS