Download 2: Exploratory data Analysis using SPSS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Psychometrics wikipedia , lookup

History of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Research Methods I: SPSS for Windows part 2
2: Exploratory data Analysis using SPSS
The first stage in any data analysis is to explore the data collected. Usually we are interested in looking at descriptive
statistics such as means, modes, medians, frequencies and so on. Often, we are interested in checking assumptions of
the data too (remember that parametric tests require normally distributed data and so we often want to assess the degree
to which the data are normal).
2.1. Opening a File
Throughout this course you will work with data files that are provided on disk. It is, therefore, important that you know
how to load these data files into SPSS. The procedure is very simple. To open a file, simply use the
icon (or use the
menus: File⇒ Open) to activate the dialogue box in Figure 2.1.
First, you need to find the location at which your file is stored. If you are loading a file from the floppy disk then access
the floppy drive, for this course, data files are on the drive labelled Courses on psyserver in a folder called andyfield.
Once the drive has been accessed you should see a list of files and folders that can be opened. If you are currently in the
Data Editor Window then SPSS will display only SPSS data files to be opened (if you are in the navigator window then
only output files will be displayed). You can open a folder by double clicking on the folder icon. Once you have tracked
down the required file you can open it either by selecting it with the mouse and then clicking on
, or by doubleclicking on the icon next to the file you want (e.g. double-clicking on
). The data/output will then appear in the
appropriate window. The data we will use are in the file SPSSExam.sav. This file contains four variables: exam (First
year SPSS exam scores as a percentage), computer (measure of computer literacy in percent), lecture (percentage of
SPSS lectures attended), and numeracy (a measure of each student’s numeracy out of 15).
Figure 2.1: Dialogue box to open a file.
2.2. Obtaining Summary Information for One group
2.2.1.
Running the Analysis
To see the distribution of our four variables, we can use the frequesncies command by using the file path
Statistics ⇒ Summarize⇒
⇒ Frequencies … to accss the main dialogue box in Figure 2.2. The variables in the data editor
will be listed on the left-hand side, and can be transferred to the box labelled Variable(s) by clicking on a variable (or
highlighting several wityh the mouse) and then clicking on
© Dr. Andy Field
Page 1
. Any analysses you choose to do will be done on every
3/12/00
Research Methods I: SPSS for Windows part 2
variable listed in the Variable(s) box. By default, SPSS produces a frequency distribution of all scores in table form.
However, there are two other dialogue boxes that can be selected that provide other options. The Statistics dialogue box
is accessed by clicking on
, and the Charts dialogue box is accessed by clicking on
. The statistics
dialogue box allows you to select several options of ways in which a distribution of scores can be described, such as
measures of central tendency (Mean, Mode, Median), measures of variability (range, standard deviation, variance,
quartile splits), measures of shape (kurtosis and skewness). To describe the characteristics of the data we should select
the mean. Mode, median, standard deviation, variance and range and to check that a distribution of scores is normal, we
need to look at the values of kurtosis and skewness. The charts option provides a simple way to plot the frequency
distribuition of scores (as a bar chart, a pie chart, or histogram). The most useful chart is the histogram, and for the
purpose of checking normality, we should select the option of displaying a normal curve on the histogram. When you
have selected the appropriate options, return to the main dialogue box by clicking on
. Once in the main dialogue
box, click
to run the analysis.
Figure 2.2: Dialogue boxes for the frequencies command.
2.2.2.
Output
SPSS Output 2.1 shows the table of descriptive statistics for the four variables. From this table, we can see that, on
average, students attended nearly 60% of lectures, obtained 58% in their SPSS exam and scored only 51% on the
computer literacy test, and only 5 out of 15 on the numeracy test. In addition, the standard deviation for computer
literacy was relatively small compared to the percentage of lectures attended and the exam scores. Finally, these latter
two variables had several modes. The other important measures are the skewness and the kurtosis, both of which have
an associated standard error. The actual values of skew and kurtosis should be zero if the distribution is normal. Positive
values of skewness indicate a pile up of scores on the left of the distribution, whereas negative values indicate a pile up
on the right. Positive values of kurtosis indicate a pointy distribution whereas negative values indicate a flat distribution.
The further the value is from zero, the more likely it is that the data are not normally distributed. However, the actual
© Dr. Andy Field
Page 2
3/12/00
Research Methods I: SPSS for Windows part 2
value of skewness and kurtosis are not, in themselves, informative. Instead, we should take the value and convert it to a
z-score using the z-score equation (skewness) and a variation on this equation (kurtosis):
z skew = S − 0
z kurtosis =
S . E. skew
K −0
S . E. Kurtosis
In these equations, the values of S (skewness) and K (kurtosis) and their respective standard errors are produced by
SPSS. However, the significance of z should be tested conservatively (at p < 0.01) in small samples and probably not at
all for large samples.
Statistics
N
Computer
literacy
100
0
50.7100
.8260
51.5000
54.00
8.2600
68.2282
-.174
.241
.364
.478
46.00
27.00
73.00
Valid
Missing
Mean
Std. Error of Mean
Median
Mode
Std. Deviation
Variance
Skewness
Std. Error of Skewness
Kurtosis
Std. Error of Kurtosis
Range
Minimum
Maximum
Percentage
Percentage
on SPSS
of lectures
exam
attended
Numeracy
100
100
100
0
0
0
58.1000
59.7650
4.8500
2.1316
2.1685
.2706
60.0000
62.0000
4.0000
72.00 a
48.50 a
4.00
21.3156
21.6848
2.7057
454.3535
470.2296
7.3207
-.107
-.422
.961
.241
.241
.241
-1.105
-.179
.946
.478
.478
.478
84.00
92.00
13.00
15.00
8.00
1.00
99.00
100.00
14.00
a. Multiple modes exist. The smallest value is shown
SPSS Output 2.1
The output also provides tabulated frequency distributions of each variable. These tables list each score and the number
of times that it is found within the data. In addition, each
Numeracy
Valid
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
12.00
13.00
14.00
Total
Frequency
5
16
15
17
13
8
9
9
2
3
1
1
1
100
Percent
5.0
16.0
15.0
17.0
13.0
8.0
9.0
9.0
2.0
3.0
1.0
1.0
1.0
100.0
Valid
Percent
5.0
16.0
15.0
17.0
13.0
8.0
9.0
9.0
2.0
3.0
1.0
1.0
1.0
100.0
Cumulative
Percent
5.0
21.0
36.0
53.0
66.0
74.0
83.0
92.0
94.0
97.0
98.0
99.0
100.0
frequency value is expressed as a percentage of the sample (in this
case the frequencies and percentages are the same because the
sample size was 100). Also, the cumulative percentage is given,
which tells us how many cases (as a percentage) fell below a
certain score. So, for example, we can see that 66% of numeracy
scores were 5 or less, 74% were 6 or less, and so on. Looking in
the other direction, we can work out that only 8% (100-92%) got
scores greater than 8. Finally, we are given histograms of each
variable with the normal distribution overlaid. These graphs are
displayed in Figure 2.3 and show us several things. First, it looks
as though computer literacy is fairly normally distributed (i.e. a
few people are very good with computers and a few are very bad, but the majority of people have a similar degree of
knowledge). The Exam scores are very interesting because this distribution is quite clearly not normal, in fact, it looks
suspiciously bimodal (there are two peaks indicative of two modes). This observation corresponds with the earlier
information from the table of descriptive statistics. Lecture attendance is generally quite normal, but the tails of the
distribution are quite heavy (i.e. although most people attend the majority of lectures—60% or so—there are a
reasonable number of dedicated souls who attend them all and a larger than ‘normal’ proportion who attend very few).
This is why there are high frequencies at the two ends of the distribution. Finally, the numeracy test has produced very
positively skewed data (i.e. the majority of people did ver badly on this test and only a few did well, hence, most scores
are clustered at the low end).
© Dr. Andy Field
Page 3
3/12/00
Research Methods I: SPSS for Windows part 2
Computer literacy
Percentage on SPSS exam
40
Percentage of lectures attended
12
12
10
10
8
8
6
6
4
4
Numeracy
40
30
30
20
20
N = 100.00
25.0
35.0
45.0
55.0
50.0
65.0
60.0
75.0
70.0
N = 100.00
15.0
25.0
20.0
35.0
30.0
45.0
40.0
55.0
50.0
65.0
60.0
75.0
70.0
85.0
80.0
95.0
90.0
100.0
10
Std. Dev = 21.68
2
Mean = 59.8
N = 100.00
0
0.
10.0
95 0
.
90 0
.
85 0
.
80 0
.
75.0
70 0
.
65 0
.
60 0
.
55 0
.
50.0
45 0
.
40 0
.
35 0
.
30 0
.
25.0
20 0
.
15 0
.
40.0
Mean = 58.1
0
10
30.0
Std. Dev = 21.32
2
0
Computer literacy
Percentage on SPSS exam
Percentage of lectures attended
Frequency
Mean = 50.7
Frequency
Std. Dev = 8.26
0
Frequency
Frequency
10
Std. Dev = 2.71
Mean = 4.9
N = 100.00
0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
Numeracy
Figure 2.3: Histograms of computer literacy, Exam scores, lecture attendance and numeracy.
Although there is a lot of information that we can obtain from histograms and descriptive information about a
distribution. There are other ways in which we can assess the degree of normality in a set of data (see section 2.4).
2.3. Obtaining Summary Information for Several Groups: The
Split File Command
2.3.1.
Running the Analysis
There are several ways to produce basic descriptive statistics for separate groups of people (and we will come across
some of these methods in due course). However, if you want to repeat any analysis on several groups of cases, there is a
function called split file, which allows this to be done. The split file function allows you to specify a grouping variable
(remember we used these variables last week to specify categories of people). Any subsequent procedure in SPSS will
then be carried out, in turn, on each category belonging to that grouping variable. For, these data, there is a variable
called Uni indicating whether the student was at Royal Holloway or Sussex university. If we wanted to obtain
descriptive statistics for each of these samples, we could split the file, and then proceed using the frequencies command
as in the previous section. To split the file, simply
use the menu path Data⇒
⇒ Split File … or click
on
. The resulting dialogue box allows you to
select the option Organise output by groups.
Once this option is selected, the Groups based on
box becomes active. Select the variable
containing the group codes by which you wish to
repeat the analysis (in this example select Uni),
and transfer it to the box by clicking on
. By
default, SPSS will then sort the file by these
groups (i.e. it will list one category followed by
the other in the data editor window). Once we
have split the file, we can again use the frequencies command (see previous section, but this time only request statistics
for numeracy and exam scores).
2.3.2.
Output
The SPSS output will be split into two sections: first the results for students at Sussex University, then the same results
but for those attending Royal Holloway. SPSS Output 2.2 shows the two main summary tables. From these tables it is
clear that Royal Holloway students scored higher on their SPSS exam than their Sussex counterparts, and also
numeracy scores were higher too. In fact, looking at the means reveals that, on average, Royal Holloway students
scored 6% more on the SPSS exam than Sussex students, and had numeracy scores twice as high. The standard
deviations for both variables are comparable.
© Dr. Andy Field
Page 4
3/12/00
Research Methods I: SPSS for Windows part 2
Sussex University
Royal Holloway
Statisticsa
Statistics a
N
Percentage
on SPSS
exam
Numeracy
50
50
0
0
54.4400
3.1800
2.7779
.2094
53.0000
3.0000
47.00
2.00
19.6429
1.4803
385.8433
2.1914
.259
.621
.337
.337
-.893
-.100
.662
.662
77.00
6.00
22.00
1.00
99.00
7.00
Valid
Missing
Mean
Std. Error of Mean
Median
Mode
Std. Deviation
Variance
Skewness
Std. Error of Skewness
Kurtosis
Std. Error of Kurtosis
Range
Minimum
Maximum
a. University = Sussex University
N
Valid
Missing
Percentage
on SPSS
exam
50
Numeracy
50
0
61.7600
3.1774
0
6.5200
.3717
67.5000
77.00
6.5000
5.00
22.4677
504.7984
-.482
2.6283
6.9078
.697
Mean
Std. Error of Mean
Median
Mode
Std. Deviation
Variance
Skewness
Std. Error of Skewness
Kurtosis
Std. Error of Kurtosis
Range
.337
-.931
.337
.648
.662
82.00
.662
12.00
Minimum
Maximum
15.00
97.00
2.00
14.00
a. University = Royal Holloway
SPSS Output 2.2
Figure 2.4 shows the histograms of these variables split according to the university attended. For exam marks, the
distributions are both bimodal. So, it seems that regardless of the university, there is always a split between students:
they either do really well (one mode around 70%) or really badly (second mode at 35%). However, at Royal Holloway,
there is a greater concentration of students around the higher mode (the peak is taller). For numeracy scores, the
distribution is slightly positively skewed in the Sussex group (there is a larger concentration at the lower end of scores)
whereas Royal Holloway students are fairly normally distributed around a mean of 7. Therefore, the overall positive
skew observed before is due to the mixture of universities (the Sussex students contaminate Royal Holloway’s
normally-distributed scores!). When you have finished with the split file command, remember to switch it off
(otherwise SPSS will carry on ding every analysis on each group separately). To switch this function off, return
to the split file dialogue box and select Analyse all cases: do not create groups.
SPSS Exam Mark
Percentage on SPSS exam
Numeracy
Percentage on SPSS exam
Sussex University
Numeracy
Royal Holloway
8
Numeracy
Sussex University
10
Royal Holloway
16
20
14
8
6
12
6
10
4
8
10
4
N = 50.00
20.0
30.0
25.0
40.0
35.0
50.0
45.0
60.0
55.0
70.0
65.0
80.0
75.0
90.0
85.0
100.0
95.0
Percentage on SPSS exam
Mean = 61.8
N = 50.00
0
15.0
35.0
25.0
55.0
45.0
75.0
65.0
95.0
85.0
Percentage on SPSS exam
4
Std. Dev = 1.48
2
Mean = 3.2
0
N = 50.00
1.0
2.0
3.0
4.0
5.0
6.0
Frequency
Mean = 54.4
0
Std. Dev = 22.47
2
Frequency
Std. Dev = 19.64
Frequency
Frequency
6
2
Std. Dev = 2.63
Mean = 6.5
N = 50.00
0
7.0
Numeracy
2.0
4.0
6.0
8.0
10.0
12.0
14.0
Numeracy
Figure 2.4: Distributions of exam and numeracy scores for Royal Holloway and Sussex students.
2.4. Testing whether a distribution is Normal
2.4.1.
Running the Analysis
It is all very well to look at histograms, but they tell us little about whether a distribution is close enough to normality to
be useful. What is needed is an objective test to decide whether or not a distribution is normal. Fortunately, there is such
a test: the Kolmogorov-Smirnov test. This test compares the set of scores to a normally-distributed set of scores with the
© Dr. Andy Field
Page 5
3/12/00
Research Methods I: SPSS for Windows part 2
same mean and standard deviation. Therefore, if the test is nonsignificant (p > 0.05) it tells us that the distribution we
have is not significantly different from a normal distribution (i.e. it is probably normal). If, however, the test is
significant (p < 0.05) then we know that the distribution in question is significantly different from a normal distribution
(i.e. it is non-normal). This test is great: in one easy procedure, it tells us whether our sample of scores is normally
distributed (nice!).
This test can be accessed through the Explore command (Analyze⇒
⇒ Descriptive Statistics ⇒ Explore…). Figure 2.5
shows the dialogue boxes for the Explore command. First, enter any variables of interest in the box labelled Dependent
List by highlighting them on the left-hand side and transferring them by clicking on
. For this example, just select the
exam scores and numeracy scores. In addition, it is possible to select a factor (or grouping variable) by which to split
the output (so, if you selected Uni and transferred it to the box labelled Factor List SPSS will produce exploratory
analysis for each group—a bit like the split file command). If you click on
a dialogue box appears, but the
default option is fine (it will produce means, standard deviations and so on). The more interesting option for our
puposes is accessed by clicking on
. In this dialogue box select the option
, and this will
produce both the Kolmogorov-Smirnov test and Normal Q-Q plots for all of the variables selected. By defauult, SPSS
will produce boxplots (split according to group if a Factor has been specified) and stem and leaf diagrams as well. Click
on
to return to the main dialogue box and then click
to run the analysis.
Figure 2.5: Dialogue boxes for the Explore Command.
2.4.2.
Output
The first table produced by SPSS contains descriptive statistics (Mean etc.) and should have the same values as the
tables obtained using the frequencies procedure. The important table is that of the Kolmogorov-Smirnov test. This table
includes the test statistic itself, the degrees of freedom (which should equal the sample size) and the significance value
of this test). Remember that a significant value (a value less than 0.05) indicates a deviation from normality. For both of
these variables, the Kolmogorov-Smirnov test is highly significant, indicating that both distributions are not normal.
© Dr. Andy Field
Page 6
3/12/00
Research Methods I: SPSS for Windows part 2
Tests of Normality
This result is likely to reflect the bimodal distribution found for
exam scores, and the positively skewed distribution observed in
Kolmogorov-Smirnova
Statistic
df
Sig.
Percentage on SPSS
.102
exam
Numeracy
.153
a. Lilliefors Significance Correction
100
.012
100
.000
the numeracy scores. However, these tests confirm that these
deviations were significant.
In addition, two Normal Q-Q plots are produced. The Normal Q-Q chart plots the values you would expect to get if the
distribution were normal (expected values) against the values actually seen in the data set (observed values). If the data
are normally distributed, then the observed values (the scores that you measured) should be the same as the scores you
would expect to get in a normal distribution (i.e. values along the X and Y axis are the same). The green (straight) line
represents this ideal situation. The red dots represent the actual data set. If the data are normally distributed, then the red
dots should lie along the green line. Any deviation of the dots from the line represents a deviation from normality. In
both the variables analysed we already know that the data are not normal, and these plots confirm this observation
(because the red dots deviate substantially from the line. It is noteworthy that the deviation is greater for the numeracy
scores, and this is consistent with the higher significance value of this variable on the Kolmogorov-Smirnov test.
SPSS Exam
Numeracy
Normal Q-Q Plot of SPSS exam scores
Normal Q-Q Plot of Numeracy
3
3
2
2
1
1
0
Expected Normal
Expected Normal
0
-1
-2
-3
0
20
40
60
80
100
120
Observed Value
-1
-2
-2
0
2
4
6
8
10
12
14
16
Observed Value
2.5. Crosstabulations (from Raw Scores)
2.5.1.
Running the Analysis
Sometimes, we are interested not in test scores, or continuous measures, but in categorical variables (such as how many
psychology students are male/female compared to computer science students). When we examine the relationship
between two (or more) categorical variables it is known as cross-tabulation. On SPSS, this kind of analysis can be done
using the Crosstabs command, which tabulates the data and then carries out numerous statistical tests. For example, a
researcher was interested in whether animals could be trained to do line dancing. So, they took some cats and dogs
(animal) and tried to train them to dance either by giving them food or affection as a reward for dance-like behaviour
(training). At the end of the week a note was made of which animals line danced and which did not (dance). These
data are in the file called cats.sav, and you should be able to identify the three variables described.
Crosstabs is again in the Summarize menu (S tatistics ⇒ Summarize⇒
⇒ Crosstabs…). To begin with, we are not
interested in whether there is a distinction between dogs and cats on the task, we merely want to see whether animals
can be trained using the two methods. Figure 2.6 shows the dialogue boxes for the Crosstabs command. First, enter one
of the variables of interest in the box labelled Row(s) by highlighting it on the left-hand side and transferring it by
© Dr. Andy Field
Page 7
3/12/00
Research Methods I: SPSS for Windows part 2
clicking on
. For this example, I selected dance to be the rows of the table. Next, select the other variable of interest
(training) and transfer it to the box labelled Column(s) by clicking on
. In addition, it is possible to select a layer
variable (i.e. you can split the rows of the table into further categories). In this case, it would make sense to place
animal in this box because SPSS would then split the crosstabulation table into a section for dogs and a section for cats.
However, for the time being don’t select this variable. If you click on
a dialogue box appears in which you can
specify various statistical tests (m,ost of which you won’t have come across yet), select the chi-square test and then
click on
. If you click on
a dialogue box appears in which you can specify they type of data displayed in
the crosstabulation table. You should request expected counts (these should all be above 5 for the chi square test to be
accurate), and it is very useful to ask for row, column and total percentages too (these values are usually more easily
interpreted than the actual frequencies). Once these options have been selected click on
to return to the main
dialogue box and then click
to run the analysis.
Figure 2.6: Dialogue boxes for the Crosstabs Command.
2.5.2.
Output
The crosstabulation table produced by SPSS contains the number of cases that falls into each combination of categories.
So, for example, we can see that 49 animals danced when food was offered as a reward compared to only 30 when
affection was given as a reward. Likewise, 15 did not dance when food was offered as a reward compared to 40 when
affection was offered as a reward. These values are not that meaningful because they depend largely on the sample size,
and so it is easier to interpret the percentages. Reading the % within Did they Dance?, it is clear that of those animals
that did dance, 62% had a food reward compared to 38% who had affection. This implies that food was a better
motivator. Looking at those animals that did not dance, 27.3% had food as a reward compared to a larger 72.7% who
had affection. This again supports the notion that affection resulted in less dancing animals! Reading down the columns,
© Dr. Andy Field
Page 8
3/12/00
Research Methods I: SPSS for Windows part 2
we should look at the % within type of training and see
that when food was used as a reward, 76.6% danced and
Did they dance? * Type of Training Crosstabulation
Type of Training
23.4% did not. When affection was used, 42.9% danced
and 57.1% did not. These results imply that affection
resulted in roughly chance performance, but food resulted
in lots of dancing animals!
Did they
dance?
Yes
Total
square statistic is given (and the degrees of freedom) and
the significance value. For these data, the chi-square is
highly significant, indicating that the type of training used
had a significant effect on whether an animal would
Pearson Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher's Exact Test
Linear-by-Linear
15.579
Association
N of Valid Cases
134
a. Computed only for a 2x2 table
1
1
1
1
.000
41.3
79.0
% within Did they dance?
62.0%
38.0%
100.0%
% within Type of Training
76.6%
42.9%
59.0%
% of Total
36.6%
22.4%
59.0%
15
40
55
% within Did they dance?
26.3
27.3%
28.7
72.7%
55.0
100.0%
% within Type of Training
23.4%
57.1%
41.0%
% of Total
11.2%
29.9%
41.0%
64
70
134
64.0
70.0
134.0
Count
Count
Expected Count
Total
79
% within Did they dance?
47.8%
52.2%
100.0%
% within Type of Training
% of Total
100.0%
100.0%
100.0%
47.8%
52.2%
100.0%
dance. The continuity corrected chi-square is
designed for situations in which you have two
Chi-Square Tests
Asymp.
Sig.
(2-sided)
.000
.000
.000
37.7
Expected Count
In addition to the crosstabulation table, SPSS produces a
table of the chi-square statistic. The value of the chi-
df
Affection
as reward
30
Count
Expected Count
No
Value
b
15.696
14.334
16.137
Food as
Reward
49
Exact Sig.
(2-sided)
Exact Sig.
(1-sided)
.000
.000
categorical variables, both containing two categories
(as is the situation here). There is still some debate as
to whether or not this correction is even accurate, let
alone necessary, and so it may be wiser to ignore it.
b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 26.27.
Homework: Re-run this crosstabulation procedure but include animal in the layers box of the main options. You
should get a table divided up for dogs and cats: what does this table tell us about the differences between cats and dogs?
Also, re-load the exam mark data used earlier on and carry out an analysis to find out whether computer literacy and
percentage of lectures attended are normally distributed. Put your name on the outputs and show them to a demonstrator
by 1 week after your SPSS session.
This handout contains large excerpts of the following text (so copyright exists!)
Field, A. P. (2000). Discovering statistics using SPSS for Windows:
advanced techniques for the beginner. London: Sage.
Go to http://www.sagepub.co.uk to order a copy
© Dr. Andy Field
Page 9
3/12/00