Download Note

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Maths 0146 Statistical Modelling in Biology
SPSS Practical
Introduction to SPSS: loading data,
descriptive statistics and t-tests for
comparing group means
Objectives
At the end of this session you should be able to:
 start up SPSS
 load data into SPSS by entering data by hand, or by reading from a file
 check and edit data fields
 summarise data columns
 make histograms and bar charts
 perform unpaired and paired t-tests
 print and store the results
 interpret the results of the statistical tests, and write a short summary of the test
findings
There are a number of supervisors for this session: please ask them for help if you
need it.
Introduction
SPSS stands for the Statistical Package for the Social Sciences, and is a popular
computer package for the statistical analysis of data in many disciplines. The version
which we will be using in this practical is version 9, which is designed for the
Windows operating system on PCs.
This exercise manual assumes that you have some knowledge of the Windows NT
workstations in the Medical School computer rooms. Please ask for help if you are
not familiar with using these machines.
In particular you should already be able to:
 Log on to a PC using your username and password
 Use a mouse to point and click
 Find, open, move, copy, and delete files using ‘My Computer’ or ‘Explorer’
 Use pull-down menus and fill-in forms in applications such as Word
 Change between different windows, and close windows
This manual contains a description of SPSS and some exercises to introduce you to
the use of SPSS in statistical testing.
You should be able to complete most of the exercises within the time available, and
the rest you can complete in your own time. These exercises are for your benefit and
will not be marked, however since computer work will very likely form part of the
final examination you would be well advised to be familiar with the operations
described in this manual.
Please read the notes carefully. All instructions are underlined, and you should follow
these exactly.
Many of you will already be sufficiently familiar with working with PC
computer programs that the instructions will be too detailed. By all means skimread through the exercise to the important parts of each. SPSS is a powerful
statistical package with a lot of functionality. During the exercise you may want
to try out functions which are not explicitly treated in this exercise. That’s fine
too.
Just be sure that at the end of the practical you have learned enough about SPSS
that you can do all of the tasks listed in the Objectives above.
Before you start


Login to the PC
Make sure any applications like web browsers or word processing options
are closed down
You will need certain files containing data in this exercise. You should be able to
download any .sav (SPSS format) files from my homepage www.bath.ac.uk/~masgs.
Go into teaching, pick this courses page and download.
When you download the files, save them in your own space or on a floppy disk
bweight.sav
surgeon.sav
Exercise 1: Reading and editing data
1.1 Start SPSS



Click on the ‘Start’ button on the task bar in the lower left of the screen;
Click on ‘Programs’
Click on ‘SPSS 9.0 for Windows’
A window will open entitled ‘Untitled – SPSS Data Editor’.
This is the window where datasets can be loaded, checked, edited, printed out and
saved to disk.
There are several other windows in SPSS - the most important being the Output
window. You can print out the contents of any window by choosing ‘Print’ from the
‘File’ menu.
You can also cut and paste graphs, tables and text into other documents (such as
Word) using the ‘Edit’ menu.
1.2 Getting Help in SPSS
In common with many other programs, SPSS has a built-in manual that you can
consult at any time.
Click on the ‘Help’ pull-down menu and choose ‘Topics’: a window appears with
options to look through a manual, its index or to search for a key word. (The first
time this is done on any particular PC you may be asked questions about creating the
search database: just take the default options that it offers – i.e. keep hitting the return
key – until it returns you to the ‘Find’ page.)
1.3 Enter a dataset by hand
There are several ways to get data into SPSS. The simplest for a small dataset is
simply to type the data in directly from the keyboard.
Consider the following dataset from your lecture notes: The PEFR was measured for
31 children with asthma. The sample is split up by the distance the children live from
the nearest main road.
PEFR (litres/min)
> 0.5 km
<= 0.5 km
307, 300, 347,
341, 302, 356,
259, 371, 340,
359, 243, 330,
272, 314, 378,
249, 286, 278,
363, 350, 294,
242, 295, 348,
285, 290, 328,
261, 267, 311
307
We have data on 31 individuals, and we have two observations for each individual:
1. whether or not the child lives within 0.5 km of the nearest main road (a
qualitative or categorical variable), and
2. a measured PEFR value (a quantitative variable)
Each row in the data editor will refer to a case (in this case an individual child) and
each column relates to a variable, so that each cell in a row contains the value of the
appropriate variable observed for that individual.
Before entering the data we need to define the variables which we will enter.
At the top of each column the word ‘var’ appears in light grey.

Double click on ‘var’ at the top of the first column.
The following ‘Define Variable’ fill-in form appears:
The first thing to do is to choose an appropriate name for the variable. By default
SPSS names variables var00001, var00002, var00003 etc., but these are not very
informative names, and it is always good practice to give a variable a meaningful
name.
Any variable name is possible as long as it satisfies the following conditions:
1. It must start with a letter;
2. Must not exceed 8 characters;
3. Can contain letters, digits and the characters @#_$ but not blanks or the characters
!?.*

Enter a variable name for the first variable: proximity to a main road. Call the
variable PROXIM
Proximity is a categorical variable, so we must tell SPSS to store it as categorical.

Click on the Type button on the Define Variable form.
The ‘Define Variable Type’ form appears:
There are several options for variable type: a categorical variable is always stored as a
String type (as in a string of characters).

Click on the string option
The form will display the maximum character length of each string. The default is 8
which is fine.


Click ‘continue’ to return to the ‘Define Variable’ form
Click ‘OK’ to return to the main Data Editor window
The first column is now defined.

Following a similar procedure define the second column to be a numerical
variable called PEFR. Define the variable to have width 4 and zero
decimal places.
Note: You can at any time check and change the definitions of variables which are
already defined just by double clicking on the column heading.
We are now ready to enter the data. To enter a value in a particular cell, just click on
the cell and type the value. Pressing return afterwards will move you to the next cell
down in the column.







Click in the first cell of the first row
Enter the letter F – the first individual is in the ‘Far’ group of children who
live more than 0.5 km from the nearest main road
Click in the PEFR column of the first row
Enter the PEFR value 307
Click on the first cell of the second row, enter F in the PROXIM column
and 300 in the PEFR column
Continue adding values in this way for the first 16 children who are all in
the ‘Far’ group
In the next 15 rows enter N in the PROXIM column for the 15 children in
the ‘Near’ group
Notes:
 You can move up and down the dataset using the scrollbar at the left.
 To correct a value just click on the cell and re-enter the value
 SPSS refers to rows in the table as cases
 To delete a row click on the row number at the side, then in the ‘Edit’ pull down
menu either choose ‘Clear’ to delete the values in the rows, or ‘Cut’ to remove the
row altogether. After cutting a row, the rows below it will move up.
 Rows can be added by right-clicking on the row number after the point where you
want to insert the row, and selecting ‘Insert case’
 SPSS will not let you enter a character value for a numeric value like PEFR – try
it!
 You can highlight groups of cells by clicking the mouse in a cell, and dragging
across the cells you want to select. Then using the ‘Edit’ menu these cells can be
cleared, cut, copied and pasted somewhere else.
When you have entered all the data it should look like this:
1.4 Saving a dataset to disk
Now save the dataset to disk so that we can use it later.


On the ‘File’ menu click ‘Save As’
On the ‘Save Data As’ form which appears give the file the name pefr.sav
and click ‘Save’
Note that the title bar now of the Data Editor window has changed and is now labelled
‘pefr.’
Exercise 2: Summarising data
Once we have a dataset the first thing we will want to do is to visualise and
summarise its contents.
We will do this first with the PEFR dataset.

Reload the pefr.sav dataset from disk
2.1 Make a histogram
Graphs and charts can be made by selecting from the ‘Graphs’ pull-down menu.

Click ‘Histogram’ on the ‘Graphs’ pull-down menu
The ‘Histogram’ fill-in form will appear:
This form is of a type which is typical in SPSS. There are the normal buttons to click:
‘OK’ to finish entering information into the form, ‘Cancel’ to abandon the form (and
in this case to abandon making a histogram) and ‘Help’ for more information about
the form.
Notice that the ‘OK’ button is greyed out, and cannot be clicked. This is because at
the moment the form is incomplete: only when we have supplied enough information
will the ‘OK’ button become clickable.
A histogram is a graph of the frequencies of occurrence of values of a particular
variable, so before we can continue we must specify the variable that we want to
graph.
On the left of the form there is a list of variables: this is the list of all variables in the
dataset for which a histogram can be drawn. Histograms can only be made of
quantitative or numerical variables, and since PEFR is the only numerical variable
in the dataset, it is the only one to appear: and it is therefore also already highlighted,
as if we had clicked on it.
Next to the list is an arrow-select button
This button shifts a selected (hightlighted) value from a list on the left to a field or a
list on the right.

Click the arrow-select button to move the highlighted variable name PEFR
into the selected variable field.
Notice that after doing this the ‘OK’ button becomes active.

Click ‘OK’ to draw the histogram
This may take a short time: the status bar at the bottom of the Data Editor window
will display the message ‘Running GRAPH’ while SPSS is busy processing your
request. This status bar lets you know what SPSS is doing at any time, and in
particular when it is ready for more commands.
When the graph is ready to plot SPSS opens a second window entitled ‘Output 1 –
SPSS Output Navigator’. This window will contain all of the output from statistical
tests as well as graphs and data summaries.
The window is split into two areas: the actual output on the right, and a navigator on
the left, which contains a list of all the items in the output area.
Many of the pull-down menus in the Output Navigator window contain the same
commands as those in the Data Editor window, and it doesn’t matter which one you
use.
A list of all of the SPSS windows is in the ‘Windows’ pull-down menu, with the
currently active window marked with a tick. Using this menu is an easy way to switch
between windows.
Have a close look at the histogram: it shows the number of children observed to have
PEFR values in intervals of 10 litres/min centred on 240, 250, 260, …, 380 litres/min.
These intervals cover the ranges 235-244, 245-254, 255-264, …, 375-384.

How many children had PEFR less than or equal to 284 litres/min?

What proportion of the total number of children is this?

How many children had PEFR between 295 and 324 litres/min?

What proportion of the total number of children is this?
The histogram graph is annotated with some descriptive statistics: the sample size,
mean and standard deviation.

Record those statistics here:
Sample Size, N
Mean
Standard Deviation
2.2 Draw a pie chart
Next draw a pie chart of the numbers of children in the Far and Near groups.

Click on ‘Pie’ in the ‘Graphs’ pull-down menu
The ‘Pie Charts’ form will appear.

Select the default: summaries for groups of cases, and click ‘Define’
This will bring up the ‘Define Pie Charts’ form. The pie chart we want to draw will
have each slice of pie representing the proportion of the number of cases in the two
groups Far (F) and Near (N). The grouping into Far and Near is determined by the
variable PROXIM so:


Click on PROXIM in the list of variables, and use the arrow-select button
to move PROXIM into the ‘Define slices by’ field
Click ‘OK’
This creates a pie chart showing the proportions of individuals in the two categories.
2.3 Draw a bar chart
If you have time, try drawing a simple bar chart using PROXIM as the category axis.
2.4 Summary statistics
In the histogram exercise the mean and standard deviation of PEFR values were
calculated automatically. These values are just two examples of the summary
statistics that we can calculate from the sample.
SPSS can easily calculate a number of different summary statistics either for the
whole sample, or for various subgroupings of the sample. Since we are most
interested in differences between the two groupings with PROXIM=F (Far) or
PROXIM=N (Near) we will want to calculate these statistics for the two groups
separately.


Click ‘Descriptive Statistics’ in the ‘Analyze’ menu
Click ‘Explore’
This brings up the ‘Explore’ fill-in form:
The variable for which we want to calculate means, standard deviations etc. is PEFR,
and it is called the dependent variable, since its probability distribution (i.e.
distribution of values) depends on the which group that an individual is in.
The variable which determines the grouping of cases or individuals is called a factor,
which in our case is the variable PROXIM.
 In the ‘Explore’ form highlight PEFR, by clicking on it, and move it into
the ‘Dependent List’ using the select-arrow
 Likewise highlight PROXIM and move it into the ‘Factor list’
 In the ‘Display’ section at the bottom left of the form, click the ‘Statistics’
 Click ‘OK’
These choices mean that you want to calculate descriptive statistics for the variable
PEFR in groups where the individuals all have the same value of PROXIM.
In the output window two tables are created. The first is a summary table headed
‘Case Processing Summary’. It lists the number of cases in each of the two groups
(F=Far and N=Near) defined by PROXIM. It also shows how many values in each
group are missing - in other words how many have unknown values of PEFR. It often
happens that not all data in a study are recorded for each individual: items may have
been missed, or incorrectly recorded. During analysis such cases may need to be left
out of statistical calculations, but it is always important to know how many such cases
there are.
You should have no missing values in the PEFR dataset.
The second table, called ‘Descriptives’, is much larger and is split into two parts, one
for each group (F and N). This table is also largely self-explanatory, and contains the
mean, standard deviation, median, quartiles etc.
 Record the sample size, mean and standard deviation and standard error of
the mean for each of the two groups:
Group F
Far
Group N
Near
Sample Size, N
Mean
Standard
Deviation
Standard Error
of the Mean
The two PEFR samples have different means. In order to test whether or not this
difference is indicative of a true difference in mean PEFR for children living closer to
main roads we can perform a t-test for independent samples.
You can already get some idea of whether or not there is a significant difference by
looking at the 95% confidence intervals for the two means.
 Do the intervals overlap? What does this suggest?
Exercise 3: t-tests
A t-test tests for the difference between the means in two groups.
3.1 t-test for two independent samples
An independent samples t-test is appropriate under the following conditions:
1. the data are collected for two separate samples
2. the measurements in each sample are independent
3. the measurements are of a quantitative variable
4. the variable has a (roughly) Normal probability distribution
It is reasonable to assume that all of these conditions apply to the PEFR data.
To perform the t-test:
 From the ‘Analyze’ pull-down menu choose ‘Compare Means’ and then
‘Independent-Samples T-test’
This brings up the ‘Independent-Samples T Test’ form:
The test is to be performed on the variable PEFR grouped by PROXIM,
 Select the test and grouping variables
When you move PROXIM into the ‘Grouping Variable’ field the ‘Define Groups’
button will become clickable. You need to use this button to tell SPSS which groups
you are going to be comparing.
Clearly there are only two groups and you might think that SPSS ought to be able to
work out that it must be those two you want to compare, but it can’t!
 Click ‘Define Groups’
 Enter F and N respectively into the two fields ‘Group 1’ and ‘Group 2’ into
the pop-up form which appears, then click ‘Continue’
 Click ‘OK’ to run the t-test
The output will be written to two tables in the output window. The first table entitled
‘Group Statistics’ should contain the same information that you recorded in Exercise
2: the group sizes, means, standard deviations and standard errors.
The second table is entitled ‘Independent Samples Test’ and contains the results of the
t-test. The test is actually applied twice: once under the assumption that the standard
deviations (and hence the variances) are equal in the two populations, and once
without that assumption. The first columns of the second table report the result of
Levene’s test for equality of variances, which assesses the validity of that assumption.
You should see that the p-value of Levene’s test is large (in the column ‘Sig.’) which
implies that the equal variance test is justified.
The t-test with equal variances assumed is the standard t-test. Its p-value is given in
the first row of the column ‘Sig. (2-tailed)’,

Fill in the p-value of the t-test with equal variances assumed:
 Interpret this result in one sentence: is there a difference between the two
groups?
3.2 t-test for paired samples
A paired t-test is appropriate when
1. you have collected two sets of measurements on the same sample of objects under
different conditions, or you have collected data on a sample of matched pairs of
subjects
2. The measurements are for a quantitative variable
3. The differences between each pair of measurements are independent
4. The differences follow a (roughly) Normal probability distribution
A recent paper in the Lancet investigated the ‘Effect of sleep deprivation on surgeons’
dexterity on laparoscopy simulator’ (Taffinder et al.1998, Lancet 352, 1191). This
paper described an experiment in which a number of surgeons performed a simulated
operation before and after a night in which they were subjected to differing amounts
of sleep deprivation. The number of errors made by the surgeons was recorded for
each simulated operation, in order to test whether or not sleep deprivation impaired
performance.
The file surgeon.sav contains a (simulated) dataset of such an experiment.
 Load the dataset surgeon.sav
In the four columns of this data set the number of mistakes is recorded for each of 20
surgeons in a simulated operation in the following circumstances:
1. before an uninterrupted night (NORM1)
2. after an uninterrupted night (NORM2)
3. before a night of no sleep (NOSLEEP1)
4. after a night of no sleep (NOSLEEP2)
We are interested in the difference in the number of mistakes between morning and
evening, so we must first create two new variables:
1. morning-evening difference for the uninterrupted night
2. morning-evening difference for the night of no sleep
 In the ‘Transform’ pull-down menu choose ‘Compute’
The ‘Compute variable’ form appears:
In this form the value of any variable can be calculated using a mathematical formula,
which may include values taken from other columns.
We wish to create two new variables, DNORM and DNOSLEEP, as follows:
 DNORM = NORM2-NORM1, and
 DNOSLEEP = NOSLEEP2-NOSLEEP1
 Fill in the target variable DNORM
 Using the arrow-select key shift NORM2 across to the ‘Numeric
Expression’ field
 Click the ‘-‘ (minus) button
 Shift NORM1 to the ‘Numeric Expression’ field
 Click ‘OK’
This will create a new column and fill it with values.
 Likewise create DNOSLEEP (you will need to delete the contents of the
Numeric Expression field)
We now want to apply a paired t-test to see if, in general, each value of DNOSLEEP
for a given surgeon is usually significantly larger than the value of DNORM for that
surgeon.
To perform the paired t-test:
 From the ‘Analyze’ pull-down menu choose ‘Compare Means’ and then
‘Paired-Samples T-test’
This brings up the ‘Paired-Samples T Test’ form:
 Select the two variables we wish to compare (DNORM and DNOSLEEP)
and move them into the ‘Paired Variables’ field.
 Click ‘OK’ to run the test
The test creates three tables. The table ‘Paired Samples Statistics’ summarises the
two columns independently. The second table reports the result of a test which
checks whether there are is a correlation between the values of DNORM and
DNOSLEEP. (Lecture 5 and Practical 2 will discuss correlation in more detail.)
The third table reports the result of the test.

Fill in the p-value of the paired t-test:
 Interpret this result: is there a significant effect of sleep deprivation?
Exercise 4: Analysing the Red Blood Cell Data from your Skills Class

Earlier this term you all took part in a Skills Class where six red cell parameters
were measured on blood taken from half the students. The data set blood.sav
contains the measurements for 133 blood samples taken at that time.
 Open the data file blood.sav. The first column indicates the sex of the donor;
columns 2-7 give:
C2: red cell count (to be multiplied by 10 to the power 12 to give the
count per litre)
C3: haematocrit in % (it is the % of the volume of the blood that is
red cells
C4: haemoglobin concentration of the blood in g/dl
C5: MCV (mean red cell volume) in fl
C6: MCH (mean cellular haemoglobin) in pg
C7: MCHC (mean cellular haemoglobin concentration) in g/dl
 Create separate histograms of each of the 5 red blood cell parameters for males
and females (use the option ‘Split File’ from the ‘Data’ pull-down menu and select
‘Organise output by groups’). Do any of the red blood cell parameters have
distributions which don’t look approximately Normal?
 Run independent samples t-tests to compare (i) mean red cell count, (ii)
haematocrit in males versus females. Is there evidence that males and females
have different mean red cell counts or haematocrit?