Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Maths 0146 Statistical Modelling in Biology SPSS Practical Introduction to SPSS: loading data, descriptive statistics and t-tests for comparing group means Objectives At the end of this session you should be able to: start up SPSS load data into SPSS by entering data by hand, or by reading from a file check and edit data fields summarise data columns make histograms and bar charts perform unpaired and paired t-tests print and store the results interpret the results of the statistical tests, and write a short summary of the test findings There are a number of supervisors for this session: please ask them for help if you need it. Introduction SPSS stands for the Statistical Package for the Social Sciences, and is a popular computer package for the statistical analysis of data in many disciplines. The version which we will be using in this practical is version 9, which is designed for the Windows operating system on PCs. This exercise manual assumes that you have some knowledge of the Windows NT workstations in the Medical School computer rooms. Please ask for help if you are not familiar with using these machines. In particular you should already be able to: Log on to a PC using your username and password Use a mouse to point and click Find, open, move, copy, and delete files using ‘My Computer’ or ‘Explorer’ Use pull-down menus and fill-in forms in applications such as Word Change between different windows, and close windows This manual contains a description of SPSS and some exercises to introduce you to the use of SPSS in statistical testing. You should be able to complete most of the exercises within the time available, and the rest you can complete in your own time. These exercises are for your benefit and will not be marked, however since computer work will very likely form part of the final examination you would be well advised to be familiar with the operations described in this manual. Please read the notes carefully. All instructions are underlined, and you should follow these exactly. Many of you will already be sufficiently familiar with working with PC computer programs that the instructions will be too detailed. By all means skimread through the exercise to the important parts of each. SPSS is a powerful statistical package with a lot of functionality. During the exercise you may want to try out functions which are not explicitly treated in this exercise. That’s fine too. Just be sure that at the end of the practical you have learned enough about SPSS that you can do all of the tasks listed in the Objectives above. Before you start Login to the PC Make sure any applications like web browsers or word processing options are closed down You will need certain files containing data in this exercise. You should be able to download any .sav (SPSS format) files from my homepage www.bath.ac.uk/~masgs. Go into teaching, pick this courses page and download. When you download the files, save them in your own space or on a floppy disk bweight.sav surgeon.sav Exercise 1: Reading and editing data 1.1 Start SPSS Click on the ‘Start’ button on the task bar in the lower left of the screen; Click on ‘Programs’ Click on ‘SPSS 9.0 for Windows’ A window will open entitled ‘Untitled – SPSS Data Editor’. This is the window where datasets can be loaded, checked, edited, printed out and saved to disk. There are several other windows in SPSS - the most important being the Output window. You can print out the contents of any window by choosing ‘Print’ from the ‘File’ menu. You can also cut and paste graphs, tables and text into other documents (such as Word) using the ‘Edit’ menu. 1.2 Getting Help in SPSS In common with many other programs, SPSS has a built-in manual that you can consult at any time. Click on the ‘Help’ pull-down menu and choose ‘Topics’: a window appears with options to look through a manual, its index or to search for a key word. (The first time this is done on any particular PC you may be asked questions about creating the search database: just take the default options that it offers – i.e. keep hitting the return key – until it returns you to the ‘Find’ page.) 1.3 Enter a dataset by hand There are several ways to get data into SPSS. The simplest for a small dataset is simply to type the data in directly from the keyboard. Consider the following dataset from your lecture notes: The PEFR was measured for 31 children with asthma. The sample is split up by the distance the children live from the nearest main road. PEFR (litres/min) > 0.5 km <= 0.5 km 307, 300, 347, 341, 302, 356, 259, 371, 340, 359, 243, 330, 272, 314, 378, 249, 286, 278, 363, 350, 294, 242, 295, 348, 285, 290, 328, 261, 267, 311 307 We have data on 31 individuals, and we have two observations for each individual: 1. whether or not the child lives within 0.5 km of the nearest main road (a qualitative or categorical variable), and 2. a measured PEFR value (a quantitative variable) Each row in the data editor will refer to a case (in this case an individual child) and each column relates to a variable, so that each cell in a row contains the value of the appropriate variable observed for that individual. Before entering the data we need to define the variables which we will enter. At the top of each column the word ‘var’ appears in light grey. Double click on ‘var’ at the top of the first column. The following ‘Define Variable’ fill-in form appears: The first thing to do is to choose an appropriate name for the variable. By default SPSS names variables var00001, var00002, var00003 etc., but these are not very informative names, and it is always good practice to give a variable a meaningful name. Any variable name is possible as long as it satisfies the following conditions: 1. It must start with a letter; 2. Must not exceed 8 characters; 3. Can contain letters, digits and the characters @#_$ but not blanks or the characters !?.* Enter a variable name for the first variable: proximity to a main road. Call the variable PROXIM Proximity is a categorical variable, so we must tell SPSS to store it as categorical. Click on the Type button on the Define Variable form. The ‘Define Variable Type’ form appears: There are several options for variable type: a categorical variable is always stored as a String type (as in a string of characters). Click on the string option The form will display the maximum character length of each string. The default is 8 which is fine. Click ‘continue’ to return to the ‘Define Variable’ form Click ‘OK’ to return to the main Data Editor window The first column is now defined. Following a similar procedure define the second column to be a numerical variable called PEFR. Define the variable to have width 4 and zero decimal places. Note: You can at any time check and change the definitions of variables which are already defined just by double clicking on the column heading. We are now ready to enter the data. To enter a value in a particular cell, just click on the cell and type the value. Pressing return afterwards will move you to the next cell down in the column. Click in the first cell of the first row Enter the letter F – the first individual is in the ‘Far’ group of children who live more than 0.5 km from the nearest main road Click in the PEFR column of the first row Enter the PEFR value 307 Click on the first cell of the second row, enter F in the PROXIM column and 300 in the PEFR column Continue adding values in this way for the first 16 children who are all in the ‘Far’ group In the next 15 rows enter N in the PROXIM column for the 15 children in the ‘Near’ group Notes: You can move up and down the dataset using the scrollbar at the left. To correct a value just click on the cell and re-enter the value SPSS refers to rows in the table as cases To delete a row click on the row number at the side, then in the ‘Edit’ pull down menu either choose ‘Clear’ to delete the values in the rows, or ‘Cut’ to remove the row altogether. After cutting a row, the rows below it will move up. Rows can be added by right-clicking on the row number after the point where you want to insert the row, and selecting ‘Insert case’ SPSS will not let you enter a character value for a numeric value like PEFR – try it! You can highlight groups of cells by clicking the mouse in a cell, and dragging across the cells you want to select. Then using the ‘Edit’ menu these cells can be cleared, cut, copied and pasted somewhere else. When you have entered all the data it should look like this: 1.4 Saving a dataset to disk Now save the dataset to disk so that we can use it later. On the ‘File’ menu click ‘Save As’ On the ‘Save Data As’ form which appears give the file the name pefr.sav and click ‘Save’ Note that the title bar now of the Data Editor window has changed and is now labelled ‘pefr.’ Exercise 2: Summarising data Once we have a dataset the first thing we will want to do is to visualise and summarise its contents. We will do this first with the PEFR dataset. Reload the pefr.sav dataset from disk 2.1 Make a histogram Graphs and charts can be made by selecting from the ‘Graphs’ pull-down menu. Click ‘Histogram’ on the ‘Graphs’ pull-down menu The ‘Histogram’ fill-in form will appear: This form is of a type which is typical in SPSS. There are the normal buttons to click: ‘OK’ to finish entering information into the form, ‘Cancel’ to abandon the form (and in this case to abandon making a histogram) and ‘Help’ for more information about the form. Notice that the ‘OK’ button is greyed out, and cannot be clicked. This is because at the moment the form is incomplete: only when we have supplied enough information will the ‘OK’ button become clickable. A histogram is a graph of the frequencies of occurrence of values of a particular variable, so before we can continue we must specify the variable that we want to graph. On the left of the form there is a list of variables: this is the list of all variables in the dataset for which a histogram can be drawn. Histograms can only be made of quantitative or numerical variables, and since PEFR is the only numerical variable in the dataset, it is the only one to appear: and it is therefore also already highlighted, as if we had clicked on it. Next to the list is an arrow-select button This button shifts a selected (hightlighted) value from a list on the left to a field or a list on the right. Click the arrow-select button to move the highlighted variable name PEFR into the selected variable field. Notice that after doing this the ‘OK’ button becomes active. Click ‘OK’ to draw the histogram This may take a short time: the status bar at the bottom of the Data Editor window will display the message ‘Running GRAPH’ while SPSS is busy processing your request. This status bar lets you know what SPSS is doing at any time, and in particular when it is ready for more commands. When the graph is ready to plot SPSS opens a second window entitled ‘Output 1 – SPSS Output Navigator’. This window will contain all of the output from statistical tests as well as graphs and data summaries. The window is split into two areas: the actual output on the right, and a navigator on the left, which contains a list of all the items in the output area. Many of the pull-down menus in the Output Navigator window contain the same commands as those in the Data Editor window, and it doesn’t matter which one you use. A list of all of the SPSS windows is in the ‘Windows’ pull-down menu, with the currently active window marked with a tick. Using this menu is an easy way to switch between windows. Have a close look at the histogram: it shows the number of children observed to have PEFR values in intervals of 10 litres/min centred on 240, 250, 260, …, 380 litres/min. These intervals cover the ranges 235-244, 245-254, 255-264, …, 375-384. How many children had PEFR less than or equal to 284 litres/min? What proportion of the total number of children is this? How many children had PEFR between 295 and 324 litres/min? What proportion of the total number of children is this? The histogram graph is annotated with some descriptive statistics: the sample size, mean and standard deviation. Record those statistics here: Sample Size, N Mean Standard Deviation 2.2 Draw a pie chart Next draw a pie chart of the numbers of children in the Far and Near groups. Click on ‘Pie’ in the ‘Graphs’ pull-down menu The ‘Pie Charts’ form will appear. Select the default: summaries for groups of cases, and click ‘Define’ This will bring up the ‘Define Pie Charts’ form. The pie chart we want to draw will have each slice of pie representing the proportion of the number of cases in the two groups Far (F) and Near (N). The grouping into Far and Near is determined by the variable PROXIM so: Click on PROXIM in the list of variables, and use the arrow-select button to move PROXIM into the ‘Define slices by’ field Click ‘OK’ This creates a pie chart showing the proportions of individuals in the two categories. 2.3 Draw a bar chart If you have time, try drawing a simple bar chart using PROXIM as the category axis. 2.4 Summary statistics In the histogram exercise the mean and standard deviation of PEFR values were calculated automatically. These values are just two examples of the summary statistics that we can calculate from the sample. SPSS can easily calculate a number of different summary statistics either for the whole sample, or for various subgroupings of the sample. Since we are most interested in differences between the two groupings with PROXIM=F (Far) or PROXIM=N (Near) we will want to calculate these statistics for the two groups separately. Click ‘Descriptive Statistics’ in the ‘Analyze’ menu Click ‘Explore’ This brings up the ‘Explore’ fill-in form: The variable for which we want to calculate means, standard deviations etc. is PEFR, and it is called the dependent variable, since its probability distribution (i.e. distribution of values) depends on the which group that an individual is in. The variable which determines the grouping of cases or individuals is called a factor, which in our case is the variable PROXIM. In the ‘Explore’ form highlight PEFR, by clicking on it, and move it into the ‘Dependent List’ using the select-arrow Likewise highlight PROXIM and move it into the ‘Factor list’ In the ‘Display’ section at the bottom left of the form, click the ‘Statistics’ Click ‘OK’ These choices mean that you want to calculate descriptive statistics for the variable PEFR in groups where the individuals all have the same value of PROXIM. In the output window two tables are created. The first is a summary table headed ‘Case Processing Summary’. It lists the number of cases in each of the two groups (F=Far and N=Near) defined by PROXIM. It also shows how many values in each group are missing - in other words how many have unknown values of PEFR. It often happens that not all data in a study are recorded for each individual: items may have been missed, or incorrectly recorded. During analysis such cases may need to be left out of statistical calculations, but it is always important to know how many such cases there are. You should have no missing values in the PEFR dataset. The second table, called ‘Descriptives’, is much larger and is split into two parts, one for each group (F and N). This table is also largely self-explanatory, and contains the mean, standard deviation, median, quartiles etc. Record the sample size, mean and standard deviation and standard error of the mean for each of the two groups: Group F Far Group N Near Sample Size, N Mean Standard Deviation Standard Error of the Mean The two PEFR samples have different means. In order to test whether or not this difference is indicative of a true difference in mean PEFR for children living closer to main roads we can perform a t-test for independent samples. You can already get some idea of whether or not there is a significant difference by looking at the 95% confidence intervals for the two means. Do the intervals overlap? What does this suggest? Exercise 3: t-tests A t-test tests for the difference between the means in two groups. 3.1 t-test for two independent samples An independent samples t-test is appropriate under the following conditions: 1. the data are collected for two separate samples 2. the measurements in each sample are independent 3. the measurements are of a quantitative variable 4. the variable has a (roughly) Normal probability distribution It is reasonable to assume that all of these conditions apply to the PEFR data. To perform the t-test: From the ‘Analyze’ pull-down menu choose ‘Compare Means’ and then ‘Independent-Samples T-test’ This brings up the ‘Independent-Samples T Test’ form: The test is to be performed on the variable PEFR grouped by PROXIM, Select the test and grouping variables When you move PROXIM into the ‘Grouping Variable’ field the ‘Define Groups’ button will become clickable. You need to use this button to tell SPSS which groups you are going to be comparing. Clearly there are only two groups and you might think that SPSS ought to be able to work out that it must be those two you want to compare, but it can’t! Click ‘Define Groups’ Enter F and N respectively into the two fields ‘Group 1’ and ‘Group 2’ into the pop-up form which appears, then click ‘Continue’ Click ‘OK’ to run the t-test The output will be written to two tables in the output window. The first table entitled ‘Group Statistics’ should contain the same information that you recorded in Exercise 2: the group sizes, means, standard deviations and standard errors. The second table is entitled ‘Independent Samples Test’ and contains the results of the t-test. The test is actually applied twice: once under the assumption that the standard deviations (and hence the variances) are equal in the two populations, and once without that assumption. The first columns of the second table report the result of Levene’s test for equality of variances, which assesses the validity of that assumption. You should see that the p-value of Levene’s test is large (in the column ‘Sig.’) which implies that the equal variance test is justified. The t-test with equal variances assumed is the standard t-test. Its p-value is given in the first row of the column ‘Sig. (2-tailed)’, Fill in the p-value of the t-test with equal variances assumed: Interpret this result in one sentence: is there a difference between the two groups? 3.2 t-test for paired samples A paired t-test is appropriate when 1. you have collected two sets of measurements on the same sample of objects under different conditions, or you have collected data on a sample of matched pairs of subjects 2. The measurements are for a quantitative variable 3. The differences between each pair of measurements are independent 4. The differences follow a (roughly) Normal probability distribution A recent paper in the Lancet investigated the ‘Effect of sleep deprivation on surgeons’ dexterity on laparoscopy simulator’ (Taffinder et al.1998, Lancet 352, 1191). This paper described an experiment in which a number of surgeons performed a simulated operation before and after a night in which they were subjected to differing amounts of sleep deprivation. The number of errors made by the surgeons was recorded for each simulated operation, in order to test whether or not sleep deprivation impaired performance. The file surgeon.sav contains a (simulated) dataset of such an experiment. Load the dataset surgeon.sav In the four columns of this data set the number of mistakes is recorded for each of 20 surgeons in a simulated operation in the following circumstances: 1. before an uninterrupted night (NORM1) 2. after an uninterrupted night (NORM2) 3. before a night of no sleep (NOSLEEP1) 4. after a night of no sleep (NOSLEEP2) We are interested in the difference in the number of mistakes between morning and evening, so we must first create two new variables: 1. morning-evening difference for the uninterrupted night 2. morning-evening difference for the night of no sleep In the ‘Transform’ pull-down menu choose ‘Compute’ The ‘Compute variable’ form appears: In this form the value of any variable can be calculated using a mathematical formula, which may include values taken from other columns. We wish to create two new variables, DNORM and DNOSLEEP, as follows: DNORM = NORM2-NORM1, and DNOSLEEP = NOSLEEP2-NOSLEEP1 Fill in the target variable DNORM Using the arrow-select key shift NORM2 across to the ‘Numeric Expression’ field Click the ‘-‘ (minus) button Shift NORM1 to the ‘Numeric Expression’ field Click ‘OK’ This will create a new column and fill it with values. Likewise create DNOSLEEP (you will need to delete the contents of the Numeric Expression field) We now want to apply a paired t-test to see if, in general, each value of DNOSLEEP for a given surgeon is usually significantly larger than the value of DNORM for that surgeon. To perform the paired t-test: From the ‘Analyze’ pull-down menu choose ‘Compare Means’ and then ‘Paired-Samples T-test’ This brings up the ‘Paired-Samples T Test’ form: Select the two variables we wish to compare (DNORM and DNOSLEEP) and move them into the ‘Paired Variables’ field. Click ‘OK’ to run the test The test creates three tables. The table ‘Paired Samples Statistics’ summarises the two columns independently. The second table reports the result of a test which checks whether there are is a correlation between the values of DNORM and DNOSLEEP. (Lecture 5 and Practical 2 will discuss correlation in more detail.) The third table reports the result of the test. Fill in the p-value of the paired t-test: Interpret this result: is there a significant effect of sleep deprivation? Exercise 4: Analysing the Red Blood Cell Data from your Skills Class Earlier this term you all took part in a Skills Class where six red cell parameters were measured on blood taken from half the students. The data set blood.sav contains the measurements for 133 blood samples taken at that time. Open the data file blood.sav. The first column indicates the sex of the donor; columns 2-7 give: C2: red cell count (to be multiplied by 10 to the power 12 to give the count per litre) C3: haematocrit in % (it is the % of the volume of the blood that is red cells C4: haemoglobin concentration of the blood in g/dl C5: MCV (mean red cell volume) in fl C6: MCH (mean cellular haemoglobin) in pg C7: MCHC (mean cellular haemoglobin concentration) in g/dl Create separate histograms of each of the 5 red blood cell parameters for males and females (use the option ‘Split File’ from the ‘Data’ pull-down menu and select ‘Organise output by groups’). Do any of the red blood cell parameters have distributions which don’t look approximately Normal? Run independent samples t-tests to compare (i) mean red cell count, (ii) haematocrit in males versus females. Is there evidence that males and females have different mean red cell counts or haematocrit?