Download Statistische Tests bei medizinischen Fragestellungen

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
BIOMETRICS I
Description, Visualisation and
Simple Statistical Tests
Applied to Medical Data
Lecture notes
Harald Heinzl and Georg Heinze
Core Unit for Medical Statistics and Informatics
Medical University of Vienna
Version 2010-07
Contents
Chapter 1 Data collection ...............................................................4
1.1. Introduction ............................................................................ 4
1.2. Data collection ........................................................................ 6
1.3. Simple analyses..................................................................... 14
1.4. Aggregated data .................................................................... 16
1.5. Exercises .............................................................................. 18
Chapter 2 Statistics and graphs ...................................................19
2.1. Overview .............................................................................. 19
2.2. Graphs ................................................................................. 20
2.3. Describing the distribution of nominal variables ......................... 21
2.4. Describing the distribution of ordinal variables........................... 32
2.5. Describing the distribution of scale variables ............................. 33
2.6. Outliers ................................................................................ 57
2.7. Missing values ....................................................................... 63
2.8. Further graphs ...................................................................... 68
2.9. Exercises .............................................................................. 70
Chapter 3 Probability ...................................................................72
3.1. Introduction .......................................................................... 72
3.2. Probability theory .................................................................. 75
3.3. Exercises .............................................................................. 95
Chapter 4 Statistical Tests I .........................................................98
4.1. Principle of statistical tests ...................................................... 98
4.2. t-test ................................................................................. 102
4.3. Wilcoxon rank-sum test ........................................................ 107
4.4. Excercises ........................................................................... 112
Chapter 5 Statistical Tests II .....................................................113
5.1. More about independent samples t-test .................................. 113
5.2. Chi-Square Test................................................................... 119
5.3. Paired Tests ........................................................................ 124
5.4. Confidence intervals ............................................................. 133
5.5. One-sided versus two-sided tests .......................................... 135
5.5 Exercises ............................................................................. 137
Appendices.................................................................................142
A. Opening and importing data files ............................................ 142
B. Data management with SPSS ................................................. 154
C. Restructuring a longitdudinal data set with SPSS ..................... 167
D.
Measuring agreement ......................................................... 178
E. Reference values .................................................................. 189
F. SPSS-Syntax ........................................................................ 195
G.
Exact tests ........................................................................ 201
H.
Equivalence trials ............................................................... 208
I. Describing statistical methods for medical publications .............. 210
J. Dictionary: English-German ................................................... 211
References .................................................................................215
2
Preface
These lecture notes are intended for the PGMW course “Biometrie I”. This
manuscript is the first part of the lecture notes of the “Medical Biostatistics
1” course for PhD students of the Medical University of Vienna.
The lecture notes are based on material previously used in the seminars
„Biometrische Software I: Beschreibung und Visualisierung von
medizinischen Daten“ and „Biometrische Software II: Statistische Tests
bei medizinischen Fragestellungen“. Statistical computations are based on
SPSS 17.0 for Windows, Version 17.0.1 (1st Dec 2008, Copyright © SPSS
Inc., 1993-2007).
The data sets used in the lecture notes can be downloaded at
http://www.muw.ac.at/msi/biometrie/lehre
Chapters 1 and 2, and Appendices A-E have been written by Georg
Heinze. Harald Heinzl is the author of chapters 3-6 and Appendices F-I.
Martina Mittlböck translated chapters 4 and 5 and Appendices G-I.
Andreas Gleiß translated Appendices F, J and K. Georg Heinze translated
chapters 1-3, 6 and Appendices A-E. Sincere thanks are given to the
translators, particularly to Andreas Gleiß who assembled all pieces into
one document.
Version 2009-03 contains fewer typing errors, mistranslations and wrong
citations than previous versions due to the efforts of Andreas Gleiß. The
contents of the lecture notes have been revised by Harald Heinzl (Version
2008-10).
Version 2009-03 is updated to SPSS version 17 by Daniela Dunkler,
Martina Mittlböck and Andreas Gleiß. Screenshots and SPSS output which
have changed only in minor aspects have not been replaced. Note that
older versions of SPSS save output files with .SPO extension, while SPSS
17 uses an .SPV extension. Old output files cannot be viewed in SPSS 17.
For this purpose the SPSS Smart Viewer 15 has to be used which is
delivered together with SPSS 17. Further note that the language of the
SPSS 17 user interface and of the SPSS 17 output can be changed
independently from each other in the options menu.
If you find any errors or inconsistencies, or if you come across statistical
terms that are worth including in the German-English dictionary (Appendix
I), then you will be asked to notify us via e-mail:
[email protected]
[email protected]
3
Chapter 1
Data collection
1.1. Introduction
Statistics 1 can be classified into descriptive and inferential statistics.
Descriptive statistics is a toolbox useful to characterize the properties of
members of a sample. The tools of this toolbox comprise
•
•
Statistics1 (mean, standard deviation, median, etc.) and
Graphs (boxplot, scatterplot, pie chart, etc.)
By contrast, inferential statistics provides mathematical techniques that
help us in drawing conclusions about the properties of a population. These
conclusions are usually based on a subset of the population of interest,
called the sample.
The motivation of any medical study should be a meaningful scientific
question. The purpose of any study is to give an answer to a scientific
question and not just searching a data base in order to find any significant
associations. The scientific question always relates to a particular
population. A sample is a randomly drawn subset of that population. An
important requirement often imposed on a sample is that it is
representative for a population. This requirement is fulfilled if
•
•
Each individual of the population has equal probability to be
selected, i.e., the selection process is independent from the
scientific question
Each individual is independent from each other, i. e., the selection of
individual a has no influence on the probability of selection of
individual b
Example 1.1.1: Hip joint endoprothesis study. Consider a study which
should answer the question how long hip joint endoprostheses can be used. The
population related to this scientific question consists of all patients who will
receive a hip joint endoprosthesis in the future. Clearly, this population
comprises a potentially unlimited number of individuals. Consider all patients
who received a hip joint endoprosthesis during the years 1990 – 1995 in the
Vienna General Hospital. These patients will be followed for 15 years or until
1
Note that the word STATISTICS has different meanings. On this single page it is used to
denote both the entire scientific field and rather simple computational formulas. Besides
them, there can be other meanings as well.
7B1.1. Introduction
their death on a regular basis. They constitute a representative sample of the
population. An example of a sample which is not suitable to draw conclusions
about the properties of the populations is the set of all patients which were
scheduled for a follow-up examination in the year 2000. This sample is not
representative because we miss any patients who died or underwent a revision
up to that year and the results would be over-optimistic.
A sample always consists of
Observations on individuals (e. g., patients)
Properties or variables which were observed (e. g., systolic blood
pressure before and after five minutes of training, sex, body-massindex, etc.) and which vary in the individuals
•
•
A sample can be represented in a table, which is often called data matrix.
In a data matrix,
rows usually correspond to observations and
columns to variables.
Example 1.1.2: Data matrix. Representation of the variables patient
number, sex, age, weight and height of four patients:
Pat. No.
Sex
1
2
3
4
M
F
F
M
Age
(years)
35
36
37
30
5
Weight
(kg)
80
55
61
72
Height
(cm)
185
167
173
180
8B1.2. Data collection
1.2. Data collection
Case report forms
Data are usually collected on paper using so-called case report forms
(CRFs). On these, all study-relevant data of a patient are recorded. Case
report forms should be designed such that three principles are observed:
•
•
•
Unambiguousness: the exact format of data values should be given,
e. g. YYYYMMDD for date variables
Clearness: the case report forms should be easy to use
Parsimony: only required data should be recorded
The last principle is very important. If case report forms are too long, then
the motivation of the person recording the data will decrease and data
quality will be negatively affected.
Building a data base
After data collection on paper, the next step is data entry in a computer
system. While small data sets can be easily transferred directly to a data
matrix on the screen, one can make use of computer forms to enter large
amounts of data. These computer forms are the analogue of case report
forms on a computer screen. Data are typed into the various fields of the
form. Commercial programs allowing the design and use of forms are,
e.g., Microsoft Office Access or SAS. Epi Info™ is a freeware program
(http://www.cdc.gov/epiinfo/, cf. Fig. 1) using the format of Access to
save data, but with a graphical user interface that is easier to handle than
that of Access.
6
8B1.2. Data collection
Fig.1: computer form prepared with Epi Info™ 3.5
After data entry using forms, the data values are saved in data matrices.
One row of the data matrix corresponds to one form, and each column of
a row corresponds to one field of a form.
Electronic data bases can usually be converted from one program to
another, e. g., from a data entry program to a statistical software system
like SPSS, which will be used throughout the lecture. SPSS also offers a
commercial program for the design of case report forms and data entry
(“SPSS Data Entry”).
When building a data base, no matter which program is used, the first
step is to decide which variables it should consist of. In a second step we
must define the properties of each variable.
The following rules apply:
•
The first variable should contain a unique patient identification
number, which is also recorded on the case report forms.
7
8B1.2. Data collection
•
Each property that may vary in individuals can be considered as a
variable.
•
With repeated measurements on individuals (e.g., before and after
treatment) there are several alternatives:
o Wide format: One row (form) per individual; repeated
measurements on the same property are recorded in multiple
columns (or fields on the form); e. g., PAT=patient
identification number, VAS1, VAS2, VAS3 = repeated
measurements on VAS
o Long format: One row per individual and measurement;
repeated measurements on the same property are recorded in
multiple rows of the same column, using a separate column to
define the time of measurement; e.g., PAT=patient
identification
number,
TIME=time
of
measurement,
VAS=value of VAS at the time of measurement
If for the first alternative the number of columns (fields) becomes too
large such that computer forms become too complex, the second
alternative will be chosen. Note that we can always restructure data from
the wide to the long format and vice versa.
Building an SPSS data file
The statistical software package SPSS appears in several windows:
The data editor is used to build data bases, to enter data by hand, to
import data from other programs, to edit data, and to perform interactive
data analyses.
The viewer collects results of data analyses.
The chart editor facilitates the modification of diagrams prepared by
SPSS and allows identifying individuals on a chart.
Using the syntax editor, commands can be entered and collected in
program scripts, which facilitates non-interactive automated data analysis.
8
8B1.2. Data collection
Data files are represented in the data editor, which consists of two tables
(views). The data view shows the data matrix with rows and columns
corresponding to individuals and variables, respectively. The variable
view contains the properties of the variables. It can be used to define
new variables or to modify properties of existing variables. These
properties are:
•
Name: unique alphanumeric name of a variable, starting with an
alphabetic character. The name may consist of alphabetic and
numeric characters and the underscore (“_”). Neither spaces nor
special characters are allowed.
•
Width: the maximum number of places.
•
Decimals: the number of decimal places.
•
Label: a description of the variable. The label will be used to name
variables in all menus and in the output viewer.
•
Values: labels assigned to values of - nominal or ordinal - variables.
The value labels replace values in any category listings. Example:
value 1 corresponds to value label ‘male’, value 2 to label ‘female’.
Data are entered as 1 and 2, but SPSS displays ‘male’ and ‘female’.
Using the button
one can switch between the display of the values and the value
labels in the data view. Within the value label view, one can directly
choose from the defined value labels when entering data.
9
8B1.2. Data collection
•
Missing: particular values may be defined as missing values. These
values are not used in any analyses. Usually, missing data values
are represented as empty fields and if a data value is missing for an
individual, it is left empty in the data matrix.
•
Columns: defines the width of the data matrix column of a variable
as number of characters. This value is only relevant for display and
changes if one column is broadened using the mouse.
•
Align: defines the alignment of the data matrix column of a variable
(left/right/center).
•
Measure: nominal, ordinal or scale. The applicability of particular
statistical operations on a variable (e. g., computing the mean)
depends on the measure of the variable. The measure of a variable
is called
o nominal if each observation belongs to one of a set of
categories, and it is called
o ordinal if these categories have a natural order. SPSS calls
the measure of a variable
o ‘scale’ if observations are numerical values that represent
different magnitudes of the variable.
Usually (outside SPSS), such variables are called `metric’,
‘continuous’ or ‘quantitative’, compared to the ‘qualitative’ and
‘semi-quantitative’ nature of nominal and ordinal variables,
respectively. Examples for nominal variables are sex, type of
operation, or type of treatment. Examples for ordinal variables are
treatment dose, tumor stage, response rating, etc. Scale variables
are, e. g., height, weight, blood pressure, age.
•
Type: the format in which data values are stored. The most
important are the numeric, string, and date formats.
o Nominal and ordinal variables: choose the numeric type.
Categories should be coded as 1, 2, 3, … or 0, 1, 2, … Value
labels should be used to paraphrase the codes.
o Scale variables: choose the numeric type, pay attention to the
correct number of decimal places, which applies to all
computed statistics (e. g., if a variable is defined with 2
decimal places, and you compute the mean of that variable,
then 2 decimal places will be shown). However, the
computational accuracy is not affected by this option.
10
8B1.2. Data collection
o Date variables: choose the date type; otherwise, SPSS won’t
be able to compute the length of time passing between two
dates correctly.
The string format should only be used for text that will not be analyzed
statistically (e. g., addresses, remarks). For nominal or ordinal variables,
the numeric type should be used throughout as it requires an exact
definition of categories. The list of possible categories can be extended
while entering data, and category codes can be recoded after data entry.
Example 1.2.1: Consider the variable “location” and the possible outcome
category “pancreas” and “stomach”. How should this variable be defined?
Proper definition:
The variable is defined properly, if these categories are given two numeric
codes, 1 and 2, say. Value labels paraphrase these codes:
In any output produced by SPSS, these value labels will be used instead of
the numeric codes. When entering data, the user may choose between
any of the predefined outcome categories:
11
8B1.2. Data collection
Improper definition:
The variable is defined with string type of length 10. The user enters
alphabetical characters instead of choosing from a list (or entering
numbers). This may easily lead to various versions of the same category:
All entries in the column “location” will be treated as separate categories.
Thus the program works with six different categories instead of two.
Further remarks applying to data entry with any program:
•
Numerical variables should only contain numbers and no units of
measurements (e.g. “kg”, “mm Hg”, “points”) or other alphabetical
12
8B1.2. Data collection
or special characters. This is of special importance if a spreadsheet
program like Microsoft Excel is used for data entry. Unlike real data
base programs, spreadsheet programs allow the user to enter any
type of data in any cell, so the user solely takes responsibility over
the entries.
•
“True” missing values should be left empty rather than using special
codes for them (e.g. -999, -998, -997). If special codes are used,
they must be defined as missing value codes and they should be
defined as value lables as well. Special codes can be advantageous
for “temporal” missing values (e.g. -999=”ask patient”, -998=”ask
nurse”, -997=”check CRF”).
•
A missing value means that the value is missing. By constrast, in
Microsoft® Office Excel® an empty cell is sometimes interpreted as
zero.
•
Imprecise values can be characterized by adding a column showing
the degree of certainty that is associated with such values (e. g.,
1=exact value, 0=imprecise value). This allows the analyst to drive
two analyses: one with exact values only, and one using also
imprecise values. By no way should imprecisely collected data
values be tagged by a question mark! This will turn the column into
a string format, and SPSS (or any other statistics program) will not
be able to use it for analyses.
•
Enter numbers without using separators (e. g., enter 1000 as 1000
not as 1,000).
If in a data base or statistics program a variable is defined as numeric
then it is not possible to enter something else than numbers! Therefore,
programs that do not distinguish variable types are error-prone (e. g.,
Excel).
For a more sophisticated discussion about data management issues the
reader is referred to Appendices A and B.
13
9B1.3. Simple analyses
1.3. Simple analyses
In small data sets values can be checked by computing frequencies for a
variable. This can be done using the menu Analyze-Descriptive
Statistics-Frequencies. Put all variables into the field Variables and
press OK.
The SPSS Viewer window pops up and shows a frequency table for each
variable. Within each frequency table, values are sorted in ascending
order. This enables the user to quickly check the minimum and maximum
values and discover implausible values.
lie_3
Valid
120
Frequency
1
Percent
6.7
Valid Percent
6.7
Cumulative
Percent
6.7
125
1
6.7
6.7
13.3
127
1
6.7
6.7
20.0
136
1
6.7
6.7
26.7
143
1
6.7
6.7
33.3
144
1
6.7
6.7
40.0
145
2
13.3
13.3
53.3
150
3
20.0
20.0
73.3
152
1
6.7
6.7
80.0
155
1
6.7
6.7
86.7
165
1
6.7
6.7
93.3
203
1
6.7
6.7
100.0
Total
15
100.0
100.0
The columns have the following meanings:
•
•
•
•
Frequency: the absolute frequency of observations having the
value shown in the first column
Percent: percentage of observations with values equal to the value
in the first column relative to total sample, including observations
with missing values
Valid percent: percentage of observations with values equal to the
value in the first column relative to total sample, excluding
observations with missing values
Cumulative percent: percentage of observations with values up to
the value shown in the first column. E. g., in line 145 a cumulative
percentage of 53.3 means that 53.3% of the probands have blood
pressure values less than or equal to 145. Cumulative percents refer
to valid percents, that is, they exclude missing values.
14
9B1.3. Simple analyses
Frequency tables are particulary useful for describing the distribution of
nominal or ordinal variables:
lie_3c
Valid
Normal
Frequency
Percent
Valid Percent
Cumulative
Percent
11
73.3
73.3
High
3
20.0
20.0
93.3
Very high
1
6.7
6.7
100.0
15
100.0
100.0
Total
73.3
Obviously, variable “lie_3c” is ordinal. From the table we learn that 93.3%
of the probands have normal or high blood pressure.
15
10B1.4. Aggregated data
1.4. Aggregated data
SPSS is able to handle data that are already aggregated, i. e. data sets
that have already been compiled to a frequency table. In the data set
shown below, each observation corresponds to a category constituted by a
unique combination of variable values. The variable “frequency” shows
how many observations fall in each category:
As we see, “frequency” is not a variable defining some property of the
patients, it rather acts as a counter. Therefore, we must inform SPSS
about the special meaning of “frequency”. This is done by choosing DataWeight cases from the menu and putting Number of patients
(frequency) into the field Frequency Variable:.
16
10B1.4. Aggregated data
Producing a frequency table for the variable “ageclass” we obtain:
Age class
Valid
<40
40-60
Frequency
111
Percent
45.5
Valid Percent
45.5
Cumulative
Percent
45.5
23
9.4
9.4
54.9
>60
110
45.1
45.1
100.0
Total
244
100.0
100.0
If the weighting had not been performed, the table would erroneously
read as follows:
Age class
Valid
Cumulative
Percent
25.0
Frequency
3
Percent
25.0
Valid Percent
25.0
40-60
4
33.3
33.3
58.3
>60
5
41.7
41.7
100.0
Total
12
100.0
100.0
<40
This table just counts the number of rows with the corresponding ageclass
values!
17
11B1.5. Exercises
1.5. Exercises
1.5.1. Cornea:
Source: Guggenmoos-Holzmann and Wernecke (1995). In an
ophthalmology department the effect of age on cornea
temperature was investigated. Forty-three patients in four age
groups were measured:
Age group
Measurements
12-29
35.0 34.1 33.4 35.2 35.3 34.2 34.6 35.7
34.9
30-42
34.5 34.4 35.5 34.7 34.6 34.9 34.6 34.9
33.0 34.1 33.9 34.5
43-55
35.0 33.1 33.6 33.6 34.2 34.5 34.3 32.5
33.2 33.2
56-73
34.5 34.7 35.0 34.1 33.8 34.0 34.3 34.9
34.5 34.5 33.4 34.2
Create an SPSS data file. Save the file as “cornea.sav”.
1.5.2.
Psychosis and type of constitution
Source: Lorenz (1988). 8099 patients suffering from endogenous
psychosis or epilepsy were classified into five groups according to
their type of constitution (asthenic, pyknic, athletic, dysplastic,
and atypical) and into three classes according to type of
psychosis. Each patient falls into one of the 15 resulting
categories. The frequency of each category is depicted below:
schizophrenia
manicdepressive
disorder
epilepsy
2632
261
378
pyknic
717
879
83
athletic
884
91
435
dysplastic
550
15
444
atypical
450
115
165
asthenic
Which are the variables of this data set?
Enter the data set into SPSS and save it as “psychosis.sav”.
18
12B2.1. Overview
Chapter 2
Statistics and graphs
2.1. Overview
Typically, a descriptive statistical analysis of a sample takes two steps:
Step 1: The data is explored, graphically and by means of statistical
measures. The purpose of this step is to obtain an overview of
distributions and associations in the data set. Thereby we don’t restrict
ourselves to the main scientific question.
Step 2: For describing the sample (e. g., for a talk or a paper) only those
graphs and measures will be used, that allow the most concise conclusion
about the data distribution. Unnecessary and redundant measures will be
omitted. The statistical measures used for description are usually
summarized in a table (e. g., a “patient characteristics” table).
When choosing appropriate statistical measures (“statistics”) and graphs,
one has to consider the measurement type (in SPSS denoted by
“measure”) of the variables (nominal, ordinal, or scale). The following
statistics and graphs are available:
•
Nominal variables (e. g. sex or type of operation)
o Statistics: frequency, percentage
o Graphs: bar chart, pie chart
•
Ordinal variables (e. g. tumor stage, school grades, categorized
scales)
o Statistics: frequency, percentage, median, quartiles,
percentiles, minimum, maximum
o Graphs: bar chart, pie chart, and – with reservations - box
plot
•
Scale variables (e. g. height, weight, age, leukocyte count)
o Statistics: median, quartiles, percentiles, minimum,
maximum, mean, standard deviation, if data is categorized:
frequency and percentage
o Graphs: box plot, histogram, error bar plot, dot plot, bar
chart and pie chart if data is categorized
19
13B2.2. Graphs
The statistical description of distributions is shown on a data set called
“cholesterol.sav”. This data set contains the variables age, sex, height,
weight, cholesterol level, type of occupation, sports abdominal girth and
hip dimension of 83 healthy probands. First, the total sample is described,
then the description is grouped by sex.
2.2. Graphs
SPSS distinguishes between graphs using the Chart Builder... and
older versions of (interactive and non-interactive) graphs using Legacy
Dialogs. Although the possibilities of these types of graphs overlap, there
are some diagrams that can only be achieved by one or the other way.
Most diagrams needed in the course can be done using the Chart Builder.
It is important to note that the chart preview window within the Chart
Builder dialogue window does not represent the data but only gives a
scetch of a typical chart of the selected type. Further note that graphs
which had been constructed using the menu Graphs - Legacy Dialogs –
Interactive cannot be changed interactively in the same manner as in
version 14, but are also edited using the chart editor.
20
14B2.3. Describing the distribution of nominal variables
2.3. Describing the distribution of nominal variables
Sex and type of occupation are the nominal variables in our data set. A
pie chart showing the distribution of type of occupation is created by
choosing Graphs-Chart Builder... from the menu and dragging
Pie/Polar from the Gallery tab to the Chart Preview field:
Drag Type of occupation [occupation] into the field Slice by? and
press OK. The pie chart is shown in the SPSS Viewer window:
21
14B2.3. Describing the distribution of nominal variables
Double-clicking
the
chart
opens the Chart Editor which
allows to change colors, styles,
labels etc. E. g., the counts
and percentages shown in the
chart above can be added by
selecting Show Data Labels in
the Elements menu of the
Chart
Editor.
In
the
Properties
window
which
pops
up,
Count
and
Percentages are moved to the
Labels Displayed field using
the green arrow button. For
the Label Position we select
labels to be shown outside the
pie and finally press Apply.
22
14B2.3. Describing the distribution of nominal variables
Statistics can be summarized in a table using the menu Analyze-TablesCustom Tables.... Drag Type of occupation to the Rows part in the
preview field and press Summary Statistics:
23
14B2.3. Describing the distribution of nominal variables
Here we select Column N % and move it to the display field using the
arrow button. After clicking Apply to Selection and OK we arrive at the
following table:
Type of
occupation
24
Col %
28.9%
mostly standing
37
44.6%
mostly in motion
16
19.3%
6
7.2%
mostly sitting
retired
Count
The table may be sorted by cell count by pressing the Categories and
Totals... button in the custom tables window, which is sometimes useful
for nominal variables. With ordinal or scale variables, however, the order
of the value labels is crucial and has to be maintained.
24
14B2.3. Describing the distribution of nominal variables
We can repeat the analyses building subgroups by sex. In the Chart
Builder, check Columns panel variable in the Groups/Point ID tab and
then drag Sex into the appearing field Panel?:
25
14B2.3. Describing the distribution of nominal variables
The same information is provided by a grouped bar chart. It is produced
by pressing Reset in the Chart Builder and selecting Bar in the Gallery
tab and dragging the third bar chart variant (Stacked Bar) to the preview
field:
26
14B2.3. Describing the distribution of nominal variables
Drag Type of Occupation to the X-Axis ? and Sex to the Stack: set
color field:
27
14B2.3. Describing the distribution of nominal variables
28
14B2.3. Describing the distribution of nominal variables
Sometimes it is useful to compare percentages within groups instead of
absolute counts. Select Percentage in the Statistics field of the
Element Properties window beside the Chart Builder window and
then press Set Parameters... in order to select Total for Each X-Axis
Category. Do not forget to confirm your selections by pressing Continue
and Apply, respectively. After closing the Chart Builder with OK, the bars
of the four occupational types now will be equalized to the same height,
and then the proportion of males and females can be compared more
easily between occupational types. The scaling to 100% within each
category can also be done ex post in the Options menu of the Chart
Editor.
29
14B2.3. Describing the distribution of nominal variables
The order of appearance can be
changed (e. g., to bring female
on top) by double-clicking on
any bar within the Chart
Editor
and
selecting
the
Categories tab:
The sorting order can also be
changed for the type of
occupation by double-clicking
on any label on the X-axis
within the Chart Editor and
selecting the Categories tab
again.
30
14B2.3. Describing the distribution of nominal variables
To create a table containing frequencies and percentages of type of
occupation broken down by sex, choose Analyze-Tables-Custom Tables
and, in addition to the selections shown above, drag the variable Sex to
the Columns field, and request Row N % in the Summary Statistics
dialogue.
Sex
male
Type of
occupation
10
Col %
25.0%
Row %
41.7%
mostly standing
22
55.0%
mostly in motion
7
17.5%
retired
1
2.5%
mostly sitting
Count
female
Count
14
Col %
32.6%
Row %
58.3%
59.5%
15
34.9%
40.5%
43.8%
9
20.9%
56.3%
16.7%
5
11.6%
83.3%
The Row% sum up to 100% within a row, the Col% values sum up within
each column. Therefore, we can compare the proportion of males between
types of occupation by looking at Row%, and we compare the proportion
of each type of occupation between the two sexes by looking at the Col%.
31
15B2.4. Describing the distribution of ordinal variables
2.4. Describing the distribution of ordinal variables
All methods for nominal variables also apply to ordinal variables.
Additionally, one may use the so-called non-parametric or distribution-free
statistics which make specific use of the ordinal information contained in
the variable. In our data set, the only ordinal variable is “sports”,
characterizing the intensity of leisure sports of the probands. Although the
calculation of nonparametric statistics is possible here, with such crude
classification the methods for nominal variables do better. However, some
attention must be paid to the correct order of categories – with ordinal
variables, categories may not be interchanged. The frequency table and
bar chart for sports, grouped by sex, look as follows:
Sex
male
Sports
7
Col %
17.5%
Row %
50.0%
seldom
22
55.0%
sometimes
11
27.5%
never
Count
female
often
32
Count
7
Col %
16.3%
Row %
50.0%
52.4%
20
46.5%
47.6%
45.8%
13
30.2%
54.2%
3
7.0%
100.0%
16B2.5. Describing the distribution of scale variables
2.5. Describing the distribution of scale variables
Histogram
The so-called histogram serves as a graphical tool showing the distribution
of a scale variable. It is mostly used in the explorative step of data
analysis. The histogram depicts the frequencies of a categorized scale
variable, similarly to a bar chart. However, in a histogram there is no
space between the bars (because consecutive categories border on each
other), and it is not allowed to interchange bars. In the following example,
143 students of the University of Connecticut were arranged according to
their height (source: Schilling et al, 2002). This resulted in a “living
histogram”:
Scale variables must be categorized before they can be characterized
using frequencies (e. g., age: 0-40=young, >40=old). Affording a
histogram with SPSS involves an automatic categorization which is done
by SPSS before computing the category counts. The category borders can
later be edited “by hand”. The histogram is created by choosing the first
variant (Simple Histogram) from the histograms offered in the Gallery
tab of the Chart Builder. To create a histogram for abdominal girth,
drag the variable Abdominal girth (cm) [waist] into the field X-Axis ?
and press OK.
33
16B2.5. Describing the distribution of scale variables
34
16B2.5. Describing the distribution of scale variables
Suppose we want abdominal
girth to be categorized into
categories of 10 cm each.
Double-click the graph and
double-click any bar of the
histogram. In the Properties
window which pops up select
the Binning tab and check
Custom and Interval width in
the X-Axis field.
Entering the value 10 and
confirming your choice by
clicking Apply updates the
histogram to the new settings:
35
16B2.5. Describing the distribution of scale variables
From the shape of the histogram we learn that the distribution is not
symmetric, the tail on the right hand side is longer than the tail on the left
hand side. Distributions of that shape are called “right-skewed”. They
often originate from a natural lower limit in the variable. The number of
intervals to be displayed is a matter of taste and depends on the total
sample size. The number should be specified such that the frequency of
each interval is not too small such that the histogram contains artificial
“wholes”.
36
16B2.5. Describing the distribution of scale variables
Histograms can also be compared between subgroups. For this purpose
perform the same steps within the Chart Builder as in the case of pie
charts:
The histograms use the same scaling on both axes, such that they can be
compared easily. Note that a comparison of the abdominal girth
distributions between both sexes requires the use of relative frequencies
(by selecting Histogram Percent in the Element Properties window
beside the Chart Builder).
37
16B2.5. Describing the distribution of scale variables
Dot plot
The dot plot serves to compare the individual values of a variable between
groups, e. g., the abdominal girth between males and females. In the
Chart Builder select the first variant (Simple Scatter) from the
Scatter/Dot plots in the Gallery tab and drag Abdominal girth and Sex
into the vertical and horizontal axis fields, respectively:
38
16B2.5. Describing the distribution of scale variables
The dot plot may suffer from ties in the data, i. e., two or more dots
superimposed, which may obscure the true distribution. Later a variant of
the dot plot is introduced that overcomes that problem by a parallel
depiction of superimposed dots.
Using proper statistical measures
Descriptive statistics should give a concise description of the distribution
of variables. There are measures describing the position and measures
describing the spread of a distribution.
We distinguish between parametric and nonparametric (distribution-free)
statistics. Parametric statistics assume that the data follow a particular
theoretical distribution, usually the normal distribution. If this assumption
holds, parametric measures allow a more concise description of the
distribution than nonparametric measures (because less numbers are
needed). If the assumption does not hold, then nonparametric measures
must be used in order to avoid confusing the target audience.
39
16B2.5. Describing the distribution of scale variables
Nonparametric statistics
Nonparametric measures make sense for ordinal or scale variables as they
are based on the sorted sample. They are roughly defined as follows (for a
more stringent definition see below):
•
Median: midpoint of the observations when they are ordered from
the smallest to the highest value (50% fall below and 50% fall
above this point)
•
25th Percentile (first or lower quartile): the value such that 25
percent of the observations fall below or at that value
•
75th Percentile (third or upper quartile): the value such that 75
percent of the observations fall below or at that value
•
Interquartile range (IQR): the difference between 75th percentile
and 25th percentile
•
Minimum: the smallest value
•
Maximum: the highest value
•
Range: difference between maximum and minimum
While Median, percentiles, minimum and maximum characterize the
position of a distribution, interquartile range and range describe the
spread of a distribution, i. e., the variation of a feature.
IQR and range can be used for scale but not for ordinal variables as
differences between two values usually make no sense for the latter.
The exact definition of the median (and analogously of the quartiles) is
as follows: the median is that value smaller than or equal to which are at
least 50% of the observations and greater than or equal to which are at
least 50% of the observations.
The box plot depicts all of those nonparametric statistics in one graph.
Surely, it is one of the most important graphical tools in statistics. It
allows to compare groups at a glance without much reducing the
information contained in the data and without assuming any theoretical
shape of the distribution. Box plots help us to decide if a distribution is
symmetric, right-skewed or left-skewed and if there are very large or very
small values (outliers). The only drawback of the box plot compared to the
histogram is that it is unable to depict multi-modal distributions
(histograms with two or more peaks).
The box contains the central 50% of the data. The line in the box marks
the median. Whiskers extend from the box up to the smallest/highest
observations that lies within 1.5 IQR from the quartiles. Observations that
lie farer away from the quartiles are marked by a circle (1.5 – 3 IQR) or
an asterisk (more than 3 IQR from the quartiles).
40
16B2.5. Describing the distribution of scale variables
The use of the box plot for ordinal variables is limited as the outlier
definition refers to IQR.
Box plots are obtained in the Chart Builder by dragging the first variant of
the Boxplots offered in the Gallery tab to the preview field.
41
16B2.5. Describing the distribution of scale variables
Check Point ID Label in the Groups/Point ID tab and drag the
Pat.No.[id] variable to the Point Label Variable? field which appears
in the preview field. Thus potentially shown extreme values or outliers are
labelled by that variable instead of by the row number in the data view.
As before, checking Columns Panel Variable in the same tab would
allow to specify another variable that is used to generate multiple charts
following the values of that variable.
The box plot does well in depicting the distribution actually observed, as
shown by the comparison with the original values (dot plot):
42
16B2.5. Describing the distribution of scale variables
Parametric statistics and the normal distribution
When creating a histogram of body height we can check the Display
normal curve option in the Element Properties window beside the
Chart Builder window:
The bars of the histogram more or less follow the normal curve. This curve
follows a mathematical formula, which is characterized by two
parameters, the mean µ and the standard deviation σ. Knowing these two
43
16B2.5. Describing the distribution of scale variables
parameters, we can specify areas where, e. g., the smallest 5% of the
data are located or where the middle 2/3 can be expected. These
proportions are obtained by the formula of the distribution function of the
normal distribution:
  x − µ 2 
1
=
exp  − 
P ( y ≤ x) ∫
 2  dx
−∞ 2π
  σ 

x
P ( y ≤ x ) denotes the probability that a value y which is drawn randomly
from a normal distribution with mean µ and standard deviation σ, is equal
to or less than x. The usually unknown parameters µ and σ are estimated
by their sample values µ̂ (the sample mean) and σˆ (the sample standard
deviation, SD):
Mean = µ̂ =
1
N
N
∑ xi
SD = σˆ =
i =1
1 N
∑ (xi − µˆ )
N − 1 i =1
2
Assuming a normal distribution and using the above formula we obtain the
following data areas:
•
Mean: 50% of the observations fall below the mean and 50% fall
above the mean
•
Mean-SD, Mean+SD: 68% (roughly two thirds) of the observations
fall into that area
•
Mean-2SD, Mean+2SD: 95% of the observations fall into that area
•
Mean-3SD, Mean+3SD: 99.7% of the observations fall into that area
Note: Although mean and standard deviation can be computed from any
variable, the above interpretation is only valid for normally distributed
variables.
Displaying mean and SD: how not to do it
Often mean and SD are depicted in bar charts as the one shown below.
However, experienced researchers discourage from using such charts.
Their reasons follow from the comparison with the dot plot:
44
16B2.5. Describing the distribution of scale variables
Some statistics programs offer bar charts with whiskers showing the
standard deviation to depict mean and standard deviation of a variable.
The bar pretends that the data area begins at the origin of the bar, which
is 150 in our example. When comparing with the dot plot, we see that the
minimum for males is much higher, about 162 cm. Furthermore, the
mean, which is a single value, is represented by a bar, which covers a
range of values. On the other hand, the standard deviation, which is a
measure showing the spread of a distribution, is only plotted above the
mean. This pretends that the variable spreads only into one direction from
the mean. The standard deviation should always be plotted into both
directions from the mean.
Furthermore, the origin of 150 depicted here is not a natural one,
therefore, choice of the length of the bars is completely arbitrary. The
relationships of the bars give a wrong idea of the relationships of the
means, which should always be seen relative to the spread of the
distributions. By changing the y-axis scale, the impression of a difference
can be enforced or attenuated easily, as can be seen by the following
comparison:
45
16B2.5. Describing the distribution of scale variables
Displaying mean and SD: how to do it
Mean and SD are correctly depicted in an error bar chart. Select the third
variant in the second row (Simple Error Bar) of the Bar Charts offered
in the Gallery tab:
Put the variable of interest (height in our example) into the vertical axis
field, and the grouping variable (sex) into the horizontal axis field.
46
16B2.5. Describing the distribution of scale variables
In the field Error Bars Represent of the Elements Properties window
check Standard Deviation.
The multiplier should be set to 1.0 to
have one SD plotted on top of and
below the mean. If the multiplier is
set to 2.0, then the error bars will
cover an area of mean plus/minus
two standard deviations. Assuming
normal distributions, about 95% of
the observations fall into that area.
Usually, one standard deviation is
shown in an error bar plot,
corresponding
to
2/3
of
the
observations.
We obtain a chart showing two error bars, corresponding to males and
females:
47
16B2.5. Describing the distribution of scale variables
The menu also offers to use the standard error of the mean (SEM)
instead of the SD. This statistic measures the accuracy of the sample
mean with respect to estimating the population mean. It is defined as the
standard deviation divided by the square root of N. Therefore, the
precision of the estimate of the population mean is proportional to the
number of observations. However, the SEM does not show the spread of
the data. Therefore, it should not to be used to describe samples.
Verifying the normal assumption
In small samples it can be very difficult to verify the assumption of normal
distribution in a variable. The following illustration shows the variety in
observed distribution of normally and right skewed data sets. Values were
sampled from a normal distribution and from a right-skewed distribution
and then collected into data sets of size 10, 30 and 100. As can be seen,
the histograms of the samples of size N=10 often look asymmetric and do
not reflect a normal distribution. By contrast, some of the histograms of
the right-skewed variable are close to a normal distribution. This effect is
caused by random variation, which is considerably high in small samples.
Therefore it is very difficult to obtain a unique decision about the normal
assumption in small samples.
48
16B2.5. Describing the distribution of scale variables
Histograms of normally distributed data of sample size 10:
Count
8
Count
Count
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
2
6
4
2
8
6
4
2
8
Count
1
4
8
6
4
2
8
Count
0
6
6
4
2
50 75 100 125
50 75 100 125
Normal
50 75 100 125
Normal
Normal
50 75 100 125
Normal
50 75 100 125
Normal
50 75 100 125
Normal
Histograms of normally distributed data of sample size 30:
0
1
2
3
4
5
6
7
8
9
Count
12
8
4
0
Count
12
8
4
0
Count
12
8
4
0
50
75
100
Normal
125
50
75
100
Normal
125
Histograms of normally distributed data of sample size 100:
0
1
25
Count
20
15
10
5
2
25
Count
20
15
10
5
50
75
100
125
Normal
49
16B2.5. Describing the distribution of scale variables
Histograms of right-skewed data of sample size 10:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Count
10
8
6
4
2
Count
10
8
6
4
2
Count
10
8
6
4
2
Count
10
8
6
4
2
Count
10
8
6
4
2
2.5
5.0
7.5 12.5
10.0
rightskew ed
2.5
5.0
7.5 12.5
10.0
rightskew ed
2.5
5.0
7.5 12.5
10.0
rightskew ed
2.5
5.0
7.5 12.5
10.0
rightskew ed
2.5
5.0
7.5 12.5
10.0
rightskew ed
2.5
5.0
7.5 12.5
10.0
rightskew ed
Histograms of right-skewed data of sample size 30:
25
0
1
2
3
4
5
6
7
8
9
Count
20
15
10
5
25
Count
20
15
10
5
25
Count
20
15
10
5
2.5
5.0
2.5
7.5 10.0 12.5
5.0
7.5 10.0 12.5
rightskew ed
rightskew ed
Histograms of right-skewed data of sample size 100:
0
75
1
Count
50
25
0
2
75
Count
50
25
0
2.5
5.0
7.5
10.0
12.5
rightskewed
50
16B2.5. Describing the distribution of scale variables
A comparison of an error bar plot with the bars extending up to two SDs
from the mean and a dot plot may help in deciding whether the normal
assumption can be adopted. Both charts must use the same scaling on the
vertical axis. If the error bar plot reflects the range of data as shown by
the dot plot, then it is meaningful to describe the distribution using the
mean and the SD. If the error bar plot gives the impression that the
distribution is shifted up or down compared to the dot plot, then the data
distribution should be described using nonparametric statistics.
As an example, let us first consider the variable “height”. A comparison of
the error bar plot and the dot plot shows good agreement:
2 00 .0 0
A
A
1 90 .0 0
1 80 .0 0
]
1 70 .0 0
]
Height (cm)
Height (cm)
1 90 .0 0
1 80 .0 0
1 70 .0 0
n= 40
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
1 60 .0 0
1 60 .0 0
A
A
A
n= 43
m al e
fe ma l e
m al e
Sex
fe ma l e
Sex
The situation is different for the variable “Abdominal girth” (waist)”. Both
error bars seem to be shifted downwards. This results from the rightskewed distribution of abdominal girth. Therefore, the nonparametric
statistics should be used for description.
51
16B2.5. Describing the distribution of scale variables
A
1 00 .0 0
]
]
7 5.0 0
Abdominal girth (cm)
Abdominal girth (cm)
A
1 25 .0 0
1 25 .0 0
1 00 .0 0
7 5.0 0
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
n= 40
5 0.0 0
5 0.0 0
n= 43
m al e
fe ma l e
m al e
Sex
fe ma l e
Sex
In practice, box plots are used more frequently than error bar plots to
show distributions in subgroups. Using SPSS one may also symbolize the
mean in a box plot (see later). However, error bar plots are useful when
illustrating repeated measurements of a variable, as they need less space
than box plots.
Computing the statistics
Statistical measures describing the position or spread of a distribution can
be computed using the menu Analyze-Tables-Custom Tables.... To
compute measures for the variable “height”, in separate rows according to
the groups defined by sex, choose that menu and first drag the variable
sex to the Rows area in the preview field. Then drag the height variable to
the produced two-row table such that the appearing red frame covers the
right-hand part of the male and the female cells:
52
16B2.5. Describing the distribution of scale variables
We have seen from the considerations above that height can be assumed
to be normally distributed. Therefore, the most concise description uses
mean and standard deviation. Additionally, the frequency of each group
should be included in a summary table. Press Summary Statistics...
and select Count, Mean and Std. Deviation:
After pressing Apply to Selection and OK, the output table reads as
follow:
53
16B2.5. Describing the distribution of scale variables
Sex
male
Height (cm)
female
Height (cm)
Count
40
Mean
179.38
Std Deviation
6.83
43
167.16
6.50
The variable “Abdominal girth” is best described using nonparametric
measures. These statistics have to be selected in the Summary
Statistics... submenu:
Sex
male
Abdominal girth (cm)
female
Abdominal girth (cm)
Count
40
Median
91.50
Percentile 25
86.00
Percentile 75
101.75
Range
51.00
43
86.00
75.00
103.00
86.00
54
16B2.5. Describing the distribution of scale variables
Transferring output to other programs
After explorative data analysis, some of the results will be used for
presentation of study results. Parts of the output collected in the SPSS
Viewer can be copied and pasted in other programs, e. g. MS Powerpoint
or MS Word 2007.
•
Charts can simply be selected and copied (Strg-C, Ctrl-C). In the
target program (Word or Powerpoint 2007) choose Edit-Paste
(Strg-V, Ctrl-V) to paste the as an editable graph.
•
Tables are transferred the following way: select the table in the
SPSS viewer, and choose Edit-Copy (Strg-C, Ctrl-C) to copy it, and
in the target program choose Edit-Paste (Strg-V, Ctrl-V) to paste it
as an editable table.
55
16B2.5. Describing the distribution of scale variables
Summary
To decide which measure and chart to use for describing the distribution
of a scale variable, first verify the distribution using histograms (by
subgroups) or box plots.
Check
histogram
or box
plot:
Normally distributed?
m ale
Not normally distributed?
fe ma le
m ale
fe ma le
10.0
7.5
Count
Count
15
10
5.0
5
2.5
0
160.00
170.00
180.00
190.00
160.00
Height (cm )
170.00
180.00
0.0
60.00
190.00
Height (cm )
80.00
100.00
120.00
140.00
60.00
80.00
Abdom inal girth (cm)
A
A
120.00
140.00
A
1 40 .0 0
Abdominal girth (cm)
1 90 .0 0
A
Height (cm)
100.00
Abdom inal girth (cm)
A
1 80 .0 0
1 70 .0 0
A
1 60 .0 0
1 20 .0 0
1 00 .0 0
8 0.0 0
A
A
m al e
fe ma l e
6 0.0 0
Sex
m al e
fe ma l e
Sex
Measure
of center:
Mean
Measure
of spread:
Standard Deviation
Chart:
Median
Additionally, minimum and
maximum may be used
Error bar plot
Large N?
1st, 3rd
quartiles
Box plot
(including
means)
Small N?
Minimum,
Maximum
Box plot
A
A
A
1 40 .0 0
170.00
]
Height (cm)
Height (cm)
180.00
n= 40
1 80 .0 0
A
A
a
1 70 .0 0
a
]
A
Abdominal girth (cm)
1 90 .0 0
1 20 .0 0
1 00 .0 0
8 0.0 0
1 60 .0 0
A
A
160.00
n= 43
male
f emale
Sex
m al e
fe ma l e
Sex
6 0.0 0
m al e
fe ma l e
Sex
The section “Further graphs” shows how to include means into box plots.
56
17B2.6. Outliers
2.6. Outliers
Some extreme values in a sample may affect results exorbitantly. Such
values are called outliers. The decision on how to proceed with outliers
depends on their origin:
•
•
The outliers were produced by measurement errors or data entry
errors, or may be the result of untypical circumstances at time of
measurement. If the correct values are not available (e. g., because
of missing case report forms, or because the measurement cannot
be repeated), then these values should be excluded from further
analysis.
If an error or an untypical concomitant can be ruled out, then the
outlying observation must be regarded as part of the distribution
and should not be excluded from further analysis.
An example, taken from the data file “Waist-Hip.sav”:
There appears to be an observation with height less than 100 cm. This
points towards a data entry error. Using box plots, such implausible values
are easily detected. The outlier is identified by its Pat.No., that has been
set as Point ID variable. Alternatively, we could identify the outlier by
57
17B2.6. Outliers
sorting the data set by height (Data-Sort Cases). Then the outlying
observation will appear in the first row of data. To exclude the observation
from further analyses, choose Data-Select Cases... and check the
option If condition is satisfied.
Press If... and specify height > 100. Press Continue and OK. Now the
untypical observation is filtered out before analysis. A new variable
“filter_$” is created in the data set. If the box plot request is repeated, we
still obtain the same data area as before.
58
17B2.6. Outliers
59
17B2.6. Outliers
The scaling of the vertical axis must be corrected by hand, by doubleclicking the chart and subsequently double-clicking the vertical axis:
60
17B2.6. Outliers
In the Scale tab uncheck the Auto option at Minimum and insert a value of
150, say. Finally, after rescaling the vertical axis we obtain a correct
picture:
61
17B2.6. Outliers
62
18B2.7. Missing values
2.7. Missing values
Missing values (sometimes called “not-available values” or NA’s) are a
problem common to all kind of medical data. In the worst case they can
lead to biased results. There are several reasons for the occurrence of
missing data:
•
•
•
•
•
Breakdown of measurement instruments
Retrospective data collection may lead to values that are no longer
available
Imprecise or fragmentary collection of data (e. g., questionnaires
with some questions the respondent refused to reply)
Missing values by design: some values may only be available for a
subset of patients
Drop-out of patients from studies
o Patients refusing further participation
o Patients are lost to follow-up
o Patients are dead
Reports on studies following the patients over a long time should always
include information about frequency and time of drop-outs. This is best
accomplished by a flow chart, which is e.g. mandatory for any study
published in the British Medical Journal. Such a flow-chart may resemble
the
following,
taken
from
Kendrick
et
al
(2001)
(http://bmj.bmjjournals.com/cgi/content/abstract/322/7283/400,
a
comparison of referral to radiography of the lumbar spine and
conventional treatment for patients with low back pain):
63
18B2.7. Missing values
64
18B2.7. Missing values
Furthermore, values may be partly available (censored). This is quite
typical for a lot of medical research questions. Censored means that we
don’t know the exact value of the variable. All we know is an interval
where the exact value is contained:
•
Right censoring: Quite common when studying survival times of
patients. Assume e.g. that we are interested in the survival of
patients after colon cancer surgery. If a patient is still living at the
time of statistical analysis, then all we know is a minimum survival
time of, say, 7.5 years for this patient. His true but unkown survival
time is within 7.5 years and infinity. This upper limit of infinity is
mathematical convenience, in our example an upper limit of, say,
150 years would suffice to cover all possible values for human
survival times.
•
Left censoring: Here a maximum for the unknown value is known.
This happens quite often in laboratory settings, where a smaller
value than a certain level of dection (LOD) cannot be observed. So
all we know is that the true value falls within the interval between 0
and LOD.
•
Interval censoring: Both a lower and upper limit for the unknown
value is known. For instance, we are interested in the age when a
person becomes HIV-positive. The person is medically examined at
an age of 22.1 years and 27.8 years. If the HIV-test is negative at
the first examination but positive at the second one, then we will
know that the HIV-infection happened between an age of 22.1 and
27.8 years, but we won’t know the exact age when the infection
occurred.
According to their impact on results, one roughly distinguishes between
•
Values missing at random: the reason for missing values may be
known, but is independent from the variable of interest
•
Nonignorable missing values (also called missing not at random):
the reason for missing values depends on the magnitude of the
(unobserved) true value; e. g., if patients in good condition refuse
to show up at follow-up examinations
65
18B2.7. Missing values
Strategies for data analysis in the presence of missing values include:
1. Complete case analysis: patients with missing values are excluded
from analysis
2. Reconstructing missing values (imputation):
o In longitudinal studies with missing measurements, often the
last available value is carried forward (LVCF) or a missing
value is replaced by interpolating the neighboring observations
o Missing values are replaced by values from similar individuals
(nearest neighbor imputation)
o Missing values are replaced by a subgroup mean (mean
imputation)
3. Application of specialized techniques to account for missing or
censored data (e. g., survival analysis)
Any methods attempting to reconstruct missing values (second strategy,
see above) need to assume that the values are missing at random. This
assumption cannot be verified in a direct way (like the normality
assumption). Therefore, the first option (complete case analysis) is often
preferred unless special techniques are available (e.g., for survival data in
the presence of right-censoring). The better alternative is an improved
study design that tries to avoid missing values.
Especially in cases where nonignorable missing values are plausible a
careful sensitivity analysis should be performed.
66
18B2.7. Missing values
Example 2.7.1: Quality of life. Consider a variable measuring quality of
life (a score consisting of 10 items) one year after an operation. Suppose that
all questionnaires have been filled in completely. Then the histogram of scores
looks like the following:
40
Frequency
30
20
10
Mean =50.8947
Std. Dev. =20.01785
N =114
0
0.00
20.00
40.00
60.00
80.00
100.00
120.00
Quality of life
20
15
Frequency
Now suppose that some patients
could not answer some items, such
that the score is missing for those
patients
(assuming
missing-atrandom). The histogram of quality-oflife scores doesn’t change much:
10
5
Mean =49.7727
Std. Dev. =18.91584
N =66
0
0.00
20.00
40.00
60.00
80.00
100.00
120.00
Quality of life
40
30
Frequency
Now
consider
a
situation
of
nonignorable missing values: suppose
that patients suffering from heavy
side effects are not able to fill in the
questionnaire. If the histogram is
based on data of the remaining
patients, we obtain:
20
10
Mean =61.8429
Std. Dev. =15.88421
N =70
0
0.00
20.00
40.00
60.00
80.00
100.00
120.00
Quality of life
The third histogram shows a clear shift towards the right, and the mean is
about 10 units higher than with the other two histograms. This reflects the
bias occurring if the missing-at-random precondition is not fulfilled.
67
19B2.8. Further graphs
2.8. Further graphs
Dot plots with heavily tied data
With larger sample sizes and scale variables of moderate precision, data
values are often tied, i. e., multiple observations assume the same value.
A simple dot plot as shown above superimposes those observations such
that the density of the distribution at such values is obscured. To
overcome that problem, a dot plot that puts tied values in parallel can be
created. First create a Simple Scatter plot as shown above for
Abdominal girth by Sex in the “cholesterol.sav” data set. Then check the
Stack identical values in the Element Properties window:
68
19B2.8. Further graphs
After rescaling the vertical axis to a minimum of 50 and a maximum of
150 the resulting chart looks as follows:
69
20B2.9. Exercises
2.9. Exercises
The following exercises are useful to practice the concepts of descriptive
statistics. All data files can be found in the filing tray “Chapter 2”. For
each exercise, create a chart and a table of summarizing statistics.
2.9.1.
Coronary arteries
Open the data set “coronary.sav”. Compare the achieved
treadmill time between healthy and diseased individuals! Use
parametric and nonparametric statistics and compare the results.
2.9.2.
Cornea (cont.)
Open the data set “cornea.sav”. Compare the distribution of
cornea temperature measurements between the age groups by
creating a chart and a table of summary statistics.
2.9.3.
Hemoglobin (cont.)
Open the data set “hemoglobin.sav”. Compare the level of
hemoglobin between pre-menopausal and post-menopausal
patients using adequate charts and statistics. Repeat the analysis
for hematocrit (PCV).
2.9.4.
Body-mass-index and waist-hip-ratio (cont.)
Open the data set “bmi.sav”. Compare body-mass-indices of the
patients in the four categories defined by sex and disease. Use
adequate charts and statistics. Repeat the analysis for waist-hipratio. Pay attention to outliers!
2.9.5.
Cardiac fibrillation (cont.)
Open the data set “fibrillation.sav”. Compare body-mass-index
between patients that were successfully treated and patients that
were not (variable “success”). Repeat the analysis building
subgroups by treatment. Repeat the analysis comparing
potassium level and magnesium level between these subgroups.
2.9.6.
Psychosis and type of constitution (cont.)
Open the data file “psychosis.sav”. Find an adequate way to
graphically depict the distribution of constitutional type, grouped
by type of psychosis. Define “frequency” as the frequency
variable! Create a table of suitable statistics.
2.9.7.
Down’s syndrome (cont.)
Open the data file “down.sav”.
Depict the distribution of mother’s age. Is it possible to compute
the median?
Depict the proportion of children suffering from Down’s syndrome
in subgroups defined by mother’s age. Is there a difference when
compared to absolute counts?
2.9.8.
Flow-chart
Use the data file “patientflow.sav” and fill in the following chart
showing the flow of patients through a clinical trial.
70
20B2.9. Exercises
Registered patients
N=
Refused to participate
N=
Randomized
N=
Active group
N=
Placebo group
N=
Lost to follow up N=
Refused to continue N=
Died N=
Lost to follow up N=
Refused to continue N=
Died N=
3 weeks
Treated per protocol
N=
Treated per protocol
N=
Lost to follow up N=
Refused to continue N=
Died N=
Lost to follow up N=
Refused to continue N=
Died N=
6 weeks
Treatment completed
per protocol
N=
2.9.9.
Treatment completed
per protocol
N=
Box plot including means
Use the data set “waist-hip.sav”. Generate box plots which
include the means for the variables body-mass-index and waisthip-ratio, grouped by sex.
2.9.10. Dot plot
Use the data set “waist-hip.sav”. Generate dot plots, grouped by
sex, for the variables body-mass-index and waist-hip-ratio as
described in section 2.5. Afterwards, generate dot plots which put
tied values in parallel as described in section 2.8.
71
21B3.1. Introduction
Chapter 3
Probability
3.1. Introduction
In the first two chapters various methods to present medical data have
been introduced. These methods comprise graphical tools (box plot,
histogram, scatter plot, bar chart, etc.) and statistical measures (mean,
standard deviation, median, quartiles, quantiles, percentages, etc.).
These statistical tools have one particular purpose in the field of medicine:
to ease the communication with colleagues. In direct conversation, at
talks or in publications, the essential of empirical data should be described
comprehensibly and correctly.
Example 3.1.1. Of 34 Patients, 18 were treated by ABC-NEW and 16 by XYZ.
Treatment with ABC-NEW led to 9 successes (50%). With XYZ, only 4 success
(25%) could be observed.
In example 3.1.1, the essential of the empirically collected data is
described comprehensibly and correctly.
This information is only interesting for others than those involved in the
trial (and the patients of course!), if the results can be generalized to
other patients suffering from the same disease. Then it could be used to
predict the treatment success in such patients.
Under which conditions can empirical results based on observations be
generalized? Put another way, when can we draw conclusions from a part
to the whole? Here we have arrived at an important but difficult point.
Therefore, some more general thoughts have to be considered at this
place.
Population – Sample
Assume a bin (e. g., an urn or a bag) containing many red and white
balls. We are interested in the proportion of red and white balls. Many
more practical problems are similar to this question:
•
Delivery of 5000 blood bottles in the production of blood plasma:
some of them are infected by Hepatitis-B (red balls), some not
(white balls).
•
Delivery of 1 million goulash tin cans for the military service: some
of them are decomposed (red balls), some not.
72
21B3.1. Introduction
•
Various patients treated by XYZ: some can be healed (red balls),
some not (white balls).
•
Tossing a coin: each ball corresponds to a toss of a coin, red balls
correspond to “head”, white balls to “tail”. Please note that here we
are dealing with a very large (infinitely large) number of coin tosses.
The urn model with balls of two colors corresponds to a so-called binary
outcome. This model can be extended to nominal outcomes, by
introducing more colors. Also for ordinal or scale outcomes we could think
of similar models.
Although the inspection of goulash cans, the examination of efficacy of
clinical treatments or the coin toss are not comparable by their matter,
they have a related structure formally, which can be represented by the
urn model.
How can the proportion of red (white) balls be found:
•
•
•
Examine all balls: this will be the only way to think of with blood
bottles.
Examine only a part: for goulash tins, treatment XYZ, and coin
toss
Examine none (use knowledge from other sources): for blood
bottles, goulash cans, treatment XYZ, coin toss
From now on, we will deal with the second option, the examination of a
part (a sample), to draw conclusions that are valid for the whole (the
population). We proceed the following way:
(1) Draw a sample from the population
(2) Determine the required features
(3) Draw conclusions from the results of the sample to the properties of
the population
Ad (1). Drawing a sample corresponds to drawing balls out of the urn,
selecting goulash tins for quality control, recruiting patients for a clinical
trial or to coin tosses. The crucial point with a sample is that it is
representative for the population, which can be achieved by a random
selection. Please note: only results from representative samples can be
generalized to the population! Results that cannot be generalized are
mostly uninteresting.
Ad (2). The determination of the required features is often difficult. It will
not be much of a problem to determine the color of a ball, but can be
more tricky in the goulash or treatment success examples, particularly
73
21B3.1. Introduction
with borderline cases. Therefore, a good study protocol should contain
exact and comprehensible definitions that are free from contradiction.
Ad (3). How can we draw conclusions from a sample to properties of a
population? This difficult and very fundamental problem will be dealt with
in the following chapters.
Note: We will even extend the problem by considering two representative
samples, which should be used to answer the question whether the
underlying populations are the same or not. This corresponds to a
situation of two urns from which a particular number of balls are drawn in
order to decide whether the proportion of red balls is the same in both
urns or not. In clinical context this corresponds to the comparison of two
treatments: by using treatments ABC-NEW, are we able to heal more
patients than by using XYZ?
First of all, we will flip sides. Assume that we know the characteristics of a
population. What happens if we draw a random sample from it? This will
be the content of the next chapter, “Probability theory”.
74
22B3.2. Probability theory
3.2. Probability theory
An experiment
experiment.
with
unpredictable
outcome
is
called
a
random
Using this definition, we can call the application of a clinical treatment a
random experiment, because we do not know in advance the outcome of
the therapy (e. g., can the patient be healed? How long will the patient
survive? How much can the drug lower the blood pressure?).
The set of all possible outcomes of a random experiment (the set of all
elementary events) constitutes the sample space. Based on the sample
space, we can define random events.
•
Random experiment: rolling a dice
Sample space: {1, 2, 3, 4, 5, 6}
Event A … “2”, A = {2}
Event B … an even number, B={2, 4, 6}
Event C … a number greater than 3, C= {4, 5, 6}
•
Random experiment: opinion poll
Sample space: {Preferences of all Austrians older than 14 years with
respect to food and stimulants}
Event D … person is smoker
Event E … person prefers bread and margarine over bread and
butter
Note: These two outcomes appear very simple at first glance, but are
defined very loosely. When is a person a smoker? Is it somebody smoking
one cigarette once a week? Is somebody a non-smoker, if he stopped
smoking three days ago? Outcome E is even more unclear: what about a
person who dislikes margarine and butter? What about a person who
prefers margarine over butter, but dislikes bread? What is the definition of
margarine, is it a certain brand or the product in general?
•
Random experiment: survival of Ewing’s sarcoma patients after
chemotherapy
Sample space: {all real numbers between 0 and 130 years}
Event F … patient dies within the first year
Event G … patient survives longer than 15 years
75
22B3.2. Probability theory
•
Random experiment: sleeping duration within 24 hrs
Sample space: {all real numbers between 0 and 24}
Event H … less than 8 hrs, H = [0,8)
Event I … abnormal sleeping duration, i. e. less than 4 hrs or more
than 11 hrs,
I = [0,4}  (11,24]
The complementary event of an event A is the event that A will not
happen, i. e., it consists of all elements of the sample space that are not
covered by the event A.
Complementary event of B: an odd number, BC = {1, 3, 5}
Complementary event of E: EC = {person does not prefer bread and
margarine over bread and butter}
A certain event is an event which is known a priori to be certain to occur,
e. g., rolling a dice and obtaining a number less than or equal to 6, S={1,
2, 3, 4, 5, 6}
An impossible event is an event which is known a priori to be certain to
not occur, e. g., rolling a dice and obtaining a number between 19 and 25,
U={}.
Union of two events is the event that at least one of the events occurs
Union of B and C: the number is even, or greater than 3, or both. B  C =
{2, 4, 5, 6}
The event I (abnormal sleeping duration) is the union of two events K and
L: I = K  L
Event K … too short sleeping duration, K=[0,4)
Event L … too long sleeping duration, K=(11,24]
The intersection of two events is the event that both events occur.
Intersection of B and C, i. e. the number is even and greater than 3: B  C
= {4, 6}
What is the intersection of K and L? It is an impossible event, because it
is impossible to sleep too long and too short at the same time (literally!).
We say, the events K and L are mutually exclusive or disjoint events.
The intersection of DC and IC: a person who does not smoke and sleeps
between 4 and 11 hrs.
76
22B3.2. Probability theory
After defining events, we can proceed to compute probabilities.
Example 3.2.1: the probability of the outcome “2” when rolling a fair dice, is
1/6.
Example 3.2.2: the probability of a male birth is 51.4%.
In both examples above there is a strong relationship to “relative
frequency”.
Ad example 3.2.1: from physical considerations we know, that a fair dice
has the same probability for each side. Therefore, we assume that when
rolling the dice repeatedly, the relative frequency of “2” approaches the
mathematically computed number of 1/6. The calculation of this
mathematical probability follows the following formula:
Number of favorable elementary events g 1
= =
m 6
Total number of elementary events
where a “favorable elementary event” is an elementary event that is an
element of the event of interest. This formula is also called Laplace
probability (Pierre Simon Marquis de Laplace, 1749-1827) or the classical
definition of probability. This definition is only reasonable if elementary
events are assigned equal probabilities. Therefore, we can apply it also to
a coin flip.
Ad example 3.2.2: this proposition is based on observation. After
observing for a long time, that the relative frequency of a male birth is
about 51.4%, we assume that this empirical law also applies to future
births. Therefore, we call the definition of probability used here the
statistical definition of probability; the relative frequency in a very, very
long series of experiments.
Difference between relative frequency and probability:
•
•
•
Relative frequency: property of a sample
Probability: property of a population, it refers to a future nonpredictable event
Probability is the expectation of relative frequency.
Please note: Probabilities are always tied to events. If a probability is
stated somewhere, it must be assured, that the related event is defined
uniquely and comprehensibly. This is exemplified by the following:
•
Gigerenzer, 2002: a psychiatrist regularly administered the drug
Prozac to depressive patients. He informed the patients, that sexual
77
22B3.2. Probability theory
problems will occur with a probability of 30 to 50%. He meant that
for 3 to 5 patients out of 10, sexual problems will occur. However,
many patients thought that in 30 to 50 per cent of their sexual
activities, problems will occur.
•
Randow, 1992: after the gulf war of 1991, the US Navy reported a
99% success rate of their Cruise Missiles. After inquiries, it turned
out that 99% of the rockets could be started without problems. It
was not a statement about the hit rate.
•
Nowadays, better diagnostic procedures are able to detect
metastases in lung cancer patients that could not be detected
formerly. Therefore, such patients are now classified into the stage
“bad” instead of the earlier classification “good”. This phenomenon
is often called stage migration. As a consequence, the survival
probabilities in both stages increased, although nothing changed
with a patient’s individual outcome. The reason for the change in
survival probabilities lies in the migration of patients from the stage
“good” to the stage “bad”; these patients had poor outcome when
compared to others of the “good” stage, but better outcome when
compared to “bad” stage.
Mathematically, three properties suffice for probability calculus:
I. The probability of an event is a number between 0 and 1 (0 and 100 per
cent).
II. Probability 1 (100%) is assigned to the certain event.
III. The probability of an event that is the union of mutually exclusive
events is the sum of probabilities corresponding to these events.
These properties are also called Komogorov’s axioms, where Kolmogorov
put a little more mathematical effort in the formulation of his third axiom.
Ad I and II. Following this axiom, probability is just a measure
quantifying the possible occurrence of an event. As the certain event
always occurs, it is assigned the highest possible probability (1). The
range from 0 to 1 is used just for convenience, we could use any other
range (e.g., -94.23 to 2490.08), but we would be silly to do so.
Ad III. This is a simple and plausible rule. Consider rolling a fair dice: if
the probability for the outcome “2” is 1/6, and the probability of rolling “3”
is also 1/6, then the probability of rolling “2” or “3” is 1/6+1/6 = 2/6.
78
22B3.2. Probability theory
These events are mutually exclusive, because we cannot roll a “2” and a
“3” at the same time.
Ad I-III. All calculation rules concerning probabilities can be deducted
from these three properties.
Often the letter “P” is used to denote probabilities. When beginning with
the calculation of probabilities, one tends to use phrases like “the
probability of a male birth is 0.514 or 51.4%” to describe the results. After
some practice this seems cumbersome and one seeks abbreviations like
“P(M)=0.514”.
In the following, we will mainly use abbreviated notation.
The probability of the complementary event
Assume that the probability of an event X is known and denoted by P(X).
One minus this probability gives us the probability of the complementary
event:
P(XC) = 1 – P(X)
In order to visualize this probability calculation, one can use Venn
diagrams. The sample space (certain event) is drawn as a rectangle.
Events are represented as ellipses within the sample space. In our case
the event X is represented by a gray-colored ellipse. The probability of X
is the proportion of the area covered by the ellipse from the rectangular
area.
79
22B3.2. Probability theory
The area outside the ellipse represents the complementary event:
Example: the probability of a male birth is 0.514, i. e. the probability of a
female birth is 1 – 0.514 = 0.486 or 48.6%.
Probability of the union of two events which are not mutually
exclusive
The probability of the union of two events which are not mutually
exclusive is computed by summing up the probabilities of the events and
substracting the probability of the intersection of the two events
(otherwise, the probability of the intersection would be counted twice):
P(X  Y) = P(X) + P(Y) – P(X  Y)
The probability is again represented by a Venn diagram:
80
22B3.2. Probability theory
Example: dice flip: the probability of the event that we obtain an even
number or a number greater than 3
P(B  C) = P(B) + P(C) – P(B  C) =
3 3 2 4
+ − =
6 6 6 6
Conditional probabilities
In daily life, our assessment of particular circumstances may change as
soon as we obtain new information. In probability theory, this observation
is reflected by the concept of conditional probability. As an example,
consider the probability of a multiple birth, which is in general very small.
Our assessment will change as soon as we know that the probability refers
to a woman who underwent hormonal treatment.
The conditional probability is the probability of an event when you know
that the outcome was in some particular part of the sample space. Put
another way, it is the probability of an event when you know that another
outcome has already occurred. Mathematically, it is written as
P(X|Y) …
the conditional probability of X given Y
By means of a Venn diagram we can easily visualize that
P(X|Y) = P(XY)/P(Y), i. e., P(XY) = P(Y) P(X|Y)
Example: in diagnostic studies clinical, radiological or laboratory
procedures are evaluated if they are suitable to detect particular diseases.
We use terms like sensitivity and specificity to quantify this suitability.
These are basically conditional probabilities. In general, a diagnostic test
for a condition is said to be positive if it states that the condition is
present and negative if it states that the condition is absent.
Sensitivity is defined as the probability that the test states the condition is
present, given it is actually present.
We are dealing with two events:
Event1={test is positive}
Event2={condition is actually present}
The sensitivity can thus be written as conditional probability:
Sensitivity=P(Event1|Event2)
81
22B3.2. Probability theory
Specificity is defined as the probability that the test states the condition is
absent, given it is actually absent.
Now we are dealing with the complementary events:
Event1C={test is negative}
Event2C={condition is absent}
Specificity=P(Event1C|Event2C)
In medical applications special interest is given to the positive predictive
value. It is defined as the probability that the condition is present, given
the diagnostic test is positive. There is an analogue referring to the
complementary events: the negative predictive value. It is defined as the
probability that the condition is absent, given the test states that it is
absent.
Independence of events
This concept also applies to daily life. Loosely speaking, independence of
events means that the events do not influence each other. This means
that occurrence of one event does not change the probability of the other
event.
As an example consider once more the toss of a dice. The event that a “2”
is rolled is independent from the event that in the toss before a number
greater than 3 or lower than 3 was obtained.
Another example: assume you read in a scientific journal, that eating
herbal drops and a subsequent healing success are statistically
independent events. This is just a scientifically correct explanation for the
fact that a patient suffering from a particular disease will not be healed if
he or she takes herbal drops.
Events X and Y are independent, if
P(X|Y) = P(X)
or P(X|Y) = P(X|YC)
or P(XY) = P(X)P(Y) … this is also called the multiplication rule.
The opposite of independence is dependence. As an example, consider the
event “multiple birth”, its probability depends on the occurrence of the
event “hormonal treatment”.
Note: we distinguish stochastic and causal dependence. While stochastic
dependence is symmetrical, causal dependence is a one-way relationship
between events. Clearly, multiple birth is causally dependent on hormonal
82
22B3.2. Probability theory
treatment, because a reverse relationship doesn’t make sense. Causal
dependence means that the occurrence of a cause changes the probability
of an effect. This does not mean necessarily, that the effect follows
directly from the cause.
Many dependencies that can be observed have to be declared as
preliminarily stochastic, just because the lacking substance matter
knowledge does not permit a causal explanation. The stochastic
dependency may turn into a causal dependency if we later obtain
additional knowledge.
When computing conditional probabilities you don’t have to take care
about causal dependencies. We are allowed to calculate both
P(effect|cause) and P(cause|effect). The latter, i. e. inferring the
probability of a cause given a particular effect has been observed, is
typical for private investigators, historians and forensic doctors.
Example 3.2.3: Consider the game “ludo” (in german: “Mensch-ärgere-Dichnicht”). A player must throw a 6 to move a piece from the starting circle onto
the first square on the track. What is the probability not to throw a “6” in three
consecutive throws?
We start with the definition of three events:
E1 … no “6” at the first throw
E2 … no “6” at the second throw
E3 … no “6” at the third throw
83
22B3.2. Probability theory
We know that the throws are independent. Therefore, we can make use of
the multiplication rule:
P(no “6” in three consecutive throws) = P(E1E2E3)=P(E1)P(E2)P(E3)
What we need now is the probability to throw no “6” at one throw. Clearly,
it is 5/6.
Thus,
P(no “6” in three consecutive throws) = 5/6 x 5/6 x 5/6 = 125/216 =
0.5787.
Example 3.2.4: Patients may suffer from infectious diseases without
showing symptoms (asymptomatic infections). Assume that for a particular
disease, the probability of an asymptomatic infection is 40%. Consider two
patients that are infected. What is the probability that
(a) both infections are asymptomatic
(b) both infections are symptomatic
(c) exactly one infection is symptomatic?
a) P(asymptomatic and asymptomatic)
b) P(symptomatic and symptomatic)
c) P(symptomatic and asymptomatic)
= 0.4 * 0.4
=
= 0.6 * 0.6
=
= 0.4 * 0.6 + 0.6 * 0.4=
0.16
0.36
0.48
1.00
Ad c) The event “exactly one infection is symptomatic” consists of two
mutually exclusive events. These are “first infection is asymptomatic and
second infection is symptomatic” and “first infection is symptomatic and
second infection is asymptomatic”.
Example 3.2.5: Everyone has two copies of each gene (one passed over from
the mother and one from the father). Each gene may appear in different
alleles. As an example, consider the ABO gene which determines blood groups.
There are three main alleles (A, B and O). The relative population frequencies
of different alleles of a gene are called allele frequencies (gene frequencies),
where each individuals contributes two alleles to the population of alleles. In a
Caucasian population the allele frequencies of the blood group alleles are:
P(A)=0.28, P(B)=0.06, P(O)=0.66
An individual's genotype for a gene is the pair of alleles it happens to possess.
With the blood group gene, we know of six genotypes, these are AA, AB, AO,
BB, BO, OO, where BA, OA and OB are the same as AB, AO and BO,
respectively.
The phenotypes of the blood groups (A, B, AB, O) are constituted by the
different genotypes, where A and B are dominant over O, and A and B are
mutually codominant. Thus, phenotype A is constituted by the genotypes AA
84
22B3.2. Probability theory
and AO, phenotype B by genotypes BB and BO, phenotype AB by genotype AB,
and phenotype O by the genotype OO.
Each parent transmits each of his/her pair of alleles with probability ½. The
transmitted alleles constitute the genotype (and the phenotype) of the
offspring.
In a large population
•
•
•
•
with random mating (each individual has the same probability to mate
with everybody),
without migration
without mutation
and without selection (influence of the genotype on fertility of the
individual and on the viability of his/her offspring)
the allele frequencies and the genotype frequencies are constant over
generations. Such a population is said to be in Hardy-Weinberg-equilibrium,
and the genotype frequencies are easily computed from the allele frequencies
(guess why?).
Assume a Caucasian population in Hardy-Weinberg-equilibrium. What are
the genotype frequencies of the blood group gene in this population?
P(genotype AA)
P(genotype AB)
P(genotype AO)
P(genotype(BB)
P(genotype BO)
P(genotype OO)
=
=
=
=
=
=
P(allele A) * P(allele A)
2 * P(allele A) * P(allele B)
2 * P(allele A) * P(allele O)
P(allele B) * P(allele B)
2 * P(allele B) * P(allele O)
P(allele O) * P(allele O)
=
=
=
=
=
=
0.0784
0.0336
0.3696
0.0036
0.0792
0.4356
1.0000
Drawing with or without replacement
What is the probability for “6 out of 6” in a lottery with 45 numbers?
Hint: consider the funnel containing 45 balls, with 6 balls colored red
(these refer to the numbers you have bet on), the other 39 balls are
white. What is the probability to draw
six red balls in six consecutive draws? It is crucial that the balls are not
returned to the funnel after being drawn (such that the same number
cannot appear twice among the winning numbers).
At the first draw, the probability of drawing a red ball, P(R1) = 6/45 =
0.1333.
At the second draw, the probability of drawing a red ball depends whether
a red or a white ball was drawn at the first draw: P(R2|R1) = 5/44 =
85
22B3.2. Probability theory
0.1136, and P(R2|R1C) = 6/44 = 0.1364. This means, that the
consecutive draws are no longer independent.
The same argument applies for the probability of the third, fourth, firth
and sixth draw. The probability for a “6 out of 6” is therefore not so easy
to compute. At last, it is given by
P(6 out of 6)
=
P(R1) * P(R2|R1) * P(R3|R1R2) * P(R4|R1R2R3)
P(R5|R1R2R3R4) * P(R6|R1R2R3R4R5)
=
6/45 * 5/44 * 4/43 * 3/42 * 2/41 * 1/40 = 0.000000123.
*
Note: when drawing without replacement from a large population, then
the effect on the probabilities for the next draw are usually very low, such
that it can be ignored in computations. As an example, consider 5000
balls among which there are 1000 red balls. At the first draw the
probability to draw a red ball is P(R) = 1000/5000 = 0.2. At the second
draw, the probability to draw a red ball (without replacing the first ball) is
now either P(R|R) = 999/4999 = 0.199984 or P(R|W) = 1000/4999 =
0.20004. On the other hand, the probability for drawing a red ball with
replacement of the first ball is 0.2, which is very close to both values
computed assuming no replacement.
Probability distributions
A lot of problems follow the same scheme when computing probabilities,
although they are completely different in their nature. Three examples
are:
•
A number of n individuals not related to each other suffer from an
infection. The probability p of an asymptomatic infection is known.
What is the probability that exactly x (or equal to or less than x)
infections are asymptomatic?
•
An experienced surgeon operates on n not related patients on different
days. The probability p for a complication during the routine operation
is known. What is the probability that a complication will occur in
exactly x (or in x or less) patients?
•
In total n not related patients suffering from mushroom poisoning after
eating death cap are admitted into the Vienna General Hospital. Let p
denote the probability that a patient survives such a poisoning. What is
the probability that exactly x (or at least x) patients survive the
poisoning?
In order to compute these probabilities it is not necessary to reinvent the
wheel whensoever. All these questions raised above can be answered by
86
22B3.2. Probability theory
making use of the so-called binomial distribution. The probability for n=18
and a “success” probability of p=0.35 have been tabled on the next page.
By a “success”, we mean “asymptomatic infection”, “complication” or
“survival after poisoning”.
Question: Why do we have to pay special attention to phrasings like
“experienced surgeon”, “not related individuals/patients”, “on different
days”?
x ... Number of
successes
p … Success
probability
Cumulative
probability
0
0.00043
0.00043
1
0.00416
0.00459
2
0.01903
0.02362
3
0.05465
0.07827
4
0.11035
0.18862
5
0.16638
0.35500
6
0.19411
0.54910
7
0.17918
0.72828
8
0.13266
0.86094
9
0.07937
0.94031
10
0.03846
0.97877
11
0.01506
0.99383
12
0.00473
0.99856
13
0.00118
0.99974
14
0.00023
0.99996
15
0.00003
0.99999...
16
0.00000...
0.99999...
17
0.00000...
0.99999...
18
0.00000...
1.00000
How to read this table:
•
If the success probability is 35%, then the probability of exactly 7
successes in 18 trials is 17.9%.
•
The probability of at most 3 successes in 18 trials, is 7.83%.
•
The probability of at least 10 successes in 18 trials, is 5.97% (Note:
computed via the complementary probability. The probability of at
87
22B3.2. Probability theory
most 9 successes is 94.03%, thus the probability of at least 10
successes is 100% - 94.03% = 5.97%.).
The table can also be visualized by a diagram showing the probability
function:
Probability function
Binomial distribution
0.20
0.19
0.18
0.17
0.16
0.15
0.14
0.13
0.12
0.11
0.10
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0.00
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18
Number of successes
Please note that though in the constellation of our example it is possible to
yield 15 or more successes in 18 trials, this is a highly implausible event.
If we had succeeded in 17 out of 18 trials, then there are three potential
explanations:
a) this is a truly fortunate coincidence
b) the success probability assumed to be 35% is wrong
c) other assumptions in our computations do not hold, e. g., the trials
are not independent from each other
By the way, we will revisit these considerations at various other occasions
during the course. The principle of statistical testing is based on these
central and basic considerations.
88
22B3.2. Probability theory
Cumulative probabilities are often called distribution function in statistics.
(Note for students who already came across with Kaplan-Meier-curves:
these are kind of reverse distribution functions.)
Binomial distribution
1.0
Distribution function
0.8
0.6
0.4
0.2
0.0
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18
Number of successes
The binomial distribution is a discrete distribution. Other discrete
distributions comprise:
•
•
•
•
•
•
•
Binomial distribution: how probable are x successes in n trials,
keeping the success probability p constant
Hypergeometric distribution: how probable are x successes in n
trials if drawn without replacement (lottery)
Multinomial distribution: generalization of binomial distribution to
more than two possible conditions
Poisson distribution: describes very rare events (e. g., in
epidemiology)
Geometric distribution: how many trials n are needed to yield the
first success, with constant success probability p
Negative binomial distribution: how many trials n are needed to
yield x successes with constant success probability p
Discrete uniform distribution: e. g., rolling a fair dice
Beneath discrete distributions there is the group of continuous
distributions. We can apply a continuous distribution for the sleeping
89
22B3.2. Probability theory
duration example, in which the point probability of a particular sleeping
duration is of no importance at all; nobody cares about the probability
that somebody sleeps exactly 8 hours 34 minutes 17.2348234230980901
seconds. We are only interested in probabilities for intervals, e. g., what is
the probability that somebody
•
•
sleeps longer than 11 hours?
sleeps between 8 and 9.5 hours?
With continuous distributions, there is no probability function, because on
the one hand there are so many individual points, and on the other hand
each of these points is assigned an infinitely small probability (“virtually
zero”). Moreover, only probabilities for intervals are of interest (see
above).
Note: The transition from discrete to continuous distributions is one small
step for us. We do not have to bother about the fact that it is one giant
leap for mathematicians dealing with measure theory.
The analogue to the probability function with continuous distributions is
the so-called density function. Probabilities assigned to intervals are
derived by computing the area under the density function. The cumulative
density function is again called distribution function. The following
diagrams show the distribution and density functions for the most
important continuous distribution, the normal distribution.
Distribution function of the normal
(Gaussian) distribution
1
0,5
0
90
22B3.2. Probability theory
Density of the normal (Gaussian)
distribution
The normal distribution is sometimes also called Gaussian distribution
(Carl Friedrich Gauß, a German mathematician, 1777-1855).
There are several other distributions beside the normal distribution.
Among them the best-known are the t-distribution, the chi-squaredistribution, the F-distribution, the gamma distribution, the lognormal
distribution, the exponential distribution, the Weibull distribution, the
continuous uniform distribution and the beta distribution. Each of them
has (similarly to discrete distributions) its classical applications. Later, we
will make use of t-, chi-square- and F-distributions.
Why is the normal distribution of special importance?
•
The results of many experiments in biological science are – at least
approximately – normally distributed.
•
The distribution of the mean and the sum of random variables
approaches the normal distribution with increasing sample size (the
central limit theorem).
•
Other known distributions stem from the normal distribution. As an
example, the chi-square-distribution can be seen as the distribution
of the sum of squared normally distributed random variables.
Note: the denomination “normal distribution” doesn’t mean that this
distribution is the normal case, the most common distribution, or a kind of
standard.
Relationship between probability theory and descriptive statistics
Density function and probability function are strongly related to histogram
and bar chart, respectively. While these functions describe theoretical
91
22B3.2. Probability theory
thoughts or the population, histogram and bar chart are used to describe
sample results.
The following already known concepts can be used for all theoretical
distributions, irrespective if discrete or continuous:
•
Expectation: this is the theoretical mean, the mean of the
population, the “true” mean
•
Theoretical variance: the expected mean square deviation from the
expectation, the “true” variance
•
These considerations also hold for statistics like median, other
quantiles, the skewness, etc.
The most important difference between an expectation and an empirically
determined mean is the following: the expectation is the true value in a
population, it is fixed and immovable. The empirical mean is a value
computed from random results (from a sample), thus it is a consequence
of randomness and possesses a probability distribution which can be
studied.
The same holds for the comparison of all other theoretical and empirical
statistical measures.
The law of large numbers and the central limit theorem
Now let’s consider two important basics of probability theory which
concern the mean computed from a sample. To keep it simple, let’s
assume the sample consists of independent identically distributed random
numbers.
The law of large numbers: The empirical mean computed from a
sample approaches the expectation (the population mean) with increasing
sample size.
By the way it should be noted that the theoretical variation of the
empirical mean decreases with increasing sample size n (exactly by a
factor of n ).
Intuitively, the law of large numbers is easy to understand: the more we
measure, i. e., the larger the sample, the more exact and precise are our
results.
Example: In a scientific meeting two high-quality studies concerning the
same topic are presented. The only difference lies in the sample size:
92
22B3.2. Probability theory
while in study A 30 patients had been recruited, study B is based on 200
patients. By nature we will consider the results of study B as more
plausible than those of study A, just because it is more probable that the
larger study B is closer to the truth than study A
The central limit theorem: The distribution of the empirical mean
converges to a normal distribution with increasing sample size
93
22B3.2. Probability theory
This statement is very useful for various statistical methods. However, it is
difficult to follow intuitively. For students interested in the topic it is
recommended to verify the central limit theorem by dice experiments or
computer simulation. There are some Java-applets around demonstrating
the central limit theorem:
http://medweb.unimuenster.de/institute/imib/lehre/skripte/biomathe/bio/grza2.html
http://statistik.wuwien.ac.at/mathstat/hatz/vo/applets/cenlimit/cenlim.html
Summary
Probability theory enables us to draw conclusions about samples, given we
know the truth (the population characteristics). And we are able to
evaluate “what if” scenarios. We will often come back to that possibility
later in the course.
Our next goal
In empirical clinical research we don’t know the truth (the population),
otherwise we wouldn’t have to do research. Usually only a sample limited
in size is available, which is used to draw conclusions about the
characteristics of the population. Since the sample size is limited we
cannot draw conclusions with absolute certainty, but we can make use of
probability theory and “what-if” scenarios to quantify the inevitably
inherent uncertainty.
94
23B3.3. Exercises
3.3. Exercises
3.3.1.
Consider a delivery of goulash tins.
Why doesn’t a complete examination of all goulash tins make
sense?
Name medical examples where a complete examination is not
reasonable.
3.3.2.
Assume that among all goulash tins there are exactly five with
spoilt contents. A sample of 20 is drawn, among which the five
spoilt tins are found. Is this sample still representative for the
population of all goulash tins?
3.3.3.
Name some conditions under which the patient selection in a
clinical study yields a representative sample.
3.3.4.
Sometimes probabilities are indicated as odds. In opposite to the
classical definition of probabilities, where a ratio of “favorable”
and “possible” events is computed, odds are defined as the ratio
of “favorable” and “unfavorable” events. Thus, odds can assume
any numbers greater than or equal to zero, not bounded above.
As an example, consider the odds for a male birth: they are
0.514/0.486 = 1.0576.
Compute the odds for a female birth.
Assume that prior to the beginning of the European Soccer
Championship 2008 in Austria and Switzerland, the odds that
Switzerland will win the games is 2:7. What is the probability of a
European Soccer Champion Switzerland?
3.3.5.
A newly developed diagnostic test for HIV discovers 98% of the
actually infected patients. On the other hand, it will also classify
5% of persons not infected by HIV as infected.
Compute sensitivity and specificity of this test.
Compute the positive and the negative predictive value of the
test.
3.3.6.
Are in general two mutually exclusive events independent from
each other? (Try to solve by finding an appropriate example.)
3.3.7.
During a scientific meeting a visit of a casino is offered. At the
roulette table you meet an old friend which seems to linger there
already for some time. He is very excited to report that the
roulette ball has fallen on a red number for eight times in series.
Therefore, he suggests to bet on “black”, because it is much
more probable now. He does so and the ball really lands on a
black number. What do you think about that?
95
23B3.3. Exercises
3.3.8.
(from Gigerenzer, 2002) A Bavarian minister of the interior spoke
out on the dangers of drug abuse and explained that since most
of the heroin addicted have been using Marihuana before, it
follows that most Marihuana users will become heroin addicted
later. Appraise the conclusion of the Bavarian minister of the
interior from a probability theory view. (Instructions: Define the
corresponding events, identify the conditional probabilities used
in the conclusion. Use Venn diagrams.)
3.3.9.
The course of infectious diseases may be asymptomatic. Assume
the probability of an asymptomatic course of a particular disease
is 40%. Further assume that three persons have been infected.
What is the probability for
(a) three asymptomatic infections,
(b) two asymptomatic infections and one symptomatic infection,
(c) one asymptomatic infection and two symptomatic infections,
(d) three symptomatic infections?
Assume you happen to know that the three infected persons are
closely related. Could this information affect the validity of the
computations of (a) to (d)?
3.3.10. Assume the Hardy-Weinberg-conditions apply.
Compute the phenotype probabilities of the blood groups in a
Caucasian population.
What is the probability that in this population two randomly
selected persons have the same blood group (the same
phenotype)?
What is the probability that in this population a biologic pair of
parents (both with phenotype A) have an offspring with
phenotype O?
What is the probability that in this population a biologic pair of
parents (both with phenotype O) have an offspring with
phenotype A?
3.3.11. What are the probabilities in a lottery with 45 numbers (e. g.,
“Lotto 6 aus 45”) to have 0, 1, 2, 3, 4, 5, 6 out of 6? Also
compute the probability for 5 out of 6 considering the bonus ball
(“Zusatzzahl”; this means that you bet on 5 of the 6 winning
numbers and on the bonus ball).
96
23B3.3. Exercises
3.3.12. Assume that two therapies A (the standard treatment) and B (a
new treatment) exist to treat a particular type of cold. Treatment
by A yields a success rate of pA=24%, and treatment by B yields
pB=26%. The number needed to treat (NNT) to describe an
advantage of B over A is defined as follows:
NNT = 1/(pB-pA) = 1/(0.26-0.24) = 1/0.02 = 50
(Please note that percentages have to be given as proportions!)
Can you interpret this result? What would happen if pA=26% and
pB=24%?
Compute the NNT for pA=3.5% and pB=16.1%. Interpret the
result.
Assume treatment A would always and B never lead to a success.
Compute and interpret the NNT.
Assume both treatments have the same success rate. Compute
and interpret the NNT.
A colleague tells you that the NNT of Acetylsalicyclic acid
(Aspirin) in treating headache is 4.2 patients. Are you contented
with this proposition?
97
24B4.1. Principle of statistical tests
Chapter 4
Statistical Tests I
4.1. Principle of statistical tests
Example 4.1.1.: (scale variable)
Animal experiment of Dr. X: two groups of female rats receive food with high
and low protein
Research question: Is there a difference between groups for the weight gain
(from an age of 28 to 84 days)?
Result (weight gain in gram):
Group 1 (high protein):
134, 146, 104, 119, 124, 161, 107, 83, 113, 129, 97, 123
Group 2 (low protein):
70, 118, 101, 85, 107, 132, 94
Analysis from Dr. X:
mean weight gain
in group 1 (high protein)
120 g
mean weight gain
in group 2 (low protein)
101 g
Conclusion from Dr. X:
High protein in the food of young female rats results in higher
weight gain than food with low protein.
(Generalization from the sample to the population!)
Pretension of colleague Y:
Differences are due to CHANCE allone! In reality both diets cause
identical weight gains. The conclusion above is worthless.
98
24B4.1. Principle of statistical tests
On the one hand:
Colleague Y could be right! Are the results pure chance?
On the other hand:
Every conclusion can be doubted with the argument „everything is
chance“!
What to do?
Conclusion:
If we want to come to a decision then the possibility of a false positive
answer has to be accepted! The statisticans call this case type I error or
α-error.
Concretely in our example: If we draw the conclusion that food with
different protein-contents leads to different weight gains, then this
decision could be a wrong decision. This is an inevitable fact. In other
words: If we come to a positive answer in an empirical study, then we
have to live with the uncertainty that it could be a false positive answer!
So, clever colleagues could decleare every unimportant result as empirical
proof for their research hypothesis and counter criticism with „I always
accept the possibility of a false positive answer!”. Consequently one has to
conclude, that a potential false-positive decision should not be made
arbitrarily often and also not to one’s customized disposal.
If there is the possibility for a false-positive answer then such an
answer should
•
•
be rare
be made only in situations, where the „pure chance“-argument
seems to be implausible due to the observed data.
The considerations above can be formalised,
and a statistical test results
99
24B4.1. Principle of statistical tests
Principle of a statistical test:
•
Define a null hypothesis, usually:
„There is no difference!“
In example 4.1.1. the null hypothesis corresponds with the view of the
critical colleague Y.
Remark: The null hypothesis if often the negation of the research question.
•
Alternative hypothesis: „There is a difference!“
Of course, this is a very broad statement. But initially we are satisfied
with this.
•
Assume the null hypothesis applies.
Calculate the probability to observe the actual (or a more
extreme) result.
This probability is called p-value. Nowadays, it is calculated nearly
exclusively by computer programs.
Remind: the p-value does not correspond with the probability that the
null hypothesis is true. The p-value can rather be seen as conditional
probability, where the condition is the assumed validity of the null
hypothesis.
Remark: A mathematical statistician will be against the statement „the pvalue is a conditional probability“ in the same way as a phycisian is against
the description of a broken fibula as a „broken foot”. Both statements are
acceptable in a broad sense for non-medic people, but incorrect for
professional terminology.
Only for interested people (and only for them): The p-value is based on a
restriction of the parameter space and not on a restriction of the sample
space (as it would be necessary for the definition of a conditional probability).
•
If the calculated p-value is smaller than or equal to 0.05, then
the null hypothesis is rejected and the result is called
“statistically significant”.
The limit of 5 % is called significance level and often the greek letter α
is used as its abbreviation.
Remark: Every value between 0 and 1 could theoretically be used as
significance level but only small values like 0.01, 0.05 or 0.10 are
useful. In medicine, the value 0.05 has been established as a
standard.
100
24B4.1. Principle of statistical tests
A statistical test corresponds to a „what-if“ scenario:
1. What if the null hypothesis was true?
2. Can our data be explained in a plausible way?
3. The p-value measures the degree of plausibility.
4. If the p-value is „small“ then the null hypothesis is not a plausible
explanation for our observed data.
5. There „must be“ another explanation (alternative hypothesis!) and
we „reject” the null hypothesis.
101
25B4.2. t-test
4.2. t-test
Back to the rat diet example 4.1.1.:
If a p-value was smaller than 5%, then Dr. X could claim that the
variation in the protein-diet may lead to differences in the weight gain.
The objection of the sceptical colleague Y would be disproved.
Remark: In reality colleague Y could be right anyway, but Dr. X could
“claim” his observed differences by a significant result.
The messages above are in subjunctive, as they are only valid in case of a
significant result. How can we now detect a potential significant difference
in the rat diet example? One possibility is offered by the t-test for
independent samples (also called unpaired t-test).
The null hypothesis is:
„No difference in weight gain between the two protein-diets.“
Calculation
Assume that the null hypothesis is true. Calculate the probability to
observe the current or a more extreme result.
What is a "more extreme result“?
Every result farther away from the null hypothesis than the current result.
Wanted:
We need an intuitive measure to detect deviations from the null
hypothesis.
A suitable measure appears to be the difference of the means between the
two groups. The bigger the difference between the two means the more
exteme is the result. Or in other words: the further we are away from the
null hypothesis.
Concretely, for example 4.1.1.:
Consider that the two diets do not cause unequal weight gains. Calculate
the probability to observe an absolute mean difference in the weights of
19 gramms or higher by chance (both directions are interesting – two
sided test).
102
25B4.2. t-test
If
•
data approximately follow a normal distribution, and
•
if standard deviations within the groups are approximately
equal,
then the mean difference divided by its estimated variation („standard
error“) follows a t-distribuion under the null hypothesis. This quotient is
also named test statistic.
Why is the observed difference in means divided by its variation? First, to
obtain a general measure without dimension and second, to take into
acount that the relative importance of a difference of e.g. 19 grams
depends on the population variation.
If the conditions mentioned above apply then we can use the t-distribution
to calculate the p-value. Nowadays, this is nearly exclusively performed
by means of computer programs.
Use of SPSS
Now we can use SPSS to analyze the example.
I.)
First, the data have to be entered
II.)
Then we visualize the situation by graphical and/or other descriptive
tools
III.) By means of the results from II) we verify the assumptions of the
intended test procedure
IV.)
Only after having verified the assumptions we calculate the test
V.)
We interprete the results
ad I.) For the rat diet example 4.1.1. we generate two variables, GROUP
and GAIN:
GROUP is a binary variable to discriminate the protein content in the food
of the rats (1=high protein, 2=low protein)
GAIN is a scale variable containing the weight gain in grams
103
25B4.2. t-test
ad II.) Boxplots are used for visualization.
ad III.) In both groups, the weight gain is symmetrically distributed
without any outliers. The spread (standard deviations) is
approximately equal in both groups. The usage of the planned t-test
is justified.
ad IV.) To compute the t-test we choose the menu
Analyze
Compare Means
Independent-Samples T Test...
We move the variable GAIN to the field
Test Variable(s):
and the variable GROUP to the field
Grouping Variable:
Beneath the GROUP variable two question marks appear. We click on the
button
Define Groups...
104
25B4.2. t-test
and enter the value 1 into the field
Group 1:
and the value 2 into the field
Group 2:
These are the group codes discriminating low and high protein groups.
Then we click on
Continue
and
OK
to calculate the requested t-test. First we receive the descriptive
measures per group.
Group Statistics
Weight gain (day
28 to 84)
Dietary
group
high protein
diet
low protein
diet
N
Mean
Std.
Deviation
Std.
Error
Mean
12 120.000
21.3882
6.1742
7 101.000
20.6236
7.7950
The test result is shown in a very wide and thus confusing table, which
was splitted into three parts here for easier understanding. Important
parts have been highlighted by grey background.
Independent Samples Test
Levene's Test for
Equality of
Variances
F
Weight gain (day
28 to 84)
Equal
variances
assumed
Equal
variances
not
assumed
.015
105
Sig.
.905
25B4.2. t-test
Weight gain (day
28 to 84)
Equal
variances
assumed
Equal
variances
not
assumed
t-test for Equality of Means
Sig. (2Mean
df
tailed)
Difference
t
1.891
17
.076
19.0000
1.911
13.082
.078
19.0000
Std. Error
Difference
Weight gain (day
28 to 84)
Equal
variances
assumed
Equal
variances
not
assumed
95% Confidence
Interval of the
Difference
Lower
Upper
10.0453
-2.1937
40.1937
9.9440
-2.4691
40.4691
The "Mean Difference" of 19 divided by its variation ("standard error
difference") of 10.0453 results in the observed test statistic value "t" of
1.891. As the test statistic follows a t-distribution with 17 degrees of
freedom (“df”) under the null hypothesis, the p-value (“Sig. (2-tailed)“)
calculates to 0.076.
ad V.) Using the common significance level of 5 % it follows that the
null hypothesis of equality of the means cannot be rejected. Dr. X cannot
invalidate the „pure chance“ objection of Dr. Y!
106
26B4.3. Wilcoxon rank-sum test
4.3. Wilcoxon rank-sum test
The Wilcoxon rank-sum test is a non-parametric test. It is equivalent to
the Mann-Whitney U-test. Thus, it is also often called Wilcoxon-MannWhitney U-Test. We will get acquainted with this test by applying it to the
data of example 4.1.1.
Null hypothesis is defined: „No difference in weight gain between the
two protein diets.“
Calculation: Consider the null hypothesis is true. Calculate the probability
to observe the actual result (or a more extreme result than the actual
one).
Wanted: a measure by which „extreme results“ can be defined, e.g. the
rank-sum of the smaller group (this is group 2 with only 7 rats for the rat
diet example).
Excursion: Calculation of the rank-sums
weight
ranks of
rank
gain
group 1
(smallest value) 1
70
2
83
2
3
85
4
94
5
97
5
6
101
7
104
7
8.5
107
8.5
8.5
107
10
113
10
11
118
12
119
12
13
123
13
14
124
14
15
129
15
16
132
17
134
17
18
146
18
(highest value) 19
161
19
Σ 190
Σ 140.5
107
ranks of
group 2
1
3
4
6
8.5
11
16
Σ
49.5
26B4.3. Wilcoxon rank-sum test
What is a more extreme example?
The total rank-sum is 190.
The mean rank per rat is 10.
The mean rank-sum of 7 rats should be 70.
The greater the deviation from 70, the more extreme is the result.
Concretely:
Assume there is no difference between the diet groups and calculate the
probability to observe a rank-sum for 7 rats of 49.5 or smaller or of 90.5
(= 70 + (70-49.5)) or higher (again a two-sided test).
To perform the Wilcoxon-Mann-Whitney U-test with SPSS, we have to
click on
Analyze
Nonparametric Tests
2 Independent Samples...
We move the variable GAIN to the field
Test Variable List:
and the variable GROUP to the field
Grouping Variable:
Besides the GROUP variable two question marks appear. We click on the
button
Define Groups...
and enter the value 1 into the field
Group 1:
and the value 2 into the fied
Group 2:
These are the group codes discriminating low and high protein groups.
Then we click on
Continue
In the box
Test Type
we choose Mann-Whitney U. Finally we click on
OK
and the requested Wilcoxon-Mann-Whitney U-Test will be calculated.
108
26B4.3. Wilcoxon rank-sum test
First, we receive descriptive measures of the ranks per group.
Ranks
Weight gain (day
28 to 84)
Dietary
group
high protein
diet
low protein
diet
Total
Mean
Rank
N
Sum of
Ranks
12
11.71
140.50
7
7.07
49.50
19
The next table contains the test reuslts. Again, important parts have grey
background.
Test Statisticsb
Mann-Whitney U
Wilcoxon W
Z
Asymp. Sig. (2-tailed)
Exact Sig. [2*(1-tailed Sig.)]
Weight gain (day 28 to 84)
21.500
49.500
-1.733
.083
.083a
a Not corrected for ties.
b Grouping Variable: Dietary group
The calculated p-value („Asymp. Sig. (2-tailed)“) of 0.083 indicates no
significant difference, similarly to the t-test.
The advantage of the Wilcoxon-Mann-Whitney U-tests is that it can be
used even if the assumptions of the t-test do not apply. On the other
hand, if the assumptions of the t-test are valid (like in the rat diet
example) then the t-test should be preferred to the Wilcoxon-MannWhitney U-test. A simple decision rule is: If means properly describe the
location of the distribution in both groups then the use of the t-test is
justifiable, as it is based on the comparison of the means.
109
26B4.3. Wilcoxon rank-sum test
A rule of thumb warns for problems with the Wilcoxon-Mann-Whitney Utest. If one or both groups have less than 10 observations, then the
asymptotic p-values („Asymp. Sig. (2-tailed)“) can become unprecise. In
such situations, it is recommended to calculate “exact” p-values by using
the permutation distribution.
This rule of thumb is relevant for our example, as the smaller group
contains only 7 rats.
SPSS calculates the exact p-value for the Wilcoxon-Mann-Whitney U-test
automatically (“Exact Sig. [2*(1-tailed Sig.)]“). In our example the exact pvalue of 0.083 is identical to the asymptotic p-value.
However, the calculation of exact p-values is computationally difficult. The
exact p-value calculated by SPSS in its standard version is not quite
correct in case of ties in the data. SPSS refers in footnote "a" explicitly to
this circumstance.
Remark: The statement that “the exact p-value is not quite correct” is –
admittedly – puzzling. This is due to the fact that the term “exact” only
refers to the use of the permutation distribution to compute the p-value.
Obviously, the term “exact” does not refer to the accurateness of this
computation.
As there are ties in our example (the value of 107 gramms appears in
both groups) the question emerges how to calculate a correct exact pvalue? To perform a correct exact Wilcoxon-Mann-Whitney-U-test with
SPSS, we have to click on
Analyze
Nonparametric Tests
2 Independent Samples...
And fill in the fields as before. Click on the button Exact... and choose
the option Exact.
110
26B4.3. Wilcoxon rank-sum test
Test Statisticsb
Weight gain (day 28 to 84)
Mann-Whitney U
21,500
Wilcoxon W
49,500
Z
-1,733
Asymp. Sig. (2-tailed)
,083
,083a
,087
Exact Sig. [2*(1-tailed Sig.)]
Exact Sig. (2-tailed)
Exact Sig. (1-tailed)
,044
Point Probability
,004
a. Not corrected for ties.
b. Grouping Variable: Dietary group
With this option the exact p-value (“Exact Sig. (2-tailed)“) calculates to
0.087. This p-value differs only slightly from the asymptotical p-value.
Remark: An exact p-value often is the only option in case of small sample
sizes. Unfortunately, the term "exact" suggests this is a more “preferable”
p-value. Thereby, a disadvantage of exact tests is often not realized: they
are in general conservative, i.e. the specified error probability (of e.g. 5 per
cent) is often not completely utilized (loosely speaking, “exact p-values are
often too large”).
So, in the case of the Wilcoxon rank sum test: if both groups have at least
10 observations, then we can use the asymptotic p-values.
111
27B4.4. Excercises
4.4. Excercises
4.4.1.
The 24-hours total energy consumption (in MJ/day) was
determined for 13 skinny und 9 heavily overweighted women.
Values for skinny women:
6.13, 7.05, 7.48, 7.48, 7.53, 7.58, 7.90, 8.08, 8.09, 8.11,
8.40, 10.15, 10.88
Values for heavy overweighted women:
8.79, 9.19, 9.21, 9.68, 9.69, 9.97, 11.51, 11.85, 12.79
a.) Is there a difference in energy consumption between both
groups?
b.) You can find the study data in the data file b4_4_1.sav. An
error occured at the setup of the data file? Which one?
4.4.2.
You can find the data for the rat diet example in the data file
b4_1_1.sav. Change the highest value in group 1 to 450
gramms.
(a) Perform a t-test
(b) Perform a Wilcoxon-Mann-Whitney U-test
(c) Compare and comment the results
4.4.3.
Would it be useful to determine several characteristics of the
participants of this lecture (e.g. height in cm, age in years,
already graduated yes/no) and then to test differences between
men and women?
112
28B5.1. More about independent samples t-test
Chapter 5
Statistical Tests II
5.1. More about independent samples t-test
Revisiting example 4.1.1.: (scale variable)
Animal experiment of Dr. X: Two groups of female rats receive food with high
and low protein, respectively.
Research question: Are there differences in weight gain (from 28. to 84. day)?
The boxplot shows similar spread in both groups:
113
28B5.1. More about independent samples t-test
If we calculate the t-test with SPSS then the first table that is shown in
the output is a table with descriptive measures. There we can see that
both groups have very similar variation (standard deviations of 21.39 and
20.62).
Group Statistics
Weight gain (day
28 to 84)
Dietary group
high protein diet
N
12
Mean
120.000
Std.
Deviation
21.3882
low protein diet
7
101.000
20.6236
Std. Error
Mean
6.1742
7.7950
Interestingly, the result of the t-test is given in two rows, labelled „Equal
variances assumed“ and “Equal variances not assumed”. As the standard
deviations (and thus also the variances) are very similar in our example,
we assume that differences (in standard deviations) are due only to
chance and use the result of the first row.
Independent Samples Test
Weight gain (day
28 to 84)
Equal variances
assumed
Equal variances
not assumed
t-test for Equality of Means
Sig. (2Mean
df
tailed)
Difference
t
1.891
17
.076
19.0000
1.911
13.082
.078
19.0000
Independent Samples Test
Std. Error
Difference
Weight gain (day
28 to 84)
Equal variances
assumed
Equal variances
not assumed
114
95% Confidence
Interval of the
Difference
Lower
Upper
10.0453
-2.1937
40.1937
9.9440
-2.4691
40.4691
28B5.1. More about independent samples t-test
Why do we have to differentiate between a t-test with
equal and a t-test with unequal variances
We already know: the „mean difference” of 19 divided by its standard
error of 10.045 results in the observed test statistic "t" of 1.891. If the
null hypothesis is valid then this test statistic follows a t-distribution with
17 degrees of freedom ("df"). Using this information we can calculate the
p-value.
However, this is only valid in case of equal variances in the
underlying populations.
In case of unequal variances there is a problem. This is already known
for a very long period and called Behrens-Fisher-problem. Briefly, the
variation of the mean difference (“standard error of the difference”) and
the degrees of freedom ("df") for the t-distribution have to be corrected.
Thus we obtain a different p-value.
When should we use the t-test for equal variances and when should we
use the t-test for unequal variances?
In principle, the t-test is robust against small to moderate deviations of
its assumptions. A very rough rule of thumb to the question above states
that if the standard deviations are different by a factor of 3 or higher then
the version of the t-test with unequal variances should be used. However,
the problem of equal/unequal variances strongly depends on the sample
size. In case of large sample size it becomes more and more negligible.
Besides this imprecise rule of thumb there is another option. Some of you
may have thought: couldn’t variances be tested for equality using a
statistical test?
Yes, they could. SPSS uses the so-called Levene-test for Equality of
Variances
Null hypothesis: the two samples are drawn from populations having
equal variances, or from the same population
Alternative hypothesis: the variances are unequal
We already know:
A sensible measure to determine the deviation from the null hypothesis is
required. SPSS uses the Levene-statistic (we will not discuss it in detail).
Under the condition that the null hypothesis (equality of variances) is true,
the probability can be calculated to observe the actual value or a more
extreme value of the Levene-statistic. This probability is the p-value.
115
28B5.1. More about independent samples t-test
SPSS automatically performs the Levene-test together with the t-test.
Independent Samples Test
Levene's Test for
Equality of Variances
Weight gain (day
28 to 84)
Equal variances
assumed
Equal variances
not assumed
F
Sig.
.015
.905
For the Levene-test the p-value calculates to 0.9 for the rat diet example.
The null hypothesis of equality of variances cannot be rejected, but this
does not necessarily mean that the variances of the underlying
populations are really equal.
This is the big disadvantage of this test:
In case of small sample size, where the test result would be very
important for us, even huge differences in the variances result in an
insignificant test. Whereas in case of large sample size, where different
variances do not cause a problem any more, even very small and
unimportant differences in the variances may become statistically
significant.
Thus: Use graphs to visualize the data.
If there is uncertainty about the equality of the variances, then use the ttest for unequal variances.
Remark: Other software packages (also earlier versions of SPSS) often
use an F-test to test equality of variances instead of the LeveneTest. The F-test would give a p-value of 0.9788 for the rat diet
example 4.1.1.. For this F-test the argument mentioned above
also applies.
What if the data in the groups are not normally
distributed?
We know that in this case an important assumption for the t-test is not
valid.
The Wilcoxon-Mann-Whitney U-test offers a useful alternative.
116
28B5.1. More about independent samples t-test
We should not abandon the t-test too quickly, particularly if the data can
simply be transformed into an approximate normal distribution. E. g., a
logarithmic transformation often turns right-skewed distributions into
normal.
But caution!
•
Data transformations change the interpretation of the results.
•
Not always are data transformations successful.
Association between independent samples t-test and
analysis of variance
This is an anticipation of a later chapter:
In many cases we would like to compare not only two but more than two
groups. This can be performed by an analysis of variance (thus also
abbreviated by ANOVA).
The t-test is the simplest version of the one-factorial (=one-way) analysis
of variance.
•
The "factor" is an interesting characteristic measured on a nominal
scale (e. g., a factor defining experimental groups).
•
"one-factorial"
characteristic.
means
that
we
are
only
interested
in
one
For illustration, we revisit the rat diet example 4.1.1. In order to perform
a one-factorial (=one-way) ANOVA, we click on
Analyze
Compare Means
One-Way ANOVA
We move the variable GAIN to the field
Dependent List:
and the variable GROUP to the field
Factor
Notice: This time no question marks appear. This method is not restricted
to two groups.
We click on
OK
and the requested one-way ANOVA will be computed.
117
28B5.1. More about independent samples t-test
ANOVA
Weight gain (day 28 to 84)
Sum of
Squares
Between Groups
1596.000
Within Groups
7584.000
Total
9180.000
df
Mean Square
1
1596.000
17
F
3.578
Sig.
.076
446.118
18
Besides incomprehensible items, we can find others which are already
known. The p-value (“Sig.”) of 0.076 is the same as the p-value for the ttest. Also the degrees of freedom (“df”) with a value of 17 are familiar to
us. If we draw the square root of the “F”-value of 3.578 we obtain the
value of the test statistic “t” from the t-test, namely 1.891.
Thus, an ANOVA with a factor with two levels corresponds to the
independent samples t-test with equal variances.
We can consider an analysis of variance as generalization of the t-test.
Note that the same problems observed for the t-test also apply to ANOVA!
(Remark: And others come along …)
118
29B5.2. Chi-Square Test
5.2. Chi-Square Test
Example 5.2.1.: (binary outcome)
Therapy comparison by Dr. X: standard therapy will be compared with new
therapy, the outcome is binary (cured versus not cured)
Research question: Is there a difference in the cure rates between therapies?
cured
yes
no
standard therapy
4
12
16
new therapy
9
9
18
13
21
34
standard therapy
25 % success
new therapy
50 % success
Conclusion of Dr. X:
The new therapy is better than the standard therapy!
Claim of colleague Y
(obviously an unpleasant and jealous person):
This result is due to pure chance, in reality the success rates are equal in
both therapy groups!
Again, the question arises: What now? Obviously we have to use a
statistical test again. But the t-test and the Wilcoxon rank-sum test are
not appropriate here, as the outcome is binary.
119
29B5.2. Chi-Square Test
Instead, the Chi-squared test can be used to compare two groups with
binary outcomes.
Null hypothesis:
„No difference in the cure rates between the two groups“
Calculation:
Assume that the null hypothesis is true. Calculate the probability to
observe the actual result (or even a more extreme result).
Wanted:
An intuitive measure to determine „more extreme results“ is needed, e.g.
Pearsons Chi-square criteria. (Basic idea: use the squared differences
between the actually observed result and the result expected if the null
hypothesis was true.)
If the null hypothesis applies:
In total, there are 13 cured out of 34 cases or 38.2 %
so (if the null hypothesis was true):
•
38.2 % expected cured of 16 cases under standard therapy are 6.1
cured
•
38.2 % expected cured of 18 cases under new therapy are 6.9 cured
expected under
null hypothesis
cured
yes
no
standard therapy
6.1
9.9
16
new therapy
6.9
11.1
18
13
21
34
The observed values are contrasted to the expected values under the null
hypothesis by the Perarson Chi-square criteria.
120
29B5.2. Chi-Square Test
cured
yes
standard therapy
no
4
12
6.1
9
new therapy
9.9
9
6.9
11.1
Table: The observed values are given in the upper left corners of each cell
and the expected values under the null hypothesis are given in the lower
right corners.
The Pearson Chi-square criteria is calculated as follwos:
( 4 − 6.1)2 + (12 − 9.9 )2 + ( 9 − 6.9 )2 + ( 9 − 11.1)2
6.1
9.9
6.9
11.1
=
2.2
Idea behind:
1.) observed minus expected (the bigger the differences the worse the
agreement with the null hypothesis).
2.) square the differences (so that bigger differences are „penalized“ more).
3.) scale with expected values (so that differences at a high level count less
than the same differences at a low level).
Example to illustrate the idea behind: A difference of 7 with an expected
value of 1000 is relatively small compared to a difference of 7 with an
expected value of 20.
4.) Sum over all four cells.
What is an extreme result? The higher the value of the Chi-square
criteria the more extreme is the result.
Concretely:
Assume there is no difference between therapies. Calculate the probability
to observe a Chi-square value of 2.2 or higher.
To calculate the example 5.2.1 with SPSS, we create two binary variables,
THERAPY and CURED:
•
THERAPY differentiates between therapy groups (0=standard therapy,
1=new therapy)
•
CURED contains the outcome, if a cure has occured (0=no, 1=yes)
121
29B5.2. Chi-Square Test
To perform the Chi-squared test we click on
Analyze
Descriptive Statistics
Crosstabs...
We move the variable THERAPY to the field
Row(s):
and the variable CURED to the field
Column(s):
We click on the button
Statistics...
and choose Chi-square.
Then we click on
Continue
and
OK
and the requested Chi-square test will be calculated. First, information
about the number of valid cases is given in the output.
Therapy * cure
chances
Case Processing Summary
Cases
Valid
Missing
N
Percent
N
Percent
34
100,0%
0
,0%
Total
N
Percent
34
100,0%
Then the 2×2-table is shown, where the therapy and the cures are crosstabulated.
Therapy * cure chances Crosstabulation
Count
cured
no
Therapy
standard
new
Total
Total
yes
12
4
16
9
9
18
21
13
34
122
29B5.2. Chi-Square Test
The last table contains the requested results of the Chi-squared test.
Important parts have a grey background.
Chi-Square Tests
Value
Asymp. Sig.
(2-sided)
df
2,242a
1
,134
Continuity Correctionb
1,308
1
,253
Likelihood Ratio
2,286
1
,131
Pearson Chi-Square
Fisher's Exact Test
Exact Sig.
(2-sided)
Exact Sig.
(1-sided)
,172
Linear-by-Linear
Association
2,176
N of Valid Cases
34
1
,126
,140
a 0 cells (,0%) have expected count less than 5. The minimum expected count is 6,12.
b Computed only for a 2x2 table
The Pearson Chi-square criteria is a test statistic that asymptotically
follows a Chi-square distribution with one degree of freedom ("df") for a
2×2-crosstable under the null hypothesis. For the observed test statistic
value (“Value”) of 2.242 the p-value ("Asymp. Sig. (2-sided)") calculates to
0.134.
This means: Again the objection of colleague Y cannot be invalidated!
A rule of thumb points to problems of the Chi-square test. If in one of
the four cells the expected frequency under the null hypothesis is smaller
than 5, then the so-called Fisher’s Exact test should be used instead. This
rule of thumb applies in our example. SPSS refers to this rule of thumb
with footnote "b".
More about Fisher's exact test can be found in the Appendix about exact
tests.
123
30B5.3. Paired Tests
5.3. Paired Tests
Example 5.3.1.: The daily dietary intake of 11 young and healthy women was
measured over a longer period. The women didn’t know the aim of the study in
advance, which was the comparison of pre- and post-menstrual ingestion, to
avoid any deliberate influence on the study results. The mean dietary intake
(in kJ) over 10 pre- (PREMENS) and 10 post-menstrual days (POSTMENS) of
each woman is given in the following table:
WOMAN
1
2
3
4
5
6
7
8
9
10
11
PREMENS
5260
5470
5640
6180
6390
6515
6805
7515
7515
8230
8770
POSTMENS
3910
4220
3885
5160
5645
4680
5265
5975
6790
6900
7335
The research question is: Is there a difference between pre- and postmenstrual dietary intake?
First idea: Use t-test or Wilcoxon-Mann-Whitney U-test
Second idea (or rather a question): Is this an appropriate setting for the
situation? Contrary to the previous 2-group-comparisons, we now deal
with two measurements per individual. Thus we have a situation with
dependencies. We call this a paired situation. Two-group-comparisons
with only one measurement per individual are called unpaired.
Basically, it is possible to use the already known unpaired version of the ttest or the Wilcoxon-Mann-Whitney U-test for paired situations. However,
as measurements usually are more similar within a person than
measurements taken from different persons, considering the paired
situation leads to higher power of the analysis.
The question „Is there a difference betwen pre- and post-menstrual
dietary intake?“ is equivalent to the question „Is the mean difference
between pre- and postmenstrual dietary intake unequal to zero?“
124
30B5.3. Paired Tests
The suitable test is the paired t-test. We proceed as usual (according to
the principle of statistical testing)!
Concretely:
• Null hypothesis: „The mean difference between pre- and postmenstrual food intake is equal to zero!“
(also: no effect)
•
Two-sided alternative hypothesis: „The mean difference is
unequal to zero!“
•
Intuitive measure for the deviation from the null hypothesis:
absolute value of the observed mean differences
•
When does a result count as more extreme than the observed
one? Obviously, if the absolute mean difference is higher than the
observed one
If the distribution of the differences approximately follows a normal
distribution then the use of the paired t-test is appropriate. The mean
absolute difference has to be divided by its estimated variation to receive
a t-distributed value. Then the p-value can be calculated.
Procedure in SPSS:
First, we have to verify graphically, if the distribution of the differences is
approximately normal. (To be clear: the differences PREMENS-POSTMENS
should be approximately normally distributed! The distributions of the variables
PREMENS and POSTMENS are not of importance here.)
125
30B5.3. Paired Tests
We accept that deviations from a normal distribution are moderate. So we
can perform the paired t-test. We ckick on
Analyze
Compare Means
Paired-Samples T Test...
We choose the variables PREMENS and POSTMENS and move them to the
field
Paired Variables:
They appear as Pair 1 under
Variable1 and Variable2,
respectively.
We click on
OK
and the requested paired t-test will be calculated. First, we receive a
descriptive table and a table with correlation coefficients. Then we can see
the result
Paired Samples Test
Paired Differences
Mean
Pair 1
Pre-menstrual
dietary intake (kJ) Post-menstrual
dietary intake (kJ)
1320,455
Std.
Deviation
Std.
Error
Mean
366,746 110,578
95% Confidence
Interval of the
Difference
Lower
Upper
1074,072
1566,838
Here, the mean and standard deviation of the paired differences can be
seen. From the 95% confidence interval we learn that the dietary intake is
statistically different during the female cycle (zero is not covered).
126
30B5.3. Paired Tests
Paired Samples Test
t
Pair 1
Pre-menstrual dietary intake (kJ) Post-menstrual dietary intake (kJ)
11,941
df
10
Sig.
(2-tailed)
,000
The p-value is <0.001, based on a two-sided test. It confirms what we
have seen before from the two-sided confidence interval.
Is there a non-parametric alternative to the paired t-test? There are even
two, the Wilcoxon signed ranks test and the sign test.
The problem with the Wilcoxon signed ranks test is, that its result may be
dependent on data transformations! This is completely unususal for a nonparametric test. Thus the Wilcoxon signed ranks test should only be used
if the distribution of the differences is symmetric.
Some statisticians recommend the use of the sign test as non-parametric
alternative to the paired t-test. However, the sign test is not vey powerful.
Other statisticians recommend to transform the data if the raw data show
larger differences for higher baseline values and then to apply the
Wilcoxon signed ranks test. However, this is only recommendable if the
transformation actually results in a symmetrical distribution of the
differences.
How can we verify if higher differences occur at higher baseline values?
We can plot the differences (X-Y) versus the means (X+Y)/2, see the
appendix on page 178.
127
30B5.3. Paired Tests
To perform the Wilcoxon signed ranks test and the sign test for example
5.3.1., we click on
Analyze
Nonparametric Tests
2 Related Samples...
We choose the variables PREMENS and POSTMENS and move them to the
field
Test Pairs:
They appear as Pair 1 under
Variable1 and Variable2,
respectively.
In the field
Test Type
we choose
Wilcoxon and Sign
Then we ckick on
OK
and the two requested tests will be calculated.
For the Wilcoxon signed ranks test a p-value of 0.003 and for the sign test
a p-value of 0.001 is calculated.
Remarks:
•
Paired situations are not restricted to only two measurements within
a patient, e.g. a matched case-control study also constitutes a paired
situation. The individual differences will be calculated from the two
observations of each matched pair (=case and its control).
Caution! Matched case-control studies are easy to analyze but their
dangers are great. Especially the choice of controls is a source of
biased and false statements.
•
Paired situations are the simplest example of a more general concept,
called blocking. In general every experimental unit with more than
one measurement constitutes a block. The denomination block
emanates from agriculture.
128
30B5.3. Paired Tests
Example 5.3.2.: In 1968 a diagnostic study about vascular occlusion was
performed at the 1st Department for Surgery (AKH, Vienna) with 121 patients.
The diagnostic tools CLINICAL and DOPPLER were compared. The true
diagnosis was assessed by venography. Here, only patients with vascular
occlusion were considered.
Clinical
Doppler
+
+
-
+
+
-
number of patients with
vascular occlusion
sum
22
3
16
3
44
In 25 cases (22 plus 3) the underlying vascular occlusion was diagnosed
correctly by the clinical method. The correct diagnosis with the Doppler
tool was made in 38 cases (22 plus 16). Thus, the sensitivity for the
clinical diagnosis is 25/44=57 % and the sensitivity for Doppler is
38/44=86 %.
Obviously, the diagnosis with Doppler is more sensitive than the clinical
diagnosis. The question arises if this difference can sufficiently be
explained by chance? Thus, we need again a statistical test.
The structures of examples 5.3.1 and 5.3.2. are very similar. In the first
example, the outcome was measured twice for each woman (pre- and
post-menstrually) and in the second, the outcome is measured twice for
each patient (clinical, Doppler). The main difference between both
examples is the type of the outcome. In example 5.3.1. the outcome
variable was scale (dietary intake in kJ) and now it is dichotomous (+, -).
The correct statistical test in example 5.3.1. was nothing else than a
paired version of the simple unpaired t-test. Similarly, the correct
statistical test for example 5.3.2 is a paired version of the simple Chisquare test, the so-called McNemar test.
The analogy continues. Only the difference between two paired measures
is used for the paired t-test. The same is true for the McNemar test: the
information from two paired measurements is reduced to one value.
We keep in mind: the McNemar test is used for paired situations with
dichotomous outcome. It is a variant of the sign test.
Now the data in b5_3_2.sav are used. However, we should not forget to
Weight Cases....
129
30B5.3. Paired Tests
If we cross-tabulate the diagnosis tools clinical and Doppler, then we
obtain the following table:
diagnosis by Doppler
clinical diagnosis
+
total
+
total
3
16 (=f)
19
3 (=g)
22
25
6
38
44
In three patients, both methods lead to a negative diagnosis and in 22
patients, both methods lead to a positive diagnosis. Such identical results
are called as concordant.
To answer the question which diagnosis tool is more sensitive, the
concordant results are unimportant. The comparison of the sensitivities
25/44 and 38/44 reduces to the comparison of the number of correct
diagnoses 25 and 38. The 22 concordant results appear in both numbers,
which leads finally to a comparison of the so-called discordant results, 3
and 16 (marked in the table with “g” and “f”).
Thus, different diagnoses are discordant results. Only patients with
discordant diagnosis results provide relevant information to answer the
research question. This is the starting point of the McNemar test. We
proceed as before (principle of statistical testing!)
Concretely:
•
Null hypothesis: „The difference between both
discordant results (f minus g) is equal to zero!“
(also: both diagnosis tools are equally sensitive)
•
Two-sided alternative hypothesis: „This difference is different
from zero!“
•
Intuitive measure for the distance from the null hypothesis:
Chi-square criterion
•
When is a result more extreme than the observed one?
The higher the Chi-square criterion, the more extreme is the result!
130
types
of
30B5.3. Paired Tests
If the null hypothesis was true:
19 discordant pairs were observed. If clinical diagnosis and Doppler
diagnosis were equally sensitive, then we expect
•
9.5 times clinical „-“ and Doppler „+“ and
•
9.5 times clinical „+“ and Doppler „-“.
Now we have to use these expected values to calculate the Chi-square
criterion. Remember: Calculate for each cell (observed minus expected)squared, then divide them by expected and sum over the cells. Contrarily
to the already known ordinary Chi-square test, only the two discordant
cells are used for the McNemar test:
(16 − 9.5)2 + ( 3 − 9.5)2
9.5
9.5
=
8.9
2
This formula can also be simplified to ( f − g )
f +g
If you insert for f=16 and g=3 then the simplified formula also results in a
value of 8.9. Of course, also an exact version of the McNemar test exists,
which is especially recommended for small values of f and g.
SPSS offers two menus to calculate the McNemar test.
1st option: We click on
Analyze
Nonparametric Tests
2 Related Samples...
We click on the variables CLINICAL and DOPPLER and move them to the
field
Test Pairs:
They appear as Pair 1 under
Variable1 and Variable2,
respectively.
In the field
Test Type
we choose
McNemar.
Then we click on OK
and the requested test will be calculated. SPSS gives automatically the
exact p-value of 0.004 („Exact Sig. (2-tailed)”).
131
30B5.3. Paired Tests
2nd option: We click on
Analyze
Descriptive Statistics
Crosstabs...
Now, we move the variable CLINICAL to the field
Row(s):
and the variable DOPPLER to the field
Column(s):.
We click on the button
Statistics...
and choose McNemar. Then we click on
Continue
and
OK
and the requested test will be calculated. The exact p-value is again
0.004.
Finally, only the interpretation of the test result is missing. The p-value of
0.004 is smaller than the commonly used significance level in medicine of
0.05. Thus the test result is statistically significant and we can reject the
null hypothesis. Diagnosis of a vascular occlusion with Doppler is more
sensitive than a diagnosis by clinical assessment.
Summary
All statistical tests that have been introduced in this and the last chapter
are summarized in the following table:
Outcome variable
Compare two
independent groups
Compare two
paired groups
t-test
paired t-test
scale or ordinal
Wilcoxon rank-sum
test
Wilcoxon signed-ranks
test or sign test
binary
Chi-square test
McNemar’s test
scale and normally
distributed
132
31B5.4. Confidence intervals
5.4. Confidence intervals
Let’s return to example 4.1.1, the rat diet data set. Recall, we computed a
p-value of 0.076. Comparing this p-value to the usual significance level of
5%, we conclude that the null hypothesis of equal means cannot be
rejected.
•
•
Should we be content with that conclusion?
Or is there more utilizable information behind?
The t-test uses the mean difference between two groups to detect
deviations from the null hypothesis. This mean difference has a
scientifically useful meaning on its own (in contrast, the rank sum
computed by the Wilcoxon rank sum test has not).
We could replace the scientific question “Are there differences between
the underlying populations?” by “How large are these differences?”
Clearly, if we observe a difference in mean weight gain of 19 grams
between the high and low protein dietary groups this doesn’t mean that
the “true” difference is exactly 19 grams. On the other hand, we expect
the “true” or population difference to be close to 19 grams. It would be of
great value if we could specify an interval within which the “true”
difference is believed to fall.
Such a specification can never be certain, because inference based on a
few (a small sample) about all (the population) cannot offer 100%
certainty (unless we restrict the conclusions to uninteresting statements
like: “it is certain with 100% probability, that the absolute difference in
weight gain of the rats will be less than 5000 kg”). Usually, we decide to
make statements which can be assigned 95% certainty (sometimes also
90% or 99%).
We are seeking an interval, which covers the “true” weight difference with
high probability. Such an interval is referred to as confidence interval, and
along with the interval, we also specify the degree of confidence which is
assigned to that interval, e. g. a “95% confidence interval” or a
“confidence interval with level 95%”.
Confidence intervals are based on the following concept: assume we could
repeat the study very often (collecting new data each time) and for each
of the study repetitions we compute a 95% confidence interval for the
weight difference. Then 95% of these intervals would cover the “true”
difference.
133
31B5.4. Confidence intervals
For example 4.1.1, confidence intervals have already been computed.
They are output along with the t-test results.
Independent Samples Test
t-test for equality of means
Std. Error
95% Confidence Interval
Difference
of the Difference
Lower
Upper
Weight gain (day Equal variances
28 to 84)
assumed
10.0453
-2.1937
40.1937
Equal variances
not assumed
9.9440
-2.4691
40.4691
The 95% confidence interval for the difference in weight gain between
high and low protein dietary groups amounts to -2.2 grams to 40.2
grams.
Confidence intervals are very closely related to statistical tests. They could
even replace the specification of p-values. If a confidence interval doesn’t
cover the difference stated in the null hypothesis (usually 0), then the test
result is statistically significant on the pre-specified significance level. A
95% confidence interval always corresponds to a test with significance
level of 5%, a 99% confidence interval to a test at the 1% level etc.
In our example, the 95% confidence interval covers the null hypothesis
(=0 grams weight gain difference). Therefore, the test fails to be
significant at the 5% level.
Note: if equal variances in the two groups cannot be assumed, a modified
version of the confidence interval has to be used.
Make use of confidence intervals!
134
32B5.5. One-sided versus two-sided tests
5.5. One-sided versus two-sided tests
Recall:
The p-value is the probability of a result at least as extreme as the
observed one assuming the null hypothesis applies.
Clearly, if the null hypothesis is true extreme results occur in any direction
with the same frequency.
Concretely with the rat diet example: If there is no difference in weight gains
between the two protein diets and we could repeat the experiment quite often,
then sometimes the rats with low protein would gain more weight and
sometimes the rats with high protein would gain more weight - distributed just
by pure chance!
We consider this case by performing two-sided tests and thus calculate
two-sided p-values. This is what we have done up to now. In the
majority of cases this will be the only correct approach.
In rare cases, there can be research questions where scientific meaningful
differences can only appear in one direction. Observing a difference in the
opposite direction would always be regarded as due to chance –
regardless of the size of this difference. A typical example are doseresponse relationships in toxicology – an increase of the dose can not lead
to decreased toxicity.
In such a case the alternative hypothesis would have to be restricted to
one direction only.
A potential alternative hypothesis for the rat diet example could be: "Food with
low protein causes higher weight gain than food with high protein."
Thus, we would perform a one-sided t-test and calculate a one-sided
p-value.
One-sided tests are rarely suitable. Even if we have the strong
presumption that the new therapy could not be worse than the present
therapy, we cannot be sure. If we were sure then we wouldn’t need to
perform an experiment!
135
32B5.5. One-sided versus two-sided tests
Whether a one-sided test is considered appropriate has to be decided
before data analysis or even better, before data collection. For prospective
studies, the intended test strategy (and the scientific reasons therefore) is
usually recorded in the study protocol in order to avoid unpleasant
discussions afterwards.
The decision for a one-sided test should in no case depend on the result of
the experiment or the study.
In medical literature, one-sided p-values often fall between 0.025 and
0.05. This means: a two-sided test wouldn’t have been significant! One
can assume in many of these cases that there were no records about the
use of one-sided hypotheses in advance.
The nominal significance level will not be preserved by a one-sided
alternative hypothesis that depends on the result of the experiment. The
actual significance level would be the double of the nominal one, namely
10%.
•
Strictly use only two-sided tests!
•
Use one-sided tests only if you have planned this at the beginning of
the study and if you have stated scientifically justifiable reasons in
the study protocol!
•
Be in principle suspicious about studies which use one-sided test
results without reasonable justification!
136
33B5.5 Exercises
5.5 Exercises
5.5.1.
27 out of 40 16-year old boys and 12 out of 32 equally-aged girls
underwent a questionnaire about their smoking behaviour. You
can find the aggregated data in the data file b5_5_1.sav.
Does the smoking behaviour depend on the sex of the teenagers?
5.5.2.
Values for thyroxin in the serum (nmol/l) of 16 children with
hypothyroidism can be found in the data file b5_5_2.sav,
differentiated by strength of the symptoms („no to slight
symptoms“ versus „pronounced symptoms“).
No to slight symptoms:
34, 45, 49, 55, 58, 59, 60, 62, 86
Pronounced symptoms:
5, 8, 18, 24, 60, 84, 96
Compare the thyroxin values between both symptom groups.
5.5.3.
You can find data from two patient groups in the data file
b5_5_3.sav. The first group (Hodgkin=1) consists of Hodgkin
patients in remission. The second group (Hodgkin=2) is a
comparable adequate group of non-Hodgkin patients, also in
remission.
The numbers of T4 and T8 cells/mm3 blood were determined.
(a) Are there differences in T4 cell counts between both groups?
(b) Are there differences in T8 cell counts between both groups?
137
33B5.5 Exercises
5.5.4.
Often, a logarithmic transformations are applied to right-skewed
data. The reason for this is simple: If the logarithmic values in
both groups are approximately normally distributed then a t-test
can be applied to these data.
What does this mean for the interpretation of the results on the
original scale (without logarithmic transformation)?
Guidance:
Which mathematical operation has to be applied to return from
the logarithmic scale to the original scale?
What are the effects of this mathematical operation on a
difference?
What are the effects of this mathematical operation on the null
hypothesis of the t-test?
5.5.5.
Do your considerations of example 5.5.4. change your analysis of
example 5.5.3?
Guidance:
Try a logartihmic transformation on the cell-counts of example
5.5.3.
Is the t-test after the transformation reasonable?
If yes, perform a t-test and interpret the results.
If no, specify reasons for your decision.
5.5.6.
Convert the data of the rat diet example 4.1.1 from SPSS to
EXCEL. Perform t-tests with equal and unequal variances and a
test for equality of variances.
5.5.7.
Use the data from example 5.3.1. and generate the variable DIFF
from the difference PREMENS minus POSTMENS. Click on
Analyze
Compare Means
One-Sample T Test...
and move the variable DIFF to the field Test Variable(s):.
Then click on OK.
(a) What attracts your attention compared to the paired t-test?
Do you have an explanation for that?
(b) Can you figure out what the field Test Value means?
Why is there a zero? What happens if you change this value?
138
33B5.5 Exercises
5.5.8.
The following study was performed with 173 patients with skin
cancer. The skin reaction of each patient with the contact allergen
Dinitrochlorobenzol (DNCB) and with the skin irritating and
inflammation actuating Kroton oil was assessed.
+ve ... skin reactions
-ve .... no skin reactions
The aim of this study was to answer the question if for patients
with skin cancer the contact to DNCB causes different skin
reaction than contact to Kroton oil.
Here are the study results:
skin reaction to
DNCB
Kroton oil
frequency
+ve
+ve
81
+ve
-ve
23
-ve
+ve
48
-ve
-ve
21
Perform the corresponding analyses.
5.5.9.
Here are the complete data for the diagnosis study about vascular
occlusion (example 5.3.2). The diagnosis tools clinical and
Doppler were compared. The true diagnoses were assessed by
venography.
clinical
Doppler
+
+
-
+
+
-
number of patients
with occlusion
without occlusion
sum
22
3
16
3
44
Compare the specificities of both diagnostic tools.
139
sum
4
27
5
41
77
33B5.5 Exercises
5.5.10. A matched case-control study was performed to evaluate if
undescended testicle (maldescensus testis) at birth leads to
increased risk for testicular cancer. Cases were 259 patients with
testicular cancer. For each case a control patient from the same
hospital was matched who was of about the same age (± 2
years), belonged to the same ethnic group and suffered on a
disease different from testicular cancer.
In the data file b5_5_10.sav you can find the corresponding data.
Answer the research question by using these data.
Additional remark: We have to be careful with the term „case“, as
it has two different meanings:
(i) A case in this case-control study is a patient with testicular
cancer.
(ii) A case in the data matrix is a pair, which consists of a patient
with testicular cancer and the corresponding matched control.
5.5.11. Example 5.5.5 continued: compute a 95% confidence interval for
the original scale (cells/mm3 blood). How can you interpret the
confidence interval?
5.5.12. Use the data set of the rat dietary example 4.1.1.
Compute a one-sided t-test corresponding to the alternative
hypothesis “low protein diet leads to more weight gain than high
protein diet” and specify a one-sided 95% confidence interval.
Compute a one-sided t-test corresponding to the alternative
hypothesis “high protein diet leads to more weight gain than low
protein diet” and specify a one-sided 80% confidence interval.
5.5.13. Use the rat dietary example data set (b4_1_1.sav). Add a fixed
value to all weight gains corresponding to group 1.
Choose the value such that the t-test yields a p-value of exactly
0.05.
Choose the value such that the t-test yields a p-value of exactly
0.01.
Hint: if you cannot proceed just by thinking, try to solve the
exercise by “trial-and-error”. Then, try to find distinctive features
according to your solutions. By doing so, pay special attention to
the mean difference between the groups and the corresponding
confidence limits.
140
33B5.5 Exercises
5.5.14. A team of authors had submitted a paper to a medical journal. In
the review of the manuscript, the referees criticized the statistical
method that had been applied to analyze the data. They
encouraged the authors to present the results by means of
confidence intervals. The first author replied as follows:
“In studies as ours, however, given the relatively small
numbers, a confidence interval … is likely to contain zero, be
fairly wide and include both positive and negative values.
Therefore this is not an appropriate setting for this form of
analysis as the result always will be too diffuse to be
meaningful.”
In future, will you accept such a rationale? Whether or not, give
reasons for your decision.
5.5.15. Perform an independent sample t-test on the data of exercise
5.3.1 (female ingestion) and compare the results to those of the
paired sample t-test.
5.5.16. The following table contains the respiratory syncytial virus (RS)
antibody titers measured in nine different patients.
Patient number
Titer
010174
8
020284
16
019459
16
011301
32
000232
32
024336
32
015319
64
009803
64
007766
256
(a) Compute the geometric mean titer (GMT).
(b) Compute a 95% confidence interval for the GMT.
(c) How would you describe the computations of (a) and (b) to be
suitable for the statistical methods section of a scientific
publication?
141
34BOpening and importing data files
Appendices
A. Opening and importing data files
Opening SPSS data files (*.sav)
SPSS data files can directly be opened using the menu File-OpenData... Not only SPSS data files can be opened using this menu, as the
following list illustrates:
The most important are the SPSS (sav), Excel (xls), SAS Long File Name
(sas7bdat), and Test (txt) file types.
Opening Excel data files
Be careful with data bases created by Excel. As mentioned above, Excel
imposes no restrictions on the type of data the user enters in a table.
Frequent errors when importing Excel files to SPSS are: data columns are
empty, date formats cannot be erased, or numeric columns are imported
as string types, etc.
In order to avoid such problems, some rules should be obeyed:
•
•
•
•
•
•
Clean the data in Excel before importing it into SPSS: the first row
should contain variable names, all other rows should only contain
numbers!
Pay attention for a consistent use of the comma. Be careful with
different language versions.
Erase any units of measurements in data columns.
Pay attention to hidden spaces in data columns; they should be
erased by replacing them with “nothing”.
Don’t use tags in data columns, like question marks for imprecisely
collected data (see above).
Sometimes formats once assigned to data columns cannot be
erased. This problem can be solved by copying the complete data
area and pasting the “contents” (using “edit-paste contents”) into a
new sheet.
142
34BOpening and importing data files
•
Wrong date formats in a column can be erased by selecting the
column and choosing General from the menu Format-cells:
143
34BOpening and importing data files
First, the Excel data file should be opened in Excel:
The first row should only contain SPSS compatible variable names. Data
values should follow from the second row. Column headers are changed
accordingly (if not, SPSS will generate variable names automatically, and
will use the column headers as variable labels):
144
34BOpening and importing data files
Now the first row is erased and the file is saved using a different name
like “bloodpress2”.
Then the data file is opened using SPSS, selecting File-Open-Data....
Change Files of type to Excel (*.xls, *.xlsx, *.xlsm) and press
Open:
145
34BOpening and importing data files
Now a window pops up in which the working sheet of the Excel file and the
data area can be specified. Choose Read variable names from first
row of data such that variable names are adopted from the Excel file.
Otherwise, the first row would be treated as a row containing data values.
The data is imported into SPSS, and should be saved as SPSS data file
immediately.
Importing data from text files
A further possibility to import data from other programs is provided by the
text import facility. Text files can easily be created by virtually any
program (Word, Outlook, etc.). In Outlook, save an e-mail using FileSave As and choose Text files (*.txt) at the Save as type: field.
146
34BOpening and importing data files
147
34BOpening and importing data files
By double-clicking the text file (data.txt) in the explorer, the notepad
opens. Delete any rows that do not contain data values or variable names:
Save the file as Data2.txt.
In SPSS, select File-Open-Data and choose Text (*.txt, *.dat) from
Files of type:. Select Data2.txt and press Open. The so-called Text
Import Wizard opens and guides through the data import:
148
34BOpening and importing data files
At step 2, change Are variable names included at the top of your
file? to Yes.
149
34BOpening and importing data files
150
34BOpening and importing data files
In our case, columns are separated by the Tab.
Steps 5 and 6 require no modification from the user. Finally, the data file
is imported and appears in the SPSS Data Editor window.
Opening multiple data sets
From version 14, SPSS is able to handle multiple data sets in multiple
windows of the SPSS Data Editor. These are distinguished by subsequent
numbers ([DataSet1], [DataSet2], …). The active window contains the socalled active data set, i. e., the data set to which all subsequent
operations apply. While a dialogue is open, one cannot change to another
data set.
During an SPSS session, at least one SPSS Data Editor window must be
open. When closing the last Data Editor window, SPSS is shut down and
the user is asked to save changes.
Copy and paste data
One may also import data by copying data areas in Excel or Word and
pasting them into SPSS. However, keep in mind that
•
In the data view, only data values can be pasted
151
34BOpening and importing data files
•
Numeric data can only be pasted into numeric variables, character
strings can only be pasted into variables defined as string type
Copy&Paste is very error-prone, it should only be used for very small
amounts of data that can easily be checked visually, and only if there is no
other possibility to import data.
Data may also be copied between SPSS Data Editor windows. Note that
when copying a range of cells in a data view table, only data values will be
pasted into another data view table. Variable descriptions will only be
pasted if a complete column is copied (including the column header).
Data from MS Access
Data bases from MS Access can be opened by SPSS using the Database
Wizard (File-Open database-New query). Database queries require a
working installation of the ODBC drivers, which come with a complete
Microsoft Office installation.
152
34BOpening and importing data files
Exercises
A.1
Reading
Source: Timischl (2000). Eight probands completed a training to
increase their reading ability. Reading ability was measured as
words per 20 minutes before start and after completion of the
course. The data was entered into the Excel table “reading.xls”.
Import the data into SPSS.
Define a score that reflects the increase in reading ability and
compute the score for each patient (“Transform-Compute
Variable...”).
Save the data set as “reading.sav”.
A.2
Digoxin
The Excel table “digoxin.xls” contains dixogin readings for 10
patients, measured on six consecutive days.
Import the table into SPSS. Pay attention to the correct definition of
the variables.
Save the data file as “digoxin.sav”.
A.3
Alanine Aminotransferase
The Word document “Alanine Aminotransferase.doc” contains the
frequency distribution of ALT (Alanine Aminotransferase) in 240
medical students.
Import the table into SPSS. Use the text import wizard to properly
import the table!
Weight the rows by the frequency variable.
Save the data file as ALT.sav.
A.4
Down’s syndrome
Source: Andrews and Herzberg (1985). The text file “down.txt”
contains the frequency of all livebirths from 1942 to 1957 in
Victoria, Australia, and the number of newborn children with Down’s
syndrome. The data are sorted by age group of the mothers.
Import the data set into SPSS and save it as “down.sav”.
153
35BData management with SPSS
B. Data management with SPSS
Merging data tables
If a data base is saved in separate tables, then these tables can be
merged to one table. Tables can be merged
•
•
one below the other, or
side by side
Example B.1: multi-center trial. Clinical trials are often recruiting
patients in multiple hospitals or “centers”. Data can be entered into the
computer system locally, and merged only at time of analysis. Thus, every
center supplies a data file with different patients and these data files have to be
merged one below the other.
Example B.2: longitudinal clinical study. Clinical trials often involve
repeated measurements on patient characteristics (e. g., blood pressure before
treatment and every month after start of treatment). On the other hand, some
variables are collected only once (e. g., age at start of treatment, sex, etc.).
Therefore, it is most efficient to create separate tables for the baseline and the
repeated data. At time of analysis, the tables are merged using a unique
patient identifier (patient identification number).
Merging tables one below the other
Consider the data file “bloodpress.sav” and an additional data file
containing the data of five other patients, “appendix.sav”. These two data
files should be merged. First, open “bloodpress.sav” and “appendix.sav”
with the SPSS Data Editor. Now two windows of the SPSS Data Editor
should be open. Now go to the window containg “bloodpress.sav” and
choose Data-Merge
Files-Add
Cases from the menu. Select
appendix.sav and press Continue.
154
35BData management with SPSS
As the variables do not have the same names in both data files, they must
be paired using the following sub-menu:
Variables corresponding to the active data set are marked by “(*)” and
those corresponding to appendix.sav by “(+)”. Now select two matching
variables from either data set and press Pair. Note that the second
variable can be selected by holding the “Strg” (“Ctrl”
on English
keyboards) key while clicking on the variable name.
155
35BData management with SPSS
Repeat this procedure for all pairs of variables until all variables appear at
Variables in New Active Dataset:
Press OK. Note that the asterisk (*) before the data set name in the active
SPSS Data Editor window indicates that changes have not yet been saved.
Merging tables side by side
Now assume that the ages of the patients are saved in a separate data file
(“age.sav”). First, open “age.sav” using the SPSS Data Editor. This data
file contains two variables: proband identification number and age. When
this data set should be merged with our active data set, it is important
that the proband identification numbers agree. Choose Data-Merge
Files-Add Variables and select age.sav.
156
35BData management with SPSS
In order to ensure that the data files are merged correctly, we should
define “proband” as key variable and match the two files using this key
variable. Select proband from the field Excluded Variables:, choose
Match cases on key variables in sorted files and Non-active
dataset is keyed table and put proband into the field Key Variables
by pressing the corresponding arrow button. Press OK, the following
warning appears on the screen:
This warning reminds us that both data files must be sorted by the key
variable before merging. The data sets are already sorted in our example.
Please not that in general, data sets should be sorted (Data-Sort Cases)
before merging!
After pressing OK the merged data set appears in the active Data Editor
window:
157
35BData management with SPSS
If some patients are completely missing in one of the data files, then
choose Both files provide cases. Otherwise, only patients present in
both files might appear in the merged data set.
Computing new variables
New variables can be computed using the menu Transform-Compute
Variable. Computations are carried out row-wise. Assume that we wish
to compute the change in blood pressure measured in sitting probands
between the time points “before treatment” and “after treatment”. This
can be done by computing the difference of the corresponding
measurements. Choose Transform-Compute Variable from the menu:
158
35BData management with SPSS
A new variable is created (“sit_31”) in which the difference between posttreatment measurement and pre-treatment measurement is computed for
each proband. Negative values indicate a decline in blood pressure.
Note: Spreadsheet programs (Excel) often treat empty cells as 0.
Computations on empty cells can have surprising results! In SPSS, empty
cells are treated as missing data, and computations involving missing data
usually lead to empty cells in the outcome variable.
The values computed by Transform-Compute Variable are treated as
observations on further variables, which are statically linked to the
corresponding row (observation). SPSS does not memorize the expression
leading to the computed data values, it can only save the expression in
the variable label of the new variable. If some values of sit_3 or sit_1 are
changed after the computation of sit_31, then the computation has to be
repeated in order to update sit_31. This is a major difference between
statistics programs that usually assume fixed data values to spreadsheet
programs, in which computed values are dynamic relationships.
Creating a variable containing subsequent numbers
159
35BData management with SPSS
Observations can be numbered consecutively by the Special Variable
“$Casenum”. This variable is generated by SPSS automatically, but
usually not visible to the user. $Casenum contains, for each observation,
the current row numbers. $Casenum can be used to automatically
generate patient identification numbers in the following way:
Choose Transform-Compute Variable and enter PatID as target variable,
and $Casenum (which can also be found selecting Miscellaneous in the
field Function group.
Press OK. Please note that the patient number “patid” will only be
generated for existing observations. If observations are added later, then
the procedure must be repeated. Again, the data values of patid are
statically linked to the observations. If you want to use $Casenum to
generate a patient identifier in an empty sheet, then you should first enter
an arbitrary value in the last row needed, such that the patient identifier is
generated for the correct number of observations.
Functions
160
35BData management with SPSS
The menu Transform-Compute Variable can be used to perform
mathematical operations of any kind. Basic arithmetical operations are
plus (+), minus (-), multiplied by (*) and divided by (/). The exponent
(“to the power of”) is symbolized by **. Boolean (logical) operations
include greater than (>), less than (<), equal to (=), not equal to (~=),
and the logical AND (&), OR (|) and NOT (~). Furthermore, various other
functions are provided. The definitions of the functions can directly
accessed by selecting a function from a function group.
Recoding variables
The menu Transform-Recode helps in recoding variables. The most
common application of that menu is to define codes for particular value
ranges (domains). Assume we want to categorize the blood pressure into
“normal” (up to 150), “high” (151-200) and “very high” (>200). Choose
Transform-Recode-Into Different Variables from the menu:
First put all blood pressure variables into the field Numeric Variable ->
Output Variable:. Then press Old and New Values. Here, domains can
be assigned codes.
161
35BData management with SPSS
The first code (“New Value”), 1, will be assigned to the domain 150 or
less. Insert 150 into the field labeled “Range, LOWEST through value:”.
Press “Add” and define the other two codes, 2 for the domain 151 through
200 (the fourth option on the left hand side), and 3 for the domain 201
through HIGHEST (the sixth option, insert 201). The definition of each
domain/code assignment must be confirmed by pressing Add. Finally, we
arrive at:
162
35BData management with SPSS
Then press Continue. Now we have to define names for the six output
variables containing the codes. Select the first input variable from the field
“Numeric Variable -> Output Variable” and create a name for the
corresponding output variable in the field “Output Variable Name:”. Press
Change and proceed to the second input variable etc.
When all output variables have been named, press OK. Six new variables
containing domain codes have been created in the Data Editor window.
Now we can assign value labels to the codes using the Variable View of
the Data Editor. Switch to “Variable View” (lower left corner) and select
variable “sit_1c”. Then click on None in the column “Values”. Define the
value labels for the three possible codes that the variable may assume:
163
35BData management with SPSS
Click OK. The definition of the value labels can be copied and pasted into
the corresponding field of the other variables:
164
35BData management with SPSS
Exercises
B.1
Hemoglobin
Source: Campbell and Machin (1993). A data set of a study
investigating the relationship of hemoglobin (hb), hematocrit
(packed cell volume, PCV), age and menopausal status
(menopause) is separated into three tables:
Hemoglobin1.xls contains hb, pcv and meno from all premenopausal patients
Hemoglobin2.xls contains the same data from all postmenopausal
patients
Hemoglobin-age.xls contains the age of all patients.
Import all tables into SPSS and merge them into one SPSS data file.
Pay attention to sorting when adding the age table; the patient
number should be used as key variable! Save the file as
“hemoglobin.sav”.
B.2
Body-mass-index and waist-hip-ratio
Two tables, controls.xls and patients.xls contain data on age,
weight, height, abdominal girth (waist) and hip measurement (hip)
as well as sex for healthy control individuals and patients suffering
from diabetes. In both tables, sex is coded as 1 (male) and 2
(female).
Merge the two data tables into one SPSS data file. Be aware of the
different column headers in the tables.
Compute body mass index (bmi = weight (kg)/height2 (m2)) and
waist-hip-ratio (whr = waist/hip). Save the data set as “bmi.sav”.
B.3
Malic acid
Source: Timischl (2000). Two Excel tables contain the malic acid
concentrations of samples of ten different commercially available
apple juices. The same samples have been subjected to either
enzymatical
(“encymatical.xls”)
or
chromatographical
(“chromatographical.xls”)
measurement
of
the
malic
acid
concentration.
Merge both data tables.
For each product, compute the difference of both measurement
techniques
(using
the
menu
“Transform-Compute”).
Save the data set as “malicacid.sav”.
165
35BData management with SPSS
B.4
Cardiac fibrillation
Source: Reisinger et al (1998). One-hundred and six patients
suffering from ventricular fibrillation were treated using two different
treatments.
Patient
data
(V6=potassium,
V7=magnesium,
treat=treatment group, fibr_dur=duration of ventricular fibrillation
in days) is contained in the Excel file “baseline.xls”. Sex is coded as
1 (male) and 2 (female). In a second Excel table (“results.xls”) you
can find the variable V33 which is coded as 1 (successful treatment
of fibrillation), and 0 (no success after 120 minutes of treatment).
Merge the two tables into one SPSS data file.
Compute the body mass index (bmi = weight (kg)/height2 (m2)).
Save the SPSS data file as “fibrillation.sav”.
166
36BRestructuring a longitdudinal data set with SPSS
C. Restructuring a longitdudinal data set with SPSS
A longitudinal data set, i. e., a data set involving repeated measurements
on the same subjects, can be represented in two formats:
•
•
The ‘long’ format: each row of data corresponds to one time point at
which measurements are taken. Each subject is represented by
multiple rows.
The ‘wide’ format: each row of data corresponds to one subject.
Each of several serial measurements is represented by multiple
columns.
The following screenshots show the cervical pain data set in long …
… and wide format:
167
36BRestructuring a longitdudinal data set with SPSS
With SPSS, SAS and other statistics programs, it is possible to switch
between these two formats. We exemplify the format switching on the
cervical pain data set.
Switching from wide to long
We start with the data set cervpain-wide.sav as depicted above. From the
menu, select Data-Restructure:
168
36BRestructuring a longitdudinal data set with SPSS
Choose ‘Restructure selected variables into cases’, press ‘Next >’:
169
36BRestructuring a longitdudinal data set with SPSS
Now the dialogue asks us whether we want to restructure one or several
variable groups. In our case, we have only one variable with repeated
measurements, but in general, one will have more than one such variable.
Now we have to define the subject identifier. This is done by changing
‘Case group identification’ to ‘Use selected variable’, and moving ‘Patient
ID’ into the field ‘Variable’:
Then we have to define the columns which correspond to the repeatedly
measured variable. We change ‘Target Variable’ to ‘VAS’ (directly writing
into the field), and move the 6 variables labeled ‘Pain VAS week 1’ to ‘Pain
VAS week 6’ into the field ‘Variables to be Transposed’:
170
36BRestructuring a longitdudinal data set with SPSS
Please take care of the correct sequence of these variables. All other
variables, which constitute the baseline characteristics, are moved to the
field ‘Fixed Variables’:
171
36BRestructuring a longitdudinal data set with SPSS
Then press ‘Next >’. The program now asks us to define an index variable.
This variable is later used to define the time points of the serial
measurements. Therefore, we could name it ‘week’. We only need one
index variable:
172
36BRestructuring a longitdudinal data set with SPSS
We request sequential numbers, and change the name to ‘week’:
In the options dialogue that appears subsequently, we request to keep
any variables that were not selected before as ‘fixed’ variables, and also to
keep rows with missing entries.
173
36BRestructuring a longitdudinal data set with SPSS
In the next dialogue, we are asked if we want to create SPSS syntax
which does the restructuring of the data for later reference (SPSS syntax
can be used to perform ‘automatic’ analyses, or to keep track of what we
have done):
After pressing ‘Finish’, the data set is immediately restructured into long
format. You should save the data set now using a different name.
Switching from long to wide
We start with the data set in long format (cervpain-long.sav). Select DataRestructure from the menu, and choose ‘Restructure selected cases into
variables’:
174
36BRestructuring a longitdudinal data set with SPSS
Define ‘Patient ID’ as ‘Identifier Variable’ and ‘Week’ as ‘Index variable’:
175
36BRestructuring a longitdudinal data set with SPSS
Next, we are asked if the data should be sorted (always select ‘yes’):
The order of the new variable groups is only relevant, if more than one
variable is serially measured. In our case, we have only the VAS scores as
repeated variable. Optionally, one may also create a column which counts
the number of observations that were combined into one row for each
subject.
176
36BRestructuring a longitdudinal data set with SPSS
Pressing ‘finish’, we obtain the data set in wide format. The VAS scores
are automatically named ‘VAS.1’ to ‘VAS.6’.
177
37BMeasuring agreement
D. Measuring agreement
Agreement of scale variables
An example (Bland and Altman, 1986): two devices measuring the peak
expiratory flow rate (PEFR) are to be compared for their agreement. The
devices are called „Wright peak flow meter” and “mini Wright meter”. The
PEFR of 17 test persons was measured twice by each device:
“wright1” denotes the first measurement on the Wright device, “wright2”
the second measurement, “mini1” the first measurement on the Mini
device, “mini2” the second measurement. First we restrict our agreement
analysis to the first measurements of the Wright and Mini devices, and we
generate a scatter plot of “wright1” and “mini1” to depict the association
of the measurements:
178
37BMeasuring agreement
The Pearson correlation coefficient for this data is r = 0.94.
•
•
Can we conclude that both devices agree almost perfectly?
What if the values measured by the Mini device were twice as high
as those measured by the Wright device?
The Pearson correlation coefficient is still r = 0.94. The Mini device now
yields twice the value as before, but neither the scatter plot nor the
correlation coefficient is sensitive to such a transformation. Therefore,
these are no adequate tools to describe agreement of measurements.
179
37BMeasuring agreement
Instead, Bland and Altman (1986) suggested to analyze the agreement of
measurements by the following procedure:
•
•
•
For each subject, compute the difference of the two measurements
and their mean
Describe the distribution of the differences by mean and standard
deviation or nonparametric measures (depending on their
distribution)
Generate a scatter diagram of the subject differences versus the
subject means to see if the magnitude of the differences depends on
the magnitude of the measurements.
The subject means act as approximation to the true values which are to
be measured by the devices.
In our example, we have 17 subject differences. If we compute a mean of
these 17 values, we obtain a measure of the mean deviation of the two
methods.
Computing a standard deviation of the 17 subject differences, we obtain a
measure of the variation of the agreement.
Even if the original measurements are not normally distributed, the
distribution of the subject differences often is much closer to normal,
justifying its description by mean and standard deviation. From the
chapter on statistical measures we know that, assuming approximately
normal distributions, the range of mean ± 2 standard deviations covers
about 95% of the data. Thus, 95% limits of agreement can easily be
computed.
In our example we have the following mean and standard deviation:
Subject difference
Count
17
Mean
2.12
Std Deviation
38.77
The mean difference is 2.12 l/min, the 95% limits of agreement are
2.12 – 2 × 38.77 = -75.5 l/min and
2.12 + 2 × 38.77 = 79.7 l/min.
Thus, the devices may deviate in a range of -75.5 to +79.7 l/min. We
cannot speak of “fair agreement” in this example!
The subject differences can be computed by choosing Transform-Compute
Variable from the menu and fill in the fields as follows (don’t forget to
define a label by clicking on Type & Label...:
180
37BMeasuring agreement
Mean and standard deviation can be computed using the menu item
Analyze-Tables-Custom Tables. A box plot or histogram of the subject
differences can be generated using the instructions of chapter 2. Such a
diagram helps us in deciding whether we can consider the subject
differences as being approximately normally distributed or not.
Similarly to the subject differences, the subject means are computed:
181
37BMeasuring agreement
Bland and Altman (1986) also suggested to draw a so-called m-d-plot to
describe the agreement of two measurement methods, i. e. a diagram
plotting subject differences (d) against subject means (m). Additionally,
such a plot should show the mean deviation (mean subject difference) and
the limits of agreement:
182
37BMeasuring agreement
This diagrams shows
•
•
•
the variation in agreement,
a systematic bias (if one method tends to measure higher values
than the other),
and a potential dependence of the variation in agreement on the
magnitude of the measured values
In the latter case a transformation (e. g., a log transformation) of the
measurements is indicated.
The Bland-Altman-Plot can be generated using SPSS in the following way:
1. Compute subject differences and subject means as described above.
Don’t forget do define labels for these new variables.
2. Generate a scatter plot of subject differences against subject means
(Graphs-Chart Builder...):
183
37BMeasuring agreement
3. Now the mean subject difference, and the lower and upper limits of
agreement can be inserted as so-called reference lines:
a. Double-click the graph
b. Click Options - Y Axis Reference Line.
c. Enter the value -75.42 in the Position field, click on Apply
and Close.
d. Repeat b and c with the values 2.12 and 79.66:
184
37BMeasuring agreement
M-d-plots can always be used to describe the distance between two
distributions.
Agreement of nominal or ordinal variables
Example: efficacy of a treatment (data file “efficacy.sav”). 105 patients
and their physicians were asked about the efficacy of a particular treatment.
Three response categories were allowed: “very good”, “good”, “poor”. The
following responses were obtained:
Patient’s
rating
Very good
Good
Poor
Very good
Physician’s
rating
Good
Poor
36
6
5
10
16
8
4
8
12
As a measure of agreement, we could compute the percentage of
matching responses (in our case 64 out of 105 = 61%). This measure has
the following disadvantage: assume, physicians’ and patients’ ratings were
completely random. In this case we would still obtain a certain degree of
agreement, as physicians’ and patients’ ratings will match in some cases
185
37BMeasuring agreement
just randomly. In our example, the expected percentage of matching
responses assuming random rating is 36%. This value can be computed
similarly to the expected cell frequencies in a chi-square-test.
The measure of agreement called kappa (greek letter κ) improves on the
simple percentage of matching ratings by relating it to the expected
percentage assuming random rating. It is defined by the ratio of the
observed proportion of matching ratings (denoted by A) to one minus the
expected proportion assuming random rating (1 – C). In other words:
κ = (A – C)/(1 – C),
with A and C denoting the observed proportion of matching ratings and
the expected proportion of matching ratings assuming random rating. The
κ measure can be computed by choosing the menu item AnalyzeDescriptive Statistics-Crosstabs, clicking on Statistics and
selecting Kappa:
Symmetric Measures
Measure of Agreement
N of Valid Cases
Kappa
Value
Asymp. Std. Error(a)
Approx. T(b)
Approx. Sig.
.390
.071
5.564
.000
105
a Not assuming the null hypothesis.
b Using the asymptotic standard error assuming the null hypothesis.
Under the null hypothesis of random rating, we can expect 36% matching
ratings. We have observed 61% matching ratings. Kappa is thus
computed as (0.61-0.36)/(1-0.36)=0.39. The significant p-values
(<0.001) indicates that the assumption of random rating is not plausible
and must be rejected.
Note: Some programs offer the possibility to compute a weighted version
of the Kappa measure, which assigns more weight to more serious
mismatches (e. g., “very good” and “poor” are rated twice as much as
“very good” and “good”).
The weighted Kappa measure is not
implemented in SPSS.
186
37BMeasuring agreement
Exercises
C.1
Source: Little et al (2002). Use the data file “Whitecoat.sav”, which
contains two measurements of blood pressure for each of 176
patients. While the first measurement was taken in an outpatient
department by a nurse, the second one was taken by a physician in
a primary care unit (“white coat”). Is blood pressure higher if
measured by a physician? Analyze the agreement of the two
measurements.
C.2
Source: Schwarz et al (2003). Perfusion was measured in fifty-nine
dialysis patients by two different ultrasound methods (ultrasound
dilution technique, UDT; color Doppler ultrasonography, CDUS).
Data are collected in the data file “Stenosis.sav”. Evaluate the
agreement of the two methods in a proper way.
C.3
Source: Bakker et al (1999). Kidney volume was measured in 20
persons by two methods: ultrasound and magnetic resonance. (data
file “Renalvolume.sav”). Both kidneys of each test person have been
evaluated. Evaluate the agreement of the two methods in a proper
way.
C.4
Use the data file “CT_US.sav”, which contains measurements of
tumor size taken by computer tomography (ct_mm), by ultrasound
(us_mm) and measured histologically (hist_mm). Which
measurements are closer to the histologically measured values,
those by CT or those by US?
C.5
Source: Fisher and van Belle (1993). Coronary arteriography is a
key diagnostic procedure to detect narrowing or stenosis in coronary
arteries. In the coronary artery surgery study (CASS) the quality of
the arteriography was monitored by comparing patient’s clinical site
readings with readings taken by a quality control site (which did only
evaluate the angiographic films and did not see the patients). From
these readings the amount of disease can be classified as “none”
(entirely normal), “zero-vessel disease but some disease”, and one-,
two- and three-vessel disease:
187
37BMeasuring agreement
quality control site reading * clinical site reading Crosstabulation
Count
clinical site reading
quality
control
site
reading
Total
normal
normal 13
some
6
One
1
two
0
three
0
20
some
8
43
9
2
0
62
one
1
19
155
18
11
204
Total
two
0
4
54
162
27
247
three
0
5
24
68
240
337
22
77
243
250
278
870
Before opening the data file “CASS.sav”, reflect how to enter such
data. Evaluate the agreement between clinical site readings and
quality control site readings!
188
38BReference values
E. Reference values
Reference ranges or tolerance ranges are used in clinical medicine to
judge whether a particular measurement taken from a patient (e. g., of a
lab investigation) indicates a potential pathology because of its extreme
value. When computing reference ranges, we are looking for limits which
define a range of, say, 95% of the non-pathologic values. With symmetric
limits, we can expect about 2.5% of patients to show a value above the
upper limit and 2.5% below the lower limit. Reference ranges can also be
computed as a function of time, e. g. growth curves for children.
Reference ranges can be computed by parametric (assuming a normal
distribution) or non-parametric statistics. Parametric reference ranges are
simply computed by making use of mean and standard deviation (e. g.,
mean +/- 1.96 standard deviations defines a 95% reference range).
However, parametric reference ranges should only be used if the data
agree well with a normal distribution. Sometimes this can be achieved by
transforming the original values using a log transformation:
Y = log(X + C)
with X and C denoting the original values and a suitable constant, and Y
denoting the transformed values. C can be chosen such that the
distribution of Y is as close to a normal distribution as possible, but it is
restricted to values C > -min X to avoid taking logs of negative values,
which are not defined. The 95% normal range [AY, BY] can be computed
using mean and standard deviation of Y. It can simply be transformed to
the original scale using the equations
AX = exp(AY) – C
and
BX = exp(BY) – C
Even with only small deviations from a normal distribution non-parametric
measures (e. g., 2.5th and 97.5th percentiles) should be preferred. There
are also hybrid methods, e. g. the method by Harrell and Davis (1982).
As a minimum sample size to compute 95% reference ranges 100 and 200
have been proposed, depending on whether parametric or non-parametric
methods can be applied (cf. Altman, 1991). In samples of such size it
should be easy to judge whether normal distribution can be assumed or
not, e. g. by a histogram (see chapter 2) or by a Q-Q-plot which is
described below.
189
38BReference values
Q-Q-Plot
The normal assumption can be verified in two ways:
•
•
formally by applying a statistical test (a test of normality) which
assess significant deviations from the null hypothesis of normal
distribution
visually using diagrams
The former of these alternatives offers the convenience of an automatic
judgment, but it also has some drawbacks: with large samples (N>100)
even irrelevant deviations from a normal distribution can lead to a
significant test of normality (i. e., rejecting the normal assumption). On
the other hand, with small samples (N<50) tests of normality may suffer
from poor statistical power. In such samples departures from normal
distributions have to be large enough to be detectable and to result in a
significant test of normality.
The Q-Q plot is a visual tool to answer the question whether a scale
variable follows a normal distribution. It is more sensitive than the simple
comparison of dot plot and error bar plot presented in chapter 2. The Q-Q
plot contrasts original values, eventually after performing a logtransformation, to the quantiles of an ideal normal distribution. If the
original values are normally distributed, then all dots in the plot lie on a
straight line extending from the lower left to the upper right corner of the
plot.
The Q-Q plots are generated by choosing the menu item Graphs-Q-Q...
(exemplified on the data set “Dallal.sav”):
190
38BReference values
Detrended Normal Q-Q Plot of pretest
Normal Q-Q Plot of pretest
10
250
5
Deviation from Normal
Expected Normal Value
200
150
100
0
-5
50
-10
0
0
50
100
150
200
0
250
50
100
150
200
Observed Value
Observed Value
SPSS generates two different diagrams. The first one shows the Q-Q plot
described above. The second one, called detrended normal Q-Q plot,
visualized departures from normality in a fashion similar to the BlandAltman plot described earlier, i. e. differences between the observed
values and the expected quantiles assuming a normal distribution are
plotted against the observed values in order to show where the data are
close to normal and where they are not. We see that the empirical
distribution of the data departs from the normal particularly at its edges.
The detrended Q-Q plot often pretends serious departures from the
normal distributions, this is caused by the range of the y axis which is
adapted to the largest deviation. Therefore, it should always be evaluated
in conjunction with the normal Q-Q plot and it should not show any
systematic departures from the straight line, as exemplified below:
Detrended Normal Q-Q Plot of Marker1
400
100
300
75
Deviation from Normal
Expected Normal Value
Normal Q-Q Plot of Marker1
200
100
0
50
25
0
-100
-100
0
100
200
300
400
0
500
100
200
300
400
500
Observed Value
Observed Value
The C-shaped and U-shaped impressions of the Q-Q and detrended Q-Q
plots, respectively, indicate that a log transformation might help in
transforming the observed distribution into normal.
191
38BReference values
If a log transformation is used, one may try the transformation Y = log(X
+ C) inserting various values for the constant C, and choose that one that
yields the best approximation to a normal distribution.
The visual impression of the Q-Q plot can also be quantified by a
statistical measure. With a perfect normal distribution, all dots lie on a
straight line, i. e. the Pearson correlation coefficient assumes the value 1.
So we can compute the correlation coefficient from a Q-Q plot and use it
as and indicator of a normal distribution if it is close enough to 1. Clearly,
it will never be exactly equal to 1 in finite samples, but a value of 0.98 or
higher can be assumed as satisfactory. Using the correlation coefficient,
we can also judge which value of C is optimal inn transforming the
empirical distribution into normal.
As an example, consider the data set “Dallal.sav”. Assume we would like
to evaluate the normal assumption on the variable “pretest” using a Q-Q
plot.
In order to compute the correlation coefficient from the Q-Q plot we must
have access to the values that are depicted by that plot. Thus, we have to
compute the theoretical normal distribution quantiles by our own. This can
be done using the menu item Transform-Rank cases, selecting pretest
as variable:
After clicking on Rank Types... we select Normal scores:
192
38BReference values
Afterwards, we compute the Pearson correlation coefficient (menu item
Analyze-Correlation-Bivariate) of the original variable pretest and
the normal scores. It is fairly high, assuming a value of 0.9986.
193
38BReference values
The following table can be used to assess values of the Q-Q plot
correlation coefficient for various sample sizes. This table was generated
by simulating normally distributed samples of various sizes. If data come
from a normally distributed population, then the stated values of the Q-Q
correlation will be exceeded with a probability of 99%. With a sample size
N > 100, which is suggested to compute parametric reference ranges, the
Q-Q correlation coefficient should exceed 0.983.
N
100
200
300
400
500
1000
Q-Q correlation coefficient
>0.983
>0.992
>0.994
>0.995
>0.996
>0.998
Exercise
D.1
Normal range. The data file “ALT.sav” contains measurements of the
parameter ALT on a sample of 240 representative test persons. The
data are already categorized, therefore they should be weighted by
the frequency variable. Compute a parametric 95% reference range
for ALT using an appropriate transformation! Try various values of C
to yield a distribution as close as possible to a normal distribution.
194
39BSPSS-Syntax
F. SPSS-Syntax
SPSS Syntax Editor
If similar analyses have to be repeated several times, it can be
cumbersome to repeat choosing menu items using the mouse. Therefore,
SPSS offers the possibility to perform analyses using the SPSS syntax
language. This language is easy to learn because SPSS translates each
command that is called by selecting particular menu items into SPSS
syntax. The translations can be made visible by clicking on Paste in a
menu:
After clicking on Paste the corresponding SPSS syntax is pasted into the
SPSS syntax editor window. Each SPSS syntax command begins with a
key word (e. g., COMPUTE) and ends with a full stop.
195
39BSPSS-Syntax
To run a syntax (several commands), click on Run. Now you can choose
among:
•
•
•
•
All: executes all commands in the syntax editor window.
Selection: executes all (partly) selected commands.
Current: executes the commands where the cursor is currently
located
To End: executes all commands from the current cursor location to
the end.
The contents of a syntax editor window can be saved to be recalled later,
e. g. with other data sets, or to reanalyze a data set during a revision.
If the same analysis should be rerun using various variables, one can save
a lot of time using SPSS syntax. Assume we want to generate box plots
for all scale variables in the data set “bmi.sav” (height, weight, abdominal
girth, hip measurement, BMI), grouped by sex and age group. We just call
the menu defining a box plot:
196
39BSPSS-Syntax
Instead of clicking on OK, we click Paste. In the syntax window the
command which generates a grouped box plot is displayed:
Now this command can be selected, copied and pasted several times
below. Each time we replace “weight” by the names of the other scale
variables:
197
39BSPSS-Syntax
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=sex weight age_group[LEVEL=NOMINAL]
MISSING=LISTWISE
REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: sex=col(source(s), name("sex"), unit.category())
DATA: weight=col(source(s), name("weight"))
DATA: age_group=col(source(s), name("age_group"), unit.category())
DATA: id=col(source(s), name("$CASENUM"), unit.category())
COORD: rect(dim(1,2), cluster(3,0))
GUIDE: axis(dim(3), label("Sex"))
GUIDE: axis(dim(2), label("Weight"))
GUIDE: legend(aesthetic(aesthetic.color), label("age_group"))
SCALE: cat(dim(3), include("1", "2"))
SCALE: linear(dim(2), include(0))
ELEMENT: schema(position(bin.quantile.letter(age_group*weight*sex)), color(age_group), label(id))
END GPL.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=sex height age_group[LEVEL=NOMINAL]
MISSING=LISTWISE
REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: sex=col(source(s), name("sex"), unit.category())
DATA: weight=col(source(s), name("weight"))
DATA: age_group=col(source(s), name("age_group"), unit.category())
DATA: id=col(source(s), name("$CASENUM"), unit.category())
COORD: rect(dim(1,2), cluster(3,0))
GUIDE: axis(dim(3), label("Sex"))
GUIDE: axis(dim(2), label("Weight"))
GUIDE: legend(aesthetic(aesthetic.color), label("age_group"))
SCALE: cat(dim(3), include("1", "2"))
SCALE: linear(dim(2), include(0))
ELEMENT: schema(position(bin.quantile.letter(age_group*weight*sex)), color(age_group), label(id))
END GPL.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=sex waist age_group[LEVEL=NOMINAL]
MISSING=LISTWISE
REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: sex=col(source(s), name("sex"), unit.category())
DATA: waist=col(source(s), name("waist"))
DATA: age_group=col(source(s), name("age_group"), unit.category())
DATA: id=col(source(s), name("$CASENUM"), unit.category())
COORD: rect(dim(1,2), cluster(3,0))
GUIDE: axis(dim(3), label("Sex"))
GUIDE: axis(dim(2), label("Waist"))
GUIDE: legend(aesthetic(aesthetic.color), label("age_group"))
SCALE: cat(dim(3), include("1", "2"))
SCALE: linear(dim(2), include(0))
ELEMENT: schema(position(bin.quantile.letter(age_group*waist*sex)), color(age_group), label(id))
END GPL.
And so on …
198
39BSPSS-Syntax
Now the cursor is located at the first of the commands, and we choose the
menu item Run-To End. All box plots are generated.
Assume we notice an outlier (id 4) with implausible values for height and
bmi. This outlier can be removed and the analysis is simply repeated by
again locating the cursor at the first IGRAPH command and choosing RunTo End.
The Session journal
The syntax of all commands that SPSS executes is collected in a file called
the Session journal. This journal can be used to learn SPSS syntax
language or to save analyses for later reference.
The SPSS session journal is saved somewhere on the hard disk, depending
on the installation. The folder where it is saved can be queried by
choosing the menu item Edit-Options and selecting the File Locations
view:
199
39BSPSS-Syntax
Here, the location of the Session Journal can even be changed by choosing
a different folder after clicking on Browse.... By default, new commands
are appended to the existing journal. Generating a new Session Journal
each time SPSS is invoked could be useful under certain circumstances.
The Session Journal file can be opened using the Notepad editor (or the
SPSS Syntax editor). Here we can also select and copy commands to use
them in other SPSS programs.
200
40BExact tests
G. Exact tests
Recall example 5.2.1.: Comparison of therapies of Dr. X: standard therapy
will be compared with new therapy, the outcome is binary (cured versus not
cured). The research question is: Is there a difference between the cure rates
between both therapies?
cured
yes
no
standard therapy
4
12
16
new therapy
9
9
18
13
21
34
A chi-square test was calculated:
Chi-Square Tests
Value
df
Asymp. Sig.
(2-sided)
2,242b
1
,134
Continuity
Correctiona
1,308
1
,253
Likelihood Ratio
2,286
1
,131
Pearson Chi-Square
Fisher's Exact Test
Exact Sig.
(2-sided)
,172
Linear-by-Linear
Association
2,176
N of Valid Cases
34
1
Exact Sig.
(1-sided)
,126
,140
a Computed only for a 2x2 table
b 0 cells (,0%) have expected count less than 5. The minimum expected count is 6,12.
201
40BExact tests
We already know that under the null hypothesis the Pearson Chi-Square
criterion for a 2×2 crosstable follows a chi-square distribution with one
degree of freedom (“df”).
For the observed value of the test statistic of 2.242 the p-value calculates
to 0.134.
When reading the output table, two issues arise. One of them concerns
the terms „Asymptotical Significance” and “Exact Significance” and the
other concerns the remark “0 cells (.0%) have expected count less than 5.
The minimum expected count is 6.12.”.
These will be investigated in greater detail now. Remember the expected
number of patients in case of validity of the null hypothesis:
expected, if null
hypothesis applies
cured
yes
no
standard therapy
6.1
9.9
16
new therapy
6.9
11.1
18
13
21
34
202
40BExact tests
If we only knew the marginal sums for example 5.2.1, then the following
14 different 2×2 crosstables were possible:
0
13
16
5
16
18
7
6
9
12
16
18
13
21
34
13
21
34
1
12
15
6
16
18
8
5
8
13
16
18
13
21
34
13
21
34
2
11
14
7
16
18
9
4
7
14
16
18
13
21
34
13
21
34
3
10
13
8
16
18
10
3
6
15
16
18
13
21
34
13
21
34
4
9
12
9
16
18
11
2
5
16
16
18
13
21
34
13
21
34
5
8
11
10
16
18
12
1
4
17
16
18
13
21
34
13
21
34
6
7
10
11
16
18
13
0
3
18
16
18
13
21
34
13
21
34
Notice that it is sufficient to know one field of this 2×2 crosstable in case
of fixed margins. We choose the upper left field (this corresponds to the
number of cured patients under standard therapy).
203
40BExact tests
Assume, the null-hypohtesis is true (=both therapies work equally well),
then we could calculate the probability to observe any potential cross
table just by chance. For those who are interested: the probabilities are
calculated using the hypergeometric distribution.
X
0
1
2
3
4
5
6
7
8
9
10
11
12
13
Pearsons
Chi-square criteria
18.7
13.1
8.5
4.9
2.2
0.62
0.0069
0.39
1.8
4.2
7.5
11.9
17.3
23.7
probability
(under the assumptions of the
null hypothesis)
0.0000092
0.0003201
0.0041152
0.0264062
0.0953555
0.2059680
0.2746240
0.2288533
0.1188277
0.0377231
0.0070416
0.0007202
0.0000353
0.0000006
X denotes the value of the upper left cell of the 2×2 cross table. We see
that X=5,6,7 are most likely. This is not surprising, as these cross tables
correspond much with the expected cross table under the null hypothesis.
Pearsons Chi-square criteria for every potential cross table is given in the
middle column. For X=6 the smallest value results, for X=7 the second
smallest value result and so on (usw.).
In example 5.2.1 we have observed a cross table with X=4. We can now
use the probability distribution shown above to determine an exact pvalue. For this we have to add the probabilities of all cross tables whose
Chi-square criteria are equal to or greater than 2.2 (shaded in grey). We
obtain an exact p-value of 0.1717. This procedure is also known as
Fisher’s exact test.
The SPSS-Output from example 5.2.1. confirms our calculation.
204
40BExact tests
Remark: The SPSS-Output gives a p-value, labelled „Exact Sig. (1-sided)“,
of 0.126 for the Fisher’s exact test for example 5.2.1. From our
calculations above we can understand how SPSS calculates this
value: the probabilities of the table of page 204 were summed from
X=0 to 4.
Note, the one-sided alternative hypothesis is generated and tested
automatically based on the observed data. This is an impure scientific
procedure. So, one-sided p-values shouldn’t be presented automatically in
this way.
0.30
Probability function
0.25
0.20
0.15
0.10
0.05
0.00
0
2
4
6
8
10
12
14
Chi-Square criterion
16
18
20
22
24
Graphical presentation of the probability function of the Chi-square criteria
from example 5.2.1 under the assumption that the null hypothesis is true.
It can be argued why we obtain an asymptotical p-value of 0.134 for the
Chi-square test, if the exact p-value is 0.1717 for example 5.2.1?
For the time being: The asymptotical p-value (which is based on the
Chi-square distribution) is an approximation of the exact p-value. This
approximation becomes better (i. e., more precise) with increasing sample
size, like most of the approximations in statistics.
205
40BExact tests
1.0
Distribution function
0.8
0.6
0.4
0.2
0.0
0
2
4
6
8
10
12
14
Chi-Square criterion
16
18
20
22
24
Graphical presentation of the asymptotic (dashed line) and exact (solid line)
cumulative density functions of the Chi-square criteria for example 5.2.1
under the assumption that the null hypothesis is true.
This approximation of the exact distribution by the asymptotic distribution
seems to be satisfying. Nevertheless, why do we use the asymptotic pvalue if the exact p-value is available?
Answer: Though the term "exact" gives the impression, that it is the
better (i. e., the more precise) version of the test, we have to take two
arguments into account. First, exact tests require in general special
computer programs and powerful computers. Second, exact tests are
conservative due to the discrete test statistic. In its statistical meaning, a
conservative test adheres to the null hypothesis in too many situations. Or
in other words, the accepted error probability (the significance level) is
not completely used.
206
40BExact tests
A rule of thumb for the decision between asymptotic and exact versions of
the Chi-square test is
•
If all expected cell frequencies (under the null hypothesis) are higher
than 5, then the asymptotic p-value can be used. Otherwise, use the
exact p-value.
For example 5.2.1. the smallest expected cell frequency is 6.1.
Sometimes another rule of thumb is used as additional step:
•
If the sample size is smaller than 60, then use the asymptotic p-value
corrected by Yates. This version is also named continuity-corrected
Chi-square test. In the SPSS output it can be found in the row
”Continuity Correction“.
For example 5.2.1. a continuity-corrected p-value of 0.253 is calculated.
Are there other exact tests?
Yes. In theory, exact p-values could be calculated for any statistical test.
However, in practice only non-parametric tests are calculated in an exact
version, e.g. the already known Wilcoxon-Mann-Whintey U Test. The
procedure is similar to the algorithm shown for example 5.2.1. However,
for problems exceeding the a test for a 2x2 table, the required computing
effort can be considerable.
Nowadays, more and more software packages offer exact tests (like SPSS,
SAS, …). However, for many problems one has to switch over to other
special software.
For very small sample sizes
⇒ use exact instead of asymptotic p-values!!
207
41BEquivalence trials
H. Equivalence trials
Most of the clinical studies are designed to detect differences between two
treatments. Sometimes one would like to show that a certain therapy (NT)
is equal to the standard therapy (ST). Such studies are called equivalence
trials. Reasons for such studies could be that NT has less side effects, is
cheaper or easier to apply.
Equivalence trials are also called similarity trials. This second expression
refers to the circumstance that exact similarity of two treatments can
never be shown, even if they are equivalent in reality. What we can show
is “sufficient similarity” of the treatments.
First, we have to consider what „sufficient similarity“ means. It doesn’t
bother us if NT works better than ST. If NT=ST then everything is alright.
Even if NT is slightly worse than ST we find NT acceptable. Thus we have
the situation of a one-sided hypothesis.
But, what means „slightly worse“? To define it, we need clinical
considerations which difference is still acceptable. This smallest acceptable
difference (derived from clinical rationale) is abbreviated by Θ0.
Example:
Comparison of cure rates, Θ0 =0.03
Null hypothesis: ST is at least 3 per cent points better than NT,
ST-NT≥0.03, (no equivalence – it is again the negation of
the research question!)
Alternative hypothesis: ST is either worse, equal or maximal 3 percent
points better than NT
ST-NT<0.03 (equivalence)
Solution:
one-sided test based on a significance level of 5 % or
one-sided 95 % confidence interval or two-sided 90 %
confidence interval
Note,
Type 1 error:
Falsely claim equivalence although the null hypothesis is
true.
Type 2 error:
Falsely claim no equivalence although the alternative
hypothesis is true.
208
41BEquivalence trials
If we want to show that the effects of two therapies are not too different
in both directions, we have the situation of a two-sided research question.
First, we have to define two smallest acceptable differences (one for each
deviation in both directions - above and below). For this we define two
one-sided null hypotheses:
E.g. comparison of cure rates, Θ01 =0.03, Θ02 =0.07
Null hypothesis 1:
ST is at least 3 per cent points better than NT
ST-NT≥0.03 (no equivalence)
Null hypothesis 2:
NT is at least 7 per cent points better than ST
NT-ST≥0.07 (no equivalence)
Alternative hypothesis: ST is maximal 3 per cent points better than NT,
equaly well or maximal 7 per cent points worse
than NT
-0.07<ST-NT<0.03 (equivalence)
Only when both one-sided null hypothesis are rejected based on the 5 %
significance level we can assume equivalence.
Two-sided hypotheses occur primarily for bioequivalence/bioavailability
trials with pharmacokinetic or pharmacodynamic questions behind.
Remark: One can also make sample size calculations for equivalence trials. The
“better” sample size calculation programs offer separate menu items for
that.
209
42BDescribing statistical methods for medical publications
I. Describing statistical methods for medical
publications
If statistical methods have been used in a medical study then they must
be described adequately in the resulting research paper. The “statistical
methods” section is usually positioned at the end of the “material and
methods” chapter (before the “results” chapter).
The following principles should be observed:
•
On the one hand: descriptions of the statistical principles should be
short and precise, as the medical aspects are in the foreground in
medical manuscripts.
•
On the other hand: The description should be detailed enough.
In other words: following the description of the statistical methods
section and using the same set of data, all results should be
reproducible by an independent researcher.
•
Empirical results do not belong to the statistical methods section.
How should a statistical methods section be organized?
•
Description of any descriptive measures, data transformations, etc.
that were used.
•
Description of any statistical tests, statistical models, methods for
multiplicity adjustments, that were applied.
•
Description of the software used.
•
Description of the significance level used and the type of alternative
hypotheses tested (one- or two-sided).
Example (statistical methods section for the rat diet example 4.1.1):
Weight gain in both groups was described by mean and standard deviation.
Differences between both groups were assessed by unpaired t-test. SPSS
statistical software system (SPSS Inc., Chicago, IL) was used for statistical
calculations. The reported p-value is a result of a two-sided test. A p-value
smaller or equal to 5 % is considered statistically significant.
210
43BDictionary: English-German
J. Dictionary: English-German
English
German
Remarks
adjusted R-squared
measure
korrigiertes
R-Quadrat
See also: coefficient of
determination
alternative
hypothesis
Alternativhypothese
analysis of variance
Varianzanalyse
Abbr.: ANOVA
ANOVA
ANOVA
Abbr. for:
ANalysis Of VAriance
arithmetic mean
arithmetisches Mittel
coefficient of
determination
Bestimmtheitsmaß,
R2, R-Quadrat
confidence interval
Konfidenzintervall
correlation
Korrelation
degree of freedom
Freiheitsgrad
dependent variable
abhängige Variable
distribution
Verteilung
estimation
Schätzung
exact test
exakter Test
geometric mean
geometrisches Mittel
independent
unabhängig
independent variable
unabhängige
Variable
interaction
Wechselwirkung
leverage point
einflussreiche
Beobachtung
(einflussreicher
Punkt, Hebelwert)
linear regression
lineare Regression
See also: R-squared
measure
Abbr.: df
Clearly, we intend the
statistical meaning (not
the colloquial) here
211
Such an observation has a
potentially large „leverage
effect“ on the regression
line
43BDictionary: English-German
logarithm
Logarithmus
mean
Mittelwert
median
Median
multiple comparison
problem
Multiplizitätsproblem
multiple testing
problem
Multiplizitätsproblem
nonparametric test
nichtparametrischer
Test
null hypothesis
Nullhypothese
one-sided test,
one-tailed test
einseitiger Test
one-way ANOVA
einfaktorielle
Varianzanalyse
outcome,
outcome variable
Zielgröße
outlier
Ausreißer
paired test
gepaarter Test
percent
Prozent
percentage point
Prozentpunkt
population
Grundgesamtheit
power
Mächtigkeit
probability
Wahrscheinlichkeit
prognostic factor
Prognosefaktor
p-value
p-Wert
quartile
Quartile
random experiment
Zufallsexperiment
range
Spannweite
regression model
Regressionsmodell
212
43BDictionary: English-German
residual
Residuum
response variable
Zielgröße
R-squared measure
Bestimmtheitsmaß,
R2, R-Quadrat
sample
Stichprobe
sample size
Fallzahl,
Stichprobengröße
sensitivity
Sensitivität
sign test
Vorzeichentest
significance level
Signifikanzniveau
skewed distribution
schiefe Verteilung
smallest difference
important to detect
minimal klinisch
relevante Alternative
specificity
Spezifizität
standard deviation
Standardabweichung
standard error
Standardfehler
statistically
significant
statistisch signifikant
summary measure
problemorientierter
Parameter
test
Test
t-test
t-Test
also: Student's t-test
("Student" was the
pseudonym of W.S.
Gosset, the inventor of the
t-test)
two related samples
zwei verbundene
Stichproben
SPSS-Jargon
two-sided test,
two-tailed test
zweiseitiger Test
unpaired test
ungepaarter Test
213
Abbr.: R2
The standard error is the
standard deviation of the
mean
43BDictionary: English-German
variable
Variable
variable
transformation
Transformation einer
Variablen
variance
Varianz
Wilcoxon rank sum
test
Wilcoxon
Rangsummentest
Wilcoxon signedrank test
Wilcoxon
Vorzeichen-Rangtest
214
43BDictionary: English-German
References
Citations in bold typeface are introductory books in agreement with the
contents of this lecture notes which may serve as supplemental material
to the student. Other citations are either interesting further reading,
specialized statistics books or sources of examples used in these lecture
notes.
•
D.G. Altman (1991): Practical Statistics for Medical Research.
Chapman and Hall, London, UK.
•
D.G. Altman (1992): Practical statistics for medical research.
Chapman and Hall.
•
D. F. Andrews and A. M. Herzberg (1985): Data - A Collection of
Problems from many Fields for the Student and Research Worker.
Wiley, New York.
•
Ralf Bender, Stefan Lange (2001): Adjusting for multiple testing—
when and how? Journal of Clinical Epidemiology 54(4), 343–349.
•
M. Bland (1995): An Introduction to Medical Statistics. Second
Edition. Oxford University Press.
•
J.M. Bland and D.G. Altman (1986): Statistical methods for assessing
agreement between two methods of clinical measurement. Lancet,
1:307-310.
•
A. Bühl (2006): SPSS Version 14. Einführung in die moderne
Datenanalyse. 10. überarbeitete Auflage. Pearson Studium (German).
•
M J. Campbell and D. Machin (1993): Medical Statistics - A
Commonsense Approach. John Wiley & Sons, New York.
•
B. Dawson and R.G. Trapp (2004): Basic & Clinical Biostatistics.
Fourth Edition. McGraw-Hill.
•
A.R. Feinstein, D.M. Sosin and C.K. Wells (1985): The Will Rogers
phenomenon. Stage migration and new diagnostic techniques as a
source of misleading statistics for survival in cancer. The New England
Journal of Medicine, 312(25), 1604-1608.
•
L.D. Fisher and G. van Belle (1993): Biostatistics - Methodology for
the Health Sciences. Wiley, New York.
•
R.H. Fletcher, S.W. Fletcher (2005): Clinical Epidemiology. The
Essentials. Fourth Edition. Lippincott Williams & Wilkins. (A german
version of the book appeared in 2007 as "Klinische Epidemiologie:
Grundlagen und Anwendung. 2. Auflage, Huber, Bern".)
•
R.J. Freund and P.D. Minton (1979): Regression Methods. A Tool for
Data Analysis. Marcel Dekker.
215
43BDictionary: English-German
•
G. Gigerenzer (2002): Das Einmaleins der Skepsis. Über den richtigen
Umgang mit Zahlen und Risiken. Berlin Verlag (German).
•
I. Guggenmoos-Holzmann und K.-D. Wernecke (1995):
Medizinische Statistik. Blackwell Wiss.-Verlag (German).
•
F.E. Harrell and C.E. Davis (1982): A new distribution-free quantile
estimator. Biometrika, 69:635-640.
•
R.-D. Hilgers, P. Bauer und V. Scheiber (2003): Einführung in
die Medizinische Statistik. Springer-Verlag (German).
•
S. Holm (1979): A Simple Sequentially Rejective Multiple Test
Procedure. Scandinavian Journal of Statistics, 6, 65-70.
•
J.C. Hsu (1996):
Chapman and Hall.
•
M.H. Katz (1999): Multivariable Analysis. A Practical Guide for
Clinicians. Cambridge University Press.
•
D. Kendrick, K. Fielding, E. Bentley, R. Kerslake, P. Miller, and M.
Pringle. Radiography of the lumbar spine in primary care patients with
low back pain: randomised controlled trial. British Medical Journal,
322:400-405, 2001.
•
S. Landau and B.S. Everitt (2004): A Handbook of Statistical Analyses
using SPSS. Chapman & Hall/CRC.
•
P. Little, J. Barnett, L. Barnsley, J. Marjoram, A. Fitzgerald-Barron,
and D. Mant. Comparison of agreement between different measures
of blood pressure in primary care and daytime ambulatory blood
pressure. British Journal of Medicine, 325:254-257, 2002
•
R. J. Lorenz. Grundbegriffe der Biometrie. G.Fischer, Stuttgart, 1988
(German).
•
D. Machin, M. Campbell, P. Fayers and A. Pinol (1997): Sample Size
Tables for Clinical Studies. 2nd Edition. Blackwell Science.
•
D.E. Matthews and V.T. Farewell (1988): Using and Understanding
Medical Statistics. 2nd, revised edition. Karger.
•
R. Matthews (Translated by J. Engel): Der Storch bringt die Babies
zur Welt (p=0.008). Stochastik in der Schule 21: 21-23, 2001
•
H. Motulsky (1995): Intuitive Biostatistics. Oxford University
Press.
•
J. Pallant (2005): SPSS survival manual. 2nd edition. Open University
Press.
•
G. v. Randow (1992): Das Ziegenproblem. Denken in Wahrscheinlichkeiten. Rowohlt Verlag (German).
Multiple
Comparisons.
216
Theory
and
methods.
43BDictionary: English-German
•
J. Reisinger, E. Gatterer, G. Heinze, K. Wiesinger, E. Zeindlhofer, M.
Gattermeier, G. Poelzl, H. Kratzer, A. Ebner, W. Hohenwallner, K.
Lenz, J. Slany, and P. Kuhn. Prospective comparison of flecainide
versus sotalol for immediate cardioversion of atrial fibrillation. Am J
Cardiol, 81:1450-1454, 1998.
•
M.F. Schilling, A.E. Watkins, and W. Watkins. Is human height
bimodal? The American Statistician, 56:223-229, 2002.
•
M. Schumacher und G. Schulgen (2002): Methodik klinischer
Studien. Methodische Grundlagen der Planung, Durchführung
und Auswertung. Springer-Verlag (German).
•
C. Schwarz, C. Mitterbauer, M. Boczula, T. Maca, M. Funovics, G.
Heinze, M. Lorenz, J. Kovarik, and R. Oberbauer. Flow monitoring:
Performance characteristics of ultrasound dilution versus color doppler
ultrasound compared with fistulography. American Journal of Kidney
Diseases, 42:539-545, 2003.J. Bakker, M. Olree, R. Kaatee, E. E. de
Lange, K. G. M. Moons, J. J. Beutler, and F. J. A. Beek. Renal volume
measurements: Accuracy and repeatability of us compared with that
of mr imaging. Radiology, 211:623-628, 1999.
•
J.P. Shaffer (1986): Modified Sequentially Rejective Multiple Test
Procedures. Journal of the American Statistical Association, 81(395),
826-831.
•
W. Timischl. Biostatistik. Springer, Wien, 2000 (German).
217