Download December 11, 2006

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Misuse of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
December 11, 2006
Introductory Statistics: Exercises
Files used in the following exercises may be downloaded from
http://www.med.uio.no/imb/stat/kursfiler/.
Some exercises from Altman’s book are included at the end.
Exercise 1
We shall use the statistical computer package SPSS. After login you may start SPSS by
clicking on the appropriate program from the start menu.
1) You shall start by writing the following data into column 1 in the data editor (these are
measurements of weight (kg) for 20 students):
50 75 70 74 95 83 65 94 66 65
65 75 84 55 73 68 72 67 53 65
Double click on the column name, and write in a variable name of your own choice.
2) You shall make a small descriptive analysis of the data by running the following
commands: Click Analyze - Descriptive Statistics - Explore. Transfer the relevant variable
to Dependent List. Click at Plots, remove the tick at “Stem and leaf” and put instead a tick at
“Histogram”. Click at Continue to leave this menu. Then click OK to have the job done.
3) Look at the output and interpret it. The table of “Descriptives” will contain some things
you have learnt and some things that are unknown. Look closely to see which you recognize
and write them down on paper.
4) Compute the median of the data by hand and compare with the computer result.
Exercise 2: Analysis of data concerning lung function
Lung function has been measured on 106 medical students. Peak expiratory flow rate (PEF,
measured in liters per minute) was measured three times in a sitting position and three times
in a standing position. The measurements may be found in the file PEFH98-english.SAV.
An SPSS-file containing data can be read in by clicking in SPSS: File - Open. You then have
to look in the directory where you have saved your file and double click the file name. You
will then get the data on the screen. Check them to see that you understand what the numbers
say.
The file PEFH98-english.SAV contains the following variables:
Age (years)
Gender (1-female, 2-male)
Height (cm)
Weight (kg)
PEF measured three times in a sitting position (pefsit1, pefsit2, pefsit3)
PEF measured three times in a standing position (pefsta1, pefsta2, pefsta3)
Mean of the three measurements made in a sitting position (pefsitm)
Mean of the three measurements made in a standing position (pefstam)
Mean of all six PEF-values (pefmean)
Carry out the following exercises:
1)
Make histograms of height, weight, age, pefsitm and pefstam. Compute mean and
median. Interpret the results.
Click Analyze - Descriptive Statistics - Explore. Mark the relevant variables and transfer
them to Dependent List. Click at Plots, remove the tick at “Stem and leaf” and put instead a
tick at “Histogram”. Click at Continue to leave this menu. Then click OK to have the job
done.
2)
Make histograms for the variables height and pefmean for men and women
separately. Conclusion? You should also look at ”Box plots” to compare the genders.
Do just like in 1), but here you should put the variable gender in the Factor List. (If you need
a reminder as to what Box plots are, click at Help and Topics and write box. Then double
click at Box-and-whiskers-plot.)
3)
Include normal distribution curves in the histograms you made. Do you think the
curves fit well?
Double click in the figure. This brings you into the SPSS Chart Editor. Then double click
again on the histogram. Go to histogram options, and check off ”Display normal curve” in
the relevant box.
4)
Use scatter diagrams to compare the pef-measurements on the one hand and height,
weight and age on the other hand. What sort of associations do you see?
You get a scatter diagram by clicking Graphs – Legacy dialogs – Scatter/dot - Simple. You
can for instance choose pefmean on the Y-Axis and weight on the X-Axis. If you want to make
separate diagrams according to gender, then you put this variable into “Set markers by”.
5)
Use SPSS to draw lines in the scatter diagrams.
Double click in the figure (i.e. the scatter diagram). This brings you into the SPSS Chart
Editor. Click the right mouse button at one of the points in the new diagram, and choose e.g.
2
“Add Fit Line at Total”. This gives you a straight line (a so-called regression line). You also
get a menu which allows you to choose between several curves. Try out some of the
possibilities.
Exercise 3
Lung function was measured on 106 medical students. Peak expiratory flow rate (PEF,
measured in liters per minute) was measured three times in a sitting position and three times
in a standing position.
The measurements may be found in the file PEFH98-english.SAV. This file can be read in by
clicking in SPSS: File - Open. You then have to look in the directory where you have saved
your file and double click the file name. You will the get the data on the screen. Check them
to see that you understand what the numbers say.
The file PEFH98-english.SAV contains the following variables:
Age (years)
Gender (1-female, 2-male)
Height (cm)
Weight (kg)
PEF measured three times in a sitting position (pefsit1, pefsit2, pefsit3)
PEF measured three times in a standing position (pefsta1, pefsta2, pefsta3)
Mean of the three measurements made in a sitting position (pefsitm)
Mean of the three measurements made in a standing position (pefstam)
Mean of all six PEF-values (pefmean)
Do the following exercises:
1)
Make a scatter plot of pefsit1 versus pefsit2, and a separate scatter plot of pefsit1
versus weight. Edit the scatter plots to insert a regression line. Also, insert confidence
and prediction curves. Interpret the results.
You get a scatter plot by clicking Graphs – Legacy dialogs – Scatter/dot - Simple. You can
for instance choose pefsit2 on the Y-Axis and pefsit1 on the X-Axis. Double click in the figure
(i.e. the scatter diagram). This brings you into the SPSS Chart Editor. Then right click the
figure and choose Ad Fit Line at Total. This gives a line fitting the data.
2)
Compute the correlation between pefsit1 and pefsit2, and between pefsit1 and weight.
Also, choose some more of the variables to correlate. Why are some correlations close
to 1 while others are smaller?
You find the correlation by clicking Analyze - Correlate - Bivariate. Click the relevant
variables and move them in to the right field. Click OK.
3)
Make two regression analyses: (i) pefsit2 as dependent variable and pefsit1 as
independent variable, (ii) pefsit1 as dependent variable and weight as independent
variable. Interpret the results in relation to the scatter plots.
3
You do regression analysis by clicking Analyze - Regression - Linear and transferring the
relevant variables to Dependent and Independent.
4)
Make residual analysis for the analyses you performed above. Interpret the results.
Under Linear regression click Plots. Under Standardized Residual Plots choose Histogram
and Normal probability plot.
5)
Make a regression analysis with pefsit1 as dependent variable and sex and weight as
independent variables. Interpret the results.
6)
Finally, make a regression analysis with pefmean as dependent variable and sex,
height and weight as independent variables. Interpret the results.
Exercise 4
A social researcher interview 25 newly married couples. Each husband and wife are
independently asked the question: "How many children would you like to have?" The
following data are obtained:
Answer of
Couple
1
2
3
4
5
6
7
8
9
10
11
12
13
Answer of
Husband
Wife
3
1
2
2
5
0
0
1
2
3
4
1
3
2
1
1
3
1
1
2
3
2
1
2
2
3
Couple
14
15
16
17
18
19
20
21
22
23
24
25
Husband
2
3
2
0
1
2
3
4
3
0
1
1
Wife
1
2
2
0
2
1
2
3
1
0
2
1
Do the data show a significant difference of opinions between husbands and wives regarding
an ideal family? Use a nonparametric test.
Exercise 5
4
(Taken from an exercise by Ørnulf Borgan). In this exercise you should use the actuarial
method to calculate a survival curve until the age of 5 years for the cohort of Norwegian
women born in 1880. Data (from Statistics Norway) is given in the table below. The censored
data corresponds to net emigration.
Interval
(Years)
0-1
1-2
2-3
3-4
4-5
Alive by
interval start
26967
25311
24154
23444
22971
Deaths
Censored
1377
848
477
343
257
279
309
233
130
111
Exercise 6
(Taken from Danish course notes). The following data is from 35 patients with ovarian
cancer. The observation time is in days from the start of treatment to deterioration of disease.
Censored data is marked with *.
15 patients with tumor of “low grade” type:
28, 89, 175, 195, 309, 377*, 393*, 421*, 447*, 462,
709*, 744*, 770*, 1106*, 1206*
20 patients with tumour of "high grade" type:
34, 88, 137, 199, 280 291, 299*, 300*, 309, 351, 358,
369, 369, 370, 375, 382, 392, 429*, 451, 1119*
Compute and draw a Kaplan-Meier survival curve for the patients with low grade tumour. Do
the same computation for the other group and draw the curve in the same diagram as the first
one. Compare the to survival curves.
(Data from Fleming et al., Biometrics, 1980, 36, 607-625.)
Exercise 7
In the table below you will find data describing the relationship between age and blood
pressure of 20 healthy adults.
5
Age
Blood pressure
20
43
63
26
53
31
58
46
58
70
46
53
70
20
63
43
26
19
31
23
120
128
141
126
134
128
136
132
140
144
128
136
146
124
143
130
124
121
126
123
Find the correlation between age and blood pressure and test if it is significant. Compute a
95% confidence interval for the regression parameter. Find also the squared correlation
coefficient between age and blood pressure. What does it mean?
What is the blood pressure for a person at age 40? For a person at age 75? Comment.
Exercise 8 (3.13 in Aalen)
(Edited exercise from Larsen & Marx 1986).The U.S. senate committee on Labour and Public
Welfare studied the possibility to map child abuse. A team of experts was consulted,
suggesting the following probabilities:
i.
Approximately 1 out of 100 children are exposed to abuse,
ii.
a medical doctor can diagnose existing abuse in approximately 90% of the cases,
iii.
a survey in large population groups would lead to approximately 3% of the nonabused children being classified as abused.
Compute the probability that one child being classified as abused, actually is abused. How
does the probability change if only 1 child out of 1000 is abused? What if 1 child out of 50 is
abused? How will you, from these calculations, consider the possibilities of screening for
child abuse in a population?
6
Exercise 9 (4.1 in Aalen)
Kari likes to play a game of dice. At one occasion she makes one throw with 5 dice, and is
interested in the number of sixes. Calculate the probability distribution for the number of
sixes and draw a probability diagram.
Exercise 10 (8.6 in Aalen)
In 1974 in Tromsø a survey was carried out on the dietary habits of 16 high school boys. The
youths, randomly chosen, were supposed to register how much they ate of different types of
food during one week. We will look at the consumption of milk. The data below gives the
number of decilitre milk per day for each individual:
6.3
3.6
6.9
1.2
3.0
1.1
3.8
3.6
4.7
3.3
4.6
9.4
3.0
5.6
3.3
3.0
a) Make a histogram.
b) Compute the mean and median.
c) You find that the mean is 4.15 dl. The standard deviation is 2.1 dl (you don’t need to
calculate this). Make a 95% confidence interval for the expected consumption of milk
for a boy at high school in Tromsø. Explain what such an interval says. Discuss
briefly the assumptions behind the calculation. If some of the boys were close friends,
could this influence one of the assumptions?
d) For 16 girls, the mean milk consumption is 2.59 dl per day and the standard deviation
1.2 dl. Do these figures give a clear indication that boys drink more milk than girls?
Choose the level of significance yourself and discuss the assumptions for the test you
perform.
Exercise 11 (5.6 in Aalen)
Kari is a midwife at a birth clinic. During one day she assists at four births.
a) Suppose that it is equally likely to have a boy and a girl. Calculate the probability that
it is 2 boys and 2 girls Kari delivers.
b) Suppose that the birth weight of a child is normally distributed with mean 3.5 kg and
standard deviation 0.5 kg. What is the probability that all the four children Kari
delivers weigh over 3 kg?
7
Exercise 12 (3.29 in Aalen)
(From exam 1987) Breast cancer is one of the most common cancer forms for women. With a
special X-ray examination, mammography, the tumour might be detected at an earlier stage
than it would otherwise. This increases the chance of recovery. Many have been eager to
conduct mass examination of women (for instance for all woman over 40 years) by
mammography. Detecting the cancer at an early stage would save lives. A well-known
problem with these mass screening is the occurrence of false positive cases. These cases
would demand a comprehensive further screening before the diagnosis is invalidated. In
1986, in the periodical of the Norwegian Medical Association, there was a big discussion
about the value of mammography, where among other things the problem of false positive
tests was emphasized. The calculations to follow is inspired by this discussion.
The following (fairly realistic) values will be used in the calculations: If a woman has breast
cancer, the probability of detecting it at mammography is 80%. If she does not have breast
cancer, the probability of a false positive test is 10%. The prevalence of breast cancer in the
population in question is estimated to 0.5%.
a) What is the sensitivity and the specificity of mammography from this information?
b) If a woman gets a positive result from mammography, what is the probability that she
really suffers from breast cancer? If the woman gets a negative result, what is the
probability that she really is healthy?
c) We can also look at it in a different way: Imagine that 50 000 women are examined
with mammography. What is the expected number of cancer cases among these
women? What is the expected number of true positive tests? The number of false
positives?
d) Explain the importance of the computations above when considering whether such a
mass examination should be carried out.
Exercise 13
The weight of the hearts of 20 men with age between 25 and 55 years has been evaluated and
is given in the following table:
11.50
10.50
14.75
11.75
13.75
10.00
10.50
14.50
14.75
12.00
13.50
11.00
10.75
14.00
9.50
15.00
11.75
11.50
12.00
10.25
(Weight in ounces, 1 ounce = 28g)
1. Calculate the mean weight of the hearts (by hand).
2. Calculate the 95%-confidence interval of the expectation value of the heart weights
(by hand, use 2 decimals during the calculation).
Hints:
a. Calculate the empirical standard deviations from your data (for your control:
1.78)
8
b. Evaluate the corresponding percentile from the t-distribution
c. Use the confidence interval formula from the lecture
3. Based on this dataset and the confidence interval for the expectation: How would you
answer the question “Is the expected value of the weight equal to 11 ounces?”
Hints:
a. Formulate the null hypothesis “in a statistical way”
b. Formulate the conclusion in your own words.
4. Perform a one-sample t-test for the hypothesis: H0: μ = 11.
Hint: Use the table at p. 474 in Kirkwood to evaluate the p-value.
5. Use SPSS to verify the results from 1-4. The dataset can be found as heart.sav.
Hint: Click Analyze -> Compare Means -> One Sample t-test (choose WEIGHTS as
test variable and 11 as a test value)
6. Using the SPSS-output: What are the one-sided p-values for the hypotheses: H0: μ <
11 and H0: μ ≥ 11?
Exercise 14
Exercise 14.1:
Learn to use the normal distribution table! Aalen p. 328, Kirkwood and Sterne p. 470f.
Evaluate the following probabilities for a standard normal distributed random variable X:
1. P( X ≤ 1.37 )
2. P( X > 0.46 )
3. P( X ≤ -1.96 )
Evaluate the following probabilities for a normal distributed random variable Y with mean 10
and standard deviation 4:
1. P ( Y ≤ 13 )
2. P ( Y > 14 )
Evaluate the percentiles P ( X ≤ percentile ) = p of the standard normal distributed random
variable X for the following probabilities p:
1. p = 0.975
2. p = 0.025
3. p = 0.95
9
Exercise 14.2:
The probability of being blood group B is 0.08. One pint of blood is taken
from 1000 unrelated individuals.
1. How is the number of individuals being blood group B (= random variable Y)
distributed?
2. How many individuals being blood group B do you expect? Which standard deviation
of the underlying distribution do you expect?
3. What is the probability of less than or equal 70 individuals being blood group B in the
sample?
Hints:
1. Use the normal distribution as an approximation. Why is it possible to do so?
2. Standardize the random variable Y.
3. Use a statistical table for the normal distribution and remember
P(x > y)= 1 – P(x ≤ y)
Exercise 15
A study was made of all 26 astronauts on the first eight space shuttle flights. On a voluntary
basis 17 astronauts consumed large quantities of salt and fluid prior to landing as a
countermeasure to space deconditioning, while nine did not. The table below shows supine
heart rates (beats/minute) before and after flights in the space shuttle.
You can use SPSS for this exercise and the dataset can also be found as astronaut.sav
10
Pre
71
65
52
68
69
49
49
57
51
55
58
57
59
53
53
53
48
Countermeasure taken (group 1)
Post
Change
61
-10
59
-6
47
-5
65
-3
69
0
50
1
51
2
60
3
57
6
64
9
67
9
69
12
72
13
69
16
72
19
75
22
77
29
Countermeasure not taken (group 2)
Pre
Post
Change
61
61
0
59
66
7
52
61
9
54
68
14
53
77
24
78
103
25
52
77
25
54
80
26
52
79
27
1. Compare the pre- and post-flight measurements in the countermeasure group using a
proper t-test.
Hint: Answer the question H0:μPRE = μPOST by using the test-scheme from the lecture
(see below)
2. Calculate the 95%-confidence interval of the change in the countermeasure group.
3. Perform a suitable analysis to compare the changes in heart rate in the two groups.
Hint: Answer the question H0:μ1=μ2 by using the test-scheme from the lecture (see
below). You have to reorganize your data. SPSS needs a dependent variable and a
group variable.
4. Calculate the 95%-confidence interval of the difference in heart rate in the two
groups.
5. Two astronauts each flew on two missions and are thus represented twice in the data
set. Does this matter?
6. Comment on the voluntary aspect of the study, and how it might affect the
interpretation of the results.
11
Test-Scheme
1.
2.
3.
4.
5.
6.
7.
Formulate the null hypothesis and remember
that it is only possible to prove the alternative
Choose an appropriate test, threshold α
Calculate the test-statistic
Calculate the p-value
Compare the p-value with the threshold α
Decide whether the null hypothesis is to be
rejected or not
Formulate the conclusion
44
Exercise 16 (Aalen 6.2)
(Former exam) Close to one per thousand of live born children dies suddenly without any
proven cause. Most of these deaths take place within the first year, and an intensive work has
been put into understanding the reason for these deaths. In a period of three years 222 deaths
of this kind took place in Norway, 132 of them were boys and 90 girls. The figures could
indicate that cot death is more common for boys than girls, but we want to examine this
claim. 51.3% of all living born children are boys, 48.7% are girls.
a) Formulate a null hypothesis and an alternative hypothesis.
b) Test the null hypothesis. Choose a 5% level of significance.¨
c) Formulate the test result in words.
Exercise 17 (Aalen 6.20)
(Former exam) Workers in the aluminium industry run a certain risk of getting asthma from
exposure at the work place. Important symptoms of asthma are chronic cough or wheezing. A
person could have only one symptom or both at the same time.
There has been made an examination at a Norwegian aluminium works, where the
occurrences of asthma among the workers were studied. Among 270 workers, it was found
that 180 did not have any of the symptoms, 71 had chronic cough and 49 had wheezing.
a) Estimate the probability for the symptom of wheezing, and state the confidence
interval.
b) How many workers had both chronic cough and wheezing?
12
c) Estimate from the numbers, the conditional probability that a person suffering from
chronic cough also has wheezing. Do your calculations indicate that there is a relation
between the occurrence of the two symptoms?
Exercise 18: Birth weight data with regression
In this problem you shall analyse a data set given in the file birth.sav. Open the file, look at
the data and make sure you understand what they mean.
Description of the file BIRTH.SAV:
In a study in Massachusetts, USA, birth weight was measured for the children of 189 women.
The main variable in the study was birth weight, BWT, which is an important indicator of the
condition of a newborn child. Low birth weight (below 2500 g) may be a medical risk factor.
A major question is whether smoking during pregnancy influences the birth weight. One has
also studied whether a number of other factors are related to birth weight, such as
hypertension in the mother. The variables of the study are:
Variable No. Description
1
2
3
4
5
6
7
8
9
10
11
12
Name
Identification number
Low Birth weight
(1=BWT<2500g, 0=BWT>2500g)
Age of the mother
Weight in pounds at last menstrual period
Race (1=White, 2=Black, 3=Other)
Smoking status (1=current smoker,
0=not smoking during pregnancy)
History of premature labour (0,1,2...,)
History of hypertension (1=yes, 0=no)
Uterine irritability (1=yes, 0=no)
First trimester visits (0,1,2,3...,)
Third trimester visits (0,1,2,3...,)
Birth weight
The first 10 lines in the data file look as follows:
4 1 28 120 3 1 1 0 1 0 0 709
10 1 29 130 1 0 0 0 1 2 0 1021
11 1 34 187 2 1 0 1 0 0 0 1135
13 1 25 105 3 0 1 1 0 0 0 1330
15 1 25 85 3 0 0 0 1 0 4 1474
16 1 27 150 3 0 0 0 0 0 5 1588
17 1 23 97 3 0 0 0 1 1 5 1588
18 1 24 128 2 0 1 0 0 1 2 1701
19 1 24 132 3 0 0 1 0 0 5 1729
20 1 21 165 1 1 0 1 0 1 4 1790
13
ID
LOW
AGE
LWT
RAC
SMK
PTL
HT
UI
FVT
TTV
BWT
Questions:
a) Make scatter plots of birth weight (BWT) versus age of the mother (AGE), and versus
weight of mother (LWT). Edit the scatter plots to insert a regression line. Make also
separate regression lines for smokers and non-smokers. Make also confidence curves
around the regression line. Interpret the results.
b) Compute the correlation between birth weight and weight of mother.
You find the correlation by clicking Analyze - Correlate - Bivariate. Click the relevant
variables and move them in to the right field. Click OK.
c) Make box plots of birth weight for smokers and non-smokers separately.
d) You shall make regression analyses with birth weight as the dependent variable. First, use
only smoking as independent variable. In the second analysis, use also weight of mother
as independent variable. Use also more independent variables if you have time. Interpret
the results in relation to the earlier results of this exercise.
You do regression analysis by clicking Analyze - Regression - Linear and transferring the
relevant variables to Dependent and Independent.
Exercise 19: Sample size
The effect of two inhalation steroids for asthma shall be compared. Pulmicort® (budesonide)
has been on the marked for many years, while “Spiros” (xxxx) is new and does not yet have a
marketing authorisation. Asthma patients treated with 2-agonists only, but unsatisfied by the
effect, will be included in the study. The patients shall be randomized, receiving either Spiros
or Pulmicort.
The primary effect variable is chosen to be FEV1 (forced expiratory volume, in litres). The
effect is measured after 12 weeks of treatment. The standard deviation for FEV1 is 0.8. A 0.2
difference between the treatment groups is considered relevant.
a) How many patients should be included in the trial? Choose a 5% level of significance and
a test power of 90%.
In the computations under a) you only considered the end measurement after treatment, not
the patients’ start value. Assume that the standard deviation for the change in FEV1 after 12
weeks of treatment with steroids is 0.4.
b) How many patients should now be included in the trial? Use the same level of
significance and test power as before.
c) Why is the number of patients reduced compared with the result in a)?
14
d) Could the trial have been carried out with another trial plan? Discuss advantages and
disadvantages with cross-over studies.
e) In reality superiority studies with to active drugs are hardly ever done. Instead, so called
non-inferiority studies are performed, defining  as ”a difference so small that it has no
clinical significance”. Discuss whether the difference chosen for the computation of
sample size in a) and b) also could be used in a non-inferiority study. Justify why/why
not, and calculate the sample size with a new value of  if needed.
f) What characterizes a good effect measure? Why should one pick one primary?
g) In asthma studies the lung function is measured by different spirometric values. In
addition clinical end points like the use of 2-agonists, time to first exacerbation, nightly
awakenings, own evaluation of breath trouble and quality of live measures are included.
All these could be used to compare effect of different treatments. Suggest alternatives to
FEV1 and discuss advantages and disadvantages.
http://www.emea.europa.eu/pdfs/human/ewp/292201en.pdf
Exercise 20: Planlegging
Barneastma påståes å øke i hyppighet. Du skal delta i en klinisk prøvning av et nytt
medikament som hevdes å forebygge forverring av sykdommen hos barn med begynnende
astma. Hvordan vil du planlegge en slik studie? Se bl.a. på følgende aspekter:






inklusjonskriterier
eksklusjonskriterier
parallell el. overkrysningsstudie?
hva skal registreres? Effektmål?
hvor ofte skal det registreres?
tidsramme
Exercise 21: Planlegging
Vi ønsker å undersøke effekten av et nytt blodtrykksenkende medikament A på en gruppe
pasienter med mild til moderat essensiell hypertensjon. Den nye behandlingen skal
sammenlignes med et veletablert medikament B. Studien skal legges opp som en
parallellstudie.
Skisser i grove trekk hvordan en klinisk prøvning kan legges opp på en praktisk
gjennomførbar måte.
15
Exercise 22: Birth weight data with regression – Dummy variables,
confounding and interaction
Consider again the data set given in the file birth.sav. Open the file, look at the data, and
make sure you understand what they mean.
Variable No. Description
1
2
Name
Identification number
Low Birth weight
(1=BWT<2500g, 0=BWT>2500g)
Age of the mother
Weight in pounds at last menstrual period
Race (1=White, 2=Black, 3=Other)
Smoking status (1=current smoker,
0=not smoking during pregnancy)
History of premature labour (0,1,2...,)
History of hypertension (1=yes, 0=no)
Uterine irritability (1=yes, 0=no)
First trimester visits (0,1,2,3...,)
Third trimester visits (0,1,2,3...,)
Birth weight
3
4
5
6
7
8
9
10
11
12
ID
LOW
AGE
LWT
RAC
SMK
PTL
HT
UI
FVT
TTV
BWT
Questions:
a)
b)
c)
d)
e)
Construct dummy variables for RAC. You need to use Transform->Recode into
different variables. Move RAC to the right, write a name for the new dummy
variable (e.g. BLACK), and click change. Click Old and new values. Write 2 for
Old value, 1 for New value, and click Add. This means that people coded as
2=black in the old variable (RAC) gets coded as 1 in the new variable (BLACK).
Click All other values and write 0 and click Add. This means that people who are
not black, will be coded as 0 in the new dummy variable. Click Continue and OK.
Repeat this procedure, in order to construct a dummy variable for OTHER also.
Each time, remember to remove the old commands in the menus! Why do you
need two dummy variables for RAC, not three? To check if the new variables are
ok, you can compare frequency tables of the original RAC variable to tables of the
two new variables (Analyze->Frequencies).
Do a regression of BWT vs the new dummy variables BLACK and OTHER.
What’s the birthweight of a black infant compared to a white infant? What’s the
birth weight of a black infant compared to a “OTHER” infant?
Do a regression of BWT vs BLACK, OTHER and LWT, mother’s weight. Does
mother’s weight look like a confounder for ethnicity? Why/why not?
You want to study if there is an interaction between mother’s weight and smoking.
Construct a new interaction variable, which is a product of the two variables LWT
and SMK. Use Transform->Compute. Call the nex variable LWTSMK, and
specify that it equals LWT*SMK.
Do a regression of BWT vs LWT, SMK and LWT*SMK. Does it look like there
is an interaction? What would an interaction mean in plain words? What is the
predicted effect on birth weight of gaining 100 pounds when considering smoking
mothers? What’s the corresponding effect if she is a non-smoker?
16
Excercise 23: Logistic regression
Again, load the data set birth.sav. In this analysis, LOW is the dependent variable. LOW is a binary outcome on
whether the birth weight is below 2500g or not. Low birth weight is an important predictor for several medical
complications for infants.
The main focus of the study was to see whether smoking during pregnancy affected the birth weight or not.
a)
b)
c)
d)
e)
f)
g)
h)
Look at the relationship between smoking and birth weight in a frequency table (Analyze->Descriptive
Statistics->Crosstabs, and check the relevant percentages). Does it look like there is a relationship?
Do a logistic regression with LOW as the dependent variable, and SMK as the independent variable. Is
there a statistically significant effect of smoking? What is the odds ratio? How do you interpret the
odds ratio? What is the 95% confidence interval for the odds ratio? (Analyze ->Regression->Binary
logistic. Move LOW to Dependent and SMK to Covariates. Click Options, and check CI for exp(B),
and click Continue. Click Categorical, move SMK over to the right, check First and click Change.
Click Continue and OK).
Repeat point b), but this time use smokers instead of non-smokers as the reference category/baseline
(click Categorical, click on SMK, but check Last and click Change and Continue. What has happened
to the odds ratio? What is the interpretation of the odds ratio in this case? What is the interpretation of
the constant (not very important to know, but still)?
Let’s look at a continuous, independent variable. Do a regression on LOW vs LWT, mother’s weight
(Remember to remove SMK from the model! Do not click Categorical, since LWT is a continuous
variable!). Is there a significant effect of mother’s weight? What is the interpretation of the odds ratio
in this case? What is the predicted change in the odds ratio if mother’s weight increases by 30 pounds?
What do we implicitly assume when using mother’s weight as a continuous variable?
Let’s look at a categorical variable with more than two categories. Do a regression on LOW vs RAC,
ethnicity in three categories (Again, you have to remove LWT from the previous analysis, and click
Categorical, move RAC over to the right, and click First and Change). How do you interpret the odds
ratios in this case? Is there a significant effect of ethnicity?
Let’s look at another categorical variable with more than two categories: PTL, history of premature
labour (which could, in principle, be considered as a continuous variable, but not in these data). Do a
regression of LOW vs PTL. What is the problem with this analysis? Why do you not get confidence
intervals for the group ptl(3)? Also, do you notice anything strange when comparing the odds ratios for
ptl(1) and ptl(2)? (Hint: Look at the SE for B and the frequency of individuals in each group in one of
the first tables of the SPSS-output)
We would like to recode the PTL-variable, so that it is 0 for no premature labour, and 1 for at least one
previous instance of premature labour. Choose Data-> Recode->Into different variables. Move PTL to
the right, and write PTL2 as the name of the new variable. Click Change, and click Old and new
values. Write 0 for Old value, 0 for New value, and click Add. This secures that those with no
premature labour remains unchanged. Now choose Range, write 1 Through 3 under Old value, and 1
under New value, and click Add. This secures that all with 1, 2 or 3 instances of premature labour are
coded as 1. Click Continue and OK. Do a regression on LOW vs PTL2. This is also useful when you
want to recode continuous variables into categories.
Now, let’s do an analysis with both SMK and LWT as independent variables (remember that SMK is
categorical, but not LWT!!). What are the interpretations of the odds ratios in this case? Can smoking
be said to be a confounder of mother’s weight or vice versa?
17
Exercises from “Practical statistics for medical research” (Altman)
18
19
20