Download December 11, 2006

December 11, 2006 Introductory Statistics: Exercises Files used in the following exercises may be downloaded from http://www.med.uio.no/imb/stat/kursfiler/. Some exercises from Altman’s book are included at the end. Exercise 1 We shall use the statistical computer package SPSS. After login you may start SPSS by clicking on the appropriate program from the start menu. 1) You shall start by writing the following data into column 1 in the data editor (these are measurements of weight (kg) for 20 students): 50 75 70 74 95 83 65 94 66 65 65 75 84 55 73 68 72 67 53 65 Double click on the column name, and write in a variable name of your own choice. 2) You shall make a small descriptive analysis of the data by running the following commands: Click Analyze - Descriptive Statistics - Explore. Transfer the relevant variable to Dependent List. Click at Plots, remove the tick at “Stem and leaf” and put instead a tick at “Histogram”. Click at Continue to leave this menu. Then click OK to have the job done. 3) Look at the output and interpret it. The table of “Descriptives” will contain some things you have learnt and some things that are unknown. Look closely to see which you recognize and write them down on paper. 4) Compute the median of the data by hand and compare with the computer result. Exercise 2: Analysis of data concerning lung function Lung function has been measured on 106 medical students. Peak expiratory flow rate (PEF, measured in liters per minute) was measured three times in a sitting position and three times in a standing position. The measurements may be found in the file PEFH98-english.SAV. An SPSS-file containing data can be read in by clicking in SPSS: File - Open. You then have to look in the directory where you have saved your file and double click the file name. You will then get the data on the screen. Check them to see that you understand what the numbers say. The file PEFH98-english.SAV contains the following variables: Age (years) Gender (1-female, 2-male) Height (cm) Weight (kg) PEF measured three times in a sitting position (pefsit1, pefsit2, pefsit3) PEF measured three times in a standing position (pefsta1, pefsta2, pefsta3) Mean of the three measurements made in a sitting position (pefsitm) Mean of the three measurements made in a standing position (pefstam) Mean of all six PEF-values (pefmean) Carry out the following exercises: 1) Make histograms of height, weight, age, pefsitm and pefstam. Compute mean and median. Interpret the results. Click Analyze - Descriptive Statistics - Explore. Mark the relevant variables and transfer them to Dependent List. Click at Plots, remove the tick at “Stem and leaf” and put instead a tick at “Histogram”. Click at Continue to leave this menu. Then click OK to have the job done. 2) Make histograms for the variables height and pefmean for men and women separately. Conclusion? You should also look at ”Box plots” to compare the genders. Do just like in 1), but here you should put the variable gender in the Factor List. (If you need a reminder as to what Box plots are, click at Help and Topics and write box. Then double click at Box-and-whiskers-plot.) 3) Include normal distribution curves in the histograms you made. Do you think the curves fit well? Double click in the figure. This brings you into the SPSS Chart Editor. Then double click again on the histogram. Go to histogram options, and check off ”Display normal curve” in the relevant box. 4) Use scatter diagrams to compare the pef-measurements on the one hand and height, weight and age on the other hand. What sort of associations do you see? You get a scatter diagram by clicking Graphs – Legacy dialogs – Scatter/dot - Simple. You can for instance choose pefmean on the Y-Axis and weight on the X-Axis. If you want to make separate diagrams according to gender, then you put this variable into “Set markers by”. 5) Use SPSS to draw lines in the scatter diagrams. Double click in the figure (i.e. the scatter diagram). This brings you into the SPSS Chart Editor. Click the right mouse button at one of the points in the new diagram, and choose e.g. 2 “Add Fit Line at Total”. This gives you a straight line (a so-called regression line). You also get a menu which allows you to choose between several curves. Try out some of the possibilities. Exercise 3 Lung function was measured on 106 medical students. Peak expiratory flow rate (PEF, measured in liters per minute) was measured three times in a sitting position and three times in a standing position. The measurements may be found in the file PEFH98-english.SAV. This file can be read in by clicking in SPSS: File - Open. You then have to look in the directory where you have saved your file and double click the file name. You will the get the data on the screen. Check them to see that you understand what the numbers say. The file PEFH98-english.SAV contains the following variables: Age (years) Gender (1-female, 2-male) Height (cm) Weight (kg) PEF measured three times in a sitting position (pefsit1, pefsit2, pefsit3) PEF measured three times in a standing position (pefsta1, pefsta2, pefsta3) Mean of the three measurements made in a sitting position (pefsitm) Mean of the three measurements made in a standing position (pefstam) Mean of all six PEF-values (pefmean) Do the following exercises: 1) Make a scatter plot of pefsit1 versus pefsit2, and a separate scatter plot of pefsit1 versus weight. Edit the scatter plots to insert a regression line. Also, insert confidence and prediction curves. Interpret the results. You get a scatter plot by clicking Graphs – Legacy dialogs – Scatter/dot - Simple. You can for instance choose pefsit2 on the Y-Axis and pefsit1 on the X-Axis. Double click in the figure (i.e. the scatter diagram). This brings you into the SPSS Chart Editor. Then right click the figure and choose Ad Fit Line at Total. This gives a line fitting the data. 2) Compute the correlation between pefsit1 and pefsit2, and between pefsit1 and weight. Also, choose some more of the variables to correlate. Why are some correlations close to 1 while others are smaller? You find the correlation by clicking Analyze - Correlate - Bivariate. Click the relevant variables and move them in to the right field. Click OK. 3) Make two regression analyses: (i) pefsit2 as dependent variable and pefsit1 as independent variable, (ii) pefsit1 as dependent variable and weight as independent variable. Interpret the results in relation to the scatter plots. 3 You do regression analysis by clicking Analyze - Regression - Linear and transferring the relevant variables to Dependent and Independent. 4) Make residual analysis for the analyses you performed above. Interpret the results. Under Linear regression click Plots. Under Standardized Residual Plots choose Histogram and Normal probability plot. 5) Make a regression analysis with pefsit1 as dependent variable and sex and weight as independent variables. Interpret the results. 6) Finally, make a regression analysis with pefmean as dependent variable and sex, height and weight as independent variables. Interpret the results. Exercise 4 A social researcher interview 25 newly married couples. Each husband and wife are independently asked the question: "How many children would you like to have?" The following data are obtained: Answer of Couple 1 2 3 4 5 6 7 8 9 10 11 12 13 Answer of Husband Wife 3 1 2 2 5 0 0 1 2 3 4 1 3 2 1 1 3 1 1 2 3 2 1 2 2 3 Couple 14 15 16 17 18 19 20 21 22 23 24 25 Husband 2 3 2 0 1 2 3 4 3 0 1 1 Wife 1 2 2 0 2 1 2 3 1 0 2 1 Do the data show a significant difference of opinions between husbands and wives regarding an ideal family? Use a nonparametric test. Exercise 5 4 (Taken from an exercise by Ørnulf Borgan). In this exercise you should use the actuarial method to calculate a survival curve until the age of 5 years for the cohort of Norwegian women born in 1880. Data (from Statistics Norway) is given in the table below. The censored data corresponds to net emigration. Interval (Years) 0-1 1-2 2-3 3-4 4-5 Alive by interval start 26967 25311 24154 23444 22971 Deaths Censored 1377 848 477 343 257 279 309 233 130 111 Exercise 6 (Taken from Danish course notes). The following data is from 35 patients with ovarian cancer. The observation time is in days from the start of treatment to deterioration of disease. Censored data is marked with *. 15 patients with tumor of “low grade” type: 28, 89, 175, 195, 309, 377*, 393*, 421*, 447*, 462, 709*, 744*, 770*, 1106*, 1206* 20 patients with tumour of "high grade" type: 34, 88, 137, 199, 280 291, 299*, 300*, 309, 351, 358, 369, 369, 370, 375, 382, 392, 429*, 451, 1119* Compute and draw a Kaplan-Meier survival curve for the patients with low grade tumour. Do the same computation for the other group and draw the curve in the same diagram as the first one. Compare the to survival curves. (Data from Fleming et al., Biometrics, 1980, 36, 607-625.) Exercise 7 In the table below you will find data describing the relationship between age and blood pressure of 20 healthy adults. 5 Age Blood pressure 20 43 63 26 53 31 58 46 58 70 46 53 70 20 63 43 26 19 31 23 120 128 141 126 134 128 136 132 140 144 128 136 146 124 143 130 124 121 126 123 Find the correlation between age and blood pressure and test if it is significant. Compute a 95% confidence interval for the regression parameter. Find also the squared correlation coefficient between age and blood pressure. What does it mean? What is the blood pressure for a person at age 40? For a person at age 75? Comment. Exercise 8 (3.13 in Aalen) (Edited exercise from Larsen & Marx 1986).The U.S. senate committee on Labour and Public Welfare studied the possibility to map child abuse. A team of experts was consulted, suggesting the following probabilities: i. Approximately 1 out of 100 children are exposed to abuse, ii. a medical doctor can diagnose existing abuse in approximately 90% of the cases, iii. a survey in large population groups would lead to approximately 3% of the nonabused children being classified as abused. Compute the probability that one child being classified as abused, actually is abused. How does the probability change if only 1 child out of 1000 is abused? What if 1 child out of 50 is abused? How will you, from these calculations, consider the possibilities of screening for child abuse in a population? 6 Exercise 9 (4.1 in Aalen) Kari likes to play a game of dice. At one occasion she makes one throw with 5 dice, and is interested in the number of sixes. Calculate the probability distribution for the number of sixes and draw a probability diagram. Exercise 10 (8.6 in Aalen) In 1974 in Tromsø a survey was carried out on the dietary habits of 16 high school boys. The youths, randomly chosen, were supposed to register how much they ate of different types of food during one week. We will look at the consumption of milk. The data below gives the number of decilitre milk per day for each individual: 6.3 3.6 6.9 1.2 3.0 1.1 3.8 3.6 4.7 3.3 4.6 9.4 3.0 5.6 3.3 3.0 a) Make a histogram. b) Compute the mean and median. c) You find that the mean is 4.15 dl. The standard deviation is 2.1 dl (you don’t need to calculate this). Make a 95% confidence interval for the expected consumption of milk for a boy at high school in Tromsø. Explain what such an interval says. Discuss briefly the assumptions behind the calculation. If some of the boys were close friends, could this influence one of the assumptions? d) For 16 girls, the mean milk consumption is 2.59 dl per day and the standard deviation 1.2 dl. Do these figures give a clear indication that boys drink more milk than girls? Choose the level of significance yourself and discuss the assumptions for the test you perform. Exercise 11 (5.6 in Aalen) Kari is a midwife at a birth clinic. During one day she assists at four births. a) Suppose that it is equally likely to have a boy and a girl. Calculate the probability that it is 2 boys and 2 girls Kari delivers. b) Suppose that the birth weight of a child is normally distributed with mean 3.5 kg and standard deviation 0.5 kg. What is the probability that all the four children Kari delivers weigh over 3 kg? 7 Exercise 12 (3.29 in Aalen) (From exam 1987) Breast cancer is one of the most common cancer forms for women. With a special X-ray examination, mammography, the tumour might be detected at an earlier stage than it would otherwise. This increases the chance of recovery. Many have been eager to conduct mass examination of women (for instance for all woman over 40 years) by mammography. Detecting the cancer at an early stage would save lives. A well-known problem with these mass screening is the occurrence of false positive cases. These cases would demand a comprehensive further screening before the diagnosis is invalidated. In 1986, in the periodical of the Norwegian Medical Association, there was a big discussion about the value of mammography, where among other things the problem of false positive tests was emphasized. The calculations to follow is inspired by this discussion. The following (fairly realistic) values will be used in the calculations: If a woman has breast cancer, the probability of detecting it at mammography is 80%. If she does not have breast cancer, the probability of a false positive test is 10%. The prevalence of breast cancer in the population in question is estimated to 0.5%. a) What is the sensitivity and the specificity of mammography from this information? b) If a woman gets a positive result from mammography, what is the probability that she really suffers from breast cancer? If the woman gets a negative result, what is the probability that she really is healthy? c) We can also look at it in a different way: Imagine that 50 000 women are examined with mammography. What is the expected number of cancer cases among these women? What is the expected number of true positive tests? The number of false positives? d) Explain the importance of the computations above when considering whether such a mass examination should be carried out. Exercise 13 The weight of the hearts of 20 men with age between 25 and 55 years has been evaluated and is given in the following table: 11.50 10.50 14.75 11.75 13.75 10.00 10.50 14.50 14.75 12.00 13.50 11.00 10.75 14.00 9.50 15.00 11.75 11.50 12.00 10.25 (Weight in ounces, 1 ounce = 28g) 1. Calculate the mean weight of the hearts (by hand). 2. Calculate the 95%-confidence interval of the expectation value of the heart weights (by hand, use 2 decimals during the calculation). Hints: a. Calculate the empirical standard deviations from your data (for your control: 1.78) 8 b. Evaluate the corresponding percentile from the t-distribution c. Use the confidence interval formula from the lecture 3. Based on this dataset and the confidence interval for the expectation: How would you answer the question “Is the expected value of the weight equal to 11 ounces?” Hints: a. Formulate the null hypothesis “in a statistical way” b. Formulate the conclusion in your own words. 4. Perform a one-sample t-test for the hypothesis: H0: μ = 11. Hint: Use the table at p. 474 in Kirkwood to evaluate the p-value. 5. Use SPSS to verify the results from 1-4. The dataset can be found as heart.sav. Hint: Click Analyze -> Compare Means -> One Sample t-test (choose WEIGHTS as test variable and 11 as a test value) 6. Using the SPSS-output: What are the one-sided p-values for the hypotheses: H0: μ < 11 and H0: μ ≥ 11? Exercise 14 Exercise 14.1: Learn to use the normal distribution table! Aalen p. 328, Kirkwood and Sterne p. 470f. Evaluate the following probabilities for a standard normal distributed random variable X: 1. P( X ≤ 1.37 ) 2. P( X > 0.46 ) 3. P( X ≤ -1.96 ) Evaluate the following probabilities for a normal distributed random variable Y with mean 10 and standard deviation 4: 1. P ( Y ≤ 13 ) 2. P ( Y > 14 ) Evaluate the percentiles P ( X ≤ percentile ) = p of the standard normal distributed random variable X for the following probabilities p: 1. p = 0.975 2. p = 0.025 3. p = 0.95 9 Exercise 14.2: The probability of being blood group B is 0.08. One pint of blood is taken from 1000 unrelated individuals. 1. How is the number of individuals being blood group B (= random variable Y) distributed? 2. How many individuals being blood group B do you expect? Which standard deviation of the underlying distribution do you expect? 3. What is the probability of less than or equal 70 individuals being blood group B in the sample? Hints: 1. Use the normal distribution as an approximation. Why is it possible to do so? 2. Standardize the random variable Y. 3. Use a statistical table for the normal distribution and remember P(x > y)= 1 – P(x ≤ y) Exercise 15 A study was made of all 26 astronauts on the first eight space shuttle flights. On a voluntary basis 17 astronauts consumed large quantities of salt and fluid prior to landing as a countermeasure to space deconditioning, while nine did not. The table below shows supine heart rates (beats/minute) before and after flights in the space shuttle. You can use SPSS for this exercise and the dataset can also be found as astronaut.sav 10 Pre 71 65 52 68 69 49 49 57 51 55 58 57 59 53 53 53 48 Countermeasure taken (group 1) Post Change 61 -10 59 -6 47 -5 65 -3 69 0 50 1 51 2 60 3 57 6 64 9 67 9 69 12 72 13 69 16 72 19 75 22 77 29 Countermeasure not taken (group 2) Pre Post Change 61 61 0 59 66 7 52 61 9 54 68 14 53 77 24 78 103 25 52 77 25 54 80 26 52 79 27 1. Compare the pre- and post-flight measurements in the countermeasure group using a proper t-test. Hint: Answer the question H0:μPRE = μPOST by using the test-scheme from the lecture (see below) 2. Calculate the 95%-confidence interval of the change in the countermeasure group. 3. Perform a suitable analysis to compare the changes in heart rate in the two groups. Hint: Answer the question H0:μ1=μ2 by using the test-scheme from the lecture (see below). You have to reorganize your data. SPSS needs a dependent variable and a group variable. 4. Calculate the 95%-confidence interval of the difference in heart rate in the two groups. 5. Two astronauts each flew on two missions and are thus represented twice in the data set. Does this matter? 6. Comment on the voluntary aspect of the study, and how it might affect the interpretation of the results. 11 Test-Scheme 1. 2. 3. 4. 5. 6. 7. Formulate the null hypothesis and remember that it is only possible to prove the alternative Choose an appropriate test, threshold α Calculate the test-statistic Calculate the p-value Compare the p-value with the threshold α Decide whether the null hypothesis is to be rejected or not Formulate the conclusion 44 Exercise 16 (Aalen 6.2) (Former exam) Close to one per thousand of live born children dies suddenly without any proven cause. Most of these deaths take place within the first year, and an intensive work has been put into understanding the reason for these deaths. In a period of three years 222 deaths of this kind took place in Norway, 132 of them were boys and 90 girls. The figures could indicate that cot death is more common for boys than girls, but we want to examine this claim. 51.3% of all living born children are boys, 48.7% are girls. a) Formulate a null hypothesis and an alternative hypothesis. b) Test the null hypothesis. Choose a 5% level of significance.¨ c) Formulate the test result in words. Exercise 17 (Aalen 6.20) (Former exam) Workers in the aluminium industry run a certain risk of getting asthma from exposure at the work place. Important symptoms of asthma are chronic cough or wheezing. A person could have only one symptom or both at the same time. There has been made an examination at a Norwegian aluminium works, where the occurrences of asthma among the workers were studied. Among 270 workers, it was found that 180 did not have any of the symptoms, 71 had chronic cough and 49 had wheezing. a) Estimate the probability for the symptom of wheezing, and state the confidence interval. b) How many workers had both chronic cough and wheezing? 12 c) Estimate from the numbers, the conditional probability that a person suffering from chronic cough also has wheezing. Do your calculations indicate that there is a relation between the occurrence of the two symptoms? Exercise 18: Birth weight data with regression In this problem you shall analyse a data set given in the file birth.sav. Open the file, look at the data and make sure you understand what they mean. Description of the file BIRTH.SAV: In a study in Massachusetts, USA, birth weight was measured for the children of 189 women. The main variable in the study was birth weight, BWT, which is an important indicator of the condition of a newborn child. Low birth weight (below 2500 g) may be a medical risk factor. A major question is whether smoking during pregnancy influences the birth weight. One has also studied whether a number of other factors are related to birth weight, such as hypertension in the mother. The variables of the study are: Variable No. Description 1 2 3 4 5 6 7 8 9 10 11 12 Name Identification number Low Birth weight (1=BWT<2500g, 0=BWT>2500g) Age of the mother Weight in pounds at last menstrual period Race (1=White, 2=Black, 3=Other) Smoking status (1=current smoker, 0=not smoking during pregnancy) History of premature labour (0,1,2...,) History of hypertension (1=yes, 0=no) Uterine irritability (1=yes, 0=no) First trimester visits (0,1,2,3...,) Third trimester visits (0,1,2,3...,) Birth weight The first 10 lines in the data file look as follows: 4 1 28 120 3 1 1 0 1 0 0 709 10 1 29 130 1 0 0 0 1 2 0 1021 11 1 34 187 2 1 0 1 0 0 0 1135 13 1 25 105 3 0 1 1 0 0 0 1330 15 1 25 85 3 0 0 0 1 0 4 1474 16 1 27 150 3 0 0 0 0 0 5 1588 17 1 23 97 3 0 0 0 1 1 5 1588 18 1 24 128 2 0 1 0 0 1 2 1701 19 1 24 132 3 0 0 1 0 0 5 1729 20 1 21 165 1 1 0 1 0 1 4 1790 13 ID LOW AGE LWT RAC SMK PTL HT UI FVT TTV BWT Questions: a) Make scatter plots of birth weight (BWT) versus age of the mother (AGE), and versus weight of mother (LWT). Edit the scatter plots to insert a regression line. Make also separate regression lines for smokers and non-smokers. Make also confidence curves around the regression line. Interpret the results. b) Compute the correlation between birth weight and weight of mother. You find the correlation by clicking Analyze - Correlate - Bivariate. Click the relevant variables and move them in to the right field. Click OK. c) Make box plots of birth weight for smokers and non-smokers separately. d) You shall make regression analyses with birth weight as the dependent variable. First, use only smoking as independent variable. In the second analysis, use also weight of mother as independent variable. Use also more independent variables if you have time. Interpret the results in relation to the earlier results of this exercise. You do regression analysis by clicking Analyze - Regression - Linear and transferring the relevant variables to Dependent and Independent. Exercise 19: Sample size The effect of two inhalation steroids for asthma shall be compared. Pulmicort® (budesonide) has been on the marked for many years, while “Spiros” (xxxx) is new and does not yet have a marketing authorisation. Asthma patients treated with 2-agonists only, but unsatisfied by the effect, will be included in the study. The patients shall be randomized, receiving either Spiros or Pulmicort. The primary effect variable is chosen to be FEV1 (forced expiratory volume, in litres). The effect is measured after 12 weeks of treatment. The standard deviation for FEV1 is 0.8. A 0.2 difference between the treatment groups is considered relevant. a) How many patients should be included in the trial? Choose a 5% level of significance and a test power of 90%. In the computations under a) you only considered the end measurement after treatment, not the patients’ start value. Assume that the standard deviation for the change in FEV1 after 12 weeks of treatment with steroids is 0.4. b) How many patients should now be included in the trial? Use the same level of significance and test power as before. c) Why is the number of patients reduced compared with the result in a)? 14 d) Could the trial have been carried out with another trial plan? Discuss advantages and disadvantages with cross-over studies. e) In reality superiority studies with to active drugs are hardly ever done. Instead, so called non-inferiority studies are performed, defining  as ”a difference so small that it has no clinical significance”. Discuss whether the difference chosen for the computation of sample size in a) and b) also could be used in a non-inferiority study. Justify why/why not, and calculate the sample size with a new value of  if needed. f) What characterizes a good effect measure? Why should one pick one primary? g) In asthma studies the lung function is measured by different spirometric values. In addition clinical end points like the use of 2-agonists, time to first exacerbation, nightly awakenings, own evaluation of breath trouble and quality of live measures are included. All these could be used to compare effect of different treatments. Suggest alternatives to FEV1 and discuss advantages and disadvantages. http://www.emea.europa.eu/pdfs/human/ewp/292201en.pdf Exercise 20: Planlegging Barneastma påståes å øke i hyppighet. Du skal delta i en klinisk prøvning av et nytt medikament som hevdes å forebygge forverring av sykdommen hos barn med begynnende astma. Hvordan vil du planlegge en slik studie? Se bl.a. på følgende aspekter:       inklusjonskriterier eksklusjonskriterier parallell el. overkrysningsstudie? hva skal registreres? Effektmål? hvor ofte skal det registreres? tidsramme Exercise 21: Planlegging Vi ønsker å undersøke effekten av et nytt blodtrykksenkende medikament A på en gruppe pasienter med mild til moderat essensiell hypertensjon. Den nye behandlingen skal sammenlignes med et veletablert medikament B. Studien skal legges opp som en parallellstudie. Skisser i grove trekk hvordan en klinisk prøvning kan legges opp på en praktisk gjennomførbar måte. 15 Exercise 22: Birth weight data with regression – Dummy variables, confounding and interaction Consider again the data set given in the file birth.sav. Open the file, look at the data, and make sure you understand what they mean. Variable No. Description 1 2 Name Identification number Low Birth weight (1=BWT<2500g, 0=BWT>2500g) Age of the mother Weight in pounds at last menstrual period Race (1=White, 2=Black, 3=Other) Smoking status (1=current smoker, 0=not smoking during pregnancy) History of premature labour (0,1,2...,) History of hypertension (1=yes, 0=no) Uterine irritability (1=yes, 0=no) First trimester visits (0,1,2,3...,) Third trimester visits (0,1,2,3...,) Birth weight 3 4 5 6 7 8 9 10 11 12 ID LOW AGE LWT RAC SMK PTL HT UI FVT TTV BWT Questions: a) b) c) d) e) Construct dummy variables for RAC. You need to use Transform->Recode into different variables. Move RAC to the right, write a name for the new dummy variable (e.g. BLACK), and click change. Click Old and new values. Write 2 for Old value, 1 for New value, and click Add. This means that people coded as 2=black in the old variable (RAC) gets coded as 1 in the new variable (BLACK). Click All other values and write 0 and click Add. This means that people who are not black, will be coded as 0 in the new dummy variable. Click Continue and OK. Repeat this procedure, in order to construct a dummy variable for OTHER also. Each time, remember to remove the old commands in the menus! Why do you need two dummy variables for RAC, not three? To check if the new variables are ok, you can compare frequency tables of the original RAC variable to tables of the two new variables (Analyze->Frequencies). Do a regression of BWT vs the new dummy variables BLACK and OTHER. What’s the birthweight of a black infant compared to a white infant? What’s the birth weight of a black infant compared to a “OTHER” infant? Do a regression of BWT vs BLACK, OTHER and LWT, mother’s weight. Does mother’s weight look like a confounder for ethnicity? Why/why not? You want to study if there is an interaction between mother’s weight and smoking. Construct a new interaction variable, which is a product of the two variables LWT and SMK. Use Transform->Compute. Call the nex variable LWTSMK, and specify that it equals LWT*SMK. Do a regression of BWT vs LWT, SMK and LWT*SMK. Does it look like there is an interaction? What would an interaction mean in plain words? What is the predicted effect on birth weight of gaining 100 pounds when considering smoking mothers? What’s the corresponding effect if she is a non-smoker? 16 Excercise 23: Logistic regression Again, load the data set birth.sav. In this analysis, LOW is the dependent variable. LOW is a binary outcome on whether the birth weight is below 2500g or not. Low birth weight is an important predictor for several medical complications for infants. The main focus of the study was to see whether smoking during pregnancy affected the birth weight or not. a) b) c) d) e) f) g) h) Look at the relationship between smoking and birth weight in a frequency table (Analyze->Descriptive Statistics->Crosstabs, and check the relevant percentages). Does it look like there is a relationship? Do a logistic regression with LOW as the dependent variable, and SMK as the independent variable. Is there a statistically significant effect of smoking? What is the odds ratio? How do you interpret the odds ratio? What is the 95% confidence interval for the odds ratio? (Analyze ->Regression->Binary logistic. Move LOW to Dependent and SMK to Covariates. Click Options, and check CI for exp(B), and click Continue. Click Categorical, move SMK over to the right, check First and click Change. Click Continue and OK). Repeat point b), but this time use smokers instead of non-smokers as the reference category/baseline (click Categorical, click on SMK, but check Last and click Change and Continue. What has happened to the odds ratio? What is the interpretation of the odds ratio in this case? What is the interpretation of the constant (not very important to know, but still)? Let’s look at a continuous, independent variable. Do a regression on LOW vs LWT, mother’s weight (Remember to remove SMK from the model! Do not click Categorical, since LWT is a continuous variable!). Is there a significant effect of mother’s weight? What is the interpretation of the odds ratio in this case? What is the predicted change in the odds ratio if mother’s weight increases by 30 pounds? What do we implicitly assume when using mother’s weight as a continuous variable? Let’s look at a categorical variable with more than two categories. Do a regression on LOW vs RAC, ethnicity in three categories (Again, you have to remove LWT from the previous analysis, and click Categorical, move RAC over to the right, and click First and Change). How do you interpret the odds ratios in this case? Is there a significant effect of ethnicity? Let’s look at another categorical variable with more than two categories: PTL, history of premature labour (which could, in principle, be considered as a continuous variable, but not in these data). Do a regression of LOW vs PTL. What is the problem with this analysis? Why do you not get confidence intervals for the group ptl(3)? Also, do you notice anything strange when comparing the odds ratios for ptl(1) and ptl(2)? (Hint: Look at the SE for B and the frequency of individuals in each group in one of the first tables of the SPSS-output) We would like to recode the PTL-variable, so that it is 0 for no premature labour, and 1 for at least one previous instance of premature labour. Choose Data-> Recode->Into different variables. Move PTL to the right, and write PTL2 as the name of the new variable. Click Change, and click Old and new values. Write 0 for Old value, 0 for New value, and click Add. This secures that those with no premature labour remains unchanged. Now choose Range, write 1 Through 3 under Old value, and 1 under New value, and click Add. This secures that all with 1, 2 or 3 instances of premature labour are coded as 1. Click Continue and OK. Do a regression on LOW vs PTL2. This is also useful when you want to recode continuous variables into categories. Now, let’s do an analysis with both SMK and LWT as independent variables (remember that SMK is categorical, but not LWT!!). What are the interpretations of the odds ratios in this case? Can smoking be said to be a confounder of mother’s weight or vice versa? 17 Exercises from “Practical statistics for medical research” (Altman) 18 19 20

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download December 11, 2006