Download I. Paired Sample Design: 30 pts the data set bph

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Least squares wikipedia , lookup

Coefficient of determination wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
I. Paired Sample Design: 30 pts
Download the data set bph-samp.sav and use SPSS to complete the following
calculations:
1) Run a paired t test to compare if there is a mean change in QoL at the baseline and at 3
months.
Ho: µD = 0
Ha: µD ≠ 0
Paired Samples Statistics
Pair
1
Mean
3.8000
2.1000
qol_base
qol_3mo
N
Std. Deviation
1.13529
1.28668
10
10
Std. Error
Mean
.35901
.40689
Pa ired Sa mples Correlations
N
Pair 1
qol_base & qol_3mo
10
Correlation
.243
Sig.
.498
Pa ired Sa mples Test
Paired Differences
Pair 1
qol_base - qol_3mo
Mean
Std. Deviation
1.70000
1.49443
Std. Error
Mean
.47258
95% Confidence
Interval of the
Difference
Lower
Upper
.63095
2.76905
t
3.597
df
9
Since the p-value (.006) is smaller than 0.05, we reject the null hypothesis. We have
sufficient evidence to prove that there is significant difference between the means at the
baseline and at 3 months.
2) Run a one-sample t test to test the same hypothesis as in (1) but on the variable
DELTA.
Ho: µDelta = 0
Ha: µDelta ≠ 0
Sig. (2-tailed)
.006
One-Sample Statistics
N
delt_qol
Mean
1.7000
10
Std. Deviation
1.49443
Std. Error
Mean
.47258
One-Sam ple Test
Test Value = 0
delt_qol
t
3.597
df
Sig. (2-tailed)
.006
9
95% Confidenc e
Int erval of t he
Difference
Lower
Upper
.6309
2.7691
Mean
Difference
1.70000
Since the p-value (.006) is smaller than 0.05, we reject the null hypothesis. We have
sufficient evidence to prove that the mean delta is different than zero.
The two methods gave the same result!
3) Check if the QoL at the baseline and at 3 months follow Normal distributions.
Ho: QoL at the baseline = normal
Ha: QoL at the baseline <.> normal
Histogram
Normal Q-Q Plot of qol_base
4
1.5
1.0
Expected Normal
Frequency
3
2
0.5
0.0
-0.5
1
-1.0
Mean =3.8
Std. -1.5
Dev. =1.135
N =10
0
2
3
4
qol_base
5
6
2
3
4
Observed Value
5
6
Tests of Normality
a
qol_base
Kolmogorov-Smirnov
Statistic
df
Sig.
.230
10
.143
Statistic
.933
Shapiro-Wilk
df
10
Sig.
.479
a. Lilliefors Significance Correction
Since the p-value is greater than 0.05 we fail to reject the null hypothesis. So QoL at the
baseline can be assumed normal.
Ho: QoL at 3 months = normal
Ha: QoL at 3 months <.> normal
Histogram
Normal Q-Q Plot of qol_3mo
4
1.5
1.0
Expected Normal
Frequency
3
2
0.5
0.0
-0.5
1
-1.0 Mean =2.1
Std. Dev. =1.287
N =10
1
0
1
2
3
4
5
2
3
4
Observed Value
qol_3mo
Tests of Normality
a
qol_3mo
Kolmogorov-Smirnov
Statistic
df
Sig.
.231
10
.139
Statistic
.824
Shapiro-Wilk
df
10
Sig.
.028
a. Lilliefors Significance Correction
Since the p-value is greater than 0.05 we fail to reject the null hypothesis. So QoL at the
baseline can be assumed normal according to Kolmogorov-Simirnov test.
Note that the data doesn't pass the normality test in Shapiro-Wilk. The sample size is
small to decide the normality at this point. We may want to do a nonparametric test.
5
4) If you have a concern about the small sample size and perhaps non-normal data,
choose an appropriate nonparametric test to compare the median QoL score at the
baseline and at 3 months. (Hint: You need to research nonparametric tests.)
Test Statisticsb
Z
As ymp. Sig. (2-tailed)
qol_3mo qol_base
-2.448a
.014
a. Based on positive ranks .
b. Wilcoxon Signed Ranks Test
Test Statisticsb
Exact Sig. (2-tailed)
qol_3mo qol_base
.039a
a. Binomial distribution us ed.
b. Sign Test
Both Wilcoxon Signed Ranks Test and Sign Test prove that the median scores at the
baseline and at 3 months are different.
Sign test may be more suitable for this case because the histograms don't seem to be
symmetric.
II. Independent Sample Design: 30pts
Download the data set lactation.sav and use SPSS to complete the following calculations.
1) Produce a side-by-side boxplot for the percentage of bone loss in the breast feeding
group vs. the control group.
2.50
percentc
0.00
-2.50
-5.00
-7.50
breast-f eeding
control
group
2) Produce a side-by-side histogram for the percentage of bone loss in the breast feeding
group vs. the control group. (Hint: You need to search for SPSS functions not covered in
the lecture to produce the histograms side-by-side.)
Cholesterol
control
breast-feeding
25%
Percent
20%
15%
10%
5%
-7.5
-5.0
-2.5
0.0
Percent bone loss
2.5
-7.5
-5.0
-2.5
0.0
Percent bone loss
2.5
3) Test if the mean percentage of bone loss in the two groups are the same using the right
version of the t test based on the SPSS output.
H0: µ1 = µ2
Ha: µ1 ≠ µ2
Group Statistics
Percent bone loss
Ref = Laskey
control
breast-feeding
N
22
47
Mean
.309
-3.587
Std. Deviation
1.2983
2.5056
Std. Error
Mean
.2768
.3655
Independent Samples Test
Levene's Test for
Equality of Variances
F
Percent bone loss
Equal variances
assumed
Equal variances
not assumed
11.255
Sig.
.001
t-test for Equality of Means
t
df
Sig. (2-tailed)
Mean
Difference
Std. Error
Difference
95% Confidence
Interval of the
Difference
Lower
Upper
6.857
67
.000
3.8963
.5682
2.7621
5.0305
8.499
66.197
.000
3.8963
.4585
2.9810
4.8116
Since the p-value is very small we reject the null hypothesis of equal means. We have
sufficient evidence to prove that the control and the breast-feeding groups have different
means.
4) Calculate a 95% confidence interval for the mean difference in the percentage of bone
loss between the two groups.
95% confidence interval for the mean difference is calculated in the previous table as
(2.7621, 5.0305)
III. Cross-Tabulation: 30pts
Download the data set bd1.sav and use SPSS to complete the following calculations.
1) Test the association between esophageal cancer and alcohol consumption (using the
original alcohol consumption variable). Write down the hypotheses, the test used, the pvalue and the interpretation.
H0: There is no association between esophageal cancer and alcohol consumption
Ha: There is association
Chi-Square Te sts
Pearson Chi-Square
Lik elihood Ratio
Linear-by-Linear
As soc iation
N of Valid Cases
Value
158.955a
146.498
152.974
3
3
As ymp. Sig.
(2-sided)
.000
.000
1
.000
df
975
a. 0 c ells (.0% ) have expected count less than 5. The
minimum expected count is 13. 74.
Since the p-value (.000) is very small (<0.05) we reject the null hypothesis. We have
sufficient evidence to conclude that there is association between esophageal cancer and
alcohol consumption.
2) Is there a need to use Fisher’s Exact test? Why?
Esophageal cancer * Alcohol consumption Crosstabulation
Count
Es ophageal
cancer
Total
case
control
0 - 39 gm/day
29
386
415
Alcohol consumption
40 - 79
80 - 119
gm/day
gm/day
75
51
280
87
355
138
120+ gm/day
45
22
67
Total
200
775
975
Fisher's Exact test is not necessary because the cell counts are not small.
3) Test the association between esophageal cancer and alcohol consumption (using the
dichotomized alcohol consumption variable). Write down the hypotheses, the test used,
the p-value and the interpretation.
H0: There is no association between esophageal cancer and dichotomized alcohol
consumption
Ha: There is association
Esophage al ca nce r * Alcohol dichotom ize d Crosstabulati on
Count
Es ophageal
cancer
case
control
Total
Alc ohol dic hotomized
80+ gms/day 0-79 gms/day
96
104
109
666
205
770
Total
200
775
975
Chi-Square Tests
Pearson Chi-Square
Continuity Correction a
Likelihood Ratio
Fis her's Exact Test
Linear-by-Linear
As sociation
N of Valid Cases
Value
110.255b
108.221
96.433
110.142
df
1
1
1
As ymp. Sig.
(2-sided)
.000
.000
.000
1
Exact Sig.
(2-sided)
Exact Sig.
(1-sided)
.000
.000
.000
975
a. Computed only for a 2x2 table
b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 42.
05.
Since the p-value (.000) is very small (<0.05) we reject the null hypothesis. We have
sufficient evidence to conclude that there is association between esophageal cancer and
dichotomized alcohol consumption.
4) Hand calculate a 95% CI for the odds ratio estimate in (3).
a = 96, b = 104, c = 109, d = 666
OR = (a*d)/(b*c) = (96*666)/(104*109) = 5.640
SE = sqrt(1/a + 1/b + 1/c + 1/d) = sqrt(1/96 + 1/104 + 1/109 + 1/666) = 0.175
Upper = exp(ln(odds ratio)+1.9596*SE) = 7.95
Lower = exp(ln(odds ratio)-1.9596*SE) = 4.00
CI: (4.00, 7.95)
IV. ANOVA: 30pts
Complete Exercise 13.9 items (a) through (d) found on pages 292 and 293 in the
textbook. Apply the nonparametric ANOVA analysis to test if the groups are different in
the decrease of body temperatures.
A trial evaluated the fever reducing effects of three substances. Study subjects were
adults seen in an emergency room with diagnosis of flu with body temperatures between
100.0 degrees F to 100.9 degrees F. The three treatments (aspirin, ibuprofen, and
acetaminophen) were assigned randomly to study subjects. Body temperatures were reevaluated 2 hours after administration of treatments. Table13.14 lists the data.
a) Explore these data with side by side boxplots. Discuss your findings.
b) Calculate the mean and standard deviation of each group
c) Complete an ANOVA for the problem. What do you conclude?
d) Conduct post hoc comparisons with the LSD method. Which groups differ
significantly at alpha (using symbol) = 0.05?
Table 13.14 Data for Exercise 13.19. Decreases in body Temperature (degrees
Fahrenheit)
Group 1(aspirin):
0.95
1.48 1.33 1.28
Group 2(ibuprofen) 0.39 0.44 1.31 2.48 1.39
Group 3(acetamin) 0.19 1.02 0.07 0.01 0.62 -0.39
a)
2.50
2.00
1.50
1.00
0.50
0.00
-0.50
aspirin
ibuprofen
acetamin
Group
Looking at the box-plot we can see that aspirin is more effective than acetamin.
Ibuprofen has the best result on some patients but it's not as consistent as the other two.
b)
c
Descriptives
Temp_Decreas e
N
as pirin
ibuprofen
acetamin
Total
Mean
1.2600
1.2020
.2533
.8380
4
5
6
15
Std. Deviation
.22346
.85444
.49657
.74301
Std. Error
.11173
.38212
.20273
.19184
95% Confidence Interval for
Mean
Lower Bound
Upper Bound
.9044
1.6156
.1411
2.2629
-.2678
.7745
.4265
1.2495
c)
Ranks
Temp_Decreas e
Group
as pirin
ibuprofen
acetamin
Total
N
4
5
6
15
Mean Rank
11.00
10.00
4.33
Te st Statisticsa,b
Chi-Square
df
As ymp. Sig.
Temp_
Decrease
6.833
2
.033
a. Kruskal Wallis Test
b. Grouping Variable: Group
According to Kruskal Wallis Test, p-value is .33 and it is less than 0.05. So we can
conclude that on the average, aspirin, ibuprofen, and acetamin have different effects of
reducing the temperature.
d)
Minimum
.95
.39
-.39
-.39
Maximum
1.48
2.48
1.02
2.48
Multiple Comparisons
Dependent Variable: Tem p_Decreas e
LSD
(I) Group
as pirin
ibuprofen
acetam in
(J) Group
ibuprofen
acetam in
as pirin
acetam in
as pirin
ibuprofen
Mean
Difference
(I-J)
.05800
1.00667*
-.05800
.94867*
-1.00667*
-.94867*
Std. Error
.40170
.38654
.40170
.36260
.38654
.36260
Sig.
.888
.023
.888
.023
.023
.023
95% Confidence Interval
Lower Bound
Upper Bound
-.8172
.9332
.1645
1.8489
-.9332
.8172
.1586
1.7387
-1.8489
-.1645
-1.7387
-.1586
*. The mean difference is s ignificant at the .05 level.
Post Hoc analysis shows that acetamin has a significantly lower value than aspirin and
ibuprofen.
V. Multiple Linear Regression: 30pts
Download the data set hdur.sav and use SPSS to complete the following calculations.
1) Fit a multiple linear regression model for predicting the hospital duration using age,
sex, body temperature, white blood cell counts, antibiotic use, blood culture and service
(medication vs. surgery). Using a cut-off value of 0.10 to assess the significance of the
predictors. Identify all significant predictors.
Model Summ ary
Model
1
R
R Square
.643a
.414
Adjust ed
R Square
.172
St d. Error of
the Es timate
5.201
a. Predic tors: (Constant), serv, Antibiotic use, Blood
culture, Body t emp, sex, age, W hit e blood cell count
tak en
ANOVAb
Model
1
Regres sion
Residual
Total
Sum of
Squares
324.194
459.806
784.000
df
7
17
24
Mean Square
46.313
27.047
F
1.712
a. Predictors: (Constant), serv, Antibiotic use, Blood culture, Body temp, s ex, age,
White blood cell count taken
b. Dependent Variable: Durration of hospitalization
Sig.
.172a
Coefficientsa
Model
1
(Constant)
age
sex
Body temp
White blood cell
count taken
Antibiotic us e
Blood culture
serv
Unstandardized
Coefficients
B
Std. Error
-303.218
179.431
.093
.067
-1.195
2.502
3.307
1.827
Standardized
Coefficients
Beta
.328
-.106
.394
t
-1.690
1.390
-.478
1.810
Sig.
.109
.183
.639
.088
-.168
.463
-.094
-.363
.721
-3.449
-1.645
-3.122
2.675
2.829
2.755
-.277
-.125
-.268
-1.289
-.581
-1.133
.215
.569
.273
a. Dependent Variable: Durration of hospitalization
Looking at the model, it seems like the only significant predictor is the Body temp
(.088<.10).
2) Assess the model fit for the multiple linear regression model using appropriate
statistics and graphics.
Adjusted R square value of .172 is very low. It states that the 17.2% of the variation in
the duration of hospitality is explained by the model. So it's not a good fit.
ANOVA results in a p-value of .172 and its greater than .10. Which also proves that the
fit is not that good.
3) Assess the assumptions of linear regression in this data using appropriate statistics and
graphics.
Normal P-P Plot of Regression Standardized Residual
Dependent Variable: Durration of hospitalization
Expected Cum Prob
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Observed Cum Prob
Points are closely located around the line, so the "Normality" assumption seems to be
valid.
Scatterplot
Dependent Variable: Durration of hospitalization
Regression Standardized Predicted
Value
4
3
2
1
0
-1
-2
-2
-1
0
1
2
Regression Standardized Residual
Error seems to be increasing, so the assumption of homoscedasticity may be violated.
Independence of the variables is also an important assumption. Actually all body
properties are more or less dependent to each other. So this assumption may highly be
violated.
Also age and the body parameters may not actually be linearly related.
4) Identify outliers and influential observations using appropriate statistics that can be
generated in SPSS. (Hint: You need to do your own research because this is not covered
in the textbook or lectures).
Scatterplot
Dependent Variable: Durration of hospitalization
Regression Studentized Residual
3
2
1
0
-1
-2
-2
-1
0
1
2
Regression Standardized Predicted Value
Looking at the residuals there are several outliers in the data.
3
4