Download Intro to Statistics * Part 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Intro to Statistics – Part 2
Maureen J. Donlin
January 18, 2012
Take home exercises
 Exercise 1:
 Using the child
nutrition data set, answer the
questions posed in the handout.
 Exercise 2:
 Using the breast cancer data
set, conduct an
exploratory analysis of the data.
 We will
session
use this data set and others during the next
Child nutrition data set
 Dataset: NutritionChildren.sav
 Does the amount of juice consumed by the children
affect their growth?
 Variables: ChildID, Weight_lbs, Height_cm, Juice,
Soda, Energy, Age
 Ages: 94 are 2
years old and 74 are 5 years old
 Gender: unknown in
our data set
Recoding variables
 Define short as ≤ 1.5 SD of the mean for age group
 82.7 cm
for age 2 and 102.5 cm for age 5
 6 met criteria
for age 2 and 3 for age 5
 Define overweight as ≥ 1.5 SD of the mean for age
group (BMI_level)
 18.8 for
age = 2 and 18.4 for age = 5
 6 met criteria
for age = 2 and 3 for age 5
Recoding variables cont.
 Excessive juice consumption (JuiceLevel)
 Mean 5.5 oz/day ±
 Excessive juice ≥
4.6 (SD)
1.5 SD of the mean (12 oz/day)
 19 children drank
≥ 12 oz juice/day
 Cross tab of JuiceLevel * Short
 p-value =
0.001
 Cross-tab of Juice Level* BMI_level
 p-value =
0.067
Breast cancer data
 Dataset: BreastCancerData.sav
 Explore the data
 338/1207 missing estrogen receptor status
 356/1207 missing progesterone receptor status
 86/1207 missing pathological
 12/1207 with a
tumor size
tumor size > 5 cm
Exploring breast cancer data, cont.
 Distribution of the continuous variables
 Age
 Pathological
 Number of
tumor size
positive lymph nodes
 Use Explore with those 3 variables and no factors
Exploring breast cancer data, cont.
 Dependence of pathological tumor size on the
categorical variables estrogen and progesterone
receptor status
 Is there a difference in the size of the tumor when
they are positive for estrogen or progesterone
receptors?
 Use Explore with pathological tumor size as the
dependent variable and estrogen and progesterone
receptor status as factors
Exploring breast cancer data, cont.
 Is there a dependence of the tumor size on the
presence of positive lymph nodes?
Recoding age
Recode into 4 groups: 20-45, 46-55, 56-66, 66 & older
Recode into 2 groups: ≤ 55, 56 & older
Explore the new category using a histogram
Looking for evenly sized categories
Breast cancer data, cont.
 Is there a dependence of pathological tumor size on
age?
Significant difference between
groups?
Between
groups
Within
groups
Total
Sum of
Squares
44.79
Df
3
Mean
Square
14.93
1065.1
1117
0.954
1110.7
1120
F
Sig.
15.65
0.000
Post-hoc testing (Bonferroni)
(I) Age
groups
20 – 45
(J) Age
groups
Mean diff. Std. error
(I – J)
Sig.
46-55
0.269
0.083
0.008
56-66
0.413
0.085
0.000
67 & older
0.543
0.081
0.000
20 - 45
-0.268
0.083
0.008
56 - 66
0.144
0.083
0.500
67 & older
0.273
0.079
0.004
20 - 45
-0.413
0.085
0.000
46 - 55
-0.144
0.083
0.500
67 & older
0.129
0.082
0.691
67 & older 20 - 45
-0.543
0.082
0.000
46 - 55
-0.273
0.079
0.004
56 - 66
-0.129
0.082
0.691
46 – 55
56 – 66
Univariate modeling
 Analyze -> General Linear Model -> Univariate
 Dependent variable:
Pathological tumor size
 Fixed factors: agecat3 &
Lymph nodes?
 Model: full factorial
 Plots: ln_yesno*agecat3 &
agecat3*ln_yesno
 Options: Display means for
all 3 variables & check:

Descriptive statistics
Estimates of effect size

Observed power

Error type model
 Type II: assumes balanced design
 Type III: works with balanced and unbalanced
designs (default option)
 Type IV: can be used when there is missing data
Presence of positive lymph node is associated with larger
tumor size at all age categories, but the effect is larger for
the younger ages.
Univariate analysis of effect of age on tumor size
Contrast
Error
Sum of
Squares
df
Mean
Square
F
Sig.
Partial Eta
Squared
28.219
3
9.406
10.185
.000
.027
1027.894 1113
.924
Univariate analysis of effect of positive lymph node on
tumor size
Contrast
Sum of
Squares
34.112
Error
1027.894
1
Mean
Square
34.112
1113
.924
df
F
Sig.
36.937
.000
Partial Eta
Squared
.032
Type IV
Sum of
Squares
df
Corrected Model 82.844
Intercept
2592.196
agecat3
28.219
7
1
3
Source
ln_yesno
agecat3 *
ln_yesno
Error
1027.894 1113
Total
4479.321 1121
Corrected Total
1110.738 1120
Mean
Square
F
11.835
12.815
2592.196 2806.820
9.406
10.185
Sig.
Partial Eta
Squared
.000
.000
.000
.075
.716
.027
34.112
1
34.112
36.937
.000
.032
1.339
3
.446
.483
.694
.001
.924
• The model explains ~ 27% of the variance.
• The effect of age and presence of positive lymph nodes
each explain about half of the total variance.
• The effect of age and lymph node status together is
negligible. They do not interact.
Effect size
Magnitude of the observed effect:
t
r= 2
t + df
2
t = t from a t-test and df = degrees of freedom
r = 0.10 (small effect; ~1% of the total variance)
r = 0.30 (medium effect; ~9% of the total variance)
r = 0.50 (large effect; ~25% of the total variance)
Effect size: Eta2 (η2)
 Effect size used in Anova & univariate modeling
from SPSS
 η2 varies between 0 and 1
 Interpretation:
 0.01 ~
small
 0.06 ~
medium
 0.14 ~
large
 Square root of η2 approximates r
Source
Type IV
Sum of
Squares
df
Mean
Square
F
Sig.
Partial Eta
Squared
Corrected Model
82.844
7
11.835
12.815
.000
.075
agecat3
28.219
3
9.406
10.185
.000
.027
ln_yesno
34.112
1
34.112
36.937
.000
.032
agecat3 * ln_yesno
1.339
3
.446
.483
.694
.001
Effect size:
• The dependence of tumor size on age has approximately
the same effect size as the dependence on the presence of
positive lymph nodes.
• The interaction of age and positive lymph nodes has very
little effect.
Calculating effect size from t-test
t
r= 2
t + df
2
T-test of dependence of tumor size on presence of positive
lymph nodes:
t
-7.164
-6.572
df
1119
381.059
Sig. (2tailed)
0
0
r = 0.21; or a moderately small effect
Other questions to consider
 Is tumor size associated with receptor (estrogen or
progesterone) status?
 Tumor size was
recoded into categories (pathcat)
 Do
cross-tabs with pathcat*estrogen status (er) or
pathcat*progesterone status (pr)
 Is positive estrogen or progesterone receptor status
associated with larger or smaller tumors?
Survival analysis
 What fraction of population will survive past a
certain time?
 What is the probability of survival on condition A
versus condition B?
 Kaplan-Meier estimator
 Estimates survival
 Can
function from life-time data
deal with some types of censored data (i.e.
patient withdraws from study before final outcome)
• The steps down represent each point where a subject has
died.
• The tick marks represent censored data
Censoring
 Removing a patient from the survival curve at the
end of their follow-up time is “censoring” the
patient.
 Shown as a tick mark on the survival curve
 Once a patient is censored, the curve becomes an
estimate of survival because we no longer know the
end point for censored patients
Kaplan-Meier estimator
 S(t): probability of surviving beyond time t
 Rank death times in order:
0 <
t(s) < t(2) < t(3) < t(4) ... t(r)
 Within
each interval, calculates probability of dying
within that interval
 Probability of

dying in interval 4:
# deaths in interval 4*number alive at time(3)
 S(t(4)) =
probability of surviving beyond interval 3 *
probability of surviving interval 4
 S(4)
= S(3) * (1-probability of dying in interval 4)
 7 patients with survivals of:
 1, 2+,
+
3+, 4, 5+, 10, 12+
indicates censored patient
# At Risk # Censored # At Risk # Died Proportion
Interval Start of During
End of End of Surviving
Interval Interval
Interval Interval Interval
0-1
1-4
7
6
0
2
7
4
1
1
6/7 = 0.86
3/4 = 0.75
4-10
3
1
2
1
1/2 = 0.5
10-12
1
0
1
0
1/1 = 1.0
Cumulative
Survival End of
Interval
0.86
0.86 * 0.75 = 0.64
0.86 * 0.75 * 0.5 =
0.31
0.86 * 0.75 * 0.5 *
1.0 = 0.31
KM survival curve
Kaplan-Meier estimator
 Dataset: leukemia.sav
 Remission times of
acute leukemia in weeks
 2 treatment groups, 42 observations
 Placebo:

1 1 2 2 3 4 4 5 5 8 8 8 8 11 11 12 12 15 17 22 23
 6-mercaptopurine:

6 6 6 6* 7 9* 10 10* 11* 13 16 17* 19* 20* 22 23 25* 32*
32* 34* 35*
 First
censored time is 6, means patient was observed for
6 weeks follow-up, but no remission occurred
 Outcome: 0 =
censored; 1 = death
 Analyze -> Survival -> Kaplan-Meier
 Time: time to
 Status:

remission
outcome (1)
Define event: Single value: 1
 Factor:
treatment
 Compare factor:
Log rank, Pooled over strata
 Options:

Statistics: Survival table(s); Mean & Median survival

Plots: Survival
Interpretation of Kaplan-Meier
 Survival table
 Provides estimate of
survival for each event
 Means & Medians Survival Time
 Data
summarized in a table that you can report
 Estimated survival


times:
Placebo: 8.6 weeks
6-mecaptopurine: 23.2 weeks
 Highly
significant difference between the 2 groups
Linear regression
 Model relationship between scalar variable y and
one or more exploratory variables X
 Used for:
 Prediction, what
 Strength of
is y given X?
a relationship between y and Xj
Linear regression
 Dataset: LifeExpectancybyTVandPhysicans.sav
 Handout
describes the dataset
 Question:
 Is there
a relationship between life expectancy in the
different countries and the ratio of people/TV or
people/physicians?
Linear regression
 Analyze -> Regression -> Linear
 Dependent: LifeExp
 Independents: TV &
Physicians (do as separate
analyses)
 Method: Enter
 Statistics: Estimates, Model fit,
 Plots:
Descriptives
Y: *SDRESID; X: *ZPRED; Histogram and Normal
probability plot
Modeling effect of TVs
Unstandardized
Coefficients
B
(Constant) 69.648
TV
-0.036
Std. Error
Standardized
Coefficients
Beta
t
Sig.
1.101
63.256
0
0.008
-0.606 -4.569
0
The life expectancy is equal to: -0.036*Ratio people/TV + 69.6
For a country with 500 people/TV, the life expectancy is
predicted to be 51.6 years.
Model summary
R
.606
R Square
0.367
Adjusted R
Square
0.349
Std. Error of the
Estimate
6.2929
• R: linear correlation between the observed and model
predicted variables. Moderate value indicates a moderately
strong relationship
• R square: coefficient of determination; about 35% of the
variation in LifeExp is explained by the model
Once you’ve lowered ratio of people/TV or people/physicians,
there is not further effect of those on the life expectancy.
Or is there? Try a log-transformation both TV and physicians
Redo the linear correlation, using either logTV or
logPhysicians as the independent
(Constant)
logPhysici
ants
Unstandardized
Coefficients
B
Std. Error
102.873
3.942
-11.454
1.238
Standardized
Coefficients
Beta
t
26.098
-0.832
-9.252
Sig.
0
0
• What is the predicted life expectancy for a country with 1000
people/physician?
• Hint: Need to take the log value of 1000 first
• The life expectancy is equal to: -0.11.45*log(1000) + 102.8
Answer: 68.5 years
Normal PP plot
Take home points
 Plot your data
 Anscombe’s quartet:

4 datasets with identical
statistical properties.
 AnscombesData.xlsx
 Consider effect size
 Statistical
significance does not
mean clinical significance
 Does the relationship make
sense?
 Association but
is it causative?