Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Intro to Statistics – Part 2 Maureen J. Donlin January 18, 2012 Take home exercises Exercise 1: Using the child nutrition data set, answer the questions posed in the handout. Exercise 2: Using the breast cancer data set, conduct an exploratory analysis of the data. We will session use this data set and others during the next Child nutrition data set Dataset: NutritionChildren.sav Does the amount of juice consumed by the children affect their growth? Variables: ChildID, Weight_lbs, Height_cm, Juice, Soda, Energy, Age Ages: 94 are 2 years old and 74 are 5 years old Gender: unknown in our data set Recoding variables Define short as ≤ 1.5 SD of the mean for age group 82.7 cm for age 2 and 102.5 cm for age 5 6 met criteria for age 2 and 3 for age 5 Define overweight as ≥ 1.5 SD of the mean for age group (BMI_level) 18.8 for age = 2 and 18.4 for age = 5 6 met criteria for age = 2 and 3 for age 5 Recoding variables cont. Excessive juice consumption (JuiceLevel) Mean 5.5 oz/day ± Excessive juice ≥ 4.6 (SD) 1.5 SD of the mean (12 oz/day) 19 children drank ≥ 12 oz juice/day Cross tab of JuiceLevel * Short p-value = 0.001 Cross-tab of Juice Level* BMI_level p-value = 0.067 Breast cancer data Dataset: BreastCancerData.sav Explore the data 338/1207 missing estrogen receptor status 356/1207 missing progesterone receptor status 86/1207 missing pathological 12/1207 with a tumor size tumor size > 5 cm Exploring breast cancer data, cont. Distribution of the continuous variables Age Pathological Number of tumor size positive lymph nodes Use Explore with those 3 variables and no factors Exploring breast cancer data, cont. Dependence of pathological tumor size on the categorical variables estrogen and progesterone receptor status Is there a difference in the size of the tumor when they are positive for estrogen or progesterone receptors? Use Explore with pathological tumor size as the dependent variable and estrogen and progesterone receptor status as factors Exploring breast cancer data, cont. Is there a dependence of the tumor size on the presence of positive lymph nodes? Recoding age Recode into 4 groups: 20-45, 46-55, 56-66, 66 & older Recode into 2 groups: ≤ 55, 56 & older Explore the new category using a histogram Looking for evenly sized categories Breast cancer data, cont. Is there a dependence of pathological tumor size on age? Significant difference between groups? Between groups Within groups Total Sum of Squares 44.79 Df 3 Mean Square 14.93 1065.1 1117 0.954 1110.7 1120 F Sig. 15.65 0.000 Post-hoc testing (Bonferroni) (I) Age groups 20 – 45 (J) Age groups Mean diff. Std. error (I – J) Sig. 46-55 0.269 0.083 0.008 56-66 0.413 0.085 0.000 67 & older 0.543 0.081 0.000 20 - 45 -0.268 0.083 0.008 56 - 66 0.144 0.083 0.500 67 & older 0.273 0.079 0.004 20 - 45 -0.413 0.085 0.000 46 - 55 -0.144 0.083 0.500 67 & older 0.129 0.082 0.691 67 & older 20 - 45 -0.543 0.082 0.000 46 - 55 -0.273 0.079 0.004 56 - 66 -0.129 0.082 0.691 46 – 55 56 – 66 Univariate modeling Analyze -> General Linear Model -> Univariate Dependent variable: Pathological tumor size Fixed factors: agecat3 & Lymph nodes? Model: full factorial Plots: ln_yesno*agecat3 & agecat3*ln_yesno Options: Display means for all 3 variables & check: Descriptive statistics Estimates of effect size Observed power Error type model Type II: assumes balanced design Type III: works with balanced and unbalanced designs (default option) Type IV: can be used when there is missing data Presence of positive lymph node is associated with larger tumor size at all age categories, but the effect is larger for the younger ages. Univariate analysis of effect of age on tumor size Contrast Error Sum of Squares df Mean Square F Sig. Partial Eta Squared 28.219 3 9.406 10.185 .000 .027 1027.894 1113 .924 Univariate analysis of effect of positive lymph node on tumor size Contrast Sum of Squares 34.112 Error 1027.894 1 Mean Square 34.112 1113 .924 df F Sig. 36.937 .000 Partial Eta Squared .032 Type IV Sum of Squares df Corrected Model 82.844 Intercept 2592.196 agecat3 28.219 7 1 3 Source ln_yesno agecat3 * ln_yesno Error 1027.894 1113 Total 4479.321 1121 Corrected Total 1110.738 1120 Mean Square F 11.835 12.815 2592.196 2806.820 9.406 10.185 Sig. Partial Eta Squared .000 .000 .000 .075 .716 .027 34.112 1 34.112 36.937 .000 .032 1.339 3 .446 .483 .694 .001 .924 • The model explains ~ 27% of the variance. • The effect of age and presence of positive lymph nodes each explain about half of the total variance. • The effect of age and lymph node status together is negligible. They do not interact. Effect size Magnitude of the observed effect: t r= 2 t + df 2 t = t from a t-test and df = degrees of freedom r = 0.10 (small effect; ~1% of the total variance) r = 0.30 (medium effect; ~9% of the total variance) r = 0.50 (large effect; ~25% of the total variance) Effect size: Eta2 (η2) Effect size used in Anova & univariate modeling from SPSS η2 varies between 0 and 1 Interpretation: 0.01 ~ small 0.06 ~ medium 0.14 ~ large Square root of η2 approximates r Source Type IV Sum of Squares df Mean Square F Sig. Partial Eta Squared Corrected Model 82.844 7 11.835 12.815 .000 .075 agecat3 28.219 3 9.406 10.185 .000 .027 ln_yesno 34.112 1 34.112 36.937 .000 .032 agecat3 * ln_yesno 1.339 3 .446 .483 .694 .001 Effect size: • The dependence of tumor size on age has approximately the same effect size as the dependence on the presence of positive lymph nodes. • The interaction of age and positive lymph nodes has very little effect. Calculating effect size from t-test t r= 2 t + df 2 T-test of dependence of tumor size on presence of positive lymph nodes: t -7.164 -6.572 df 1119 381.059 Sig. (2tailed) 0 0 r = 0.21; or a moderately small effect Other questions to consider Is tumor size associated with receptor (estrogen or progesterone) status? Tumor size was recoded into categories (pathcat) Do cross-tabs with pathcat*estrogen status (er) or pathcat*progesterone status (pr) Is positive estrogen or progesterone receptor status associated with larger or smaller tumors? Survival analysis What fraction of population will survive past a certain time? What is the probability of survival on condition A versus condition B? Kaplan-Meier estimator Estimates survival Can function from life-time data deal with some types of censored data (i.e. patient withdraws from study before final outcome) • The steps down represent each point where a subject has died. • The tick marks represent censored data Censoring Removing a patient from the survival curve at the end of their follow-up time is “censoring” the patient. Shown as a tick mark on the survival curve Once a patient is censored, the curve becomes an estimate of survival because we no longer know the end point for censored patients Kaplan-Meier estimator S(t): probability of surviving beyond time t Rank death times in order: 0 < t(s) < t(2) < t(3) < t(4) ... t(r) Within each interval, calculates probability of dying within that interval Probability of dying in interval 4: # deaths in interval 4*number alive at time(3) S(t(4)) = probability of surviving beyond interval 3 * probability of surviving interval 4 S(4) = S(3) * (1-probability of dying in interval 4) 7 patients with survivals of: 1, 2+, + 3+, 4, 5+, 10, 12+ indicates censored patient # At Risk # Censored # At Risk # Died Proportion Interval Start of During End of End of Surviving Interval Interval Interval Interval Interval 0-1 1-4 7 6 0 2 7 4 1 1 6/7 = 0.86 3/4 = 0.75 4-10 3 1 2 1 1/2 = 0.5 10-12 1 0 1 0 1/1 = 1.0 Cumulative Survival End of Interval 0.86 0.86 * 0.75 = 0.64 0.86 * 0.75 * 0.5 = 0.31 0.86 * 0.75 * 0.5 * 1.0 = 0.31 KM survival curve Kaplan-Meier estimator Dataset: leukemia.sav Remission times of acute leukemia in weeks 2 treatment groups, 42 observations Placebo: 1 1 2 2 3 4 4 5 5 8 8 8 8 11 11 12 12 15 17 22 23 6-mercaptopurine: 6 6 6 6* 7 9* 10 10* 11* 13 16 17* 19* 20* 22 23 25* 32* 32* 34* 35* First censored time is 6, means patient was observed for 6 weeks follow-up, but no remission occurred Outcome: 0 = censored; 1 = death Analyze -> Survival -> Kaplan-Meier Time: time to Status: remission outcome (1) Define event: Single value: 1 Factor: treatment Compare factor: Log rank, Pooled over strata Options: Statistics: Survival table(s); Mean & Median survival Plots: Survival Interpretation of Kaplan-Meier Survival table Provides estimate of survival for each event Means & Medians Survival Time Data summarized in a table that you can report Estimated survival times: Placebo: 8.6 weeks 6-mecaptopurine: 23.2 weeks Highly significant difference between the 2 groups Linear regression Model relationship between scalar variable y and one or more exploratory variables X Used for: Prediction, what Strength of is y given X? a relationship between y and Xj Linear regression Dataset: LifeExpectancybyTVandPhysicans.sav Handout describes the dataset Question: Is there a relationship between life expectancy in the different countries and the ratio of people/TV or people/physicians? Linear regression Analyze -> Regression -> Linear Dependent: LifeExp Independents: TV & Physicians (do as separate analyses) Method: Enter Statistics: Estimates, Model fit, Plots: Descriptives Y: *SDRESID; X: *ZPRED; Histogram and Normal probability plot Modeling effect of TVs Unstandardized Coefficients B (Constant) 69.648 TV -0.036 Std. Error Standardized Coefficients Beta t Sig. 1.101 63.256 0 0.008 -0.606 -4.569 0 The life expectancy is equal to: -0.036*Ratio people/TV + 69.6 For a country with 500 people/TV, the life expectancy is predicted to be 51.6 years. Model summary R .606 R Square 0.367 Adjusted R Square 0.349 Std. Error of the Estimate 6.2929 • R: linear correlation between the observed and model predicted variables. Moderate value indicates a moderately strong relationship • R square: coefficient of determination; about 35% of the variation in LifeExp is explained by the model Once you’ve lowered ratio of people/TV or people/physicians, there is not further effect of those on the life expectancy. Or is there? Try a log-transformation both TV and physicians Redo the linear correlation, using either logTV or logPhysicians as the independent (Constant) logPhysici ants Unstandardized Coefficients B Std. Error 102.873 3.942 -11.454 1.238 Standardized Coefficients Beta t 26.098 -0.832 -9.252 Sig. 0 0 • What is the predicted life expectancy for a country with 1000 people/physician? • Hint: Need to take the log value of 1000 first • The life expectancy is equal to: -0.11.45*log(1000) + 102.8 Answer: 68.5 years Normal PP plot Take home points Plot your data Anscombe’s quartet: 4 datasets with identical statistical properties. AnscombesData.xlsx Consider effect size Statistical significance does not mean clinical significance Does the relationship make sense? Association but is it causative?