Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Statistics Lecture 4 1 Outline Regression Correlation in more detail Multiple Regression ANCOVA Normality Checks Non-parametrics Sample Size Calculations 2 Linear Regression & Correlation Sometimes we have two quantitative measurements which are related in some way e.g. the peak height on a chromatograph and the concentration of analyte in a specimen They are called the dependent variable (Y) and the independent variable (X) Y & X may be arbitrary, but if a variable is set by the experimenter rather than simply observed then that variable is X 3 Regression & Correlation A correlation measures the “degree of association” between two variables (interval (50,100,150…) or ordinal (1,2,3...)) Associations can be positive (an increase in one variable is associated with an increase in the other) or negative (an increase in one variable is associated with a decrease in the other) 4 Correlation coefficient r The (Pearson) sample correlation coefficient is defined as r ( xi x )( yi y ) 2 2 ( xi x ) ( yi y ) Describes how y varies with x (numerator) and scaled (by dividing by the variability of x and of y) so -1 ≤ r ≤ 1 -1 = 100% negative correlation, +1 = 100% positive correlation 5 Interpreting r r=0 no relationship |r|=1 means a perfect linear relationship |r| < 0.4: weak relationship 0.4 ≤ |r| ≤ 0.7: moderate relationship |r| > 0.7: strong relationship P-value assigned to r tells us how likely it is that r is right, i.e. confidence in our statement 6 Example Daytime SBP and age collected for 447 hypertensive males. SBP 115 130 128 123 126 … Age 34 40 28 21 39 … 7 Example (contd) Is there a linear relationship between SBP and Age? r=0.145 weak positive relationship 8 Correlation examples 9 Example 2: Height vs. Weight Graph One: Relationship between Height and Weight 180 Strong positive correlation between height and weight Can see how the relationship works, but cannot predict one from the other If 120cm tall, then how heavy? Weight (kgs) 160 140 120 100 80 60 40 20 0 0 50 100 150 200 Height (cms) 10 Regression Problem: to draw a straight line through the points that best explains the variance 9 8 7 Line can then be used to predict Y from X 6 5 4 3 2 1 0 0 1 2 3 4 5 6 11 Example: Symptom Index vs Drug A Graph Three: Relationship between Symptom Index and Drug A (with best-fit line) “Best fit line” allows us to describe relationship between variables more accurately. We can now predict specific values of one variable from knowledge of the other All points are close to the line 180 Symptom Index 160 140 120 100 80 60 40 20 0 0 50 100 150 200 250 Drug A (dose in mg) 12 Simple Linear Regression Assume the population regression line: y = α+βx 200 180 160 Where: α = y intercept (constant) β = slope of line y = dependent variable x = independent variable 140 120 100 80 60 40 20 0 0 50 100 150 200 250 yi = α+βxi+εi 13 Regression Establish equation for the best-fit line: y = a+bx Best-fit line same as regression line b is the regression coefficient for x x is the predictor or regressor variable for y 14 Fit a line to the data: Not great: 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 15 Fit a line to the data: Better: 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 16 Least Squares Minimise the (squared) distance between the points and the line a and b are the estimates of α and β which minimise yi ( xi ) 2 17 Least Squares Estimates Using calculus (partial derivatives), we get ( xi x )( yi y ) b 2 ( xi x ) a y bx Note b is related to the correlation coefficient r (same numerator)- if x and y are positively correlated then the slope is positive 18 Example from the literature Predicting tandem repeat variability 19 Example: symptom Index versus Drug A dose Graph Three: Relationship between Symptom Index and Drug A (with best-fit line) “Best fit line” Allows us to describe relationship between variables more accurately. We can now predict specific values of one variable from knowledge of the other All points are close to the line 180 Symptom Index 160 140 120 100 80 60 40 20 0 0 50 100 150 200 250 Drug A (dose in mg) 20 Example: Symptom Index versus Drug B dose We can still predict specific values of one variable from knowledge of the other Will predictions be as accurate? Why not? Large “residual” variation (random error) Graph Four: Relationship between Symptom Index and Drug B (with best-fit line) 160 Symptom Index 140 120 100 80 60 40 20 0 0 50 100 150 Drug B (dose in mg) 200 250 = Difference between observed data and that predicted by the equation 21 Regression Hypothesis Tests Hypotheses about the intercept H0: α = 0 HA: α 0 But most research focuses on the slope H0: β = 0 HA: β 0 This addresses the general question “Is X predictive of Y?” 22 Regression Estimates of a slope (b) have a sampling distribution, like any other statistic If certain assumptions are met (NB normality, homogeneity of variance) the sampling distribution approximates the t-distribution Thus, we can assess the probability that a given value of b would be observed, if = 0 → hypothesis tests & confidence intervals 23 Regression R2, the coefficient of determination, is the percentage of variation explained by the “regression”. R2 > 0.6 is deemed reasonably good. Note, the model must also be significant, e.g. 24 Back to SBP and Age example a=123 and b=0.159 approximately What does b mean? Is age predictive of BP? i.e. is there evidence that b ≠ 0? How good is the fit of the regression line? 25 Regression 2 R Interpretation R2= proportion of variation explained by (or predictive ability of) the regression Y variation Variation in y is almost fully explained by x: R2 ≈ 1 Y variation Still some variation in y left over (not explained by x): R2 < 1 26 25 UNexplained error 20 TOTAL error 15 explained error 10 MEAN 5 52 48 44 40 36 32 28 24 20 16 12 8 REGRESSION LINE 4 0 0 27 Regression – four possibilities b≠0 P-value nonsignificant Relationship but not much evidence b≈0 P-value nonsignificant No relationship & not much evidence b≠0 P-value significant Plenty of evidence for a relationship b≈0 P-value significant Plenty of evidence for no relationship 28 Regression Assumption: Homoskedasticity (Equal Error Variance) 10 8 Examine error at different values of X. Is it roughly equal? 6 Here, things look pretty good. 4 2 0 0 20000 40000 60000 80000 100000 INCOME 29 Heteroskedasticity: Unequal Error Variance At higher values of X, error variance increases a lot. 10 8 6 A transformation of data (e.g. log) can remove heterskedasticity 4 2 0 0 20000 10000 INCOME 40000 30000 60000 50000 80000 70000 100000 90000 30 50% of slides complete! 31 Multiple Regression Extension of simple linear regression to more than one (continuous/ordinal) independent variables We use least squares in exactly the same way to obtain estimates of the regression coefficients e.g. with 2 independent variables x and z, we fit the regression y=a+bx+cz… where a,b and c are the regression coefficients. This represents a plane in 3d space Previous example 32 Notes on multiple regression Make sure variables are normal. If not, transform them. If still not, can split into 2 groups (categories (0/1)) for e.g. high vs. low responders Can combine with “stepwise selection”: instead of using every variable and forcing them into a final model, can drop out variables automatically, e.g. petri dish temperature, that are not predictive 33 Example Study to evaluate the effect of the duration of anesthesia and degree of trauma on percentage depression of lymphocyte transformation 35 patients Trauma factor classified as 0, 1, 3 and 4, depending upon severity 34 Duration Trauma Depression Duration Trauma Depression 4 6 1.5 4 2.5 3 3 2.5 3 3 2 8 5 2 2.5 2 1.5 1 3 3 2 2 2 2 2 2 3 3 3 3 4 2 2 2 2 1 36.7 51.3 40.8 58.3 42.2 34.6 77.8 17.2 -38.4 1 53.7 14.3 65 5.6 4.5 1.6 6.2 12.2 3 3 3 3 4 4 2 2 2 1 1 1 1 1 3 4 2 29.9 76.1 11.5 19.8 64.9 47.8 35 1.7 51.5 20.2 -9.3 13.9 -19 -2.3 41.6 18.4 9.9 3 4 3 3 7 6 2 4 2 1 1 2 1 3 4 8 2 35 Example (con’t) Fitted regression line is y=-2.55+10.375x+1.105z or Depression= -2.55+10.375*Trauma+1.105*Duration Both slopes are non-significant (p-value=0.1739 for trauma, 0.7622 for duration) R2=16.6% of the variation in lymphocyte depression is explained by the regression Conclusion: Trauma score and duration of anesthesia are poor explanations for lymphocyte depression 36 Collinearity If two (independent) variables are closely related its difficult to estimate their regression coefficients because they tend to get confused This difficulty is called collinearity Solution is to exclude one of the highly correlated variables 37 Example (cont’d) Correlation between trauma and duration= 0.762 (quite strong) Drop trauma from regression analysis Depression=9.73+4.94*Duration P-value for duration is 0.0457, statistically significant! However, the R2 is still small (11.6%) Conclusion: Although there is evidence for a non-zero slope or linear relationship with duration, there is still considerable variation not explained by the regression. 38 Outliers in Regression Outliers: cases with extreme values that differ greatly from the rest of your sample Even a few outliers can dramatically change estimates of the slope (b) Outliers can result from: Errors in coding or data entry (→rectify) Highly unusual cases (→exclude?) Or, sometimes they reflect important “real” variation (→include?) 39 Outliers: Example Extreme case that pulls regression line up 4 2 -4 -2 0 -2 -4 2 4 Regression line with extreme case removed from sample 40 What about non-linear relationships? Fail to reject β=0 Reject β=0 41 Non-linear models Linear Regression fits a straight line to your data Non-linear models may, in some cases, be more appropriate Non-linear models are usually used in situations where nonlinear relationships have some biological explanation e.g. non-linear models are commonly used in pharmacokinetics studies and compartmental analysis Computationally intensive - estimates may not “converge” and results may be difficult to interpret 42 Analysis of Covariance-ANCOVA Modelling both categorical and continuous independent variables (covariates) Justification: Consideration of a covariate may improve the precision (reduce variability) of the comparisons of categories made in ANOVA Often used as a correction for baseline imbalances ANOVA + regression combined 43 44 Example Antihypertensive Drug Clinical Trial: 84 patients randomly allocated to either Aliskiren or Losartan. Blood pressure measured at baseline and after several weeks treatment. Age and gender was also recorded for each patient. Is there a significant difference in the two treatments for BP reduction? 45 Age 66 62 48 32 61 68 60 33 69 53 63 64 58 52 54 43 46 46 66 29 50 49 40 52 68 35 49 47 Treatment SPP75 SPP75 Los100 Los100 Los100 SPP75 Los100 SPP75 Los100 SPP75 SPP75 Los100 SPP75 SPP75 SPP75 SPP75 SPP75 Los100 Los100 SPP75 SPP75 SPP75 SPP75 Los100 Los100 SPP75 Los100 SPP75 Ba_daySBP Red_daySBP gender 131.1 0.6 Female 160.4 -0.7 Male 147.3 17.9 Male 144.8 -2.4 Male 150.7 21.6 Female 152.4 6 Male 143.6 15.3 Male 143.2 5.1 Male 166.6 24.6 Male 147.6 Female 163.7 -0.2 Male 145.7 15.5 Male 168.3 0.5 Male 156.8 0.6 Male 154.9 7.3 Male 170.5 -2.7 Female 155.5 18.3 Female 173.1 26.7 Male 151.2 -1.7 Male 139.8 -1.5 Male 162.6 13 Male 178.8 17.2 Female 146.8 0 Male 157.1 -0.2 Female 152 8.9 Male 145.4 2.8 Male 153.7 14.9 Male 139.2 10 Male Age 45 68 48 69 64 40 61 47 60 35 61 62 54 45 57 48 59 47 54 45 61 69 21 69 66 42 47 60 Treatment Los100 Los100 Los100 SPP75 Los100 SPP75 SPP75 Los100 Los100 SPP75 SPP75 Los100 Los100 Los100 SPP75 SPP75 Los100 SPP75 Los100 SPP75 SPP75 Los100 SPP75 SPP75 Los100 SPP75 SPP75 SPP75 Ba_daySBP Red_daySBP gender 156 17 Male 149.9 6.8 Female 147 16 Female 145.3 -1.2 Male 142.5 20.6 Female 168.6 2.6 Male 165.6 8.1 Male 156.3 2.3 Female 147.7 11.5 Female 157.4 12.5 Male 143.7 -1.8 Male 148.6 6.6 Male 164.6 39.6 Female 145.3 -6.1 Female 143.9 7 Female 144.3 2.4 Male 147.8 1.1 Female 150.4 -2.3 Male 143.9 -0.2 Male 145.1 6.6 Male 158 3 Male 154.8 Male 142.1 Male 171.3 -2 Male 140.3 2.8 Male 146 5.7 Male 159.3 14.6 Female 157.8 6.8 Male 46 Age 45 62 57 54 54 63 35 52 66 55 61 35 52 60 37 54 62 55 57 40 56 53 52 46 Treatment SPP75 SPP75 SPP75 Los100 Los100 Los100 SPP75 SPP75 Los100 SPP75 Los100 Los100 Los100 SPP75 SPP75 Los100 Los100 Los100 Los100 SPP75 SPP75 Los100 Los100 SPP75 Ba_daySBP Red_daySBP gender 162.9 10.7 Male 173.4 55.9 Female 141.1 0.2 Female 147.6 11.1 Male 140.5 -16.5 Male 156 19.3 Female 150.9 26.8 Male 143.8 -9.7 Female 150.9 0.6 Male 155.8 -3 Female 162.1 21 Male 149.1 21.3 Male 177 -2.7 Male 157.8 6.8 Male 143.4 -3.9 Male 163.9 26.6 Male 147.4 Female 141.9 8.7 Female 163.1 3.6 Male 153.3 0.9 Male 165.9 11.9 Male 144.1 -1.9 Female 144.1 18.6 Male 147.7 -5.6 Male Age Treatment 58 41 42 57 56 Los100 SPP75 Los100 SPP75 Los100 Ba_daySBP Red_daySBP gender 153 149.9 165.8 154.8 146.6 -10.3 4.3 35.5 16.1 18.1 Male Male Male Female Female 47 Analysis without covariates Since treatment has only 2 levels (Losartan & Aliskiren), the ANOVA is equivalent to the two-sample t-test Treatment difference (LosartanAliskiren)=5.06 with P-value=0.0554 Borderline non-significant at the 5% level of significance 48 ANCOVA analysis We have as factors (categorical independent variables) Treatment Gender As covariates (continuous independent variables) Age Baseline BP 49 Results: ANOVA table Source df SS MS F P-value Baseline BP Treatment Age Gender 1 1 1 1 1572.90 651.73 71.57 279.02 1572.90 651.73 71.57 279.02 14.36 5.58 0.61 2.39 0.0005 0.0208 0.4363 0.1264 50 Distributional assumptions Classification of statistical methods based on distributional assumptions Likelihood Based Methods Assume a distribution of data explicitly at the outset Can fit very complex models e.g. model correlation structures with the data, allow for unequal variances, etc. T-tests, ANOVA, regression, etc. rely on normality of data or that the data be of a sufficiently large size (Central Limit Theorem) Non-parametric methods Relaxes assumptions about the shape of the distribution Methods are based on ranking of the data points 51 So how normal is your data? Difficult to see visually if a histogram looks normal Use normal probability (quantile-quantile) plots Points must lie along a line Also useful for detecting outliers 52 Significant Deviations from Normality: Alternatives Transform your data to make it more “normal” and use standard parametric tests e.g. log transform (eliminates skew) Difficulty - may wish to reverse-transform results back (e.g. exponentiate parameter estimates) Use non-parametric methods Make less assumptions on distribution shape These may have less power than parametric methods 53 Non-parametric Equivalents Parametric Non-parametric Paired T-test Wilcoxon Signed Rank Test Two-sample T-test Wilcoxon Rank-Sum ANOVA Kruskal Wallis Analysis Pearson’s Correlation Spearman’s Rank Correlation ANCOVA ANCOVA on ranked data 54 Other Multivariate Methods Multivariate Analysis of Variance (MANOVA) ANOVA with more than one dependent variable or response Also MANCOVA Low power is a problem Not so widely used The objectives behind multivariate analyses can be quite different (to those presented), namely Discriminant Analysis Classification Clustering Pattern Recognition (principal components analysis) 55 Sample Size Calculations How many observations are needed in a study? No mystery to this: sample sizes are often calculated by inverting the computations involved in determining statistical significance Unfortunately, several quantities are required before we can do any calculations (and the argument is a little circular!) 56 Need to specify: Study design and statistical test to be used e.g. we wish to compare 2 treatments, use the two-sample t-test, two-tailed Level of significance α (usually 0.05 i.e. 5%) Power β (commonly set to 0.80) But we also need (with more difficulty): Difference we would like to detect? An estimate of the variability σ2 one expects to see in the data? 57 Example: Two-sample t-test The sample size calculation for the two-sample t-test is (with n per group) n Where 2( z1 z1 )2 2 1 2 (called the standardized difference) 58 Example: Two-sample t-test calculation In order to detect a standardised difference of 0.3 at the 5% level of significance (one-tail) and 80% power (z0.95=1.645, z0.80=0.84), we need a sample size of 138 per group. 59 Recommended Books/Links “Statistical Methods in Medical Research” by Armitage, Berry, Matthews (4th edition, 2002). Oxford, Blackwell Science “Practical Statistics for Medical Research” by Altman (1999). Kluwer Academic Publishers# Statnotes: Topics in Multivariate Analysis, by G. David Garson: http://www2.chass.ncsu.edu/garson/PA765/statnote.htm Statistical computing resources – UCLA: http://www.ats.ucla.edu/stat/ 60 Statistics Software Excel stats add-in is not recommended STATA, S-Plus, Minitab, SAS, JMP, SPSS, R Stats calculations cannot be trusted (e.g. standard deviations may be inaccurate) Doesn’t have the functionality of other software Certain amount of programming skill required but well worth it R is free; SAS is most expensive (up to 10k per person per annum) DataDesk, InStata –user friendly. “point-and-click”, with reasonable functionality 61 You’re free!! Questions? [email protected] 62