* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Statistics 1: tests and linear models
Survey
Document related concepts
Transcript
Statistics 1: tests and linear models How to get started? • Exploring data graphically: Scatterplot Scatterplot Boxplot Histogram Important things to check • Are all the variables in correct format? • Do there seem to be outliers? – Mistake in data coding? Initial structure of the analyses • What is the response variable? • What are the explanatory variables? • Explore patterns visually – Correlations? – Differences between groups? Summary statistics • • • • • • • summary(data), summary(x) mean(x), median(x) range(x) var(x), sd(x) min(x), max(x) quantile(x,p) tapply(), table() Tests • Test for normality – Shapiro’s test: shapiro.test() – QQ plot: qqnorm(), qqline() • Homogeneity of variance – var.test (for two groups) – bartlett.test (for several groups) Tests for differences in means • Student’s t-test: t.test() – One or two sample test • Testing if sample mean differs e.g. from 0 • Testing if sample means of two groups differ – Paired/non paired • Are pairs of measurements associated? – Variance homogeneous/non homogeneous – Assumes normally distributed data • Wilcoxon’s test: wilcox.test() – Normality not required – paired/non paired DEMO 1 Correlation • cor(x,y) calculated correlation coefficient between two numeric variables – close to 0: no correlation – close to 1: strong correlation • Is the correlation significant – cor.test(y,x) – Note: check also graphically!!! Confidence intervals and standard errors • Typical ways of describing uncertainty in a parameter value (e.g. mean) – Standard error (SE of mean is sqrt(var(xx)/n) – Confidence interval (95%) • The range within which the value is with the probability of 95% • Normal approximation: 1.96*SE, so that 95% CI for mean(xx) [mean(xx) - 1.96*SE(xx), mean(xx) + 1.96*SE(xx)] • If data not normally distributed bootstrapping can be helpful – Let’s assume we have measured age at death for 100 rats 95% CI for mean age at death can be derived by » 1. take a sample of 100 rats with replacement from the original data » 2. calculate mean » 3. repeat 1 & 2 e.g. 1000 times and always record the mean » 4. Now 2.5 and 97.5% quantiles of the means give the 95% CI for mean EXERCISE TOMORROW! Linear model and regression • Models the response variable through additive effects of explanatory variables – E.g. how does stopping distance of a car depend on speed? – Or how does weight of an animal depend on it’s length? The formula Y = a + b1x1 + … + bnxn + ε Intercept Explanatory variables Response variable Normally distributed error term, i.e. ‘random noise’ Regression, ANOVA or ANCOVA? How to interprete… • Intercept: – Baseline value for Y – The value that Y is expected to get if all the predictors are 0 – If one/some of the predictors are factors, then this is the value predicted for the reference levels of the factors • Coefficients bn – If xn is numeric variable, then increment of xn with one unit increases the value of Y with bn – If xn is a factor, then parameter bn gets different value for each factor level, so that Y increases with the value bn corresponding to the level of xn • Note, reference level of x is included to the intercept Fitting the model in R • lm(y~x,data=“name of your dataset”) • Formula: y~x intercept + the effect of x y~x-1 no intercept y~x+z multiple regression with main effects y~x*z multiple regression with main effects and interactions • Exploring the model: summary(), anova(), plot(“model”) plot() command in lm Produced four figures 1. 2. 3. 4. Residuals against fitted values QQ plot for residuals Standardized residuals ‘Influence’ plotted against residuals: identifies outliers Residuals should be normally distributed and not show any systematic trends. If not OK, then: -> transformation of response: sqrt(), ln(),… -> transformations of explanatory variables -> should generalized linear model be used? How to predict? Y = a + b1x1 + … + bnxn Expected value of Y Values of predictors Estimated model parameters In R, predict() function. Briefly about model selection • The aim: simplest adequate model – – – – Few parameters preferred over many Main effects preferred over interactions Untransformed variables preferred over transformed Model should still not be oversimplified • Simplifying a model – Are effects of explanatory variables significant? – Does deletion of a term increase residual variation significantly? • Model selection tools: – anova() Tests difference in residual variation between alternative models – step() Stepwise model selection based on AIC values DEMO 2