Download Statistics 1: tests and linear models

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Psychometrics wikipedia , lookup

Analysis of variance wikipedia , lookup

Categorical variable wikipedia , lookup

Omnibus test wikipedia , lookup

Transcript
Statistics 1: tests and linear models
How to get started?
• Exploring data graphically:
Scatterplot
Scatterplot
Boxplot
Histogram
Important things to check
• Are all the variables in correct format?
• Do there seem to be outliers?
– Mistake in data coding?
Initial structure of the analyses
• What is the response variable?
• What are the explanatory variables?
• Explore patterns visually
– Correlations?
– Differences between groups?
Summary statistics
•
•
•
•
•
•
•
summary(data), summary(x)
mean(x), median(x)
range(x)
var(x), sd(x)
min(x), max(x)
quantile(x,p)
tapply(), table()
Tests
• Test for normality
– Shapiro’s test: shapiro.test()
– QQ plot: qqnorm(), qqline()
• Homogeneity of variance
– var.test (for two groups)
– bartlett.test (for several groups)
Tests for differences in means
• Student’s t-test: t.test()
– One or two sample test
• Testing if sample mean differs e.g. from 0
• Testing if sample means of two groups differ
– Paired/non paired
• Are pairs of measurements associated?
– Variance homogeneous/non homogeneous
– Assumes normally distributed data
• Wilcoxon’s test: wilcox.test()
– Normality not required
– paired/non paired
DEMO 1
Correlation
• cor(x,y) calculated correlation coefficient between two numeric
variables
– close to 0: no correlation
– close to 1: strong correlation
• Is the correlation significant
– cor.test(y,x)
– Note: check also graphically!!!
Confidence intervals and standard errors
• Typical ways of describing uncertainty in a parameter value (e.g.
mean)
– Standard error (SE of mean is sqrt(var(xx)/n)
– Confidence interval (95%)
• The range within which the value is with the probability of 95%
• Normal approximation: 1.96*SE, so that 95% CI for mean(xx)
[mean(xx) - 1.96*SE(xx), mean(xx) + 1.96*SE(xx)]
• If data not normally distributed bootstrapping can be helpful
– Let’s assume we have measured age at death for 100 rats
95% CI for mean age at death can be derived by
» 1. take a sample of 100 rats with replacement from the original data
» 2. calculate mean
» 3. repeat 1 & 2 e.g. 1000 times and always record the mean
» 4. Now 2.5 and 97.5% quantiles of the means give the 95% CI for mean
EXERCISE TOMORROW!
Linear model and regression
• Models the response variable through additive effects of explanatory
variables
– E.g. how does stopping distance of a car depend on speed?
– Or how does weight of an animal depend on it’s length?
The formula
Y = a + b1x1 + … + bnxn + ε
Intercept
Explanatory
variables
Response variable
Normally distributed error term,
i.e. ‘random noise’
Regression, ANOVA or ANCOVA?
How to interprete…
• Intercept:
– Baseline value for Y
– The value that Y is expected to get if all the predictors are 0
– If one/some of the predictors are factors, then this is the value
predicted for the reference levels of the factors
• Coefficients bn
– If xn is numeric variable, then increment of xn with one unit
increases the value of Y with bn
– If xn is a factor, then parameter bn gets different value for each
factor level, so that Y increases with the value bn corresponding
to the level of xn
• Note, reference level of x is included to the intercept
Fitting the model in R
• lm(y~x,data=“name of your dataset”)
• Formula:
y~x
intercept + the effect of x
y~x-1
no intercept
y~x+z
multiple regression with main effects
y~x*z
multiple regression with main effects and interactions
• Exploring the model: summary(), anova(), plot(“model”)
plot() command in lm
Produced four figures
1.
2.
3.
4.
Residuals against fitted values
QQ plot for residuals
Standardized residuals
‘Influence’ plotted against residuals: identifies outliers
Residuals should be normally distributed and not show any systematic
trends. If not OK, then:
-> transformation of response: sqrt(), ln(),…
-> transformations of explanatory variables
-> should generalized linear model be used?
How to predict?
Y = a + b1x1 + … + bnxn
Expected value of Y
Values of predictors
Estimated model parameters
In R, predict() function.
Briefly about model selection
• The aim: simplest adequate model
–
–
–
–
Few parameters preferred over many
Main effects preferred over interactions
Untransformed variables preferred over transformed
Model should still not be oversimplified
• Simplifying a model
– Are effects of explanatory variables significant?
– Does deletion of a term increase residual variation significantly?
• Model selection tools:
– anova() Tests difference in residual variation between alternative models
– step() Stepwise model selection based on AIC values
DEMO 2