Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Corecursion wikipedia , lookup
Inverse problem wikipedia , lookup
Computer simulation wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Data analysis wikipedia , lookup
Predictive analytics wikipedia , lookup
Regression analysis wikipedia , lookup
Generalized linear model wikipedia , lookup
Least squares wikipedia , lookup
Introduction to Biostatistical Analysis Using R Statistics course for PhD students in Veterinary Sciences Session 2 Lecture: Introduction to statistical hypothesis testing Null and alternate hypothesis. Types of error. Two-sample hypotheses. Correlation. Analysis of frequency data. Model simplification Lecturer: Lorenzo Marini, PhD Department of Environmental Agronomy and Crop Production, University of Padova, Viale dell’Università 16, 35020 Legnaro, Padova. E-mail: [email protected] Tel.: +39 0498272807 http://www.biodiversity-lorenzomarini.eu/ Inference A statistical hypothesis test is a method of making statistical decisions from and about experimental data. Null-hypothesis testing just answers the question of "how well do the findings fit the possibility that chance factors alone might be responsible?”. sampling Sample Estimation Population (Uncertainty!!!) testing Statistical Model Key concepts: Session 1 Statistical testing in five steps: 1. Construct a null hypothesis (H0) 2. Choose a statistical analysis (assumptions!!!) 3. Collect the data (sampling) 4. Calculate P-value and test statistic 5. Reject/accept (H0) if P is small/large Remember the order!!! Concept of replication vs. pseudoreplication 1. Spatial dependence (e.g. spatial autocorrelation) 2. Temporal dependence (e.g. repeated measures) 3. Biological dependence (e.g. siblings) n=6 yi Key quantities mean var yi n deviance SS ( yi mean) ( yi mean) (n 1) 2 y residual mean 2 x Hypothesis testing • 1 – Hypothesis formulation (Null hypothesis H0 vs. alternative hypothesis H1) • 2 – Compute the probability P that a particular result we observed (e.g. difference in means) could have occurred by chance, if the null hypothesis were true; • 3 – In short P is a measure of the credibility of the null hypothesis • 3 – If this probability P is lower than a defined threshold (e.g. <0.05) we can reject the null hypothesis Hypothesis testing: Types of error Wrong conclusions: Type 1 and 2 errors Actual situation Reject H0 Accept H0 Effect Correct Effect detected Type 2 error () Effect not detected No effect Type 1 error () Effect detected, none exists Correct, No effect detected, None exists As power increases, the chances of a Type II error decreases Statistical power depends on: the statistical significance criterion used in the test the size of the difference or the strength of the similarity (effect size) in the population the sensitivity of the data. Statistical analyses Mean comparisons for 2 populations Test the difference between the means drawn by two samples Correlation In probability theory and statistics, correlation, (often measured as a correlation coefficient), indicates the strength and direction of a linear relationship between two random variables. In general statistical usage, correlation refers to the departure of two variables from independence. Analysis of count or proportion data Whole number or integer numbers (not continuous, different distributional properties) or proportion Mean comparisons for 2 samples The t test H0: means do not differ H1: means differ Assumptions • • • • Independence of cases (work with true replications!!!) - this is a requirement of the design. Normality - the distributions in each of the groups are normal Homogeneity of variances - the variance of data in groups should be the same (use Fisher test or Fligner's test for homogeneity of variances). These together form the common assumption that the errors are independently, identically, and normally distributed Normality Before we can carry out a test assuming normality of the data we need to test our distribution (not always before!!!) Normal qqplot RESIDUALS MUST BE NORMAL 0 5 10 2.5 1.5 0.5 50 30 0 10 Frequency In many cases we must check this assumption after having fitted the model (e.g. regression or multifactorial ANOVA) Observed quantiles Graphics analysis -2 15 0 1 2 norm quantiles mass hist(y) lines(density(y)) -1 library(car) qq.plot(y) or qqnorm(y) Test for normality Shapiro-Wilk Normality Test Skew + kurtosis (t test) shapiro.test() Normality: Histogram and Q-Q Plot Histogram of log(fishes$mas) 30 20 0 10 Frequency 30 0 10 Frequency 50 40 Histogram of fishes$mas 0 5 10 15 -0.5 0.5 1.0 1.5 2.0 2.5 log(fishes$mas) 2.0 0.0 1.0 log(fishes$mass) 10 5 fishes$mass 15 fishes$mas -3 -2 -1 0 1 norm quantiles 2 3 -3 -2 -1 0 1 norm quantiles 2 3 Normality: Quantile-Quantile Plot Quantiles are points taken at regular intervals from the cumulative distribution function (CDF) of a random variable. The quantiles are the data values marking the boundaries between consecutive subsets 400 Normality: Histogram 0 100 200 300 Normal distribution must be symmetrical around the mean 2.5 4.5 6.5 8.5 10.5 12.5 library(animation) ani.options(nmax = 2000 + 15 -2, interval = 0.003) freq = quincunx(balls = 2000, col.balls = rainbow(1)) # frequency table barplot(freq, space = 0) Normality In case of non-normality: 2 possible approaches 1. Change the distribution (use GLMs) Advanced statistics Probit (proportion) Box-Cox transformation 30 20 10 Frequency 30 0 Arcsin (percentage) 0 10 Square-root Frequency Logaritmic (skewed data) 50 2. Data transformation 40 E.g. Poisson (count data) E.g. Binomial (proportion) 0 5 10 mass 15 -0.5 0.5 1.5 2.5 fishes$logmass Homogeneity of variance: two samples Before we can carry out a test to compare two sample means, we need to test whether the sample variances are significantly different. The test could not be simpler. It is called Fisher’s F To compare two variances, all you do is divide the larger variance by the smaller variance. E.g. Students on the left vs. Students on the right F<-var(A)/var(B) qf(0.975,nA-1,nB-1) F calculated F critical if the calculated F is larger than the critical value, we reject the null hypothesis Test can be carried out with the var.test() Homogeneity of variance : > two samples It is important to know whether variance differs significantly from sample to sample. Constancy of variance (homoscedasticity) is the most important assumption underlying regression and analysis of variance. For multiple samples you can choose between the Bartlett test and the Fligner–Killeen test. Bartlett.test(response,factor) Fligner.test(response,factor) There are differences between the tests: Fisher and Bartlett are very sensitive to outliers, whereas Fligner–Killeen is not Mean comparison In many cases, a researcher is interesting in gathering information about two populations in order to compare them. As in statistical inference for one population parameter, confidence intervals and tests of significance are useful statistical tools for the difference between two population parameters. Ho: the two means are the same H1: the two means differ - All Assumptions met? Parametric t.test() - t test with independent or paired sample -Some assumptions not met? Non-parametric Wilcox.test() - The Wilcoxon signed-rank test is a non-parametric alternative to the Student's t-test for the case of two samples. Mean comparison: 2 independent samples Two independent samples Students on the left Students on the right The two samples are statistically independent meana y i meanb n SEdiff t y i n vara varb na nb meana meanb SEdiff Test can be carried out with the t.test() function Mean comparison: t test for paired samples Paired sampling in time or in space E.g. Test your performance before or after the course. I measure twice on the same student Time 1 a: 1, 2, 3, 2, 3, 2 ,2 t ( ai bi ) / n Time 2 b: 1, 2, 1, 1, 5, 1, 2 SDdiff / n If we have information about dependence, we have to use this!!! We can deal with dependence Test can be carried out with the t.test() function Mean comparison: Wilcoxon A 3 4 4 3 2 3 1 3 5 2 Rank procedure U n1n2 n1 (n1 1) R1 2 n1 and n2 are the number of observations R1 is the sum of the ranks in the sample 1 Test can be carried out with the wilcox.test() function B 5 5 6 7 4 4 3 5 6 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ozone 1 2 2 3 3 3 3 3 4 4 4 4 5 5 5 5 5 6 6 7 -NB Tied ranks correction label ranks A 1 A 2.5 A 2.5 A 6 A 6 A 6 A 6 B 6 A 10.5 A 10.5 B 10.5 B 10.5 A 15 B 15 B 15 B 15 B 15 B 18.5 B 18.5 B 20 Correlation Correlation, (often measured as a correlation coefficient), indicates the strength and direction of a linear relationship between two random variables Bird species Plant species richness richness 1 x1 l1 2 x2 l2 3 x3 l3 4 x4 l4 … … … 458 x458 l458 Sampling unit Three alternative approaches 1. Parametric - cor() 2. Nonparametric - cor() 3. Bootstrapping - replicate(), boot() Correlation: causal relationship? Which is the response variable in a correlation analysis? Bird species Plant species richness richness 1 2 3 4 … 458 x1 x2 x3 x4 … x458 l1 l2 l3 l4 … l458 Sampling unit Correlation Plot the two variables in a Cartesian space A correlation of +1 means that there is a perfect positive LINEAR relationship between variables. A correlation of -1 means that there is a perfect negative LINEAR relationship between variables. A correlation of 0 means there is no LINEAR relationship between the two variables. Correlation Same correlation coefficient! r= 0.816 Parametric correlation: when is significant? Pearson product-moment correlation coefficient Correlation coefficient: cor ( xy) x y 2 2 Hypothesis testing using the t distribution: Ho: Is cor = 0 H1: Is cor ≠ 0 cor t SEcor t critic value for d.f. = n-2 Assumptions -Two random variables from a random populations - cor() detects ONLY linear relationships SEcor (1 cor 2 ) n2 Nonparametric correlation Rank procedures Distribution-free but less power Spearman correlation index cor.spearman (rank rank ) rank rank x y 2 x 2 y The Kendall tau rank correlation coefficient cor.kendall 4P 1 n(n 1) P is the number of concordant pairs n is the total number of pairs Scale-dependent correlation NB Don’t use grouped data to compute overall correlation!!! 7 sites Issues related to correlation 1. Temporal autocorrelation Values in close years are more similar Dependence of the data 2. Spatial autocorrelation Values in close sites are more similar Dependence of the data Moran's I or Geary’s C Measures of global spatial autocorrelation Moran's I = 0 Moran's I = 1 Three issues related to correlation 2. Temporal autocorrelation Values in close years are more similar Dependence of the data Working with time series is likely to have temporal pattern in the data E.g. Population dynamics Autoregressive models (not covered!) Three issues related to correlation 3. Spatial autocorrelation Values in close sites are more similar Dependence of the data ISSUE: can we explain the spatial autocorrelation with our models? Moran's I or Geary’s C (univariate response) Measures of global spatial autocorrelation Raw response Residuals after model fitting Hint: If you find spatial autocorrelation in your residuals, you should start worrying Frequency data Properties of frequency data: -Count data -Proportion data Count data: where we count how many times something happened, but we have no way of knowing how often it did not happen (e.g. number of students coming at the first lesson) Proportion data: where we know the number doing a particular thing, but also the number not doing that thing (e.g. ‘mortality’ of the students who attend the first lesson, but not the second) Count data Straightforward general linear methods (assuming constant variance, normal errors) are not appropriate for count data for four main reasons: • The linear model might lead to the prediction of negative counts. • The variance of the response variable is likely to increase with the mean. • The errors will not be normally distributed. • Many zeros are difficult to handle in transformations. - Classical test with contingency tables - Generalized linear models with Poisson distribution and log-link function (extremely powerful and flexible!!!) Count data: contingency tables We can assess the significance of the differences between observed and expected frequencies in a variety of ways: - Pearson’s chi-squared (χ2) - G test - Fisher’s exact test Group 1 Group 2 Row total (R) Trait 1 a b a+b Trait 2 c d c+d b+d a+b+c+d (G) Column total (C) a+c H0: frequencies found in rows are independent from frequencies in columns Count data: contingency tables Pearson’s chi-squared: An example Following a staff training in insemination techniques in cattle, an insemination centre compared three different methods Method I Method II Method III Row total (Ri) Pregnant 275 192 261 728 Non pregnant 78 64 123 265 256 386 993 (G) Column total (Ci) 353 What about the sampling? Where are the potential biases? H0: the pregnant rate is independent from the method. Method I (78%) = method II (73%) = method III (68%)? The test uses expected and observed frequencies The table above presents the observed (Oi) frequencies We need a model to get the expected (Ei) frequencies X Count data: contingency tables - Pearson’s chi-squared (χ2) We need a model to define the expected frequencies (E) in case of perfect independence Ei (R i x Ci ) G df (r - 1) x (c - 1) 2 (O E ) 2 i i Ei O1=275 Critic value E1=(728*353)/993 O2=192 E2=(728*256)/993… Method I Method II Method III Row total (Ri) Pregnant 275 192 261 728 Non pregnant 78 64 123 265 256 386 993 (G) Column total (Ci) 353 Count data: contingency tables - G test 1. We need a model to define the expected frequencies (E) (many possibilities) – E.g. perfect independence (R i x Ci ) Ei G Oi G 2 Oi ln Ei χ2 distribution - Fisher’s exact test fisher.test() If expected values are less than 4 o 5 Proportion data Proportion data have three important properties that affect the way the data should be analyzed: • the data are strictly bounded (0-1); • the variance is non-constant (it depends on the mean); • errors are non-normal. - Classical test with probit or arcsin transformation - Generalized linear models with binomial distribution and logit-link function (extremely powerful and flexible!!!) Proportion data: traditional approach Transform the data! Arcsine transformation The arcsine transformation takes care of the error distribution p' arcsin p p are percentages (0-100%) Probit transformation The probit transformation takes care of the non-linearity p are proportions (0-1) Proportion data: modern analysis An important class of problems involves data on proportions such as: • studies on percentage mortality (LD50), • infection rates of diseases, • proportion responding to clinical treatment (bioassay), • sex ratios, or in general • data on proportional response to an experimental treatment 2 approaches 1. It is often needed to transform both response and explanatory variables or 2. To use Generalized Linear Models (GLM) using different error distributions Statistical modelling MODEL Generally speaking, a statistical model is a function of your explanatory variables to explain the variation in your response variable (y) E.g. Y=a+bx1+cx2+ dx3 Y= response variable (performance of the students) xi= explanatory variables (ability of the teacher, background, age) The object is to determine the values of the parameters (a, b, c and d) in a specific model that lead to the best fit of the model to the data The best model is the model that produces the least unexplained variation (the minimal residual deviance), subject to the constraint that all the parameters in the model should be statistically significant (many ways to reach this!) deviance SS ( yi mean) 2 Statistical modelling Getting started with complex statistical modeling It is essential, that you can answer the following questions: • Which of your variables is the response variable? • Which are the explanatory variables? • Are the explanatory variables continuous or categorical, or a mixture of both? • What kind of response variable do you have: is it a continuous measurement, a count, a proportion, a time at death, or a category? Statistical modelling: multicollinearity 1. Multicollinearity Correlation between predictors in a non-orthogonal multiple linear models Confounding effects difficult to separate Variables are not independent This makes an important difference to our statistical modelling because, in orthogonal designs, the variation that is attributed to a given factor is constant, and does not depend upon the order in which factors are removed from the model. In contrast, with non-orthogonal data, we find that the variation attributable to a given factor does depend upon the order in which factors are removed from the model The order of variable selection makes a huge difference (please wait for session 4!!!) Statistical modelling: multicollinearity 1. Multicollinearity E.g. Y=a+bx1+cx2+ dx3 Y= response variable (performance of the students) xi= explanatory variables (ability of the teacher, background, age) Do you see potential collinearity here? Collinearity is a major problem in situation where we cannot control all the factors. Typical of multiple regression Statistical modelling Getting started with complex statistical modeling The explanatory variables (a) All explanatory variables continuous - Regression (b) All explanatory variables categorical - Analysis of variance (ANOVA) (c) Explanatory variables both continuous and categorical Analysis of covariance (ANCOVA) The response variable (a) Continuous - Normal regression, ANOVA or ANCOVA (b) Proportion - Logistic regression, GLM logit-linear models (c) Count - GLM Log-linear models (d) Binary - GLM binary logistic analysis (e) Time at death - Survival analysis Statistical modelling Each analysis estimate a MODEL You want the model to be minimal (parsimony), and adequate (must describe a significant fraction of the variation in the data) It is very important to understand that there is not just one model. • given the data, • and given our choice of model, • what values of the parameters of that model make the observed data most likely? Model building: estimate of parameters (slopes and level of factors) Occam’s Razor Statistical modelling Occam’s Razor • Models should have as few parameters as possible; • linear models should be preferred to non-linear models; • experiments relying on few assumptions should be preferred to those relying on many; • models should be pared down until they are minimal adequate; • simple explanations should be preferred to complex explanations. MODEL SIMPLIFICATION The process of model simplification is an integral part of hypothesis testing in R. In general, a variable is retained in the model only if it causes a significant increase in deviance when it is removed from the current model. Statistical modelling: model simplification Parsimony requires that the model should be as simple as possible. This means that the model should not contain any redundant parameters or factor levels. Model simplification • remove non-significant interaction terms; • remove non-significant quadratic or other non-linear terms; • remove non-significant explanatory variables; • group together factor levels that do not differ from one another; • in ANCOVA, set non-significant slopes of continuous explanatory variables to zero.