Download File - TAU R Workshop 2015

Basic statistic inference in R Shai Meiri Everything differs!!! “We expect to find differences between x and y” is a trivial saying The statistician within you asks “Are the differences we found are larger that expected by chance?” The biologist within you asks “Why the differences I found are in the direction and the level they are?” Moments of central tendency 1.Mean Arithmetic mean: Σxi/n Geometric mean: (x1*x2*…*xn)1/n Harmonic mean: Moments of central tendency in R 1. Arithmetic mean: Σxi/n Example: data<-c(2,3,4,5,6,7,8) mean(data) [1] 5 Use the function “mean” 2. Geometric mean: (x1*x2*…*xn)1/n :You can also use the .csv file dat<-read.csv("island_type_final2.csv") Attach(dat) mean(lat) [1] 17.40439 Example: data<-c(2,3,4,5,6,7,8) exp(mean(log(data))) [1] 4.549163 Moments of central tendency 1. A. mean B. Median C. Mode General example: data<-c(2,3,4,5,6,7,8) median(data) [1] 5 Example from the .csv: median(mass) [1] 0.69 Moments of central tendency http://www.statmethods.net/management/functions.html 1.Mean 2.Variance = Σ(xi-μ)2 / n Is the mean is a good measurement to what is happening in the population when the variance is low? Example : data<-c(2,3,4,5,6,7,8) var(data) [1] 4.666667 Var(lat) [1] 89.20388 Moments of central tendency 1.Mean 2.Variance The second moment of central tendency is the measurement of how much the data is scattered around the first moment (mean) An example for the second moment are the variance, the standard variation, standard error, coefficient of variation and the confidence interval of 90%, 95% and 99% from something Moments of central tendency #for: data<-c(2,3,4,5,6,7,8) Sample size: length(data) Variance: var(data) Standard deviation: sd(data) Standard error: coefficient of variation: se<-(sd(data)/length(data)^0.5) se [1] 0.8164966 CV<-sd(data)/mean(data) CV [1] 0.4320494 Moments of central tendency 1.Mean 2.Variance 3.Skew Skewed distribution of frequencies is not symmetric Do you think that the arithmetic mean is a good measurement of central tendency for a skewed frequency distribution What is the mean salary of the student here and of Bill Gates? Moments of central tendency Skew skew<-function(data){ m3<-sum((data-mean(data))^3)/length(data) s3<-sqrt(var(data))^3 m3/s3} skew(data) The SE of skewness: sdskew<-function(x) sqrt(6/length(x)) Moments of central tendency 1.Mean 2.Variance 3.Skew 4.Kurtosis Moments of central tendency Kurtosis kurtosis<-function(x){ m4<-sum((x-mean(x))^4)/length(x) s4<-var(x)^2 m4/s4-3 } kurtosis(x) SE of kurtosis: sdkurtosis<-function(x) sqrt(24/length(x)) A normal distribution can get a value of mean and variance but its skewness and the kurtosis should equal to zero Values of skew and kurtosis have their own variance – and zero should be outside of their confidence interval in order for them to be significantly different from zero Residuals When doing statistics we’re creating models of the reality One of the most simple models is the mean : The mean height of Israeli citizens is 173cm The mean salary is ₪ 9271(correct for April 2014) The mean service in IDF is 24 months (I guess) Rab. Dov Lior 2.06m Served in IDF for 1 month for a month ₪ 46,699 (excluding the bottles) http://www.haaretz.co.il/1.2057452 Residuals When doing statistics we’re creating models of the reality We can see here that our models: 24 months, 9271 ₪ and 173 cm are not very successful The Residual Is how much a certain value is far from the prediction of the model . Omri Caspi is far away in 32 cm from the model “Israeli = 173” and in 29 cm from the more complicated model: “Israeli man = 177, Israeli women = 168” Residual = ₪37428 Residual = -23 month IDF service Residual = 33 cm Residuals When doing statistics we’re creating models of the reality dat<-read.csv("island_type_final2.csv") model<-lm(mass~iso+area+age+lat, data=dat) out<-model$residuals out write.table(out, file = "residuals.txt",sep="\t",col.names=F,row.names=F) #note that residual values are in the order entered (i.e., not alphabetic, not by residual size – first in, first out) Residual = ₪37428 Residual = 33 cm Residual = -23 month service Theoretical statistics and statistical inference When we have data it is best that we first describe them: plot graphs, calculate the mean and so on In statistical inference we are testing the behavior of our data compared to a certain hypothesis We can present our hypothesis as a statistical model For Example: • The distribution of the heights is normal • Number of species increases with area • Number of species increases with area with a power function of 0.25 Frequency distribution* How many observations are in each bin? dat<-read.csv("island_type_final2.csv") attach(dat) names(dat) Hist(mass) *graphic form = “histogram” Describes the distribution of all observations Frequency distribution What did we learn? dat<-read.csv("island_type_final2.csv") attach(dat) Hist(mass) • There are no mass smaller than one tenth of a gram or larger than 100 kg • Lizard with mass between 1 and 10 are very common – larger or smaller lizards are rare • The distribution is unimodal and skewed to the right Frequency distribution Histograms don’t have to be so ugly dat<-read.csv("island_type_final2.csv") attach(dat) hist(mass, col="purple",breaks=25,xlab="log mass (g)",main="masses of island lizards - great data by Maria",cex.axis=1.2,cex.lab=1.5) Presenting a categorical predictor with a continuous response variable dat<-read.csv("island_type_final2.csv") attach(dat) plot(type,brood) Always prefer boxplot to barplot Presenting a continuous variable against another continuous variable dat<-read.csv("island_type_final2.csv") attach(dat) plot(mass,clutch) plot(mass,clutch,pch=16, col=“blue”) Which test should we choose? It changes according to the nature of our response variable (=y variable), and mostly according to the nature of our predictor variables • If the response variable is “success or failure” and the null hypothesis is equality of both we’ll use a binomial test • If the response variable is counts we’ll usually use chisquare or G • In many cases our response variable will be continuous (14 species, 78 individuals, 54 heartbeats per second, 7.3 eggs, 23 degrees) Which test should we choose? ? What is your response variable Continuous Counts Success or failure (14 species, 78 individuals, 23 degrees, 7.3 eggs) (frequency: 6 females, 4 males) (found the cheese/idiot) Soon… Chi-square or G (=loglikelihood) Binomial Binomial test in R You need to define the number of successes from the whole sample size. For example: 19 out of 34 is not significant 19 out of 20 is significant binom.test(19,34) Exact binomial test data: 19 and 34 number of successes = 19, number of trials = 34 p-value = 0.6076 alternative hypothesis: true probability of success is not equal to 0.5 95 percent confidence interval: 0.3788576 0.7281498 sample estimates: probability of success 0.5588235 binom.test(19,20) Exact binomial test data: 19 and 20 number of successes = 19, number of trials = 20, p-value = 4.005e-05 alternative hypothesis: true probability of success is not equal to 0.5 95 percent confidence interval: 0.7512672 0.9987349 sample estimates: probability of success 0.95 Chi-square test in R chisq.test Data: lizard insularity & diet: habitat island island island mainland mainland mainland M<-as.table(rbind(c(1901,101,269),c(488,43,177))) chisq.test(M) data: M χ2 = 80.04, df = 2, p-value < 2.2e-16 diet carnivore herbivore omnivore carnivore herbivore omnivore species# 488 43 177 1901 101 269 Chi-square test in R Now lets use our dataset: chisq.test dat<-read.csv("island_type_final2.csv") type anoles else install.packages("reshape") Continental 7 45 library(reshape) Land_bridge 1 30 cast(dat, type ~ what, length) Oceanic 23 M<-as.table(rbind(c(7,45,45),c(1,30,14),c(23,110,44))) chisq.test(M) data: M χ2 = 17.568, df = 4, p-value = 0.0015 110 gecko 45 14 44 Which test should we choose? If our response variable is continuous then we’ll choose our test based on the predictor variables If our predictor variable is categorical (Area 1, Area 2, • Area 3 or species A, species B, species C) We’ll use ANOVA If our predictor variable is continuous (temperature, body • mass, height) We’ll use REGRESSION t-test in R t.test(x,y) dimorphism<-read.csv("ssd.csv",header=T) attach(dimorphism) names(dimorphism) males<-size[Sex=="male"] females<-size[Sex=="female"] t.test(females,males) Sex female male male female male female female male male female male female male female male female male female size 79.7 85 120 133.0 118 126.0 105.8 112 106 121.0 95 111.0 86 93.0 65 75.0 230 240.0 Welch Two Sample t-test data: females and males t = -2.1541, df = 6866.57, p-value = 0.03127 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 7.5095545 -0.3536548 sample estimates: mean of x mean of y 88.17030 92.10191 t-test in R (2) Sex female male male female male female female male male female male female male female male female male female lm(x~y) dimorphism<-read.csv("ssd.csv",header=T) attach(dimorphism) names(dimorphism) model<-lm(size~Sex,data=dimorphism) summary(model) (Intercept) Sexmale Estimate 88.17 3.932 standard error 1.291 1.825 t 68.32 2.154 p value <2e-16 *** 0.031 * size 79.7 85 120 133.0 118 126.0 105.8 112 106 121.0 95 111.0 86 93.0 65 75.0 230 240.0 Species Xenagama_zonura Xenagama_zonura Xenosaurus_grandis Xenosaurus_grandis Xenosaurus_newmanorum Xenosaurus_newmanorum Xenosaurus_penai Xenosaurus_penai Xenosaurus_platyceps Xenosaurus_platyceps Xenosaurus_rectocollaris Xenosaurus_rectocollaris Zonosaurus_anelanelany Zonosaurus_anelanelany Zootoca_vivipara Zootoca_vivipara Zygaspis_nigra Zygaspis_nigra Zygaspis_quadrifrons Zygaspis_quadrifrons Paired t-test in R t.test(x,y,paired=TRUE) dimorphism<-read.csv("ssd.csv",header=T) attach(dimorphism) names(dimorphism) males<-size[Sex=="male"] females<-size[Sex=="female"] t.test(females,males, paired=TRUE) size 79.7 85 120 133.0 118 126.0 105.8 112 106 121.0 95 111.0 86 93.0 65 75.0 230 240.0 195 227.0 Sex female male male female male female female male male female male female male female male female male female male female Paired t-test data: females and males t = -10.192, df = 3503, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -4.688 -3.175 sample estimates: mean of the differences -3.931 tapply(size,Sex,mean) female 88.17 male 92.10 ANOVA in R aov model<-aov(x~y) island<-read.csv("island_type_final2.csv",header=T) names(island) [1] "species" "what" "family" "insular" "Archipelago" "largest_island" [7] "area" "type" "age" "iso" "lat" "mass" [13] "clutch" "brood" "hatchling" "productivity“ model<-aov(clutch~type,data=island) summary(model) Df type 2 Residuals 289 species Trachylepis_sechellensis Trachylepis_wrightii Tropidoscincus_boreus Tropidoscincus_variabilis Urocotyledon_inexpectata Varanus_beccarii Algyroides_fitzingeri Anolis_wattsi Archaeolacerta_bedriagae Cnemaspis_affinis Cnemaspis_limi Cnemaspis_monachorum Amblyrhynchus_cristatus Ameiva_erythrocephala Ameiva_fuscata Ameiva_plei Anolis_acutus Anolis_aeneus Anolis_agassizi Anolis_bimaculatus Anolis_bonairensis Sum sq Mean sq F value Pr(>F) 0.466 0.23296 2.784 0.0635 . 24.184 0.08368 type Continental Continental Continental Continental Continental Continental Land_bridge Land_bridge Land_bridge Land_bridge Land_bridge Land_bridge Oceanic Oceanic Oceanic Oceanic Oceanic Oceanic Oceanic Oceanic Oceanic clutch 0.6 0.65 0.4 0.45 0.3 0.58 0.4 0 0.65 0.3 0.18 0 0.35 0.6 0.6 0.41 0 0 0 0.18 0 Post-hoc test for ANOVA in R TukeyHSD(model) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = clutch ~ type, data = island) $type Land_bridge-Continental Oceanic-Continental Oceanic-Land_bridge diff 0.124 0.0218 -0.102 lwr -0.0025 -0.0671 -0.2206 upr 0.2505 0.1108 0.0163 p adj 0.0561 0.8318 0.1066 The difference is not significant. Notice that zero is always in the confidence interval. The difference between Land bridge islands and Continental islands is very close to significance (p = 0.056) correlation in R cor.test(x,y) island<-read.csv("island_type_final2.csv",header=T) names(island) [1] "species" "what" "family" "insular" "Archipelago" "largest_island" [7] "area" "type" "age" "iso" "lat" "mass" [13] "clutch" "brood" "hatchling" "productivity“ attach(island) cor.test(mass,lat) Pearson's product-moment correlation data: mass and lat t = -1.138, df = 317, p-value = 0.256 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.17239 0.04635 sample estimates: cor -0.06378 The variable “cor” is the correlation coefficient r lat 5 5 4 18 18 18 20 18 18 18 18 18 5 21 21 21 22 21 mass 1.21 0.83 1.84 1.39 0.42 0.29 0.45 1.54 0.36 0.27 0.04 0.01 1.21 0.95 0.51 0.29 0.74 0.92 regression in R Same data as in the previous example lm (=“linear model”): lm (y~x) model<-lm(mass~lat,data=island) summary(model) Call: lm(formula = mass ~ lat, data = island) Residuals: Min 1Q Median 3Q Max -4.708 -1.774 0.470 1.465 3.725 Coefficients: (Intercept) lat Estimate 0.958034 -0.00554 Std. Error 0.096444 0.004872 t value 9.934 -1.138 Pr(>|t|) <2e-16 *** 0.256 Residual standard error: 0.8206 on 317 degrees of freedom Multiple R-squared: 0.004069, Adjusted R-squared: 0.0009268 F-statistic: 1.295 on 1 and 317 DF, p-value: 0.256 lm vs. aov We can also use ‘lm’ with data that fits ANOVA In this case we’ll receive all the data that ‘summary’ gives for ‘lm’ function for regression including parameter estimates, SE, difference between factors and p-values for contrasts between categories of our predictor variable lm vs. aov We can use ‘lm’ also on data that fits ANOVA In this case we’ll receive all the data that ‘summary’ gives for ‘lm’ function for regression including parameter estimates, SE, difference between factors and pvalues for contrasts between category pairs of our predictor variable island<-read.csv("island_type_final2.csv",header=T) model<-aov(clutch~type,data=island) model2<-lm(clutch~type,data=island) summary(model) summary(model2) Df type 2 Residuals 289 (Intercept) typeLand_bridge typeOceanic Sum sq Mean sq F value Pr(>F) 0.466 0.23296 2.784 0.0635 . 24.184 0.08368 Estimate 0.33149 0.12399 0.02184 Std. Error 0.02984 0.05369 0.03777 t value 11.11 2.309 0.578 aov results Pr(>|t|) <2e-16 0.0216 0.5635 Residual standard error: 0.2893 on 289 degrees of freedom (27 observations deleted due to missingness) Multiple R-squared: 0.0189, Adjusted R-squared: 0.01211 F-statistic: 2.784 on 2 and 289 DF, p-value: 0.06346 *** * lm results More later on Assumptions of statistical tests (all statistical tests) A non-random, non-independent sample of Israeli people 1. Random sampling (assumption of all tests not only parametric) 2. Independence (spatial, phylogenetic etc.) ANOVA .Assumptions of parametric test. A In addition to the assumptions of all tests 1. Homoscedasticity 2. Normal distribution of the residuals "Comments on earlier drafts of this manuscript made it clear that for many readers who analyze data but who are not particularly interested in statistical questions, any discussion of statistical methods becomes uncomfortable when the term ‘‘error variance’’ is introduced.“ Smith, R. J. 2009. Use and misuse of the reduced major axis for linefitting. American Journal of Physical Anthropology 140: 476-486. Richard Smith & 3 friends Reading material: Sokal & Rohlf 1995. Biometry. 3rd edition. Pages 392-409 (especially 406-407 for normality) Always look at your data Don’t just rely on the statistics! Anscombe's quartet Summary statistics are the same for all four data sets: • • • • • • n = 11 means of x & y (9, 7.5), standard deviation (4.12) regression & residual SS R2 = (0.816) regression line (y = 3 + 0.5x) Anscombe 1973. Graphs in statistical analysis. The American Statistician 27: 17–21. http://en.wikipedia.org/wiki/Anscombe%27s_quartet Assumptions of parametric tests. B. Regression 1. Homoscedasticity Smith, R. J. 2009. Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140: 476-486. Assumptions of parametric tests. B. Regression 1. Homoscedasticity 2. The explanatory variable was sampled without error Smith, R. J. 2009. Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140: 476-486. Assumptions of parametric tests. B. Regression 1. Homoscedasticity 2. The explanatory variable was sampled without error 3. Normal distribution of the residuals of each response variable Smith, R. J. 2009. Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140: 476-486. Assumptions of parametric tests. B. Regression 1. Homoscedasticity 2. The explanatory variable was sampled without error 3. Normal distribution of the residuals of each response variable 4. Equality of variance between the values of the explanatory variables Smith, R. J. 2009. Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140: 476-486. Assumptions of parametric tests. B. Regression 1. Homoscedasticity 2. The explanatory variable was sampled without error 3. Normal distribution of the residuals of each response variable 4. Equality of variance between the values of the explanatory variables 5. Linear relationship between the response and the predictor Smith, R. J. 2009. Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140: 476-486. How will we test if our model follows the assumptions? R has a very useful model diagnostic functions which allows us to evaluate in a graphical matter how much our model follows the model assumption (especially in regression) https://www.youtube.com/watch?v=eTZ4VUZHzxw http://stat.ethz.ch/R-manual/R-patched/library/stats/html/plot.lm.html :‫ראו גם‬ What can we do when our data doesn’t follow the assumptions? 1. We can ignore it and hope that our test is robust enough to break the assumptions: this is not as unreasonable as it sounds 2. Use non-parametric tests 3. Use generalized linear models (glm); which means: • Transformation (in glm it means changing the link functions) • Change error distribution in glm (to non-normal distribution) 4. Use non-linear tests 5. Use randomization (more about it in Roi’s lessons) Non-parametric test I think it is really wrong to have a presentation without any animal pictures in it Non-parametric test do not assume equality of variance or normal distribution. They are based on Ranks Disadvantages: There are no test for models with multiple predictors Many times their statistical power is very low compared to a equivalent parametric test They do not give you parameter estimation (slopes, intercepts) A few useful non-parametric tests Orycteropus afer The photographed is not related to the lectures Chi-square test is a non-parametric test Kolmogorov-Smirnov is a non-parametric test used to compare to frequency distributions (or to compare “our” distribution to a known distribution. For example: a normal distribution Mann-Whitney U = Wilcoxon rank sum Is a non-parametric test equivalent to students t-test Wilcoxon two-sample (=Wilcoxon signed-rank) test replaces paired-t-test Kruskal-Wallis replaces one-way ANOVA Spearman test Kendall’s-tau test replaces correlation tests Non-parametric tests in R Orycteropus afer The photographed is not related to the lectures Kolmogorov-Smirnov is a non-parametric test used to compare to frequency distributions (or to compare “our” distribution to a known distribution. For example: a normal distribution We need to define in R the grouping variable and the response: lets say we want to compare between the frequency distribution of lizard body mass on oceanic and land bridge islands island<-read.csv("island_type_final2.csv",header=T) attach(island) levels(type) [1] "Continental" "Land_bridge" "Oceanic“ Land_bridge<-mass[type=="Land_bridge"] Two-sample Kolmogorov-Smirnov test Oceanic <-mass[type==" Oceanic"] data: Land_bridge and Oceanic ks.test(Land_bridge, Oceanic) D = 0.1955, p-value = 0.1288 alternative hypothesis: two-sided

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download File - TAU R Workshop 2015