Download File - TAU R Workshop 2015

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Psychometrics wikipedia , lookup

Analysis of variance wikipedia , lookup

Categorical variable wikipedia , lookup

Omnibus test wikipedia , lookup

Time series wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
Basic statistic inference in
R
Shai Meiri
Everything differs!!!
“We expect to find differences
between x and y” is a trivial saying
The statistician within you asks “Are the differences we found
are larger that expected by chance?”
The biologist within you asks “Why the differences I found are
in the direction and the level they are?”
Moments of central
tendency
1.Mean
Arithmetic mean: Σxi/n
Geometric mean: (x1*x2*…*xn)1/n
Harmonic mean:
Moments of central
tendency in R
1. Arithmetic mean: Σxi/n
Example:
data<-c(2,3,4,5,6,7,8)
mean(data)
[1] 5
Use the function
“mean”
2. Geometric mean: (x1*x2*…*xn)1/n
:You can also use the .csv file
dat<-read.csv("island_type_final2.csv")
Attach(dat)
mean(lat)
[1] 17.40439
Example:
data<-c(2,3,4,5,6,7,8)
exp(mean(log(data)))
[1] 4.549163
Moments of central tendency
1. A. mean
B. Median
C. Mode
General example:
data<-c(2,3,4,5,6,7,8)
median(data)
[1] 5
Example from the .csv:
median(mass)
[1] 0.69
Moments of central tendency
http://www.statmethods.net/management/functions.html
1.Mean
2.Variance = Σ(xi-μ)2 / n
Is the mean is a good
measurement to what is happening
in the population when the
variance is low?
Example :
data<-c(2,3,4,5,6,7,8)
var(data)
[1] 4.666667
Var(lat)
[1] 89.20388
Moments of central tendency
1.Mean
2.Variance
The second moment of central
tendency is the measurement of
how much the data is scattered
around the first moment (mean)
An example for the second moment are the variance, the
standard variation, standard error, coefficient of variation and
the confidence interval of 90%, 95% and 99% from
something
Moments of central tendency
#for:
data<-c(2,3,4,5,6,7,8)
Sample size:
length(data)
Variance:
var(data)
Standard deviation:
sd(data)
Standard error:
coefficient of variation:
se<-(sd(data)/length(data)^0.5)
se
[1] 0.8164966
CV<-sd(data)/mean(data)
CV
[1] 0.4320494
Moments of central tendency
1.Mean
2.Variance
3.Skew
Skewed distribution of frequencies is not symmetric
Do you think that the arithmetic mean is a good measurement
of central tendency for a skewed frequency distribution
What is the mean salary of the student here and of
Bill Gates?
Moments of central tendency
Skew
skew<-function(data){
m3<-sum((data-mean(data))^3)/length(data)
s3<-sqrt(var(data))^3
m3/s3}
skew(data)
The SE of
skewness:
sdskew<-function(x) sqrt(6/length(x))
Moments of central tendency
1.Mean
2.Variance
3.Skew
4.Kurtosis
Moments of central tendency
Kurtosis
kurtosis<-function(x){
m4<-sum((x-mean(x))^4)/length(x)
s4<-var(x)^2
m4/s4-3 }
kurtosis(x)
SE of kurtosis:
sdkurtosis<-function(x) sqrt(24/length(x))
A normal distribution can get a value of mean and
variance but its skewness and the kurtosis should
equal to zero
Values of skew and kurtosis have
their own variance – and zero should
be outside of their confidence
interval in order for them to be
significantly different from zero
Residuals
When doing statistics we’re creating models of the
reality
One of the most simple models is the mean :
The mean height of Israeli citizens is 173cm
The mean salary is ₪ 9271(correct for April 2014)
The mean service in IDF is 24 months (I guess)
Rab. Dov Lior
2.06m
Served in IDF for 1
month
for a month ₪ 46,699
(excluding the bottles)
http://www.haaretz.co.il/1.2057452
Residuals
When doing statistics we’re creating models of the
reality
We can see here that our models: 24 months, 9271 ₪ and 173 cm are
not very successful
The Residual Is how much a certain value is far from the prediction of the model .
Omri Caspi is far away in 32 cm from the model “Israeli = 173” and in 29 cm from
the more complicated model: “Israeli man = 177, Israeli women = 168”
Residual = ₪37428
Residual = -23 month
IDF service
Residual = 33 cm
Residuals
When doing statistics we’re creating models of the
reality
dat<-read.csv("island_type_final2.csv")
model<-lm(mass~iso+area+age+lat, data=dat)
out<-model$residuals
out
write.table(out, file =
"residuals.txt",sep="\t",col.names=F,row.names=F)
#note that residual values are in the order entered (i.e., not alphabetic, not
by residual size – first in, first out)
Residual = ₪37428
Residual = 33 cm
Residual = -23 month service
Theoretical statistics and statistical
inference
When we have data it is best that we first describe them:
plot graphs, calculate the mean and so on
In statistical inference we are testing the behavior of our data
compared to a certain hypothesis
We can present our hypothesis as a statistical model
For Example:
• The distribution of the heights is normal
• Number of species increases with area
• Number of species increases with area with a power function of
0.25
Frequency distribution*
How many observations are in each bin?
dat<-read.csv("island_type_final2.csv")
attach(dat)
names(dat)
Hist(mass)
*graphic form = “histogram”
Describes the distribution
of all observations
Frequency distribution
What did we learn?
dat<-read.csv("island_type_final2.csv")
attach(dat)
Hist(mass)
• There are no mass smaller than one tenth of a gram or larger than 100 kg
• Lizard with mass between 1 and 10 are very common – larger or smaller
lizards are rare
• The distribution is unimodal and skewed to the right
Frequency distribution
Histograms don’t have to be so ugly
dat<-read.csv("island_type_final2.csv")
attach(dat)
hist(mass, col="purple",breaks=25,xlab="log mass (g)",main="masses of
island lizards - great data by Maria",cex.axis=1.2,cex.lab=1.5)
Presenting a categorical predictor with a
continuous response variable
dat<-read.csv("island_type_final2.csv")
attach(dat)
plot(type,brood)
Always prefer boxplot to barplot
Presenting a continuous variable against
another continuous variable
dat<-read.csv("island_type_final2.csv")
attach(dat)
plot(mass,clutch)
plot(mass,clutch,pch=16, col=“blue”)
Which test should we choose?
It changes according to the nature of our response
variable (=y variable), and mostly according to the
nature of our predictor variables
• If the response variable is “success or failure” and the null
hypothesis is equality of both we’ll use a binomial test
• If the response variable is counts we’ll usually use chisquare or G
• In many cases our response variable will be continuous
(14 species, 78 individuals, 54 heartbeats per second, 7.3
eggs, 23 degrees)
Which test should we choose?
? What is your response variable
Continuous
Counts
Success or failure
(14 species, 78
individuals, 23
degrees, 7.3 eggs)
(frequency: 6
females, 4 males)
(found the
cheese/idiot)
Soon…
Chi-square or
G (=loglikelihood)
Binomial
Binomial test in R
You need to define the number of successes from the
whole sample size.
For example: 19 out of 34 is not significant
19 out of 20 is significant
binom.test(19,34)
Exact binomial test data: 19 and 34 number of successes = 19,
number of trials = 34
p-value = 0.6076 alternative hypothesis: true probability of
success is not equal to 0.5 95 percent confidence interval:
0.3788576 0.7281498 sample estimates: probability of success
0.5588235
binom.test(19,20)
Exact binomial test data: 19 and 20 number of successes = 19,
number of trials = 20,
p-value = 4.005e-05 alternative hypothesis: true probability of
success is not equal to 0.5 95 percent confidence interval:
0.7512672 0.9987349 sample estimates: probability of success
0.95
Chi-square test in R
chisq.test
Data: lizard insularity & diet:
habitat
island
island
island
mainland
mainland
mainland
M<-as.table(rbind(c(1901,101,269),c(488,43,177)))
chisq.test(M)
data: M
χ2 = 80.04, df = 2, p-value < 2.2e-16
diet
carnivore
herbivore
omnivore
carnivore
herbivore
omnivore
species#
488
43
177
1901
101
269
Chi-square test in R
Now lets use our dataset:
chisq.test
dat<-read.csv("island_type_final2.csv") type
anoles else
install.packages("reshape")
Continental 7
45
library(reshape)
Land_bridge 1
30
cast(dat, type ~ what, length)
Oceanic
23
M<-as.table(rbind(c(7,45,45),c(1,30,14),c(23,110,44)))
chisq.test(M)
data: M
χ2 = 17.568, df = 4, p-value = 0.0015
110
gecko
45
14
44
Which test should we choose?
If our response variable is continuous then we’ll
choose our test based on the predictor variables
If our predictor variable is categorical (Area 1, Area 2, •
Area 3 or species A, species B, species C)
We’ll use ANOVA
If our predictor variable is continuous (temperature, body •
mass, height)
We’ll use REGRESSION
t-test in R
t.test(x,y)
dimorphism<-read.csv("ssd.csv",header=T)
attach(dimorphism)
names(dimorphism)
males<-size[Sex=="male"]
females<-size[Sex=="female"]
t.test(females,males)
Sex
female
male
male
female
male
female
female
male
male
female
male
female
male
female
male
female
male
female
size
79.7
85
120
133.0
118
126.0
105.8
112
106
121.0
95
111.0
86
93.0
65
75.0
230
240.0
Welch Two Sample t-test data: females and males t = -2.1541, df = 6866.57, p-value = 0.03127
alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 7.5095545 -0.3536548 sample estimates: mean of x mean of y 88.17030 92.10191
t-test in R (2)
Sex
female
male
male
female
male
female
female
male
male
female
male
female
male
female
male
female
male
female
lm(x~y)
dimorphism<-read.csv("ssd.csv",header=T)
attach(dimorphism)
names(dimorphism)
model<-lm(size~Sex,data=dimorphism)
summary(model)
(Intercept)
Sexmale
Estimate
88.17
3.932
standard error
1.291
1.825
t
68.32
2.154
p value
<2e-16 ***
0.031 *
size
79.7
85
120
133.0
118
126.0
105.8
112
106
121.0
95
111.0
86
93.0
65
75.0
230
240.0
Species
Xenagama_zonura
Xenagama_zonura
Xenosaurus_grandis
Xenosaurus_grandis
Xenosaurus_newmanorum
Xenosaurus_newmanorum
Xenosaurus_penai
Xenosaurus_penai
Xenosaurus_platyceps
Xenosaurus_platyceps
Xenosaurus_rectocollaris
Xenosaurus_rectocollaris
Zonosaurus_anelanelany
Zonosaurus_anelanelany
Zootoca_vivipara
Zootoca_vivipara
Zygaspis_nigra
Zygaspis_nigra
Zygaspis_quadrifrons
Zygaspis_quadrifrons
Paired t-test in R
t.test(x,y,paired=TRUE)
dimorphism<-read.csv("ssd.csv",header=T)
attach(dimorphism)
names(dimorphism)
males<-size[Sex=="male"]
females<-size[Sex=="female"]
t.test(females,males, paired=TRUE)
size
79.7
85
120
133.0
118
126.0
105.8
112
106
121.0
95
111.0
86
93.0
65
75.0
230
240.0
195
227.0
Sex
female
male
male
female
male
female
female
male
male
female
male
female
male
female
male
female
male
female
male
female
Paired t-test data:
females and males t = -10.192, df = 3503, p-value < 2.2e-16 alternative hypothesis: true
difference in means is not equal to 0 95 percent confidence interval: -4.688 -3.175 sample
estimates: mean of the differences -3.931
tapply(size,Sex,mean)
female
88.17
male
92.10
ANOVA in R
aov
model<-aov(x~y)
island<-read.csv("island_type_final2.csv",header=T)
names(island)
[1] "species" "what" "family" "insular" "Archipelago" "largest_island"
[7] "area" "type" "age" "iso" "lat" "mass"
[13] "clutch" "brood" "hatchling" "productivity“
model<-aov(clutch~type,data=island)
summary(model)
Df
type
2
Residuals 289
species
Trachylepis_sechellensis
Trachylepis_wrightii
Tropidoscincus_boreus
Tropidoscincus_variabilis
Urocotyledon_inexpectata
Varanus_beccarii
Algyroides_fitzingeri
Anolis_wattsi
Archaeolacerta_bedriagae
Cnemaspis_affinis
Cnemaspis_limi
Cnemaspis_monachorum
Amblyrhynchus_cristatus
Ameiva_erythrocephala
Ameiva_fuscata
Ameiva_plei
Anolis_acutus
Anolis_aeneus
Anolis_agassizi
Anolis_bimaculatus
Anolis_bonairensis
Sum sq Mean sq F value Pr(>F)
0.466 0.23296 2.784 0.0635 .
24.184 0.08368
type
Continental
Continental
Continental
Continental
Continental
Continental
Land_bridge
Land_bridge
Land_bridge
Land_bridge
Land_bridge
Land_bridge
Oceanic
Oceanic
Oceanic
Oceanic
Oceanic
Oceanic
Oceanic
Oceanic
Oceanic
clutch
0.6
0.65
0.4
0.45
0.3
0.58
0.4
0
0.65
0.3
0.18
0
0.35
0.6
0.6
0.41
0
0
0
0.18
0
Post-hoc test for ANOVA in R
TukeyHSD(model)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = clutch ~ type, data = island)
$type
Land_bridge-Continental
Oceanic-Continental
Oceanic-Land_bridge
diff
0.124
0.0218
-0.102
lwr
-0.0025
-0.0671
-0.2206
upr
0.2505
0.1108
0.0163
p adj
0.0561
0.8318
0.1066
The difference is not significant. Notice that zero is
always in the confidence interval. The difference
between Land bridge islands and Continental islands
is very close to significance (p = 0.056)
correlation in R
cor.test(x,y)
island<-read.csv("island_type_final2.csv",header=T)
names(island)
[1] "species" "what" "family" "insular" "Archipelago" "largest_island"
[7] "area" "type" "age" "iso" "lat" "mass"
[13] "clutch" "brood" "hatchling" "productivity“
attach(island)
cor.test(mass,lat)
Pearson's product-moment correlation
data: mass and lat
t = -1.138, df = 317, p-value = 0.256
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval: -0.17239 0.04635
sample estimates: cor -0.06378
The variable “cor” is the
correlation coefficient r
lat
5
5
4
18
18
18
20
18
18
18
18
18
5
21
21
21
22
21
mass
1.21
0.83
1.84
1.39
0.42
0.29
0.45
1.54
0.36
0.27
0.04
0.01
1.21
0.95
0.51
0.29
0.74
0.92
regression in R
Same data as
in the previous
example
lm (=“linear model”):
lm (y~x)
model<-lm(mass~lat,data=island)
summary(model)
Call: lm(formula = mass ~ lat, data = island)
Residuals:
Min
1Q
Median 3Q
Max
-4.708
-1.774
0.470
1.465
3.725
Coefficients:
(Intercept)
lat
Estimate
0.958034
-0.00554
Std. Error
0.096444
0.004872
t value
9.934
-1.138
Pr(>|t|)
<2e-16 ***
0.256
Residual standard error: 0.8206 on 317 degrees of freedom
Multiple R-squared: 0.004069, Adjusted R-squared: 0.0009268
F-statistic: 1.295 on 1 and 317 DF, p-value: 0.256
lm vs. aov
We can also use ‘lm’ with data that fits ANOVA
In this case we’ll receive all the data that ‘summary’ gives for ‘lm’
function for regression including parameter estimates, SE,
difference between factors and p-values for contrasts between
categories of our predictor variable
lm vs. aov
We can use ‘lm’ also on data that fits ANOVA
In this case we’ll receive all the data that ‘summary’ gives for ‘lm’ function for
regression including parameter estimates, SE, difference between factors and pvalues for contrasts between category pairs of our predictor variable
island<-read.csv("island_type_final2.csv",header=T)
model<-aov(clutch~type,data=island)
model2<-lm(clutch~type,data=island)
summary(model)
summary(model2)
Df
type
2
Residuals 289
(Intercept)
typeLand_bridge
typeOceanic
Sum sq Mean sq F value Pr(>F)
0.466 0.23296 2.784 0.0635 .
24.184 0.08368
Estimate
0.33149
0.12399
0.02184
Std. Error
0.02984
0.05369
0.03777
t value
11.11
2.309
0.578
aov results
Pr(>|t|)
<2e-16
0.0216
0.5635
Residual standard error: 0.2893 on 289 degrees of freedom
(27 observations deleted due to missingness)
Multiple R-squared: 0.0189, Adjusted R-squared: 0.01211
F-statistic: 2.784 on 2 and 289 DF, p-value: 0.06346
***
*
lm results
More later on
Assumptions of statistical
tests (all statistical tests)
A non-random, non-independent
sample of Israeli people
1. Random sampling (assumption of all tests not only
parametric)
2. Independence (spatial, phylogenetic etc.)
ANOVA .Assumptions of parametric test. A
In addition to the assumptions of all tests
1. Homoscedasticity
2. Normal distribution of the residuals
"Comments on earlier drafts of this manuscript made it clear that
for many readers who analyze data but who are not particularly
interested in statistical questions, any discussion of statistical
methods becomes uncomfortable when the term ‘‘error
variance’’ is introduced.“
Smith, R. J. 2009. Use and misuse of the reduced major axis for linefitting. American Journal of Physical Anthropology 140: 476-486.
Richard Smith & 3 friends
Reading material:
Sokal & Rohlf 1995. Biometry. 3rd edition. Pages 392-409 (especially 406-407 for normality)
Always look at your data
Don’t just rely on the statistics!
Anscombe's quartet Summary
statistics are the same for all
four data sets:
•
•
•
•
•
•
n = 11
means of x & y (9, 7.5),
standard deviation (4.12)
regression & residual SS
R2 = (0.816)
regression line (y = 3 + 0.5x)
Anscombe 1973. Graphs in statistical analysis. The American Statistician 27: 17–21.
http://en.wikipedia.org/wiki/Anscombe%27s_quartet
Assumptions of parametric tests. B. Regression
1. Homoscedasticity
Smith, R. J. 2009. Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140: 476-486.
Assumptions of parametric tests. B. Regression
1. Homoscedasticity
2. The explanatory variable was sampled without
error
Smith, R. J. 2009. Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140: 476-486.
Assumptions of parametric tests. B. Regression
1. Homoscedasticity
2. The explanatory variable was sampled without error
3. Normal distribution of the residuals of each response
variable
Smith, R. J. 2009. Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140: 476-486.
Assumptions of parametric tests. B. Regression
1. Homoscedasticity
2. The explanatory variable was sampled without error
3. Normal distribution of the residuals of each response variable
4. Equality of variance between the values of the
explanatory variables
Smith, R. J. 2009. Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140: 476-486.
Assumptions of parametric tests. B. Regression
1. Homoscedasticity
2. The explanatory variable was sampled without error
3. Normal distribution of the residuals of each response variable
4. Equality of variance between the values of the explanatory variables
5. Linear relationship between the response and the predictor
Smith, R. J. 2009. Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140: 476-486.
How will we test if our model follows
the assumptions?
R has a very useful model diagnostic functions
which allows us to evaluate in a graphical matter
how much our model follows the model
assumption (especially in regression)
https://www.youtube.com/watch?v=eTZ4VUZHzxw
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/plot.lm.html :‫ראו גם‬
What can we do when our data doesn’t
follow the assumptions?
1. We can ignore it and hope that our test is robust enough to break the
assumptions: this is not as unreasonable as it sounds
2. Use non-parametric tests
3. Use generalized linear models (glm); which means:
• Transformation (in glm it means changing the link functions)
• Change error distribution in glm (to non-normal distribution)
4. Use non-linear tests
5. Use randomization (more about it in Roi’s lessons)
Non-parametric test
I think it is really wrong
to have a presentation
without any animal
pictures in it
Non-parametric test do not assume equality of
variance or normal distribution. They are
based on Ranks
Disadvantages:
There are no test for models with multiple predictors
Many times their statistical power is very low compared to a
equivalent parametric test
They do not give you parameter estimation (slopes, intercepts)
A few useful non-parametric
tests
Orycteropus afer
The photographed is not
related to the lectures
Chi-square test is a non-parametric test
Kolmogorov-Smirnov is a non-parametric test used to compare to frequency
distributions (or to compare “our” distribution to a known distribution. For
example: a normal distribution
Mann-Whitney U = Wilcoxon rank sum Is a non-parametric test
equivalent to students t-test
Wilcoxon two-sample (=Wilcoxon signed-rank) test replaces paired-t-test
Kruskal-Wallis replaces one-way ANOVA
Spearman test Kendall’s-tau test replaces correlation tests
Non-parametric tests in R
Orycteropus afer
The photographed is not
related to the lectures
Kolmogorov-Smirnov is a non-parametric test used to
compare to frequency distributions (or to compare “our”
distribution to a known distribution. For example: a normal
distribution
We need to define in R the grouping variable and the response: lets say we want to
compare between the frequency distribution of lizard body mass on oceanic and land
bridge islands
island<-read.csv("island_type_final2.csv",header=T)
attach(island)
levels(type)
[1] "Continental" "Land_bridge" "Oceanic“
Land_bridge<-mass[type=="Land_bridge"]
Two-sample Kolmogorov-Smirnov test
Oceanic <-mass[type==" Oceanic"]
data: Land_bridge and Oceanic
ks.test(Land_bridge, Oceanic)
D = 0.1955, p-value = 0.1288
alternative hypothesis: two-sided