Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Dealing With Statistical
Uncertainty
Richard Mott
Wellcome Trust Centre for Human
Genetics
Synopsis
•
•
•
•
•
•
•
•
Hypothesis testing
P-values
Confidence intervals
T-test
Non-parametric tests
Sample size calculations
Permutations
Bootstrap
Hypothesis testing
Null Hypothesis H0
Examples of H0:
Alternative Hypothesis H1
• Examples of H1:
– Mean of a population is 3.0
– Mean of population > 3.0
– In a genetic association study,
there is no association
between disease state and
the genotypes of a particular
SNP
– There is an association
between disease and
genotype
P-values
• The P-value of a statistic Z is the
probability of sampling a value
more extreme than that
observed value z if the null
hypothesis is true.
• For a one-sided test, the p-value
would be
a = Prob( Z > z | H0)
• For a two-sided test it is
a = Prob( |Z| > z | H0)
• Note this is not the probability
density at x, but the tails of the
cumulative distribution function
Upper and lower 2.5% tails of the
N(0,1) distribution,
corresponding to |z|=1.96
Power
• The power of a statistic Z is the
probability of sampling a value
more extreme than that observed
value z if the alternative
hypothesis is true.
• For a one-sided test, the power
would be
b = Prob( Z > z | H1)
• For a two-sided test it is
b = Prob( |Z| > z | H1)
• Note this is not the probability
density at x, but the tails of the
cumulative distribution function
H0: N(1,0) H1: N(1.5,1) H2: N(4,1)
0.05 = Prob( Z> 1.644|H0) (5% upper tail)
0.44 = Prob( Z> 1.644|H1)
0.99 = Prob( Z> 1.644|H2)
Example: The mean of a Normal
Distribution
H0: mean = m0
vs
H1: mean = m1
data: y1, … yn iid N(m,s2) Assume variance is the same and is known
Therefore we base all inferences on the sample mean compared to the difference in the
hypothesised means
The Normal and T distributions
•
For a sample from a Normal distribution with known mean and variance, the sample mean can be
standardised to follow the standard Normal distribution
•
And the 95% confidence interval for the mean is given by
•
But we often wish to make inferences about samples from a Normal Distribution when the variance
is unknown: The variance must be estimated from the data as the sample variance
•
Because of this uncertainty the distribution of the sample mean is broader, and follows the Tn-1
distribution
The T distribution
• The Tn and Normal
distributions are almost
identical for n>20,
• As a rule of thumb, an
approximate 95%
confidence interval for
n>20 is
T-tests
• The T-test compares the means of samples
• The T-test is optimal (in LR sense) if the data are
sampled from a Normal distribution
• Even if the data are not Normal the test is useful
in large samples
• There are several versions, depending on the
details
T tests for the comparison of sample means
One Sample Test
• One sample y1 ….. yn
– H0: the sample mean = m0
– H1: the sample mean != m0
– Reject H0 if
– tn-1(a/2) is the quantile of the Tn-1 distribution such that the
probability of exceeding tn-1(a/2) is a/2
– Two sided test
– in R, tn-1(0.025) = pt(0.025,n-1)
T tests for the comparison of sample means
Paired T-test
• Two samples of paired data (x1,y1), (x2-y2) …. (xn,yn)
• Example:
– two experimental replicates taken from n individuals, we
wish to test if the replicates are statistically similar
•
•
•
•
H0: mean(x) = mean(y)
Take differences d1= x1 - y1 … dn= xn - yn
H0: mean(d) = 0
One sample T-test with m0=0
T tests for the comparison of sample means
Two-Sample T-test
•
Two samples of unpaired data
x1, x2 ….. xn
y1, y2 ….. ym
•
•
H0: mean(x) = mean(y)
If we assume the variance is the same in each group then we can estimate the variance
as the pooled estimator
•
The test statistic is
(The case with unequal variances is more complicated)
T tests in R
t.test(x, y = NULL, alternative = c("two.sided", "less",
"greater"), mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)
•
•
•
•
•
•
•
•
•
•
x a numeric vector of data values.
y an optional numeric vector data values.
alternative a character string specifying the alternative hypothesis, must be one of
"two.sided" (default), "greater" or "less".
mu a number indicating the true value of the mean
paired a logical indicating whether you want a paired t-test.
var.equal a logical variable indicating whether to treat the two variances as being
equal.
conf.level confidence level of the interval.
formula a formula of the form lhs ~ rhs where lhs is a numeric variable giving the
data values and rhs a factor with two levels giving the corresponding groups.
data an optional data frame containing the variables in the model formula.
subset an optional vector specifying a subset of observations to be used.
T tests in R
samp1 <- rnorm( 20, 0, 1) # sample 20 numbers from N(0,1) distribution
samp2 <- rnorm( 20, 1.5, 1) # sample from N(1.5,1)
t.test( samp1, samp2,var.equal=TRUE )
Two Sample t-test
data: samp1 and samp2
t = -3.5892, df = 38, p-value = 0.000935
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.9272037 -0.5372251
sample estimates:
mean of x mean of y
0.4040969 1.6363113
Outliers
A small number of contaminating observations with different distribution
• Outliers can destroy statistical
genuine significance
Reason: the estimate of the
variance is inflated, reducing
the T statistic
• And sometimes create false
significance :
Type III error: "correctly
rejecting the null hypothesis
for the wrong reason".
• There are Robust Alternatives
to the T-test: Wilcoxon tests
> samp3 <- samp2
> samp3[1] = 40 # add one outlier
> t.test( samp1, samp3,var.equal=TRUE )
Two Sample t-test
data: samp1 and samp3
t = -1.5911, df = 38, p-value = 0.1199
alternative hypothesis: true difference in means is not
equal to 0
95 percent confidence interval:
-7.0519594 0.8450837
sample estimates:
mean of x mean of y
0.4040969 3.5075347
> var(samp2)
[1] 1.160251
> var(samp3)
[1] 74.88977
Non-parametric tests:
Wilcoxon signed rank test
(alternative to Paired T-test)
•
•
•
•
•
•
•
•
Two samples of paired data (x1,y1), (x2-y2) …. (xn,yn)
Take differences
d1= x1 - y1 … dn= xn – yn
H0: distribution of d’s is symmetric about 0
Compute the ranks of the absolute differences, rank(|d1|) etc
Compute the signs of the differences
Compute W, the sum of the ranks with positive signs
If H0 true then W should be close to the mean rank n/2
The distribution of W is known and for n>20 a Normal approximation is used.
•
in R
wilcox.test(x, y, alternative = c("two.sided", "less", "greater"), mu = 0,
paired = TRUE, exact = NULL, correct = TRUE, conf.int = FALSE, conf.level
= 0.95, ...)
Wilcoxon rank sum test
(alternative to unpaired T-test)
•
Two samples of unpaired data
x1, x2 ….. xn
y1, y2 ….. ym
•
•
•
•
Arrange all the observations into a single ranked series of length N = n+m.
R1 = sum of the ranks for the observations which came from sample 1. The sum of
ranks in sample 2 follows by calculation, since the sum of all the ranks equals
N(N + 1)/2
Compute U = min(R1 ,R2) - n(n + 1)/2.
z = (U –m)/s Normal approximation
m = nm/2
s2 = nm(n=m+1)/12
•
The Wilcoxon Rank sum test is only about 5% less efficient than the unpaired T-test (in
terms of the sample size required to achieve the same power) in large samples.
Wilcoxon rank sum test
(alternative to unpaired T-test)
>wilcox.test(samp1,samp2)
Wilcoxon rank sum test
data: samp1 and samp2
W = 82, p-value = 0.001041
alternative hypothesis: true mu is not equal to 0
> wilcox.test(samp1,samp3)
# with outliers
Wilcoxon rank sum test
data: samp1 and samp3
W = 82, p-value = 0.001041
alternative hypothesis: true mu is not equal to 0
Binomial Sign Test
(T-test on one sample)
• One sample y1 ….. yn
H0: the sample mean = m0
• Count M, the number of y’s that are > m0
• If H0 is true, this should be about n/2, and
should be distributed as a Binomial random
variable B(n/2,n)
Contingency Tables
• Data are a table of counts, with the row and column
margins fixed.
n11
n12
r1c1/N
r1c2/N
n21
n22
r2c1/N
r2c2/N
c1
= n11+ n21
c2
= n12+ n22
r1 = n11+ n12
r2 = n21+ n22
N
• Ho: the counts in each cell are consistent with the rows
and columns acting independently
The Chi-Squared statistics
2
(O
E
)
ij ij / Eij
X2
ij
log LikRatio
2Oij log( Oij / Eij )
ij
2
(R1)(C 1)
Fisher’s Exact Test
for 2 x 2 Contigency tables
• Similar idea to a permutation test
• Given the marginal totals in a 2x2 table, we can
compute the probability P of getting the
observed cell counts assuming the null
hypothesis of no associatons between classifying
factors.
– hypergeometric distribution.
• Find all tables with same margins and with
probabilities no bigger than P
• P-value of the table is the sum of these
probabilities
Fisher’s Exact Test
Male
Female
Smoker
a
c
a+c
Non-smoker
b
d
b+d
a+b
c+d
n
Male
Female
Smoker
10
5
15
Non-smoker
10
12
22
20
17
37
a b c d
a c
P
n
a c
R implementation:
m <- matrix(c(10,10,5,12),nrow=2)
fisher.test(m)
Fisher's Exact Test for Count Data
data: m
p-value = 0.3152
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.513082 11.942154
sample estimates:
odds ratio
2.342796
>
Permutation Tests
Another non-parametric Alternative
• Permutation is a general principle with many applications to hypothesis
testing.
• But it is not useful for parameter estimation
• The method is best understood by an example:
– Suppose a data set is classified into N groups. We are interested in whether
the differences between group means (eg if N=2 then the two sample T test
or Wilcoxon Rank Sum test is appropriate)
– Compute a statistic S (eg the difference between the maximum and minimum
group mean)
– If there are no differences between groups then any permutation of the group
labellings between individuals should produce much the same result for S
– Repeat i = 1 to M times:
• permute the data
• compute the statistic on the permuted data, Si
– The Permutation p-value is the fraction of permutations that exceed S
• It’s that simple!
Bootstrapping
•
Given a set of n things S = {x1 …. x2}, a bootstrap sample is formed by sampling with
replacement from S, to create a new set S* of size n
•
•
Original
Resample
•
Some elements of S will be missing in S* and some will be present multiply often (about
1/e = 63% of the elements will be present in S* on average)
•
S* can be used as a replacement for S in any algorithm that works on S, and we can
repeatedly draw further bootstrap samples.
•
Any statistic Z that can be calculated on S can also be calculated on S*, and so we can
construct the bootstrap sampling distribution of Z from repeated samples.
•
From this we can make inferences about the variation of Z, eg construct a 95%
confidence interval (if Z is numeric)
•
Can be used for parameter estimation
Bootstrapping
• Bootstrapping is a general way to evaluate the uncertainty in estimates of
statistics whilst making few assumptions about the data
• It is particularly useful for complex statistics whose distribution cannot be
evaluated analytically
• Example:
– In a gene expression study, we wish to find a minimal set of SNPs that in
combination will predict the variation in a phenotype.
– Suppose we have a deterministic algorithm A that will find us a plausible set of
SNPs (ie a model).
• This is likely to be a complex programme and contain some arbitrary steps.
• It is also unlikely to give any measure of uncertainty to the model.
– There may be many possible sets of SNPs with similar claims, so we would like
a way to average across different models, and evaluate the evidence for a SNP
as the fraction of models in which the SNP occurs
• To do this, we need a way to perturb the data, to generate new data sets on which we
can apply the procedure A
• Bootstrapping is one way to do this