Download Section 2 - Statistics for Bioinformatics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Statistics for bioinformatics
Filtering microarray data
Aims of filtering
• Suppose
– We have a set of 10000 genes whose expression
is measured on our microarray chips.
– We are looking at an experiment where gene
expression is measured in 11 cancer patients and
7 normal individuals.
• We want to know which genes have altered
expression in cancerous cells (maybe they
can be used as drug targets).
• Genes whose expression is similar between
cancer and normal individuals are not
interesting and we want to filter them out.
What will be discussed
• General background on statistics
–
–
–
–
–
–
Distributions
P-values, significance
Hypothesis testing
T-test
Analysis of variance
Nonparametric statistics
• Application of statistics to filtering microarray
data
Distributions
• Distributions help to assign probabilities to subsets of
the possible set of outcomes of an experiment.
• The distribution function F:[0,1] of a random
variable X is given by
F ( x)  P( X  x).
• Random variables can be discrete or continuous. X
is discrete if it takes values in a countable subset of
 (eg. number of heads in two coin tosses is 0,1 or 2)
and continuous if its distribution can be written as the
integral of an integrable xfunction f:
F ( x) 
 f (u)du.

• f is probability density function (pdf) of X (f=F’).
Normal distribution
• Also known as Gaussian
• Symmetrical about the mean, “bell-shaped”
• Completely specified by mean and variance –
denoted X is N (  ,  2 )
X 
• Can transform to standard form, Z 

- Z is N(0,1)
Pdf is:
  ( x   )2 
1
f ( x) 
exp 
,  x  
2
 2
 2

x
Central limit theorem
• A lot of the statistical tests that we will discuss
apply specifically for normal distributions…
• …however, the central limit theorem says:
• If X 1 , X 2 ,..., X n are (independent) items from
a random sample drawn from any distribution
with mean  and positive variance 2 then
n( X n  ) / 
has a limiting distribution ( n  ) which is
normal with mean 0 and variance 1, where
Xn 
1
n

n
i 1
Xi.
Central limit theorem
• For a sample X 1 , X 2 ,..., X n drawn from a normal
distribution, n ( X n   ) /  is exactly normally
distributed with mean 0 and variance 1.
• For other distributions, n ( X n   ) /  is approximately
normally distributed with mean 0 and variance 1, for
large enough n.
• This approximate normal distribution can be used to
compute approximate probabilities concerning the
sample mean, X n .
• In practice, convergence is v. rapid, eg. means of
samples of 10 observations from uniform distribution
on [0,1] are v. close to normal.
Chi-squared (  )distribution
and F distribution
2
• If X1 , X 2 ,..., 2X r are independent and N(0,1) then
r
2
  i 1 X i has a Chi-squared distribution with r
degrees of freedom.
2
• If you add chi-squared random variables, i with ri
degrees of freedom, i=1,…,k,
you get a chi-squared
k
random variable with i 1 ri degrees of freedom.
• Let  m and  n be independent variates distributed
as chi-squared
with m and n degrees of freedom. The
2
1
ratio F  m  m has an F distribution with
m,n
2
1

n
n
parameters m and n.
• NB F distribution completely determined by m & n.
• Useful for statistical tests- see later.
2
2
Statistics
• What is a statistic? A function of one or
more random variables that does not depend
on any unknown parameter
– Eg. sample mean
– Z=(X-)/ is not a statistic unless  &  are known
• If interested in a random variable X, may only
have partial knowledge of its distribution. Can
sample & use statistics to infer more info, eg.
estimate unknown parameters.
• Primary purpose of theory of statistics:
provide mathematical models for experiments
involving randomness; make inferences from
noisy data.
Hypothesis testing
• A statistical hypothesis is an assertion about the
distribution of one or more random variables (r.v.s).
• In hypothesis testing, have a null hypothesis H0 (eg.
suppose we have a r.v. which we know is N(,1) &
our null is that =0) which we want to test against an
alternative hypothesis H1 (eg. =1).
• The test is a rule for deciding based on an
experimental sample – usually ask if a particular
statistic is in some acceptance region or in the
rejection (also called critical) region; if in
acceptance region keep the null, else reject.
• Test has power function which maps a potential
underlying distribution for the r.v. to the probability of
rejecting the null hypothesis given that distribution.
Significance and P-values
• Significance level of a hypothesis test is the
maximum value (actually supremum) of the power
function of the test if H0 is true- ie. the worst case
probability of rejecting the null if it is true. Typical
values are 0.01 or 0.05 (often expressed as 1% or
5%)
– NB some texts refer to 95% significance, which by my
definition would be 5%.
• P-value = The probability that a statistic would
assume a value greater than or equal to the observed
value strictly by chance.
– Eg. suppose we sample 1 value from our normal distn. with
variance 1 and use this as our statistic. If the sample value is
0.9, this has P-value 0.184, since P(X0.9)=0.816 for the null
hypothesis N(0,1). If we were testing at 5% significance, we
would keep the null, since our P-value is > 0.05.
Student t-test
• Suppose you have a sample X1,…Xn of
independent random variables each with
distribution, N(,), then
– Sample mean,X  1n i 1 X i , has distribution N(,/n)
1 n
2
– Sample variance,
S  i 1 ( X i  X ) ,2 nS2/2 has
2
n
distribution  (n  1)
– X and S2 are stochastically independent
n
• Suppose you don’t know the actual mean and
variance. If you want to test (at some
significance level) whether the actual mean
takes a certain value then you can’t look up
P-values directly from the sample mean
because you don’t know /n.
Student’s t-test
• Consider instead t-ratio (t-statistic) is given by
X 
U
T

S / n 1
V /( n  1)
where U  n ( X   ) /  is N(0,1) and V  nS / 
is  2 (n  1).
[S is the sample standard deviation]
• So by dividing (by an estimate of the standard
deviation of X ), we have eliminated the unknown .
U
• This statistic
has a “t distribution
2
T
V /( n  1)
with n-1 degrees of freedom”.
2
Student’s t-test
• A one-sample t-test compares the mean of a
single column of numbers against a
hypothetical mean you define:
– H0: =0
– H1: 0
• Assume H0 is true and calculate the t-statistic:
X /( S / n  1).
• A P-value is calculated from the t-statistic,
using the pdf. This value is a measure of the
significance of the deviation of the sample
(column of numbers) from the mean. Normal
way of assessing significance is to use a
look-up table [cf example in next section].
Two-sample T-test
• A two-sample t-test compares the means of two
columns of numbers (independent samples) against
one another on the assumption that they are normally
distributed with the same (although unknown)
variance: .
• Suppose we have a sample X1, …, Xn and another
Y1, …, Ym drawn from N(1, 2) and N(2, 2)
respectively, then the difference in sample means is
distributed as N(1- 2,2(1/n+1/m)) and the t-ratio is
given by
( X  Y )  ( 1   2 )
T
nS X  mSY
nm2
2
2
1 1

n m
Two-sample T-test
• We lay out our null and alternative hypotheses:
– H0: 1= 2
– H1: 1 2
• Assume H0 is true and calculate the T statistic:
T
(X Y )
nS X  mSY
nm2
2
2
1 1

n m
• The T statistic follows a t-distribution with n+m-2
degrees of freedom.
Two-sample T test
• From the T statistic can calculate a P-value, using the
p.d.f. of a t-distribution with n+m-2 degrees of
freedom. If the P-value is smaller than the desired
significance level (T greater than a critical value),
then reject the null hypothesis (there is a significant
difference in means between the two samples).
• Usually we just see if the T statistic exceeds a critical
value, corresponding to some significance level, by
looking up in a table. (Often significance is 5%sometimes written 95%).
• [Example in next section of lecture].
Two-sample T-test
Are the sample means
different?
The significance of the difference in
means depends on the variances.
http://trochim.human.cornell.edu/kb/stat_t.htm
Analysis of variance (ANOVA)
• Another test to work out if the means of a set
of samples are the same is called analysis of
variance (ANOVA).
– Eg used for working out whether the expression of
gene A in a microarray experiment is significantly
different in cells from patients of cancer type A,
cancer type B and in normal patients.
• For two groups (eg. cancer and normal),
ANOVA turns out to be equivalent to a T test,
but can use ANOVA for more than two
samples.
One-way ANOVA
• The assumptions of analysis of variance are that the
samples of interest are normally distributed,
independent & have same variances, however
research shows that the results of hypothesis test
using ANOVA are pretty robust to the assumptions
being violated. If this happens, ANOVA tends to be
conservative, ie. will not reject the null hypothesis of
equal means when it actually should – thus will tend
to underestimate significant effects of eg. drug
response.
• Suppose we have m samples, with the jth sample
given by X 1 j , …, X n j j , from distributions N(j , 2),
where 2 is the same for each but unknown.
• The null hypothesis is H0: 1= 2=…= m= , 
unspecified.,
• H1: at least one mean is different.
One-way ANOVA
• We will test the hypothesis using two different
estimates of the variance.
• One estimate (called the Mean Square Error or
"MSE" for short) is based on the variances within the
samples. The MSE is an estimate of 2 whether or
not the null hypothesis is true.
• 2nd estimate (Mean Square Between or "MSB" for
short) is based on the variance of the sample means.
The MSB is only an estimate of 2 if the null
hypothesis is true.
• If the null hypothesis is true, then MSE and MSB
should be about the same since they are both
estimates of the same quantity (2); however, if H0 is
false then MSB can be expected to be > MSE since
MSB is estimating a quantity larger then 2.
Variance between groups
• Let X j represent the sample mean of the jth group
(sample) and X the “grand mean” of all elements
from all the groups. The variance between groups
measures the deviations of the group means around
the grand mean.
• Sum of squares between groups (SSB):
SS B   j 1 n j ( X j  X ) 2
m
i1 X ij , X 
 j 1 n j X j ,N   j 1 n j
• [where
.]
• The variance between groups, also known as Mean
square between (MSB) is given by sum of squares
divided by the degrees of freedom between (dfB):
SS
where dfB=m-1.
MSB  B
Xj 
df B
1
nj
nj
1
N
m
m
Variance within groups
• Here we want to know the total variance due to
deviations within groups.
• Sum of squares within groups (SSW):
SS w   j 1 i 1 ( X ij  X j ) 2
m
nj
• To get the variance within, also known as mean
squared error (MSE), we must divide by the degrees
of freedom within dfW = N-m. Roughly speaking this is
because we have used up m degrees of freedom in
estimating the group means (by their sample values)
and so only have N-m independent ones left to
SS
estimate this variance:
MSE  W
dfW
F-statistics
• The F-statistic is the ratio of the variance between
groups to the variance within groups:
MSB  estimate of variance between 
F

.
MSE  estimate of variance within 
• If the F-statistic is sufficiently large then we will reject
the null hypothesis that the means are equal.
• The F-statistic is distributed according to an F
distribution with degree of freedom for the numerator
= dfB and degree of freedom for the denominator =
dfW, ie. Fm-1,N-m. We can look up in an F table or
calculate using the probability density function the
P-value corresponding to a given value of the statistic
on the distribution with parameters as given. We
reject the null if this P-value is less than our
significance level.
Two-way analysis of variance
• What analysis of variance actually does is to split the
squared deviation from the grand mean into 2 parts:

2
2
2
(
X

X
)

(
X

X
)

(
X

X
)


ij
ij
j
j
i, j
i, j
i, j
• In order to estimate the mean from a sample we
actually find a value which minimizes the sum of
squared residuals. Eg. to find group means we use
values X j which minimize the second term above and
to find the grand mean we minimize the LHS term.
• The values of these sum of squared residuals when
the means take their maximum likelihood values (the
variance terms above) gives a measure of the
likelihood of the means taking those values. So, as
we have seen, the variances can be used to see how
likely certain hypotheses about the mean are.
Two-way analysis of variance
• Measures of the relative sizes of the LHS term to the
2nd term tell us how good a fit the single parameter
model with all means equal is compared to the
multiple means model.
• We use some degrees of freedom (independent
sample data) to estimate the means and other d.o.f.s
to see how good our hypotheses about the means
are (via estimation of the variances).
• Suppose now that we have 2 different factors
affecting our microarray samples: eg. yeast cell in
different concentrations of glucose at different
temperatures.
• Our model for the expression of gene A might involve
both factors influencing the mean…
Two-way analysis of variance
•
We suppose that the sample at temperature j with
glucose concentration k is N(jk,2) with
 jk  a  b j  ck ,
 b  c 0
According to our model, the mean expression level
can vary both with temperature and with glucose
concentration.
If we want to test whether temperature affects gene
expression level at 5% significance, then we take:
H0: bj=0 for all j
and proceed in a similar manner (although with
different components of variance in the F-statistic)
to before.
Clearly this can be extended to more than 2 factorssee Kerr et al (handout for homework).
j
•
•
•
j
k
k
Nonparametric statistics
• So far we have look at statistical tests which are valid
for normally-distributed data.
• If we know that our data is (approximately) Gaussian
(eg. in large sample-size limits by Central Limit
Theorem) these are useful and easy to use.
• If our data deviates a lot from normal then we need
other techniques.
• Nonparametric techniques make no assumptions
about the underlying distributions.
• We will briefly discuss such an example: a rank
randomization test equivalent to the Mann-Whitney
U-test.
Randomization test
•
Best described by example:
•
Group 1
11
14
7
8
Mean 10
•
Group 2
2
9
0
5
4
Want to know if the two groups have significantly
different means
1.
2.
3.
Work out difference in means
Work out how many ways there are of dividing the total
sample into two groups of four.[?]
Count how many of these lead to a bigger differences than
the original two groups.
Randomization tests
• Difference in means is 6
• There are 70=8!/(4!4!) ways of dividing the data
• There are only two other combinations that give a
difference in means which is as large or larger:
• Probability of getting a
difference in mean in favour of
group 1 (one-tailed test) as high
as the original is approximately
3/70=0.0429. There are also 3
combinations that give
differences in favour of group 2
of  6. So the 2-tailed p-value is
6/70=0.0857.
Mann-Whitney U test
• The problem with randomization tests is that as the
number of samples and groups increases, the
number of possible ways of dividing becomes
extremely large- thus randomization tests are hard to
compute.
• A simplification involves replacing the data by ranks
(ie. the smallest value is replaced by 1, the next by 2,
…). A randomization test is then performed on the
ranks: Group 1
Group 2
Group 1
Group 2
11
14
7
8
2
9
0
5
7
8
4
5
2
6
1
3
Rank randomization test
• Calculate the difference in the summed ranks of the
two groups: 12 here.
• The problem is then to work out how many of the 70
ways of rearranging the numbers 1,…,8 into two
groups give a difference in group sum which is  12
(one-tailed; has modulus  12 for two-tailed).
• This problem doesn’t depend on the exact data, so
standard values can be tabulated. For a given data
set just use a lookup table.
• The rank randomization test for the differences
between two groups is called the Wilcoxon Rank
Sum test. It is the same as the Mann-Whitney Utest although this uses a different test statistic.
• Clearly information is lost in converting from real data
to ranks so the test is not as powerful as
randomization tests, but is easier to compute.
Statistics summary
• We have discussed several ways of
assessing significant differences between the
means of two or more samples.
• For normally distributed samples with equal
variances, we have two methods:
– T test (for comparing two samples)
– Analysis of variance (for comparing two or more
groups)
• The central limit theorem shows that the
mean of a very large sample follows an
approximately normal distribution, however,
for small & non-normally distributed samples
non-parametric methods may be necessary.
Statistics summary
• These techniques are useful in analysing
microarray data because we want to infer
from noisy data which genes vary significantly
in their expression over a variety of conditions
- NB since the conditions correspond to
the groups, we will generally need several
repeats of the microarray experiments
under the “same” conditions in order to
apply these techniques.
[References for more info on statistics, esp.
statistical tests: Introduction to mathematical
statistics by Hogg & Craig (Maxwell Macmillan);
http://davidmlane.com/hyperstat/index.html ]