Download L11_SUMMARY_DE

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Copy-number variation wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

NEDD9 wikipedia , lookup

Human genetic variation wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Twin study wikipedia , lookup

X-inactivation wikipedia , lookup

Gene therapy wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Oncogenomics wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Gene nomenclature wikipedia , lookup

Pathogenomics wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Public health genomics wikipedia , lookup

Gene desert wikipedia , lookup

Essential gene wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

History of genetic engineering wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Heritability of IQ wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Genome evolution wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene wikipedia , lookup

Gene expression programming wikipedia , lookup

Minimal genome wikipedia , lookup

RNA-Seq wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genome (book) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Microevolution wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Lecture 11
Differential Expressions: A summary
The Purposes of Statistical Tests
• Microarray data is often used as a guide to further, more precise
studies of gene expression by qt-PCR or other methods.
• Then the goal of the statistical analysis is heuristic: to provide the
experimenter with an ordered list of good candidate genes to follow
up.
• Sometimes the experimenter plans to publish microarray results as
evidence for changes in gene abundance; in this case it is important to
state the correct degree of evidence: the ‘p-value’.
• Many microarray papers still present p-values, as if each gene had
been tested in isolation, even though many genes were actually tested
in parallel; these p-values are wrong in the context of testing thousands
of genes.
Transforms: Justifications for Log
• Often the first step is transforming the values to log scale, and doing
all subsequent steps on the log-transformed values.
• Taking logarithms is common practice, and helpful in several ways,
there ARE other options.
• The main justification for transforms in statistics is to better detect
differences between groups whose within-group variances are very
different.
• Most commonly the within-group variances are higher in those groups
where the mean is also higher.
• A different kind of variation, the measurement error in expression level
estimates, grows with the mean level.
• If the measurement error is proportional to the mean, then the logtransformed values will have consistent variance for all genes.
• For both reasons many researchers argue that gene expression
measures should be analyzed on a logarithmic scale.
Transforms: Others to consider
• However the gene abundances on the log scale usually
show greater variation near the bottom end, giving rise to
the characteristic ‘funnel’ shape of many microarray plots.
• The log transform is too ‘strong’, and partly reverses the
inequality of measurement variances.
• A weaker transform, such as a cube root, would bring the
variances closer to equal. Some researchers have devised
‘variance-stabilizing’ transformations, which attempt to
equalize measurement errors for all genes.
Transforms: concluding remarks
• We are not in principle comparing different genes, but rather the same
genes across different groups, and in most experiments, few genes
change more than three-fold in mean levels.
• For studies where gene levels are fairly constant, and changes are
expected to be modest, such as neuroscience studies, there is no need
to transform data.
• For studies of cancerous tissue, where often some genes are elevated
ten-fold or more, and these increases are highly variable between
individuals, the log transform is very useful.
• For studies with at most moderate fold changes – such as most
experimental treatments on healthy animals – it would be better to use
a weaker transform, such as a cube-root transform or a ‘variancestabilizing’ transform’.
• It is worth doing the analysis in parallel in both log and true scale, and
discarding genes whose significance disappears entirely under one or
the other.
Comparing two-groups: t test
• The long-time standard test statistic for comparing two
groups is the t-statistic:
• t = (xi,1 – xi,2) / si,
• where xi,1 is the mean value of gene i in group 1, xi,2 is the
mean in group 2, and si is the (non-pooled) within-groups
standard error (SE) for gene i.
• Usually the majority of genes are not consistently changed
between groups. For the unchanged genes, the values of
their t-statistic would follow the t-distribution indexed by
n1+n2 – 2 degrees of freedom, provided there are no
outliers in the samples.
Assumptions and Effects of Outliers
• Gene expression values don’t follow a normal curve, but the p-values
from a standard t-test aren’t far away from truth when the distribution
is moderately asymmetric.
• However the t-test fails badly when there are outliers.
• For this reason, when doing a t-test, it is wise to confirm that neither
group contains outliers.
• In practice, it often happens that genes detected as different between
groups, are actually expressed very highly in only one individual of the
higher group.
• Probably the best way to avoid the problem with outliers is to use
robust estimates of mean (trimmed mean) and variability (mean
absolute deviation).
• However no theoretical distribution is known for these; and their
significance levels (p-values) would have to be computed by
permutations.
t-test: follow-up: Quantile plot
• As a matter of practice, in almost all cases, the values of the t-statistics
for all of the unchanged genes in a single experiment also follow a tdistribution.
• If the researcher expects only a small number of genes are greatly
changed, then a convenient way of assessing the most likely true
positives is to plot the t-scores obtained, against the t-distribution.
• Usually the t-scores of the unchanged genes follow the t-distribution; if
the unchanged genes are the overwhelming majority, then this plot will
be a straight line at 45o.
• The t-scores for the few really significant genes are more extreme, and
stand out from the straight line, where they can be easily seen.
t quantile plot
Comments on Quantile Plot
• A drawback of the t-statistic for microarray datasets is that most
experiments have only a few samples in each group (n1 and n2 are
small), and so the standard error si is not very reliable.
• In a modest fraction of cases, si is (by normal sampling variation)
greatly under-estimated, and genes that are little changed give rise to
extreme t-values, and therefore false positives.
• The theory compensate for some of these false positives, in that the
tails of the t-distribution are much further out than those of a Normal
curve.
• This causes a reverse problem: a truly changed gene has to change a
great deal to give convincing evidence; therefore many of the
moderately changed genes won’t show convincing evidence of change.
Volcano Plot: A visual Technique
• The ‘volcano plot’ arrange genes along dimensions of biological and
statistical significance.
• The horizontal dimension is the fold change between the two groups
(on a log scale, so that up and down regulation appear symmetric), and
the vertical axis represents the p-value for a t-test of differences
between samples (most conveniently on a negative log scale – so
smaller p-values appear higher up).
• The x axis indicates biological impact of the change; the y axis
indicates the statistical evidence, or reliability of the change.
• The researcher can then make judgements about the most promising
candidates for follow-up studies, by trading off both these criteria by
eye.
• R allows one to attach names to genes that appear promising.
Volcano Plot: An example
Moderated t : An Alternative
• Tusher and Tibshirani suggested (a form of) the following test statistic
t* rather than the usual t statistic.
• The SE in the denominator is adjusted toward a common value for all
genes; this prevents gross underestimation of SE’s, and so eliminates
from the gene list those genes that change only a little:
• t* = (xi,1 – xi,2) / ((si + s0)/2)
• where s0 is the median of the distribution of standard errors for all
genes.
• The range of values of the revised denominator is narrower than the
range of sample SEs; it is as if all the SE’s are ‘shrunk’ toward the
common value s0.
• Unfortunately no theoretical distribution is known for this statistic t*,
and hence the significance levels of this statistic must be computed by
a permutation method (see below).
Another approach: Pooling Info from past data
• If a good deal of prior data exist on the tissue and strain used in the
measured on the same microarray platform then it is defensible to pool
the estimates of variation from each of the prior studies, and use this as
the denominator in the t-scores.
• The t-scores should then be compared to the t-distribution on a number
of degrees of freedom, equal to that used in computing the pooled
standard error.
• A variant of this approach may be used in a study where many groups
are compared in parallel. The within-group variances for each gene
may be pooled across the different groups to obtain a more accurate
estimate of variation.
• This presumes that treatments applied to different groups affect mostly
the mean expression levels, and not the variation among individuals.
• Use F-ratios to test that the discrepancies in variance estimates are not
too large for many of the genes that are selected as differentially
expressed.
Wilcoxon Signed Rank: A nonparametric alternative
• Since gene expression values are not distributed according to a Normal
curve, some researchers have used non-parametric tests.
• The most commonly used test of this type is the Wilcoxon signed-rank
test.
• NOT a good choice, because the Wilcoxon test assumes that the
distribution of the gene expression values is strictly symmetricand in
practice the distribution of values of a single gene is frequently quite
asymmetric.
• A thoroughly robust non-parametric test is the rank, or binomial test,
which counts the number of values in each group greater than the
median value, and compares this number to the binomial distribution.
• This test has very little power to detect differences in small groups,
which are typical of microarray studies, and is never used in practice to
detect differential expression.
Permutation Test
•
Permutation testing is an approach that is widely applicable and
copes with distributions that are far from Normal; this approach is
particularly useful for microarray studies because it can be easily
adapted to estimate significance levels for many genes in parallel.
1.
2.
3.
Choose a test statistic, eg. a t-score for a comparison of two groups,
Compute the test statistic for the gene of interest,
Permute the labels on samples at random, and re-compute the test
statistic for the rearranged labels; repeat for a large number (perhaps
1,000) permutations, and finally.
Compute the fraction of cases in which the test statistics from 3
exceed the real test statistic from 2.
4.
Permutation Based p-values
• To calculate corrected p-values, first calculate single-step p-values for
all genes: p1, …, pN.
• Then order the p-values: p(1), …, p(N), from least to greatest.
• Next permute the sample labels at random, and compute the test
statistics for all genes between the two (randomized) groups. For each
position k, keep track of how often you get at least one p-value more
significant than p(k), from gene k, or from any of the genes further
down on the list: k+1, k+2, …, N. ?
• After all permutations, compute the fraction of permutations with at
least one apparently more significant p-value less than p(k). This is the
corrected p-value for gene k.
• Although this procedure is complicated, it is much more powerful than
other methods. This is known as the Westfall–Young correction.
The Bayesian Approach
• Bayesian approaches make assumptions about the parameters to be
estimated (such as the differences between gene levels in treatment and
control groups).
• A pure Bayes approach assumes specific distributions (prior
distributions) for the mean differences of gene levels, and their
standard deviations.
• The empirical Bayes approach assumes less, usually that the form of
these distributions is known; the parameters of the prior distribution
are estimated from data.
• As we get more detailed knowledge of the variability of individual
genes, it should become possible to make detailed useful prior
estimates based on past experience.
Philosophy and Idea for EB
• all unchanged genes have a range of variances that follow a known
prior distribution.
• estimate the mean of that distribution as the mean of the experimental
variances for all the genes.
• estimate the variance of that distribution as the variance of the gene
variances. Both of these estimates assume that a small minority of
genes are actually changed.
• Then consider each gene individually. Would like an estimate of the
variability of that gene that is more reliable than its’ sample standard
deviation.
• We can compute the probability of any value for sample variance,
given a supposed value for the (unknown) true variance of that gene.
EB Contd
• If we combine that with the prior assignment of probability for each
true variance of a gene, we can work out the probabilities that various
possible true values underlie the one observed value of the variance.
• It is more probable that a very high sample variance comes about by an
over-estimate of a moderately high variance, than a true estimate of a
very high variance, simply because there are many more moderate
variances than high variances, and over-estimates are not very
uncommon compared to precise estimates.
EB: the Variance estimate
• In the Bayesian tradition construct a density function for the
probability function for the true value of the gene variance (called a
posterior distribution).
• To come up with a single value we might estimate the expected value
of the posterior distribution.
• This expected value turns out to be a weighted average of the sample
variance for the gene, and the mean of the prior distribution, which
was set to the mean of all the sample variances.
• Thus we can obtain a theoretically more reliable estimate of variation,
by pooling information about variation, derived from all genes
simultaneously.
• This more reliable estimate of variation should then translate into more
reliable t-scores, with more power to pick up moderate differences in
gene abundance between samples.
EB:Comments on Moderated t
• The empirical Bayes estimate of variance is closer to the mean
variance, more often than the sample variance, which means that the
moderated t-score is more likely to have a moderate (not very small,
not very large) value than a true t-score.
• The tails of the moderated t distribution are fairly close to the tails of a
t distribution based on a more reliable variance estimate, that is, with a
higher number of degrees of freedom.
• This means that a given difference in means often gives a more
significant p-value, and that more genes are selected at a particular
threshold.
EB: estimating proportion of changed genes using Prior
Idea:
• Suppose we believe that some fraction p1 of genes are actually changed by the
treatment, and the remaining fraction p0 = 1 - p1 are unchanged.
• Then we examine the distribution of the p-values from all the t-scores from all
the genes in the experiment with the raw t-scores.
• The way that p-values are constructed for a t-score, we should find even
numbers of p-values in any interval of the same length.
• We find generally that there are fewer p-values in the range 0.5 to 1.0 than
there should be (there should be N/2 such p-values), and correspondingly more
in the range 0.0 to 0.5.
• The natural assumption is that some genes are missing from the range 0.5 to
1.0, because they are really differentially expressed.
• The number of genes remaining, out of the total N/2 that we would expect,
gives an estimate of p0, the fraction of genes that are unchanged. Let's look at
the fraction of p-values between .5 and 1, say M. Then p0 ≈ M/(2*N).