Download rec10

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene prediction wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Gene expression programming wikipedia , lookup

Transcript
Comp. Genomics
Recitation 10
4/7/09
Differential expression
detection
Outline
•
•
•
•
•
•
•
Clustering vs. Differential expression
Fold change
T-test
Multiple testing
FDR/SAM
Mann-Whitney
Examples
Microarray preliminaries
• General input: A matrix of probes
(sequences) and intensities
• We assume the hard work is over:
• Probes are assigned to genes
• The data is properly (?) normalized
• We have an expression matrix
• Rows correspond to genes
• Columns correpond to conditions
Microarray analysis
• Common scenarios:
• We tested the behavior of genes across
several time points
• We test a large number of different condtions
• Clustering is the solution
• We compared a small number of conditions
(2) and have multiple replicates for each
condition
• E.g., we took blood expression in 10 sick and
10 individuals
• Differential expression analysis
Identification of differential
genes
• The most basic experimental design:
comparison between 2 conditions – ‘treatment’
vs. control
• More complex: sick/treatment/control
• The goal: identify genes that are differentially
expressed in the examined conditions
• Number of replicates is usually low (n=2-4)
• Statistics are important
Slides: Rani Elkon
Approaches for identification
of differential genes
1. Fold Change
2. T-test
3. SAM
1. Fold Change
• Consider genes whose mean expression
level was change by at least 1.75-2 fold as
differential genes
• Pros:
• Very simple!
• Cons:
• Usually no estimation of false positive rate is
provided
• Biased to genes with low expression level
• Ignores the variability of gene levels over
replicates.
Fold Change limit – Biased to
low expression levels
Determine ‘floor’ cut-off and set all expression levels below it to this
floor level
Fold Change limit – ignores
variability over replicates
g1
g2
control
C1 C2 C3 mean_c
90 100 110
100
50 100 150
100
0.000128
0.143932
g1
g2
10
10
50 132.2876
treatment
t1
t2
t3
mean_t
190 200 210
200
100 150 350
200
12.24745
1.227079
• We need a score that ‘punishes’ genes
with high variability over replicates
Approaches for identification
of differential genes
1. Fold Change
2. T-test
3. SAM
2. T-test
• Compute a t-score for each gene
mc, mt – mean levels in Control and Treatment
Sc2, St2 – variance estimates in Control and Treatment
nc, nt – number of replicates in in Control and Treatment
control
C1
C2
treatment
C3
mean_c
t1
t2
t3
mean_t
g1
90
100
110
100
190
200
210
200
g2
50
100
150
100
100
150
350
200
t
12
1.3
T-test
• The t-score is good because it is a results of a
well known statistical hypothesis testing
• If we assume the sample is normally distributed
(unknown variance) and compare two
hypotheses:
• H0 – All the measurements come from the same
distribution
• H1 – All the measurements come from different
normal distributions
• In this case a p-value can be derived for every tscore
T-test
• Set cut-off for p-value (α=0.01) and
consider all genes with p-value < α as
differential genes
C1
g1
g2
C2
C3
mean_c
90 100 110
100
50 100 150
100
t1
t2
190
100
t3
200
150
mean_t
210
350
t
200
200
p-val
12.2 0.0001
1.3
0.14
Multiple Testing
• Pg associated with the t-score tg is the probability
for obtaining by random a t-score that is at least
as extreme as tg.
• Multiplicity problem: thousands of genes are
tested simultaneously (all the genes on the
array!)
• Simple example:
• 10,000 genes on a chip
• not a single one is differentially expressed (everything
is random)
• α=0.01
• 10000x0.01 = 100 genes are expected to have a pvalue < 0.01 just by chance.
Multiple testing
• Individual p–values of e.g. 0.01 no longer
correspond to significant findings.
• Need to adjust for multiple testing when
assessing the statistical significance of
findings
• Actually this is a somewhat common
problem in statistics
Multiple Testing
• Simple solution (Bonferroni): consider as
differential genes only those with p-value <
(α/N)
• N: number of tests
• α=0.01, N=10,000: cut-off=0.000001
• Ensure very low probability for having any false
positive genes (less than α)
• Advantage: very clean list of differential genes
• Limit: the list usually contains very few genes…
unacceptable high rate of false negatives
FDR correction (Benjamini &
Hochberg)
• False Discovery Rate
• In high-throughput studies certain proportion of
false positives is tolerable
• Control the expected proportion of false
positives among the genes declared as
differential (q=10%).
• Scheme:
• Rank genes according to their p-vals: p(1)<p(2)…<p(N)
• Consider as differential the top k genes, where
k = max{i: p(i)< i*(q/N)}
Approaches for identification
of differential genes
1. Fold Change
2. T-test
3. SAM
3. SAM (Tusher, Tibshirani
& Chu)
• ‘Significance Analysis of Microarray’
• Limit of analytical FDR approach: assumes
that the tests are independent
• In the microarray context, the expression
levels of some genes are highly correlated
→ unreliable FDR estimate
• SAM uses permutations to get an
‘empirical’ estimate for the FDR of the
reported differential genes
SAM
• Scheme:
• Compute for each gene a statistic that measures its
relative expression difference in control vs ‘treatment’
(t-score or a variant)
• Rank the genes according to their ‘difference score’
• Set a cut off (d0) and consider all genes above it as
differential (Nd)
• Permute the condition labels, and count how many
genes got score above d0 (Np)
• Repeat on many (all possible) permutations and
count (Npj)
• estimate FDR as the proportion: Average(Npj)/Nd
Permutation on condition
labels
D
score
G1
G2
G3
d1p1
d1p2
e11 e12 e13 e14 e15 e16 e17 e18 d1
d2p1
d2p2
e21 e22 e23 e24 e25 e26 e27 e28 d2
d3p1
d3p2
e31 e32 e33 e34 e35 e36 e37 e38 d3
BACK
SAM example
• Ionizing radiation response experiment
• After setting the threshold:
• 46 genes found significant
• 36 permutations
• 8.4 genes on average pass the threshold
• False discovery rate is 18%
Mann-Whitney/Wilcoxon
• In general normality assumption of t-test
is problematic
• Aparametric statistics are very useful in
many bioinfo related problem
• Assume nothing about the distribution of
the samples
• Less powerful (more false negatives, but
less false positives)
Mann-Whitney/Wilcoxon
• MW/Wilcoxon test for two samples:
• H0 – The medians of both distributions are
the same
• H1 – The medians of the distributions are
different
• Assumes:
• The two samples are independent
• The observations can ordered (ordinal)
Mann-Whitney/Wilcoxon
• Computes a U-score whose distribution is
known under H0 (& can be approximated
by normal distribution in large samples)
• Arrange all the observations into a single
ranked series
• Add up the ranks in sample 1. The sum of
ranks in sample 2 follows by calculation, since
the sum of all the ranks equals N(N+1)/2
• U-score: