Download Document

Document related concepts
no text concepts found
Transcript
Microarrays:
Common Analysis Approaches
Outline




Missing Value Estimation
Differentially Expressed Genes
Clustering Algorithms
Principal Components Analysis
Missing Data: Outline
 Missing data problem, basic concepts and
terminology
 Classes of procedures
 Case deletion
 Single imputation
 Filling with zeroes
 Row averaging
 SVD imputation
 KNN imputation
 Multiple imputation
The Missing Data Problem
Causes for missing data
 Low resolution
 Image corruption
 Dust/scratched slides
 Missing measurements
Why estimate missing values?
 Many algorithms cannot deal with missing values
- Distance measure-dependent algorithms
(e.g., clustering, similarity searches)
Basic concepts and terminology
Statistical overview
Population
of complete
data: θ
Sample
Sample of
complete
data: θs
Missing data
mechanism
Sample of
incomplete
data: θi
Need to estimate θ from the incomplete data and investigate its
performance over repetitions of the sampling procedure
Basic concepts
Y
f(Y;θ)
θ
R
= sample data
= distribution of sample data
= parameters to be estimated
= indicators, whether elements of Y are observed or
missing
g(R|Y) = missing data mechanism (maybe with other params)
Y = (Yobs, Ymis)
Yobs = observed part of Y
Ymis = missing part of Y
Goal:
Propose methods to estimate θ from Yobs and accurately
assess its error
Basic concepts (cont.)
Classes of mechanisms (cf. Rubin, 1976, Biometrika)
• Missing Completely At Random (MCAR)
 g(R|Y) does not depend on Y
• Missing At Random (MAR)
g(R|Y) may depend on Yobs but not on Ymis
• Missing Not At Random (MNAR)
g(R|Y) depends on Ymis
Example
Suppose we measure age and income of a collection of
individuals…
• MCAR
• The dog ate the response sheets!
• MAR
• Probability that the income measurement is missing
varies according to the age but not income
• MNAR
• Probability that an income is recorded varies
according to the income level with each age group
Note: we can disprove MCAR by examining the data, but
we cannot disprove MAR or MNAR.
Outline
 Missing data problem, basic concepts and terminology
 Classes of procedures
 Case deletion
 Single imputation
 Filling with zeroes
 Row averaging
 SVD imputation
 KNN imputation
 Multiple imputation
Classes of procedures: Case Deletion
• Remove subjects with missing
values on any item needed for
analysis
1
2
3
4
Y1
1
5
4
1
Y2
3
?
4
2
Y3
4
1
?
3
Advantages
• Easy
• Valid analysis under MCAR
• OK if proportion of missing cases is small and they are not
overly influential
Disadvantages
• Can be inefficient, may discard a very high proportion of
cases (5669 out of 6178 rows discarded in Spellman
yeast data)
• May introduce substantial bias, if missing data are not
MCAR (complete cases may be un-representative of the
population)
Classes of procedures: Single Imputation (I)
Replace with zeroes
• Fill-in all missing values with
zeroes
1
2
3
4
Y1
1
5
4
1
Y2
3
0
4
2
Y3
4
1
0
3
Advantages
• Easy
Disadvantages
• Distorts the data disproportionately (changes statistical
properties)
• May introduce bias
• Why zero?
Classes of procedures: Single Imputation (II)
Row averaging
• Replace missing values by the
row average for that row
Y1 Y2
1 1 3
2 5 2.6
7
3 4 4
Y3
4
1
Advantages
3.6
• Easy
7
• Keeps same mean
4 1 2 3
Disadvantages
• Distorts distributions and relationships between variables
x
x
x
x
x
x
x
x
xx
x
x
x
xxxx xx x
x x
Classes of procedures: Single Imputation
(III)
“Hot deck” imputation
• Replace each missing value
by a randomly drawn
observed value
1
2
3
4
Y1
1
5
4
1
Y2
3
1
4
2
Y3
4
1
2
3
Advantages
• Easy
• Preserves distributions very well
Disadvantages
• May distort relationships
• Can use, e.g., “similar” rows to draw random values from
(to help constrain distortion)
• Depend on definition of “similar”
Classes of procedures: Single Imputation (IV)
Regression imputation
• Fit regression to observed values, use it to obtain
predictions for missing ones
• SVD imputation
• Fill missing entries with regressed values from a set of
characteristic patterns, using coefficients determined
by the proximity of the missing row to the patterns
• KNN imputation (more later)
• Isolate rows whose values are similar to those of the
one with missing values (choosing (i) similarity
measure, and (ii) size of this set)
• Fill missing values with averages from this set of
genes, with weights inversely proportional to
similarities
• Computationally intensive
• May distort relationships between variables (could use Y
+random residual)
Classes of procedures: Multiple Imputation
Main Idea
• Replace Ymis by M>1 independent draws
• {Y1mis,…, YMmis } ~ P(Ymis| Yobs )
• Produce M different versions of complete data
• Analyse each one in same fashion and combine results
at the end, with standard error estimates (Rubin, 1987)
• More difficult to implement
• Requires (initially) more computations
• More work involved in interpreting results
KNN Imputation
• Troyanskaya et al., Bioinformatics, 2001
The Algorithm
0. Given gene A with missing values
1. Find K other genes with values present in
experiment 1, with expression most similar to A in
other experiments
2. Weighted average of values in experiment 1 from
the K closest genes is used as an estimate for the
missing value in A
KNN Imputation: Considerations
• K – the number of nearest neighbours
• Method appears to be relatively insensitive to K within
the range 10-20
• Distance metric to be used for computing gene similarity
• Troyanskaya: “Euclidean is sufficient”
• No clear comparison or reason – would expect that
metric to be used depends on the type of experiment
• Not recommended on matrices with less than four columns
• Computationally intensive!
• ~O(m2n) for m rows and n genes
• “3.23 minutes on a Pentium III 500 MHz for 6153 genes,
14 experiments with 10% of the entries missing”
KNN Imputation: Expression Profiler
Outline




Missing Value Estimation
Differentially Expressed Genes
Clustering Algorithms
Principal Components Analysis
Identifying Differentially Expressed Genes
[Slides courtesy of John Quackenbush, TIGR]
Two vs. Multiple conditions
• Two conditions
- t-test
- Significance analysis of microarrays (SAM)
- Volcano Plots
- ANOVA
•
Multiple conditions
- Clustering
- K-means
- PCA
How Many Replicates??
n = [4(za/2 + zb)2] / [(d/1.4s)2]
Where za/2 and zb are normal percentile values at
false positive rate a
Type I error rate
false negative rate b
Type II error rate,
d represents the minimum detectable log2 ratio;
and s represents the SD of log ratio values.
For a = 0.001 and b = 0.05, get za/2 = -3.29 and zb = -1.65.
Assume d = 1.0 (2-fold change) and s = 0.25,
 n = 12 samples (6 query and 6 control) 
(Simon et al., Genetic Epidemiology 23: 21-36, 2002)
Some Concepts from
Statistics
Probability Distributions
The probability of an event is the likelihood of its occurring.
It is sometimes computed as a relative frequency (rf), where
rf =
the number of “favorable” outcomes for an event
the total number of possible outcomes for that event
The probability of an event can sometimes be inferred from a
“theoretical” probability distribution, such as a normal distribution.
Normal Distribution
σ = standard
deviation
of the
distribution
X = μ (mean of the distribution)
Mean 1
Population 1
Mean 2
Population 2
Sample mean “s”
Less than a 5 % chance that the sample with mean s came from Population 1
s is significantly different from Mean 1 at the p < 0.05 significance level.
But we cannot reject the hypothesis that the sample came from Population 2
Probability and Expression Data
•
Many biological variables, such as height and weight, can reasonably be
assumed to approximate the normal distribution.
•
But expression measurements? Probably not.
•
Fortunately, many statistical tests are considered to be fairly robust to
violations of the normality assumption, and other assumptions used in
these tests.
•
Randomization / resampling based tests can be used to get around the
violation of the normality assumption.
•
Even when parametric statistical tests (the ones that make use of normal
and other distributions) are valid, randomization tests are still useful.
Outline of a Randomisation Test
1. Compute the value of interest (i.e., the test-statistic s)
from your data set.
s
Original data set
2. Make “fake” data sets from your original data, by taking a
random sub-sample of the data, or by re-arranging the data in a
random fashion. Re-compute s from the “fake” data set.
“fake” s
“fake” s
“fake” s
...
Randomized “fake” data sets
Outline of a Randomisation Test (II)
3. Repeat step 2 many times (often several hundred to several
thousand times) and record of the “fake” s values from step 2
4. Draw inferences about the significance of your original s value
by comparing it with the distribution of the randomized
(“fake”) s values
Original s value could be significant
as it exceeds most of the randomized s values
Range of randomized s values
Outline of a Randomisation Test (III)
• Rationale
• Ideally, we want to know the “behavior” of the larger population
from which the sample is drawn, in order to make statistical
inferences.
• Here, we don’t know that the larger population “behaves” like a
normal distribution, or some other idealized distribution. All we have
to work with are the data in hand.
• Our “fake” data sets are our best guess about this behavior (i.e., if
we had been pulling data at random from an infinitely large
population, we might expect to get a distribution similar to what we
get by pulling random sub-samples, or by reshuffling the order of
the data in our sample)
The Problem of Multiple Testing (I)
• Let’s imagine there are 10,000 genes on a chip, and
• none of them is differentially expressed.
• Suppose we use a statistical test for differential expression,
where we consider a gene to be differentially expressed if it
meets the criterion at a p-value of p < 0.05.
The Problem of Multiple Testing (II)
•
Let’s say that applying this test to gene “G1” yields a p-value of p
= 0.01
•
Remember that a p-value of 0.01 means that there is a 1%
chance that the gene is not differentially expressed, i.e.,
•
Even though we conclude that the gene is differentially expressed
(because p < 0.05), there is a 1% chance that our conclusion is
wrong.
•
We might be willing to live with such a low probability of being
wrong
•
BUT .....
The Problem of Multiple Testing (III)
•
We are testing 10,000 genes, not just one!!!
•
Even though none of the genes is differentially expressed,
about 5% of the genes (i.e., 500 genes) will be erroneously
concluded to be differentially expressed, because we have
decided to “live with” a p-value of 0.05
•
If only one gene were being studied, a 5% margin of error
might not be a big deal, but 500 false conclusions in one
study? That doesn’t sound too good.
The Problem of Multiple Testing (IV)
•
There are “tricks” we can use to reduce the severity of this
problem.
•
They all involve “slashing” the p-value for each test (i.e.,
gene), so that while the critical p-value for the entire data set
might still equal 0.05, each gene will be evaluated at a lower
p-value.
•
We’ll go into some of these techniques later.
The Problem of Multiple Testing (V)
• Don’t get too hung up on p-values.
• Ultimately, what matters is biological relevance.
• P-values should help you evaluate the strength of the
evidence, rather than being used as an absolute yardstick of
significance.
• Statistical significance is not necessarily the same as
biological significance.
Finding Significant Genes
• Assume we will compare two conditions with multiple
replicates for each class
• Our goal is to find genes that are significantly different
between these classes
• These are the genes that we will use for later data
mining
Finding Significant Genes (II)
• Average Fold Change Difference for each gene
• suffers from being arbitrary and not taking into
account systematic variation in the data
???
Finding Significant Genes (III)
• t-test for each gene
• Tests whether the difference between the mean of
the query and reference groups are the same
• Essentially measures signal-to-noise
• Calculate p-value (permutations or distributions)
• May suffer from intensity-dependent effects
t = signal = difference between means = <Xq> – <Xc>_
noise
variability of groups
SE(Xq-Xc)
t
Xq  Xc
s
2
q
nq

s
2
c
nc
T-Tests
A significant
difference
Probably
not
T-Tests (I)
1. Assign experiments to two groups, e.g., in the expression
matrix below, assign Experiments 1, 2 and 5 to group A,
and experiments 3, 4 and 6 to group B.
Group A
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6
Exp 1 Exp 2 Exp 5
Gene 1
Gene 1
Gene 2
Gene 2
Gene 3
Gene 3
Gene 4
Gene 4
Gene 5
Gene 5
Gene 6
Gene 6
Group B
Exp 3 Exp 4 Exp 6
2. Question: Is mean expression level of a gene in group A
significantly different from mean expression level in group B?
T-Tests (II)
3. Calculate t-statistic for each gene
4. Calculate probability value of the t-statistic for each
gene either from:
A. Theoretical t-distribution
OR
B. Permutation tests.
T-Tests (III)
Permutation tests
i) For each gene, compute t-statistic
ii) Randomly shuffle the values of the gene between groups A
and B, such that the reshuffled groups A and B respectively
have the same number of elements as the original groups A
and B.
Group A
Exp 1 Exp 2 Exp 5
Group B
Exp 3 Exp 4 Exp 6
Original grouping
Gene 1
Group A
Exp 3 Exp 2 Exp 6
Gene 1
Group B
Exp 4Exp 5 Exp 1
Randomized grouping
T-Tests (IV)
Permutation tests - continued
iii) Compute t-statistic for the randomized gene
iv) Repeat steps i-iii n times (where n is specified by the user).
v) Let x = the number of times the absolute value of the original
t-statistic exceeds the absolute values of the randomized tstatistic over n randomizations.
vi) Then, the p-value associated with the gene = 1 – (x/n)
T-Tests (V)
5. Determine whether a gene’s expression levels are significantly
different between the two groups by one of three methods:
A) “Just alpha” (a significance level): If the calculated
p-value for a gene is less than or equal to the user-input a
(critical p-value), the gene is considered significant.
OR
Use Bonferroni corrections to reduce the probability of erroneously
classifying non-significant genes as significant.
B) Standard Bonferroni correction: The user-input alpha is divided
by the total number of genes to give a critical
p-value that is used as above –> pcritical = a/N.
T-Tests (VI)
5C) Adjusted Bonferroni:
i) The t-values for all the genes are ranked in
descending order.
ii) For the gene with the highest t-value, the critical pvalue becomes (a/N), where N is the total number of
genes; for the gene with the second-highest t-value, the
critical
p-value will be (a/[N-1]), and so on.
Finding Significant Genes (IV)
• Significance Analysis of Microarrays (SAM)
- Uses a modified t-test by estimating and adding a small
positive constant to the denominator
- Significant genes are those which exceed the expected
values from permutation analysis.
SAM
• SAM can be used to select significant genes based on
differential expression between sets of conditions
• Currently implemented for two-class unpaired design –
i.e., we can select genes whose mean expression level
is significantly different between two groups of samples
(analogous to t-test).
• Stanford University, Rob Tibshirani
http://www-stat.stanford.edu/~tibs/SAM/index.html
SAM
• SAM gives estimates of the False Discovery Rate
(FDR), which is the proportion of genes likely to have
been wrongly identified by chance as being significant.
• It is a very interactive algorithm – allows users to
dynamically change thresholds for significance (through
the tuning parameter delta) after looking at the
distribution of the test statistic.
• The ability to dynamically alter the input parameters
based on immediate visual feedback, even before
completing the analysis, should make the data-mining
process more sensitive.
SAM Two-class
1. Assign experiments to two groups
- in the expression matrix below:
Experiments 1, 2 and 5 to group A
Experiments 3, 4 and 6 to group B
Group A
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6
Exp 1 Exp 2 Exp 5
Gene 1
Gene 1
Gene 2
Gene 2
Gene 3
Gene 3
Gene 4
Gene 4
Gene 5
Gene 5
Gene 6
Gene 6
Group B
Exp 3
Exp 4 Exp 6
2. Question: Is mean expression level of a gene in group A
significantly different from mean expression level in group B?
SAM Two-class
Permutation tests
i) For each gene, compute d-value (analogous to t-statistic).
This is the observed d-value for that gene.
ii) Randomly shuffle the values of the gene between groups A
and B, such that the reshuffled groups A and B have the
same number of elements as the original groups A and B.
Compute the d-value for each randomized gene
Group A
Group B
Exp 1 Exp 2 Exp 5
Exp 3
Exp 4 Exp 6
Original grouping
Gene 1
Group A
Exp 3 Exp 2
Gene 1
Group B
Exp 6
Exp 4 Exp 5 Exp 1
Randomized grouping
SAM Two-class
• Repeat step (ii) many times, so that each gene has
many randomized d-values. Take the average of the
randomized d-values for each gene. This is the
expected d-value of that gene.
• Plot the observed d-values vs. the expected d-values
SAM Two-class
“Observed d = expected d” line
Significant positive genes
( mean expression of group B >
mean expression of group A) in red
Tuning parameter
“delta” limits, can
be dynamically
changed by using
the slider bar or
entering a value in
the text field.
Significant negative genes
( mean expression of group A > mean
expression of group B) in green
The more a gene deviates from the “observed = expected” line, the more
likely it is to be significant. Any gene beyond the first gene in the +ve or –
ve direction on the x-axis (including the first gene), whose observed
exceeds the expected by at least delta, is considered significant.
SAM Two-class
• For each permutation of the data, compute the
number of positive and negative significant genes for
a given delta. The median number of significant
genes from these permutations is the median False
Discovery Rate.
• The rationale:
Any gene designated as significant from the
randomized data are being picked up purely by
chance (i.e., “falsely” discovered). Therefore, the
median number picked up over many randomisations
is a good estimate of false discovery rate.
Finding Significant Genes (V)
Volcano Plots
• Effect vs. Significance
• Selections of items that
have both a large effect
and are highly
significant can be
identified easily.
High p
High Effect & Significance
Low p
Boring stuff
-ve effect
+ve effect
Volcano Plots
Using log10 for Y axis
p < 0.01
(2 decimal places)
p < 0.1
(1 decimal place)
Using log2 for X axis
Volcano Plots (II)
Using log10 for Y axis
Effect has doubled
Effect has halved
20.5 (2
raised to the
power of 0.5)
21 (2 raised to the
power of 1)
Two Fold Change
Using log2 for X axis
Finding Significant Genes (VI)
• Analysis of Variation (ANOVA)
- Which genes are most significant for separating
classes of samples?
- Calculate p-value (permutations or distributions)
- Reduces to a t-test for 2 samples
- May suffer from intensity-dependent effects
???
Multiple Conditions/Experiments
• Goal is to identify genes (or conditions) which have
“similar” patterns of expression
• This is a problem in data mining
• “Clustering Algorithms” are most widely used
• All depend on how one measures distance
Pattern analysis
Pattern analysis
Supervised
Learning
Unsupervised
Learning
Hierarchical
Agglomerative
Single linkage
Divisive
Complete linkage
Average linkage
Non-hierarchical
K-means
SOMs
Expression Vectors
• Each gene is represented by a vector where coordinates
are its values log(ratio) in each experiment
- x = log(ratio)exp1
- y = log(ratio)exp2
- z = log(ratio)exp3
z
- etc.
y
Similar expression
x
Expression Vectors
• Each gene is represented by a vector where coordinates
are its values log(ratio) in each experiment
- x = log(ratio)exp1
- y = log(ratio)exp2
- z = log(ratio)exp3
- etc.
• For example, if we do six experiments,
- Gene1 = (-1.2, -0.5, 0, 0.25, 0.75, 1.4)
- Gene2 = (0.2, -0.5, 1.2, -0.25, -1.0, 1.5)
- Gene3 = (1.2, 0.5, 0, -0.25, -0.75, -1.4)
- etc.
Expression Matrix
Exp 6
0
1.2
0
Exp 5
-1.2 -0.5
0.2 -0.5
1.2 0.5
Exp 4
Exp 3
Gene1
Gene2
Gene3
Exp 2
Exp 1
• These gene expression vectors of log(ratio) values
can be used to construct an expression matrix
0.25 0.75 1.4
-0.25 -1.0
1.5
-0.25 -0.75 -1.4
• This is often represented as a red/green colored
matrix
Expression Matrix
The Expression Matrix is a representation of data from multiple
microarray experiments.
Gene 1
Gene 2
Exp 6
Exp 5
Exp 4
Exp 3
Exp 2
Exp 1
Each element is a log ratio,
usually
log 2 (Cy5/Cy3)
Black indicates a log
ratio of zero
( Cy5 ~= Cy3 )
Gene 3
Gene 4
Gene 5
Gene 6
Gray indicates missing data
Green indicates a
negative log ratio
( Cy5 < Cy3 )
Red indicates a
positive log ratio ( Cy5 > Cy3 )
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Exp 3
Exp 2
Exp 1
Expression Vectors as points in
“Expression Space”
Experiment 3
Similar Expression
z
Experiment 2
y
Experiment 1
x
Distance measures
• Distances are measured “between” expression vectors
• Distance measures define the way we measure distances
• Many different ways to measure distance:
- Euclidean distance
- Manhattan distance
- Pearson correlation
- Spearman correlation
- etc.
• Each has different properties and can reveal different
features of the data
Euclidean distance
• Measures the 'as-the-crow-flies' distance
• Deriving the Euclidean distance between two data points
involves computing the square root of the sum of the
squares of the differences between corresponding values
( Pythagoras theorem )
y
D
n
2
(
x

y
)
 i i
i n
x
Manhattan distance
• Computes the distance that would be traveled to get from
one data point to the other if a grid-like path is followed
• Manhattan distance between two items is the sum of the
differences of their corresponding components
y
n
D   xi  y i
i 1
x
Pearson and Pearson squared
Expression
Expression
• Pearson Correlation measures the similarity in shape
between two profiles
• Pearson Squared distance measures the similarity in
shape between two profiles, but can also capture inverse
relationships
Samples
Samples
D  1  (Z ( x ) * Z ( y ) / n ) D  1  2(Z ( x ) * Z ( y ) / n )
Spearman Rank Correlation
• Spearman Rank Correlation measures the correlation
between two sequences of values.
• The two sequences are ranked separately and the
differences in rank are calculated at each position, i.
• Use Spearman Correlation to cluster together genes
whose expression profiles have similar shapes or show
similar general trends, but whose expression levels may
be very different
n
D  1
6 (rank ( xi )  rank ( y i ))2
i 1
n(n 2  1)
Where Xi and Yi are the ith values of sequences X and Y respectively
Distance Matrix
Gene2
Gene3
Gene4
Gene5
Gene6
Gene1
Gene2
Gene3
Gene4
Gene5
Gene6
Gene1
• Once a distance metric has been selected, the
starting point for all clustering methods is a “distance
matrix”
0
1.5
1.2
0.25
0.75
1.4
1.5
0
1.3
0.55
2.0
1.5
1.2
1.3
0
1.3
0.75
0.3
0.25
0.55
1.3
0
0.25
0.4
0.75
2.0
0.75
0.25
0
1.2
1.4
1.5
0.3
0.4
1.2
0
• The elements of this matrix are the pair-wise
distances. ( matrix is symmetric around the diagonal )
Hierarchical Clustering
1. Calculate the distance between all genes. Find the smallest
distance. If several pairs share the same similarity, use a
predetermined rule to decide between alternatives.
2. Fuse the two selected clusters to produce a new cluster that now
contains at least two objects. Calculate the distance between the
new cluster and all other clusters.
3. Repeat steps 1 and 2 until only a single cluster remains.
4. Draw a tree representing the results.
G1
G6
G6
G1
G5
G5
G2
G2
G4
G3
G3
G4
Hierarchical Clustering
G1
G2
G3
G4
G5
G6
G7
G8
G1 is most like G8
G1
G8
G2
G3
G4
G5
G6
G7
G4 is most like {G1, G8}
G1
G8
G4
G2
G3
G5
G6
G7
Hierarchical Clustering
G1
G8
G4
G2
G3
G5
G6
G7
G5 is most like G7
G1
G8
G4
G2
G3
G5
G7
G6
{G5,G7} is most like {G1, G4, G8}
G1
G8
G4
G5
G7
G2
G3
G6
Hierarchical Tree
G1
G8
G4
G5
G7
G2
G3
G6
Agglomerative Linkage Methods
• Linkage methods are rules that determine which
elements (clusters) should be linked.
• Three linkage methods that are commonly used:
- Single Linkage
- Average Linkage
- Complete Linkage
Single Linkage
Cluster-to-cluster distance is defined as the minimum
distance between members of one cluster and
members of another cluster. Single linkage tends to
create ‘elongated’ clusters with individual genes chained
onto clusters.
DAB = min ( d(ui, vj) )
where u  A and v  B
for all i = 1 to NA and j = 1 to NB
DAB
Average Linkage
Cluster-to-cluster distance is defined as the average
distance between all members of one cluster and all
members of another cluster. Average linkage has a slight
tendency to produce clusters of similar variance.
DAB = 1/(NANB) S S ( d(ui, vj) )
where u  A and v  B
for all i = 1 to NA and j = 1 to NB
DAB
Complete Linkage
Cluster-to-cluster distance is defined as the maximum
distance between members of one cluster and members
of the another cluster. Complete linkage tends to create
clusters of similar size and variability.
DAB = max ( d(ui, vj) )
where u  A and v  B
for all i = 1 to NA and j = 1 to NB
DAB
Comparison of Linkage Methods
Single
Average
Complete
K-Means/Medians Clustering
1. Specify number of clusters, e.g., 5
2. Randomly assign genes to clusters
G1
G2
G3
G4
G5
G6
G7
G8
G9
G10
G11
G12
G13
K-Means/Medians Clustering
3. Calculate mean/median expression profile of each cluster
4. Shuffle genes among clusters such that each gene is now
in the cluster whose mean expression profile (calculated
in
step 3) is the closest to that gene’s expression profile
G3
G11
G6
G1
G8
G4
G7
G5
G2
G10
G9 G12
G13
5. Repeat steps 3 and 4 until genes cannot be shuffled around
any more, OR a user-specified number of iterations has
been reached
K-Means is most useful when the user has an a priori hypothesis about the
number of clusters the genes should group into.
Clustering Comparison
MOTIVATION: Using different clustering methods often produces different
results. How do these clustering results relate to each other?
 Clustering comparison method that finds a many-to-many
correspondence in two different clustering results.
• comparison of two flat clusterings
• comparison of a flat and a hierarchical clustering.
Comparison of flat clusterings
C1 = {A1, A2, A3 , A 4}
C2 = {B1, B2, B3, B4 }
B1
A2
A1
A4
A3
We are interested in finding:
B2
g : C1  C2
where the clusters are mapped as follows:
A1  B1  B2
A2  B3
A3  A4  B4
B3
B4
Indices to measure the overlapping
• Intersection size:
• Simpson´s index:
• Jaccard
index:
I ij  card ( Ai  B j )
sij 
J ij 
card ( Ai  B j )
min{ card ( Ai ), card ( B j )}
card ( Ai  B j )
card ( Ai  B j )
Comparison of flat and hierarchical clusterings
1
0
Selecting a point to cut the dendogram leads to s disjoint groups.
Results
ARTIFICIAL DATA: Four data sets with four clusters,
constructed with the same four seeds and different levels of noise.
• 1000 genes, 10 conditions
• d = 20 initial partitions
Visualisation in Expression Profiler
Outline




Missing Value Estimation
Differentially Expressed Genes
Clustering Algorithms
Principal Components Analysis
PCA
(Dimensionality Reduction Methods)
Outline
 Dimensionality Problem
 Techniques Methods
 Multidimensional Scaling
 Eigenanalysis-based ordination methods
 Principal Component Analysis (PCA)
 Correspondence Analysis (CA)
Dimensionality problem
Problem?
 “Curse of dimensionality”
 Convergence of any estimator to the true value of a
smooth function on a space of high dimension is very slow
 In other words, need many observations to obtain a good
“estimate” of gene function
 “Blessing?” – very few things really matter
Solutions
 Statistical techniques (corrections, etc.)
 Reduce dimensionality
 Ignore non-variable genes
 Feature subset selection
 Eliminate coordinates that are less relevant
Multidimensional Scaling
Idea: place data in a low-dimensional space so that “similar”
objects are close to each other.
The Algorithm (roughly)
1.
Assign points to arbitrary coordinates in p-dimensional space.
2.
Compute all-against-all distances, to form a matrix D’.
3.
Compare D’ with the input matrix D by evaluating the stress function. The
smaller the value, the greater the correspondence between the two.
4.
Adjust coordinates of each point in the direction that best maximizes stress.
5.
Repeat steps 2 through 4 until stress won't get any lower.
However:
• Computationally intensive
• Axes are meaningless, orientation of the MDS map is
arbitrary
• Difficult to interpret
Eigenanalysis: Background
Basic Concepts
An eigenvalue and eigenvector of a square matrix A are a scalar λ
and a nonzero vector x so that
Ax = λx
Q: What is a matrix?
A: A linear transformation.
Q: What are eigenvectors?
A: Directions in which the transformation “takes place the most”
Exploratory example: EigenExplorer
Eigenanalysis: Background
Finding eigenvalues
Ax = λx
(A – λI)x = 0
Interpreting eigenvalues
• Eigenvalues of a matrix provide a solid rotation in the directions of highest
variance
• Can pick N largest eigenvalues, capture a large proportion of the variance
and represent every value in the original matrix as a linear combination of
these values, e.g., xi = a1λ1+ . . . + aNλN
• Call this collection {aj} the eigengene/eigenarray (depending on which way
we compute these)
PCA
1. PCA simplifies the “views” of the data.
2. Suppose we have measurements for each gene on multiple
experiments.
3. Suppose some of the experiments are correlated.
4. PCA will ignore the redundant experiments, and will take a weighted
average of some of the experiments, thus possibly making the trends
in the data more interpretable.
5. The components can be thought of as axes in n-dimensional space,
where n is the number of components. Each axis represents a
different trend in the data.
PCA
y
x
“Cloud” of data points (e.g., genes)
in N-dimensional space, N = # hybridizations
z
Data points resolved along 3 principal
component axes.
In this example:
x-axis could mean a continuum from over-to under-expression
y-axis could mean that “blue” genes are over-expressed in first five expts and under
expressed in the remaining expts, while “brown” genes are under-expressed in the
first five expts, and over-expressed in the remaining expts.
z-axis might represent different cyclic patterns, e.g., “red” genes might be overexpressed in odd-numbered expts and under-expressed in even-numbered ones,
whereas the opposite is true for “purple” genes.
Interpretation of components is somewhat subjective.
z
y
Principal Components
pick out the directions
in the data that capture
the greatest variability
x
z 2y+c2z
y’ =a2x+b
x’=a1x+b1y+c1z
z’=a3x+b3y+c3z
y
The “new” axes are linear
combinations of the old
axes – typically combinations
of genes or experiments.
x
y’
x’
Projecting the data into a
lower dimensional space
can help visualize relationships
y’
x’
Projecting the data into a
lower dimensional space
can help visualize relationships
PCA in Expression Profiler
Further Reading
• MDS
– http://www.analytictech.com/borgatti/mds.htm
• PCA, SVD
– http://www.statsoftinc.com/textbook/stfacan.html
– http://linneus20.ethz.ch:8080/2_2_1.html
– Alter et al., Singular value decomposition for genome-wide
expression data processing and modelling, PNAS, 2000
• COA
– Fellenberg et al., “Correspondence analysis applied to
microarray data”, PNAS, 2001
• General ordination
– http://www.okstate.edu/artsci/botany/ordinate/
– Legendre P. and Legendre L., Numerical Ecology, 1998