Download Making Sense of Complicated Microarray Data Part II - MGH-PGA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Making Sense of Complicated
Microarray Data
Part II
Gene Clustering and Data Analysis
Gabriel Eichler
Boston University
Some slides adapted from: MeV documentation slides
Why Cluster?

Clustering is a process by which you can
explore your data in an efficient manner.
 Visualization of data can help you review
the data quality.
 Assumption: Guilt by association – similar
gene expression patterns may indicate a
biological relationship.
Expression Vectors
Gene Expression Vectors encapsulate the
expression of a gene over a set of
experimental conditions or sample types.
Numeric Vector
-0.8
1.5
1.8
0.5 -0.4 -1.3 0.8
1.5
2
Line Graph
0
-2
Heatmap
-2
2
1
2
3
4
5
6
7
8
Expression Vectors As Points in ‘Expression Space’
G1
G2
G3
G4
G5
t1
t2
t3
-0.8
-0.4
-0.6
0.9
1.3
-0.3
-0.8
-0.8
1.2
0.9
-0.7
-0.7
-0.4
1.3
-0.6
Similar Expression
Experiment 3
Experiment 2
Experiment 1
Distance and Similarity
-the ability to calculate a distance (or similarity,
it’s inverse) between two expression vectors is
fundamental to clustering algorithms
-distance between vectors is the basis upon which
decisions are made when grouping similar patterns
of expression
-selection of a distance metric defines the concept
of distance
Distance: a measure of similarity between gene expression.
Exp 1
Exp 2
Gene A
x1A
x2A
Gene B
x1B
x2B
Exp 3
Exp 4
x3A
x3B
x4A
x4B
Exp 5
Exp 6
x5A
x6A
x5B
x6B
p1
Some distances: (MeV provides 11 metrics)
1. Euclidean: i6= 1 (xiA - xiB)2
2. Manhattan: i = 1 |xiA – xiB|
6
3. Pearson correlation
p0
Clustering Algorithms
Clustering Algorithms

Be weary - confounding computational
artifacts are associated with all clustering
algorithms. -You should always understand
the basic concepts behind an algorithm
before using it.

Anything will cluster! Garbage In means
Garbage Out.
Hierarchical Clustering
• IDEA: Iteratively combines genes into groups based on
similar patterns of observed expression
• By combining genes with genes OR genes with groups
algorithm produces a dendrogram of the hierarchy of
relationships.
• Display the data as a heatmap and dendrogram
• Cluster genes, samples or both
(HCL-1)
Hierarchical Clustering
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
Hierarchical Clustering
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
Hierarchical Clustering
Gene 1
Gene 2
Gene 4
Gene 5
Gene 3
Gene 8
Gene 6
Gene 7
Hierarchical Clustering
Gene 7
Gene 1
Gene 2
Gene 4
Gene 5
Gene 3
Gene 8
Gene 6
Hierarchical Clustering
Gene 7
Gene 1
Gene 2
Gene 4
Gene 5
Gene 3
Gene 8
Gene 6
Hierarchical Clustering
Gene 7
Gene 1
Gene 2
Gene 4
Gene 5
Gene 3
Gene 8
Gene 6
Hierarchical Clustering
Gene 7
Gene 1
Gene 2
Gene 4
Gene 5
Gene 3
Gene 8
Gene 6
Hierarchical Clustering
Gene 7
Gene 1
Gene 2
Gene 4
Gene 5
Gene 3
Gene 8
Gene 6
Hierarchical Clustering
H
L
Hierarchical Clustering
Genes
Samples
The Leaf Ordering Problem:
• Find ‘optimal’ layout of branches for a given dendrogram
architecture
• 2N-1 possible orderings of the branches
• For a small microarray dataset of 500 genes
there are 1.6*E150 branch configurations
Hierarchical Clustering
The Leaf Ordering Problem:
Hierarchical Clustering

Pros:
– Commonly used algorithm
– Simple and quick to calculate

Cons:
– Real genes probably do not have a
hierarchical organization
Self-Organizing Maps (SOMs)
A
Idea:
Place genes onto a
grid so that genes with
similar patterns of
expression are placed
on nearby squares.
B
C
D
a
c
bd
Self-Organizing Maps (SOMs)
A
IDEA:
Place genes onto a
grid so that genes with
similar patterns of
expression are placed
on nearby squares.
B
C
D
a
c
bd
Self-organizing Maps (SOMs)
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
Gene 9
Gene 10Gene 11
Gene 12
Gene 13
Gene 14
Gene 15
Gene 16
a_1hr
1
2
4
3
1
8
4
5
3
2
1
1
4
9
1
1
a_2hr
2
3
4
4
2
7
4
6
3
4
5
3
3
7
2
2
a_3hr
4
7
5
3
3
7
4
5
1
8
6
5
3
5
2
5
b_1hr
5
7
5
4
4
6
4
4
3
5
9
8
4
3
3
7
b_2hr
7
6
4
3
5
5
5
3
6
4
8
8
5
2
4
8
b_3hr
9
3
4
3
6
3
4
2
8
2
7
6
6
1
4
9
A
B
A
B
E
C
F
H
D
I
G
E
H
C
G
D
A
D
G
I
F
B
E
C
F
H
A
D
B
E
C
F
I
G
H
I
A
D
B
E
C
F
G
H
I
Self-organizing Maps (SOMS)
Gen es 11 and 12
Gen es 1, 1 6, and 5
Gen es 10
Gen e 1 5
A
D
G
B
E
H
C
F
I
Gen es 8
Gen es 6 and 14
Gen es 9 and 13
Gen es 3
Gen es 4, 7 an d 2
The Gene Expression Dynamics Inspector – GEDI
}
}
}
Samples
Group A
G en
en ee ss
G
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
…
Group C
Group B
1.5
1.4
1.7
1.2
.85
.65
.50
.55
2.5
2.8
2.7
2.1
.78
.95
.75
.45
1.1
1.2
1.0
1.3
.56
.62
.78
.89
.45
.23
.15
.05
.82
.71
.62
.49
.11
.16
.11
.95
2.2
4.5
6.7
6.2
2.2
2.5
2.8
2.9
.48
.90
1.5
1.8
2.1
2.0
1.9
1.6
4.2
4.8
5.2
5.5
2.5
2.6
2.0
1.9
1.2
1.1
1.6
2.9
1.1
1.8
1.9
1.4
1.7
1.2
1.1
1.6
GEDI’s Features:
•Allows for simultaneous analysis
or several time courses or datasets
•Displays the data in an intuitive
and comparable mathematically
driven visualization
•The same genes maps to the same
tiles
H
Group A
Group B
Group C
L
1
2
3
4
Software Demonstrations
MeV available at
http://www.tigr.org/software/tm4/mev.html
GEDI available at
http://www.chip.org/~ge/gedihome.htm
Comparison of GEDI vs. Hierarchical Clustering
Hierarchical clustering of random data
(GIGO)
G.E.D.I. allows the direct visual assessment of
the quality of conventional cluster analysis
From: CreateGEP_Journal.wpd, random_A
Questions
Related documents