Download Analysis of Expression Array Results

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Clustering Gene Expression Data
DNA Microarrays Workshop
Feb. 26 – Mar. 2, 2001 ,UNIL & EPFL, Lausanne
Gaddy Getz, Weizmann Institute, Israel
• Gene Expression Data
• Clustering of Genes and Conditions
• Methods
– Agglomerative Hierarchical: Average Linkage
– Centroids: K-Means
– Physically motivated: Super-Paramagnetic Clustering
• Coupled Two-Way Clustering
Feb 2001 (GG)
1
Gene Expression Technologies
• DNA Chips (Affymetrix) and MicroArrays can measure
mRNA concentration of thousands of genes simultaneously
• General scheme: Extract RNA, synthesize labeled cDNA,
Hybridize with DNA on chip.
Feb 2001 (GG)
2
Single Experiment
• After hybridization
– Scan the Chip and obtain an image file
– Image Analysis (find spots, measure signal and noise)
Tools: ScanAlyze, Affymetrix, …
• Output File
– Affymetrix chips: For each gene a reading proportional
to the concentrations and a present/absent call.
(Average Difference, Absent Call)
– cDNA MicroArrays: competing hybridization of target
and control. For each gene the log ratio of target and
control. (CH1I-CH1B, CH2I-CH2B)
Feb 2001 (GG)
3
Preprocessing: From one experiment to many
• Chip and Channel Normalization
– Aim: bring readings of all experiments to be on the
same scale
– Cause: different RNA amounts, labeling efficiency and
image acquisition parameters
– Method: Multiply readings of each array/channel by a
scaling factor such that:
• The sum of the scaled readings will be the same for all arrays
• Find scaling factor by a linear fit of the highly expressed genes
– Note: In multi-channel experiments normalize each
channel separately.
Feb 2001 (GG)
4
Preprocessing: From one experiment to many
Colon cancer data (Alon et. al.)
45
200
• Filtering of Genes
40
400
– Remove genes that are
absent in most
600
experiments
800
– Remove genes that are constant in all
1000
experiments
1200
– Remove genes with low readings which are not
1400
reliable.
35
Genes
30
25
20
15
1600
10
1800
5
2000
Feb 2001 (GG)
10
20
30
40
Experiments
50
60
5
Noise and Repeats
log – log plot
•
•
•
•
>90% 2 to 3 fold
Multiplicative noise
Repeat experiments
Log scale
dist(4,2)=dist(2,1)
Feb 2001 (GG)
6
We canSupervised
ask many
Methods questions?
(use predefined labels)
• Which genes are expressed differently in two
known types of conditions?
• What is the minimal set of genes needed to
distinguish one type of conditions from the others?
• Which genes behave similarly in the experiments?
• How many different types of conditions are there?
Unsupervised Methods
(use only the data)
Feb 2001 (GG)
7
Unsupervised Analysis
• Goal A: Find groups of genes that have correlated
expression profiles.
These genes are believed to belong to the same
biological process and/or are co-regulated.
• Goal B: Divide conditions to groups with similar
gene expression profiles.
Example: divide drugs according to their effect on
gene expression.
Clustering Methods
Feb 2001 (GG)
8
What is clustering?
• Input: N data points, Xi, i=1,2,…,N in a D
dimensional space.
• Goal: Find “natural” groups or clusters.
Data point of same cluster – “more similar”
• Note: number of clusters also to be determined
Feb 2001 (GG)
9
Clustering is ill-posed
• Problem specific definitions
• Similarity: which points should be
considered close?
– Correlation coefficient
– Euclidean distance
• Resolution: specify/hierarchical results
• Shape of clusters: general, spherical.
Feb 2001 (GG)
10
Similarity Measure
• Similarity measures
–
–
–
–
Centered Correlation
Uncentered Correlation
Absolute correlation
Euclidean
Feb 2001 (GG)
13
Need to define the distance between the
new cluster and the other clusters.
Single Linkage:
distance between closest pair.
Agglomerative Hierarchical Clustering
Complete Linkage: distance between farthest pair.
Average
Linkage:
average
Distance between
joined
clustersdistance between all pairs
or distance between cluster centers
4
2
5
3
1
1
3
2
4
5
The dendrogram induces a linear ordering
of the data points
Dendrogram
Feb 2001 (GG)
14
Agglomerative Hierarchical Clustering
• Results depend on distance update method
– Single Linkage: elongated clusters
– Complete Linkage: sphere-like clusters
• Greedy iterative process
• NOT robust against noise
• No inherent measure to choose the clusters
Feb 2001 (GG)
15
Centroid Methods - K-means
•Start with random position of K
centroids.
•Iteratre until centroids are stable
•Assign points to centroids
•Move centroids to center
of assign points
Iteration = 0
Feb 2001 (GG)
16
Centroid Methods - K-means
•Start with random position of K
centroids.
•Iteratre until centroids are stable
•Assign points to centroids
•Move centroids to center
of assign points
Iteration = 1
Feb 2001 (GG)
17
Centroid Methods - K-means
•Start with random position of K
centroids.
•Iteratre until centroids are stable
•Assign points to centroids
•Move centroids to center
of assign points
Iteration = 1
Feb 2001 (GG)
18
Centroid Methods - K-means
•Start with random position of K
centroids.
•Iteratre until centroids are stable
•Assign points to centroids
•Move centroids to center
of assign points
Iteration = 3
Feb 2001 (GG)
19
Centroid Methods - K-means
• Result depends on initial centroids’ position
• Fast algorithm: compute distances from data
points to centroids
• No way to choose K.
• Example: 3 clusters / K=2, 3, 4
• Breaks long clusters
Feb 2001 (GG)
20
Super-Paramagnetic Clustering (SPC)
M.Blatt, S.Weisman and E.Domany (1996) Neural Computation
• The idea behind SPC is based on the physical
properties dilute magnets.
• Calculating correlation between magnet
orientations at different temperatures (T).
T=Low
Feb 2001 (GG)
21
Super-Paramagnetic Clustering (SPC)
M.Blatt, S.Weisman and E.Domany (1996) Neural Computation
• The idea behind SPC is based on the physical
properties dilute magnets.
• Calculating correlation between magnet
orientations at different temperatures (T).
T=High
Feb 2001 (GG)
22
Super-Paramagnetic Clustering (SPC)
M.Blatt, S.Weisman and E.Domany (1996) Neural Computation
• The idea behind SPC is based on the physical
properties dilute magnets.
• Calculating correlation between magnet
orientations at different temperatures (T).
T=Intermediate
Feb 2001 (GG)
23
Super-Paramagnetic Clustering (SPC)
• The algorithm simulates the magnets behavior at a range of
temperatures and calculates their correlation
• The temperature (T) controls the resolution
• Example: N=4800 points in D=2
Feb 2001 (GG)
24
Output of SPC
A function (T) that peaks
when stable clusters break
Size of largest clusters as
function of T
Dendrogram
Feb 2001 (GG)
Stable clusters
“live” for large T
25
Choosing a value for T
Feb 2001 (GG)
26
Advantages of SPC
• Scans all resolutions (T)
• Robust against noise and initialization calculates collective correlations.
• Identifies “natural” () and stable clusters (T)
• No need to pre-specify number of clusters
• Clusters can be any shape
Feb 2001 (GG)
27
Many clustering methods applied
to expression data
• Agglomerative Hierarchical
– Average Linkage (Eisen et. al., PNAS 1998)
• Centroid (representative)
– K-Means (Golub et. al., Science 1999)
– Self Organized Maps (Tamayo et. al., PNAS 1999)
• Physically motivated
– Deterministic Annealing (Alon et. al., PNAS 1999)
– Super-Paramagnetic Clustering (Getz et. al., Physica A 2000)
Feb 2001 (GG)
28
Available Tools
• M. Eisen’s programs for clustering and display of
results (Cluster, TreeView)
– Predefined set of normalizations and filtering
– Agglomerative, K-means, 1D SOM
• Matlab
– Agglomerative, public m-files.
• Dedicated software packages (SPC)
• Web sites: e.g. http://ep.ebi.ac.uk/EP/EPCLUST/
• Statistical programs (SPSS, SAS, S-plus)
Feb 2001 (GG)
29
Colon cancer data (normalized genes)
Back to gene expression data
200
0.8
400
• 2 Goals: Cluster Genes and Conditions
• 2 independent clustering:
0.6
600
Genes
800
0.4
– Genes represented as vectors of expression in
all conditions 1200
1400
– Conditions are represented
as vectors of
expression of all 1600
genes
1000
0.2
0
-0.2
1800
-0.4
2000
Feb 2001 (GG)
10
20
30
40
Experiments
50
60
30
First clustering - Experiments
1. Identify tissue classes (tumor/normal)
Feb 2001 (GG)
31
Second Clustering - Genes
2. Find Differentiating And Correlated Genes
Ribosomal proteins
Cytochrome C
metabolism
HLA2
Feb 2001 (GG)
32
Two-way
Clustering
Feb 2001 (GG)
33
Coupled Two-Way Clustering (CTWC)
G. Getz, E. Levine and E. Domany (2000) PNAS
• Why use all the genes to represent conditions and all
conditions to represent genes?
Different structures emerge when clustering
sub-matrices.
• New Goal: Find significant structure in subsets of the
data matrix.
• A non-trivial task – exponential number of subsets.
• Recently we proposed a heuristic to solve this
problem.
Feb 2001 (GG)
34
CTWC of colon cancer data
60
200
A
50
40
400
30
20
600
(A)
10
800
0
1000
B
0
10
20
30
40
50
60
1200
60
50
1400
40
1600
30
1800
20
(B)
10
2000
10
20
30
40
50
60
0
0
Feb 2001 (GG)
10
20
30
40
50
60
35
Biological Work
• Literature search for the genes
• Genomics: search for common regulatory
signal upstream of the genes
• Proteomics: infer functions.
• Design next experiment – get more data to
validate result.
• Find what is in common with sets of
experiments/conditions.
Feb 2001 (GG)
37
Summary
• Clustering methods are used to
– find genes from the same biological process
– group the experiments to similar conditions
• Different clustering methods can give different
results. The physically motivated ones are more
robust.
• Focusing on subsets of the genes and conditions
can uncover structure that is masked when using
all genes and conditions
www.weizmann.ac.il/physics/complex/compphys
Feb 2001 (GG)
38