Download Unsupervised analysis of gene expression data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Unsupervised analysis of gene
expression data
Bing Zhang
Department of Biomedical Informatics
Vanderbilt University
[email protected]
Overall workflow of a microarray study
Biological question
Experiment design
Microarray experiment
Image analysis
Pre-processing
Data Analysis
Experimental
verification
2
Hypothesis
Applied Bioinformatics, Spring 2011
Three major goals of gene expression studies
 
 
 
3
Class comparison (supervised analysis)
 
e.g. disease biomarker discovery
 
Differential expression analysis
 
Input: gene expression data, class label of the samples
 
Output: differentially expressed genes
Class detection (unsupervised analysis)
 
e.g. patient subgroup detection
 
Clustering analysis
 
Input: gene expression data
 
Output: groups of similar samples or genes
Class prediction (supervised learning)
!"#$%&'%(&)*
/..3&'&4(
/.51&4(
//3&4(
/0/&4(
/055&6&4(
/078&4(
/1/2&4(
/10.&4(
/8.5&)&4(
/81/&4(
/819&4(
/893&4(
/878&:&4(
/550052&4&4(
/550053&4&4(
+,-.&/
!"#!!!
+")$$!
("%(%%
+"()('
'"&!%)
#"*$$#
#"$($+
#"$'+(
'"*&#%
$"&)+)
("%)$$
!"#*#)
("*&+#
)%"#&'$
)%"*&&'
+,-.&0
!"$%&$
+")!*$
("%%*'
+"(''%
'"'##+
#"&*!)
#"$**%
#"$*!!
'"'#'%
$"&%(%
#"+*$+
!"'!(+
("*+%)
)%"$&*$
)%")('+
 
e.g. disease diagnosis and prognosis
 
Machine learning techniques
 
Input: gene expression data, class label of the samples (training data)
 
Output: prediction model
Applied Bioinformatics, Spring 2011
+,-.&1
!"$'()
+"'&+'
#"+%'(
+"#)&%
'"&*#%
#"&%$*
#"'(%+
#"$')%
'")'*!
$"&#$(
#"+&')
!"''+!
("%!!#
)%"#$&&
)%")++&
+,-2.&/
!"$')&
+"&)))
+"%')'
+"($!)
'"*(%%
#"'&+%
#"##*#
#"##%$
'"*'#&
$"&!&*
("%&'!
!"''(%
("&#'!
)%"'&%$
)%"&'#'
+,-2.&0
!"$#&'
+")&%'
!"#*!&
+"('&&
'"'$(*
#"$%('
#"#'*!
#"$+!(
'"*!(#
$"&$&&
("%)'&
!"$*))
("#%$!
)%"&*''
)%"&)+)
+,-2.&1
!"*%(*
+"&'+'
+"&##*
+"(*'$
'"&+(+
#"&(()
#"'#!!
#"(&*#
'"#!'+
$")!%!
("%+()
!"'&&$
("&+'+
)%"*)''
)%"&'%$
What is clustering
 
Clustering algorithms are methods to divide a set of n objects
(genes or samples) into g groups so that within group similarities are
larger than between group similarities
 
Unsupervised techniques that do not require sample annotation in
the process
Samples
Genes
Sample_1 Sample_2 Sample_3 Sample_4 Sample_5
4
TNNC1
DKK4
ZNF185
CHST3
FABP3
MGST1
DEFA5
VIL1
AKAP12
HS3ST1
……
14.82
10.71
15.20
13.40
15.87
12.76
10.63
11.47
18.26
10.61
……
14.46
10.37
14.96
13.18
15.80
12.80
10.47
11.69
18.10
10.67
……
14.76
11.23
15.07
13.15
15.85
12.67
10.54
11.87
18.50
10.50
……
11.22
19.74
12.57
11.18
13.16
14.92
15.52
13.94
15.60
12.44
……
Applied Bioinformatics, Spring 2011
11.55
19.73
12.37
10.99
12.99
15.02
15.52
14.01
15.69
12.23
……
……
……
……
……
……
……
……
……
……
……
……
……
Why clustering?
5
 
Exploratory data analysis, providing rough maps and suggesting
directions for further study
 
Representing distances among high-dimensional expression profiles
in a concise, visually effective way, such as a tree or dendrogram
 
Identify candidate subgroups in complex data. e.g. identification of
novel sub-types in cancer, identification of co-expressed genes
 
Functional annotation based on guilt by association
Applied Bioinformatics, Spring 2011
Clustering methods
6
 
Hierarchical clustering: generate a hierarchy of clusters going from 1
cluster to n clusters
 
Partitioning: divide the data into g groups using some reallocation
algorithms, e.g. K-means
Applied Bioinformatics, Spring 2011
Hierarchical clustering
 
Agglomerative clustering (bottom-up)
 
 
 
 
At each step of the algorithm, the pair of clusters with the shortest distance are
combined into a single cluster.
The algorithm stops when all sample units are combined into a single cluster of
size n.
Divisive clustering (top-down)
 
 
 
7
Start out with all sample units in n clusters of size 1.
Start out with all sample units in a single cluster of size n.
At each step of the algorithm, clusters are partitioned into a pair of daughter
clusters, selected to maximize the distance between each daughter.
The algorithm stops when sample units are partitioned into n clusters of size 1.
Applied Bioinformatics, Spring 2011
Agglomerative clustering
 
8
Require distance measurement
 
Between two objects
 
Between clusters
Applied Bioinformatics, Spring 2011
Between objects distance measurement
 
Euclidean distance
 
 
 
#( x
i " yi )
Parametric, normally distributed and
follow the linear regression model
!
 
Focus on the expression profile shape
 
Non-parametric, no assumption
!
Less sensitive but more robust than
Pearson
Applied Bioinformatics, Spring 2011
2
i=1
n
Focus on the expression profile shape
!
Spearman correlation coefficient
 
9
Focus on the absolute expression value
d=
Pearson correlation coefficient
 
 
n
r=
# (x
i=1
#
n
i=1
d =1" r
i
" x )(y i " y )
(x i " x ) 2
#
n
i=1
(y i " y ) 2
Different measurement, different distance
Most similar profile to GeneA
(blue) based on different
distance measurement:
Euclidean: GeneB (pink)
Pearson: GeneC (green)
Spearman: GeneD (red)
10
Gene expression level (log2)
6
5
4
GeneA
3
GeneB
GeneC
2
GeneD
1
0
1
2
3
4
5
Time (hr)
Applied Bioinformatics, Spring 2011
6
7
Between cluster distance measurement
11
 
Single linkage: the smallest distance of all pairwise distances
 
Complete linkage: the maximum distance of all pairwise distances
 
Average linkage: the average distance of all pairwise distances
Applied Bioinformatics, Spring 2011
Visualization and interpretation of hierarchical
clustering results
 
Dendrogram
 
 
 
 
Tree structure with the genes
or samples as the leaves
The height of the join
indicates the distance
between the left branch and
the right branch
Heat map
 
12
Output of a hierarchical
clustering
Graphical representation of
data where the values are
represented as colors.
Applied Bioinformatics, Spring 2011
Partitioning
 
 
13
General idea
 
Select the number of groups, g
 
Randomly divide the objects into g Group
 
Iteratively rearrange the objects until a stop condition
Representative methods
 
K-means
 
Self Organizing Map (SOM)
Applied Bioinformatics, Spring 2011
K-means
14
 
Define k = number of clusters
 
Randomly initialize a seed vector for each cluster
 
Go through all objects, and assign each object to the
cluster witch it is most similar to
 
Recalculate all seed vectors as means of patterns of
each cluster
 
Repeat 3 & 4 until a stop condition (e.g. Until all objects
get assigned to the same partition twice in a row)
Applied Bioinformatics, Spring 2011
K-means
seed vector 1
Randomly initialize seeds
Objects join with closest seed
seed vector 2
Recaculate seeds
Reassign objects
Recaculate seeds
Reassign objects
Seeds become stable: final clusters
15
Applied Bioinformatics, Spring 2011
Cool animations
 
Hierarchical clustering
 
 
K-means
 
16
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html
http://animation.yihui.name/mvstat:k-means_cluster_algorithm
Applied Bioinformatics, Spring 2011
Resources
 
 
17
Data source
 
Gene Expression Omnibus (GEO): http://www.ncbi.nlm.nih.gov/geo/
 
ArrayExpress: http://www.ebi.ac.uk/arrayexpress/
Microarray data analysis tools
 
Bioconductor: http://www.bioconductor.org/
 
Expression profiler: http://www.ebi.ac.uk/expressionprofiler/
Applied Bioinformatics, Spring 2011
Summary
 
Agglomerative clustering
 
Bottom-up
 
Between objects distance measurement
 
Euclidean distance
 
Pearson’s correlation coefficient
Spearman’s correlation coefficient
 
 
 
 
 
Single linkage
 
Complete linkage
 
Average linkage
Visualization
 
Dendrogram
 
Heat map
k-means clustering
 
18
Between cluster distance measurement
Partitioning
Applied Bioinformatics, Spring 2011
Exercise
 
Data set: evan_deneris_2010_5ht_top500diff.txt
 
500 selected probe sets
 
Four groups (Rostral_5ht, Rostral_non5ht, Caudal_5ht, Caudal_non5ht)
 
No missing value; Already normalized; Already log transformed
 
Use hierarchical clustering in Expression profiler (http://www.ebi.ac.uk/expressionprofiler)
to generate a heat map
19
Applied Bioinformatics, Spring 2011
Related documents