Download Intro Data Clustering - Genomics & Bioinformatics at Purdue

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Pathogenomics wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Copy-number variation wikipedia , lookup

Minimal genome wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Oncogenomics wikipedia , lookup

Genetic engineering wikipedia , lookup

Public health genomics wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

History of genetic engineering wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Gene therapy wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Epigenetics of human development wikipedia , lookup

The Selfish Gene wikipedia , lookup

NEDD9 wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene desert wikipedia , lookup

Gene nomenclature wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genome (book) wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression programming wikipedia , lookup

Ridge (biology) wikipedia , lookup

Microevolution wikipedia , lookup

Gene expression profiling wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Transcript
An Overview of Clustering Methods
Michael D. Kane, Ph.D.
Topics
• What is clustering?
• Clustering mechanics (how the computer does it).
• Parameter choices and their effect.
• Examples.
What is clustering?
Grouping by similarity.
Samples
Gene clustering
Similar genes.
Group genes that have
similar expression
profiles when
observed over
multiple samples.
Genes
Samples
Sample clustering
Similar samples.
Genes
Group samples that
are similar when
observed over
multiple genes.
Why cluster?
• Similar gene expression infers common biology.
Function of uncharacterized genes may be deduced from coexpression with known genes.
• Associate expression patterns with:
Response to environmental change.
Disease pathology/progression.
Clustering Mechanics
For gene clustering, we must
measure similarity between genes.
+
E2
E1
E2
c
Gene a
Gene b
Gene c
Gene d
a
e
d
E1
b
Gene e
f
Gene f
-
+
Distance (similarity) measure
Euclidean distance
+
E2
c
(1.0, 1.7)
dbe
a
-
e
d
E1
b
dbe 
 4.6 1.0  0.5 1.7
2
(4.6, 0.5)
2
f
-
+
Distance Measure
Pearson Correlation
1
S a, b  
N
N

i 1
 ai  a  bi  b 



    
 a  b 
S=(-1 . . . +1)
Used in “Eisen” clustering
Hierarchical Clustering
+
E2
c
a
e
d
E1
+
b
f
-
a
b
c d
e
f
Measuring distance between clusters
Single linkage
The minimum distance between clusters.
May form loose clusters.
Produces “chained” clusters.
Complete linkage
The maximum distance between clusters.
Tends to form compact clusters.
Methods for joining clusters
UPGMA unweighted pair group method (Average linkage)
The average distance between clusters.
Weighted pair group method
Same as UPGMA but the distance is weighted by cluster size.
Use when clusters are expected to be significantly uneven in size!
Effect of distance measure
Euclidean
Single Linkage
Euclidean
Complete Linkage
Effect of distance measure
Euclidean
UPGMA
Euclidean
Ward’s Method
Alternatives to hierarchical clustering
k-means
• Number of clusters specified by user.
• Good when prior knowledge available.
k-means clustering
+
E2
1. Number of clusters specified by user.
c
2. Genes randomly assigned to clusters.
a
e
d
E1
b
3. Assess inter and intra-cluster similarity.
f
-
4. Move genes to alternative cluster if distance is reduced.
+
Alternatives to hierarchical clustering
SOM
Self-organizing maps
• Number of clusters specified by user.
• Good when prior knowledge available.
SOM
E1
E2
E1 E2
Gene a
+
0
-
Gene b
+
0
-
Gene c
+
0
-
Gene d
+
0
-
Gene e
+
0
-
Gene f
+
0
-
E1 E2
+
0
-
cluster 1
+
0
-
cluster 2
+
0
-
cluster 3
E1 E2
E1 E2
Usera specified
For
Increase
Iteratively
After
gene,
training,
thetrain
find
similarity
assign
number
the
the cluster
most
each
by
of similar
adjusting
clusters.
gene
representations.
tocluster
the
themost
cluster
representation.
similar
representation.
cluster.
Each initially given a random expression representation.
“Training”
Gene clustering
Eisen et al.,
Cluster analysis and display of genome-wide
expression patterns.
PNAS v95,14863-14868, 1998
cholesterol biosynthesis
24 hour time course after re-introduction of
serum to serum-deprived human
fibroblasts.
Pearson correlation, average linkage.
cell cycle
immediate-early response
signaling
wound healing
Sample
clustering
Ross et al.,
Systematic variation in gene expression patterns in
human cancer cell lines.
Nature Genetics v24, 227-235, 2000
Note breast cancer cell
lines, derived from the
same patient.
64 cancer cell lines clustered.
8,000 genes.
Clustering performed with 2 different
subsets of genes. Similar results.
Pearson correlation, average linkage.
Summary
• Different methods often provide different clusters.
• No overall “best” clustering method.
• Clustering applied to unrelated data will still provide clusters.
• Use biological insight in method selection and interpretation.
Clustering
+
E2
c
a
e
d
E1
+
b
f
-
a
b
c d
e
f
SOM
E1
E2
Gene a
+
0
-
Gene b
+
0
-
Gene c
+
0
-
Gene d
+
0
-
Gene e
+
0
-
Gene f
+
0
-
E1 E2
+
0
-
cluster 1
+
0
-
cluster 2
+
0
-
cluster 3
E1 E2
E1 E2
After training, assign each gene to the most similar cluster.