Download Soft clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Public health genomics wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene nomenclature wikipedia , lookup

History of genetic engineering wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Gene desert wikipedia , lookup

NEDD9 wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genomic imprinting wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene wikipedia , lookup

Genome evolution wikipedia , lookup

Minimal genome wikipedia , lookup

Genome (book) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Microevolution wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene expression programming wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression profiling wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Ridge (biology) wikipedia , lookup

Transcript
Soft clustering of gene expression data
Matthias E. Futschik
Institute for Theoretical Biology
Humboldt-University, Berlin, Germany
Clustering methodology
Hierachical clustering
• can be divisive or agglomerative producing nested clusters.
• Results are usually visualised by tree structures dendrogram.
• Clustering depends on the linkage procedure used: single, complete, average,...
Partitional clustering
• divides data into a (pre-)chosen number of classes.
• Examples: k-means, SOMs, fuzzy c-means, simulated annealing,
model-based clustering, HMMs,...
• Setting the number of clusters is problematic
Cluster validity:
• Most cluster algorithms always detect clusters, even in random data.
• Cluster validation approaches address the number of existing clusters.
• Approaches are based on objective functions, figures of merits,
resampling, adding noise ....
Hard clustering vs. soft clustering
Hard clustering:
• Based on classical set theory
• Assigns a gene to exactly one cluster
• No differentiation how well gene is represented by
cluster centroid
• Examples: hierachical clustering, k-means, SOMs, ...
Soft clustering:
• Can assign a gene to several cluster
• Differentiate grade of representation (cluster membership)
• Example: Fuzzy c-means, HMMs, ...
Hard clustering is sensitive to noise
Example data set:
Yeast cell cylce data by Cho et al.
Standard deviation of
expression
Standard procedure is pre-filtering
of genes based on variation due to
noise sensitvity of hard clustering.
However, no obvious threshold exists!
(Heyer et al.: ca. 4000 genes, Tavazoe et al.: 3000
genes, Tamayo et al.: 823 genes)
=> Risk of essential losing information
=> Need of noise robust clustering method
Soft clustering is more noise robust
Hard clustering always detects
clusters, even in random data
Soft clustering differentiates cluster
strength and, thus, can avoid
detection of 'random' clusters
Genes with high membership values
cluster together inspite of added noise
Differentiation in cluster membership
allows profiling of cluster cores
●
●
●
●
●
A gene can be assigned to several clusters
Each gene is assigned to a cluster with a membership value between 0
and 1
The membership values of a gene add up to one
Genes with lower membership values are not well represented by the
cluster centroid
Expression of genes with high membership values are close to cluster
centroid
=> Clusters have internal structures
Hard clustering
Membership value > 0.5 Membership value > 0.7
Varitation in cluster parameter reveals
cluster stability
m=1.1
Variation of fuzzification
parameter m determines 'hardness'
of clustering:
m → 1: Fuzzy c-means clustering
becomes equivalent to k-means
m → ∞: All genes are equivally assigned
to all clusters.
m=1.3
Strong clusters maintain their core for
increasing m
By variation of m clusters can be
distinguished by their stability.
Weak cluster lose their core
Periodic and aperiodic clusters
Periodic clusters of yeast cell cycle:
Aperiodic clusters:
=> Aperiodic clusters were generally weaker than
periodic clusters
Global clustering structure
=> Sub-clustering reveals
sub-structures
Increasing number of clusters
c-means clustering allows definition
of overlap of clusters i.e. how many
genes are shared by two clusters.
This enables to define a similarity
measure between clusters.
Global clustering structures can be
visualised by graphs i.e. edges
representing overlap.
Non-linear 2D-projection by
Sammon's Mapping
M. Futschik and B. Carlisle, Noise robust, soft clustering of gene expression data
(JBCB, Vol. 3, No. 4, 965-988, 2005)