Download Parent Techniques and their Problems Hierarchical Clustering SOM

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Polycomb Group Proteins and Cancer wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

NEDD9 wikipedia , lookup

Transcript
A hierarchical unsupervised growing
neural network for clustering gene
expression patterns
Javier Herrero, Alfonso Valencia & Joaquin Dopazo
Seminar “Neural Networks in Bioinformatics”
by Barbara Hammer
Presentation by Nicolas Neubauer
January, 25th, 2003
Topics
• Introduction
– Motivation
– Requirements
• Parent Techniques and their Problems
• SOTA
• Conclusion
25.1.2002
SOTA
2
Motivation
• DNA arrays create huge
masses of data.
• Clustering may provide
first orientation.
• Clustering: Group
vectors so that similar
vectors are together.
• Vectorizing DNA array
data:
.3 .7
.2 .6
.2 .5 .1 .5
.3 .5
.4 .5
.1
.2
.3
.5
.6
.7
.4
.3
.2
.5
.5
.5
( )( )( )( )
– Each gene is one point
in input space
• in reality:
– Each condition (i.e.,
each DNA array)
– several thousands of genes,
contributes 1 component
– several dozens of DNA
of an input vector.
arrays
25.1.2002
SOTA
3
Requirements
• Clustering algorithm should…
… be tolerant to noise
… capture high-level (inter-cluster) relations
… be able to scale topology based on
• topology of input data
• users’ required level of detail
• Clustering is based on similarity
measure
– Biological sense of distance function must
be validated
25.1.2002
SOTA
4
Topics
• Introduction
• Parent Techniques and their Problems
– Hierarchical Clustering
– SOM
• SOTA
• Conclusion
25.1.2002
SOTA
5
Hierarchical Clustering
• Vectors are arranged
in a binary tree
• Similar vectors are
close in the tree
hierarchy
• One node for each
vector
• Quadratic runtime
• Result may depend
on order of data
25.1.2002
SOTA
6
Hierarchical Clustering (II)
Clustering algorithm should…
… be tolerant to noise
no - data is directly used to define position
… capture high-level (inter-cluster) relations
yes - tree structure gives very clear relationships
between clusters
… be able to scale topology based on
– topology of input data
yes - tree is built to fit distribution in input data
– users’ required level of detail
no - tree is fully built; may be reduced by later analysis,
but has to be fully built before
25.1.2002
SOTA
7
Self-Organising Maps
• Vectors are assigned
to clusters
• Clusters are defined
by the neurons
which serve as
“prototypes” for that
cluster
• Many vectors per
cluster
• Linear runtime
25.1.2002
SOTA
8
Self-Organising Maps (II)
Clustering algorithm should…
… be tolerant to noise
yes - data is not aligned directly but in relation to
prototypes which are averages
… capture high-level (inter-cluster) relations
? - paper says no, but neighbourhood of neurons?
… be able to scale topology based on
– topology of input data
no - number of clusters is set before-hand, data may be
stretched to fit the SOM’s topology
“if some particular type of profile is abundant, … this type
of data will populate the vast majority of clusters”
– users’ required level of detail
yes - choice of number of clusters influences detail
25.1.2002
SOTA
9
Topics
• Introduction
• Parent Techniques and their Problems
• SOTA
–
–
–
–
Growing Cell Structure
Learning Algorithm
Distance Measures
Abortion Criteria
• Conclusion
25.1.2002
SOTA
10
SOTA Overview
• SOTA stands for
self-organising tree algorithm
• SOTA combines best things from
hierarchical clustering and SOMs:
– Align clusters in a hierarchical structure
– Use cluster prototypes created in a SOMlike way
• New idea: Growing Cell Structures
– Topology is built up incrementally as data
requires it
25.1.2002
SOTA
11
Growing Cell Structures
• Topology consists of
– Cells, the clusters, and
– Nodes, connections
between the cells
• Cells can become
nodes and get two
daughter cells
• Result: Binary tree
• SOTA: Cells have
codebook serving as
prototypes for clusters
25.1.2002
SOTA
• Good splitting
criteria: topology
adapts to data
12
Learning Algorithm
• Compare pattern Pj to
cell Ci
• Cell for which
d(Pj, Ci) is smallest wins
Repeat cycle
Repeat epoch
For each pattern,
adapt cells:
• Find winning cell
• Adapt cells
Until updating-finished()
Split cell containing
most heterogenity
Until all-finished()
25.1.2002
SOTA
13
Learning Algorithm
Repeat cycle
Repeat epoch
For each pattern,
adapt cells:
• Find winning cell
• Adapt cells
Until updating-finished()
Split cell containing
most heterogenity
Until all-finished()
25.1.2002
Ci(t+1) = Ci(t)+n*(Pj - Ci(t))
• Move cell into direction of
of pattern, with learning
factor n depending on
proximity
• Only three cells (at most)
are updated:
– winner cell,
– ancestor cell and
– sister cell
• nwinner > nancestor > nsister
• If sister is no longer cell,
but node, only winner is
adapted
SOTA
14
Learning Algorithm
Repeat cycle
Repeat epoch
For each pattern,
adapt cells:
• Find winning cell
• Adapt cells
Until updating-finished()
Split cell containing
most heterogenity
Until all-finished()
25.1.2002
When is updating finished?
• Each pattern “belongs” to
its winner cell from this
epoch
• So, each cell has a set of
patterns assigned
• Resource (Ri) is the
average distance between
the cell’s codebook and its
patterns
• The sum of all Ris is the
error t of epoch t
• Stop repeating epochs if
|(t - t-1)/(t-1)| < E
SOTA
15
Learning Algorithm
• A new cluster is created.
• Most efficient: split cell
where patterns are most
heterogenous
• Measure for heterogenity:
Repeat cycle
Repeat epoch
For each pattern,
adapt cells:
• Find winning cell
• Adapt cells
– Resource
mean distance between
patterns and cell
– Variability
maximum distance
between patterns
Until updating-finished()
Split cell containing
most heterogenity
Until all-finished()
25.1.2002
• Cell turns into node
• Two daughter cell inherit
mother’s codebook
SOTA
16
Learning Algorithm
Repeat cycle
Repeat epoch
For each pattern,
adapt cells:
• Find winning cell
• Adapt cells
Until updating-finished()
Split cell containing
most heterogenity
The big question:
When to stop iterating cycles
• When each pattern has its
own cell
• When maximum number
of nodes is reached
• When maximum resource
or variability value drops
under a certain level
Until all-finished()
25.1.2002
SOTA
– See later for sophisticated
calculations of such a
threshold
17
Learning Algorithm
Repeat cycle
Repeat epoch
For each pattern,
adapt cells:
• Find winning cell
• Adapt cells
Until updating-finished()
Split cell containing
most heterogenity
Until all-finished()
25.1.2002
SOTA
18
Learning Algorithm
Repeat cycle
Repeat epoch
For each pattern,
adapt cells:
• Find winning cell
• Adapt cells
Until updating-finished()
Split cell containing
most heterogenity
Until all-finished()
25.1.2002
SOTA
19
Distance Measures
• How does d(x,y)
really look like?
• Distance function
has to contain
biological similarity!
• Euclidean distance:
d(x,y) = √(xi-yi)2
• Pearson correlation:
d(x,y)=(1-r)
r = (exi-êx)(eyi-êy)
SexSey
-1≤ r ≤ 1
25.1.2002
.1
.2
.3
.5
.6
.7
.4
.3
.2
.5
.5
.5
• Empirical evidence
suggests that Pearson
correlation better
grasps similarity in case
of DNA array data.
SOTA
20
SOTA evaluation
Clustering algorithm should…
… be tolerant to noise
yes - data is averaged via codebooks just as in SOM
… capture high-level (inter-cluster) relations
yes - hierarchical structure as in hierarch. clustering
… be able to scale topology based on
– topology of input data
yes - tree is extended to meet the distribution of variance
in the input data
– users’ required level of detail
yes - tree can be adjusted to desired level of detail;
criteria may also be set to meet certain confidence levels...
25.1.2002
SOTA
21
Abortion Criteria
• What we are looking for:
“an upper level of
distance at which two
genes can be considered
to be similar at their
profile expression levels”
• Distribution of distances
has to do with nonbiological characteristics
of the data
– Many points with few
components cause a lot
of high correlations
25.1.2002
SOTA
22
Abortion Criteria (II)
• Idea:
• Problem:
– If we knew the
distribution in the
data that is random,
– A confidence level 
could be given
– Meaning that having a
given distance given
two unrelated genes
is not more probable
than .
25.1.2002
– We cannot know the
random distribution:
– We only know the real
distribution which is
partially due to
random properties,
partially due to “real”
correlations
• Solution:
SOTA
– Approximation by
shuffling
23
Abortion Criteria (III)
• Shuffling
• Claim:
– For each pattern, the
components are
randomly shuffled
– Correlation is
destroyed
– Number of points,
ranges of values,
frequency of values
are conserved
25.1.2002
– Random distance
distribution in this
data approximates
random distance
distribution in real
data
• Conclusion
– If p(corr.>a)<=5%
in random data,
– Finding corr.>a in
the real data is
meaningful with
95% confidence
SOTA
24
• Only 5% of the
random data pairs
have a correlation >
.178
• Choosing .178 as
threshold, there is a
95% confidence that
genes in the same
cluster are there for
non-statistical, but
biological reasons
25.1.2002
SOTA
25
Topics
• Introduction
• Parent Techniques and their Problems
• SOTA
• Conclusion
– Additional nice properties
– Summary of differences compared to parent
techniques
25.1.2002
SOTA
26
Additional properties
• As patterns do not have to be compared
to each other, runtime is approximately
linear in the number of patterns
– Like SOM
– Hierarchical clustering uses a distance
matrix relating each pattern to each other
pattern
• The cell’s vectors approach very closely
the average of the assigned data points
25.1.2002
SOTA
27
Summary of SOTA
• Compared to SOMs
– SOTA builds up a
topology that reflects
higher-order relations
– Level of detail can be
defined very flexible
• Compared to
standard hierarchical
clustering
– SOTA is more robust
to noise
– Has better runtime
properties
– Has a more flexible
concept of cluster
• Nicer topological
properties
• Adding of new data
into an existing tree
would be
problematic(?)
25.1.2002
SOTA
28