Download ppt

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data (Star Trek) wikipedia , lookup

Gene prediction wikipedia , lookup

Gene expression programming wikipedia , lookup

Human genetic clustering wikipedia , lookup

Time series wikipedia , lookup

Pattern recognition wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Revealing the internal
structures of gene
expression data sets
Matthias E. Futschik
Institute for Theoretical Biology
Humboldt-University, Berlin, Germany
Hvar sommer school, 2004
Overview
 It
is a two way road: Top-down vs bottomup approaches
 Good to see: Visualisation

PCA and Multi-dimensional scaling
 Guilt

by association: Clustering
Hard clustering vs soft clustering
 Gattica
becomes alive: Classification
Approaches of modelling in
molecular biology
Network of interactions of
single components
Bottom-up
approach
Set of measurements of
single component
System-wide
measurements
Top-down
approach
Underlying molecular
mechanism
Visualisation
• Important tool for detection of patterns remains
visualization of results.
• Examples are MA-plots, dendrograms, Venndiagrams or projection derived fro multi-dimensional
scaling.
• Do not underestimate the ability of the human eye
Principal component analysis
PCA:
• linear projection of data onto major
principal components defined by the
eigenvectors of the covariance matrix.
• PCA is also used for reducing the
dimensionality of the data.
• Criterion to be minimised: square of the
distance between the original and projected
data. This is fulfilled by the Karhuven-Loeve
transformation
x P  Px
Example: Leukemia data sets
by Golub et al.: Classification
of ALL and AML
P is composed by eigenvectors of the
covariance matrix
C
1
( xi   )( xi   )t

n 1 i
Multi-linear scaling
Sammon`s mapping:
• Non-linear multi-dimensional
scaling such as Sammon's mapping
aim to optimally conserve the
distances in an higher dimensional
space in the 2/3-dimensional space.
• Mathematically: Minimalisation
of error function E by steepest
descent method:
E
1
 i  j Dij
N
N
( Dij  dij ) 2
i j
Dij

Example: DLBCL prognosis –
cured vs featal cases
Clustering: Birds of a feather flock
together
• Clustering of genes
– Co-expression indicates co-regulation:
functional annotation
– Clustering of time series
• Clustering of array:
– finding new subclasses in samplespace
• Two-way clustering:
– Parallel clustering of samples and
genes
Clustering methods
Unsupervised classification of genes and/or samples.
Movtivation: Co-expression indicates co-regualtion
General division into hierarchical and
partitional clustering
Hierachical clustering
• can be divisive or agglomerative producing nested
clusters.
• Results are usually visualised by tree structures
dendrogram.
• Clustering depends on the linkage procedure used:
single, complete, average, Ward,...
• A related family of methods are based on graphtheoretical approach (e.g. CLICK).
Profilíng of breast cancer
by Perou et al
Example for
hierarchical
clustering
A Alizadeh et al, Nature. 2000
Distinct types of diffuse large
B-cell lymphoma identified by
gene expression profiling
Clustering methods II
Partitional clustering
• divides data into a (pre-)chosen number of classes.
• Examples: k-means, SOMs, fuzzy c-means, simulated annealing,
model-based clustering, HMMs,...
• Setting the number of clusters is problematic
Cluster validity:
• Most cluster algorithms always detect clusters, even in random data.
• Cluster validation approaches address the number of existing clusters.
• Approaches are based on objective functions, figures of merits,
resampling, adding noise ....
Hard clustering vs. soft clustering
Hard clustering:
• Based on classical set theory
• Assigns a gene to exactly one cluster
• No differentiation how well gene is represented by
cluster centroid
• Examples: hierachical clustering, k-means, SOMs, ...
Soft clustering:
• Can assign a gene to several cluster
• Differentiate grade of representation (cluster membership)
• Example: Fuzzy c-means, HMMs, ...
K-means clustering
• Partitional clustering splits the data in k
partitions with a given integer k.
• Partition can represented by a partition
matrix U that contains the membership values
μij of each object i for each cluster j.
• For clustering methods, which is based on
classical set theory, clusters are mutually
exclusive. This leads to the so called hard
partitioning of the data.
Hard partions are defined as
M hc






0

1

i

j


ij


k


kN
 U ij  R
ij  1 j 

i 1


N


0   ij  N  i 

j 1


k is the number of clusters and N is
the number of data objects.
Partitional clustering is frequently based on the optimisation of a given
objective function. If the data is given as a set of N dimensional
vectors, a common objective function is the square error function:
E    d ( xi  c j ) 2
i
j
where d is the distance metric and cj is the centre of clusters.
K-means algorithm
•
•
•
•
•
Initiation: Choose k random vectors as cluster centres cj ;
Partitioning: Assign xi to if for all k with k ;
Calculation of cluster centres cj based on the partition derived in
step 2: The cluster centre cj is defined as the mean value of all
vectors within the cluster;
Calculation of the square error function E;
If the chosen stop criterion is met, stop; otherwise continue with
step 2.
For the distance metric D, the Euclidean distance is generally
chosen.
Hard clustering is sensitive to noise
Example data set:
Yeast cell cylce data by Cho et al.
Standard deviation of
expression
Standard procedure is pre-filtering
of genes based on variation due to
noise sensitvity of hard clustering.
However, no obvious threshold exists!
(Heyer et al.: ca. 4000 genes, Tavazoe et al.: 3000
genes, Tamayo et al.: 823 genes)
=> Risk of essential losing information
=> Need of noise robust clustering method
Soft clustering is more noise robust
Hard clustering always detects
clusters, even in random data
Soft clustering differentiates cluster
strength and, thus, can avoid
detection of 'random' clusters
Genes with high membership values
cluster together inspite of added noise
Differentiation in cluster membership
allows profiling of cluster cores
●
●
●
●
●
A gene can be assigned to several clusters
Each gene is assigned to a cluster with a membership value between 0
and 1
The membership values of a gene add up to one
Genes with lower membership values are not well represented by the
cluster centroid
Expression of genes with high membership values are close to cluster
centroid
=> Clusters have internal structures
Hard clustering
Membership value > 0.5 Membership value > 0.7
Varitation in cluster parameter reveals
cluster stability
m=1.1
Variation of fuzzification
parameter m determines 'hardness'
of clustering:
m → 1: Fuzzy c-means clustering
becomes equivalent to k-means
m → ∞: All genes are equivally assigned
to all clusters.
m=1.3
Strong clusters maintain their core for
increasing m
By variation of m clusters can be
distinguished by their stability.
Weak cluster lose their core
Periodic and aperiodic clusters
Periodic clusters of yeast cell cycle:
Aperiodic clusters:
=> Aperiodic clusters were generally weaker than
periodic clusters
Global clustering structure
Non-linear 2D-projection by
Sammon's Mapping
=> Sub-clustering reveals
sub-structures
M. Futschik and B. Carlisle, Noise robust, soft clustering of gene
expression data (in preparation)
Increasing number of clusters
c-means clustering allows definition
of overlap of clusters i.e. how many
genes are shared by two clusters.
This enables to define a similarity
measure between clusters.
Global clustering structures can be
visualised by graphs i.e. edges
representing overlap.
Classifaction of microarray data

Many diseases involve (unknown) complex interaction of
multiple genes, thus ``single gene approach´´ is limited

 genome-wide approaches may reveal this interactions

To detect this patterns, supervised learníng techniques from
pattern recognition, statistics and artificial intelligence can
be applied.

The medical applications of these “arrays of hope” are various
and include the identification of markers for classification,
diagnosis, disease outcome prediction, therapeutic responsiveness
and target identification.
Gattica – the Art of Classifaction




In contrast to clustering
approaches, algorithms for
supervised classification are
based on labelled data.
Labels assign data objects to a
predefined set of classes.
Frequently the class distributions
are not known, so the learning of
the classifiers is inductive.
Task for classification methods is
the correct assignment of new
examples based on a set of
examples of known classes.

Generalisation
?
?
Class 2
?
Class 1
Challanges in classification of microarray data
• Microarray data inherit large experimental and biological
variances
• experimental bias + tissue hetrogenity
• cross-hybridisation
• ‘bad design’: confounding effects
• Microarray data are sparse
• high-dimensionality of gene (feature) space
• low number of samples/arrays
• Curse of dimensionality
• Microarray data are highly redundant
•Many genes are co-expressed, thus their expression is strongly
correlated.
Classification I: Models
• K-nearest neighbour
– Simple and quick method
• Decision Trees
– Easy to follow the classification process
• Bayesian classifier
– Inclusion of prior knowledge possible
• Neural Networks
– No model assumed
• Support Vector Machines
– Based on statistical learning theory; today`s state of art.
Criteria for classification
ROC curve
Accuracy:
Precision:
how variable are the
results compared to the true value
Sensitivity: how
many true
posítive are detected
Specificity
how closely are the
results to the true values
Specificity: how
many of the
selected genes are true positives.
1-Sensitivity
Getting a good team:
Feature Extraction - gene selection
Selection of genes based on :
• Parametric tests e.g. t-test
• Non-parametric tests eg. Wilcoxon-Rank test
But the 11 best players do not necessary form the best team!
Selection of groups of genes which act as good:

Sensitivity measure

Genetic Algorithms

Decision tree

SVD
Bayesian Classifiers
The most fundamental classifier in statistical pattern recognition is the
Bayes classifier which is directly derived from the Bayes theorem.
Suppose a vector x belongs to one of k classes. The probability P(C,x) of
observing x belonging to class C is
P(C x)  P(x  C )  P(C )
P(x|C): conditional probability for x given class C is observed
P(C): prior probability for class C
Similarly, the joint probability P(C,x) can be expressed by
P(C x)  P(C  x)  P(x)
P(C|x): conditional probability for C given object x is observed
P(x): is the prior probability of observing x.
Bayesian Classifiers
Since equations 1 and 2 describe the same probability P(C,x), we
can derive
P(x  C )  P(C )  P(C  x)  P(x)
P(x  C )  P(C )
 P(C  x) 
P ( x)
This is the famous Bayes theorem, which can be applied for
classification as follows. We assigned x to class Cj if
P(C j  x)  P(Ck  x)
for all classes Cj with . This rule constitutes the Bayes classifier.
Decision trees
• Stepwise classification into
two classes
• Comprehensible ( in contrast
to black box approaches)
• Overfitting can occur easily
• Complex (non-linear)
interactions of genes may not
be reflected in tree structure
Aritificial neural networks
•ANN were originally inspired by the functioning of biological neurons.
•Two major components: neurons and connections between them.
•A neuron receives inputs xi and determines the output y based on an
activation function. An example of a non-linear activation function is the
sigmoid function
1
y
w x b
1  e i i i
• Multi-layer perceptrons are
hierachically structured and trained by
backpropagation
Support vector machines
•SVMs are based on statistical learning theory and
belong to the class of kernel based methods.
•The basic concept of SVMs is the transformation of
input vectors into a highly dimensional feature space
where a linear separation may be possible between
the positive and negative class members
f(.)
Input space
K (x y )  (x  y ) d
f( )
f( )
f( )
f( ) f( ) f( )
f( )
f( )
f( )
f( ) f( )
f( ) f( )
f( )
f( ) f( )
f( )
f( )
Feature space
Example study: Tumour/Normal classification
Motivation: Colon cancer should be detected as early as possible to
avoid invasive treatment.
Data: Study by Alon et al. based on expression profiling of 60 samples
with Affymetrix GeneChips containing over 6000 genes.
Method: Adaptive neural networks
Network structure
can be translated in
linguistic rules
M.Futschik et al, AI in medicine, 2003