Download TKTL_luento3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene expression programming wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Transcript
Improved Bayesian segmentation
with a novel application in genome
biology
Petri Pehkonen, Kuopio University
Garry Wong, Kuopio University
Petri Törönen, HY, Institute of
Biotechnology
Outline
• little motivation
• some heuristics used
• proposed Bayes model
– represents also a modified Dirichlet prior
• proposed testing with artificial data
– discusses the use of prior information in the
evaluation
• little analysis of real datasets
Biological problem setup
Input
• Genes and their associations with biological features like
regulation, expression clusters, functions etc.
Assumption
• Neighbouring genes of genome may share same features
Aim
• Find the chromosomal regions "over-related" to some
biological feature or combination of features, look for
non-random localization of features
I will discuss more about gene expression data application
A comparison with some earlier
work with expression data
• Our aim is to analyze the
gene expression from the
genome with a new
perspective
– standard: Consider very
local areas of ~constant
expression levels
– our view: How about
looking at larger regions
that have clearly more
active genes (under certain
conditions)?
Our perspective is related with the idea of active and
passive regions of the genome
Further comparison with earlier
work
• Standard: Up/Down/No regulation classification
or real value from each experiment as input vector
for gene
• Our idea: One can also associate genes to clusters
in varying clustering solutions.
multinomial variable/vector for single gene
By using varying number of clusters one should obtain
broader and narrower classes
This is related with the idea of combining weak coherent
signals occuring in various measurements with clusters
Methodological problem setup
Gene participance in co-expression clusters
• Genes can be partitioned into separate clusters according to
expression similarity: first 2 clusters, then 3, then 4 etc.
• Aim is to find chromosomal regions where consecutive genes
are in same expression clusters in different clustering results
Broader expression
similarity
Specific expression
similarity
6 gene expression clusters
0
0
1
5
2
3
6
5
0
3
3
3
4
0
0
5 gene expression clusters
0
0
5
4
5
2
1
2
0
4
4
4
4
0
0
4 gene expression clusters
0
0
3
3
4
3
3
3
0
2
2
1
2
0
0
3 gene expression clusters
0
0
3
3
3
3
3
3
0
1
1
1
1
1
0
2 gene expression clusters
0
0
1
1
1
1
1
1
1
1
1
1
1
0
0
Gene order in chromosome
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Existing segmentation algorithms
Non-heuristic:
• Dynamic programming
Heuristic:
• Hierarchical
–
–
•
•
•
Top-down/bottom-up
Recursive/iterative
K-means reminding solutions (EM-methods)
Sliding window with adaptive window size (?)
etc. ..
Hierarchical vs. Non-hierarchical
Heuristic methods
• Non-hierarchical heuristic methods usually
produce only a single solution.
• compare k-means in clustering
• these often require a parameter (number of change-points )
• Aims to create (local) optimal solution for the number of
change point
• Hierarchical heuristic methods produce a large
group of solutions with varying number of
change-points
• Large group of solutions can be created with one run
• Solutions could be usually optimized further
Recursive vs. Iterative hierarchical
heuristics
• Recursive hierarchical heuristics
• Slices usually until some stopping rule (BIC penalty) is
fullfilled.
• Each segment is sliced independent from the rest of the data
• Hard to obtain a solution for varying number of change points
• Designed to stop at optimum (which can be a local optimum)
• Top-Down (?) hierarchical heuristics
• Slices until a stopping rule or maximum number of clusters is
fullfilled
• Each new change-point is placed after all segments are
analyzed. The best change-point from all segments is selected.
• Creates a chain of solutions with varying number of segments
• Can be run past the (local) optimum to see if we can find a
better solution after few bad results.
Our choice for heuristic search
• Top-Down hierarchical segmentation
How to place a new change-point
The new change-point position is usually selected using a
statistical measure:
• Optimization of Log of Likelihood-ratio (ratio of ML
based solutions)
• lighter to calculate
• often referred as Jensen-Shannon Divergence
• Optimization of our Bayes Factor
• bayes model discussed later
• natural choice (as this is what we want to optimize)
Bayes factor would seem natural but:
In testing we noticed that we started splitting only the smallest
segments?? Why???
Bias in bayesian score
The first fig. represents random data
(no preferred change-point position)
The second fig. represents the
behaviour of log likelihood ratio
model. (ML method)
The third fig. represents the behaviour
of Bayes Factor (BF)
•The highest point of each profile is
taken as change point
•Notice the bias in BF that favours
cutting near the ends
•Still all BFs are negative (against
splicing)
Problems when we force the algorithm
=> We chose the ML to
to go pass the local optimum
change point search
What we have obtained so far…
• Top-Down hierarchical heuristical
segmentation
• ML based measure (JS divergence) used to
select the next change-point
Selecting optimal solution from
hierarchy
• Hierarchical segmentation for n sized data
contains n different nested solutions
• The solutions must be evaluated in order to
find proper one: not too general, not too
complex
• Need model selection
Model selection used for
segmentation models
Two ideas occuring in the most used methods:
• Evaluating "the fit" of model (usually ML score)
• Penalization for used parameters (in the model)
– Segmentation model parameters: data classes in segments
and the positioning of change points
We used few (ML based) model selection methods:
• AIC
• BIC
• Modified BIC (designed for segmentation tasks)
We were not happy with their performance, therefore ..
Our model selection criterion
•
•
Bayesian approach => Takes account uncertainity
and a priori information on parameters
The change-point model M includes two varying
parameter groups:
A. Class proportions within segments
B. Change-points (segment borders)
•
Posterior probability for the model M fitted in data
D would be to integrate over A and B parameter
spaces:
P( M | D)  P( D | M )    P( D | M , A, B)  P( A)  P( B)dAdB
Our approximations/assumptions:
A
• clusters do not affect each other (independence)
• data dimensions do not affect each other
(independence)
These two allow simple multiplication
• segmentation does not affect directly the modeling
the modelling of the data in the cluster
Only the model and the prior A affect the likelihood
of the data: P(D|M,A,B) = P(D|M,A)
Our model selection criterion
•
Therefore a Multinomial model with a Dirichlet prior
can be used to calculate the integrated likelihood:
 P( D | M , A)  P( A)dA
multiplication over
dimensions


   vi 
 i 1 
Iv
multiplication over classes
in one dimension
( xvi   vi )


Iv
Iv
 i 1 ( vi )

v 1
  xvi    vi 
i 1

 i 1
V
Iv
Further
assumptions/approximations: B
•
We assume all the change-points exchangeable
•
•
•
order of finding change-points does not matter for the solution
We do not integrate over the parameter space B, but
analyze only the MAP solution
need a proper prior for B..
Our model evaluation criterion
•
We select flat prior for simplicity
–
•
This makes the MAP equal to ML solution
Prior of parameters B is 1 divided with how many
ways the current m change-point estimates can be
positioned into data with size n:
1
P( B ) 
 n  1


 m 
Our model evaluation criterion
•
The final form of our criterion is (without the log):
"Flat" MAP-estimate Posterior probability of
parameter group A
for parameters B.
P( D | M )  PMAP ( B)   P( D | M , A)  P( A)dk
 Iv

   cvi 
 i 1

Iv
C
V
( xcvi   cvi )
1

 

Iv
 N  1 c 1 v 1  I v
 i 1 ( cvi )


  xcvi    cvi 
i 1
 m 
 i 1

Multiplication goes over various clusters (c), and
various dimensions (v). Quite simple equation.
What about the Dirichlet prior
weights
Multinomial model requires prior parameters:
• Standard Dirichlet prior weights:
I) all the prior weights same (FLAT)
II) prior probabilities equal the class probabilities in the
whole dataset (CSP)
• These require the definition of prior sum (ps)
We used ps = 1 (CSP1, FLAT1) and ps = number of
classes (CSP, FLAT) for both of previous prios
• Empirical Bayes (?) prior (EBP): prior II with ps =
SQRT(Nc) (Carlin, others, ’scales according the std’)
…Dirichlet prior weights
•
•
We considered EBP reasonable but…
With small class proportions and small clusters
EBP problematic
–
•
gamma function of Dirichlet equation probably
approaches infinity (as prior approaches zero)
Modified EBP (MEBP) mutes this behaviour:
i  N c P ( X  i )
Instead of
i  N c P ( X  i )
•
now prior weights approach 0 slower, when
class proportion is small
…Dirichlet prior weights
• Also now the ps in MEBP is dependent on
the class distribution (more even
distribution => bigger ps). Also larger
number of classes => bigger ps. Both these
sound natural…
• Prior weight can be also linked to Chi
Square test.
What we have obtained so far…
• Top-Down hierarchical heuristical
segmentation
• ML based measure (JS divergence) used to
select the next change-point
• Results from heuristics are analyzed using
proposed Bayes model
– flat prior using number of potential solutions
for segmentation with same m.
– MEBP prior for multinomial data
Evaluation
• testing using artificial data
• we can vary number of clusters, number of classes, class
distributions and monitor the performance
• Do hierarchical segmentation
• Select the best result with various methods
• Standard measure for evaluation: Compare how
well the clusters obtained correspond to clusters
used in the data generation
• But is the good correlation/correspondence always
what we want to see?
When correlation fails
•
•
•
•
many clusters/segments and few data points
consecutive small segments
similar neighboring segments
One segment in the obtained segmentation (or in
the data-generation) => no correspondence
Problem: Correlation does not account Occam’s
Razor
Our proposal
• Base the evaluation to the similarity of the statistical
model used to generate each data point (DGM) vs. the
data model obtained from clustering for the datapoint
(DEM)
– Reminds standard cross validation
• Use a probability distribution distance measure to
monitor how similar they are
– one can think this as infinite size testing data set
• Need only to select the distance measure
• extra-plus: with hierarchical results we can look the
optimal result and see if a method overestimates or
underestimates it.
Probability distribution distance
measures
• Kullback-Leibler Divergence (most natural)
DKL ( X || Y )  EX log( X / Y )   P( X  i ) log( P( X  i ) / P(Y  i ))
• Inverse of the KL
DKL _ Inv ( X || Y ) DKL (Y || X )
• Jensen-Shannon Divergence
DJS ( X || Y )  DKL ( X || ( X  Y ) / 2)  DKL (Y || ( X  Y ) / 2)
• Other measures were also tested..
X is here DGM
Y is the DEM (obtained from segments)
The Good, the Bad and…
• DEM can have data points with P(X=i) = 0
– These create infinite score in DKL
– Under-estimates the optimal model
• DKL_Inv was considered to correct this, but
– P(Xi) = 0 causes now too many zero scores
• x*log(x) when x => 0 was defined as 0
– over-estimates heavily the model
• DJS was selected as a comprimise between these
two phenomenas
Do we want to use prior info
• Standard cross validation: Bayes method uses prior
information, ML does not use prior information
• Is this fair?
– same result with and without prior gets different score
– method with prior usually gets better results
• Our (=My!) opinion: evaluation should use same amount
of prior info for all the methods!
– we would get same score for same result (independent from the
method)
– we would pick the model from the model group that usually
performs better
• Selecting the prior for evaluation process is now an open
question!
Defending note
• Amount of prior only affects the results from one
group of artificial datasets analyzed (sparse signal
/small clusters)
• These are datasets where bayes methods behave
differently.
• Revelation from the results: ML methods perform
worse also in datasets where prior has little effect
• => Use of prior mainly important for our Bayes
method prior comparisons…
Rules for selecting prior
to model evaluation
• Obtained DEM should be as close to DGM as
possible (=more correct, smaller DJS)
• The used prior should be based on something else
than our favorite MEBP
• Hoping we would not get good results with MEBP just because
of the same prior
• Use as little prior as possible
• want the segment area to have as much affect as possible
• Better ideas?
Comparison of model evaluation
priors
• Used a small cluster data with 10 and 30 classes
(=prior effects the results)
• Used CSP (class prior = class probability* ps), with ps
= 1, 2, c/4, c/2, c*3/4, c, 10*c (c = number of
classes)
• Looked for obtained DJS for various segmenting
outcomes (from hierarchical results) with 1 – n
clusters (n =max(5,k), k= artificial data cluster
number)
• Analysis was done with artificial datasets
Comparison of model evaluation
priors
•ps = 1, 2, c/4, c/2, 3*c/4, c, 10*c
The approximate minimum at
ps = number of classes
Data with 10 classes
Data with 30 classes
Jensen-Shannon divergence
50
40
80
30
60
20
40
10
20
0
0
1
2
2.5 5 7.5
Prior sum
10
100
0
0
1
2
7.5 15 22.5 30
Prior sum
300
Comparison of priors
• We did not look minimum, but wanted the compromise
between minimum and weak prior effect:
• We chose ps = c/2
• Choice quite arbitrary but a quick analysis with neighbor
priors gave similar results
Data with 10 classes
Data with 30 classes
Jensen-Shannon divergence
50
40
80
30
60
20
40
10
20
0
0
1
2
2.5 5 7.5
Prior sum
10
100
0
0
1
2
7.5 15 22.5 30
Prior sum
300
Proposed method+
artificial data evaluation
• Top-Down hierarchical heuristical segmentation with
ML used to select the next change-point
• Results from heuristics are analyzed using proposed
Bayes model
• Evaluation of the results using the artificial data
– estimate how well the obtained model predicts the future
data sets
– compare the models with DJS that uses also prior
information
More on evaluation
•
Three data types (with varying number of
classes):
i) several (1 – 10) large segments (each 30 – 300 data points)
•
this should be ~easy to analyze
ii) few (1 – 4) large segments (30 – 300 data points)
•
this should have less reliable prior class distribution
iii) several (1 – 10) small segments (15 – 60 data points)
•
•
•
most difficult to analyze
prior affects these results
Number of classes used in each: 2, 10, 30
– data sparseness increases with increasing number of
classes
•
Data classes were made skewed
…evaluation…
• Data segmented by Top-Down: 1 – 100 segments
• Model selection methods used to pick optimal
segmentation
– ML methods: AIC, BIC, modified BIC
– Bayes method with dirichlet priors: FLAT1, FLAT,
CSP1, CSP, EBP, MEBP
• Each test replicated 100 times
• Djs calculated between DGM and the obtained
DEM
…still evaluating
•
•
•
As mentioned: the smaller the JS-distance between
DGM and DEM the better the model selection
method
For simplification we subtracted JS-distances
obtained with our own Bayesian method from the
distances obtained with other methods
We took average of these differences over 100
replicates
Data
I
AIC
BIC
BIC2
CSP
EBP
CSP1
Flat
Flat1
Z-scores
Upper box shows the Zscores
i.
2
16.8
0.6
-0.6
-1.4
-2.2
-1.2
-1.8
-1.5
i.
10
1.6
7.0
3.8
1.6
1.0
3.5
1.6
3.2
(mean(diff)/std(diff)*sqrt(100))
i.
30
4.1
13.5
10.3
1.9
1.7
4.3
2.0
4.5
ii.
2
8.7
0.9
2.4
1.6
-1.3
2.2
1.6
2.5
ii.
10
0.4
7.4
2.5
4.0
0.8
3.0
0.5
2.0
ii.
30
1.5
15.4
14.7
8.2
2.2
7.4
-1.4
5.3
iii.
2
7.0
2.8
1.3
-1.3
-0.7
1.0
1.2
1.0
iii.
10
1.9
13.8
8.1
1.6
2.4
4.6
4.2
5.6
iii.
30
11.9
13.9
13.9
5.0
4.9
8.7
5.5
9.7
0.60
0.84
0.63
0.24
0.10
0.37
0.15
0.36
Lower box shows the
average difference
Shaded Z-scores: x > 3,
a strong support in
favour our method
underlined Z-scores: x <
0, any result against our
method
Average
Averages
i.
2
5.65
0.06
-0.05
-0.09
-0.06
-0.07
-0.12
-0.09
i.
10
0.16
4.89
0.55
0.04
0.02
0.45
0.08
0.39
i.
30
0.78
58.12
17.48
0.24
0.08
5.74
0.23
4.28
ii.
2
1.42
0.01
0.08
0.03
-0.03
0.07
0.05
0.07
ii.
10
0.03
3.01
0.25
0.47
0.05
0.32
0.02
0.18
ii.
30
0.22
12.64
12.19
1.66
0.30
4.15
-0.15
3.47
iii.
2
1.13
0.27
0.11
-0.08
-0.03
0.09
0.11
0.09
iii.
10
0.15
13.61
3.67
0.19
0.21
1.90
0.65
1.47
iii.
30
5.82
13.88
13.88
0.59
0.51
8.93
1.72
10.70
1.70
11.83
5.35
0.34
0.12
2.40
0.29
2.29
Average
Summary: AIC bad on two classes, (overestimates)
BIC (and Mod-BIC) bad on 10 and 30 classes (underestimates)
Flat1 and CSP1 weak on 10 and 30 classes (overestimates)
Large segments
Detailed view
Rows show D results for datasets with
2, 10 and 30 classes
D from segmentation selected by Bayes
model with MEBP
Positive results=> BM with MEBP
outperforms
negative results=> method in question
outperforms BM with MEBP
1 column: Mainly worse methods
2 column: Mainly better methods
These results did not depend on the DJS
prior
Large segments
in small data
This is data where prior information is
less reliable. (smaller dataset)
Flat class outperforms our prior in 30
class dataset
Small segments
Hardest data to model
This is data where prior affects the
evaluation significantly.
Without prior BIC methods give best
result (=1 segment is considered best)
Summary from art. data
• MEBP had overall better result in 23/24 pairwise
comparisons with 30 class datasets (in 18/24 Z-score > 3)
• MEBP had better overall result in all pairwise
comparisons with 10 class datasets (in 12/24 Z-score > 3)
• Our method slightly outperformed by other bayes
methods in dataset i with 2 classes. Also EBP slightly
outperforms it with every 2 class dataset
EBP might be better for smaller class numbers
MEBP underestimates the optimum here
• ML methods and priors with ps = 1 (Flat1, CSP1) had
weakest performance
Analysis of real biological data
•
•
•
•
Yeast cell cycle time series gene expression data
Genes were clustered with k-means into 3 groups, 4
groups, 5 groups, and 6 groups
Order of genes in chromosomes, and gene
associations with expression clusters were turned
into multidimensional multinomial data
Aim was to locate regional similarities in gene
exression in yeast cell cycle
Anything in real data
•
•
•
Each chromosome
was segmented
Segmentation score
of each
chromosome was
compared to score
from randomized
data (100
randomizations)
Goodness:
(x –mean(rand))/std(rand)
CHR
Rand. mean
Rand. std
log(P(M|D))
Goodness
1
-726.39
3.86
-711.47
3.87
2
-2783.24
5.17
-2759.31
4.62
3
-1134.89
6.65
-1103.91
4.66
4
-5331.72
8.80
-5160.64
19.44
5
-1899.52
3.62
-1889.82
2.68
6
-792.07
4.90
-752.02
8.17
7
-3548.24
6.34
-3523.82
3.85
8
-1982.86
2.46
-1969.82
5.31
9
-1502.43
6.71
-1492.22
1.52
10
-2589.06
3.36
-2543.79
13.48
11
-2185.09
9.37
-2167.20
1.91
12
-3693.34
4.60
-3658.42
7.58
13
-3176.61
5.06
-3166.51
2.00
14
-2641.54
6.02
-2612.29
4.86
15
-3719.47
6.80
-3693.68
3.79
16
-3157.52
3.77
-3150.92
1.75
Conclusions
• Showed a Bayes Model, that outperforms in overall
ML based methods
• Proposed a modified prior, that performs better than
other tested priors with datasets having many
classes
• Proposed a way of testing various methods
– avoids picking too detailed models
– use of prior can be considered a drawback
• Showed the preference to ML score when
segmenting data with very weak signals
• Real data has localized signal
Future points
• Improve the heuristic (optimize the results)
• Use of fuzzy vs. hard cluster classifications
• Various other potential applications (no certainty
of their rationality yet..)
• Should clusters be merged? (Work done in HIIT,
Mannila’s group)
• Consider sound ways of setting the prior for DJS
calculus
• Length of the gene, density of genes?
Thank you!
=Wake up!