* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download TKTL_luento3
Survey
Document related concepts
Transcript
Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri Törönen, HY, Institute of Biotechnology Outline • little motivation • some heuristics used • proposed Bayes model – represents also a modified Dirichlet prior • proposed testing with artificial data – discusses the use of prior information in the evaluation • little analysis of real datasets Biological problem setup Input • Genes and their associations with biological features like regulation, expression clusters, functions etc. Assumption • Neighbouring genes of genome may share same features Aim • Find the chromosomal regions "over-related" to some biological feature or combination of features, look for non-random localization of features I will discuss more about gene expression data application A comparison with some earlier work with expression data • Our aim is to analyze the gene expression from the genome with a new perspective – standard: Consider very local areas of ~constant expression levels – our view: How about looking at larger regions that have clearly more active genes (under certain conditions)? Our perspective is related with the idea of active and passive regions of the genome Further comparison with earlier work • Standard: Up/Down/No regulation classification or real value from each experiment as input vector for gene • Our idea: One can also associate genes to clusters in varying clustering solutions. multinomial variable/vector for single gene By using varying number of clusters one should obtain broader and narrower classes This is related with the idea of combining weak coherent signals occuring in various measurements with clusters Methodological problem setup Gene participance in co-expression clusters • Genes can be partitioned into separate clusters according to expression similarity: first 2 clusters, then 3, then 4 etc. • Aim is to find chromosomal regions where consecutive genes are in same expression clusters in different clustering results Broader expression similarity Specific expression similarity 6 gene expression clusters 0 0 1 5 2 3 6 5 0 3 3 3 4 0 0 5 gene expression clusters 0 0 5 4 5 2 1 2 0 4 4 4 4 0 0 4 gene expression clusters 0 0 3 3 4 3 3 3 0 2 2 1 2 0 0 3 gene expression clusters 0 0 3 3 3 3 3 3 0 1 1 1 1 1 0 2 gene expression clusters 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 Gene order in chromosome 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Existing segmentation algorithms Non-heuristic: • Dynamic programming Heuristic: • Hierarchical – – • • • Top-down/bottom-up Recursive/iterative K-means reminding solutions (EM-methods) Sliding window with adaptive window size (?) etc. .. Hierarchical vs. Non-hierarchical Heuristic methods • Non-hierarchical heuristic methods usually produce only a single solution. • compare k-means in clustering • these often require a parameter (number of change-points ) • Aims to create (local) optimal solution for the number of change point • Hierarchical heuristic methods produce a large group of solutions with varying number of change-points • Large group of solutions can be created with one run • Solutions could be usually optimized further Recursive vs. Iterative hierarchical heuristics • Recursive hierarchical heuristics • Slices usually until some stopping rule (BIC penalty) is fullfilled. • Each segment is sliced independent from the rest of the data • Hard to obtain a solution for varying number of change points • Designed to stop at optimum (which can be a local optimum) • Top-Down (?) hierarchical heuristics • Slices until a stopping rule or maximum number of clusters is fullfilled • Each new change-point is placed after all segments are analyzed. The best change-point from all segments is selected. • Creates a chain of solutions with varying number of segments • Can be run past the (local) optimum to see if we can find a better solution after few bad results. Our choice for heuristic search • Top-Down hierarchical segmentation How to place a new change-point The new change-point position is usually selected using a statistical measure: • Optimization of Log of Likelihood-ratio (ratio of ML based solutions) • lighter to calculate • often referred as Jensen-Shannon Divergence • Optimization of our Bayes Factor • bayes model discussed later • natural choice (as this is what we want to optimize) Bayes factor would seem natural but: In testing we noticed that we started splitting only the smallest segments?? Why??? Bias in bayesian score The first fig. represents random data (no preferred change-point position) The second fig. represents the behaviour of log likelihood ratio model. (ML method) The third fig. represents the behaviour of Bayes Factor (BF) •The highest point of each profile is taken as change point •Notice the bias in BF that favours cutting near the ends •Still all BFs are negative (against splicing) Problems when we force the algorithm => We chose the ML to to go pass the local optimum change point search What we have obtained so far… • Top-Down hierarchical heuristical segmentation • ML based measure (JS divergence) used to select the next change-point Selecting optimal solution from hierarchy • Hierarchical segmentation for n sized data contains n different nested solutions • The solutions must be evaluated in order to find proper one: not too general, not too complex • Need model selection Model selection used for segmentation models Two ideas occuring in the most used methods: • Evaluating "the fit" of model (usually ML score) • Penalization for used parameters (in the model) – Segmentation model parameters: data classes in segments and the positioning of change points We used few (ML based) model selection methods: • AIC • BIC • Modified BIC (designed for segmentation tasks) We were not happy with their performance, therefore .. Our model selection criterion • • Bayesian approach => Takes account uncertainity and a priori information on parameters The change-point model M includes two varying parameter groups: A. Class proportions within segments B. Change-points (segment borders) • Posterior probability for the model M fitted in data D would be to integrate over A and B parameter spaces: P( M | D) P( D | M ) P( D | M , A, B) P( A) P( B)dAdB Our approximations/assumptions: A • clusters do not affect each other (independence) • data dimensions do not affect each other (independence) These two allow simple multiplication • segmentation does not affect directly the modeling the modelling of the data in the cluster Only the model and the prior A affect the likelihood of the data: P(D|M,A,B) = P(D|M,A) Our model selection criterion • Therefore a Multinomial model with a Dirichlet prior can be used to calculate the integrated likelihood: P( D | M , A) P( A)dA multiplication over dimensions vi i 1 Iv multiplication over classes in one dimension ( xvi vi ) Iv Iv i 1 ( vi ) v 1 xvi vi i 1 i 1 V Iv Further assumptions/approximations: B • We assume all the change-points exchangeable • • • order of finding change-points does not matter for the solution We do not integrate over the parameter space B, but analyze only the MAP solution need a proper prior for B.. Our model evaluation criterion • We select flat prior for simplicity – • This makes the MAP equal to ML solution Prior of parameters B is 1 divided with how many ways the current m change-point estimates can be positioned into data with size n: 1 P( B ) n 1 m Our model evaluation criterion • The final form of our criterion is (without the log): "Flat" MAP-estimate Posterior probability of parameter group A for parameters B. P( D | M ) PMAP ( B) P( D | M , A) P( A)dk Iv cvi i 1 Iv C V ( xcvi cvi ) 1 Iv N 1 c 1 v 1 I v i 1 ( cvi ) xcvi cvi i 1 m i 1 Multiplication goes over various clusters (c), and various dimensions (v). Quite simple equation. What about the Dirichlet prior weights Multinomial model requires prior parameters: • Standard Dirichlet prior weights: I) all the prior weights same (FLAT) II) prior probabilities equal the class probabilities in the whole dataset (CSP) • These require the definition of prior sum (ps) We used ps = 1 (CSP1, FLAT1) and ps = number of classes (CSP, FLAT) for both of previous prios • Empirical Bayes (?) prior (EBP): prior II with ps = SQRT(Nc) (Carlin, others, ’scales according the std’) …Dirichlet prior weights • • We considered EBP reasonable but… With small class proportions and small clusters EBP problematic – • gamma function of Dirichlet equation probably approaches infinity (as prior approaches zero) Modified EBP (MEBP) mutes this behaviour: i N c P ( X i ) Instead of i N c P ( X i ) • now prior weights approach 0 slower, when class proportion is small …Dirichlet prior weights • Also now the ps in MEBP is dependent on the class distribution (more even distribution => bigger ps). Also larger number of classes => bigger ps. Both these sound natural… • Prior weight can be also linked to Chi Square test. What we have obtained so far… • Top-Down hierarchical heuristical segmentation • ML based measure (JS divergence) used to select the next change-point • Results from heuristics are analyzed using proposed Bayes model – flat prior using number of potential solutions for segmentation with same m. – MEBP prior for multinomial data Evaluation • testing using artificial data • we can vary number of clusters, number of classes, class distributions and monitor the performance • Do hierarchical segmentation • Select the best result with various methods • Standard measure for evaluation: Compare how well the clusters obtained correspond to clusters used in the data generation • But is the good correlation/correspondence always what we want to see? When correlation fails • • • • many clusters/segments and few data points consecutive small segments similar neighboring segments One segment in the obtained segmentation (or in the data-generation) => no correspondence Problem: Correlation does not account Occam’s Razor Our proposal • Base the evaluation to the similarity of the statistical model used to generate each data point (DGM) vs. the data model obtained from clustering for the datapoint (DEM) – Reminds standard cross validation • Use a probability distribution distance measure to monitor how similar they are – one can think this as infinite size testing data set • Need only to select the distance measure • extra-plus: with hierarchical results we can look the optimal result and see if a method overestimates or underestimates it. Probability distribution distance measures • Kullback-Leibler Divergence (most natural) DKL ( X || Y ) EX log( X / Y ) P( X i ) log( P( X i ) / P(Y i )) • Inverse of the KL DKL _ Inv ( X || Y ) DKL (Y || X ) • Jensen-Shannon Divergence DJS ( X || Y ) DKL ( X || ( X Y ) / 2) DKL (Y || ( X Y ) / 2) • Other measures were also tested.. X is here DGM Y is the DEM (obtained from segments) The Good, the Bad and… • DEM can have data points with P(X=i) = 0 – These create infinite score in DKL – Under-estimates the optimal model • DKL_Inv was considered to correct this, but – P(Xi) = 0 causes now too many zero scores • x*log(x) when x => 0 was defined as 0 – over-estimates heavily the model • DJS was selected as a comprimise between these two phenomenas Do we want to use prior info • Standard cross validation: Bayes method uses prior information, ML does not use prior information • Is this fair? – same result with and without prior gets different score – method with prior usually gets better results • Our (=My!) opinion: evaluation should use same amount of prior info for all the methods! – we would get same score for same result (independent from the method) – we would pick the model from the model group that usually performs better • Selecting the prior for evaluation process is now an open question! Defending note • Amount of prior only affects the results from one group of artificial datasets analyzed (sparse signal /small clusters) • These are datasets where bayes methods behave differently. • Revelation from the results: ML methods perform worse also in datasets where prior has little effect • => Use of prior mainly important for our Bayes method prior comparisons… Rules for selecting prior to model evaluation • Obtained DEM should be as close to DGM as possible (=more correct, smaller DJS) • The used prior should be based on something else than our favorite MEBP • Hoping we would not get good results with MEBP just because of the same prior • Use as little prior as possible • want the segment area to have as much affect as possible • Better ideas? Comparison of model evaluation priors • Used a small cluster data with 10 and 30 classes (=prior effects the results) • Used CSP (class prior = class probability* ps), with ps = 1, 2, c/4, c/2, c*3/4, c, 10*c (c = number of classes) • Looked for obtained DJS for various segmenting outcomes (from hierarchical results) with 1 – n clusters (n =max(5,k), k= artificial data cluster number) • Analysis was done with artificial datasets Comparison of model evaluation priors •ps = 1, 2, c/4, c/2, 3*c/4, c, 10*c The approximate minimum at ps = number of classes Data with 10 classes Data with 30 classes Jensen-Shannon divergence 50 40 80 30 60 20 40 10 20 0 0 1 2 2.5 5 7.5 Prior sum 10 100 0 0 1 2 7.5 15 22.5 30 Prior sum 300 Comparison of priors • We did not look minimum, but wanted the compromise between minimum and weak prior effect: • We chose ps = c/2 • Choice quite arbitrary but a quick analysis with neighbor priors gave similar results Data with 10 classes Data with 30 classes Jensen-Shannon divergence 50 40 80 30 60 20 40 10 20 0 0 1 2 2.5 5 7.5 Prior sum 10 100 0 0 1 2 7.5 15 22.5 30 Prior sum 300 Proposed method+ artificial data evaluation • Top-Down hierarchical heuristical segmentation with ML used to select the next change-point • Results from heuristics are analyzed using proposed Bayes model • Evaluation of the results using the artificial data – estimate how well the obtained model predicts the future data sets – compare the models with DJS that uses also prior information More on evaluation • Three data types (with varying number of classes): i) several (1 – 10) large segments (each 30 – 300 data points) • this should be ~easy to analyze ii) few (1 – 4) large segments (30 – 300 data points) • this should have less reliable prior class distribution iii) several (1 – 10) small segments (15 – 60 data points) • • • most difficult to analyze prior affects these results Number of classes used in each: 2, 10, 30 – data sparseness increases with increasing number of classes • Data classes were made skewed …evaluation… • Data segmented by Top-Down: 1 – 100 segments • Model selection methods used to pick optimal segmentation – ML methods: AIC, BIC, modified BIC – Bayes method with dirichlet priors: FLAT1, FLAT, CSP1, CSP, EBP, MEBP • Each test replicated 100 times • Djs calculated between DGM and the obtained DEM …still evaluating • • • As mentioned: the smaller the JS-distance between DGM and DEM the better the model selection method For simplification we subtracted JS-distances obtained with our own Bayesian method from the distances obtained with other methods We took average of these differences over 100 replicates Data I AIC BIC BIC2 CSP EBP CSP1 Flat Flat1 Z-scores Upper box shows the Zscores i. 2 16.8 0.6 -0.6 -1.4 -2.2 -1.2 -1.8 -1.5 i. 10 1.6 7.0 3.8 1.6 1.0 3.5 1.6 3.2 (mean(diff)/std(diff)*sqrt(100)) i. 30 4.1 13.5 10.3 1.9 1.7 4.3 2.0 4.5 ii. 2 8.7 0.9 2.4 1.6 -1.3 2.2 1.6 2.5 ii. 10 0.4 7.4 2.5 4.0 0.8 3.0 0.5 2.0 ii. 30 1.5 15.4 14.7 8.2 2.2 7.4 -1.4 5.3 iii. 2 7.0 2.8 1.3 -1.3 -0.7 1.0 1.2 1.0 iii. 10 1.9 13.8 8.1 1.6 2.4 4.6 4.2 5.6 iii. 30 11.9 13.9 13.9 5.0 4.9 8.7 5.5 9.7 0.60 0.84 0.63 0.24 0.10 0.37 0.15 0.36 Lower box shows the average difference Shaded Z-scores: x > 3, a strong support in favour our method underlined Z-scores: x < 0, any result against our method Average Averages i. 2 5.65 0.06 -0.05 -0.09 -0.06 -0.07 -0.12 -0.09 i. 10 0.16 4.89 0.55 0.04 0.02 0.45 0.08 0.39 i. 30 0.78 58.12 17.48 0.24 0.08 5.74 0.23 4.28 ii. 2 1.42 0.01 0.08 0.03 -0.03 0.07 0.05 0.07 ii. 10 0.03 3.01 0.25 0.47 0.05 0.32 0.02 0.18 ii. 30 0.22 12.64 12.19 1.66 0.30 4.15 -0.15 3.47 iii. 2 1.13 0.27 0.11 -0.08 -0.03 0.09 0.11 0.09 iii. 10 0.15 13.61 3.67 0.19 0.21 1.90 0.65 1.47 iii. 30 5.82 13.88 13.88 0.59 0.51 8.93 1.72 10.70 1.70 11.83 5.35 0.34 0.12 2.40 0.29 2.29 Average Summary: AIC bad on two classes, (overestimates) BIC (and Mod-BIC) bad on 10 and 30 classes (underestimates) Flat1 and CSP1 weak on 10 and 30 classes (overestimates) Large segments Detailed view Rows show D results for datasets with 2, 10 and 30 classes D from segmentation selected by Bayes model with MEBP Positive results=> BM with MEBP outperforms negative results=> method in question outperforms BM with MEBP 1 column: Mainly worse methods 2 column: Mainly better methods These results did not depend on the DJS prior Large segments in small data This is data where prior information is less reliable. (smaller dataset) Flat class outperforms our prior in 30 class dataset Small segments Hardest data to model This is data where prior affects the evaluation significantly. Without prior BIC methods give best result (=1 segment is considered best) Summary from art. data • MEBP had overall better result in 23/24 pairwise comparisons with 30 class datasets (in 18/24 Z-score > 3) • MEBP had better overall result in all pairwise comparisons with 10 class datasets (in 12/24 Z-score > 3) • Our method slightly outperformed by other bayes methods in dataset i with 2 classes. Also EBP slightly outperforms it with every 2 class dataset EBP might be better for smaller class numbers MEBP underestimates the optimum here • ML methods and priors with ps = 1 (Flat1, CSP1) had weakest performance Analysis of real biological data • • • • Yeast cell cycle time series gene expression data Genes were clustered with k-means into 3 groups, 4 groups, 5 groups, and 6 groups Order of genes in chromosomes, and gene associations with expression clusters were turned into multidimensional multinomial data Aim was to locate regional similarities in gene exression in yeast cell cycle Anything in real data • • • Each chromosome was segmented Segmentation score of each chromosome was compared to score from randomized data (100 randomizations) Goodness: (x –mean(rand))/std(rand) CHR Rand. mean Rand. std log(P(M|D)) Goodness 1 -726.39 3.86 -711.47 3.87 2 -2783.24 5.17 -2759.31 4.62 3 -1134.89 6.65 -1103.91 4.66 4 -5331.72 8.80 -5160.64 19.44 5 -1899.52 3.62 -1889.82 2.68 6 -792.07 4.90 -752.02 8.17 7 -3548.24 6.34 -3523.82 3.85 8 -1982.86 2.46 -1969.82 5.31 9 -1502.43 6.71 -1492.22 1.52 10 -2589.06 3.36 -2543.79 13.48 11 -2185.09 9.37 -2167.20 1.91 12 -3693.34 4.60 -3658.42 7.58 13 -3176.61 5.06 -3166.51 2.00 14 -2641.54 6.02 -2612.29 4.86 15 -3719.47 6.80 -3693.68 3.79 16 -3157.52 3.77 -3150.92 1.75 Conclusions • Showed a Bayes Model, that outperforms in overall ML based methods • Proposed a modified prior, that performs better than other tested priors with datasets having many classes • Proposed a way of testing various methods – avoids picking too detailed models – use of prior can be considered a drawback • Showed the preference to ML score when segmenting data with very weak signals • Real data has localized signal Future points • Improve the heuristic (optimize the results) • Use of fuzzy vs. hard cluster classifications • Various other potential applications (no certainty of their rationality yet..) • Should clusters be merged? (Work done in HIIT, Mannila’s group) • Consider sound ways of setting the prior for DJS calculus • Length of the gene, density of genes? Thank you! =Wake up!