* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download A New Biclustering Algorithm for Analyzing Biological Data
Survey
Document related concepts
Transcript
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali Introduction • Microarray technology use to study the expression of many genes at once • Large amount of data is produced in the microarray technology • Proper analysis of the data is important to get meaningful information from it • There is a need for new analysis techniques Data Analysis • From data to knowledge • We need to process data by grouping and synthesizing information into a “big picture” based upon characteristics and relationships • One of the most used analysis technique is traditional clustering Traditional Clustering Genes Genes Conditions Conditions • Applied to either rows or columns of the data matrix separately • Each gene is defined using all the conditions • Each condition is characterized by the activity of all the genes that belong to it Motivation • The large amount of data provide us great challenges of analysis • Clustering algorithms consider all the conditions to group genes and all the genes to group conditions • Biologically data may not show similar behavior in all conditions but in a subset of them • Traditional clustering algorithms will very likely miss some important information Biclustering • The term “Biclustering” was first used by Cheng and Church in gene expression data analysis [Year 2000] • Clusters do not need to include all parameters (genes in Bioinformatics) for all conditions • Data Matrix ▫ Each gene – One row ▫ Each condition – One column ▫ Each element – expression level of a gene under specific condition Biclustering (Cont.) Genes Conditions • Performs clustering in these two dimensions simultaneously • Each gene is selected using only a subset of the conditions • Each condition is selected using only a subset of the genes Goal of Biclustering • To identify subgroups of genes and subgroups of conditions by performing simultaneous clustering of both rows and columns of the gene expression matrix, instead of clustering these two dimensions separately • To find biclusters is NP-hard problem: It is actually a generalized version of traditional clustering Previous Work • A systematic comparison and evaluation of biclustering methods for gene expression data Amela Prelic (2006) • Algorithms: ▫ Statistical Algorithmic Method for Biclustering Analysis Algorithm (SAMBA) ▫ Order Preserving Submatrix Algorithm (OPSM) ▫ Iterative Signature Algorithm (ISA) ▫ Cheng and Church algorithm ▫ xMotif ▫ Bimax Previous Work (Cont.) • Comparative Analysis of Biclustering Algorithms – Doruk Bozdag … (2010) • Algorithms ▫ ▫ ▫ ▫ Correlated Pattern Bicluster Algorithm (CPB) Cheng and Church Algorithm Order Preserving Submatrix Algorithm (OPSM) HARP Algorithm ▫ Minimum Sum-Squared Residue-based CoClustering Algorithm (MSSRCC) ▫ Statistical Algorithmic Method for Biclustering Analysis Algorithm (SAMBA) The Importance of Assessment • Different algorithms give different solutions for same data • There is no agreed upon guideline for choosing among them • Validation Techniques ▫ External Validation Measures Evaluate a result based on the knowledge of the correct class labels ▫ Internal Validation Measures Evaluate a result based on the information intrinsic to the data alone Validation • In most biclustering papers external validation measures used to assess the methods, ▫ It is not clear how to extend notions such as homogeneity and separation to the biclustering context (Gat-Viks et al 2003) ▫ Internal measures don’t work well in case of biclustering due to which Gat-Viks et al 2003 and Handl et al 2005 recommend external measures Objectives of the Project • Comprehensive Assessment Technique ▫ Internal measures as well as external measures • Customized Biclustering Method ▫ Input domain Validation using Synthetic Data • Testing using Manufactured data ▫ The portion of the implanted bicluster the algorithm was able to return ▫ The portion external or irrelevant to the implanted bicluster which algorithm returns ▫ Two metrics to evaluate cluster quality U: Uncovered portion of the implanted bicluster E: Portion of the output cluster external to the implanted bicluster Validation using Synthetic Data • Testing using real (domain specific) data – for example using Gene match score ▫ M1, M2 be two sets of Biclusters ▫ Average of the maximum match scores for all biclusters in M1 with respect to the bicluster in M2 • Potential improvements ▫ Don’t consider samples / conditions ▫ Specificity and Sensitivity Proposed Assessment • Calculate sensitivity and specificity scores ▫ Specificity: proportion of negatives which are correctly identified ▫ Sensitivity: proportion of actual positives which are correctly identified • Improve existing measures: ▫ Average of the maximum match scores for all bi-clusters in M1 with respect to bi-clusters in M2 (considering both genes and samples) • Assessment based on knowledge of domain data ▫ The resulting biclusters were evaluated based on the enrichment of Gene Ontology (GO) terms Experiments • Given two biclustering results ▫ M1: Result of a biclustering algorithm ▫ M2: True Result ▫ (G1, C1) M1 and (G2, C2) M2 • Calculate similarity score (Jaccard Coefficient) ▫ G1 G 2 G1 G 2 and C1 C 2 C1 C 2 • Calculate the two scores, ▫ Score 1: % of result of an algorithm is included in the true result ▫ Score 2: % of true result an algorithm can find Results • Synthetic Data: 100 genes and 100 samples • 10 implanted biclusters of each size 10 X 10 (10 genes and 10 samples) • Used publically available different biclustering algorithm implementations Algorithm No of biclusters Score 1 Score 2 Cheng and Church Algorithm (CC) 8 0.475 0.38 Iterative Search Algorithm (ISA) 9 1 0.9 Order Preserving Sub Matrix (OPSM) Algorithm 32 0.273139 0.874044 Statistical Algorithm Method for Bicluster Analysis (SAMBA) 9 0.5 0.45 xMotif Algorithm 87 0.100023 0.870204 • Score 1: % of result of an algorithm is included in the true result • Score 2: % of true result an algorithm can find Conclusion • Traditional Clustering is too restrictive technique for analyzing datasets in various application domains • We need new flexible analysis technique like biclustering to deal with possible imperfections in the input datasets • Assessment of data analysis is critical and must be considered while selecting the right tool for each application domains • Biclustering represents a powerful tool for analysis of data in a variety of domains and can be applicable to datasets other than biology References • Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis: A survey • Amela Prelic et al: A systematic comparison and evaluation of biclustering methods for gene expression data • http://cheng.ececs.uc.edu/biclustering • http://www.tik.ethz.ch/~sop/bicat/ • http://acgt.cs.tau.ac.il/expander/ Thank you…