* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Bicat-plus_preseneta.. - k
Oncogenomics wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Gene therapy wikipedia , lookup
History of genetic engineering wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Pathogenomics wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Pharmacogenomics wikipedia , lookup
Minimal genome wikipedia , lookup
Public health genomics wikipedia , lookup
Gene desert wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Genomic imprinting wikipedia , lookup
Gene nomenclature wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome evolution wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Genome (book) wikipedia , lookup
Ridge (biology) wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Microevolution wikipedia , lookup
Designer baby wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
1/42 BicAT_Plus: An Automatic Bi/Clustering Comparative Tool of Gene Expression Data Obtained Using Microarrays Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen Biomedical Engineering Department, Cairo University, Giza , Egypt Mohamed H. Ali Computer Science School, Nottingham University, Nottingham, United Kingdom Yasser M. Kadah Center for Informatics Sciences, Nile University, Egypt Biomedical Engineering Department, Cairo University, Giza , Egypt 2/42 What is Bioinformatics? Bioinformatics is defined as the creation and advancement of databases, algorithms, computational and statistical techniques, and theory to understanding of biological processes. 3/42 The Central Dogma DNA nucleus Transcription RNA cytoplasm Gene Expression Level PROTEIN A Translation F G N S T D cytoplasm K G S A 4/42 Biological Balance Feedback System + Gene on Translation Rate Disease Drug Gene off + Gene Expression Level GENE A Protein Level + Transcription Rate - External or internal stimuli 5/42 Transcriptome data: Microarray Technology C1 Gene Expression Data Cm G1 0.5 1 2 Gn 3 2 1 6/42 Biological Balance Feedback System + Translation Rate Translation Rate + Gene Expression Level + GENE A Protein Level + Gene Expressi on Level GENE B + Transcription Transcription Rate External or internal stimuli Protein Level - Rate - + 7/42 Biological Balance Feedback System _ Gene A g1 _ + - g2 _ + g3 + g4 Gene B Balance Feedback Loop system _ _ Gene Regulatory Network GRN 8/42 Gene Regulatory Network GRN 9/42 Biological Data Base DNA Transcription RNA Translation PROTEIN A F G N S T D K G S A 10/42 Drug Discovery • One of the main objective of bioinformatics is how to integrate this database to advance in human health. Drug Discovery Disease Ontology 11/42 Drug Discovery & GRN •The costs to bring a new drug vary from around 500 million to 2,000 million dollars •Drug Design required the sophisticated understanding of how genes interact with each others construct GRN. _ g1 _ + g2 _ + g3 + g4 _ _ 12/42 Drug Discovery: GRN steps Experimental Design Prepare Microarray chip Sampling rate Error Experimental condition Data Extraction Microarray Image Segmentation Preprosseing Gene Expression Matrix c1 c2 g1 g2 gn Dynamics Bayesian Network Probabilistic boolean Network Fuzzy network ……… Drug Testing Network Generation cm Normalization Discretization Filtration Missing value Low entropy Low variance Traditional clustering methods Bicluster methods Gene Clustering 13/42 Gene Expression Data Analysis: Clustering similarity matrix cluster genes based on similarity n genes n genes m assays n genes Euclidean Distance Correlation coefficient Pearson 14/42 Hierarchical Clustering g1 g1 g2 g3 g4 g2 g3 g4 g5 0.23 0.00 0.95 -0.63 0.91 0.56 0.56 0.32 0.77 -0.36 g5 • Find largest value in similarity matrix. g1 • Join genes together. • Recompute matrix and iterate. g4 15/42 Hierarchical Clustering g1 , g4 g1 , g4 g2 g3 0.37 0.16 g5 0.52 g2 0.91 0.56 g3 0.77 g5 • Find largest value is similarity matrix. g1 • Join clusters together. • Recompute matrix and iterate. g4 g2 16/42 g3 Hierarchical Clustering g1 , g4 g1 , g4 g2 , g3 g2 , g3 g5 0.27 0.52 0.68 g5 • Find largest value is similarity matrix. g1 • Join clusters together. • Recompute similarity matrix and iterate. g4 g5 g2 17/42 g3 Hierarchical Clustering : dendogram Eisen et al. (1998), PNAS, 95(25): 14863-14868 18/42 Gene Expression Data Analysis: Clustering • Cluster is a group of genes show similar expression profile along the experiments • Examples – – – – – K-means Hierarchal Self Organization Map Click Model based clustering Eisen et al. (1998), PNAS, 95(25): 14863-14868 19/42 Gene Expression Data Analysis: Clustering Limitations c10 c1 c2 c3 c4 c5 c7 c8 c9 g1 3 4 1 1 7 10 11 1 1 g2 5 6 1 1 0.5 0.1 1 1 1 g3 2 2 2 2 2 2 2 2 2 g4 1 1 1 1 2 2 2 1 1 g5 3 4 4 2 5 4 7 9 8 g6 6 7 1 9 0 6 4 2 1 g7 0.5 0.1 1 2 2 2 2 2 5 20/42 Gene Expression Data Analysis: biClustering the mean squared residue score (MSRS), George M. Church Professor of Genetics, Harvard Medical School 21/42 Biclustering Algorithms Algorithm Author Bivisu/ pClusters Kin-On Cheng et al.,2008 Haixun Wang, 2002 RMSBE Xiaowen Liu and Lusheng Wang, 2006 Bimax ROBA Preli et al., 2006 Alain B. Tchagang and Ahmed H. Tewfik, 2005 x-motif SAMBA Murali and Kasif, 2003 Tanay et al., 2002 OPSM Plaid Ben-Dor et al., 2002 Laura Lazzeroni and Art Owen, 2000 ISA CC / δ biclusters Ihmels et al., 2002 Cheng and Church, 2000 22/42 Paper IDEA Which algorithm is suitable for my dataset? Which algorithm is better? And do some algorithms have advantages over others? Generally, comparing different biclustering algorithms is not straightforward as they differ in strategy, approach, computational complexity, number of parameters, and prediction ability. Moreover, such methods are strongly influenced by user selected parameter values. 23/42 BicAT-plus • To our best knowledge, bicluster compassion toolbox has not been available in the literature. • We have developed a comparative tool, which we will call “Bicat-plus” that includes the biological comparative methodology to enable researchers and biologists to compare between the different bi/clustering methods based on set of biological value and draw conclusion on the biological meaning of the results. 24/42 BicAT BicAT-plus is extension of BicAT Toolbox which is popular gene expression analysis toolbox which contains 5 biclustering and 2 traditional cluster algorithm. •OPSM •CC •ISA •X-motive •BIMAX •K-means •Hierarchal 25/42 BicAT-plus Comparison Methodology Algorithm A (n biclusters) g1, g1,g g4, 2,g3 g5, g1,g g1 ,g4, 2,g3, ,g g5, g1,g2 g4,g 2, … ,g3,g 5,… g3 4,g5, ,g g1,g … 4, 2,g3, g5 g4,g ,… 5,… Function GO Algorithm A (m biclusters) g1,g 2,g3 ,g4, g5, g1,g2 … ,g3,g Enriched bicluster= have biological meaning Pathway KEGG g1, g4, g5, g1,g g1 2,g3, ,g g4,g 2, 5,… g3 4,g5, g1,g … 2,g3, g4,g 5,… PPI BIOGRID ,g 4, g5 ,… Enriched not Enriched Promotor GENE BANCK 26/42 BicAT-plus Comparison Methodology • Percentage of enriched bi/clusters Percentage of enriched bicluster significan ce level Number of enriched biclusters at this level total number of biclusters • Percentage of annotated genes per each bi/cluster Study fraction of a GO term No of genes sharing the GO term in a bicluster 100 total number of genes in this bicluster • The predictability power of algorithm to recover interested pattern selected by user. 27/42 BicAT-plus Features 1. Adding more algorithms to the BicAT-plus tool in order to have one software package that employs most of the commonly used biclustering algorithms. 28/42 BicAT-plus Features 2. Perform functional analysis (Gene Ontology) of bicluster genes using different GO categories 2. Biological Process 3. Molecular Function 4. Cellular Component 29/42 BicAT-plus Features 3. Displaying the analysis and comparing results using graphical and statistical charts visualizations in multiple modes (2D and 3D). 30/42 BicAT-plus Features 4. Comparing between the different biclustering algorithms based on different respective methdology 31/42 BicAT Comparison Steps Manual file http://home.k-space.org/FADL/Downloads/BicAT_plus.zip 32/42 Results We used Gasch gene expression data. http://genome-www.stanford.edu/yeast_stress/ We used the default parameters as authors recommend in their publications. Bi/clustering Parameter settings Algorithm ISA tg = 2.0, tc = 2.0, seeds = 500 CC δ = 0.5, α = 1.2, M = 100 OPSM l = 100 BiVisu Ε = 60, Nr = 10, Nc = 5, = 25 K-means K=100 33/42 Percentage of enriched bi/clusters 34/42 Percentage of annotated genes per each bi/cluster 35/42 The predictability power of algorithm to recover interested pattern • The conditions applied in Gasch experiments varied from temperature shocks, hydrogen peroxide, the superoxide-generating drug menadione, the sulfhydryl-oxidizing agent diamide, the disulfide-reducing agent dithiothreitol, …… • The user could compare bi-clusters algorithms based on which of them could recover defined pattern like which one of them could recover biclusters which have response to the conditions applied in Gasch experiments. 36/42 GO Term / (number of annotated genes) K-means CC ISA Bivisu OPSM GO:0006970 response to osmotic stress / (83) 3 5 6 3 0 GO:0006979 response to oxidative stress / (79) 2 7 11 0 0 GO:0046686 response to cadmium ion / (102) GO:0043330 response to exogenous dsRNA / (7) 2 3 2 2 0 2 3 2 2 0 2 0 2 2 0 3 0 2 2 0 0 0 2 0 0 0 2 0 0 0 GO:0006995 cellular response to nitrogen starvation / (5) 4 4 4 0 0 GO:0042149 cellular response to glucose starvation / (5) 0 2 0 0 0 GO:0009651 response to salt stress / (15) 2 7 0 0 0 GO:0042542 response to hydrogen peroxide /(5) 0 0 0 2 0 GO:0000304 response to singlet oxygen / (4) 2 0 0 0 0 GO:0046685 response to arsenic / (77) GO:0009408 response to heat / (24) GO:0009409 response to cold / (7) GO:0009267 cellular response to starvation / (44) 37/42 Conclusion http://home.k-space.org/FADL/Downloads/BicAT_plus.zip • BicAT-plus is a flexible, open-source software tool written in java swing and it has a well structured design that can be extended easily to employ more comparative methodologies that help biologists to extract the best results of each algorithm and interpret these results to useful biological meaning. 38/42 BicAT-plus This figure for people that want to extend BicAT-plus by adding new features (or fixing bugs). 39/42 Conclusion • The comparison methodology used in this study confirm that the bicluster and cluster algorithms can be considered as integrated modules; there is no certain algorithm that can recover all the interesting patterns, what algorithm A success to recover in certain data sets, Algorithm B might fail, and vice verse. 40/42 Conclusion • Using BicAT-plus, we can identify the highly enriched bi/clusters of the whole compared algorithms, Integrating them to solve the dimensionality reduction problem of the Gene regulatory network construction from the gene expression data where samples number are fewer than number of genes in the microarray dataset. 41/42 Thanks 42/42 BicAT-Plus http://home.k-space.org/FADL/Downloads/BicAT_plus.zip 43/42 Availability and Requirements • Availability: you can free download from • System requirements 1. Java Runtime Environment (JRE). version 6 is recommended. 2. Active Perl version 5.10 Note BicAT plus has been tested on a PC machine with the following configurations: CPU: Pentium 4, 1.5 GHZ, RAM: 2.0 GB, Platform: windows XP professional with SP2. 44/42 Algorithms comparison • Generally, comparing different biclustering algorithms is not straightforward as they differ in strategy, approach, computational complexity, number of parameters, in addition to prediction ability. • Moreover, such methods are strongly influenced by user selected parameter values. As a result, the quality of biclustering results is often considered more important than the required computation time. 45/42 Algorithms comparison • Although there are some analytical comparative studies to evaluate the traditional clustering algorithms (Azuaje, 2002; Datta and Datta, 2003; Yeung, et al.), no such comprehensive comparison of biclustering methods can be found in the literature so far (Prelic, et al., 2006). 46/42 Cluster/bi-cluster algorithm performance comparison: Cluster Evaluation Cluster 1 g1,g2,g3,g4 ,g5,… Cluster 2 g1,g2,g3,g 4,g5,… Cluster n …….. g1,g2,g3,g4,g5, … • Homogeneity between cluster genes • Separation between clusters “it is not clear how to extend notions such as homogeneity and separation (Gat-Viks et al., 2003) to the biclustering context (to our best knowledge, no general internal indices have been suggested so far for biclustering) “ Prelic, et al., 2006 47/42 Cluster/bi-cluster algorithm performance comparison: Bicluster Evaluation bicluster g1,g 1 2,g3 ,g4, g5, … g1, 2 bicluster g2, g3, g4, g5, … …….. g1, n bicluster g2, g3, g4, g5, … Function: hypergometric test with GeneOntology database Pathway: KEEG PPI: Biograd database Promotor: Scan motif program 48/42 Hyper Geometric Test Cluster1 Test set (X genes) Reference set (N genes) g1, g2, g3, g4, g5, g6,g7,g8, g9,gN when sampling X genes (test set) out of N genes (reference set), what is the probability that x or more of these genes belong to a functional category C shared by n of the N genes in the reference set?”. g1, g2, g3, g4, g5, g6,g7,g8,g 9,gX Steven et al.(Maere, et al., 2005) 49/42 The Gene Ontology • The Gene ONTOLOGY (GO) is a project to put annotated genes( known function genes) in groups. • Example in S. cerevisiae • Function name =cellular g1, g2, g3, g4, g5, g6 response to glucose starvation function ID=GO:0042149 50/42 Hyper Geometric Test: Example Cellular response to glucose starvation Cluster1(10) GO:0042149 (6) g1, g2, g3, g4, g5, g6 2,3,4,5,6 g1, g2, g3, g4, g5, g6,g7,g8,g 9,g10 51/42 GO enrichment program with Hypergometric Test • • • • • • • FuncAssociate GeneMerge GoMiner FatiGO GOstat GO::TermFinder http://www.geneontology.org/GO.tools.shtml • we used GeneMerege program which were developed at University of Maryland C. I. Castillo-Davis, 2003 52/42 GO Analysis programs Limitations Reference set Test set 53/42 BicAT Swiss Federal Institute of Technology Zurich, ETH Zentrum, 8092 Zurich, Switzerland OPSM CC ISA X-motive BIMAX K-means Hierarchal 54/42 BicAT-plus • To our best knowledge, such An automatic gene ontology compassion tool has not been available in the literature. • We have developed a comparative tool, which we will call “Bicat-plus” that includes the biological comparative methodology and to be as an extension to the BicAT program. 55/42 BicAT-plus • Moreover, BicAT-plus help researchers in comparing and evaluating the algorithms results multiple times according to the user selected parameter values as well as the required biological perspective on various datasets. 56/42 Gene Expression Data Analysis: biClustering • Recent understanding of cellular process leads to expect subsets of genes to be coregulated and coexpressed under certain experimental conditions, but to behave almost independently under other conditions. A. Prelic,2006, Bioinformatics • Bicluster is a group of genes show similar expression profile under certain conditions. 57/42 Cluster/bi-cluster algorithm performance comparison: Bicluster Evaluation bicluster g1,g 1 2,g3 ,g4, g5, … g1, 2 bicluster g2, g3, g4, g5, … …….. g1, n bicluster g2, g3, g4, g5, … Function: hypergometric test with GeneOntology database Pathway: KEEG PPI: Biograd database Promotor: Scan motif program 58/42