Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
7th December 2010 A Visual Analytics Approach to Augmenting Formal Concepts with Relational Background Knowledge in a Biological Domain Elma Akand*, Mike Bain, Mark Temple *CSE, UNSW/School of Biomedical and Health Sciences,UWS The Sixth Australasian Ontology Workshop, Adelaide University of South Australia 1 Outline Machine learning and data mining in bioinformatics Domain Ontologies in biomedical applications Formal Concept Analysis MCW algorithm (Mining Closed itemsets for Web apps) BioLattice – a web based browser Experimental Application: systems biology Part-1: Concept ranking by gene interaction Part-2: Relational learning of multiple-stress rules Machine learning & Data mining in Bioinformatics Bioinformatics “Bioinformatics is the study of information content and information flow in biological systems and processes” (Michael Liebman,1995) Machine Learning & Data mining -Can offer automatic knowledge acquisition -Process to discover knowledge by analyzing data from different perspectives and can contribute greatly in building knowledge base Our work: focus on knowledge-based machine learning - Previous work: learning from ontologies - Current work: ontology construction by learning - Potential application areas: ontologies – central to eCommerce, eHealth - Current application area: systems biology – predict gene function, data integration Ontology In philosophy - concerned with nature and relations of being In knowledge representation - study of categorization of things: Ontology – "specification of a conceptualization” (Gruber, 1993) Conceptualization – "formalization of knowledge in declarative form” (Genesereth and Nilsson, 1987) Ontology Informal Ontology Upper Ontology Natural language General Formal Ontology Domain Ontology First order logic or a variant Specific Gene Ontology a Missing concepts and relations One gene annotated with different GO terms with a term specialization of other gene: x concepts : a ,b relations : (i) x a (ii) x b and (iii) b a b x x y Formal Concept Analysis (FCA) Mathematical order theory (Rudolf Wille in the early 80s) -Derives conceptual structures out of data -Method for data analysis, knowledge representation and information management Components -Formal context, concept , concept lattice fourlegged haircovered cats x x dogs x x dolphins gibbons x intelligent marine x x thumbed x x humans x x whales x x Formal concepts in a concept lattice Top ({cats, gibbons, dogs, dolphins, humans, whales}, {}) ({cats, gibbons, dogs}, {hair-covered}) ({gibbons, dolphins, humans, whales}, {intelligent}) 5 6 ({dolphins, whales}, {intelligent, marine}) ({cats, dogs}, {haircovered, four-legged}) 3 2 Formal context: an n by m Boolean matrix m attributes n objects A O columns rows 1 ({gibbons, humans}, {intelligent, thumbed}) ({gibbons}, {intelligent, hair-covered, thumbed}) Formal concept: Galois connection <X, Y> X is a subset of A, Y is a subset of O Concept lattice loosely interpretable in ontology terms: concept definitions and sub-concept relations cf. T-box concept membership by objects cf. A-box Bottom ({}, {intelligent, hair-covered, thumbed, marine, four-legged}) 4 FCA in data mining FCA can be seen as a clustering technique in machine learning -Most of the work is in a propositional framework In data mining closed itemset mining is an efficient alternative to FCA A frequent itemset X is closed if there exists no proper superset Y such that Y⊃X with support(Y)=support(X) E.g., if X = {a,b,c,d} and Y ={a,b,c,d,e} and support(Y)=support(X), then X is not closed Parameters to avoid building entire lattice -Extent size must be greater than minsup Existing closed itemset mining algorithms -Data structures to speed up closed itemset mining -But may not build lattice, or include extents MCW algorithm (Mining Closed itemsets for Web apps) Vertical data format IT-tree (itemset-tidset tree) search space -node has X x t(X) and all children have prefix X Pruning - 4 set difference closure operators Subsumption check - A look-up table to record all attributes and their occurrences in closed concepts Is {TA}{135} closed? i(135)={TAWC} Lattice - adding concepts following a general to specific order D T A W C 2 1 1 1 1 4 3 3 2 2 5 5 4 3 3 6 6 5 4 4 5 5 6 attribute Concept_id D C1,C2 T C3,C4 A C4,C5 W C2,C4,C5,C6 C C1,C2,C3,C4, C5,C6,C7 Closure operators Based on CHARM (Zaki, 2005) D T A W C 2 1 1 1 1 4 3 3 2 2 5 5 4 3 3 6 6 5 4 4 5 5 6 {TA}{135}={TW}{135} ->{TAW}{135} {D}{2456}⊂{C}{123456}->{DC}{2456} {D}{2456} and {W}{12345}->{DW}{245} Concept lattice as a visual analytics approach Visual analytics -combination of information visualization with machine learning and data analysis (Keim et al., 2008) Visualization of concept lattice - provides overview of the structure of the domain - means for further data analysis, e.g., classification, clustering, implication discovery, rule learning Previous work - lattice navigation since Godin et al. (1993) -Browsable concept lattice, e.g., Kim & Compton (2004) Our current work - on augmenting concept lattice by integrating multiple sources of knowledge (Gene Ontology, protein interactions) for further analysis & machine learning Case study: Yeast systems biology Browsable concept lattice more general Biological validation (1) : synthetic lethality Synthetic lethal interaction if cell is viable when either gene A or B are individually deleted, but cannot grow when both are deleted. Our results show that 72 (119) concepts in the lattice more likely than random chance at p < 0.01 (p < 0.05) to contain synthetic lethal pairs. Biological validation (2) : ILP learning of concept definitions Biochemical pathway data Protein-protein interaction data Ontology data Inductive Logic Programming Microarray geneexpression data First-order rule concept(A):- ppi(B,A,C), ppi(B,A,E), ppi(B,C,E) tfbinds(D,C),fbinds(F,E) Transcription factor binding data (ChIP-chip) Example rule: RSM19 required for H2O2 response; RSM19, RSM22 and MRPS17 in “mitochondrial ribosomal small subunit” stable complex; and RSM22, MRPS17 bound by transcription factors under amino acid starvation. Transcription factors Conclusions Many real-world domains are data-intensive Machine learning and data mining applications required to generate predictive and useful outputs We focus on knowledge-based learning for comprehensibility – use ontologies Formal concept analysis as a framework for ontology structure Use data mining techniques for efficient concept lattice generation Visual analytics approach: browsable lattice, added background knowledge Initial validation on a case study from yeast systems biology Future work Investigate pseudo-intents to simplify concept lattice Investigate variants of concept lattice structures -e.g., concept lattice of inverse context Add concept definitions to background knowledge in ILP