Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Heterogeneous Association Rules Mining Badr Al-Daihani School of Computer Science Cardiff University Edinburgh,UK BNCOD21 Overview Motivation Challenges of Bioinformatics Databases Management Approaches to integration of bioinformatics databases Association rule mining Hypothesis Basic concepts Material and methods Edinburgh,UK BNCOD21 Motivation Very large heterogeneous databases. Need to link. Integration. Complex relation. Edinburgh,UK BNCOD21 Challenges of Bioinformatics Databases Management Bioinformatics Databases format: Flat files: GenBank, EMBL, DDBJ, PDB. Relational databases: HGMD, MGMD Object-oriented database: AceDB. XML databases: PIR, SwissProt, InterPro. Characteristics: The Diversity/variety of data. The representational heterogeneity. Autonomous and web-based sources. Varied interface and query capabilities Edinburgh,UK BNCOD21 Approaches to integration of bioinformatics databases Multiple models of data integration: • • • Federation Warehousing Mediations Edinburgh,UK BNCOD21 Federation Provides access to distributed data while preserving database autonomy examples: K2/BioKleisli Entrez Edinburgh,UK BNCOD21 Warehousing import data from remote sources and copy to local server Example: GUS (Genome Unified Schema) Sequence Retrieval System (SRS) Edinburgh,UK BNCOD21 Mediations • • stores no data on its own rather it provides a virtual view of the integrated sources Examples: Transparent Access to Multiple Bioinformatics Information Source (TAMBIS) Knowledge-based Integration of Neuroscience Data (KIND) Edinburgh,UK BNCOD21 Hypothesis: It is possible to mine diverse databases to recover datasets related to a disease, associated gene mutations and mutagens which aid scientists understanding of their cause. Edinburgh,UK BNCOD21 Association Rules Association Rules –interesting association relationship among huge amounts of transactions An association rule is an expression of the form X => Y, where X and Y are sets of items Goal of AA – To find all association rules that satisfy user-specified minimum support and minimum confidence threshold Examples. – Rule form: “Body ead [support, confidence]”. – buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%] – major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%] Edinburgh,UK BNCOD21 Association Rules Applications: Basket data analysis Genomic Data Cross-marketing Catalog design sale campaign analysis Web Personalization clustering, classification, etc. Edinburgh,UK BNCOD21 Basic Concepts The discovery of interesting association relationships among huge amount of gene mutation can help in determining the cause of mutation in tumours and diseases. Gene is a segment of a DNA molecule that contains all the information required for synthesis of a product. Gene mutation is any change in the DNA sequence of a gene. Types of mutations: Insertion, Deletion, Insertion/Deletion, Complex, and Multiple Substitution Edinburgh,UK BNCOD21 Material and Methods HGMD database The Human Gene Mutation Database (HGMD) runs by University of Wales College of Medicine. Known (published) gene lesions responsible for human inherited disease. Provides information about practical diagnoses. Edinburgh,UK BNCOD21 Material and Methods MGMD database The Mammalian Gene Mutation Database (MGMD). Runs by Centre of Molecular Genetics and Toxicology, University of Wales Swansea. profiles of known (published) mutagen-induced gene mutations. Stores the mutation spectra information. It has 39134 records. Edinburgh,UK BNCOD21 Material and Methods Sets of items whose elements tend to be in both databases will be retrieved to discover the interesting association rules among genes, mutations, mutagens and diseases. Edinburgh,UK BNCOD21 Material and Methods Graphical User Interface (GUI) Mining tools Query interpreter Wrapper DBn Wrapper MGMD Edinburgh,UK Wrapper HGMD BNCOD21 References [1] Hernandez T. and Kambhampati S. (2004) Integration of Biological Sources: Current Systems and Challenges Ahead, Proc. of the ACM SIGMOD Conference. [2] C. Goble et al. (2001) Transparent access to multiple bioinformatics information sources. IBM Systems Journal, 40(2). [3] Barbara Eckman,Zoe Lacroix and Louiqa Raschid (2001) Optimized Seamless Integration of Biomolecular Data,IEEE, International Conference on Bioinformatics and Biomedical Egineering,23-32. [4] Lacroix Z, Boucelma O and Essid M (2003) The Biological Integration System. Proc. of the 5th ACM Workshop on Web Information and data management, pp 45-49. [5] Aldana J.,Roldán M, Navas I, Pérez A and Trelles O (2004) Integrating Biological Data Sources and Data Analysis Tools through Mediators, Proceedings of the 2004 ACM symposium on Applied computing. [6]. Agrawal, R.-Imielinski, T.-Swami, A. (1993) Mining Association Rules Between Sets of Items in Large Databases. Proc. ACM SIGMOD:207-216. [7] P.D. Lewis, J.S. Harvey, E.M. Waters, and J.M. Parry (2000) The Mammalian Gene Mutation Database, Mutagenesis, 15(5): 411- 414. [8] Krawczak M, Ball EV, Fenton I, Stenson PD, Abeysinghe S, Thomas N, Cooper DN (2000): Human Gene Mutation Database - a biomedical information and research resource. Human Mutation 15(1):45-51. Edinburgh,UK BNCOD21 Edinburgh,UK BNCOD21