Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
A Short Introduction to Analyzing Biological Data Using Relational Databases Part I: Introduction to Using the Relational Model of Data in Biology Alex Ropelewski [email protected] Pittsburgh Supercomputing Center National Resource for Biomedical Supercomputing Bienvenido Vélez [email protected] University of Puerto Rico at Mayagüez Department of Electrical and Computer Engineering 1 Learning Objectives • Awareness of the diverse nature of data managed by biologists • Understanding how to model biological data and their relationships using tables • Minimizing data redundancy through normalization 3 Bioinformatics The interdisciplinary science of using computational approaches to analyze, classify, collect, represent and store biological data with the goal of accelerating and enhancing the understanding of DNA, RNA and protein molecules. CATD_H CHYM_B CARP_YEAST CARP_RHICH PEPA_ASPAW : : : : : * 20 * 40 * 60 * ------MQPSSLLPLALCLLAAPASALVRIPLHKFTSIRRTMSEVG--------------GSVEDLIAKGPV --------MRCLVVLLAVFALSQGTEITRIPLYKGKSLRKALKEHG---------------LLEDFLQKQQY --------MFSLKALLPLALLLVSANQVAAKVHKAKIYKHELSDEMKEVTFEQHLAHLGQKYLTQFEKANPE MKFTLISSCIAIAALAVAVDAAPGEKKISIPLAKNPNYKPSAKNAIQ-------------KAIAKYNKHKIN -------MVVFSKTAALVLGLSSAVSAAPAPTRKGFTINQIARPANKTRTIN-------LPGMYARSLAKFG l p K 6 : : : : : 52 49 64 59 58 CATD_H CHYM_B CARP_YEAST CARP_RHICH PEPA_ASPAW : : : : : 80 * 100 * 120 * 140 SKYSQAVPAVTEGPIPEVLKNYMDAQYYGEIGIGTPPQCFTVVFDTGSSNLWVPSIHCKLLDIACWIHHKYN GISSKYSGFGEVASVP--LTNYLDSQYFGKIYLGTPPQEFTVLFDTGSSDFWVPSIYCKSN--ACKNHQRFD VVFSREHPFFTEGGHDVPLTNYLNAQYYTDITLGTPPQNFKVILDTGSSNLWVPSNECGSL--ACFLHSKYD TSTGGIVPDAGVGTVP-MTDYGNDVEYYGQVTIGTPGKKFNLDFDTGSSDLWIASTLCTNCG---SRQTKYD GTVPQSVKEAASKGSAVTTPQNNDEEYLTPVTVGKS--TLHLDFDTGSADLWVFSDELPSSE---QTGHDLY 1 2Y 6 6Gtp f 6 fDTGSs1lW6 S c : : : : : 124 117 134 127 125 CATD_H CHYM_B CARP_YEAST CARP_RHICH PEPA_ASPAW : : : : : * 160 * 180 * 200 * SDKSSTYVKNGTSFDIHYGSGS-LSGYLSQDTVSVPCQSASSASALGGVKVERQVFGEATKQPGITFIAAKF PRKSSTFQNLGKPLSIHYGTGS-MQGILGYDTVTVSN-----------IVDIQQTVGLSTQEPGDVFTYAEF HEASSSYKANGTEFAIQYGTGS-LEGYISQDTLSIGD-----------LTIPKQDFAEATSEPGLTFAFGKF PKQSSTYQADGRTWSISYGDGSSASGILAKDNVNLGG------------LLIKGQTIELAKREAASFANGPN TPSSSATKLSGYTWDISYGDGSSASGDVYRDTVTVGG-----------VTTNKQAVEAASKISSEFVQNTAN SS G I YG GS G 6 Dt6 6 q f : : : : : 195 177 194 187 186 Data Management for Bioinformatics 5 How to Represent a Molecule? C43H66N12O12S2 N Q P G L MRLLVLAALLTVGAGQAGLNSRALWQFNGM IKCKIPSSEPLLDFNNYGCYCGLGGSGTPV DDLDRCCQTHDNCYKQAKKLDSCKVLVDNP YTNNYSYSCSNNEITCSSENNACEAFICNC DRNAAICFSKVPYNKEHKNLDKKNC C C Image from Wikipedia Commons: http://en.wikipedia.org/wiki/File:Oxytocin.jpg I Y 6 Storing Biological Data • Data organization from a biologists perspective: – – – – – – – – – Sequence (Amino Acids represented as Letters) Structure Family/Domain Species Taxonomy Function/Pathway Disease/Variation Publication Journal Many others • How will the data be used? 7 Storing Biological Data • Data organization from a computer-science perspective: – – – – – – – – In a flat text file In a spreadsheet In an image In an video animation In a relational database In a networked (hyperlinked) model In any combination of the above Others • How will the data be used? 8 Retrieving Biological Data • Reference: – find something that I have seen before – Example: • find out who discovered a DNA sequence or protein • Find some characteristic of a known sequence or protein • Discovery: – find something new. Infer new knowledge. – Examples: • Find new sequences that evolved from known common ancestor • Find sequences that may have similar function in other organisms 9 Finding Reference Information • Reference information searches can be accomplished: – By key • Find a DNA sequence by its accession number – By attribute (exact) • Find sequences belonging to C. Elegans – By attribute (inexact) • Find proteins known to be related to some type of cancer 10 Motivation: Storing Experimental Results • Recent phenomenon from the biological experimenters perspective – Too many results to keep track of by hand – Need to Summarize/Aggregate data in order to visualize and extract valuable information – Need to repeat same discovery searches to better mine the data • With different parameters • Over time to pick up database changes (more sequences, better annotation) 11 Structured Databases • All information organized in same way (Data Model) • Language available to – – – – describe (create) the database insert data manipulate data update • Language establishes an abstract data model: Data Independence • Programs using language can work across systems • Facilitates communication and sharing data 12 Structured Databases • Examples – Hierarchical Databases – Networked Databases – XML Databases – Relational Databases Relational Databases • Model originally described by Edgar F. Codd in the early seventies. • Defined using relational algebra, an offshoot of algebra of sets and first order logic. • Model Implemented in many products: – Commercial: Oracle, MS SQL, IBM DB2 – Open Source/Free: MySQL, Postgres, SQLite 14 Relations • set of tuples that have the same attributes – base relations = stored data – derived relations = computed by applying relational operators Relation Tuple 15 Attribute or Column Example Relational Database Design Store results from multiple sequence database BLAST searches Accession Description P14555 Species Matrix eValue Date Group IIA Human Phospholipase A2 Pam70 4.18 E-32 7/21/07 P81479 Phospholipase Indian Green A2 isozyme IV Tree Viper Pam70 2.68 E-52 7/21/07 P14555 Group IIA Human Phospholipase A2 Blosom80 3.47 E-33 7/20/07 P81479 Phospholipase Indian Green A2 isozyme IV Tree Viper Blosom80 1.20 E-54 7/20/07 P00623 Phospholipase Eastern A2 Diamondback Rattlesnake Blosum80 1.21 E-08 7/20/07 16 An Improved Relational Database Design Reduce redundancy through normalization Sequences Species Accession Description P14555 Group IIA Phospholipase A2 P81479 Phospholipase A2 isozyme IV Indian Green Tree Viper P00623 Phospholipase A2 Foreign key Human Eastern Diamondback Rattlesnake Matches Matrix Acc# Date eValue P14555 7/21/07 Pam70 4.18 E-32 P81479 7/21/07 Pam70 2.68 -E52 P14555 7/20/07 Blosom80 3.47 E-33 P81479 7/20/07 Blosom80 1.20 E-54 P00623 7/20/07 Blosum80 1.21 E-08 17 Still redundant An Improved Relational Database Design Sequences Accession Description Species P14555 Group IIA Phospholipase A2 Human P81479 Phospholipase A2 isozyme IV Indian Green Tree Viper P00623 Phospholipase A2 Eastern Diamondback Rattlesnake Matches Runs Accession RunNum eValue P14555 1 4.18 E-32 RunNum Date P81479 2 2.68 E-52 1 7/21/07 Pam70 P14555 2 3.47 E-33 2 7/20/07 Blosom80 P81479 1 1.20 E-54 P00623 2 1.21 E-08 Matrix 18 Key Concepts • Biologists need to manage information of a large and diverse nature • Computer scientists have developed a variety of data models for storing information • Relational databases are ideal for representing large collections of data of a regular tabular nature • A relational database is a collection of tables • Tables use rows to represent objects and columns to represent their attributes • Tables should be optimized through normalization to avoid redundancy 19