Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bioinformatics & LIS A brief talk for librarians, information scientists, and computer scientists about resources and collaborative opportunities with biology. April 18, 2006 G. Benoit Outline of the talk • • • • Bioinformatics defined Generation of data Tools and databases Activities for Librarianship, Computer and Information Science • Examples: – Entrez, NCBI, Visualization • Collaborations Bioinformatics defined • Over 70 defintions • Differences arise from the work • Nat’l Center for Biotechnical Information (NCBI) • The development of new algorithms and statistics with which to assess relationships among members of large data sets; • The analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures; and • The development and implementation of tools that enable efficient access and management of different types of information. Without getting into the science… • How the data started … • Four chemical bases (purines [adenine (A), guanin (G)] and pyrimidines [cytosine (C) and thymine (T)] ) • Their precise order and linking (attached to a sugar molecule and to a phosphate molecule to create a nucleotide) … DNA • A pairs with T; G with C to make unique and very long strings, called sequences • E.g., AATGACCAT codes for a different gene than GGGCCATAG would • Replication: RNA consists of A, G, C, and Uracil and has ribose instead of deoxyribose • Point is one can predict missing data, sometimes… In short… the nucleotides are linked in a certain order or sequence through the phosphate group; their precise order and linking within the DNA determines what proteins the gene produces and the phenotype of the organism Generation of Data • Raw data from sequencing • Expression data • Data generated by linking other raw data in very large, multidimensional databases (e.g., OMIM) • Research literature (full-text journals) • Data models to describe the literature for retrieval, linking to other data, and linking to the raw data • New data models to support greater flexibility in describing & manipulating data … Generation of Data • To support integrated search and retrieval • To focus on single organisms or find similarities across them • Feed other technology • Visualization of natural phenomena and of abstract phenomena Tools & Databases • A host of tools for database searching… – BLAST (basic local alignment search tool) – FASTA (sequence strings) – ChopUp (protein analysis) – Integrated packages (Lasergene Sequence Analysis Software) – The many services offered through NCBI and NLM • Take a look at handout, Table 1, publically accessible databases Data Categories • Monographs, Journals, Announcements (text) • Datasets: – – – – – – – – – Bibliographic (http://www.expasy.org/links.html) Taxonomic Nucleic acid Genomic (e.g., GDB, OMIM) Protein DB (SwissProt, TrEMBL) Protein families, domains, and functional sites Proteomics initiative Enzyme/metabolic pathways Sequence Retrieval System (SRS) and NCBI Data Model • Take a look at handout, Table 2, publicallyaccessible databases defined and then • Entrez sample, Table 3 Entrez example • Notice the familiar access points (author, journal, title) as well as domainspecific ones (exon, gene, organism) • Notice, too, the DNA … NCBI Homepage • http://www.ncbi.nih.gov/ • Notice the variety of tools (left menu) • Site map: http://www.ncbi.nih.gov/Sitemap/index.html • Alpha list http://www.ncbi.nih.gov/Sitemap/AlphaList.html Linking across resources • • • • • • • http://www.ncbi.nlm.nih.gov/entrez/query/static/linking.html NCBI’s structure database is called Molecular Modeling Database (MMDB), and is a subset of non-theoretical models 3D structures obtained from the Protein Data Bank (PDB). Data are obtained from X-ray crystallography and NMRspectroscopy. Goal is to make it easier to compare structures. Searching: variety of access points: author, title, text terms, or a PDB 4-character code or a numerical MMDB-id MMDB Data: PDB records are parsed (to extract sequences and citations from PDB records, and structural info). Converted to ASN.1. Taxonomy: is used to help end users see term relationships and databases, along with literature references: Example: http://www.ncbi.nlm.nih.gov/Taxonomy/tax.html/ http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&name= Escherichia+coli&lvl=0&srchmode=1 Linking across resources • XML - there are hundreds of XML schema used in biology • Calls for mapping to ASN1 records [see NCBI example] • Calls for mapping across schema • Calls for exporting data for different devices… Visualization • Cn3D - uses MMDB-Entrez’s structure database – http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml • RasMol http://www.umass.edu/microbio/rasmol/ • Protein Explorer http://www.umass.edu/microbio/rasmol/rotating.htm • OpenRasMol http://www.openrasmol.org/ • MolviZ.org http://www.umass.edu/microbio/chime • World Index of Molecular Visualization http://molvis.sdsc.edu/visres/index.html Recap main points • Very large data sets “homogenized” thru ASN.1 • Goal to integrate (text-text, visualization-text, text-vis) • Raw data + research literature + visualization • Biologists provide domain knowledge • XML is a big player • CS and IS provide technology • Librarians provide maintenance and access to resources Collaborative Opportunities • For LIS and CS: – Domain analysis – information use, communication, theories of information; – systems analysis and design, – data modeling, – classification, – storage and retrieval, – HCI mapped onto a generalized model of a molecular biology experimental cycle • [Denn & MacMullen, 2002, p. 556] Collaborative Opportunities • “Insertion Points” - development of new tools and methods for managing, integrating & visualization • For local use: download selected data sets for local needs (Stapley & Benoit, 2000) • • • • XML Transformations XML - SVG - X3D Automated retrieval Clustering (data- and text-mining) Collaborative Opportunities • Biologists’ needs: – To go beyond mining of genomic data to investigate causal entailments in intra- and intracellular dynamics • LIS’s response: – To aid understanding of the scientific processes thru visualization of literature, metadata and graphic representations in general and for disease-specific analysis Back to you… • Thanks …