Download Introducing Bioinformatics Databases

INTRODUCING BIOINFORMATICS DATABASES Tan Tin Wee/Victor Tong/Susan Moore Dept of Biochemistry NUS Mohammad Asif Khan Perdana University Graduate School of Medicine Sources of Biological Knowledge Past: textbooks, monographs, books, journals.  Today: online accessible databases Keyword searchable, e.g. Google.  Every class of biological molecule has at least a few databases associated with it.  Every area of biology, biotechnology, medicine and life science research will have some kind of database associated with it.  Must be aware and familiar with MAJOR databases  Must be able to discover NEW databases and master them as and when they appear.  BIOLOGICAL KNOWLEDGE TODAY!     STORED digitally Almost critical biological data, information, knowledge is currently stored in computers ACCESSIBLE globally All current critical biological knowledge is publicly accessible via the Internet network of computers SHARED extensively Most research data is exchanged via the Internet today if not publicly and free, then shared among international collaborators PUBLISHED online Most scientific journals are now published with a digital version accessible online, free open access or for a subscription fee paid by the individual or by the institution 10 years ago, this was not so. There has been tremendous change. UNSTOPPABLE DATA GROWTH 100 100 90 80 70 60 90 Growth of GenBank DNA Sequence 80 (2005 – 2009) >100,000,000 sequences Exponential Increase Next Gen Sequencing Technologies 70 60 http://www.ncbi.nlm.nih.gov/Genbank/g enbankstats.html Growth of PDB Protein and Macromolecular Structures Driven by various Structural Genomics initiatives such as Protein Structure Initiative http://www.nigms.nih.gov/Initiatives/PSI JCSG http://www.jcsg.org/ http://www.pdb.org/pdb/statistics/contentGro wthChart.do?content=total&seqid=100 2005 2008 RELENTLESS INCREASE IN DATABASES Michael Y. Galperin and Guy R. Cochrane (2009) Nucl. Acids Res. 37:D1-D4 . Nucleic Acids Research annual Database Issue and the NAR online Molecular Biology Database Collection in 2009 (doi:10.1093/nar/gkn942) http://nar.oxfordjournals.org/cgi/content/full/37/suppl_1/D1 A lot of data A lot of databases What do they mean? Most of the data begins to make sense if they are Integrated But many plans to integrate these databases have failed Biological Databases – examples and general considerations 7 Biological databases – what they are; purpose  Some general considerations  Sample databases  BIOLOGICAL DATABASES Many (but not all) definitions of “database” include: - Storage of data on a computer in an organized way Provision for searching and data extraction.  By these definitions web pages, books, journal articles, text files, and spreadsheet files cannot be considered as databases Purposes of biological databases: 1. 2. 3. 8 - To disseminate biological data and information To provide biological data in computer-readable form To allow analysis of biological data But first…a few terms  www.d.umn.edu/lib/reference/skills/vocab.html  Field: “the part of a record reserved for a particular type of data…” www.amberton.edu/VL_terms.htm 9 Database Record: “A collection of related data, arranged in fields and treated as a unit. The data for each [item] in a database make up a record.” Example from the “Grocery Shopping Database”: Fields A different view of the first “record”: A record Field Values Date: 18/08/2006 Item: White bread Store: Dover Provision Price: $1.29 10 SOME FEATURES OF BIOLOGICAL DATABASES  Data/information…     11  Stored in records according to some predetermined structure/format +/- evidence +/- unique identifiers +/- additional annotation +/- DB Xrefs (cross references) AUTHORITATIVE AND RELIABLE      Most biological databases are from authoritative and reliable sources, however… Not all Websites and Databases are reliable. Not all data and information stored in authoritative and reliable websites or databases are accurate or correct, or up-to-date Nevertheless, most of them are useful and instructive Many of them contain valuable information and knowledge Identification of authority and Evaluation of reliability – very important Every serious scientist must be critical of the information they read, whether online or not. DISCOVERABILITY Most publications, books and courses include online references – Web address (URL) e.g. http://www.pdb.org/ for protein structural data  Most useful resources are also listed and taught in courses, or spread by word of mouth.  Most databases are searchable by appropriate keywords and their authority determined by their web addresses, the institutions behind the databases or the authors’ reputation Most databases have full details of their content and how to use them.  NAR DATABASE CATEGORIES LIST 14 From: http://nar.oxfordjournals.org TABLE OF NAR DATABASES ISSUE http://en.wikipedia.org/wiki/Biological_database http://www.oxfordjournals.org/nar/database/c/                Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases Bibliographic databases DATABASE OF BIOLOGICAL DATABASES Alphabetical order http://www.oxfordjournals.org/nar/database/a/  Category http://www3.oup.co.uk/nar/database/cap/  Information flow in Biology      Human Genome Project – DNA sequence Microarray – RNA expression and levels Proteomics – protein expression and concentration in cells Structural proteomics or genomics – protein structure (and function) Functional genomicsprotein function EXAMPLES OF MAJOR BIOINFORMATICS RESOURCES  Browsing databases NCBI Entrez http://www.ncbi.nlm.nih.gov/sites/gquery  EBI Ensembl http://www.ensembl.org/index.html   Retrieving sequences   SRS - Sequence Retrieval System http://srs.ebi.ac.uk/ ExPASy – Expert Protein Analysis System – Proteomics server  http://au.expasy.org/ BIBLIOGRAPHIC INFORMATION  PubMed and Medline  Recent National Institutes of Health USA policy Google Scholar  Web of Science and Science Citation Index  Online journals    SuperTier Top Journals – Nature, Science, Cell, PNAS, etc. Open access journals Public Library of Science PLoS  Biomed Central  LITERATURE - PUBMED Citations and abstracts for articles from approx. 5000 (not all!) biomedical journals  Text searching to identify citations of interest  Links to full-text articles (free or otherwise)  More than 16,000,000 records*  20 * 16000000 As of Dec 29 2005. PubMed News. http://www.ncbi.nlm.nih.gov/feed/rss.cgi?ChanKey=PubMedNews Literature – PubMed 21 p53 cancer Authors Article Title Bibliographic Information (Journal name, date, volume, issue, page numbers) PMID: Unique ID for this record ABSTRACTPLUS VIEW - PUBMED 22 STORING YOUR OWN BIBLIOGRAPHIC INFORMATION Online Wizfolio: http://www.wizfolio.com Software: ENDNOTE or REFMAN GENETIC AND GENOMIC DATABASES From sequencing of specific genes or genomic sequence of entire genomes  Data are prepared, annotated and stored in databases  Genbank, NCBI  DDBJ, NIG  EBI/EMBL   Making Deposits http://www.ncbi.nlm.nih.gov/Genbank/update.html   Bankit Sequin NUCLEIC ACID DATABASES  25 Include:  GenBank  DDBJ  EMBL •Archives of Primary data •Exchange data amongst themselves RefSeq Summary/Integration of primary data GENBANK  Data from: Individual laboratories  Sequencing centres  Eg: sequencing errors  Eg: incomplete sequences  NCBI Handbook 26 Any organism  Individual records may be incomplete or inaccurate  SEARCHING ENTREZ NUCLEOTIDE FOR HUMAN P53 27 P53 GENBANK RECORD: GI 48094186 28 P53 GENBANK RECORD: HEADER Organismal Source Data sources 29 Identifiers, Version, Definition Line P53 GENBANK RECORD: FEATURES 30 CrossReferences to Other DBs Protein product P53 GENBANK RECORD: SEQUENCE 31 THE LINKED PROTEIN RECORD: GENBANK  GENPEPT 32 LINKS FROM P53 GENPEPT RECORD 33 Available links vary from one record to another WITH SO MANY RECORDS HOW DO WE KNOW WHICH ONE TO WORK WITH?   eg DDBJ, GenBank, EMBL (nucleotide) Have the same or different sequence information Single changes in nucleotides/amino acids  Incomplete sequence   Have variable extra annotation  Eg: Signal peptide; domains; DB XRefs etc 34 They may:  Come from different source databases THE REFSEQ PROJECT  http://www.ncbi.nlm.nih.gov/RefSeq/index.html  Info from: Predictions from genomic sequence  Analysis of GenBank Records  Collaborating databases  35 Goal: a “comprehensive, integrated, nonredundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms.” REFSEQ: 36 EXAMPLE: P53 REFSEQ MRNA RECORD 37 EXAMPLE: P53 REFSEQ MRNA RECORD 38 P53 REFSEQ MRNA FEATURES 39 P53 REFSEQ MRNA FEATURES CONTINUED 40 P53 REFSEQ MRNA FEATURES CONTINUED 41 p53 RefSeq mRNA features include…  Links:  GeneID – locus and display of genomic, mRNA and protein 42 sequences; extensive additional annotation OMIM – Online Mendelian Inheritance in Man – disease information  CDD – conserved protein domain  HGNC – official nomenclature for human genes  HPRD – Human Protein Reference Database   CDS (CoDing Sequence) Gene Ontology terms applied to the protein  Nucleotide sequence range of translated product  Translation – the protein sequence  Link to RefSeq Protein record   Other features – sequence ranges refer to the nucleotide  Nuclear Localization Signal  Polyadenylation site etc P53 REFSEQ PROTEIN 43 P53 REFSEQ PROTEIN CONTINUED 44 P53 REFSEQ PROTEIN CONTINUED 45 Sequence ranges in features refer to the amino acid sequence INTERPRETING REFSEQ IDENTIFIERS Genomic DNA NC_123456 - complete genome, complete chromosome, complete plasmid  NG_123456 - genomic region  NT_123456 - genomic contig mRNA - NM_123456 Protein - NP_123456 Gene and protein models from genome annotation projects:  XM_123456 - mRNA  XR_123456 - RNA (non-coding transcripts)  XP_123456 - protein 46  REFSEQ STATUS Most confident 47 Validated  Reviewed  Provisional -------------- Predicted  Model  Inferred  Genome Annotation  Least confident Protein Database – Swiss-Prot • • 48 SWISS-PROT A curated database of protein sequences Trained biologists extract and analyze relevant evidence from scientific publications Post translational modifications, sequence variations, functions, etc TrEMBL = Translated EMBL  UniProtKB = Swiss-Prot + TrEMBL Protein Database – Swiss-Prot • • 49 SWISS-PROT A curated database of protein sequences Trained biologists extract and analyze relevant evidence from scientific publications Post translational modifications, sequence variations, functions, etc TrEMBL = Translated EMBL  UniProtKB = Swiss-Prot + TrEMBL STRUCTURES: PDB  Three-dimensional structures of biomolecules 50 Image: Eric Martz RasMol Gallery. http://www.umass.edu/microbio/rasmol/galmz.htm (Accessed Aug 16, 2006) PDB 51 RESULTS SUMMARY PAGE 52 PDB – Structure Summary 53 PDB STRUCTURE SUMMARY CONTINUED 54 INTERACTIONS: BIND  Physical and genetic interaction data    p53 AP2Alpha 55  Curated from published experimental evidence All organisms Physical interactions span all molecule types:  Protein-Protein  Protein-RNA  Protein-DNA  Protein-Small Molecule  Etc Details characterizing the interaction – eg binding sites p53 protein-protein interactions in BIND – query results 56 A BIND INTERACTION RECORD 57 BIND INTERACTION STATISTICS - PROTEIN 58 FUNCTION AND PATHWAYS DATABASES - KEGG Several interconnected databases including: 59 • PATHWAY contains info on metabolic and regulatory networks. • 40,568 pathways generated from 301 reference pathways • GENES contains information on genes and proteins. • LIGAND contains information on chemical compounds and reactions involved in cellular processes. SEARCHING KEGG GENES 60 LINKING FROM GENE TO PATHWAYS 61 KEGG HUMAN CELL CYCLE PATHWAY 62 SUMMARY: Biological databases – examples and general considerations   Scope and sample records from selected databases: Pubmed, Genbank, RefSeq, PDB, BIND, KEGG Primary archival databases vs. derived databases Relative numbers of database records  Pubmed > RefSeq > Interactions > Structures > Reference Pathways 63  EXTRACTING DATA FROM THE DATABASES Databases have variable means of accessing and working with the data 64        Keyword (simple) searches +/- query by ID (eg PMID) +/- advanced queries – Boolean; field-specific +/- different views of the data +/- ways to export or store your results +/- visualization Getting the data A PROBLEM WITH KEYWORD SEARCHES Matches in potentially irrelevant parts of the record  Eg: if we ONLY want records describing the sequence of p53 and we do a keyword search of Entrez Nucleotide with p53:  p53 is mentioned in a GeneRIF. 65 Biological databases: PDB Advanced Query – Field Specific 66 PDB Advanced Query – Boolean Field Specific Query 67 “Match ALL of the following conditions” Molecule Name: p53 AND Ligand Name: zinc PDB CUSTOM REPORTS 68 SELECTING FIELDS FOR PDB CUSTOM REPORTS 69 PDB CUSTOM REPORT 70 PDB Custom Report – Save report in CSV (comma separated value) 71 VIEWS: GENBANK FLAT FILE VS FASTA 72 GETTING THE DATA   Querying through the web interface may be ineffective Some DBs also have programming interfaces  Many DBs also store their data at their FTP sites  can download entire datasets for programmatic manipulation  Eg: Flat Files  parse  into tables  73 Large-scale analyses  large-scale data retrieval EG OF KEGG API  http://www.genome.jp/kegg/soap/ 74 EXTRACTING THE DATA - SUMMARY Understanding database records allows us to query more effectively  Saving our results allows us to manipulate them offline  Different views are suited for different purposes  The web interface is not the only way to extract data  75 Limitations of databases… May have redundant information  May be incomplete  May have errors  May not be actively updated    Including new data Including corrections to old data Including updates of info from other DBs 76  WHERE DOES THE DATA AND ADDITIONAL ANNOTATION COME FROM?  Direct deposition of data  Manually extracted from the literature   eg BIND, SwissProt, old Genbank Text-mining   eg PDB, Genbank Automatically extracting biological information from the literature using computer programs Electronic Annotation  Eg automated assignment of GO terms to proteins based on sequence similarity All can be +/- human validation 77  REDUNDANCY AND INCOMPLETENESS IN BIOLOGICAL DATABASES  We’ve already seen redundancy WITHIN databases…  Eg multiple entries for human p53 in Genbank And…there can be multiple databases for a single data type. Eg: SwissProt – RefSeq Protein  Eg: BIND – MINT – DIP – HPRD etc  78  REDUNDANCY AND INCOMPLETENESS:  79 Overlap of human protein-protein interactions between 2 databases HPRD: Human Protein Reference Database 24385 interactions • 73% overlap  DIP: the Database of Interacting Proteins 1049 interactions (Gandhi et al, Nature Genetics 38, 285-293 (2006) It is likely that NEITHER of these databases is complete INCOMPLETE PRIMARY DATABASES Not all published observations have been curated into databases  Not all experimental observations have been published  Not all experiments have been conducted yet  Not all experiments can make all possible observations  80 STRIVING FOR COMPLETE DATASETS AND MINIMAL REDUNDANCY Eg GenBank – DDBJ – EMBL Share curation (data entry + checking) workload in a non-redundant fashion  Curate according to common standards  Exchange data using the PSI MI exchange format  81 Eg for interaction data:  IMEx – International Molecular Exchange consortium  BIND, DIP, MINT, IntAct, MPact, BioGRID ERRORS Can include, but are not limited to:  Typographical errors Impact on keyword searches?? Incorrect interpretation of source data  Experimental errors  Text mining not validated by a human  Incorrect automated analysis   (eg predicting mRNAs from genomic sequence) 82  ERROR EXAMPLE: A RETRACTED RECORD  83 GI 4504946, RefSeq mRNA for water buffalo alpha-lactalbumin: RETRACTED RECORD: NM_002289.1 VS NM_002289.2 84 Re-interpretation of genomic sequence  different mRNA, same protein SO: 1)  3) 4) Eg: For crucial information – go back to the original publication Keep a record of unique IDs, DB versions used Search using appropriate identifiers (eg from CVs – see next section) where possible 85 2) Search multiple databases Where possible, verify by consulting the evidence STANDARD NOMENCLATURE AND CONTROLLED VOCABULARIES Standard Nomenclature  Limited computational value of free text  CVs and Ontologies - Definitions  86 Grocery Shopping example  Gene Ontology  NCBI Taxonomy  STANDARD NOMENCLATURE A plague of biology: many names for the same unique biological objects MAPK14: CSBP1, CSBP2, CSPB1, EXIP…SAPK2A, p38, p38ALPHA MAPK1: ERK, ERK2, ERT1….PRKM2, p38, p40, p41, p41mapk GRAP2: RP3-370M22.1, GADS, GRAP-2, GRB2L, GRBLG, GRID, GRPL, GrbX, Grf40, Mona, P38 AHSA1: AHA1, C14orf3, p38 - Imagine a PubMed search on p38. - Imagine a sequence database search on p38. 87 TP53: LFS1, TRP53, p53 STANDARD NOMENCLATURE 1) Use identifiers!  2) Include the “official gene name” when describing your research eg HUGO Gene Nomenclature Committee approved name: HGNC:1189, “AHSA1” 88  When referring to database entries that you have used When querying databases. LIMITED COMPUTATIONAL VALUE OF FREE TEXT  89 Text mining vs human interpretation of free text: Humans win! Which would be easier for a computer to assess?  1. MoleculeA was found in the cytoplasm, whereas MoleculeB was not; rather it continued to accumulate in the nucleus. 2. MoleculeACellPlace: Cytoplasm MoleculeBCellPlace: Nucleus CONTROLLED VOCABULARIES AND ONTOLOGIES  Define: “a set of official descriptors assigned to a particular entry in a database, illustrating the relationship between synonyms and preferred usage terms.” 90  controlled vocabulary (CV) truncated from: www.library.appstate.edu/tutorial/glossary/glossary.html  Define:  ontology “specification of a conceptualisation of a knowledge domain. An ontology is a controlled vocabulary that describes objects and the relations between them in a formal way…” truncated from: members.optusnet.com.au/~webindexing/Webbook2Ed/glossary.htm These We will think of ontologies as “hierarchical CVs that specify relationships” 91 terms are sometimes used interchangeably in bioinformatics Example from the “grocery shopping database”  CV: we might use different words for the same thing  Bread – le pain – das Brot Ontology: we can formally classify our concepts of bread 92  A sample (and simple) ontology for bread others are possible… Breakfast cereal 93 Grain product Bread Synonyms: Le pain, das Brot Loaf bread Flat bread White bread Synonym: WonderBread Roti Prata Pita Naan THE GENE ONTOLOGY  What: A database of terms to describe gene (or gene product) information 94  Terms are applied to gene products  Why: 1) So we can use a common language to describe the same biological observations  2) So that we can compute on these observations  GO – the 3 aspects of describing genes and their products  Cellular Component  Molecular Function ~ actions of the gene product at a molecular level – eg catalysis, binding  Biological Process ~ biological events mediated by ordered assemblies of molecular functions – eg signal transduction An Introduction to the Gene Ontology. http://www.geneontology.org/GO.doc.shtml 95 ~ where it is in the cell (can also be extracellular) – eg nucleus Increasingly specific terms within each “aspect” 96 Parent terms (less granular) Child terms (more granular) P53 CELLULAR COMPONENT ANNOTATION FROM REFSEQ PROTEIN RECORD        In general, the most specific (most “granular”) term possible is applied, given the evidence. 97  cytoplasm [pmid 7720704]; insoluble fraction [pmid 12915590]; mitochondrion [pmid 12667443]; nuclear matrix [pmid 11080164]; nucleolus [pmid 12080348]; nucleoplasm [pmid 11080164] [pmid 12915590]; nucleus [pmid 7720704] A GO TERM RECORD: NUCLEAR MATRIX 98 “Nuclear matrix” in tree view (QuickGO) 99 Less granular term GROUPING ENTRIES BY LESS GRANULAR TERMS Can group by A COMMON PARENT TERM: ”Intracellular Membrane-Bound Organelle” 100 Eg: ProteinA  nucleus ProteinB  nuclear matrix ProteinC  mitochondria EG PDB: BROWSE BY CELLULAR COMPONENT 101 “Show me all of the structures where one or more of the molecules can reside in an intracellular membrane-bound organelle” NCBI TAXONOMY: HOMO SAPIENS 102 NCBI TAXONOMY: CLASS MAMMALIA 103 GROUPING ENTRIES BY LESS GRANULAR TERMS: NCBI TAXONOMY  Eg2: Trying to draw global patterns about hostvirus interactions 104 “Find me all of the protein-protein interactions where one protein is from a virus and the other protein is from a mammal” 105 VALUE OF CVS/ONTOLOGIES IN QUERYING  Increased ability to retrieve specific records  Grouping observations at multiple levels  Eg: return all the records that have the GO term “nucleus”, or any of its child terms 106  Eg: return all the records that have the exact GO term “Nuclear Matrix” in them SUMMARY OF L2  Contents of some databases: PubMed, GenBank, RefSeq, etc 107  Databases have limitations  Understanding database records allows us to query more effectively  Controlled Vocabularies can make queries more powerful 

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Introducing Bioinformatics Databases