* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download document
Survey
Document related concepts
Transcript
Review of Biological Database Utilization 1 Biological Databases We will discuss: • Usefulness to the bioinformaticist • Database types • Search methods and tools http://www.sequenceanalysis.com/ 2 Importance of the Public Databases • The data provide the basis for sequencebased biology – Open access is key • Supported by Human Genome Project, International Nucleotide Sequence Database Collaboration and others • The amount of biological data is enormous – Biologists are dependent on computers for storing, organizing, searching, manipulating, and retrieving the data/information • 3 Why Search Biological Databases? • Generate new sequence – Is it already in bank? – Homologous sequences? • Find out about the gene – Annotation – Literature 4 Why Search Biological Databases? • Similar non-coding sequences – Repetitive elements – Regulatory regions • Homologous proteins;families • Identify and verify PCR priming sites 5 Biological Databases Types of Databases • Generalized databases (DNA, proteins and carbohydrates, 3D-structures) • Specialized databases (EST, STS, SNP, RNA, genomes, protein families, pathways, microarray data ...) 6 Generalized Databases • 2 Main Classes – DNA (nucleotide) The large databases are: • GenBank at NCBI (US), • EMBL at EBI (Europe - UK), • DDBJ (Japan). – Protein – SWISS-PROT/TrEMBL (high level of annotation), (protein identification resource). PIR 7 Specialized Databases • • • • ESTs (Expressed Sequence Tags) STSs (Sequence-Tagged Sites) SNPs (Single Nucleotide Polymorphisms) Organismal Genomic databases: Human (GDB), mouse (MGB), yeast (SGB), fly • HTGS (High Throughput Genomic Sequences • RNA – tRNAs, rRNAs, small RNA’s & others 8 Specialized Databases • Protein families – PROSITE, PRINTS, BLOCKS • Pathways: metabolic, regulatory etc. – EMP , PathDB, KEGG • Microarray data: expression data – 4 major: GeneX, ArrayExpress, – Stanford, Gene Expression Omnibus (GEO) To find specialized databases: http://www.agr.kuleuven.ac.be/vakken/i287/bioinformatica.htm# 9 Types of Database • Primary: archival – experimental data with some annotation (interpretation) • Secondary: curated 10 What is annotation? • Extraction, definition and interpretation of features on the genome sequence • Derived by integrating computational tools and biological knowledge – for example, known and predicted genes • Some databases are referred to as “annotated databases” – means that they contain sequence, comments, literature references, notes on experiments… 11 Curated Databases • Records are added only after they have been through a curation process – checked for accuracy, additional information (annotation) – scientific judgments are made as data are cleaned up and merged • Examples of curated databases: – SWISS-PROT, OMIM, RefSeq, LocusLink 12 Swissprot http://www.expasy.ch/sprot/ Swissprot • SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. 13 Organismal Databases These databases often serve a specific research community • • • • • Human Mouse Drosophilia C. elegans Yeast • • • • • Livestock Arapidopsis Maize Plasmodium Other http://tolweb.org/tree/home.pages/linksdb.html#organismal 14 Multi-Organism Resources www.ncbi.nlm.nih.gov www.tigr.org www.expasy.org 15 Biological Databases Types of Database Search • Text-based database search (SRS, Entrez) • Sequence-based database search (sequence similarity search) (BLAST, FASTA...) • Motif-based database search (ScanProsite, eMOTIF) • Structure-based database search (structure similarity search) (VAST, DALI...) 16 Database Search Tools Text-based :querying the annotation • SRS6 at http://srs6.ebi.ac.uk/srs6bin/cgibin/wgetz?-page+top • ENTREZ at http://www.ncbi.nlm.nih.gov/Entrez/ • DBGET/LinkDB at http://www.genome.ad.jp/dbgetbin/www_bfind?linkdb • 17 Sequence-based Searches Considerations: • Should I compare DNA or protein sequences? • More random matches with DNA http://www.people.virginia.edu/~rjh9u/codetabl.ht ml • Protein “matches” may be more relevant • DNA databases are larger 18 Sequence-based Searches • Sensitivity vs. Selectivity • Sensitivity: the ability to find true positive matches but still have false positives • Selectivity: the ability to reject false positives • Trade-off when choosing algorithm 19 Database Search Tools Sequence-Based • FASTA (FASTA at EBI, UK) • BLAST (Basic local alignment search tool at NCBI, USA) • MPsrch (Smith-Waterman algorithm-based search at EBI, UK) 20 More Sequence-based Tools • BLAST Microbial Genomes at http://www.ncbi.nlm.nih.gov/Microb_blast/unfi nishedgenome.html (Search finished and unfinished genomic sequences at NCBI) • Genome and proteome FASTA (at EBI, UK) at http://www2.ebi.ac.uk/fasta3/genomes.html 21 More Sequence-based Tools • Protein search in genomes at http://searchlauncher.bcm.tmc.edu/seqsearch/protein-search-genomes.html (BLAST and FASTA Species-specific protein sequence searches at Baylor College of Medicine, USA) • SectionSearch (FASTA or TFASTA search against predefined sections of sequence databanks at IUBIO Indiana, USA) • NRL-3D at http://pir.georgetown.edu/pirwww/dbinfo/nrl3d.html (Sequence-structure data base search at John Hopkins University, USA) 22 Tools to Search Special Databases for Sequences with Similar Motifs or Patterns ProfileScan • uses pfscan to find similarities between a query sequence and profile library • PROSITE is one such database • an Expasy database (ExpertProteinAnalysisSYstem, http://www.expasy.ch/) • similarities are based on fingerprints or common patterns 23 BLOCKS Database • a block is a motif or region of similar structure • no gaps are introduced • a block refers to the alignment, not the individual sequences • BLOCKS database is derived from PROSITE • searches can be done at Fred Hutchinson Cancer Center in Seattle 24 3 Major Portals into the Genome Data • UCSC Genome Browser at Univ. of California Santa Cruz • Ensembl at European Bioinformatics Inst (EBI) – http://www.ensembl.org • Entrez at NCBI – http://www.ncbi.nlm.nih.gov/Entrez/ 25 Entrez Databases • PubMed: The biomedical literature – PUBMED database contains Medline abstracts as well as links to full text articles on sites maintained by journal publishers • Nucleotide sequence database (Genbank) • Protein sequence database • Structure: three-dimensional macromolecular structures • Genome: complete genome assemblies • PopSet: population study data sets 26 Entrez Databases • • • • • OMIM: Online Mendelian Inheritance in Man Taxonomy: organisms in GenBank Books: online books ProbeSet: Gene Expression Omnibus (GEO) 3D Domains: domains from Entrez Structure 27 Entrez sequence searching • can find sequences for a given gene or protein • can download copy of sequence 28 NCBI BLAST NCBI offers several “flavors” of BLAST 29 NCBI BLAST NCBI offers several “flavors” of BLAST 30 The Take Home Lessons • • Search often, search with multiple parameters Use specialized DBs where possible, use protein sequence if appropriate • • • There are many tools available. You must know what tools are relevant. You must know how to use available tools. • • Look for sites that have multiple resources Google is your best friend. 31