* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download document
Survey
Document related concepts
Transcript
Review of Biological Database Utilization 1 Biological Databases We will discuss: • Usefulness to the bioinformaticist • Database types • Search methods and tools http://www.sequenceanalysis.com/ 2 Importance of the Public Databases • The data provide the basis for sequencebased biology – Open access is key • Supported by Human Genome Project, International Nucleotide Sequence Database Collaboration and others • The amount of biological data is enormous – Biologists are dependent on computers for storing, organizing, searching, manipulating, and retrieving the data/information • 3 Why Search Biological Databases? • Generate new sequence – Is it already in bank? – Homologous sequences? • Find out about the gene – Annotation – Literature 4 Why Search Biological Databases? • Similar non-coding sequences – Repetitive elements – Regulatory regions • Homologous proteins;families • Identify and verify PCR priming sites 5 Biological Databases Types of Databases • Generalized databases (DNA, proteins and carbohydrates, 3D-structures) • Specialized databases (EST, STS, SNP, RNA, genomes, protein families, pathways, microarray data ...) 6 Generalized Databases • 2 Main Classes – DNA (nucleotide) The large databases are: • GenBank at NCBI (US), • EMBL at EBI (Europe - UK), • DDBJ (Japan). – Protein – SWISS-PROT/TrEMBL (high level of annotation), (protein identification resource). PIR 7 Specialized Databases • • • • ESTs (Expressed Sequence Tags) STSs (Sequence-Tagged Sites) SNPs (Single Nucleotide Polymorphisms) Organismal Genomic databases: Human (GDB), mouse (MGB), yeast (SGB), fly • HTGS (High Throughput Genomic Sequences • RNA – tRNAs, rRNAs, small RNA’s & others 8 Specialized Databases • Protein families – PROSITE, PRINTS, BLOCKS • Pathways: metabolic, regulatory etc. – EMP , PathDB, KEGG • Microarray data: expression data – 4 major: GeneX, ArrayExpress, – Stanford, Gene Expression Omnibus (GEO) To find specialized databases: http://www.agr.kuleuven.ac.be/vakken/i287/bioinformatica.htm# 9 Types of Database • Primary: archival – experimental data with some annotation (interpretation) • Secondary: curated 10 What is annotation? • Extraction, definition and interpretation of features on the genome sequence • Derived by integrating computational tools and biological knowledge – for example, known and predicted genes • Some databases are referred to as “annotated databases” – means that they contain sequence, comments, literature references, notes on experiments… 11 Curated Databases • Records are added only after they have been through a curation process – checked for accuracy, additional information (annotation) – scientific judgments are made as data are cleaned up and merged • Examples of curated databases: – SWISS-PROT, OMIM, RefSeq, LocusLink 12 Swissprot http://www.expasy.ch/sprot/ Swissprot • SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. 13 Organismal Databases These databases often serve a specific research community • • • • • Human Mouse Drosophilia C. elegans Yeast • • • • • Livestock Arapidopsis Maize Plasmodium Other http://tolweb.org/tree/home.pages/linksdb.html#organismal 14 Multi-Organism Resources www.ncbi.nlm.nih.gov www.tigr.org www.expasy.org 15 Biological Databases Types of Database Search • Text-based database search (SRS, Entrez) • Sequence-based database search (sequence similarity search) (BLAST, FASTA...) • Motif-based database search (ScanProsite, eMOTIF) • Structure-based database search (structure similarity search) (VAST, DALI...) 16 Database Search Tools Text-based :querying the annotation • SRS6 at http://srs6.ebi.ac.uk/srs6bin/cgibin/wgetz?-page+top • ENTREZ at http://www.ncbi.nlm.nih.gov/Entrez/ • DBGET/LinkDB at http://www.genome.ad.jp/dbgetbin/www_bfind?linkdb • 17 Sequence-based Searches Considerations: • Should I compare DNA or protein sequences? • More random matches with DNA http://www.people.virginia.edu/~rjh9u/codetabl.ht ml • Protein “matches” may be more relevant • DNA databases are larger 18 Sequence-based Searches • Sensitivity vs. Selectivity • Sensitivity: the ability to find true positive matches but still have false positives • Selectivity: the ability to reject false positives • Trade-off when choosing algorithm 19 Database Search Tools Sequence-Based • FASTA (FASTA at EBI, UK) • BLAST (Basic local alignment search tool at NCBI, USA) • MPsrch (Smith-Waterman algorithm-based search at EBI, UK) 20 More Sequence-based Tools • BLAST Microbial Genomes at http://www.ncbi.nlm.nih.gov/Microb_blast/unfi nishedgenome.html (Search finished and unfinished genomic sequences at NCBI) • Genome and proteome FASTA (at EBI, UK) at http://www2.ebi.ac.uk/fasta3/genomes.html 21 More Sequence-based Tools • Protein search in genomes at http://searchlauncher.bcm.tmc.edu/seqsearch/protein-search-genomes.html (BLAST and FASTA Species-specific protein sequence searches at Baylor College of Medicine, USA) • SectionSearch (FASTA or TFASTA search against predefined sections of sequence databanks at IUBIO Indiana, USA) • NRL-3D at http://pir.georgetown.edu/pirwww/dbinfo/nrl3d.html (Sequence-structure data base search at John Hopkins University, USA) 22 Tools to Search Special Databases for Sequences with Similar Motifs or Patterns ProfileScan • uses pfscan to find similarities between a query sequence and profile library • PROSITE is one such database • an Expasy database (ExpertProteinAnalysisSYstem, http://www.expasy.ch/) • similarities are based on fingerprints or common patterns 23 BLOCKS Database • a block is a motif or region of similar structure • no gaps are introduced • a block refers to the alignment, not the individual sequences • BLOCKS database is derived from PROSITE • searches can be done at Fred Hutchinson Cancer Center in Seattle 24 3 Major Portals into the Genome Data • UCSC Genome Browser at Univ. of California Santa Cruz • Ensembl at European Bioinformatics Inst (EBI) – http://www.ensembl.org • Entrez at NCBI – http://www.ncbi.nlm.nih.gov/Entrez/ 25 Entrez Databases • PubMed: The biomedical literature – PUBMED database contains Medline abstracts as well as links to full text articles on sites maintained by journal publishers • Nucleotide sequence database (Genbank) • Protein sequence database • Structure: three-dimensional macromolecular structures • Genome: complete genome assemblies • PopSet: population study data sets 26 Entrez Databases • • • • • OMIM: Online Mendelian Inheritance in Man Taxonomy: organisms in GenBank Books: online books ProbeSet: Gene Expression Omnibus (GEO) 3D Domains: domains from Entrez Structure 27 Entrez sequence searching • can find sequences for a given gene or protein • can download copy of sequence 28 NCBI BLAST NCBI offers several “flavors” of BLAST 29 NCBI BLAST NCBI offers several “flavors” of BLAST 30 The Take Home Lessons • • Search often, search with multiple parameters Use specialized DBs where possible, use protein sequence if appropriate • • • There are many tools available. You must know what tools are relevant. You must know how to use available tools. • • Look for sites that have multiple resources Google is your best friend. 31