Download Will Entrez Find Every Sequence Record?

NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries January 2016 Entrez Nucleotides Entrez Nucleotides (GenBank) • • • Database of nucleotide sequences (ATGC) Actually contains data from several databases - GenBank, EMBL, DDBJ, RefSeq Hard to search because many submitting scientists send in redundant information and poorly annotated information Nucleotide Data Domain • As of December 15, 2015 • Over 203,939,111,071 bases • Over 189,232,925 sequence records • Some complete genomes and chromosomes So Why So Hard to Search? • • • No controlled vocabulary - lose power of MeSH must OR synonyms. Often miss the records you want. Archival - quality of annotations depends on the submitter (especially features field); little to no quality control; spelling errors! Often miss the records you want. Redundant - lots of records for the same gene; partial records, etc. Often pull up records you don’t want. GenBank Sample Record • • Before searching, we will look at a GenBank sample record Note that the “Features” field provides useful biological information, and may be searched “Definition” field acts as record title – search [titl] Click any link in sample record to access definition of field and search tips Unique identifier; assigned by NCBI; required by journals/grants Link to PubMed citation/abstract The “Features” field provides the most biological information; search as [fkey] Numbers indicate location on the nucleotide sequence …3158 GenBank Identifiers • Accession Number - U49845 [accn] • Unique identifier; does not change • Letter prefix no longer has significance • Version - U49845.1 • If any change to sequence, version U49845.2 created • GenInfo Identifier (GI number) [uid] • Run parallel to accession.version system; change in sequence changes number Searching “Nucleotides” • Database is difficult to search: • Redundant records • Archival - poor or missing annotation • • Best searches are done using commands; need a class to learn all Practice search – search for sequences for human presenilin 1 • Is there anything odd about the some of the retrieved results? Choose “nucleotide” from dropdown, then click “search” Search for HUMAN presenilin 1 But end up with rat, mouse, etc. Searching “Nucleotides” • • We retrieved the non-human and PSEN2 (rather than PSEN1) records because the computer looked for the terms “human” and “presenilin 1” ANYWHERE in the record (click on details tab to see how the computer parsed your search) Use complex boolean searching to clean this up: term [field] AND term [field] Searching “Nucleotides” • • How to get rid of non-human sequences? • Search human [orgn] (this works for any taxon) How to get rid of non-presenilin 1 sequences? • Another trick – search PSEN1 [gene] • Note – you may miss relevant sequences, but should not pick up irrelevant sequences • The sequences that you miss are the ones that have not been annotated with the current official gene symbol in the “gene” field • DO NOT use this method if you need to find every sequence for a particular gene • Human [orgn] AND PSEN1 [gene] This is the search that was completed using fields (orgn, gene) and filters Use these filters to choose molecule type, confine to RefSeq records How Can I Find “Best” Sequences • • • • • • Non-redundant, curated subset of the sequence data domains Contains one record for each gene or splice variant from each organism represented Records can be thought of as “review articles” for sequences “Best” (usually longest) sequence used as seed Value-added annotations provided by experts Easy – a tab now exists to limit retrieval to just RefSeq The typical RefSeq accession number format: 2 letters, an underscore, and then numbers Click on the RefSeq link to retrieve only the “best” sequences (highly annotated, complete, nonredundant) Viewing Formats • • • The “Default” view is the standard GenBank record Researchers often use the “FASTA” format for analysis Change the record format at the “Display” pull-down menu Entrez Proteins Entrez Proteins • • Contains data from several databases: • SwissProt, PIR, PRF, PDB • Translations from annotated coding regions in GenBank and RefSeq Redundant archival data domain of publicly available protein sequences Searching Entrez Proteins • • Searched like Entrez Nucleotides “Filters” choices differ; includes molecular weight and sequence length filters Entrez Gene Entrez Gene • • • • Pulls together information (sequences, structures, literature, gene models, pathways, etc.) for genes Best place to start for “gene-centered” info One record per gene per organism Search by names, symbols, accessions, publications, GO terms, chromosome numbers, E.C. numbers, etc. Search using gene symbol Could have searched under any of these aliases (unlike GenBank where you would have to try them all) Official gene symbol as determined by the Human Genome Nomenclature Commission Summary of protein, function and diseasecausing mutations; from RefSeq record Links to PubMed records that provide evidence of function – any researcher can add these Gene Ontology terms form a controlled vocabulary with three components – biological process, molecular function, and cellular component Links to OMIM records of phenotype/ disease Links to protein interactions Links to homology maps Pathway info may be available from the Kyoto Encyclopedia of Genes and Genomes Sequence and domain links Links to GeneReviews – clinical resource Taxonomy Browser Search Taxonomy Browser • • How many genera from the family Iguanidae are represented by sequence data? How many nucleotide and protein sequences are available for the family? Entrez Searching Summary To Find Everything(?) Broaden Search • • • • • • OR together synonyms OR together related terms (gene name, gene symbol, protein name, alternate spellings, disorder) Don’t specify a field- search entire record Truncation - use * at end of word root Click “Similar articles” Try using Taxonomy Browser to pick up all taxa in a particular group Fewer/Best Records Narrow Search • • • Search particular fields: • PubMed - MeSH Browser, subheadings, major MeSH • Nucleotide - features, title, gene, properties, organism Use “Filters” Search only the RefSeq database Will Entrez Find Every Sequence Record? • No!!! • Entrez relies on annotation of records, so you are searching solely on “terminology” • Some records are not annotated, some records are poorly or incorrectly annotated • To find all useful sequences – need to search on sequence itself • BLAST Entrez “Similar Articles” • PubMed similar articles • Based on a “word weight” algorithm – MeSH, title, abstract words • In order by weight (highest weight first)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Will Entrez Find Every Sequence Record?