Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Using Entrez The Life Sciences Search Engine Searching NCBI Databases Efficiently • Knowing how to retrieve the exact information you need in an efficient way is the fundamental and most important skill in Bioinformatics. • Every NCBI database is designed and created for some specific purposes. • A common mistake Bioinformatics novices make is searching for information in an inappropriate database. • Entrez links among and within databases, making it easier to search for information. What is Entrez? • Entrez is an NCBI retrieval system designed for searching several linked databases. • Entrez is a search tool for integrated access to the biological literature and sequence data. • Entrez is extremely powerful, enabling the user to quickly move between the different specialized databases. Entrez • Entrez is divided into sites for nucleotide, protein, structure, genomes, OMIM, and more. You can use limits (such as RefSeq) to focus your Entrez search. • When you conduct a search via Entrez, your query generates this screen, telling you the number of hits to your query. The Entrez System Books PubMed PopSet ProbeSet e! GDB MGC Protein Entrez LocusLink OMIM Structure HGMD Homologene SNP CDD 3D Domains UCSC Nucleotide Genome Taxonomy The Big Picture UniGene UniSTS MapViewer Entrez and LocusLink • Entrez doesn’t link to all the databases that contain sequences, however! • LocusLink has its own groups of links to specialty databases, since it doesn’t cover all the genomes yet. Entrez: Database Integration Word weight PubMed abstracts Phylogeny 3 -D 3-D Structure Structure Taxonomy VAST Genomes BLAST Nucleotide sequences Protein sequences BLAST The (ever) Expanding Entrez System UniGene PubMed Nucleotide Protein Journals Structure CDD Genome Entrez SNP PopSet OMIM 3D Domains Taxonomy UniSTS ProbeSet Books PubMed Books Nucleotide Protein Entrez Databases Biomedical literature Online textbooks GenBank, EMBL, DDBJ, RefSeq, PDB [GenBank, EMBL, DDBJ], RefSeq, SWISS-PROT, PIR, PRF, PDB Genome Complete genomes Taxonomy Organisms in NCBI sequence databases Structure MMDB: experimental 3D structures Domains CDD: conserved protein domains 3D Domains Compact 3D protein domains in MMDB OMIM Online Mendelian Inheritance in Man SNP Single nucleotide polymorphisms UniSTS Sequence Tagged Site markers ProbeSet Gene expression and microarray datasets PopSet Population study datasets UniGene Gene-based expressed sequence clusters Nucleotide Database • The Nucleotide database contains sequence data from GenBank, EMBL, and DDBJ, the members of the tripartite, international collaboration of sequence databases. • EMBL is the European Molecular Biology Laboratory at Hinxton Hall, UK; • DDBJ is the DNA Database of Japan in Mishima, Japan. • Sequence data are also incorporated from the Genome Sequence Data Base (GSDB), Santa Fe, NM. • Patent sequences are incorporated through arrangements with the U.S. Patent and Trademark Office (USPTO) and via the collaborating international databases from other international patent offices. Entrez Nucleotides • • • • Primary GenBank / EMBL / DDBJ 35,116,960 Derivative RefSeq 259,219 Third Party Annotation 3,182 PDB 4,703 Total 35,384,248 Database Searching with Entrez Using limits and field restriction to find plant g6pdh Linking and neighboring with g6pdh Entrez Nucleotides glucose 6 phosphate dehydrogenase The G6PD enzyme catalyzes the oxidation of glucose-6phosphate to 6-phosphogluconate, while reducing nicotinamide adenine dinucleotide phosphate (NADP+ to NADPH). In terms of electron transfer, glucose-6-phosphate loses two electrons to become 6-phosphogluconate and NADP+ gains two electrons to become NADPH. This is the first step in the pentose phosphate pathway. This pathway, or shunt, as it is sometimes called, produces the 5- carbon sugar, ribose, which is an essential component of both DNA and RNA. Limits Are Helpful • Limits allow restriction of a search to a defined subset of the database. • Limits can be set to restrict a search to a particular database field (e.g., the Author field). • Limits can be set to search everything but a particular type of data (e.g., “exclude patent records”). • Alternatively, limits can be set to search only a particular type of data (e.g., Genomic RNA/DNA) or to search only data from a particular source database (e.g., EMBL). Date limits and sequence length limits are also possible. • The contents of each Entrez database differ, and therefore the Limits available for each database differ. Entrez Nucleotides: Limits & Preview/Index glucose 6 phosphate dehydrogenase Try using the Limits and Preview function to hone your search To find the Plant G6PD genes. Entrez Nucleotides: Limits Accession All Fields Author Name EC/RN Number Feature key Filter Field Restriction Gene Name glucose 6 phosphate dehydrogenase Issue Journal Name Keyword Modification Date Organism Exclude bulk Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Word Uid Volume sequences Entrez Nucleotides: Limits glucose 6 phosphate dehydrogenase Title == Definition Exclude Bulk Sequences Nuclear gene mRNA molecule type Document Summaries: Limits Adding Terms: Preview/Index Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword green plants Modification Date Organism Page Number green plants Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Word Uid Volume Plant cytosolic g6pdh mRNAs Database Neighbors and Interlinking • What makes Entrez more powerful than many services is that most of its records are linked to other records, both within a given database (such as Nucleotide) and between databases. • Links within a database are called “neighbors” (e.g., Nucleotide neighbors). Links Between Databases • Protein and Nucleotide neighbors are determined by performing similarity searches using the BLAST algorithm to compare the entry amino acid or DNA sequence to all other amino acid or DNA sequences in the database. We will discuss more about BLAST later. • Nucleotide sequence records in the Nucleotide database are linked to the PubMed citation of the article in which the sequences were published. • Protein sequence records are linked to the nucleotide sequence from which the protein was translated. Plant cytosolic g6pdh mRNAs Links and neighbors (related records) Summary Brief GenBank ASN.1 Formats FASTA GI list LinkOut PubMed Links Protein Links Nucleotide Neighbors PopSet Links Structure Links Genome Links Taxonomy Links OMIM Links LinkOut • LinkOut is a feature of Entrez that is designed to provide users with links from PubMed and other Entrez databases to a wide variety of relevant webaccessible online resources: – – – – Full-text publications Other biological databases Consumer health information Research tools • The goal is to facilitate access to relevant online resources beyond the Entrez system to extend, clarify, or supplement information found in the Entrez databases. Protein Database • The protein database includes proteins from translate regions of DNA in GenBank as well as sequence from PIR • The entry includes: – The name of the protein – How the protein sequence was derived – An accession and a PID number – The number of amino acids Protein Entry The Entry also includes: • Structural information for the protein (if known) – Helices and Sheets – Domains – Etc • The sequence of amino acids comprising the protein Setting Protein Database search limits • Choose Protein from the drop-down menu – Can do a Boolean search – Or can set LIMITS • Fields (eg Author, Journal, etc.) • Gene Location (genomic, mitochondrial etc) • Segmented Sequence • Only from (Database to check) • Modification date Linking Between Databases • Sometimes you will pull up a record and you have no idea what organism the gene you are looking at is from. • For Example, the following record- what is Medicago sativa ? Entrez GenBank / GenPept Taxonomy to the Rescue • Entrez lets you click a live link from the record and determine what organism Medicago sativa is. • It is alfalfa. • You can also tell what it is related to taxonomically, because sometimes the common name isn’t very useful either! Taxonomy Link Advanced Neighbors: BLink What is BLink • BLink - BLAST Link • Someone has done a BLAST search already, and you can just retrieve it! • BLink displays the graphical output of precomputed blastp results against the protein non-redundant (nr) database. This graphical output includes: • Alignment of up to 200 BLAST hits on the query sequence • Best Hits to each organism • List of known protein domains in the query sequence • Filter hits by selecting the BLAST cutoff score • Distribution of hits by taxonomic grouping • Display of similar sequences with known 3D structure • Filter hits by database and/or by taxonomic grouping • Display a taxonomic tree of all organisms with similar sequences PopSet Links • The PopSet database contains aligned sequences submitted as a set resulting from a population, phylogenetic, or mutation study. • These alignments describe such events as evolution and population variation. • The PopSet database contains both nucleotide and protein sequence data. Protein Neighbors->PopSet Links Protein Neighbors->Genome Links PopSet search results • The results or a PopSet search • The PopSet database includes alignments of genes from multiple organisms OR different gene families OR mutational analyses PopSet Entry • The PopSet entry includes: – The title of the paper/study – The length of the sequence(s) aligned – The number of aligned sequences PopSet Entry without alignment • The PopSet Entry without an alignment – Title of the study – The number of sequences included – Links to the sequences Entrez Structures Protein Structures can also be in databases http://bmbiris.bmb.uga.edu/wampler/tutorial/prot0.html is a useful review Tutorial. Entrez links to structure databases • The Structure database or Molecular Modeling Database (MMDB) contains experimental data from crystallographic and NMR structure determinations. • The data for MMDB are obtained from the Protein Data Bank (PDB). • The NCBI has cross-linked structural data to bibliographic information, to the sequence databases, and to the NCBI taxonomy. • Use Cn3D, the NCBI 3D structure viewer, for easy interactive visualization of molecular structures from Entrez. Structure Search results • The structure of proteins are also in a database • Search as before • Your search results are similar Structure Entry • The structure Entry has links to the other databases • And it will allow you download a file to open with a structure viewer program • Proteins with similar structures and functions have been identified in the databases BLink: Advanced Protein Neighbors BLink: Related Structures Viewing Structure in Cn3D • You can download Cn3D (a structural viewer program) from NCBI • This will allow you to view the structures from the structure database Cn3D Text Window • The Text window of Cn3D will align two or more proteins so you can compare the structure of multiple proteins BLink: Human Homologue Human RefSeqs: Genome Reagents MMDB: Molecular Modeling Data Base • Derived from experimentally determined PDB records • Value added to PDB records including: – Addition of explicit chemical graph information – Validation – Inclusion of Taxonomy, Citation, and other information – Conversion to ASN.1 data description language • Structure neighbors determined by Vector Alignment Search Tool (VAST) Structure Summary Cn3D viewer Structure Neighbors Conserved Domains 3D Domain Neighbors Cn3D 4.1 Cn3D 4.1: Structural Alignment Conserved ATP binding site Src Kinase H. sapiens Casein kinase S. pombe Cn3D: Simple Homology Modeling human swordtail Using Cn3D to model domains Other services and databases from the NCBI • LocusLink to all possible information from NCBI and beyond for a few well characterized model organisms. • LocusLink is a great starting point: it collects key information on each gene/protein from major databases. It now covers 8 organisms. • RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635) Locus Links • Results of a Locus links search, includes: – – – – – – Locus ID Species Locus symbol Locus name Locus location Links • Protein Database • OMIM • Reference Sequence • Related GenBank Sequences • Homologene Data • UniGene • Variation Data LocusLink: Selected Higher Genomes OMIM PubMed Map Viewer HomoloGene UniGene RefSeq Full report GenBank dbSNP Protein Protein Database • The Protein database contains sequence data from the translated coding regions from DNA sequences in GenBank, EMBL, and DDBJ as well as protein sequences submitted to: – – – – Protein Information Resource (PIR) SWISS-PROT Protein Research Foundation (PRF) Protein Data Bank (PDB) (sequences from solved structures) NCBI Protein Databases • • • • GenPept GenBank, EMBL, DDBJ CDS translations RefSeq mRNA based (NP_) and genome based (XP_) Swiss-Prot curated high quality protein reviews PIR protein information resource Georgetown University • PRF protein resource foundation • PDB Protein Databank sequences from structures Entrez Protein • GenPept • RefSeq (GB,EMBL, DDBJ) • Third Party Annotation 3,442,298 856,191 3,834 • Swiss Prot • PIR • PRF 144,508 282,821 12,079 Total 3,442,298 BLAST nr 1,642,191 Protein Link BLAST Link Conserved Domains Related Proteins: Redundancy Redundant Sequences Related Proteins: Links Sequence from MutL structure BLink: non-redundant relatives Arabidopsis homolog Conserved Domain MLH1 Domain Structure: CDD ATPase Domain Mismatch Repair Domain MLH1: ATPase Domain 1BGQ: ATPase Domain in Cn3D Yeast HSP90 ATP Binding site helix Variations Human MLH1 BLink Finding structural models Mapping Variation Onto Structure Loads sequence alignment and structure in Cn3D Bacterial DNA mismatch repair proteins Mapping Variation Onto Structure Asn Ile – Val Conserved Asn Ile NCBI Genome Databases • The Genome database provides views for a variety of genomes, complete chromosomes, sequence maps with contigs, and integrated genetic and physical maps. Microbial Genomes ZWF Genome search results • Genome Search Results • The Genome database includes full (and some partial) genomes from viruses to complex organisms Genome Entry • Genome entries include – Maps of the genome – Links to the sequence – The organism for the genome Genes Database: All Genomes Coming soon! Genes Database: All Genomes Genes Database: All Genomes But wait! There’s more! • There is even more at NCBI that I have covered here. • This site map is also a guide to NCBI resources. Each link leads to a brief description of the resource on this page, then to the resource itself. http://www.ncbi.nlm.nih.gov/Sitemap/ There are many bioinformatics servers outside NCBI. • Try ExPASy’s sequence retrieval system at http://www.expasy.ch/ • (ExPASy = Expert Protein Analysis System) • Or try ENSEMBL at www.ensembl.org for a premier human genome web browser.