* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Linear time algorithm for parsing RNA secondary structure
Public health genomics wikipedia , lookup
Non-coding DNA wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Genome (book) wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Gene expression programming wikipedia , lookup
Designer baby wikipedia , lookup
Genome evolution wikipedia , lookup
Gene nomenclature wikipedia , lookup
Microevolution wikipedia , lookup
Protein moonlighting wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Pathogenomics wikipedia , lookup
Gene expression profiling wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Point mutation wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Sequence alignment wikipedia , lookup
Helitron (biology) wikipedia , lookup
Database Resources of the National Center for Biotechnology Information David L. Wheeler et al. Nucleic Acids Research, Vol. 33, Database issue Baharak Rastegari MEDG 505 presentation February 3, 2005 [email protected] 1 NCBI! What is it? • Created in 1998 • At the National Institutes of Health • To develop information systems for molecular biology • Maintains: GenBank(R) nucleic acid sequence database • Provides: Data retrieval systems & computational resources 2 3 DB Resources Categories • Databases retrieval tools • The BLAST family of sequence-similarity search programs • Resources for Gene-level sequences • Resources for Genome-scale analysis • Resources for the analysis of patterns of gene expression and phenotypes • The molecular modeling database, the conserved domain database search, CDART and Protein interactions 4 DB Resources Categories • Databases retrieval tools • The BLAST family of sequence-similarity search programs • Resources for Gene-level sequences • Resources for Genome-scale analysis • Resources for the analysis of patterns of gene expression and phenotypes • The molecular modeling database, the conserved domain database search, CDART and Protein interactions 5 Entrez • Text searching → using Boolean queries → of a diverse set of over 20 databases • Simultaneous searches across all Entrez databases at speeds comparable to a single database search 6 7 Entrez • Retrieved record can be displayed in a wide variety of formats → GenBank Flatfile, FASTA, XML, … • Graphical display is offered for some type of records • Search history → allows users to recall result of previous searches and combine them using Boolean logic 8 Entrez • PubMed → includes 12.8 million references and abstracts in MEDLINE(R) → with links to the full text of more than 4400 journals available on web • PubMed Central → digital archive of peer reviewed journals in life sciences → access to over 300 000 full text articles → over 160 journals • Books database → Contains more than 35 online scientific textbook 9 Taxonomy • Indexed over 165 000 named organisms • Can be used to view taxonomic position or retrieve data from a database for particular organism or group • Searches can be made on whole, partial or phonetically spelled organism names • Links to organisms commonly used in biological research are provided • Display custom taxonomic trees, representing userdefined subsets of the full NCBI taxonomy 12 13 14 15 Entrez Gene • Successor to LocusLink • Provides an interface to curated sequences and descriptive information about genes • With links to gene related resources → NCBI’s Map Viewer, Evidence Viewer, Blast Link, .. 16 DB Resources Categories • Databases retrieval tools • The BLAST family of sequence-similarity search programs • Resources for Gene-level sequences • Resources for Genome-scale analysis • Resources for the analysis of patterns of gene expression and phenotypes • The molecular modeling database, the conserved domain database search, CDART and Protein interactions 17 BLAST Family • BLAST → Local alignment search tool → performing sequence-similarity searches against variety of sequence databases → returning a set of gapped alignments btw the query and database sequences • BLAST2Sequences → comparing two DNA or protein sequences → producing a dot-plot representation of the alignments 18 19 BLAST Family • MegaBLAST → designed to search for nearly exact matches → handles batch nucleotide queries → operates up to 10 times faster than standard nucleotide BLAST • BLASTLink (BLink) → displays pre-computed protein BLAST alignments for each protein in the Entrez databases → can display subset of these alignments by taxonomic criteria, database of origin, … 20 DB Resources Categories • Databases retrieval tools • The BLAST family of sequence-similarity search programs • Resources for Gene-level sequences • Resources for Genome-scale analysis • Resources for the analysis of patterns of gene expression and phenotypes • The molecular modeling database, the conserved domain database search, CDART and Protein interactions 21 UniGene • System for automatically partitioning Gen-Bank sequences, including ESTs, into a non-redundant set of gene-oriented clusters • Each cluster contains sequences that represent a unique gene, and is linked to related information • Human UniGene → over 4.5 million human ESTs → reduced to 42-fold in number to approximately 107 000 sequence clusters • Has been used as a source of unique sequences for the fabrication of microarrays for the large-scale study of gene expression 22 ProEST • Analogous to BLASTLink • Presents pre-computed BLAST alignment btw protein sequences from model organisms and six-frame translations of UniGene nucleotide sequences • Reports are updated in tandem with UniGene protein similarities 23 Trace & Assembly Archives • Trace Archive allows for flexible searching and download of sequencing traces • Assembly Archive links the raw sequence information found in the Trace Archive with assembly information found in GenBank 24 HomoloGene • System for automated detection of homologs among the annotated genes of several completely sequence eukaryotic • New HomoloGene build is guided by the taxonomic tree, relies on: → conserved gene order & measures of DNA similarity among closely related species → protein similarity for more distantly related organisms • 25 …HomoloGene • ‘Ancestor’ field → refers to the taxonomic group of the last common ancestor of the species represented in HomoloGene entry → using it is possible to limit a search to genes conserved in one of 22 ancestral group • ‘Pairwise Score’ display gives a table of pairwise statistics for members of a Homologene group that includes → percent amino acid and nucleotide identities → Jukes-Cantor genetic distance parameter → the ratio of non-synonymous to synonymous amino acid substitutions (Ka/Ks) 26 Reference Sequences • RefSeq provides curated references for → transcripts, proteins and genomic regions → computationally derived nucleotide sequences and proteins • Containing 1.3 million sequences → including more than 1 million protein sequences → representing more than 2400 organisms 28 ORF Finder and Spidey • ORF finder → performs a six-frame translation of a nucleotide sequence → returns the location of each ORF within a specified size range • Spidey → alignment tool for eukaryotic genomic sequences → takes into account predicted splice sites in constructing its alignment, and can use one of four splice-site models → returns exon alignments, protein translations and a summary showing the alignment quality, … 29 Electronic PCR (e-PCR) • Forward e-PCR → searches for matches to STS primer pairs in the UniSTS database of over 450 000 markers → to increase sensitivity, allows the size of primer segment to be matched, number of mismatches, number of gaps and the size of the STS to be adjusted • Reverse e-PCR → used to estimate the genomic binding site, amplicon size and specificity for sets of primer pairs by searching against the genomic and transcript databases 30 31 32 dbSNP • Database of single nucleotide polymorphisms • Repository for single base nucleotide substitutions and short deletion and insertion polymorphisms • Contains 9.8 million human SNPs as well as about 5 million from a variety of other organisms 33 DB Resources Categories • Databases retrieval tools • The BLAST family of sequence-similarity search programs • Resources for Gene-level sequences • Resources for Genome-scale analysis • Resources for the analysis of patterns of gene expression and phenotypes • The molecular modeling database, the conserved domain database search, CDART and Protein interactions 34 Entrez Genomes • Provides access to genomic data contributed by the scientific community for species whose sequencing and mapping is complete or in progress • Includes: → over 180 complete microbial genomes → more than 1600 viral genomes → over 550 reference sequences for eukaryotic organelles →… • Complete genome can be accessed hierarchically starting from either → an alphabetical listing → phylogenetic tree for each of six principal taxonomic groups 35 COGs database • Clusters of orthologous groups • Presents a compilation of orthologous groups of proteins from 66 completely sequenced organisms • Eukaryotic version, KOGs, is available for seven eukaryotes 36 MAP & Evidence Viewer • MAP Viewer displays → genome assemblies → genetic and physical markers → the result of annotation, and other analyses using sets of aligned maps • Evidence Viewer displays the alignments to a → genomic contig of RefSeq transcripts → GenBank mRNAs → known or potential transcripts → EST’s supporting a gene model 37 Cancer Chromosome • Consists of → NCI/NCBI SKY, M-FISH and CGH databases → NCI Mitelman database of chromosome Aberrations in cancer → NCI Recurrent Chromosome Aberrations in Cancer dtabase • Three search formats are available → convential Entrez query → Quick/Simple search: set of menus to select a disease site or diagnosis → Advanced search : combination of forms for more complex queries 39 DB Resources Categories • Databases retrieval tools • The BLAST family of sequence-similarity search programs • Resources for Gene-level sequences • Resources for Genome-scale analysis • Resources for the analysis of patterns of gene expression and phenotypes • The molecular modeling database, the conserved domain database search, CDART and Protein interactions 40 SAGEmap • Provides two-way mapping btw → regular (10 base) and LongSAGE (17 base) SAGE tags → UniGene clusters • SAGEmap repository contains → 381 SAGE experiments from 11 organisms • Can also construct a user-configurable table of data comparing one group of SAGE libraries with another • Is updated weekly 41 42 Gene Expression Ominbus • Data repository and retrieval system for any highthroughput gene expression or molecular abundance data • Contains → microarray-based experiments measuring the abundance of mRNA → genomic DNA and protein molecules → non-array-based technologies such as SAGE → mass spectrometry peptide profiling • Now contains → high-throughput gene expression data from about 30 000 hybridization experiment → about 1000 array definitions → half a billion individual spot measurement data derived from over 100 organisms 43 OMIM • Catalog of human genes and genetic disorders authored and edited by Victor A. McKusick at the John Hopkins University • Contains information on disease phenotypes and genes • Contains → about 16 000 entries 44 DB Resources Categories • Databases retrieval tools • The BLAST family of sequence-similarity search programs • Resources for Gene-level sequences • Resources for Genome-scale analysis • Resources for the analysis of patterns of gene expression and phenotypes • The molecular modeling database, the conserved domain database search, CDART and Protein interactions 45 MMDB • Built by processing entries from the Protein Data Bank • Structures are linked to sequences in Entrez and to the Conserved Domain Database. • Conserved Domain Search can be used to search a protein sequence for conserved domains in CDD • Wherever possible, CDD hits are linked to structure which can be viewed with NCBI’s 3D molecular structure viwer, Cn3D 46 HIV-I/Human Protein Interaction DB • Concise summary of documented interactions between HIV-1 proteins and → host cell proteins → other HIV-1 proteins → proteins from disease organisms associated with HIV or AIDS • Summaries, including protein RefSeq accession numbers, Entrez Gene ID number, … are presented 47 Summary / Conclusion • NCBI provides many tools for data retrieval and analysis of data in GenBank and other biological data • All of the tools and resources can be find easily on the website http://www.ncbi.nih.gov/ along with documentations and explanatory material • NCBI Handbook and several tutorials are available • One can search for tools and information in NCBI website by choosing NCBI Website as database 48 49 Thank you! 50 Outline • • • • • • • Introduction Related work Components of a Pseudoknotted Sec. Str. Parsing algorithm Enumerating loops Akutsu’s structure class Conclusion & Future work 51