Download Sequence, Gene and Protein Resources

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
5/19/10 NCBI Resources Sequence, Gene and Protein Resources Benjamin King Mount Desert Island Biological Laboratory NCBI Gene “Entrez Gene” Database Genes at NCBI Genes Transcripts Genome Sequence •  Uses informaDon from model organism databases Proteins Genes VariaDon FuncDonal AnnotaDon PublicaDons Expression Data •  Comprehensive catalog of genes in many organisms Sequence Data –  WormBase –  FlyBase –  Mouse Genome InformaDcs •  hTp://www.informaDcs.jax.org –  Rat Genome Database –  Zebrafish InformaDon Resource –  … 1 5/19/10 Hierarchy of Sequence Resources Raw DNA sequence and RNA sequences Non-­‐redundant transcript databases One sequence per isoform per gene Non-­‐redundant protein databases One sequence per protein FoundaDon of above data sets DNA RNA Protein NucleoDde Databases Interna5onal Nucleo5de Sequence Database Collabora5on GenBank (NCBI) EMBL (European Bioinforma5cs Ins5tute) DDBJ (Japan) InformaDon mirrored daily DNA RNA Protein Historical archives of nucleoDde sequence data GenBank / EMBL / DDBJ NCBI Trace Archive (genome sequence) NCBI Short Read Archive (next gen. sequence) GenBank / EMBL / DDBJ ExponenDal Growth of GenBank •  Historical repository of nucleoDde sequences –  Any species –  Any type •  Genomic –  enDre genes, enDre chromosomes, fragments, BACs, constructs •  cDNA –  full-­‐length, parDal, Expressed Sequence Tags (ESTs) •  Other -­‐ tRNA, … –  Any quality •  Drad vs. finished genomic sequences (e.g., BACs) •  ESTs vs. high quality cDNAs (e.g., mammalian gene collecDon) •  Lots of redundancy •  GenBank version 177 (April 15, 2010) –  119.1 million sequences –  114.3 billion nucleoDdes 2 5/19/10 Other DNA Databases Assembled Genomic Sequences Mammalian Genomes Viewed in a “genome browser” Ensembl UCSC Genome Browser NCBI MapViewer Microbial Genomes Combines: •  GenBank/EMBL/DDBJ •  Trace Archive •  Sequence Read Archive Assembled Genomic Sequences Comprehensive Microbial Resource at J. Craig Venter InsDtute Reference Sequence Project Mammalian Genomes Human and mouse assembled by NCBI into “builds” Current human -­‐ “NCBI Build 37” Current mouse -­‐ “NCBI Build 37” Sequencing consorDums release assembles for other species Microbial Genomes Sequencing center releases assembly 3 5/19/10 Non-­‐Redundant Transcript Sequences RefSeq from NCBI – primarily human, mouse, rat View in genome browser or search using Entrez at NCBI Non-­‐Redundant Protein Sequences SWISS-­‐PROT – best annotated protein sequence database >sp|P31273|HXC8_HUMAN Homeobox protein Hox-C8 (Hox-3A) - Homo sapiens (Human).
MSSYFVNPLFSKYKAGESLEPAYYDCRFPQSVGRSHALVYGPGGSAPGFQHASHHVQDFF
HHGTSGISNSGYQQNPCSLSCHGDASKFYGYEALPRQSLYGAQQEASVVQYPDCKSSANT
NSSEGQGHLNQNSSPSLMFPWMRPHAPGRRSGRQTYSRYQTLELEKEFLFNPYLTRKRRI
EVSHALGLTERQVKIWFQNRRMKWKKENNKDKLPGARDEEKVEEEGNEEEEKEEEEKEEN
KD Derive protein sequence from these transcripts Other resources: 1.  Dana Farber Gene Indices – animal, plant, proDst, and fungal species 2.  DoTS at UPenn – human and mouse only 3.  STACK at SANBI (S. Africa) – just human Other resources: 1.  UniProt = SWISS-­‐PROT + TrEMBL – translated EMBL added 2.  RefSeq at NCBI hTp://www.expasy.org RefSeq (Reference Sequence) Project Easiest Way to Obtain Sequences •  NCBI Entrez –  GenBank –  RefSeq –  UniProt –  Trace Archives Extensive links to other resources and also pre-­‐
computed BLAST searches •  Genome Sequences –  Genome browser (e.g., UCSC or Ensembl) •  Query by chromosomal coordinates •  Repeat masked sequence at UCSC –  Comprehensive Microbial Resource 4 5/19/10 What is Available Via Entrez? • 
• 
• 
• 
• 
• 
• 
PubMed Protein Nucleo5de Structure Genome PopSet OMIM •  Taxonomy •  Books •  GEO •  3D Domains •  UniSTS •  SNP •  CDD Bulk Sequence Retrieval Batch Entrez A method for obtaining large numbers of sequences by supplying a file containing a list of GI or accession numbers. BioMart at Ensembl Obtain sequences for a set of genes, their transcripts or proteins from a chromosomal region, are associated with a list of idenDfiers, or have some common classificaDon 5 5/19/10 Worked Examples • 
Worked Example #1: Use Entrez to explore informaDon the NCBI has about the human gene BAX. In this example, you will explore the Entrez Gene detail page and use links from the page to explore: – 
– 
– 
– 
– 
– 
homologs gene expression informaDon protein funcDon informaDon OMIM Link out to the mouse and zebrafish model organism databases Gene Ontology terms You will also explore gene expression informaDon in BioGPS. • 
Worked Example #2: Retrieve the sequence for the skate HOXA cluster and view the annotaDon informaDon contained within the GenBank record. • 
Worked Example #3: Find all skate ESTs in GenBank. 6