Download NCBI Molecular Biology Resources

Document related concepts

Whole genome sequencing wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Minimal genome wikipedia , lookup

Human genetic variation wikipedia , lookup

NEDD9 wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Copy-number variation wikipedia , lookup

Gene expression programming wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Transposable element wikipedia , lookup

Gene therapy wikipedia , lookup

Messenger RNA wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Public health genomics wikipedia , lookup

Gene desert wikipedia , lookup

Genetic engineering wikipedia , lookup

Non-coding DNA wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genome (book) wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene nomenclature wikipedia , lookup

Primary transcript wikipedia , lookup

Epitranscriptome wikipedia , lookup

Human Genome Project wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene wikipedia , lookup

Point mutation wikipedia , lookup

Human genome wikipedia , lookup

Genomic library wikipedia , lookup

Microevolution wikipedia , lookup

Genome evolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

RNA-Seq wikipedia , lookup

Designer baby wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Metagenomics wikipedia , lookup

Genome editing wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
NCBI FieldGuide
NCBI Molecular Biology
Resources
A Field Guide
August 2-3, 2005
University of Massachusetts
• The NCBI Entrez System
• NCBI Sequence Databases
– Primary data: GenBank
– Derivative data: RefSeq, Gene, Genome
– Beyond Refseq: UniGene, Trace Archive
• NCBI Genomic Resources
** Intermission **
• BLAST
• Protein Structure and Function
• Sequence polymorphisms and phenotypes
NCBI FieldGuide
NCBI Resources
Bethesda, MD
NCBI FieldGuide
The National Institutes of Health
• Created as a part of NLM in 1988
–
–
–
–
Establish public databases
Perform research in computational biology
Develop software tools for sequence analysis
Disseminate biomedical information
NCBI FieldGuide
The National Center for
Biotechnology Information
Text
Entrez
Sequence
BLAST
Structure
VAST
NCBI FieldGuide
Web Access
600,000
NCBI FieldGuide
NCBI Web Traffic
User’s per day
World
Internet Users
500,000
400,000
US
Internet Users
300,000
200,000
100,000
1998
1999
2000
2001
2002
2003
2004
Christmas and New Year’s Day
2005
30,000 files per day
620 Gigabytes per day
NCBI FieldGuide
The NCBI ftp site
• NCBI accepts submissions of primary data
• NCBI develops tools to analyze these data
• NCBI uses these tools to create derivative
databases based on the primary data
• NCBI provides free search, link, and
retreival of these data, primarily through
the Entrez system
NCBI FieldGuide
What does NCBI do?
• Primary Databases
– Original submissions by experimentalists
– Content controlled by the submitter
• Examples: GenBank, SNP, GEO, PubChem Substance
• Derivative Databases
– Built from primary data
– Content controlled by third party (NCBI)
• Examples: Refseq, TPA, RefSNP, UniGene, Protein, Structure,
Conserved Domain, PubChem Compound
NCBI FieldGuide
Types of Databases
Algorithms
Sequencing
Centers
GenBank
Updated ONLY
by submitters
INV VRT PHG VRL
UniSTS
EST
STS
GSS
HTG
UniGene
NCBI FieldGuide
Primary vs. Derivative Databases
Updated
continually
by NCBI
RefSeq:
Annotation
Pipeline
PRI ROD PLN MAM BCT
Curators
Labs
RefSeq:
LocusLink and
Genomes Pipelines
TATAGCCG
AGCTCCGATA
CCGATGACAA
•
•
•
•
•
A system of 29 linked databases
A text search engine
A tool for finding biologically linked data
A retrieval engine
A virtual workspace for manipulating large
datasets
NCBI FieldGuide
What is Entrez?
NCBI FieldGuide
The Entrez System: Text Searches
• Each record is assigned a UID
– unique integer identifier for internal tracking
– GI number for Nucleotide
• Each record is given a Document Summary
– a summary of the record’s content (DocSum)
• Each record is assigned links to biologically
related UIDs
• Each record is indexed by data fields
– [author], [title], [organism], and many others
NCBI FieldGuide
Entrez Databases
The backbone of NCBI
[organism]
NCBI FieldGuide
Entrez Taxonomy
• GenBank: Primary Data (97.9%)
– original submissions by experimentalists
– submitters retain editorial control of records
– archival in nature
• RefSeq: Derivative Data (2.1%)
– curated by NCBI staff
– NCBI retains editorial control of records
– record content is updated continually
NCBI FieldGuide
An Entrez Database - Nucleotide
Primary Data
• DDBJ / EMBL / GenBank 56,865,268
Derivative Data
• RefSeq
• PDB
• Third Party Annotation
Total
1,226,084
5,973
4,650
58,101,975
NCBI FieldGuide
Entrez Nucleotide
What is GenBank?
•
•
•
•
Nucleotide only sequence database
Archival in nature
Each record is assigned a stable accession number
GenBank Data
– Direct submissions (traditional records )
– Batch submissions (EST, GSS, STS)
– ftp accounts (genome data)
• Three collaborating databases
– GenBank
– DNA Database of Japan (DDBJ)
– European Molecular Biology Laboratory (EMBL)
Database
NCBI FieldGuide
NCBI’s Primary Sequence Database
NIH
Sequin
BankIt
ftp
NCBI FieldGuide
The International Sequence
Database Collaboration
Entrez
NCBI
GenBank
•Submissions
•Updates
•Submissions
•Updates
EMBL
CIB
NIG
DDBJ
•Submissions
•Updates
getentry
EBI
SRS
EMBL
Release 148
June 2005
45,236,251
49,398,852,122
>140,000
Records
Nucleotides
Species
172 Gigabytes
785 files
• full release every two months
• incremental and cumulative updates daily
• available only through internet
ftp://ftp.ncbi.nih.gov/genbank/
NCBI FieldGuide
GenBank Releases
NCBI FieldGuide
The Growth of GenBank
50
45
Basepairs
Records
Release 148:
35
25
35
45.2 million records
49.4 billion nucleotides
30
30
25
Average doubling time ≈ 14 months*
20
20
15
15
Date
Jun-04
Jun-02
Jun-00
Jun-98
Jun-96
Jun-94
0
Jun-92
0
Jun-90
5
Jun-88
5
Jun-86
10
Jun-84
10
Jun-82
Base Pairs (billions)
40
40
Records (millions)
45
PRI
ROD
PLN
BCT
INV
VRT
VRL
MAM
PHG
SYN
UNA
(28)
(14)
(13)
(10)
(7)
(7)
(4)
(2)
(1)
(1)
(1)
Primate
Rodent
Plant and Fungal
Bacterial/Archeal
Invertebrate
Other Vertebrate
Viral
Mammalian
Phage
Synthetic
Unannotated
EST
GSS
HTG
HTC
STS
(349)
(120)
(62)
(6)
(5)
Expressed Sequence Tag
Genome Survey Sequence
High Throughput Genomic
High Throughput cDNA
Sequence Tagged Site
Traditional
NCBI FieldGuide
GenBank Divisions
•Direct Submissions (Sequin/Bankit)
•Accurate (~1 error per 10,000 bp)
•Well characterized
•Organized by taxonomy
Bulk
•From sequencing projects
•Batch submissions (ftp/email)
•Inaccurate
•Poorly Characterized
•Organized by sequence type
LOCUS
DEFINITION
AY182241
1931 bp
mRNA
linear
PLN 04-MAY-2004
Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA,
complete cds.
ACCESSION
AY182241
VERSION
AY182241.2 GI:32265057
KEYWORDS
.
SOURCE
Malus x domestica (cultivated apple)
ORGANISM Malus x domestica
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;
rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.
REFERENCE
1 (bases 1 to 1931)
AUTHORS
Pechous,S.W. and Whitaker,B.D.
TITLE
Cloning and functional expression of an (E,E)-alpha-farnesene
synthase cDNA from peel tissue of apple fruit
JOURNAL
Planta 219, 84-94 (2004)
REFERENCE
2 (bases 1 to 1931)
AUTHORS
Pechous,S.W. and Whitaker,B.D.
TITLE
Direct Submission
JOURNAL
Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REFERENCE
3 (bases 1 to 1931)
AUTHORS
Pechous,S.W. and Whitaker,B.D.
TITLE
Direct Submission
JOURNAL
Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REMARK
Sequence update by submitter
COMMENT
On Jun 26, 2003 this sequence version replaced gi:27804758.
FEATURES
Location/Qualifiers
source
1..1931
/organism="Malus x domestica"
/mol_type="mRNA"
/cultivar="'Law Rome'"
/db_xref="taxon:3750"
/tissue_type="peel"
gene
1..1931
/gene="AFS1"
CDS
54..1784
/gene="AFS1"
/note="terpene synthase"
/codon_start=1
/product="(E,E)-alpha-farnesene synthase"
/protein_id="AAO22848.2"
/db_xref="GI:32265058"
/translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK
NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF
EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE
NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS
LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW
ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS
EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT
KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA
DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK
GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI
LSLLFQPLVN"
ORIGIN
1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat
61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg
121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt
181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga
241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt
1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt
1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa
1921 aaaaaaaaaa a
//
The Flatfile Format
Header
Feature Table
Sequence
NCBI FieldGuide
A Traditional GenBank Record
Indexing for Nucleotide UID 4680720
Field
[primary accession]
[title]
[organism]
[sequence length]
[modification date]
[properties]
Indexed Terms
M17755
Homo sapiens thyroid peroxidase (TPO) mRNA…
Homo sapiens
3060
1999/04/26
biomol mrna
gbdiv pri
srcdb genbank
NCBI FieldGuide
An Example Record – M17755
NCBI FieldGuide
M17755: Feature Table
TPO [gene name]
CDS position in bp
thyroiditis
[text word]
thyroid peroxidase
[protein name]
protein
accession
The sequence itself
is not indexed…
Use BLAST for that!
NCBI FieldGuide
Sequence: 99.99% Accurate
•
•
•
•
•
•
•
GenPept (DDBJ, EMBL, GenBank)
RefSeq
PIR
Swiss Prot
PDB
PRF
Third Party Annotation
Total
4,444,405
1,753,167
222,395
189,005
68,621
12,079
4,219
6,693,891
NCBI FieldGuide
Entrez Protein
PIR
RefSeq
no mRNA!
 NM_000537
SWISS-PROT
GenPept
no mRNA!
 M17755
NCBI FieldGuide
Protein Sources and Links
First seen at NCBI, not
first seen at GenBank!
Version and GI change only if the sequence changes
The accession number always retrieves the most recent version
NCBI FieldGuide
Sequence Revisions
NCBI FieldGuide
Update without a Sequence Change
June 15, 1989!
GenBank came
to NCBI in 1992!
NCBI FieldGuide
Update with a Sequence Change
ASN.1 – The Raw Data
flat file
XML (4 flavors)
FASTA
NCBI FieldGuide
GenBank File Formats
/************************************************************************
*
*
asn2ff.c
*
convert an ASN.1 entry to flat file format, using the FFPrintArray.
*
**************************************************************************/
#include <accentr.h>
#include "asn2ff.h"
#include "asn2ffp.h"
#include "ffprint.h"
#include <subutil.h>
#include <objall.h>
#include <objcode.h>
#include <lsqfetch.h>
#include <explore.h>
Toolbox Sources
ftp> open ftp.ncbi.nih.gov
.
.
#ifdef ENABLE_ID1
ftp> cd toolbox
#include <accid1.h>
#endif
ftp> cd ncbi_tools
FILE *fpl;
Args myargs[] = {
{"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL},
{"Input is a Seq-entry","F", NULL ,NULL ,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL},
{"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL},
{"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL},
{"Show Sequence?","T", NULL ,NULL ,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL},
ftp://ftp.ncbi.nlm.gov/toolbox/ncbi_tools
NCBI FieldGuide
NCBI Toolbox
term1 term2
If no [limit] is specified…
Organism?  [ organism ]
Journal?  [ journal ]
User compounds?  search as phrase
Author?  [author]
else [All Fields]
term1[limit] OP term2[limit] OP …
where
limit = Entrez indexing field (organism, author, …)
op = AND, OR, NOT
NCBI FieldGuide
Text Searches in Entrez
Limits
Provides a simple form for applying commonly used Entrez limits
Preview/Index
Allows access to the full indexing of each Entrez database
and aids in constructing complex queries
History
Provides access to previous searches in the current Entrez database
Clipboard
A temporary storage area for selected records
Details
Displays the detailed parsing of the current Entrez query, and
lists errors and terms without matches
NCBI FieldGuide
Entrez Tabs
http://www.ncbi.nih.gov/entrez/query/static/eutils_help.html
Entrez query
ESearch
UID list or History
UID list or History
ESummary
UID list or History
EFetch
UID list or History
ELink
UID list or History
UID list
EPost
History
Document summaries
Formatted data
NCBI FieldGuide
Programming Entrez: E-Utilities
• Search Entrez Nucleotide
– 97.9% GenBank (primary data)
– 2.1% RefSeq (curated data)
Possible queries we’ve seen so far…
M17755 [primary accession]
thyroid peroxidase [title]
Homo sapiens [organism]
3060 [sequence length]
biomol mrna [properties]
srcdb genbank [properties]
TPO [gene name]
thyroiditis [text word]
thyroid peroxidase [protein name]
1999/04/26 [modification date]
gbdiv pri [properties]
NCBI FieldGuide
Finding Primary Sequences
Find nucleotide records for human thyroid peroxidase
human thyroid peroxidase
309 records
(("Homo sapiens“[Organism] OR human[All Fields]) AND
thyroid peroxidase[All Fields])
Field Limit!
human[organism] AND thyroid peroxidase
298 records
("Homo sapiens“[Organism] AND thyroid peroxidase[All Fields])
11 records aren’t human sequences!!
NCBI FieldGuide
A Starting Query
Entrez Nucleotide
GenBank
RefSeq
srcdb ddbj/embl/genbank[properties]
NCBI FieldGuide
Limit by Title and Database
srcdb refseq[properties]
#1: thyroid peroxidase AND human[orgn]
#2: thyroid peroxidase[title] AND human[orgn]
#3: #2 AND srcdb refseq[properties]
#4: #2 AND srcdb ddbj/embl/genbank[properties]
primary data
298
169
5
164
EST Division
Primate Division
#1:
#2:
#3:
#4:
NCBI FieldGuide
Limit by Genbank Division
gbdiv est[prop]
gbdiv pri[prop]
thyroid peroxidase AND human[orgn]
thyroid peroxidase[title] AND human[orgn]
#2 AND srcdb refseq[properties]
#2 AND srcdb ddbj/embl/genbank[properties]
#5: #4 AND gbdiv est[prop]
#6: #4 AND gbdiv pri[prop]
20
144
traditional GenBank records
298
169
5
164
Genomic DNA
cDNA
#1:
#2:
#3:
#4:
#5:
#6:
biomol genomic[prop]
biomol mrna[prop]
thyroid peroxidase AND human[orgn]
298
thyroid peroxidase[title] AND human[orgn]
169
#2 AND srcdb refseq[properties]
5
#2 AND srcdb ddbj/embl/genbank[properties] 164
#2 AND gbdiv est[prop]
20
#2 AND gbdiv pri[prop]
144
genomic DNA
#7: #6 AND biomol genomic[prop]
#8: #6 AND biomol mrna[prop]
mRNA / cDNA
26
118
NCBI FieldGuide
Limit by Biomolecule Type
thyroid peroxidase[protein name] AND human[orgn] AND
gbdiv pri[prop] AND biomol mrna[prop]
118 records [title]  4 records [protein name]
NCBI FieldGuide
Limit by Protein Name
Links menu
Click the accession to view the record
Links to other
Entrez databases
computed for
M17755
NCBI FieldGuide
Entrez Document Summaries
Gene annotation based on M17755
Full text online articles about M17755
All polymorphisms in the TPO gene
DNA/RNA sequences similar to M17755
Graphical view of TPO gene annotation
Human phenotypes involving TPO
Microarray datasets for M17755
Protein translation of M17755
Literature abstracts about M17755
Sequence polymorphisms in M17755
Source organism of M17755
STS markers in the TPO gene
TPO links beyond NCBI
NCBI FieldGuide
Entrez Links for GI 4680720
NCBI FieldGuide
Viewing M17755
Which one is the best sequence???
NCBI FieldGuide
GenBank Sequences for Human TPO
NCBI’s Derivative Sequence Database
RefSeq Benefits
•
•
•
•
•
•
•
NCBI FieldGuide
RefSeq:
Non-redundant
Explicitly linked nucleotide and protein sequences
Updated to reflect current sequence data and biology
Validated by hand
Format consistency
Distinct accession series
Stewardship by NCBI staff and collaborators
ftp://ftp.ncbi.nih.gov/refseq/release
NCBI’s Derivative Sequence Database
• Curated transcripts and proteins
– NM_123456  NP_123456
– NR_123456 (non-coding RNA)
• Model transcripts and proteins
– XM_123456  XP_123456
– XR_123456 (non-coding RNA)
Nucleotide
Protein
• Assembled Genomic Regions (contigs)
– NT_123456 (BAC clones)
– NW_123456 (WGS)
• Other Genomic Sequence
– NG_123456 (complex regions, pseudogenes)
– NZ_ABCD12345678 (WGS)  ZP_123456
• Chromosome records in Entrez Genome
– NC_123456 (chromosome; microbial or organelle genome)
NCBI FieldGuide
RefSeq:
Genome annotation
Longest mRNA
NMs must have
cDNA support
NCBI FieldGuide
Creating NM Records
NM_000547: variant 1
COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff.
The reference sequence was derived from M17755.2 and AW874082.1.
On Feb 25, 2003 this sequence version replaced gi:21361188.
NM_175719: variant 2
EST that completes 3’ end
COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff.
The reference sequence was derived from J02970.1, AW874082.1 and M17755.2.
Nucleotide
Protein
NCBI FieldGuide
NM/NP Records in Entrez
Genomic DNA
(NC, NT, NW)
Scanning....
Model mRNA (XM)
(XR)
Curated mRNA (NM)
(NR)
RefSeq
Genbank
Sequences
NCBI FieldGuide
Annotating the Gene
Model protein (XP)
= ?!
Curated Protein (NP)
GenBank
RefSeq
Gene
Nucleotide
• Entrez Gene is the central depository for information about a gene
available at NCBI, and often provides links to sites beyond NCBI
• Entrez Gene includes records for organisms that have NCBI Reference
Sequences (RefSeqs)
• Entrez Gene records contain RefSeq mRNAs, proteins, and genomic
DNA (if known) for a gene locus, plus links to other Entrez databases
• NCBI RefSeqs are based on primary sequence data in GenBank
NCBI FieldGuide
Entrez Gene and RefSeq
NCBI FieldGuide
Entrez Gene: RefSeq Annotations
NCBI FieldGuide
NM/NP Records in Entrez Gene
NM
NCBI FieldGuide
Entrez Gene RefSeq Graphics
NP
Entrez Gene
NCBI FieldGuide
What about LOC440844?
Is there any GenBank support for this mRNA?
srcdb ddbj/embl/genbank[prop] AND biomol mrna[prop]
no full-length hit
NCBI FieldGuide
BLAST Results for XM_496543
XM records are models based only on genomic sequence, and are subject
to revision or removal with each new build of that genome.
BLAST the XM against the RefSeq database to look for a replacement:
Query= gi|20850420|ref|XM_124429.1|
Mus musculus expressed sequence AA553001 (AA553001), mRNA
gi|19527087|ref|NM_133873.1|
Mus musculus DNA segment, Chr 4, Wayne State University 114,
expressed (D4Wsu114e), mRNA Length=1898
Score = 3701.55 bits (1867), Expect = 0
Identities = 1870/1871 (99%), Gaps = 0/1871 (0%) Strand=Plus/Plus
NCBI FieldGuide
The Perils of the XM
Bos taurus:
37541
Oryza sativa (japonica cultivar-group):
36836
Danio rerio:
30577
Homo sapiens:
29261
Arabidopsis thaliana:
28953
Mus musculus:
27033
Rattus norvegicus:
23975
Pan troglodytes:
21810
Caenorhabditis elegans:
21124
Drosophila melanogaster:
19412
Aspergillus nidulans FGSC A4:
18951
Gallus gallus:
18120
Canis familiaris:
16891
Anopheles gambiae str. PEST:
15328
Plasmodium chabaudi:
14747
Candida albicans SC5314:
13672
Dictyostelium discoideum:
13570
Ustilago maydis 521:
13044
Plasmodium berghei:
11778
Gibberella zeae PH-1:
11640
Magnaporthe grisea 70-15:
11109
Neurospora crassa:
10079
Aspergillus fumigatus Af293:
9923
Entamoeba histolytica HM-1:IMSS:
9772
Cryptococcus neoformans var. neoformans JEC21: 6594
NCBI FieldGuide
Eukaryotic NM/XM Records
Giardia lamblia ATCC 50803:
6569
Yarrowia lipolytica CLIB99:
6521
Debaryomyces hansenii CBS767:
6318
Apis mellifera:
6292
Kluyveromyces lactis NRRL Y-1140: 5327
Candida glabrata CBS138:
5181
Schizosaccharomyces pombe 972h-: 5035
Eremothecium gossypii:
4718
Theileria parva:
4079
Xenopus tropicalis:
4069
Cryptosporidium hominis:
3886
Cryptosporidium parvum:
3396
Sus scrofa:
938
Trypanosoma brucei:
599
Ovis aries:
253
Strongylocentrotus purpuratus:
215
Felis catus:
162
Plasmodium yoelii yoelii:
105
Takifugu rubripes:
7
Ciona intestinalis:
3
Trypanosoma cruzi:
3
GenBank Components
(clones, WGS)
NT/NW Contigs
NC
Genome
Assembly
NM/XM
Master
mRNA
Components
Components
NCBI FieldGuide
Genome Annotation in Entrez Nucleotide
curated mRNA
genomic contig on human chromosome 2
containing NM_000547
human chromosome 2
the 21 contigs of the
chromosome 2 assembly
NCBI FieldGuide
Genome Annotation Links
Genomic sequence
NCBI FieldGuide
Getting the Annotation Details
ACCESSION NC_000002 REGION: 1396242..1525502
ACCESSION NC_000002 REGION: 1396242..1525502
exon-intron structure
These flat files contain all annotations in the gene and the full, explicit sequence
NCBI FieldGuide
Getting the Annotation Details
Gene symbol: human thyroid peroxidase (TPO)
tpo [sym] AND human [organism]
NCBI FieldGuide
Searching Entrez Gene
Protein name: topoisomerase genes from Archaea
topoisomerase[gene/protein name] AND archaea [organism]
Chromosome and Links: genes on human chromosome 2 with OMIM links
2 [chromosome] AND gene omim [filter] AND human [organism]
RefSeq status and variants: Reviewed RefSeqs with transcript variants
srcdb refseq reviewed[prop] AND has transcript variants[prop]
Disease and Gene Ontology: Membrane proteins linked to cancer
integral to plasma membrane[gene ontology] AND cancer [dis]
Microarray datasets for TPO
NCBI FieldGuide
Gene Links in Entrez
Gene homologs for TPO
DNA and RNA sequences for TPO
Phenotypes involving TPO
Protein sequences for TPO
Literature abstracts about TPO
Sequence polymorphisms in TPO
Species whose genome has this TPO gene
STS markers in the TPO gene
ESTs aligned to the TPO gene
NCBI now accepts the submission of new annotations
of existing GenBank sequences.
NCBI FieldGuide
Third Party Annotation
(TPA) Database
• Submissions must be published in a peer-reviewed journal.
• Facilitates the annotation of sequences by experts.
Examples of sequences appropriate for TPA are:
Annotation of features on gene and/or mRNA sequences
Assembled “full length” genes and/or mRNAs
What should not be submitted to TPA?
Synthetic constructs (such as cloning vectors) that use well-characterized,
publicly available genes, promoters, or terminators
Updates or changes to existing sequence data
Sequence annotations without experimental evidence
If your organism does not have RefSeqs…
• UniGene : gene-based clusters of cDNAs and ESTs
• WGS sequences in Entrez Nucleotide (wgs[prop])
• Trace Archive
NCBI FieldGuide
Beyond RefSeq
A gene-oriented view of sequence entries
•MegaBlast based automated sequence clustering
•Now informed by genome hits New!
•Nonredundant set of gene oriented clusters
•Each cluster a unique gene
•Information on tissue types and map locations
•Includes known genes and uncharacterized ESTs
•Useful for gene discovery and selection of
mapping reagents
NCBI FieldGuide
What is UniGene?
Top Ten
1. Human
2. Rice
3. Mouse
4. Cow
5. Wheat
6. Zebrafish
7. Pig
8. Chicken
9. Frog (X. laevis)
10. Frog (X. tropicalis)
NCBI FieldGuide
Organisms in UniGene
by link
by Entrez search
NCBI FieldGuide
Finding UniGene Clusters
NCBI FieldGuide
UniGene Cluster for TPO
GPL
Platform
descriptions
GSM
GSE
Grouping of
Raw/processed
slide/chip data
spot intensities
from a single “a single experiment”
slide/chip
Entrez GEO
Curated by
NCBI
NCBI FieldGuide
Submitted by
Manufacturer*
Submitted by
Experimentalists
GDS
Grouping of
experiments
Entrez
GEO Datasets
NCBI FieldGuide
Linking to GEO
NCBI FieldGuide
GEO Datasets
• Traditional GenBank Divisions
• 300 + projects
–
–
–
–
–
Viruses
Bacteria
Environmental sequences
Archaea
73 Eukaryotes featuring:
•
•
•
•
•
•
Cow, Chicken, Rat, Mouse, Dog, Chimpanzee, Human
Pufferfish (2), Zebrafish
Honeybee, Anopheles, Fruit Flies (4), Silkworm
Nematode (C. briggsae)
Yeasts (9), Aspergillus (3)
Rice
NCBI FieldGuide
Whole Genome Shotgun Projects
NCBI FieldGuide
Trace Archive
NCBI FieldGuide
Short-tailed opossum traces
All are RefSeq NC records in Entrez Genome
• Full chromosomal
sequences are provided
• Genes are annotated
• The annotation can be
shown graphically and
linked to sequence records
NCBI FieldGuide
Viewing Simple Genomes
NCBI FieldGuide
NCBI FieldGuide
mutL
NCBI Map Viewer
• Map Viewer Home Page
– Shows all supported organisms
– Provides links to genomic BLAST
• Genome Overview Page
– Provides links to individual chromosomes
– Shows hits on a genome graphically
• Chromosome Viewing Page
– Allows interactive views of annotation details
– Provides numerous maps unique to each genome
NCBI FieldGuide
Viewing Complex Genomes
NCBI FieldGuide
Map Viewer Home Page
Search the maps
Genomic BLAST
Species-specific help!
NCBI FieldGuide
Genome Overview Page
Map Summary
Add or remove maps
Master Map
with exploded content
Genes
UniGene
Contigs
Zooming
Controls
Ideogram
NCBI FieldGuide
Chromosome Viewing Page
TPO’s contig!
NCBI FieldGuide
Map Summary
Map content varies greatly by species!
• Sequence Maps
• Core assembly
• Annotation evidence
• Clones & Markers
• Polymorphisms
• Links & Features
• Genetic Maps
• Cytogenetic maps
• Linkage maps
• Radiation hybrid maps
Assembly
Contig
Component
Transcript
Gene
NCBI FieldGuide
Map Content
NCBI FieldGuide
View the Assembly near TPO
NT_033000
1255072
1563756
NCBI FieldGuide
Assembly of Chr. 2
NCBI FieldGuide
Assembly of Chromosome 2
NCBI FieldGuide
Zooming
NCBI FieldGuide
View of TPO
Links to Entrez Nucleotide
Links to Entrez Gene
Links to Tools and Data
Gap in assembly
Map content varies greatly by species!
• Sequence Maps
• Core assembly
• Annotation evidence
• Clones & Markers
• Polymorphisms
• Links & Features
• Genetic Maps
• Cytogenetic maps
• Linkage maps
• Radiation hybrid maps
Ab initio (model)
GenBank DNA
EST
UniGene
Gene
NCBI FieldGuide
Map Content
GenBank records not used in
assembly
Aligned ESTs
NCBI FieldGuide
Annotation Evidence
UniGene Clusters
Ab initio models
Homologs by protein BLAST
NCBI FieldGuide
Entrez Homologene