Download Flow of genetic information DNA --> RNA -

Document related concepts

Promoter (genetics) wikipedia , lookup

Genetic engineering wikipedia , lookup

Gene nomenclature wikipedia , lookup

Western blot wikipedia , lookup

Transposable element wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Interactome wikipedia , lookup

Proteolysis wikipedia , lookup

RNA-Seq wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Molecular ecology wikipedia , lookup

Metalloprotein wikipedia , lookup

Gene wikipedia , lookup

Community fingerprinting wikipedia , lookup

Expression vector wikipedia , lookup

Gene expression wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Genomic library wikipedia , lookup

Protein structure prediction wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Non-coding DNA wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Point mutation wikipedia , lookup

Homology modeling wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Molecular evolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Flow of genetic information
DNA --> RNA --> PROTEIN -->
---> CONFORMATION --> BIOLOGICAL FUNCTION
Overview of molecular biology databases
- Sequence
DNA
Genbank (www.ncbi.nlm.nih.gov)
- BLAST
- Entrez
EMBL (European Molecular Biology Laboratory, www.ebi.ac.uk)
- SRS : srs.ebi.ac.uk, www.sanger.ac.uk/srs6/
DDBJ (DNA Data Bank of Japan)
Protein
Swissprot (www.ebi.ac.uk)
NCBI
Protein classification databases
Prosite (expasy.hcuge.ch)
Pfam (www.sanger.ac.uk/Pfam)
InterPro (www.ebi.ac.uk/interpro)
Gene ontology
www.geneontology.org
- Structure
PDB
Protein Data Bank, www.rcsb.org/pdb/cgi/queryForm.cgi
(RCSB, Research Collaboratory for Structural Bioinformatics, rcsb.rutgers.edu)
Xray crystallography
NMR
modeling
KLOTHO (small molecules, www.ibc.wustl.edu/moirai/klotho/compound_list.html)
- Genome
GDB (Human Genome Data Base, www.gdb.org)
Mouse genome database (www.informatics.jax.org)
Yeast genome (genome-ftp.stanford.edu/Saccharomyces)
Bacterial genomes (www.tigr.org)
- Human genome browsers
NCBI
UCSC
EBI
Celera
www.ncbi.nlm.nih.gov
genome.ucsc.edu
www.ensembl.org
www.celera.com
- Genetic disorders
OMIM (Online Mendelian Inheritance in Man, www.ncbi.nlm.nih.gov)
- Taxonomy (www.ncbi.nlm.nih.gov)
- Literature
PubMed (www.ncbi.nlm.nih.gov/Entrez)
Molecular biology databases
DNA sequence
Genome data
Protein sequence
Protein classification
Protein structure
Major bioinformatics sites / public sequence database administrators
Genbank
NCBI, NIH, US
DDBJ (Japan)
EMBL (EBI, UK )
DNA sequence data :
EMBL - Genbank - DDBJ
EMBL and Genbank formats
EMBL format
ID
XX
AC
XX
SV
XX
DT
DT
XX
DE
XX
KW
XX
OS
OC
OC
XX
RN
RX
RA
RT
RT
RT
RL
XX
RN
RP
RA
RT
RL
RL
RL
XX
DR
XX
LISOD
standard; DNA; PRO; 756 BP.
X64011; S78972;
X64011.1
28-APR-1992 (Rel. 31, Created)
30-JUN-1993 (Rel. 36, Last updated, Version 6)
L.ivanovii sod gene for superoxide dismutase
sod gene; superoxide dismutase.
Listeria ivanovii
Bacteria; Firmicutes; Bacillus/Clostridium group;
Bacillus/Staphylococcus group; Listeria.
[1]
MEDLINE; 92140371.
Haas A., Goebel W.;
"Cloning of a superoxide dismutase gene from Listeria ivanovii by
functional complementation in Escherichia coli and characterization of the
gene product.";
Mol. Gen. Genet. 231:313-322(1992).
[2]
1-756
Kreft J.;
;
Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases.
J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am
Hubland, 8700 Wuerzburg, FRG
SWISS-PROT; P28763; SODM_LISIV.
FH
FH
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
XX
SQ
Key
Location/Qualifiers
source
1..756
/db_xref="taxon:1638"
/organism="Listeria ivanovii"
/strain="ATCC 19119"
95..100
/gene="sod"
723..746
/gene="sod"
109..717
/db_xref="SWISS-PROT:P28763"
/transl_table=11
/gene="sod"
/EC_number="1.15.1.1"
/product="superoxide dismutase"
/protein_id="CAA45406.1"
/translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEAVSG
HAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGNLKAA
IESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGKTPVLGL
DVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK"
RBS
terminator
CDS
Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other;
cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc
gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat
ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa
gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga
ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct
gccttacaat
gacttacgaa
agaaacaatg
agcagtctca
agatagcgtt
60
120
180
240
300
3.2.4 Feature key examples
Key
Description
conflict
rep_origin
protein_bind
CDS
misc_RNA
insertion_seq
D-loop
Separate determinations of the "same" sequence differ
Origin of replication
Protein binding site on DNA
Protein-coding sequence
Generic label for an undefined RNA
Insertion element
Mitochondrial or other D-loop structure
3.3.4 Qualifier examples
Key
Location/Qualifiers
CDS
86..742
/product="hypoxanthine phosphoribosyltransferase"
/label=hprt
/note="hprt catalyzes vital steps in the
reutilization pathway for purine biosynthesis
and its deficiency leads to forms of ""gouty"" arthritis"
234..243
/direction=left
109..564
/usedin=X10009:catalase
rep.origin
CDS
3.5.3 Location examples
The following is a list of common location descriptors with their meanings:
Location
Description
467
Points to a single base in the presented sequence
340..565
Points to a continuous range of bases bounded by and
including the starting and ending bases
<345..500
Indicates that the exact lower boundary point of a
feature is unknown. The location begins at some
base previous to the first base specified (which need
not be contained in the presented sequence) and continues to and includes the ending base
<1..888
The feature starts before the first sequenced base and
continues to and includes base 888
(102.110)
Indicates that the exact location is unknown but that
it is one of the bases between bases 102 and 110, inclusive
(23.45)..600
Specifies that the starting point is one of the bases between
bases 23 and 45, inclusive, and the end point is base 600
(122.133)..(204.221)
The feature starts at a base between 122 and 133, inclusive,
and ends at a base between 204 and 221, inclusive
123^124
Points to a site between bases 123 and 124
145^177
Points to a site between two adjacent bases anywhere
between bases 145 and 177
complement(34..(122.126))
Start at one of the bases complementary to those between
122 and 126 on the presented strand and finish at the
base complementary to base 34 (the feature is on the strand
complementary to the presented strand)
join("acct",449..670)
Concatenate the four bases 'acct' to the 5' end of the
sequence from bases 449 to 670, inclusive
J00193:hladr
Points to a feature whose location is described in another entry: the feature labelled 'hladr' in the entry
(in this database) with primary accession number 'J00193'
J00194:(100..202)
Points to bases 100 to 202, inclusive, in the entry (in
this database) with primary accession number 'J00194'
EMBL and Genbank formats
EMBL format
ID
XX
AC
XX
SV
XX
DT
DT
XX
DE
XX
KW
XX
OS
OC
OC
XX
RN
RX
RA
RT
RT
RT
RL
XX
RN
RP
RA
RT
RL
RL
RL
XX
DR
XX
LISOD
standard; DNA; PRO; 756 BP.
X64011; S78972;
X64011.1
28-APR-1992 (Rel. 31, Created)
30-JUN-1993 (Rel. 36, Last updated, Version 6)
L.ivanovii sod gene for superoxide dismutase
sod gene; superoxide dismutase.
Listeria ivanovii
Bacteria; Firmicutes; Bacillus/Clostridium group;
Bacillus/Staphylococcus group; Listeria.
[1]
MEDLINE; 92140371.
Haas A., Goebel W.;
"Cloning of a superoxide dismutase gene from Listeria ivanovii by
functional complementation in Escherichia coli and characterization of the
gene product.";
Mol. Gen. Genet. 231:313-322(1992).
[2]
1-756
Kreft J.;
;
Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases.
J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am
Hubland, 8700 Wuerzburg, FRG
SWISS-PROT; P28763; SODM_LISIV.
Common sequence formats
1. EMBL release format
2. Genbank (ASN.1)
3. FASTA format :
>X12345 Y098TR gene
CGTATCTTACGAGCTACTACGA
GGTCTTATCGGACGAGCGACT
...
EMBL divisions
Human
Mus musculus
Rodents
Other Mammals
Other Vertebrates
Invertebrates
Plants
Fungi
Prokaryotes (+ Archae)
Organanelles
Viruses
Bacteriophages
Patented
Synthetic
EST
HTG
STS
GSS
EST (Expressed Sequence Tag)
Expressed Sequence Tags (ESTs) are partial mRNA sequences, they
are sequences of cDNA which have been reverse-transcribed from
mRNA
Short sequences (~500-1000 bases), each is result of single
sequencing experiment -> high frequency of errors
Applications:
Discovery of new genes
Mapping of various genomes
Identification of coding regions in genomic sequences.
EST libraries are used to answer questions like:
What genes in specific cell or tissue are expressed ?
UniGene clusters
UniGene partitions GenBank sequences into a non-redundant
set of gene-oriented clusters. Each UniGene cluster contains sequences
that represent a unique gene. A majority of sequences are ESTs.
The mouse dataset contains 84,247 clusters with a total of 2,332,864 sequences.
5’ UTR
mRNA
public ESTs
CDS
3’ UTR
High-Throughput Genomic Sequences
The High Throughput Genomic (HTG) Sequences division was created to accommodate a growing need to
make 'unfinished' genomic sequence data rapidly available to the scientific community. It was done in a
coordinated effort between the three International Nucleotide Sequence databases: DDBJ, EMBL, and
GenBank. The HTG division contains 'unfinished' DNA sequences generated by the high-throughput
sequencing centers. Sequence data in this division are available for BLAST homology searches against either
the "htgs" database or the "month" database, which includes all new submissions for the prior month. The HTG
division of GenBank was described in a [Genome Research (1997) 7(10)] article by Ouellette and Boguski.
Location of HTG records:
Unfinished HTG sequences containing contigs greater than 2 kb are assigned an accession number and
deposited in the HTG division. A typical HTG record might consist of all the first pass sequence data generated
from a single cosmid, BAC, YAC, or P1 clone which together comprise more than 2 kb and contain one or
more gaps. A single accession number is assigned to this collection of sequences and each record includes a
clear indication of the status (phase 1 or 2) plus a prominent warning that the sequence data is "unfinished" and
may contain errors. The accession number does not change as sequence records are updated; only the most
recent version of a HTG record remains in GenBank. 'Finished' HTG sequences (phase 3) retain the same
accession number, but are moved into the relevant primary GenBank division. An example of a submission
(one accession number) that has progressed through phase 1, phase 2, and phase 3 is available
Genome Survey Sequence (GSS)
This division is similar in nature to the EST division, except that its sequences will be genomic rather than cDNA
(mRNA). The GSS division will contain (but not be limited to) the following types of data:
- random "single pass read" genome survey sequences
- single pass reads from cosmid/BAC/YAC ends
- exon trapped genomic sequences
- Alu PCR sequences
STS (Sequence Tagged Sites)
Sequence Tagged Sites (STS) are short DNA segments with a single location in the genome. This feature of STS
makes them useful tags for mapping.
Molecular biology databases
DNA sequence
Genome data
Protein sequence
Protein classification
Protein structure
Genome sequencing: published complete microbial genomes
Genome
Strain
Domain
Size (Mb)
Haemophilus influenzae Rd
KW20
B
1.83
TIGR
1995
Mycoplasma genitalium
G-37
B
0.58
TIGR
1995
Methanococcus jannaschii
DSM 2661
A
1.66
TIGR
1996
Mycoplasma pneumoniae
M129
B
0.81
Univ. of Heidelberg
1996
Synechocystis sp.
PCC 6803
B
3.57
1996
Archaeoglobus fulgidus
DSM4304
A
2.18
Kazusa DNA
Research Inst.
TIGR
Bacillus subtilis
168
B
4.2
1997
Deinococcus radiodurans
R1
B
3.28
International
Consortium
TIGR
K-12 Strain MG1655
B
4.6
Helicobacter pylori
26695
B
1.66
Methanobacterium
thermoautotrophicum
delta H
A
1.75
Saccharomyces cerevisiae
S288C
E
13
VF5
B
1.5
International
Consortium
Diversa
Escherichia coli
Aquifex aeolicus
Chlamydia trachomatis
Institution
University of
Wisconsin
TIGR
Year
1997
1997
1997
1997
1997
1996/19
97
1998
serovar D (D/UW3/Cx)
Mycobacterium tuberculosis H37Rv (lab strain)
B
1.05
UC Berkeley Stanford
1998
B
4.4
Sanger Centre
1998
Pyrococcus horikoshii
OT3
A
1.8
Biotechnology Center
1998
Rickettsia prowazekii
Madrid E
B
1.1
University of Uppsala
1998
Rickettsia prowazekii
Madrid E
B
1.1
University of Uppsala
1998
Treponema pallidum
Nichols
B
1.14
TIGR
1998
K1
A
1.67
Biotechnology Center
1999
CWL029
B
1.23
UC Berkeley Stanford
1999
J99
B
1.64
1999
Thermotoga maritima
MSB8
B
1.8
Astra Research Center
Boston Genome
Therapeutics
TIGR
Bacillus halodurans
C-125
B
4.2
2000
APS
B
0.64
Japan Marine Science
and Technology
Center
Univ. Tokyo / RIKEN
NCTC 11168
B
1.64
Sanger Centre
2000
Chlamydia pneumoniae
AR39
B
1.23
TIGR
2000
Chlamydia trachomatis
MoPn
B
1.07
TIGR
2000
Halobacterium sp.
NRC-1
A
2.57
2000
Neisseria meningitidis
MC58
B
2.27
Halobacterium
genome consortium
TIGR
Neisseria meningitidis
serogroup A strain
Z2491
PAO1
B
2.18
Sanger Centre
2000
B
6.3
2000
A
1.56
2000
Aeropyrum pernix
Chlamydia pneumoniae
Helicobacter pylori
Buchnera sp.
Campylobacter jejuni
Pseudomonas aeruginosa
1999
2000
2000
Thermoplasma volcanum
GSS1
A
1.58
University of
Washington
Max-Planck-Institute
for Biochemistry
AIST
Ureaplasma urealyticum
serovar 3
B
0.75
Applied Biosystems /
2000
serotype O1,
Biotype El Tor, strain
N16961
9a5c
B
4
TIGR
2000
B
2.68
ONSA Consortium
2000
B
4.1
2001
B
1.44
University of
Wisconsin
TIGR
Thermoplasma acidophilum
Vibrio cholerae
Xylella fastidiosa
Escherichia coli
Borrelia burgdorferi
O157:H7 strain
EDL933
B31
2000
1997 /
Nucleotide sequence database statistics - distribution among organisms
Comparison of fully sequenced genomes
MB
Genes
Bacteria
0.6 - 7.5
500-7,000
S. cerevisiae
12
6,000
S. pombe
13
6,000
Caenorhabditis elegans
97
20,000
Drosophila melanogaster
120
14,000
Arabidopsis thaliana
110
26,000
Fugu rubripes
365
~38,000?
Mus musculus
~3000
>40,000?
H. sapiens
3200
>40,000?
Sites for exploring fully sequenced genomes of man, mouse
and other higher eukaryotes.
NCBI
www.ncbi.nlm.nih.gov
UCSC
genome.ucsc.edu
EBI
www.ensembl.org
Celera
www.celera.com
Genome MOT, Genome monitoring table
http://www.ebi.ac.uk/genomes/mot/index.html
March 2003:
% Finished
% Finished+Draft
Drosophila
100
C. elegans
100
A. thaliana
100
H. sapiens
118
183
Danio rerio
5
23
Mouse
110
181
Rat
0.5
169
Taxonomy database
www3.ncbi.nlm.nih.gov/Taxonomy/tax.html
This is the top level of the taxonomy database maintained by
NCBI/GenBank. You can explore any of the taxa listed below by
clicking it.
Archaea
Eubacteria
Eukaryotae
Viroids
Viruses
Other
Unclassified
Molecular biology databases
DNA sequence
Genome data
Protein sequence
Protein classification
Protein structure
Most entries in protein sequence databases are
computational translations from gene sequences
DNA -> RNA -> protein -> conformation
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
CDS
109..717
/db_xref="SWISS-PROT:P28763”
/transl_table=11
/gene="sod”
/EC_number="1.15.1.1”
/product="superoxide dismutase”
/protein_id="CAA45406.1”
/translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEAVSG
HAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGNLKAA
IESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGKTPVLGL
DVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK"
The flow of genetic information
DNA -> RNA -> protein -> conformation
Translation products of DNA - Amino acids in three letter code
ValArgIleArgIleSerAsp
TyrGlyPheGlyPheArgMet
ThrAspSerAspPheGlyCys
5' GUACGGAUUCGGAUUUCGGAUGC 3'
3' CAUGCCUAAGCCUAAAGCCUACG 5'
TyrProAsnProAsnArgIle
ValSerGluSerLysProHis
ArgIleArgIleGluSerAla
Amino acids in one letter code
V
R
I
R
I
S
D
Y G F G F R M
T D S D F G C
5' GUACGGAUUCGGAUUUCGGAUGC 3'
3' CAUGCCUAAGCCUAAAGCCUACG 5'
Y P N P N R I
V S E S K P H
R I R I E S A
Three- and one-letter codes of the amino acids.
Alanine
Arginine
Asparagine
Aspartate
Cysteine
Glutamate
Glutamine
Glycine
Histidine
Isoleucine
Leucine
Lysine
Metionine
Fenylalanine
Proline
Serine
Treonine
Tryptofan
Tyrosine
Valine
Ala
Arg
Asn
Asp
Cys
Glu
Gln
Gly
His
Ile
Leu
Lys
Met
Phe
Pro
Ser
Thr
Trp
Tyr
Val
A
R
N
D
C
E
Q
G
H
I
L
K
M
F
P
S
T
W
Y
V
4. THE GENETIC CODE
UUU
UUC
UUA
UUG
Phe
Phe
Leu
Leu
UCU
UCC
UCA
UCG
Ser
Ser
Ser
Ser
UAU
UAC
UAA
UAG
Tyr
Tyr
Stop
Stop
UGU
UGC
UGA
UGG
Cys
Cys
Stop
Trp
CUU
CUC
CUA
CUG
Leu
Leu
Leu
Leu
CCU
CCC
CCA
CCG
Pro
Pro
Pro
Pro
CAU
CAC
CAA
CAG
His
His
Gln
Gln
CGU
CGC
CGA
CGG
Arg
Arg
Arg
Arg
AUU
AUC
AUA
AUG
Ile
Ile
Ile
Met
ACU
ACC
ACA
ACG
Thr
Thr
Thr
Thr
AAU
AAC
AAA
AAG
Asn
Asn
Lys
Lys
AGU
AGC
AGA
AGG
Ser
Ser
Arg
Arg
GUU
GUC
GUA
GUG
Val
Val
Val
Val
GCU
GCC
GCA
GCG
Ala
Ala
Ala
Ala
GAU
GAC
GAA
GAG
Asp
Asp
Glu
Glu
GGU
GGC
GGA
GGG
Gly
Gly
Gly
Gly
Table I. The genetic code
Deviations from the standard genetic code
# Cilian protozoa
UAA = Gln:Q
UAG = Gln:Q
# Yeast mitochondria
UGA
CUU
CUC
CUA
CUG
AUA
=
=
=
=
=
=
Trp:W
Thr:T
Thr:T
Thr:T
Thr:T
Met:M
# Mammalian mitochondria
UGA
AUU
AUC
AUA
AGA
AGG
=
=
=
=
=
=
Trp:W
Ile:I
Ile:I
Met:M
* :*
* :*
# Drosophila mitochondria
UGA
AUU
AUA
AGA
AGG
=
=
=
=
=
Trp:W
Ile:I
Met:M
Ser:S
Ser:S
# mycoplasma
UGA = Trp
Sequence symbols: Nucleotides
Symbol
Meaning
Complement
A
A
T
C
C
G
G
G
C
T/U
T
A
M
A or C
K
R
A or G
Y
W
A or T
W
S
C or G
S
Y
C or T
R
K
G or T
M
V
A or C or G
B
H
A or C or T
D
D
A or G or T
H
B
C or G or T
V
X/N G or A or T or C
X
.
.
not G or A or T or C
‘Reverse’ translation
A - G - K - M
GCN GGN AAR ATG
GCU GGU AAA ATG
Protein
DNA - most ambiguous
DNA - most likely
Codon usage for enteric bacterial (highly expressed) genes 7/19/83
AmAcid
Codon
Number
/1000
Fraction
Gly
Gly
Gly
Gly
GGG
GGA
GGU
GGC
13.00
3.00
365.00
238.00
1.89
0.44
52.99
34.55
0.02
0.00
0.59
0.38
Glu
Glu
Asp
Asp
GAG
GAA
GAU
GAC
108.00
394.00
149.00
298.00
15.68
57.20
21.63
43.26
0.22
0.78
0.33
0.67
Val
Val
Val
Val
GUG
GUA
GUU
GUC
93.00
146.00
289.00
38.00
13.50
21.20
41.96
5.52
0.16
0.26
0.51
0.07
Ala
Ala
Ala
Ala
GCG
GCA
GCU
GCC
161.00
173.00
212.00
62.00
23.37
25.12
30.78
9.00
0.26
0.28
0.35
0.10
Arg
Arg
Ser
Ser
AGG
AGA
AGU
AGC
1.00
0.00
9.00
71.00
0.15
0.00
1.31
10.31
0.00
0.00
0.03
0.20
Lys
Lys
Asn
Asn
AAG
AAA
AAU
AAC
111.00
320.00
19.00
274.00
16.11
46.46
2.76
39.78
0.26
0.74
0.06
0.94
Met
Ile
Ile
Ile
AUG
AUA
AUU
AUC
170.00
1.00
70.00
345.00
24.68
0.15
10.16
50.09
1.00
0.00
0.17
0.83
Thr
Thr
Thr
Thr
ACG
ACA
ACU
ACC
25.00
14.00
130.00
206.00
3.63
2.03
18.87
29.91
0.07
0.04
0.35
0.55
..
Protein sequence databases - Content
The SWISS-PROT Protein Sequence Data Bank is a database of protein
sequences produced collaboratively by Amos Bairoch (University of
Geneva) and the EBI. It contains high-quality annotation, is nonredundant, and cross-referenced to many other databases.
Release 40.44of SWISS-PROT contains 122'214 sequence entries
comprising 44864044 amino acids.
SWISS-PROT is accompanied by TrEMBL, a computer-annotated
supplement to SWISS-PROT. TrEMBL contains the translations of all
coding sequences (CDS) present in the EMBL Nucleotide Sequence
Database not yet integrated into SWISS-PROT.
TrEMBL (March 2003) contains 725'373 sequence entries
NCBI protein database : 1’335'897 sequences
Growth of Swissprot protein sequence database
RELATIONSHIPS BETWEEN SWISS-PROT AND SOME BIOMOLECULAR
DATABASES
************
* EMBL Nucleotide
*
* Sequence Database *
*
[EBI]
*
***********************
^ ^ ^ ^ ^ ^ ^ ^ ^
******************
| | | I | | | | |
* FlyBase
* <-------+ | | I | | | | +------->
******************
| | | I | | | | |
| | | I | | | | |
******************
| | | I | | | | |
* SubtiList
* <---------+ | I | | | +--------->
* [B.subtilis]
*
| | | I | | | | |
******************
| | | I | | | | |
| | | I | | | | |
******************
| | | I | | +----------->
* Mendel [Plant] * <-----+ | | | I | | | | |
******************
| | | | I | | | | |
| | | | I | | | | |
******************
| | | | I +--------------->
* MaizeDb
* <-----------+ I | | | | |
* [Zea mays]
*
| | | | I | | | | |
******************
| | | | I | | | | |
| | | | I | +------------->
******************
| | | | I | | | | |
* WormPep
*
| | | | I | | | | |
* [C.elegans]
* <---+ | | | | I | | | | |
******************
| | | | | I | | | | | +----->
| | | | | I | | | | | |
******************
| v v v v v v v v v v v
* REBASE
*
*************************
* [Restriction
* <-- *
SWISS-PROT
* ---->
* enzymes]
*
*
Protein Sequence
*
******************
*
Data Bank
*
*************************
******************
^ ^ ^ ^ ^ ^ ^ | ^ ^ ^
* StyGene
*
| | | | | | | | | | +-------->
* [S.Typhimurium]* <----+ | | | | | | | | |
******************
| | | | | | | | |
| | | | | | | | +---------->
******************
| | | | | | | |
* TRANSFAC
* <------+ | | | | | | |
******************
| | | | | | |
| | | | | | +------------>
******************
| | | | | |
* Harefield [2D] * <--------+ | | | | |
******************
| | | | |
| | | | +-------------->
******************
| | | |
* PROSITE
*
| | | |
* [Patterns and * <----------+ | | +---------------->
* profiles]
*
| |
******************
| +----------------+
|
v
|
|
***********************
+->
+--------> * PDB [3D structures] * <----***********************
**********************
* MGD [Mouse]
*
**********************
**********************
* GCRDb [7TM recep.] *
**********************
**********************
* EcoGene [E.coli]
*
**********************
**********************
* SGD [Yeast]
*
**********************
**********************
* DictyDB [D.disco.] *
**********************
**********************
* ENZYME [Nomencl.] *
**********************
v
**********************
* OMIM [Human]
*
**********************
**********************
* ECO2DBASE
[2D] *
**********************
**********************
* Maize-2DPAGE [2D] *
**********************
**********************
* SWISS-2DPAGE [2D] *
**********************
**********************
* Aarhus/Ghent [2D] *
**********************
**********************
* YEPD [Yeast] [2D] *
**********************
**********************
* HSSP [3D similar.] *
**********************
Swissprot and relation to
other databases
Example of Swissprot entry
ID
AC
DT
DT
DT
DE
GN
OS
OC
OC
RN
RP
RX
RA
RA
RL
RN
RP
RX
RA
RL
RN
RP
RX
RA
RA
RL
RN
RP
RX
RA
RL
PRIO_HUMAN
STANDARD;
PRT;
253 AA.
P04156;
01-NOV-1986 (REL. 03, CREATED)
01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE)
01-NOV-1997 (REL. 35, LAST ANNOTATION UPDATE)
MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) (ASCR).
PRNP.
HOMO SAPIENS (HUMAN).
EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;
EUTHERIA; PRIMATES.
[1]
SEQUENCE FROM N.A.
MEDLINE; 86300093.
KRETZSCHMAR H.A., STOWRING L.E., WESTAWAY D., STUBBLEBINE W.H.,
PRUSINER S.B., DEARMOND S.J.;
DNA 5:315-324(1986).
[2]
SEQUENCE OF 8-253 FROM N.A.
MEDLINE; 86261778.
LIAO Y.-C.J., LEBO R.V., CLAWSON G.A., SMUCKLER E.A.;
SCIENCE 233:364-367(1986).
[3]
VARIANT AMYLOID GSS, SEQUENCE OF 58-85 AND 111-150.
MEDLINE; 91160504.
TAGLIAVINI F., PRELLI F., GHISO J., BUGIANI O., SERBAN D.,
PRUSINER S.B., FARLOW M.R., GHETTI B., FRANGIONE B.;
EMBO J. 10:513-519(1991).
[4]
REVIEW ON VARIANTS.
MEDLINE; 93372867.
PALMER M.S., COLLINGE J.;
HUM. MUTAT. 2:168-173(1993).
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
-!- FUNCTION: THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THE
HOST GENOME AND IS EXPRESSED BOTH IN NORMAL AND INFECTED CELLS.
-!- SUBUNIT: PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLED
"RODS".
-!- SUBCELLULAR LOCATION: ATTACHED TO THE MEMBRANE BY A GPI-ANCHOR.
-!- DISEASE: PRP IS FOUND IN HIGH QUANTITY IN THE BRAIN OF HUMANS AND
ANIMALS INFECTED WITH NEURODEGENERATIVE DISEASES KNOWN AS
TRANSMISSIBLE SPONGIFORM ENCEPHALOPATHIES OR PRION DISEASES, LIKE:
CREUTZFELDT-JACOB DISEASE (CJD), GERSTMANN-STRAUSSLER SYNDROME
(GSS), FATAL FAMILIAL INSOMNIA (FFI) AND KURU IN HUMANS; SCRAPIE
IN SHEEP AND GOAT; BOVINE SPONGIFORM ENCEPHALOPATHY (BSE) IN
CATTLE; TRANSMISSIBLE MINK ENCEPHALOPATHY (TME); CHRONIC WASTING
DISEASE (CWD) OF MULE DEER AND ELK; FELINE SPONGIFORM
ENCEPHALOPATHY (FSE) IN CATS AND EXOTIC UNGULATE ENCEPHALOPATHY
(EUE) IN NYALA AND GREATER KUDU. THE PRION DISEASES ILLUSTRATE
THREE MANIFESTATIONS OF CNS DEGENERATION: (1) INFECTIOUS (2)
SPORADIC AND (3) DOMINANTLY INHERITED FORMS. TME, CWD, BSE, FSE,
EUE ARE ALL THOUGHT TO OCCUR AFTER CONSUMPTION OF PRION-INFECTED
FOODSTUFFS.
-!- DISEASE: CJD OCCURS PRIMARILY AS A SPORADIC DISORDER (1 PER
MILLION), WHILE 10-15% ARE FAMILIAL. ACCIDENTAL TRANSMISSION OF
CJD TO HUMANS APPEARS TO BE IATROGENIC (CONTAMINATED HUMAN GROWTH
HORMONE (HGH), CORNEAL TRANSPLANTATION, ELECTROENCEPHALOGRAPHIC
ELECTRODE IMPLANTATION. . .). EPIDEMIOLOGIC STUDIES HAVE FAILED TO
IMPLICATE THE INGESTION OF INFECTED ANNIMAL MEAT IN THE
PATHOGENESIS OF CJD IN HUMAN. THE TRIAD OF MICROSCOPIC FEATURES
THAT CHARACTERIZE THE PRION DISEASES CONSISTS OF (1) SPONGIFORM
DEGENERATION OF NEURONS, (2) SEVERE ASTROCYTIC GLIOSIS THAT OFTEN
APPEARS TO BE OUT OF PROPORTION TO THE DEGREE OF NERF CELL LOSS,
AND (3) AMYLOID PLAQUE FORMATION. CJD IS CHARACTERIZED BY
PROGRESSIVE DEMENTIA AND MYOCLONIC SEIZURES, AFFECTING ADULTS IN
MID-LIFE. SOME PATIENTS PRESENT SLEEP DISORDERS, ABNORMALITIES OF
HIGH CORTICAL FUNCTION, CEREBELLAR AND CORTICOSPINAL DISTURBANCES.
THE DISEASE ENDS IN DEATH AFTER A 3-12 MONTHS ILLNESS.
-!- DISEASE: GSS IS A HETEROGENEOUS DISORDER AND WAS DEFINED AS A
"SPINOCEREBELLAR ATAXIA WITH DEMENTIA AND PLAQUELIKE DEPOSITS".
GSS INCIDENCE IS LESS THAN 2 PER 100 MILLION.
-!- DISEASE: KURU IS TRANSMITTED DURING RITUALISTIC CANNIBALISM, AMONG
NATIVES OF THE NEW GUINEA HIGHLANDS. PATIENTS EXHIBIT VARIOUS
MOVEMENT DISORDERS LIKE CEREBELLAR ABNORMALITIES, RIGIDITY OF THE
LIMBS, AND CLONUS. EMOTIONNAL LABILITY IS PRESENT, AND DEMENTIA IS
CONSPICUOUSLY ABSENT. DEATH USUALLY OCCURS FROM 3 TO 12 MONTH
AFTER ONSET.
-!- SIMILARITY: TO OTHER PRP.
-!- DATABASE: NAME=HotMolecBase; NOTE=PrP entry;
WWW="http://bioinformatics.weizmann.ac.il/hotmolecbase/entries/prp.htm".
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
SQ
//
SIGNAL
CHAIN
PROPEP
LIPID
CARBOHYD
CARBOHYD
DISULFID
DOMAIN
REPEAT
REPEAT
REPEAT
REPEAT
REPEAT
VARIANT
VARIANT
VARIANT
VARIANT
VARIANT
VARIANT
VARIANT
VARIANT
VARIANT
VARIANT
VARIANT
CONFLICT
SEQUENCE
MANLGCWMLV
HGGGWGQPHG
VVGGLGGYML
NITIKQHTVT
ILLISFLIFL
1
23
231
230
181
197
179
51
22
230
253
230
181
197
214
91
MAJOR PRION PROTEIN.
REMOVED IN MATURE FORM (BY SIMILARITY).
GPI-ANCHOR (BY SIMILARITY).
PROBABLE.
PROBABLE.
BY SIMILARITY.
5 X 8 AA TANDEM REPEATS OF P-H-G-G-G-W-GQ.
51
59
1.
60
67
2.
68
75
3.
76
83
4.
84
91
5.
102
102
P -> L (IN GSS).
105
105
P -> L (IN GSS).
117
117
A -> V (LINKED TO DEVELOPMENT OF
DEMENTING GSS).
129
129
M -> V (DETERMINES THE DISEASE PHENOTYPE
IN PATIENTS WHO HAVE A PRP MUTATION AT
CODON 178: PATIENTS WITH MET DEVELOP FFI,
THOSE WITH VAL DEVELOP CJD).
178
178
D -> N (IN FFI AND CJD).
180
180
V -> I (IN CJD).
198
198
F -> S (IN A ATYPICAL FORM OF GSS WITH
NEUROFIBRILLARY TANGLES).
200
200
E -> K (IN CJD).
210
210
V -> I (IN CJD).
217
217
Q -> R (IN GSS WITH NEUROFIBRILLARY
TANGLES).
232
232
M -> R (IN CJD).
118
118
MISSING (IN REF. 2).
253 AA; 27661 MW; FD5373AD CRC32;
LFVATWSDLG LCKKRPKPGG WNTGGSRYPG QGSPGGNRYP PQGGGGWGQP
GGWGQPHGGG WGQPHGGGWG QGGGTHSQWN KPSKPKTNMK HMAGAAAAGA
GSAMSRPIIH FGSDYEDRYY RENMHRYPNQ VYYRPMDEYS NQNNFVHDCV
TTTKGENFTE TDVKMMERVV EQMCITQYER ESQAYYQRGS SMVLFSSPPV
IVG
Molecular biology databases
DNA sequence
Genome data
Protein sequence
Protein classification
Protein structure
Protein classification databases
PROSITE
Pfam
InterPro
Prosite: Patterns are identified from multiple alignments of
protein sequences
PROSITE
Release 16.45 : 1483 patterns
Example 1
ID
AC
DT
DE
PA
CC
3D
DO
ATP_GTP_A; PATTERN.
PS00017;
APR-1990 (CREATED); APR-1990 (DATA UPDATE); NOV-1990 (INFO UPDATE).
ATP/GTP-binding site motif A (P-loop).
[AG]-x(4)-G-K-[ST].
/TAXO-RANGE=ABEPV;
1EFM; 1ETU; 1Q21; 2Q21; 4Q21; 5Q21; 6Q21;
PDOC00017;
Example II
ID
AC
DT
DE
PA
NR
ZINC_FINGER_C2H2; PATTERN.
PS00028;
APR-1990 (CREATED); JUN-1994 (DATA UPDATE); NOV-1997 (INFO UPDATE).
Zinc finger, C2H2 type, domain.
C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H.
/RELEASE=35,69113;
Pfam
www.sanger.ac.uk/Pfam/
Pfam is a database of multiple alignments of
protein domains or conserved protein
regions. Hopefully they represent some
evolutionary conserved structure which has
implications for the protein's function.
Version 6.6, August 2001, 3071 families
Over 65% of the proteins in SWISSPROT
38 and TrEMBL-11 have at least one match
to a Pfam family.
72% of protein sequences have at least one match to Pfam.
Applications of gene ontology databases:
1. Cases of non-informative protein sequence databases
Query= FWR602467643.F1 664 23 640 ABI cut from 23 to 663. Remaining:
640 bases.
(640 letters)
Database: /pubdata/ncbi/nr
771,594 sequences; 245,249,561 total letters
Searching..................................................done
Sequences producing significant alignments:
dbj|BAB23278.1| (AK004371) putative [Mus musculus]
gb|AAH08101.1|AAH08101 (BC008101) Similar to hypothetical protei...
Score
(bits)
E
Value
369
174
e-101
7e-43
2. You want to answer questions like:
Which proteins are linked to a specific biological process,
like glycolysis ?
Gene ontology consortium
Major principles
•Molecular function
•Biological process
•Cellular component
Gene ontology
www.geneontology.org
Extract of gene assocation table:
SP
SP
SP
SP
SP
SP
O00115
O00115
O00115
O00115
O00116
O00116
DRN2_HUMAN
DRN2_HUMAN
DRN2_HUMAN
DRN2_HUMAN
ADAS_HUMAN
ADAS_HUMAN
GO:0003677
GO:0004519
GO:0004531
GO:0005764
GO:0005777
GO:0005777
F
F
F
C
C
C
Deoxyribonuclease II precursor
Deoxyribonuclease II precursor
Deoxyribonuclease II precursor
Deoxyribonuclease II precursor
Alkyldihydroxyacetonephosphate..
Alkyldihydroxyacetonephosphate..
Molecular biology databases
DNA sequence
Genome data
Protein sequence
Protein classification
Protein structure
Databases of protein/nucleic structure
HIV protease
Secondary
structure elements
of proteins:
?????? helix
Secondary
structure elements
of proteins:
????? sheet
Schematic pictures of proteins
highlight secondary structure
Determination of protein structure
* X ray crystallography
* NMR
Example of PDB entry
HEADER
COMPND
SOURCE
AUTHOR
REVDAT
REVDAT
JRNL
JRNL
JRNL
JRNL
JRNL
REMARK
REMARK
HORMONE
30-OCT-92
1BPH
INSULIN (CUBIC) IN 0.1M SODIUM SALT SOLUTION AT PH9
BOVINE (BOS $TAURUS) PANCREAS
O.GURSKY,J.BADGER,Y.LI,D.L.D.CASPAR
2
31-OCT-93 1BPHA
1
REMARK HET
FORMUL
1
15-JAN-93 1BPH
0
AUTH
O.GURSKY,J.BADGER,Y.LI,D.L.D.CASPAR
TITL
CONFORMATIONAL CHANGES IN CUBIC INSULIN CRYSTALS
TITL 2 IN THE PH RANGE 7-11
REF
BIOPHYS.J.
V. 63 1210 1992
REFN
ASTM BIOJAU US ISSN 0006-3495
030
1
1 REFERENCE 1
1BPH
2
1BPH
3
1BPH
4
1BPH
5
1BPHA 1
1BPH
6
1BPH
7
1BPH
8
1BPH
9
1BPH 10
1BPH 11
1BPH 12
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
N
CA
C
O
N
CA
C
O
CB
CG1
CG2
CD1
N
CA
C
O
CB
CG1
CG2
GLY
GLY
GLY
GLY
ILE
ILE
ILE
ILE
ILE
ILE
ILE
ILE
VAL
VAL
VAL
VAL
VAL
VAL
VAL
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
1
1
1
1
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
13.994
14.277
15.574
16.078
16.088
17.342
18.526
19.425
17.571
18.638
17.859
18.914
18.619
19.774
19.952
21.018
19.719
20.847
19.868
47.196
46.226
45.507
45.660
44.766
44.034
44.939
44.457
43.072
42.049
43.936
40.930
46.195
47.080
47.453
47.421
48.274
49.225
47.724
31.798
30.708
31.085
32.217
30.126
30.404
30.686
31.392
29.158
29.605
27.903
28.590
30.192
30.436
31.895
32.561
29.462
29.754
28.044
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
35.87
38.67
31.18
22.60
28.39
23.76
25.29
18.74
27.36
18.03
25.54
17.07
24.42
30.26
19.08
28.15
33.87
30.40
24.51
1BPH
1BPH
1BPH
1BPH
1BPH
1BPH
1BPH
1BPH
1BPH
1BPH
1BPH
1BPH
1BPH
1BPH
1BPH
1BPH
1BPH
1BPH
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
3D viewers
Several programs are available for viewing protein and nucleic 3D structures:
Rasmol
www.umass.edu/microbio/rasmol/
Weblab
www.msi.com
Kinemage
www.cryst.bbk.ac.uk/PPS/vsns-pps/technology/kinemage.html
Chime
www.umass.edu/microbio/rasmol/
Protein
explorer
www.umass.edu/microbio/chime/explorer/
Cn3D
www.ncbi.nlm.nih.gov/Entrez
SwissPDB
viewer
expasy.proteome.org.au/spdbv/
(Molscript
www.avatar.se/molscript/)