Download Data and databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
What are we looking for?
Data & databases
©CMBI 2001
Your questions
Lookup
Compare
Predict
©CMBI 2002
Your questions
Lookup
•
•
•
•
•
•
Is the gene known for my protein (or vice versa)?
On which chromosome is the gene located?
What sequence patterns are present in my protein?
Are the mutations known which cause this disease?
To what class or family does my protein belong?
What is known about this family?
©CMBI 2002
Your questions
Compare
• Are there protein sequences in the database which
resemble the protein I cloned?
• How can I optimally align the members of this protein
family?
• Are these two proteins similar?
©CMBI 2002
Sequence similarity
Image, you sequenced this human protein.
MVVSGAPPAL
WPWIVSIQKN
VGVAWVEPHP
GSIQDGVPLP
DSGGPLMCQV
GGGCLGTFTS
GTHHCAGSLL
VYSWKEGACA
HPQTLQKLKV
DGAWLLAGII
LLLLASTAIL
TSRWVITAAH
DIALVRLERS
PIIDSEVCSH
SWGEGCAERN
NAARIPVPPA
CFKDNLNKPY
IQFSERVLPI
LYWRGAGQGP
RPGVYISLSA
CGKPQQLNRV
LFSVLLGAWQ
CLPDASIHLP
ITEDMLCAGY
HRSWVEKIVQ
VGGEDSTDSE
LGNPGSRSQK
PNTHCWISGW
LEGERDACLG
GVQLRGRAQG
You know it is a serine protease.
Which residues belong to the active site?
Is its sequence similar to the mouse serine protease?
©CMBI 2002
Alignment
MVVSGAPPAL GGGCLGTFTS LLLLASTAIL NAARIPVPPA CGKPQQLNRV VGGEDSTDSE
MMISRPPPAL GGDQFSILIL LVLLTSTAPI SAATIRVSPD CGKPQQLNRI VGGEDSMDAQ
*::* .**** **. :. :
*:**:*** : .** * *.* *********: ****** *::
WPWIVSIQKN GTHHCAGSLL TSRWVITAAH CFKDNLNKPY LFSVLLGAWQ LGNPGSRSQK
WPWIVSILKN GSHHCAGSLL TNRWVVTAAH CFKSNMDKPS LFSVLLGAWK LGSPGPRSQK
******* ** *:******** *.***:**** ***.*::** *********: **.**.****
VGVAWVEPHP VYSWKEGACA DIALVRLERS IQFSERVLPI CLPDASIHLP PNTHCWISGW
VGIAWVLPHP RYSWKEGTHA DIALVRLEHS IQFSERILPI CLPDSSVRLP PKTDCWIAGW
**:*** *** ******: * ********:* ******:*** ****:*::** *:*.***:**
GSIQDGVPLP HPQTLQKLKV PIIDSEVCSH LYWRGAGQGP ITEDMLCAGY LEGERDACLG
GSIQDGVPLP HPQTLQKLKV PIIDSELCKS LYWRGAGQEA ITEGMLCAGY LEGERDACLG
********** ********** ******:*. ******** . ***.****** **********
DSGGPLMCQV DGAWLLAGII SWGEGCAERN RPGVYISLSA HRSWVEKIVQ GVQLRGRAQG
DSGGPLMCQV DDHWLLTGII SWGEGCAD-D RPGVYTSLLA HRSWVQRIVQ GVQLRG---********** *. ***:*** *******: : ***** ** * *****::*** ******
=> Transfer of information
©CMBI 2002
Are these structures similar?
©CMBI 2000 J Leunissen
Your questions
Predict
•
•
•
•
•
Can I predict the active site residues of this enzyme?
Why are these patients ill?
Can I make a 3D model for my protein?
Can I predict a (better) drug for this target?
How can I improve the thermostability of this protein?
(protein engineering)
• How can I predict the genes located on this genome?
©CMBI 2002
How to find the answers to these questions?
Outline
Morning
• Data in databases
Afternoon
• Programs (tools) to search these databases
• Knowledge how to search the databases with these
tools (hands-on)
©CMBI 2002
Biological Databases
The number of databases
- DBCAT currently lists over 500 databases
The size of databases
- Grows exponentially
- EMBL database: New entries entered at 6.3 sec/seq!
(July 2001)
©CMBI 2002
(July 2001)
©CMBI 2001 J Leunissen
Primary and Secondary Databases
Primary databases
REAL EXPERIMENTAL DATA
Biomolecular sequences or structures and associated
annotation information (organism, function, mutation linked to
disease, functional/structural patterns, bibliographic etc.)
Secondary databases
DERIVED INFORMATION
Fruits of analyses of sequences in the primary sources
(patterns, blocks, profiles etc. which represent the most conserved
features of multiple alignments)
©CMBI 2002
Primary Databases
Sequence Information
– DNA: EMBL, Genbank, DDBJ
– Protein: SwissProt, TREMBL, PIR, OWL
Genome Information
– GDB, MGD, ACeDB, ENSEMBL
Structure Information
– PDB, NDB, CCDB/CSD
©CMBI 2002
Secondary Databases
Sequence-related Information
– ProSite, REBase
Genome-related Information
– OMIM, TransFac
Structure-related Information
– DSSP, HSSP, FSSP, PDBFinder
Pathway Information
– KEGG, Pathways
Function-related
– Enzyme, GO
©CMBI 2002
Databases
Data must be in certain format for the programs to
recognize them.
Every database can have its own format, but some data
elements are essential for every database:
1. Unique identifier, or accession code
2. Name of depositor
3. Literature references
4. Deposition date
5. The real data
©CMBI 2002
3 examples
1. SwissProt
2. EMBL
3. PDB
©CMBI 2002
Quality of databases
SwissProt
• Data is only entered by annotation experts
EMBL, PDB
• Everybody can submit data
• Data are accepted the way they are submitted
©CMBI 2002
SwissProt database
• Database of protein sequences
• Produced by Amos Bairoch (University of Geneva) and the
EMBL Data Library
• Data derived from:
– translations of DNA sequences (from EMBL Database)
– adapted from the PIR collection
– extracted from the literature
– and directly submitted by researchers
• SwissProt & SwissNew
• July 2001:
– ~86,600 entries, ~15,000 new entries / year
– Swissnew: 53,000 entries
• Ca. 200 Annotation experts worldwide
• Keyword-organised flatfile
©CMBI 2002
SwissProt records (1)
ID identification line
ID ENTRY_NAME DATA_CLASS; MOLECULE_TYPE; SEQUENCE_LENGTH.
ID CRAM_CRAAB STANDARD;
PRT;
46 AA.
Format for the ENTRY_NAME:
NAME_SPECIES ( 10 characters)
For number of organisms (16) SPECIES has a recognizable name:
HUMAN, MOUSE, CHICK, BOVIN, YEAST, ECOLI….
N.B. The ID can change, e.g. serotonine receptors have got a new
nomenclature
©CMBI 2002
SwissProt records (2)
AC accession number
AC
P01542;
AC is unique:
Name, sequence, everything can change but AC stays the same
DT deposition date
DT
21-JUL-1986 (Rel. 01, Created)
DT
30-MAY-2000 (Rel. 39, Last sequence update)
DT
30-MAY-2000 (Rel. 39, Last annotation update)
1) You can not see what the last annotation update was
2) No depositor record (Implicit: author of first reference)
©CMBI 2002
SwissProt records (3)
DE description
DE
DE
CRAMBIN.
6-phosphofructo-2-kinase 1 (EC 2.7.1.105)
(Phosphofructokinase 2 I)
1) General descriptive information
2) Free-format
GN gene name
GN
THI2.
OS & OC & OG
OS
OC
OC
OC
Crambe abyssinica (Abyssinian crambe).
Eukaryota; Viridiplantae; Embryophyta;Tracheophyta;Spermatophyta;
Magnoliophyta; eudicotyledons; Rosidae; eurosids II; Brassicales;
Brassicaceae; Crambe.
Organism Species; Organism Classification; OrGanelle
©CMBI 2002
SwissProt records (4)
RN References
RN
RP
RX
RA
RT
RL
[1]
SEQUENCE.
MEDLINE; 82046542.
Teeter M.M., Mazer J.A., L'Italien J.J.;
"Primary structure of the hydrophobic plant protein crambin.";
Biochemistry 20:5437-5443(1981).
CC Comments or notes
CC
CC
CC
CC
CC
-!- FUNCTION: THE FUNCTION OF THIS HYDROPHOBIC PLANT SEED PROTEIN
IS NOT KNOWN.
-!- MISCELLANEOUS: TWO ISOFORMS EXISTS, A MAJOR FORM PL (SHOWN HERE)
AND A MINOR FORM SI.
-!- SIMILARITY: BELONGS TO THE PLANT THIONIN FAMILY.
©CMBI 2002
SwissProt records (5)
DR Database Cross Reference
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
PIR; A01805; KECX.
PDB; 1CRN; 16-APR-87.
PDB; 1CBN; 31-JAN-94.
PDB; 1CCM; 31-OCT-93.
PDB; 1CCN; 31-JAN-94.
PDB; 1CNR; 31-AUG-94.
PDB; 1AB1; 12-AUG-97.
INTERPRO; IPR001010; -.
PFAM; PF00321; plant_thionins; 1.
PRINTS; PR00287; THIONIN.
PROSITE; PS00271; THIONIN; 1.
KW Keyword
Not standardized (under control of depositor)
KW
Thionin; 3D-structure.
©CMBI 2002
SwissProt records (6)
FT Feature table data
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
DISULFID
DISULFID
DISULFID
VARIANT
VARIANT
STRAND
HELIX
TURN
HELIX
TURN
STRAND
TURN
3
4
16
22
25
2
7
17
23
31
33
42
40
32
26
22
25
3
16
19
30
31
34
43
P -> S (IN ISOFORM SI).
L -> I (IN ISOFORM SI).
©CMBI 2002
Feature table
Other features: post-translational modifications, binding sites,
enzyme active sites, local secondary structure or other
characteristics reported in the cited references. Sequence conflicts
between references are also included.
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
CONFLICT 33 33 MISSING (IN REF. 2).
MUTAGEN 123 123 G->R,L,M: DNA BINDING LOST.
MOD_RES 11 11 PHOSPHORYLATION (BY PKC).
LIPID 1 1 MYRISTATE.
CARBOHYD 103 103 GLUCOSYLGALACTOSE.
METAL 87 87 COPPER (POTENTIAL).
BINDING 14 14 HEME (COVALENT).
PROPEP 27 28 ACTIVATION PEPTIDE.
DOMAIN 22 788 EXTRACELLULAR (POTENTIAL).
ACT_SITE 193 193 ACCEPTS A PROTON DURING CATALYSIS.
©CMBI 2002
SwissProt records (7)
SQ sequence header
SQ
SEQUENCE
46 AA;
4736 MW;
919E68AF159EF722 CRC64;
Sequence data
TTCCPSIVAR SNFNVCRLPG TPEALCATYT GCIIIPGATC PGDYAN
//
Termination line
©CMBI 2002
EMBL database
• Nucleotide database
• EMBL & EMNEW
• July 2001:
• EMBL: 3,951,820 entries, EMNEW: 323,703
• EMEST*: 8,092,600, EMNEWEST*: 619,777
*) EMEST/EMNEWEST = EST-section of EMBL, EST = expressed
sequence tag
• EMBL records follows roughly same scheme as
SwissProt
• Obligatory deposit of sequence in EMBL (or
SwissProt) before publication
©CMBI 2002
Protein Data Bank (PDB)
• Databank for macromolecular structure data (3dimensional coordinates)
• Obligatory deposit of coordinates in the PDB before
publication
• ~16,000 entries (October 2001)
• PDB file is a keyword-organised flat-file (80 column)
1) human readable
2) every line starts with a keyword (3-6 letters)
3) platform independent
• Started ca. 25 years ago (on punche cards!)
©CMBI 2002
PDB records (1)
Filename= accession number= PDB Code
1) Filename is 4 positions (often 1 digit & 3 letters, e.g. 1CRN)
2) Be aware: 0HYK means entry HYK does not contain coordinates
HEADER
describes molecule & gives deposition date
HEADER
PLANT SEED PROTEIN
30-APR-81
CMPND
name of molecule
COMPND
CRAMBIN
SOURCE
organism
SOURCE
ABYSSINIAN CABBAGE (CRAMBE ABYSSINICA) SEED
1CRN 1CRND
1
1CRN
4
1CRN
5
©CMBI 2002
PDB records (2)
AUTHOR
AUTHOR
W.A.HENDRICKSON,M.M.TEETER
1CRN
6
111L
111L
111L
111L
111L
10
11
12
13
14
The depositor
JRNL
JRNL
JRNL
JRNL
JRNL
JRNL
AUTH
M.BLABER,X.-J.ZHANG,B.W.MATTHEWS
TITL
STRUCTURAL BASIS OF ALPHA-HELIX PROPENSITY AT TWO
TITL 2 SITES IN T4 LYSOZYME
REF
SCIENCE
V. 260 1637 1993
REFN
ASTM SCIEAS US ISSN 0036-8075
038
REMARK
Not standardized: many different REMARK records & subrecords!
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
1 REFERENCE 3
1 AUTH
M.M.TEETER,W.A.HENDRICKSON
1 TITL
HIGHLY ORDERED CRYSTALS OF THE PLANT SEED PROTEIN
1 TITL 2 CRAMBIN
1 REF
J.MOL.BIOL.
V. 127 219 1979
1 REFN
ASTM JMOBAK UK ISSN 0022-2836
070
2
2 RESOLUTION. 1.5 ANGSTROMS.
1CRNC
1CRN
1CRN
1CRN
1CRN
1CRN
1CRN
1CRN
10
16
17
18
19
20
21
22
©CMBI 2002
PDB records (3)
SEQRES
Sequence of protein;
Be aware: Not always all 3D-coordinates are present for all the amino
acids in SEQRES!!
SEQRES
SEQRES
SEQRES
SEQRES
1
2
3
4
46
46
46
46
THR
ASN
ALA
CYS
THR
VAL
THR
PRO
CYS
CYS
TYR
GLY
CYS
ARG
THR
ASP
PRO
LEU
GLY
TYR
SER
PRO
CYS
ALA
ILE VAL ALA ARG SER ASN PHE
GLY THR PRO GLU ALA ILE CYS
ILE ILE ILE PRO GLY ALA THR
ASN
1CRN
1CRN
1CRN
1CRN
51
52
53
54
4MDH
4MDH
4MDH
4MDH
4MDH
4MDH
4MDH
219
220
221
222
223
224
225
HET & FORMUL
metals, cofactors, ions, etc.
HET
HET
HET
HET
FORMUL
FORMUL
FORMUL
NAD
SUL
NAD
SUL
3
4
5
A
A
B
B
NAD
SUL
HOH
1
2
1
2
44
NAD CO-ENZYME
5
SULFATE
44
NAD CO-ENZYME
5
SULFATE
2(C21 H28 N7 O14 P2)
2(O4 S1)
*471(H2 O1)
©CMBI 2002
PDB records (4)
HELIX/SHEET/TURN
Secondary structure elements as provided by the crystallographer
(subjective)
HELIX
SHEET
TURN
1
2
1
H1 ILE
S1 2 CYS
T1 PRO
7 PRO
32 ILE
41 TYR
CYS
CYS
19 1 3/10 CONFORMATION RES 17,19
35 -1
44
1CRN
1CRN
1CRN
55
58
59
40
32
1CRN
1CRN
60
61
1CRN
1CRN
1CRN
1CRN
1CRN
1CRN
1CRN
63
64
65
66
67
68
69
SSBOND
disulfide bridges
SSBOND
SSBOND
1 CYS
2 CYS
3
4
CRYST1, ORIGX1, ORIGX2, ORIGX3, SCALE1, SCALE2,
SCALE3
crystallographic parameters
CRYST1
ORIGX1
ORIGX2
ORIGX3
SCALE1
SCALE2
SCALE3
40.960
18.650
22.520 90.00
1.000000 0.000000 0.000000
0.000000 1.000000 0.000000
0.000000 0.000000 1.000000
.024414 0.000000 -.000328
0.000000
.053619 0.000000
0.000000 0.000000
.044409
90.77
90.00 P 21
0.00000
0.00000
0.00000
0.00000
0.00000
0.00000
2
©CMBI 2002
PDB records (5)
ATOM
one line for each atom with its unique name and its x,y,z coordinates
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
1
2
3
4
5
6
7
8
9
10
11
N
CA
C
O
CB
OG1
CG2
N
CA
C
O
THR
THR
THR
THR
THR
THR
THR
THR
THR
THR
THR
1
1
1
1
1
1
1
2
2
2
2
17.047
16.967
15.685
15.268
18.170
19.334
18.150
15.115
13.856
14.164
14.993
14.099
12.784
12.755
13.825
12.703
12.829
11.546
11.555
11.469
10.785
9.862
3.625
4.338
5.133
5.594
5.337
4.463
6.304
5.265
6.066
7.379
7.443
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
13.79
10.80
9.19
9.85
13.02
15.06
14.23
7.81
8.31
5.80
6.94
1CRN
1CRN
1CRN
1CRN
1CRN
1CRN
1CRN
1CRN
1CRN
1CRN
1CRN
70
71
72
73
74
75
76
77
78
79
80
1.00 11.00
1.00 10.32
1.00 7.86
1CRN
1CRN
1CRN
1CRN
394
395
396
397
TER record terminates the amino acid chain
ATOM
ATOM
ATOM
TER
325
326
327
328
OD1 ASN
ND2 ASN
OXT ASN
ASN
46
46
46
46
11.982
13.407
12.703
4.849
3.298
4.973
15.886
15.015
10.746
©CMBI 2002
PDB records (6)
HETATM
atomic coordinate records for atoms within “HET & FORMUL”-lines
(metals, cofactors, ions, …) and for water molecules
HETATM 5158 AP
4MDH5495
HETATM 5159 AO1
4MDH5496
HETATM 5160 AO2
4MDH5497
NAD B
1
42.641
30.361
41.284
1.00 26.73
NAD B
1
43.440
31.570
40.868
1.00 20.69
NAD B
1
41.161
30.484
41.376
1.00 33.73
HETATM 5207
4MDH5544
HETATM 5208
4MDH5545
HETATM 5209
4MDH5546
O
HOH
0
15.379
1.907
3.295
1.00 58.12
O
HOH
1
58.861
0.984
17.024
1.00 37.58
O
HOH
2
24.384
1.184
74.398
1.00 35.92
©CMBI 2002
©CMBI 2002
Related documents