Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
What are we looking for? Data & databases ©CMBI 2001 Your questions Lookup Compare Predict ©CMBI 2002 Your questions Lookup • • • • • • Is the gene known for my protein (or vice versa)? On which chromosome is the gene located? What sequence patterns are present in my protein? Are the mutations known which cause this disease? To what class or family does my protein belong? What is known about this family? ©CMBI 2002 Your questions Compare • Are there protein sequences in the database which resemble the protein I cloned? • How can I optimally align the members of this protein family? • Are these two proteins similar? ©CMBI 2002 Sequence similarity Image, you sequenced this human protein. MVVSGAPPAL WPWIVSIQKN VGVAWVEPHP GSIQDGVPLP DSGGPLMCQV GGGCLGTFTS GTHHCAGSLL VYSWKEGACA HPQTLQKLKV DGAWLLAGII LLLLASTAIL TSRWVITAAH DIALVRLERS PIIDSEVCSH SWGEGCAERN NAARIPVPPA CFKDNLNKPY IQFSERVLPI LYWRGAGQGP RPGVYISLSA CGKPQQLNRV LFSVLLGAWQ CLPDASIHLP ITEDMLCAGY HRSWVEKIVQ VGGEDSTDSE LGNPGSRSQK PNTHCWISGW LEGERDACLG GVQLRGRAQG You know it is a serine protease. Which residues belong to the active site? Is its sequence similar to the mouse serine protease? ©CMBI 2002 Alignment MVVSGAPPAL GGGCLGTFTS LLLLASTAIL NAARIPVPPA CGKPQQLNRV VGGEDSTDSE MMISRPPPAL GGDQFSILIL LVLLTSTAPI SAATIRVSPD CGKPQQLNRI VGGEDSMDAQ *::* .**** **. :. : *:**:*** : .** * *.* *********: ****** *:: WPWIVSIQKN GTHHCAGSLL TSRWVITAAH CFKDNLNKPY LFSVLLGAWQ LGNPGSRSQK WPWIVSILKN GSHHCAGSLL TNRWVVTAAH CFKSNMDKPS LFSVLLGAWK LGSPGPRSQK ******* ** *:******** *.***:**** ***.*::** *********: **.**.**** VGVAWVEPHP VYSWKEGACA DIALVRLERS IQFSERVLPI CLPDASIHLP PNTHCWISGW VGIAWVLPHP RYSWKEGTHA DIALVRLEHS IQFSERILPI CLPDSSVRLP PKTDCWIAGW **:*** *** ******: * ********:* ******:*** ****:*::** *:*.***:** GSIQDGVPLP HPQTLQKLKV PIIDSEVCSH LYWRGAGQGP ITEDMLCAGY LEGERDACLG GSIQDGVPLP HPQTLQKLKV PIIDSELCKS LYWRGAGQEA ITEGMLCAGY LEGERDACLG ********** ********** ******:*. ******** . ***.****** ********** DSGGPLMCQV DGAWLLAGII SWGEGCAERN RPGVYISLSA HRSWVEKIVQ GVQLRGRAQG DSGGPLMCQV DDHWLLTGII SWGEGCAD-D RPGVYTSLLA HRSWVQRIVQ GVQLRG---********** *. ***:*** *******: : ***** ** * *****::*** ****** => Transfer of information ©CMBI 2002 Are these structures similar? ©CMBI 2000 J Leunissen Your questions Predict • • • • • Can I predict the active site residues of this enzyme? Why are these patients ill? Can I make a 3D model for my protein? Can I predict a (better) drug for this target? How can I improve the thermostability of this protein? (protein engineering) • How can I predict the genes located on this genome? ©CMBI 2002 How to find the answers to these questions? Outline Morning • Data in databases Afternoon • Programs (tools) to search these databases • Knowledge how to search the databases with these tools (hands-on) ©CMBI 2002 Biological Databases The number of databases - DBCAT currently lists over 500 databases The size of databases - Grows exponentially - EMBL database: New entries entered at 6.3 sec/seq! (July 2001) ©CMBI 2002 (July 2001) ©CMBI 2001 J Leunissen Primary and Secondary Databases Primary databases REAL EXPERIMENTAL DATA Biomolecular sequences or structures and associated annotation information (organism, function, mutation linked to disease, functional/structural patterns, bibliographic etc.) Secondary databases DERIVED INFORMATION Fruits of analyses of sequences in the primary sources (patterns, blocks, profiles etc. which represent the most conserved features of multiple alignments) ©CMBI 2002 Primary Databases Sequence Information – DNA: EMBL, Genbank, DDBJ – Protein: SwissProt, TREMBL, PIR, OWL Genome Information – GDB, MGD, ACeDB, ENSEMBL Structure Information – PDB, NDB, CCDB/CSD ©CMBI 2002 Secondary Databases Sequence-related Information – ProSite, REBase Genome-related Information – OMIM, TransFac Structure-related Information – DSSP, HSSP, FSSP, PDBFinder Pathway Information – KEGG, Pathways Function-related – Enzyme, GO ©CMBI 2002 Databases Data must be in certain format for the programs to recognize them. Every database can have its own format, but some data elements are essential for every database: 1. Unique identifier, or accession code 2. Name of depositor 3. Literature references 4. Deposition date 5. The real data ©CMBI 2002 3 examples 1. SwissProt 2. EMBL 3. PDB ©CMBI 2002 Quality of databases SwissProt • Data is only entered by annotation experts EMBL, PDB • Everybody can submit data • Data are accepted the way they are submitted ©CMBI 2002 SwissProt database • Database of protein sequences • Produced by Amos Bairoch (University of Geneva) and the EMBL Data Library • Data derived from: – translations of DNA sequences (from EMBL Database) – adapted from the PIR collection – extracted from the literature – and directly submitted by researchers • SwissProt & SwissNew • July 2001: – ~86,600 entries, ~15,000 new entries / year – Swissnew: 53,000 entries • Ca. 200 Annotation experts worldwide • Keyword-organised flatfile ©CMBI 2002 SwissProt records (1) ID identification line ID ENTRY_NAME DATA_CLASS; MOLECULE_TYPE; SEQUENCE_LENGTH. ID CRAM_CRAAB STANDARD; PRT; 46 AA. Format for the ENTRY_NAME: NAME_SPECIES ( 10 characters) For number of organisms (16) SPECIES has a recognizable name: HUMAN, MOUSE, CHICK, BOVIN, YEAST, ECOLI…. N.B. The ID can change, e.g. serotonine receptors have got a new nomenclature ©CMBI 2002 SwissProt records (2) AC accession number AC P01542; AC is unique: Name, sequence, everything can change but AC stays the same DT deposition date DT 21-JUL-1986 (Rel. 01, Created) DT 30-MAY-2000 (Rel. 39, Last sequence update) DT 30-MAY-2000 (Rel. 39, Last annotation update) 1) You can not see what the last annotation update was 2) No depositor record (Implicit: author of first reference) ©CMBI 2002 SwissProt records (3) DE description DE DE CRAMBIN. 6-phosphofructo-2-kinase 1 (EC 2.7.1.105) (Phosphofructokinase 2 I) 1) General descriptive information 2) Free-format GN gene name GN THI2. OS & OC & OG OS OC OC OC Crambe abyssinica (Abyssinian crambe). Eukaryota; Viridiplantae; Embryophyta;Tracheophyta;Spermatophyta; Magnoliophyta; eudicotyledons; Rosidae; eurosids II; Brassicales; Brassicaceae; Crambe. Organism Species; Organism Classification; OrGanelle ©CMBI 2002 SwissProt records (4) RN References RN RP RX RA RT RL [1] SEQUENCE. MEDLINE; 82046542. Teeter M.M., Mazer J.A., L'Italien J.J.; "Primary structure of the hydrophobic plant protein crambin."; Biochemistry 20:5437-5443(1981). CC Comments or notes CC CC CC CC CC -!- FUNCTION: THE FUNCTION OF THIS HYDROPHOBIC PLANT SEED PROTEIN IS NOT KNOWN. -!- MISCELLANEOUS: TWO ISOFORMS EXISTS, A MAJOR FORM PL (SHOWN HERE) AND A MINOR FORM SI. -!- SIMILARITY: BELONGS TO THE PLANT THIONIN FAMILY. ©CMBI 2002 SwissProt records (5) DR Database Cross Reference DR DR DR DR DR DR DR DR DR DR DR PIR; A01805; KECX. PDB; 1CRN; 16-APR-87. PDB; 1CBN; 31-JAN-94. PDB; 1CCM; 31-OCT-93. PDB; 1CCN; 31-JAN-94. PDB; 1CNR; 31-AUG-94. PDB; 1AB1; 12-AUG-97. INTERPRO; IPR001010; -. PFAM; PF00321; plant_thionins; 1. PRINTS; PR00287; THIONIN. PROSITE; PS00271; THIONIN; 1. KW Keyword Not standardized (under control of depositor) KW Thionin; 3D-structure. ©CMBI 2002 SwissProt records (6) FT Feature table data FT FT FT FT FT FT FT FT FT FT FT FT DISULFID DISULFID DISULFID VARIANT VARIANT STRAND HELIX TURN HELIX TURN STRAND TURN 3 4 16 22 25 2 7 17 23 31 33 42 40 32 26 22 25 3 16 19 30 31 34 43 P -> S (IN ISOFORM SI). L -> I (IN ISOFORM SI). ©CMBI 2002 Feature table Other features: post-translational modifications, binding sites, enzyme active sites, local secondary structure or other characteristics reported in the cited references. Sequence conflicts between references are also included. FT FT FT FT FT FT FT FT FT FT CONFLICT 33 33 MISSING (IN REF. 2). MUTAGEN 123 123 G->R,L,M: DNA BINDING LOST. MOD_RES 11 11 PHOSPHORYLATION (BY PKC). LIPID 1 1 MYRISTATE. CARBOHYD 103 103 GLUCOSYLGALACTOSE. METAL 87 87 COPPER (POTENTIAL). BINDING 14 14 HEME (COVALENT). PROPEP 27 28 ACTIVATION PEPTIDE. DOMAIN 22 788 EXTRACELLULAR (POTENTIAL). ACT_SITE 193 193 ACCEPTS A PROTON DURING CATALYSIS. ©CMBI 2002 SwissProt records (7) SQ sequence header SQ SEQUENCE 46 AA; 4736 MW; 919E68AF159EF722 CRC64; Sequence data TTCCPSIVAR SNFNVCRLPG TPEALCATYT GCIIIPGATC PGDYAN // Termination line ©CMBI 2002 EMBL database • Nucleotide database • EMBL & EMNEW • July 2001: • EMBL: 3,951,820 entries, EMNEW: 323,703 • EMEST*: 8,092,600, EMNEWEST*: 619,777 *) EMEST/EMNEWEST = EST-section of EMBL, EST = expressed sequence tag • EMBL records follows roughly same scheme as SwissProt • Obligatory deposit of sequence in EMBL (or SwissProt) before publication ©CMBI 2002 Protein Data Bank (PDB) • Databank for macromolecular structure data (3dimensional coordinates) • Obligatory deposit of coordinates in the PDB before publication • ~16,000 entries (October 2001) • PDB file is a keyword-organised flat-file (80 column) 1) human readable 2) every line starts with a keyword (3-6 letters) 3) platform independent • Started ca. 25 years ago (on punche cards!) ©CMBI 2002 PDB records (1) Filename= accession number= PDB Code 1) Filename is 4 positions (often 1 digit & 3 letters, e.g. 1CRN) 2) Be aware: 0HYK means entry HYK does not contain coordinates HEADER describes molecule & gives deposition date HEADER PLANT SEED PROTEIN 30-APR-81 CMPND name of molecule COMPND CRAMBIN SOURCE organism SOURCE ABYSSINIAN CABBAGE (CRAMBE ABYSSINICA) SEED 1CRN 1CRND 1 1CRN 4 1CRN 5 ©CMBI 2002 PDB records (2) AUTHOR AUTHOR W.A.HENDRICKSON,M.M.TEETER 1CRN 6 111L 111L 111L 111L 111L 10 11 12 13 14 The depositor JRNL JRNL JRNL JRNL JRNL JRNL AUTH M.BLABER,X.-J.ZHANG,B.W.MATTHEWS TITL STRUCTURAL BASIS OF ALPHA-HELIX PROPENSITY AT TWO TITL 2 SITES IN T4 LYSOZYME REF SCIENCE V. 260 1637 1993 REFN ASTM SCIEAS US ISSN 0036-8075 038 REMARK Not standardized: many different REMARK records & subrecords! REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK 1 REFERENCE 3 1 AUTH M.M.TEETER,W.A.HENDRICKSON 1 TITL HIGHLY ORDERED CRYSTALS OF THE PLANT SEED PROTEIN 1 TITL 2 CRAMBIN 1 REF J.MOL.BIOL. V. 127 219 1979 1 REFN ASTM JMOBAK UK ISSN 0022-2836 070 2 2 RESOLUTION. 1.5 ANGSTROMS. 1CRNC 1CRN 1CRN 1CRN 1CRN 1CRN 1CRN 1CRN 10 16 17 18 19 20 21 22 ©CMBI 2002 PDB records (3) SEQRES Sequence of protein; Be aware: Not always all 3D-coordinates are present for all the amino acids in SEQRES!! SEQRES SEQRES SEQRES SEQRES 1 2 3 4 46 46 46 46 THR ASN ALA CYS THR VAL THR PRO CYS CYS TYR GLY CYS ARG THR ASP PRO LEU GLY TYR SER PRO CYS ALA ILE VAL ALA ARG SER ASN PHE GLY THR PRO GLU ALA ILE CYS ILE ILE ILE PRO GLY ALA THR ASN 1CRN 1CRN 1CRN 1CRN 51 52 53 54 4MDH 4MDH 4MDH 4MDH 4MDH 4MDH 4MDH 219 220 221 222 223 224 225 HET & FORMUL metals, cofactors, ions, etc. HET HET HET HET FORMUL FORMUL FORMUL NAD SUL NAD SUL 3 4 5 A A B B NAD SUL HOH 1 2 1 2 44 NAD CO-ENZYME 5 SULFATE 44 NAD CO-ENZYME 5 SULFATE 2(C21 H28 N7 O14 P2) 2(O4 S1) *471(H2 O1) ©CMBI 2002 PDB records (4) HELIX/SHEET/TURN Secondary structure elements as provided by the crystallographer (subjective) HELIX SHEET TURN 1 2 1 H1 ILE S1 2 CYS T1 PRO 7 PRO 32 ILE 41 TYR CYS CYS 19 1 3/10 CONFORMATION RES 17,19 35 -1 44 1CRN 1CRN 1CRN 55 58 59 40 32 1CRN 1CRN 60 61 1CRN 1CRN 1CRN 1CRN 1CRN 1CRN 1CRN 63 64 65 66 67 68 69 SSBOND disulfide bridges SSBOND SSBOND 1 CYS 2 CYS 3 4 CRYST1, ORIGX1, ORIGX2, ORIGX3, SCALE1, SCALE2, SCALE3 crystallographic parameters CRYST1 ORIGX1 ORIGX2 ORIGX3 SCALE1 SCALE2 SCALE3 40.960 18.650 22.520 90.00 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 .024414 0.000000 -.000328 0.000000 .053619 0.000000 0.000000 0.000000 .044409 90.77 90.00 P 21 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 2 ©CMBI 2002 PDB records (5) ATOM one line for each atom with its unique name and its x,y,z coordinates ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM 1 2 3 4 5 6 7 8 9 10 11 N CA C O CB OG1 CG2 N CA C O THR THR THR THR THR THR THR THR THR THR THR 1 1 1 1 1 1 1 2 2 2 2 17.047 16.967 15.685 15.268 18.170 19.334 18.150 15.115 13.856 14.164 14.993 14.099 12.784 12.755 13.825 12.703 12.829 11.546 11.555 11.469 10.785 9.862 3.625 4.338 5.133 5.594 5.337 4.463 6.304 5.265 6.066 7.379 7.443 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 13.79 10.80 9.19 9.85 13.02 15.06 14.23 7.81 8.31 5.80 6.94 1CRN 1CRN 1CRN 1CRN 1CRN 1CRN 1CRN 1CRN 1CRN 1CRN 1CRN 70 71 72 73 74 75 76 77 78 79 80 1.00 11.00 1.00 10.32 1.00 7.86 1CRN 1CRN 1CRN 1CRN 394 395 396 397 TER record terminates the amino acid chain ATOM ATOM ATOM TER 325 326 327 328 OD1 ASN ND2 ASN OXT ASN ASN 46 46 46 46 11.982 13.407 12.703 4.849 3.298 4.973 15.886 15.015 10.746 ©CMBI 2002 PDB records (6) HETATM atomic coordinate records for atoms within “HET & FORMUL”-lines (metals, cofactors, ions, …) and for water molecules HETATM 5158 AP 4MDH5495 HETATM 5159 AO1 4MDH5496 HETATM 5160 AO2 4MDH5497 NAD B 1 42.641 30.361 41.284 1.00 26.73 NAD B 1 43.440 31.570 40.868 1.00 20.69 NAD B 1 41.161 30.484 41.376 1.00 33.73 HETATM 5207 4MDH5544 HETATM 5208 4MDH5545 HETATM 5209 4MDH5546 O HOH 0 15.379 1.907 3.295 1.00 58.12 O HOH 1 58.861 0.984 17.024 1.00 37.58 O HOH 2 24.384 1.184 74.398 1.00 35.92 ©CMBI 2002 ©CMBI 2002