* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download swift.cmbi.ru.nl
Western blot wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Interactome wikipedia , lookup
Proteolysis wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Point mutation wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Structural alignment wikipedia , lookup
Metalloprotein wikipedia , lookup
Protein structure prediction wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Bioinformatics databases & sequence retrieval Content of lecture I. Introduction II. Bioinformatics data & databases III. Sequence Retrieval with MRS Celia van Gelder CMBI UMC Radboud September 2014 I. Bioinformatics questions Lookup •Is the gene known for my protein (or vice versa)? •What sequence patterns are present in my protein? •To what class or family does my protein belong? Compare •Are there sequences in the database which resemble the protein I cloned? •How can I optimally align the members of this protein family? Predict •Can I predict the active site residues of this enzyme? •Can I predict a (better) drug for this target? •How can I predict the genes located on this genome? ©CMBI 2009 Sequence similarity Image, you sequenced this human protein. MVVSGAPPAL WPWIVSIQKN VGVAWVEPHP GSIQDGVPLP DSGGPLMCQV GGGCLGTFTS GTHHCAGSLL VYSWKEGACA HPQTLQKLKV DGAWLLAGII LLLLASTAIL TSRWVITAAH DIALVRLERS PIIDSEVCSH SWGEGCAERN NAARIPVPPA CFKDNLNKPY IQFSERVLPI LYWRGAGQGP RPGVYISLSA CGKPQQLNRV LFSVLLGAWQ CLPDASIHLP ITEDMLCAGY HRSWVEKIVQ VGGEDSTDSE LGNPGSRSQK PNTHCWISGW LEGERDACLG GVQLRGRAQG You know it is a serine protease. Which residues belong to the active site? Is its sequence similar to the mouse serine protease? ©CMBI 2009 Sequence Alignment MVVSGAPPAL GGGCLGTFTS LLLLASTAIL NAARIPVPPA CGKPQQLNRV VGGEDSTDSE MMISRPPPAL GGDQFSILIL LVLLTSTAPI SAATIRVSPD CGKPQQLNRI VGGEDSMDAQ *::* .**** **. :. : *:**:*** : .** * *.* *********: ****** *:: WPWIVSIQKN GTHHCAGSLL TSRWVITAAH CFKDNLNKPY LFSVLLGAWQ LGNPGSRSQK WPWIVSILKN GSHHCAGSLL TNRWVVTAAH CFKSNMDKPS LFSVLLGAWK LGSPGPRSQK ******* ** *:******** *.***:**** ***.*::** *********: **.**.**** VGVAWVEPHP VYSWKEGACA DIALVRLERS IQFSERVLPI CLPDASIHLP PNTHCWISGW VGIAWVLPHP RYSWKEGTHA DIALVRLEHS IQFSERILPI CLPDSSVRLP PKTDCWIAGW **:*** *** ******: * ********:* ******:*** ****:*::** *:*.***:** GSIQDGVPLP HPQTLQKLKV PIIDSEVCSH LYWRGAGQGP ITEDMLCAGY LEGERDACLG GSIQDGVPLP HPQTLQKLKV PIIDSELCKS LYWRGAGQEA ITEGMLCAGY LEGERDACLG ********** ********** ******:*. ******** . ***.****** ********** DSGGPLMCQV DGAWLLAGII SWGEGCAERN RPGVYISLSA HRSWVEKIVQ GVQLRGRAQG DSGGPLMCQV DDHWLLTGII SWGEGCAD-D RPGVYTSLLA HRSWVQRIVQ GVQLRG---********** *. ***:*** *******: : ***** ** * *****::*** ****** => Transfer of information ©CMBI 2009 II. Bioinformatics data and databases mRNA expression profiles MS data Large amount of data Growing very very fast Heterogeneous data types EMBL DNA database Note: In 2012 247 millions & 429 billions ©CMBI 2014 Genome projects ©CMBI 2013 Biological databases (1) Primary databases contain biomolecular sequences or structures (experimental data!) and associated annotation information Sequences Nucleic acid sequences Protein sequences EMBL, Genbank, DDBJ SwissProt, trEMBL, UniProt Structures Protein Structures PDB Structures of small compounds CSD Genomes Ensembl UCSC ©CMBI 2010 Biological databases (2) Secondary databases Contain data derived from primary database(s) Patterns, motifs, domains PROSITE, PFAM, PRINTS, INTERPRO,...... Disease mutations OMIM / MIM SNPs dbSNP Pathways KEGG ©CMBI 2009 Databases Data must be in a certain format for software to recognize Every database can have its own format but some data elements are essential for every database: 1. Unique identifier, or accession code 2. Name of depositor 3. Literature references 4. Deposition date 5. The real data ©CMBI 2009 Quality of Data SwissProt • Data is only entered by annotation experts EMBL, PDB • “Everybody” can submit data • No human intervention when submitted; some automatic checks ©CMBI 2009 SwissProt database • Database of protein sequences • 546000 sequence entries (sept 2014) • Swissprot is manually annotated and reviewed • Obligatory deposit of in SwissProt before publication • SwissProt is part of UniProt • The other main part of UniProt is Trembl (translated Embl). Trembl is automatically annotated and is not reviewed. ©CMBI 2014 Important records in SwissProt (1) ID AC DT DT DT HBA_HUMAN Reviewed; 142 AA. P69905; P01922; Q3MIF5; Q96KF1; Q9NYR7; 21-JUL-1986, integrated into UniProtKB/Swiss-Prot. 23-JAN-2007, sequence version 2. 23-SEP-2008, entry version 63. DE RecName: Full=Hemoglobin subunit alpha; DE AltName: Full=Hemoglobin alpha chain; DE AltName: Full=Alpha-globin; ©CMBI 2009 Important records in SwissProt (2) Cross references section: Hyperlinks to all entries in other databases which are relevant for the protein sequence HBA_HUMAN genes & mRNA protein domains diseases structures ©CMBI 2009 Important records in SwissProt (3) Features section: post-translational modifications, signal peptides, binding sites, enzyme active sites, domains, disulfide bridges, local secondary structure, sequence conflicts between references etc. etc. ©CMBI 2011 And finally, the amino acid sequence! ©CMBI 2009 EMBL database Nucleotide database EMBL: 470 million sequence entries comprising 998 billion nucleotides (Sept 2014) EMBL records follows roughly same scheme as SwissProt Obligatory deposit of sequence in EMBL before publication Most EMBL sequences never seen by a human ©CMBI 2013 Protein Data Bank (PDB) Databank for 3-dimensional structures of biomolecules (by X-ray & NMR): • • • • Protein DNA RNA Ligands Obligatory deposit of coordinates in the PDB before publication ~ 84000 entries (Sep 2012) ( ~6000 “unique” structures) PDB file is a keyword-organised flat-file (80 column) 1) human readable 2) every line starts with a keyword (3-6 letters) 3) platform independent ©CMBI 2011 PDB important records (1) PDB nomenclature Filename= accession number= PDB Code Filename is 4 positions (often 1 digit & 3 letters, e.g. 1CRN) HEADER describes molecule & gives deposition date HEADER PLANT SEED PROTEIN 30-APR-81 1CRN CMPND name of molecule COMPND CRAMBIN SOURCE organism SOURCE ABYSSINIAN CABBAGE (CRAMBE ABYSSINICA) SEED ©CMBI 2009 PDB important records (2) SEQRES Sequence of protein; be aware: Not always all 3d-coordinates are present for all the amino acids in SEQRES!! SEQRES SEQRES SEQRES SEQRES 1 2 3 4 46 46 46 46 THR ASN ALA CYS THR VAL THR PRO CYS CYS TYR GLY CYS ARG THR ASP PRO LEU GLY TYR SER PRO CYS ALA ILE VAL ALA ARG SER ASN PHE GLY THR PRO GLU ALA ILE CYS ILE ILE ILE PRO GLY ALA THR ASN 1CRN 1CRN 1CRN 1CRN 51 52 53 54 SSBOND disulfide bridges SSBOND 1 CYS 3 CYS 40 SSBOND 2 CYS 4 CYS 32 ©CMBI 2009 PDB important records (3) and at the end of the PDB file the “real” data: ATOM one line for each atom with its unique name and its x,y,z coordinates ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM 1 2 3 4 5 6 7 8 9 10 11 N CA C O CB OG1 CG2 N CA C O THR THR THR THR THR THR THR THR THR THR THR 1 1 1 1 1 1 1 2 2 2 2 17.047 16.967 15.685 15.268 18.170 19.334 18.150 15.115 13.856 14.164 14.993 14.099 12.784 12.755 13.825 12.703 12.829 11.546 11.555 11.469 10.785 9.862 3.625 4.338 5.133 5.594 5.337 4.463 6.304 5.265 6.066 7.379 7.443 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 13.79 10.80 9.19 9.85 13.02 15.06 14.23 7.81 8.31 5.80 6.94 1CRN 1CRN 1CRN 1CRN 1CRN 1CRN 1CRN 1CRN 1CRN 1CRN 1CRN 70 71 72 73 74 75 76 77 78 79 80 ©CMBI 2009 Structure Visualization Structures from PDB can be visualized with: 1. Yasara / Yasaraview (www.yasara.org) 2. SwissPDBViewer (http://spdbv.vital-it.ch/) 3. Protein Explorer (http://www.umass.edu/microbio/rasmol/) 4. Cn3D (http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml) ©CMBI 2009 Part III: Sequence Retrieval with MRS Google Thé best generic search and retrieval system Google searches everywhere for everything MRS Maarten’s Retrieval System (http://mrs.cmbi.ru.nl ) MRS searches in selected data environments MRS is the Google of the biological database world Search engine (like Google) • Input/Query = word(s) • Output = entry/entries from database Other programs exist: Entrez, SRS, .... ©CMBI 2009 MRS Search Steps • Select database(s) of choice • Formulate your query • Hit “Search” • The result is a “query set” or “hitlist” • Analyze the results ©CMBI 2009 http://mrs.cmbi.ru.nl ©CMBI 2011 MRS Database Selection You can choose between selecting all databases or just one of them. But think about your query first!! ©CMBI 2009 MRS Search options Simply type your keywords in the keyword field and choose SEARCH. If you know the fields of the database you are searching in you can specify your query further But think about your query first!! ©CMBI 2009 MRS Hitlist (1) ©CMBI 2009 MRS Hitlist (2) ©CMBI 2009 MRS Options MRS creates a result, or a “query set”, or “hitlist”. With the result you can do different things in MRS: – View the hits – Blast single hit sequences – Clustal multiple hit sequences ©CMBI 2009 MRS - View Hits ©CMBI 2009 Combine in MRS AND or & AND is implicit OR or | NOT or ! ©CMBI 2009 MRS - Options Home brings you back to the start page of MRS. That is the page from which you can do keyword searches. Blast brings you to the MRS-page from which you can do Blast searches. Status gives you all the currently indexed databases Align brings you to the MRS-page from which you can do Clustal alignments. Databank: uniprot lists the database you selected. Help provides some help ©CMBI 2011 Try it yourself with the exercises! Ground rules for bioinformatics Don't always believe what programs tell you - they're often misleading & sometimes wrong! Don't always believe what databases tell you - they're often misleading & sometimes wrong! Don't always believe what lecturers tell you - they're sometimes wrong! Don't be a naive user, computers don’t do biology & bioinformatics, you do! free after Terri Attwood ©CMBI 2009