Download protein sequence

Bioinformatics for Proteomics Shu-Hui Chen (陳淑慧) Department of Chemistry National Cheng Kung University Bioinformatics I DNA 5’ 3’ How do we find protein coding regions, introns and exons in genomic DNA Transcription sequences? Splicing mRNA Translation Polypeptide Folding Protein • Transport / Localization • Oligomerization • PTM (Post-Translational Modification) Function Function What is Proteomics ? Systematic analysis of All protein sequences All protein expression pattern All protein interactions This involves Protein isolation Protein separation Protein identification Functional characterization of all proteins The tools of Proteomics Traditional protein chemistry assay methods struggle to establish Identity Identity requires: Specificity of measurement (Precision) Mass Spectrometry MS-based data acquisition algorithm A reference for comparison Protein sequence databases Search algorithms MS-based Proteomics and Bioinformatics • MS instrument is so far not sensitive enough to resolve proteins in a biological system solely based on signals measured. • MS, however, is able to acquire sufficient data for mapping a protein from the database using new computer algorithms to analyze the data. • This is the field of bioinformatics Instrumentation Sample inlet vacuum Ion source Mass analyzer Data acquisition “Bioanalytical Chemistry” Mikkelsen, S.R., published by John Wiley & Sons, Inc. MS-based Protein Identification  Mass Mapping Peptide Sequencing Conventional Methodology - Expression Proteomics Trypsin Digestion We know that trypsin cleaves polypeptides C-terminal to basic amino acids. -NH-CH(R1)-CO-NH-CH(R2)-COtrypsin -NH-CH(R1)-COOH H2N-CH(R2)-CO- Ion intensity m/z Mass Spectrometry Protein identified by database mapping Automated Database Search Number 1 match: tumor necrosis factor type 1 receptor associated protein TRAP-1 (Mr): 76030.27 1 51 101 151 201 251 301 351 401 451 501 551 601 651 RALRRAPALA DKEEPLHSII LISNASDALE EELVSNLGTI EVYSRSAAPG SEARVRDVVT RYVAQAHDKP YSRKVLIQTK DVLQQRLIKF KLLRYESSAL AMKKKDTEVL DRSPAAECLS GAARHFLRMQ SCWWIRYTRT AVPGGKPILC SSTESVQGST KLRHKLVSDG ARSGSKAFLD SLGYQWLSDG KYSNFVSFPL RYTLHYKTDA ATDILPKWLR FIDQSKKDAE PSGQLTSLSE FCFEQFDELT EKETEELMAW QLAKTQEERA P PRRTTAQLGP SKHEFQAETK QALPEMEIHL ALQNQAEASS SGVFEIAEAS YLNGRRMNTL PLNIRSIFYV FIRGVVDSED KYAKFFEDYG YASRMRAGTR LLHLREFDKK MRNVLGSRVT QLLQPTLEIN RRNPAWSLQA KLLDIVARSL QTNAEKGTIT KIIGQFGVGF GVRTGTKIII QAIWMMDPKD PDMKPSMFDV IPLNLSRELL LFMREGIVTA NIYYLCAPNR KLISVETDIV NVKVTLRLDT PRHALIKKLN GRLFSTQTAE YSEKEVFIRE IQDTGIGMTQ YSAFMVADRV HLKSDCKEFS VGEWQHEEFY SRELGSSVAL QESALIRKLR TEQEVKEDIA HLAEHSPYYE VDHYKEEKFE HPAMVTVLEM HCAQASLAWL Total coverage: 33.4% Bioinformatics I Minimal content of a « protein sequence » db • • • • • • • • Sequences !! Accession number (AC) Taxonomic data References ANNOTATION/CURATION Keywords Cross-references Documentation Bioinformatics I SWISS-PROT/TrEMBL • Collaboration between the SIB (CH) and EMBL/EBI (UK) • SWISS-PROT: Fully annotated (manually), non-redundant, cross-referenced, documented protein sequence database. • TrEMBL: is automatically generated (from annotated EMBL coding sequences (CDS)) and annotated using software tools. http://www.expasy.org/sprot/ ExPASy Web Server ExPASy = Expert Protein Analysis System History for MS Searching 1993 MOWSE By Pappin and Bleasby SEQUEST 1994 1996 MOWSEⅡ 1997 MOWSEⅢ 1998 MASCOT By Yates and Eng Molecular Weight Search By Matrix science Scoring algorithm Final score= -10*LOG(P), where P is absolute probability that the observed match is a random event E value (expected value) = describes the number of hits one can expect to see by chance when searching a database of a particular size. A value of zero indicates that no matches would be expected by chance. Significant hits at 95% confidence level (p<0.05) there is less than a 1 in 20 chance that the observed match is a random event. Increase mass tolerance 5 7 MS-based Protein Identification Mass Mapping  Peptide Sequencing Tandem Mass Spectrometry- MS/MS MS/MS acquisition is controlled by software setting Protein Identification Peptide Sequencing using MSMS peptide A BCDEF CID AB CDEF ABCDEF precursor ion ABC DEF ABCD EF ABCDE AB ABC A A ABCD B F C D ABCDE E m/z Nomenclature used for CID peptide fragmentationLow Energy (eV)- Q, TOF, FT “Bioanalytical Chemistry” Mikkelsen, S.R., published by John Wiley & Sons, Inc. Protein Identification by Database Search Trypsin Digestion We know that trypsin cleaves polypeptides C-terminal to basic amino acids. -NH-CH(R1)-CO-NH-CH(R2)-COtrypsin -NH-CH(R1)-COOH H2N-CH(R2)-CO- Ion intensity m/z Sequence Tag Approach for Peptide Sequencing “Bioanalytical Chemistry” Mikkelsen, S.R., published by John Wiley & Sons, Inc. The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families. Bioinformatics I BLAST: Basic Local Alignment Search Tool NCBI BLAST http://www.ncbi.nlm.nih.gov/blast/ Bioinformatics I Sequence alignments and comparison 1: MYTAILORISRICH 2: MONTAILLEURESTRICHE 1: MY-TAIL--ORIS-RICH¦x ¦¦¦¦ x¦x¦ ¦¦¦¦ 2: MONTAILLEURESTRICHE 1: 2: TAILO ¦¦¦¦x TAILL RICH ¦¦¦¦ RICHE Global Alignment Two Local Alignments ¦ = Identity x = Mismatch - = Insertion / Deletion Bioinformatics I Multiple Sequence Alignment (MSA) Programs: • CLUSTALW • T_COFFEE • MULTALIGN HBA_CHICK HBAD_CHICK HBPI_CHICK HBB_CHICK HBE_CHICK HBRH_CHICK MYG_CHICK VL-SAADKNNVKGIFTKIAGHAEEYGAETLERMFTTYPPTKTYFPHF-DL ML-TAEDKKLIQQAWEKAASHQEEFGAEALTRMFTTYPQTKTYFPHF-DL AL-TQAEKAAVTTIWAKVATQIESIGLESLERLFASYPQTKTYFPHF-DV VHWTAEEKQLITGLWGKV--NVAECGAEALARLLIVYPWTQRFFASFGNL VHWSAEEKQLITSVWSKV--NVEECGAEALARLLIVYPWTQRFFASFGNL VHWSAEEKQLITSVWSKV--NVEECGAEALARLLIVYPWTQRFFDNFGNL GL-SDQEWQQVLTIWGKVEADIAGHGHEVLMRLFHDHPETLDRFDKFKGL .... . ..* . .. * * * *.. .* * * * .. 48 48 48 48 48 48 49 HBA_CHICK HBAD_CHICK HBPI_CHICK HBB_CHICK HBE_CHICK HBRH_CHICK MYG_CHICK SH-----GSAQIKGHGKKVVAALIEAANHIDDIAGTLSKLSDLHAHKLRV SP-----GSDQVRGHGKKVLGALGNAVKNVDNLSQAMAELSNLHAYNLRV SQ-----GSVQLRGHGSKVLNAIGEAVKNIDDIRGALAKLSELHAYILRV SSPTAILGNPMVRAHGKKVLTSFGDAVKNLDNIKNTFSQLSELHCDKLHV SSPTAIMGNPRVRAHGKKVLSSFGEAVKNLDNIKNTYAKLSELHCDKLHV SSPTAIIGNPKVRAHGKKVLSSFGEAVKNLDNIKNTYAKLSELHCEKLHV KTPDQMKGSEDLKKHGATVLTQLGKILKQKGNHESELKPLAQTHATKHKI . *. .. ** .*.. . . .. .. . *.. * .. 93 93 93 98 98 98 99 HBA_CHICK HBAD_CHICK HBPI_CHICK HBB_CHICK HBE_CHICK HBRH_CHICK MYG_CHICK DPVNFKLLGQCFLVVVAIHHPAALTPEVHASLDKFLCAVGTVLTAKYR-DPVNFKLLSQCIQVVLAVHMGKDYTPEVHAAFDKFLSAVSAVLAEKYR-DPVNFKLLSHCILCSVAARYPSDFTPEVHAEWDKFLSSISSVLTEKYR-DPENFRLLGDILIIVLAAHFSKDFTPECQAAWQKLVRVVAHALARKYH-DPENFRLLGDILIIVLASHFARDFTPACQFAWQKLVNVVAHALARKYH-DPENFRLLGNILIIVLAAHFTKDFTPTCQAVWQKLVSVVAHALAYKYH-PVKYLEFISEVIIKVIAEKHAADFGADSQAAMKKALELFRNDMASKYKEF . .... . .* . . ... . .* . .. **. 141 141 141 146 146 146 149 HBA_CHICK HBAD_CHICK HBPI_CHICK HBB_CHICK HBE_CHICK HBRH_CHICK MYG_CHICK ------------------GFQG 141 141 141 146 146 146 153 Consensus length: 154; Identity : 19 ( 12.3%); Similarity: 51 ( 33.1%) Character to show that a position in the alignment is perfectly conserved: '*' Character to show that a position is well conserved: '.' Searching databases with multiple alignments PSI-BLAST: Position-Specific Iterative BLAST (Altschul et al., 1997) 1. Starting with a single sequence, PSI-BLAST searches a database using BLAST and builds a multiple sequence alignment and a profile. 2. The profile is then used to search the protein database again. 3. Running the program several times can further refine the profile and increase search sensitivity. Error tolerance search 0.2Da/0.2Da 32 0.05Da/0.05Da 27 33 0.5Da/0.5Da MS/MS Scan Functions Collision Chamber (gas) m2 m1 m4 m3 + m2 m2 m2 m2 single mass transmission + + N2 + + + + + + mass scan mode Q1 Product Ion Scan (PI) Fix Multiple Reaction Mode (MRM) Fix Precursor Ion Scan (PS) Scan Neutral Loss Scan (NL) Scan Q3 Scan Fix Fix Scan + + IP + MS/ID for searching protein interaction complex Conclusions Protein identification by MS is a key element of proteomics and the ID process is an informatics-based methodology. MS + sequence databases represent a huge leap for protein Biochemistry- A large scale analysis approach. Biochemical manipulation + protein ID is capable of providing functional information of proteins. Bioinformatics tools are needed to link proteomics data to protein interaction and biological pathways.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download protein sequence