Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bioinformatika Jiří Vondrášek Ústav organické chemie a biochemie [email protected] Jan Pačes Ústav molekulární genetiky [email protected] http://bio.img.cas.cz/kurs Predikce sekundární struktury v proteinech Secondary structure Elements in Protein B Sheets – atom representations a helices - atom representations Metoda Chou-Fasman Využívá tabulky konformačních parametrů extrahovaných z reálných struktur a CD spektroskopie Tabulka obsahuje pravděpodobnosti pro jednotlivé sekundární prvky pro každou aminokyselinu 1. Assign all of the residues in the peptide the appropriate set of parameters. 2. Scan through the peptide and identify regions where 4 out of 6 contiguous residues have P(ahelix) > 100. That region is declared an alphahelix. Extend the helix in both directions until a set of four contiguous residues that have an average P(a-helix) < 100 is reached. That is declared the end of the helix. If the segment defined by this procedure is longer than 5 residues and the average P(a-helix) > P(b-sheet) for that segment, the segment can be assigned as a helix. 3. Repeat this procedure to locate all of the helical regions in the sequence. 3. Scan through the peptide and identify a region where 3 out of 5 of the residues have a value of P(b-sheet) > 100. That region is declared as a beta-sheet. Extend the sheet in both directions until a set of four contiguous residues that have an average P(b-sheet) < 100 is reached. That is declared the end of the beta-sheet. Any segment of the region located by this procedure is assigned as a beta-sheet if the average P(b-sheet) > 105 and the average P(b-sheet) > P(a-helix) for that region. 4. Any region containing overlapping alpha-helical and beta-sheet assignments are taken to be helical if the average P(a-helix) > P(b-sheet) for that region. It is a beta sheet if the average P(b-sheet) > P(ahelix) for that region. 5. To identify a bend at residue number j, calculate the following value p(t) = f(j)f(j+1)f(j+2)f(j+3) 6. where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue is used and the f(j+3) value for the j+3 residue is used. If: (1) p(t) > 0.000075; (2) the average value for P(turn) > 1.00 in the tetrapeptide; and (3) the averages for the tetrapeptide obey the inequality P(a-helix) < P(turn) > P(b-sheet), then a beta-turn is predicted at that location. CHOU-FASMAN RULES FOR ALPHA HELIX: Helical residues = >1.0 for helix. Helical breakers = 4/6 >1.0 nucleates helix. Helix continues both ways until 4 contiguous Special rules for Proline. Segment 5 residues or longer and P(a) > P(b) = helix. CHOU-FASMAN RULES FOR BETA STRAND: Beta residues = >1.0 for strand. Beta breakers = 3/5 >1.0 nucleates strand. Strand continues both ways until 4 contiguous Segment with average P(b) > 1.05 and P(b) > P(a) = strand Name P(a) Alanine Arginine Aspartic Acid Asparagine Cysteine Glutamic Acid Glutamine Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine 142 98 101 67 70 151 111 57 100 108 121 114 145 113 57 77 83 108 69 106 P(b) 83 93 54 89 119 037 110 75 87 160 130 74 105 138 55 75 119 137 147 170 P(turn)f(i) 66 95 146 156 119 74 98 156 95 47 59 101 60 60 152 143 96 96 114 50 0.06 0.070 0.147 0.161 0.149 0.056 0.074 0.102 0.140 0.043 0.061 0.055 0.068 0.059 0.102 0.120 0.086 0.077 0.082 0.062 f(i+1) f(i+2) f(i+3) 0.076 0.106 0.110 0.083 0.050 0.060 0.098 0.085 0.047 0.034 0.025 0.115 0.082 0.041 0.301 0.139 0.108 0.013 0.065 0.048 0.035 0.099 0.179 0.191 0.117 0.077 0.037 0.190 0.093 0.013 0.036 0.072 0.014 0.065 0.034 0.125 0.065 0.064 0.114 0.028 0.058 0.085 0.081 0.091 0.128 0.064 0.098 0.152 0.054 0.056 0.070 0.095 0.055 0.065 0.068 0.106 0.079 0.167 0.125 0.053 Chou-Fasman propensities (partial table) Amino Acid Glu Met Ala Val Ile Tyr Pro Gly Pa 1.51 1.45 1.42 1.06 1.08 0.69 0.57 0.57 Pb 0.37 1.05 0.83 1.70 1.60 1.47 0.55 0.75 Pt 0.74 0.60 0.66 0.50 0.50 1.14 1.52 1.56 Conformation Phi Psi Omega Residues per turn Translation per residue Antiparallel beta -139 +135 -178 2.0 3.4 Parallel beta -119 +113 180 2.0 3.2 alpha helix -57 -47 180 3.6 1.5 3-10 helix -49 -26 180 3.0 2.0 Xi-helix -57 -70 180 4.4 1.15 Polyproline I -83 +158 0 3.33 1.9 Polyproline II -78 +149 180 3.0 3.12 Polyproline III -80 +150 180 3.0 3.1 Garnier-Osguthorpe-Robson GOR • Využívá tabulku tendencí určenou primárně z krystalových struktur • Tabulka obsahuje jednu pravděpodobnost pro každou strukturu a každou aminokyselinu v okně dlouhém 17 aminokyselin Teorie informace aplikovaná na predikci struktury Jakou informaci získáme o pravděpodobnosti, že residuum j je v jistém stavu (H,E,T,C), ze znalosti jaké residuum je v pozici jm (m 8), nezávisle na tom co je residuum j zač. Je li m=0, podobné Chou-Fasman SUM -7 -6 -5 -4 -3 -2 -1 pro 2 3 4 5 6 7 Asn Glu Asp Glu Leu Lys His Gly -225 H m=0 1 -60 -77 0 -248 -212 Asn Glu Asp Glu Leu Lys His Gly -90 -65 -15 -63 -15 -37 15 Asn Glu Asp Glu Leu Lys His Gly -45 -55 -55 -77 5 Asn Glu Asp -203 -73 -70 -98 -58 15 Asn Glu Asp Glu Leu Lys His Gly 15 33 -35 -55 -55 -70 5 Asn Glu Asp Glu Leu Lys His Gly 20 -53 0 -27 -15 -45 25 Asn Glu Asp Glu Leu Lys His Gly 20 0 -20 0 -35 10 -35 Asn Glu Asp Glu Leu Lys His Gly 12 -45 -15 0 -45 25 Glu Leu Lys His Gly -10 -45 -105 -65 25 -129 -50 -25 -30 5 -27 0 -23 0 -10 22 33 40 35 22 12 Přesnost predikce Obě metody mají přesnost kolem 55 – 65% Hlavní důvod je to, že přeceňují lokální kontext na úkor globálního, tedy typ proteinu Tytéž aminokyseliny mohou zaujímat různé konfigurace v cytoplazmatickém a v membránovém proteinu PSIPred output window PSIPRED PREDICTION RESULTS Key Conf: Confidence (0=low, 9=high) Pred: Predicted secondary structure (H=helix, E=strand, C=coil) AA: Target sequence Conf: 952010265389973742568774158851022313889854542110122124543202 Pred: CCCCEECCCEEEEEECCHHHHHHHHCCCCCHHHHHCCCCCCCCCCEEECCCCEEEEEEEC AA: PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMSLPGRWKPKMIGGIGGFIKVRQYD 10 20 30 40 50 60 Conf: 102122066401257647861344327778750531369 Pred: CEEEEECCCCCEEEEEECCCCHHHHHHHHHHHHCCCCCC AA: QIIIEICGHKAIGTVLVGPTPVNIIGRNLLTQIGCTLNF 70 80 90 Calculate PostScript, PDF and JPEG graphical output for this result using: http://bioinf.cs.ucl.ac.uk/cgi-bin/psipred/graphics/nph-view.cgi?id=103942638010041 1.Avoid Chou and Fasman algorithm 2.Note the accurracy of the algorithms on standard benchmarks and "real life" situations. 3.Use methods based on multiple alignments. Check carefully the alignments (avoid redundancies) 4.Use several independant methods, of similar accuracy 5.In case of disagreement, trust PHD, Jnet and Psipred. Sekundární strukturní prvky – formulace problému • Daná proteinová sekvence – NWVLSTAADMQGVVTDGMASGLDKD... • Predikce sekvence sekundární struktury: – LLEEEELLLLHHHHHHHHHHLHHHL... • „3-state“ problém: {ARNDCQEGHILKMFPSTWYV}n-> {L,H,E}n Predikce prvků sekundární struktury u proteinů motivace pro předpověď prvků sekundární struktury - efektivní konformační vzorek pro 3D protein folding - vylepšení ostatních sekvenčních a strukturně analytických metod : sekvenční alignment : homologické a „threading“ modelování (CASP) : analýza experimentálních dat : protein design V proteinech se známou strukturou není určení sekundární struktury jednoznačným a jednoduchým úkolem Dva základní klasifikační proramy a postupy pro určení SS z krystalových struktur DSSP a STRIDE výsledky těchto dvou postupů se liší nepatrně Reference Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983 Dec;22(12):2577-637. http://www.embl-heidelberg.de/argos/stride/stride_info.html http://www.cmbi.kun.nl/gv/dssp/ Výskyt aminokyselin a jejich distribuce v příslušných prvcích sekundární struktury by měly být vodítkem při predikci prvků SS stupeň determinace Klasické metody -Chou Fasman -GOR (Garnier-Osguthorpe-Robson) Adaptivní metody -Metoda neuronových sítí pokusná síť používá sadu známých proteinů k predikci žádané struktury ze sekvenčních dat <nnpredict> -Metoda založená na homologii hledané sekvence se známými proteiny <SOPM> <PHD> Neural Network methods • A neural network with multiple layers is presented with known sequences and structures - network is trained until it can predict those structures given those sequences • Allows network to adapt as needed (it can consider neighboring residues like GOR) The different approaches: Only the original works and the more recent implementations are presented here. First Generation (information is coming from a single residu, of a single sequence) Second Generation (Local interactions are taken into account) Single residue statistics Explicit rules Chou and Fasman 1974 GOR1 1978 Lim 1974 GOR3 1987 Zvelebil et al 1987 Third Generation (Information coming from homologous sequences is incorporated) PREDATOR 1996 DSC 1996 Nearest-Neighbors Neural-Networks based prediction Levin et al 1986 Nishikawa and Ooi 1986 Holley and Karplus 1989 Qian and Sejnowski 1988 Yi and Lander 1993 NNSSP 1995 PHD 1993 Jnet 1999 rPsipred 1999