Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Segmenting G-Protein Coupled Receptors using Language Models Betty Yee Man Cheng Language Technologies Institute, CMU Advisors: Judith Klein-Seetharaman Jaime Carbonell The Segmentation Problem Segment protein sequence according to secondary structure Related to secondary structure prediction Often viewed as a classification problem Best performance so far is 78% Large portion of the problem lies with the boundary cases Limited Domain: GPCRs G-Protein Coupled Receptors One of the largest superfamily of proteins known 2955 sequences, 1654 fragments found so far Transmembrane proteins Plays a central role in many diseases Only 1 protein has been crystallized Distinguishing Characteristic of GPCRs Order of segments are known N-terminus Helix Intracellular loop Extracellular loop C-Terminus Methodology: Topicality Measures Based on “Statistical Models for Text Segmentation” by D. Beeferman, A. Berger, and J. Lafferty Topicality measures are log-ratios of 2 different models Short-range model versus long-range model in topic segmentation in text Models of different segments in proteins Short-Range Model vs. Long-Range Model Problem - Not Enough Data! Family Name Number of Proteins Class A 1081 Class B 83 Class C of 1333 Proteins Total 28 Class D 11 Over 90% are shorter than 7504 amino acids Class E Class F Average sequence length is 44145amino acids Drosophila Odorant Receptors 31 Ocular Albinism Proteins 2 Orphan A 35 Orphan B 2 Plant Mlo Receptors 10 Average segment length is 25 amino acids Nematode Chemoreceptors 1 3 Topicality Models in GPCRs Previous segmentation experiments with mutual information and Yule’s measures have shown a similarity between All helices All intracellular loops and C-terminus All extracellular loops and N-terminus No two helices or loops occur consecutively 3 models instead of 15, trained across all families of GPCRs Model of a Segment Each model is an interpolated model of 6 basic probability models Unigram model (20 amino acids) Bi-gram model (20 amino acids) Tri-gram model (20 amino acids) 3 Tri-gram models on reduced alphabets 11, 3, 2 amino acids LVIM,FY,KR,ED,AG,ST,NQ,W,C,H,P LVIMFYAGCW, KREDH, STNQP LVIMFYAGCW, KREDHSTNQP Why Use Reduced Alphabets? Figure 1. Snake-like diagram of the human 2 adrenergic receptor. Interpolation Oddity weights were trained so that sum of the probability assigned to the amino acid at each position in the training data is a max First attempt: all weight to the tri-gram model with the smallest reduced alphabet Reason: smaller vocabulary size causes the probability mass to be not as spread out Interpolation Oddity, Take 2 Normalize the probabilities from reduced alphabet models E.g. LVIM,FY,KR,ED,AG,ST,NQ,W,C,H,P P(L | ) / 4 P(F | ) / 2 All of the weight went to the tri-gram model with the normal 20 amino acid alphabet An Example: D3DR_RAT Log Probability of Amino Acid at Each Position Log Probability of Amino Acid Class A dopamine receptor 0 Extracellular Helix Intracellular -1000 -2000 -3000 -4000 Figure 3 - Graph of the Log Probability of the Amino Acid at Each Position in the D3DR_RAT Sequence from the 3 Segment Models. The 3 segment models fluctuate frequently in their performance, making it difficult to detect which model is doing best and where the boundaries should be drawn. -5000 -6000 -7000 0 50 100 150 200 250 300 Amino Acid Position in Sequence 350 400 450 D3DR_RAT @ Position 0-100 Log Probability of Amino Acid at Each Position 0 + + + -1000 Extracellular Helix Intracellular + -2000 -3000 -4000 -5000 -6000 N-Terminus -7000 0 20 Helix 40 Intracellular 60 Helix 80 Amino Acid Position in Sequence Figure 4 - Enlargement of the Graph in Figure 3 for the Amino Acid Positions 0-100. The true segment boundaries are marked in dotted vertical lines. 100 Running Averages & Look-Ahead Running Averages of Log Probability of Amino Acid at Each Position 0 + + + -1000 + Extracellular Helix Intracellular -2000 -3000 -4000 -5000 -6000 N-Terminus Helix Intracellular Helix -7000 0 20 + 40 + + 60 + 80 100 Amino Acid Position Figure 5 - Graph of Running Averages of Log Probabilities of Each Amino Acid between Positions 0 and 100 in the D3DR_RAT sequence with Predicted and True Boundaries marked. Running averages were computed using a window-size of 2 and boundaries were predicted using a look-ahead of 5. The predicted boundaries are indicated by dotted vertical lines at positions 38, 53, 65 and 88, while the true boundaries are indicated by dashed vertical lines at positions 32, 55, 66 and 92. Predicted Boundaries for D3DR_RAT Window-size 2 from current amino acid Look-ahead interval of 5 amino acids Predicted Boundaries 38 53 65 88 107 135 150 171 188 212 374 394 413 431 6 2 1 4 3 9 1 1 3 3 1 3 1 3 32 55 66 92 104 126 149 172 185 209 375 397 412 434 Synthetic True Boundaries The Only Truth: OPSD_HUMAN The only GPCR that has been crystallized so far Predicted Boundaries 37 61 72 97 113 130 153 173 201 228 250 275 283 307 1 0 1 1 0 3 1 3 1 2 2 1 1 2 36 61 73 98 113 133 152 176 202 230 252 276 284 309 True Boundaries Average offset for protein is 1.357 a.a. Evaluation Metrics Accuracy Score 1 – perfect match Score 0.5 – offset of 1 Score 0.25 – offset of 2 Score 0 otherwise Offset – absolute difference between the predicted and true boundary position 10-fold Cross Validation Results: Trained Interpolated Models Test Set Size A 130 B 130 C 130 D 130 E 130 F 130 G 130 H 130 I 129 J 129 Overall 1298 Accuracy 0.2383 0.2691 0.2426 0.2353 0.2501 0.2269 0.2343 0.2250 0.2438 0.2445 0.2410 Average 49.9698 21.3005 34.4385 22.9654 34.9154 21.5857 32.1989 42.7929 33.1179 62.1717 35.5270 E-H 48.4827 22.1250 34.9635 23.0442 35.6519 22.7269 32.1808 43.5135 32.0872 62.2519 35.6851 Figure 6 - Results of Our Approach using Trained Interpolation Weights. Window-size: 2 Look-ahead interval: 5 Offset H-I 48.5115 21.6981 34.6077 22.3865 35.6808 21.9135 31.6827 43.4462 32.0213 62.5039 35.4270 I-H 51.1564 19.5744 33.4205 21.8949 33.0051 18.9513 31.3590 41.2103 33.6512 60.8269 34.4854 H-E 52.7103 21.3974 34.5308 24.7026 34.8231 22.2615 33.7513 42.5436 35.4212 62.9664 36.4913 Distribution of Offset between Predicted and Synthetic True Boundary Distribution of Offset between Predicted and Synthetic True Boundary Removing 10% of the proteins with the worst average offset causes the average offset for the dataset to drop to 10.51. Results: Using All Probability Models Test Set Size A 130 B 130 C 130 D 130 E 130 F 130 G 130 H 130 I 129 J 129 Overall 1298 Accuracy 0.2309 0.2291 0.2352 0.2223 0.2137 0.2468 0.2169 0.2118 0.2193 0.2014 0.2228 Average 64.2923 33.1368 45.0154 31.2264 51.0593 27.1764 40.4791 57.1110 39.3272 83.3162 47.1923 E-H 63.9038 34.1462 45.4231 31.3096 52.8019 27.9519 41.1673 56.8558 41.3353 84.2655 47.8931 Offset H-I 63.6750 33.5077 45.1115 30.9365 52.1962 27.7500 40.5558 56.4673 40.1143 84.4302 47.4517 I-H 63.4359 30.9077 43.3744 29.8333 47.3000 24.8923 38.1846 56.3179 35.4600 79.9018 44.9412 H-E 66.4897 33.5256 45.9846 32.8949 50.9795 27.6615 41.7538 59.1026 39.4677 83.9793 48.1631 Figure 7 - Results of Our Approach using Pre-set Model Weights in the Interpolation: 0.1 for unigram and bi-gram models, 0.2 for each of the tri-gram models. Running averages were computed over a window-size of 5 and a look-ahead interval of 4 was used. Results: Using Only Tri-gram Models Test Set Size A 130 B 130 C 130 D 130 E 130 F 130 G 130 H 130 I 129 J 129 Overall 1298 Accuracy 0.2234 0.2462 0.2359 0.2224 0.2271 0.2286 0.2363 0.2310 0.2251 0.2168 0.2293 Average 70.7082 33.4071 45.6275 31.7533 51.1978 29.1319 43.0967 64.5154 41.9873 86.6235 49.7825 E-H 70.8231 34.0231 45.6019 32.0019 53.0673 30.7000 43.1115 64.3077 43.5504 87.2209 50.4178 Offset H-I 70.7000 33.4731 45.6173 32.0731 52.7346 30.1077 42.4288 64.0923 43.0581 87.6919 50.1743 Figure 8 - Results of Our Approach using Pre-set Model Weights in the Interpolation: 0.25 for each of the tri-gram models. Window-size of 4 and a look-ahead interval of 4. I-H 69.2359 31.4615 44.5077 30.0410 47.3308 26.0872 41.8923 63.4308 38.5297 83.6951 47.6004 H-E 72.0385 34.4436 46.7949 32.7077 50.5231 28.7846 45.1718 66.4410 41.9328 87.3308 50.5953 Conclusions Average accuracy of 0.241 ~ offset of 2 on average But average offsets are much higher Missing a boundary has detrimental effects on prediction of remaining boundaries in the sequence, especially with a small segment Large offsets with a small number of proteins Future Work Cue words Long range contact Unigrams, bi-grams, tri-grams, 4-grams in a window of +/- 25 amino acids from boundary Distribution tables of how likely 2 amino acids are in long-range contact of each other Evaluation How much homology is needed between training and testing data References 1. Doug Beeferman, Adam Berger, and John Lafferty. “Statistical Models for Text Segmentation.” Machine Learning, special issue on Natural Language Learning, C. Cardie and R. Mooney eds., 34(1-3), pp. 177-210, 1999. http://www-2.cs.cmu.edu/~lafferty/ps/ml-final.ps 2. F. Campagne, J.M. Bernassau, and B. Maigret. Viseur program (Release 2.35). Copyright 1994,1995,1996, Fabien Campagne, All Rights Reserved.