Download BettyBLM2a

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Segmenting G-Protein Coupled
Receptors using Language Models
Betty Yee Man Cheng
Language Technologies Institute, CMU
Advisors: Judith Klein-Seetharaman
Jaime Carbonell
The Segmentation Problem


Segment protein sequence according to
secondary structure
Related to secondary structure prediction



Often viewed as a classification problem
Best performance so far is 78%
Large portion of the problem lies with the
boundary cases
Limited Domain: GPCRs


G-Protein Coupled Receptors
One of the largest superfamily of proteins
known




2955 sequences, 1654 fragments found so far
Transmembrane proteins
Plays a central role in many diseases
Only 1 protein has been crystallized
Distinguishing Characteristic of
GPCRs

Order of segments
are known





N-terminus
Helix
Intracellular loop
Extracellular loop
C-Terminus
Methodology: Topicality Measures


Based on “Statistical Models for Text
Segmentation” by D. Beeferman, A.
Berger, and J. Lafferty
Topicality measures are log-ratios of 2
different models


Short-range model versus long-range model
in topic segmentation in text
Models of different segments in proteins
Short-Range Model vs.
Long-Range Model
Problem - Not Enough Data!
Family Name
Number of Proteins
Class A
1081
Class B
83
Class
C of 1333 Proteins

Total
28
Class D
11

Over 90% are shorter than 7504 amino acids
Class E
Class
F

Average
sequence length is 44145amino acids
Drosophila Odorant Receptors
31
Ocular Albinism Proteins
2
Orphan A
35
Orphan B
2
Plant Mlo Receptors
10

Average
segment
length
is
25
amino
acids
Nematode Chemoreceptors
1
3 Topicality Models in GPCRs

Previous segmentation experiments with mutual
information and Yule’s measures have shown a
similarity between





All helices
All intracellular loops and C-terminus
All extracellular loops and N-terminus
No two helices or loops occur consecutively
3 models instead of 15, trained across all families
of GPCRs
Model of a Segment

Each model is an interpolated model of 6
basic probability models




Unigram model (20 amino acids)
Bi-gram model (20 amino acids)
Tri-gram model (20 amino acids)
3 Tri-gram models on reduced alphabets




11, 3, 2 amino acids
LVIM,FY,KR,ED,AG,ST,NQ,W,C,H,P
LVIMFYAGCW, KREDH, STNQP
LVIMFYAGCW, KREDHSTNQP
Why Use Reduced Alphabets?
Figure 1. Snake-like diagram of the human 2 adrenergic receptor.
Interpolation Oddity


weights were trained so that sum of the
probability assigned to the amino acid at
each position in the training data is a max
First attempt: all weight to the tri-gram
model with the smallest reduced alphabet

Reason: smaller vocabulary size causes the
probability mass to be not as spread out
Interpolation Oddity, Take 2

Normalize the probabilities from reduced
alphabet models

E.g. LVIM,FY,KR,ED,AG,ST,NQ,W,C,H,P
P(L | ) / 4
 P(F | ) / 2


All of the weight went to the tri-gram
model with the normal 20 amino acid
alphabet
An Example: D3DR_RAT
Log Probability of Amino Acid at Each Position

Log Probability of Amino Acid

Class A
dopamine
receptor
0
Extracellular
Helix
Intracellular
-1000
-2000
-3000
-4000
Figure 3 - Graph of the Log
Probability of the Amino Acid at
Each Position in the D3DR_RAT
Sequence from the 3 Segment
Models. The 3 segment models
fluctuate frequently in their
performance, making it difficult to
detect which model is doing best and
where the boundaries should be
drawn.
-5000
-6000
-7000
0
50
100
150
200
250
300
Amino Acid Position in Sequence
350
400
450
D3DR_RAT @ Position 0-100
Log Probability of Amino Acid at Each Position
0
+
+
+
-1000
Extracellular
Helix
Intracellular
+
-2000
-3000
-4000
-5000
-6000
N-Terminus
-7000
0
20
Helix
40
Intracellular
60
Helix
80
Amino Acid Position in Sequence
Figure 4 - Enlargement of the Graph in Figure 3 for the Amino Acid Positions 0-100.
The true segment boundaries are marked in dotted vertical lines.
100
Running Averages & Look-Ahead
Running Averages of Log Probability of Amino Acid at Each Position
0
+
+
+
-1000
+
Extracellular
Helix
Intracellular
-2000
-3000
-4000
-5000
-6000
N-Terminus
Helix
Intracellular
Helix
-7000
0
20
+
40
+
+
60
+
80
100
Amino Acid Position
Figure 5 - Graph of Running Averages of Log Probabilities of Each Amino Acid between Positions 0 and 100 in the
D3DR_RAT sequence with Predicted and True Boundaries marked. Running averages were computed using a window-size
of 2 and boundaries were predicted using a look-ahead of 5. The predicted boundaries are indicated by dotted vertical lines at
positions 38, 53, 65 and 88, while the true boundaries are indicated by dashed vertical lines at positions 32, 55, 66 and 92.
Predicted Boundaries for
D3DR_RAT



Window-size 2 from current amino acid
Look-ahead interval of 5 amino acids
Predicted Boundaries




38 53 65 88 107 135 150 171 188 212 374 394 413 431
6 2 1 4 3 9 1 1 3 3 1 3 1 3
32 55 66 92 104 126 149 172 185 209 375 397 412 434
Synthetic True Boundaries
The Only Truth: OPSD_HUMAN


The only GPCR that has been crystallized
so far
Predicted Boundaries





37 61 72 97 113 130 153 173 201 228 250 275 283 307
1 0 1 1 0 3 1 3 1 2 2 1 1 2
36 61 73 98 113 133 152 176 202 230 252 276 284 309
True Boundaries
Average offset for protein is 1.357 a.a.
Evaluation Metrics

Accuracy






Score 1 – perfect match
Score 0.5 – offset of 1
Score 0.25 – offset of 2
Score 0 otherwise
Offset – absolute difference between the
predicted and true boundary position
10-fold Cross Validation
Results: Trained Interpolated Models
Test Set
Size
A
130
B
130
C
130
D
130
E
130
F
130
G
130
H
130
I
129
J
129
Overall 1298
Accuracy
0.2383
0.2691
0.2426
0.2353
0.2501
0.2269
0.2343
0.2250
0.2438
0.2445
0.2410
Average
49.9698
21.3005
34.4385
22.9654
34.9154
21.5857
32.1989
42.7929
33.1179
62.1717
35.5270
E-H
48.4827
22.1250
34.9635
23.0442
35.6519
22.7269
32.1808
43.5135
32.0872
62.2519
35.6851
Figure 6 - Results of Our Approach using Trained Interpolation Weights.
Window-size: 2
Look-ahead interval: 5
Offset
H-I
48.5115
21.6981
34.6077
22.3865
35.6808
21.9135
31.6827
43.4462
32.0213
62.5039
35.4270
I-H
51.1564
19.5744
33.4205
21.8949
33.0051
18.9513
31.3590
41.2103
33.6512
60.8269
34.4854
H-E
52.7103
21.3974
34.5308
24.7026
34.8231
22.2615
33.7513
42.5436
35.4212
62.9664
36.4913
Distribution of Offset between
Predicted and Synthetic True Boundary
Distribution of Offset between
Predicted and Synthetic True Boundary
Removing 10% of the proteins
with the worst average offset
causes the average offset for the
dataset to drop to 10.51.
Results: Using All Probability Models
Test Set
Size
A
130
B
130
C
130
D
130
E
130
F
130
G
130
H
130
I
129
J
129
Overall 1298
Accuracy
0.2309
0.2291
0.2352
0.2223
0.2137
0.2468
0.2169
0.2118
0.2193
0.2014
0.2228
Average
64.2923
33.1368
45.0154
31.2264
51.0593
27.1764
40.4791
57.1110
39.3272
83.3162
47.1923
E-H
63.9038
34.1462
45.4231
31.3096
52.8019
27.9519
41.1673
56.8558
41.3353
84.2655
47.8931
Offset
H-I
63.6750
33.5077
45.1115
30.9365
52.1962
27.7500
40.5558
56.4673
40.1143
84.4302
47.4517
I-H
63.4359
30.9077
43.3744
29.8333
47.3000
24.8923
38.1846
56.3179
35.4600
79.9018
44.9412
H-E
66.4897
33.5256
45.9846
32.8949
50.9795
27.6615
41.7538
59.1026
39.4677
83.9793
48.1631
Figure 7 - Results of Our Approach using Pre-set Model Weights in the Interpolation: 0.1 for unigram
and bi-gram models, 0.2 for each of the tri-gram models. Running averages were computed over a
window-size of 5 and a look-ahead interval of 4 was used.
Results: Using Only Tri-gram Models
Test Set
Size
A
130
B
130
C
130
D
130
E
130
F
130
G
130
H
130
I
129
J
129
Overall 1298
Accuracy
0.2234
0.2462
0.2359
0.2224
0.2271
0.2286
0.2363
0.2310
0.2251
0.2168
0.2293
Average
70.7082
33.4071
45.6275
31.7533
51.1978
29.1319
43.0967
64.5154
41.9873
86.6235
49.7825
E-H
70.8231
34.0231
45.6019
32.0019
53.0673
30.7000
43.1115
64.3077
43.5504
87.2209
50.4178
Offset
H-I
70.7000
33.4731
45.6173
32.0731
52.7346
30.1077
42.4288
64.0923
43.0581
87.6919
50.1743
Figure 8 - Results of Our Approach using Pre-set Model Weights in the Interpolation:
0.25 for each of the tri-gram models.
Window-size of 4 and a look-ahead interval of 4.
I-H
69.2359
31.4615
44.5077
30.0410
47.3308
26.0872
41.8923
63.4308
38.5297
83.6951
47.6004
H-E
72.0385
34.4436
46.7949
32.7077
50.5231
28.7846
45.1718
66.4410
41.9328
87.3308
50.5953
Conclusions

Average accuracy of 0.241




~ offset of 2 on average
But average offsets are much higher
Missing a boundary has detrimental effects on
prediction of remaining boundaries in the
sequence, especially with a small segment
Large offsets with a small number of proteins
Future Work

Cue words


Long range contact


Unigrams, bi-grams, tri-grams, 4-grams in a
window of +/- 25 amino acids from boundary
Distribution tables of how likely 2 amino
acids are in long-range contact of each other
Evaluation

How much homology is needed between
training and testing data
References
1. Doug Beeferman, Adam Berger, and John Lafferty.
“Statistical Models for Text Segmentation.” Machine
Learning, special issue on Natural Language
Learning, C. Cardie and R. Mooney eds., 34(1-3), pp.
177-210, 1999.
http://www-2.cs.cmu.edu/~lafferty/ps/ml-final.ps
2.
F. Campagne, J.M. Bernassau, and B. Maigret. Viseur
program (Release 2.35). Copyright 1994,1995,1996,
Fabien Campagne, All Rights Reserved.
Related documents