Download sv-lncs - Department of Computer Science and Engineering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Hedgehog signaling pathway wikipedia , lookup

Proteasome wikipedia , lookup

P-type ATPase wikipedia , lookup

Histone acetylation and deacetylation wikipedia , lookup

Phosphorylation wikipedia , lookup

Signal transduction wikipedia , lookup

SR protein wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

List of types of proteins wikipedia , lookup

Apoptosome wikipedia , lookup

Protein wikipedia , lookup

Homology modeling wikipedia , lookup

Magnesium transporter wikipedia , lookup

Protein phosphorylation wikipedia , lookup

Protein (nutrient) wikipedia , lookup

Protein folding wikipedia , lookup

Cyclol wikipedia , lookup

Protein moonlighting wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Protein structure prediction wikipedia , lookup

JADE1 wikipedia , lookup

Trimeric autotransporter adhesin wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Proteolysis wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Transcript
Predicting Protein-Protein Interactions from Protein
Domains Using a Set Cover Approach
Chengbang Huang1, Simon P.Kanaan1, Stefan Wuchty2, Danny Z. Chen1, Jesus A.
Izaguirre1
1
Computer Science and Engineering Department,
University of Notre Dame,
46556 Notre Dame, IN, USA
{CHuang1, SKanaan, Chen, Izaguirr}@nd.edu
2Physics Department,
University of Notre Dame,
46556 Notre Dame, IN, USA
[email protected]
Abstract. Protein interactions serve as the chemical basis for all living
organisms. Proteins fold and interact in intricate arrangements that provide
functionality to the components of a cell.
These components work
cooperatively to form whole body systems. Proteins are composed of different
domains. Domains are composed of distinct peptides and are the key to
intricate arrangements that drive the proteins to fold and interact as they do. A
single protein molecule can possess multiple domains causing difficulty in
discovering a simple formula that dictates the manner by which protein
interactions occur. There is no known way to identify a protein-protein
interaction with a specific domain pair. Yet, certain affinities exist between
protein domains and are frequently seen in living organisms. This drives
researchers to extrapolate the mechanism of protein interactions by focusing on
domain interactions as a factor. A method capable of predicting accurate
protein interactions gives function to proteins which have not been tested yet,
and helps researchers understand the underlying biological network. Minimum
Set Cover, MSC, approach is able to predict protein interactions with a higher
specificity than other methods using the same information while maintaining a
high sensitivity. MSC could also be used to aid any existing method which uses
domains by cutting down the number of protein interactions by removing
unnecessary domain interactions. This allows these methods to increase their
specificity while maintaining a high sensitivity.
1 Introduction
brief intro/Goal/assumptions
The model system used for these proceedings is the yeast cell, with several of its
proteins serving as the test cases. The data used for training and testing our methods
are databases, PFAMA and PFAMB, which contain protein-protein interactions and
protein structures, domains contained in a protein.
1.1 The Function of a Protein
A protein is one of the basic building blocks of all living organisms. Proteins are
typically comprised of an intricate sequence involving a variation of twenty common
amino acids, which are molecules that, in turn, consist of a single carbon molecule
flanked by four different attached molecular functional groups. Proteins provide
evidence for the occurrence of evolution and function to maintain those processes that
keep organisms alive, often playing major roles in the catalysis of chemical reactions
that control the delicate homeostatic balance. Fundamental life processes such as
oxygen transport throughout the body of an organism, filtration of molecules seeking
passage through a cell membrane, and duplication of nucleic acids during
reproduction all involve proteins.
1.2 Significance of Protein-Protein Interactions
The amino acids of proteins develop extremely durable peptide bonds between the
carbonyl carbon and the amide nitrogen to self-assemble into polypeptide chains,
serving as the primary protein structure and driving the way a protein folds into the
secondary and tertiary structures through the interaction between attached functional
groups.
1.3 Domains as the Possible Cause of the Interaction
The primary sequence of proteins causes an inheritance of a multitude of polar and/or
non-polar characteristics that would otherwise prevent interactions to occur between
two proteins. Such an arrangement is undertaken to prevent aggregation of proteins
into useless blobs. However, a multitude of proteins associate with each other in
living organisms, suggestive of a specific geometric arrangement occurring to allow
interaction between two or more proteins. Further, a piece-wise assembly of multiple
proteins allows organisms to effectively and expediently detect and correct any
defects in a protein complex as opposed to analyzing a single large, awkward protein
for a miniscule error.
2 Related Work
AM, MLE, InterDom etc…
3 Methods & Implementation
The Minimum Set Cover Problem, MSC, is the problem of finding the minimum size
set of sets whose union is equal to the union of all the sets. We approximate the
minimum number of domain pairs, which are needed to explain all the protein
interactions read from the training data. A protein interaction, (P1, P2), is explained if
a domain pair, (D1, D2), is chosen such that P1 includes either D1 or D2 as one of its
domains, while P2 includes the other as one of its domains. The "Minimum Set Cover
Problem" is a NP complete problem which is why we approximate a solution instead
of finding the exact solution which is computational expensive.
There are different methods to approximate MSC, one method to choose the most
interacting domain pair. This is the most common domain pair observed in
interacting proteins. Once this pair is chosen it covers the set of interacting proteins
which contain this domain pair. This set of protein pairs is assumed to interact due to
this domain pair and therefore will not need to be explained anymore. This process is
repeated until there are no more protein pairs to be explained. The union of all the
chosen sets is equal to the training data.
Another method chooses the domain pair with the greatest probability of
interacting. This probability of a domain pair (D1, D2) is the number of interacting
protein pairs containing (D1, D2) divided by the possible number of protein pairs
containing (D1, D2). The possible number of protein pairs containing (D1, D2) is
simply the number of proteins containing D1 multiplied by the number of proteins
containing D2.
The first method relies on the assumption that the most common observed
interacting domain pair among the protein interactions is probably the cause of the
protein interactions. The second method relies on the completeness and accuracy of
the training data. The closer the training data set is to completeness the closer the
calculated probability is to the actual probability.
3.1 Implementing Minimum Set Cover Method
Implementing MSC using the number of interactions observed as the weight function,
the method to go about choosing the approximation of the minimum number of sets,
is similar to MSC using the domain-pairs probability. Instead of dealing with both the
numerator and denominator as in MSC by probability just deal with the numerator,
the number of interacting protein-pairs.
3.2 Implementing Minimum Set Cover Method Using Probability
Before going in detail about MSC by probability lets take care of the input and the
data structures used to implement MSC. In order to use MSC you need to have a
training data set. This data set needs to include several protein interactions and the
protein structure of every protein involved in an interaction, in other words the
domains contained in the protein involved in an interaction. This data set needs to be
read in and stored in meaningful data structures. One such data structure is a vector
of linked lists. Three such data structures are recommended. The size of the first
vector is the number of proteins available. Every protein has a unique id where the id
is the index in the vector. Every protein also has a linked list of domains. This list
contains all the domains this protein contains. The second data structure is similar to
the first but instead of each protein containing a list of the domains it contains, it
contains a list of the proteins it interacts with. The third data structure is to aid in
finding the proteins which host a domain. The number of domains is the vector’s
size. Every domain has a unique id where the id is the index in the vector. Every
domain also has a linked list of proteins. This list contains all the proteins which host
this domain.
A data structure is also needed in order to choose the domain pair depending on
your weight function. One such data structure is a domain-domain matrix. The
matrix’s size is n x n, where n is the number of domains available in the training data.
Each node, Dij in the matrix represents the probability that the domain pair, (Di, Dj), is
the cause of protein pair containing (Di, Dj) interacting. For a protein pair to contain
(Di, Dj) one of the proteins needs to contain either Di or Dj while the other protein
contains the other. The probability of (Di, Dj) causing the protein-protein interaction
is equal to:
(# interacting proteins containing (Di, Dj)) / (# protein pairs containing (Di, Dj)) .
Instead of storing a double it is simpler to store two integers, numerator and
denominator since this number needs to be updated. Observe the pseudo code.
After all the training data is read in and this matrix is built it is time to approximate
MSC. Recall for MSC by probability the domain pair with the greatest probability of
interacting is chosen and used to explain a set of protein pairs. This is repeated until
there are no more protein pairs to explain. Pseudo code highlighting these steps
follows:
MSC by probability (Training Data)
Maximum := largest number in Domain matrix;
Di := row index of largest number in Domain matrix;
Dj := column index of largest number in Domain matrix;
while maximum ≠ 0
begin
For every protein Pi which hosts Di
For every protein Pj which hosts Dj
Remove affect of (Pi, Pj) from Domain matrix
end
end
Remove affect of (Pi, Pj) from Domain matrix
begin
For every domain Dx contained in Pi
For every domain Dy contained in Pj
Decrement Dxy numerator by one;
Decrement Dxy denominator by one;
end
end
end
(1)
end.
3.3 Method’s Significance
MSC minimizes the number of domain pairs which will be used for predicting protein
interactions by choosing the minimum set of domain pairs which explain all the
protein interactions in the training data. The number or predicted protein pairs
decreases by minimizing the number of domain pairs chosen. This decreases the
number of false positive interactions. False positive interactions are protein
interactions which are predicted but not contained in the test data set. However
enough domain pairs are chosen to explain the training data set and therefore these
domain pairs can predict at least all the protein pairs in the training data set. Any
additional protein pairs which are predicted are less likely to be false positives as will
be seen through experimentation. Two metrics are used to measure how good the
protein predictions are, specificity and sensitivity.
Specificity = # matches / # predicted protein interactions .
(2)
Sensitivity = # matches / # protein interactions in test data set .
(3)
The number of matches is the number of predicted protein interactions which are
included in the testing data set. MSC aims to maximize specificity while maintaining
a high sensitivity.
3.5 Prediction
Once MSC is finished choosing all the domain pairs needed to explain the training
data, these domain pairs predict protein interactions. The probability of each
observed protein pair is calculated. If the probability is greater than some threshold
this protein pair is predicted to be interacting. The number of matches between the
predicted data set and observed data set is increased by one. On the other hand if the
probability is less than some threshold the protein pair is predicted not to interact.
This case increases the number of false negatives by one.
The following pseudo code tests whether an observed interacting protein pair is
predicted to interact or not:
Prediction (observed data, Domain matrix)
For every observed protein Pi
For every observed protein Pj interacting with Pi
Predict the probability(Pi, Pj)
If probability (Pi, Pj) > threshold
(Pi, Pj) is predicted to interact;
increment number of matches by one;
Else
(Pi, Pj) is not predicted to interact;
increment number of false_negatives by one;
end
end
Predict the probability(Pi, Pj)
begin
non_interaction := 1.0;
For every domain Di contained in Pi
For every domain Dj contained in Pj
Dij := see equation 1;
non_interaction := non_interaction_prob * Dij;
end
end
return 1 - non_interaction;
end.
This algorithm can easily be modified to predict all protein interactions given a set
of proteins and their structure. Instead of looking at observed protein interactions as
in the first for loop, look at all possible protein interactions. As for the “if statement”
instead of incrementing the number of matches you increment the number of
predicted protein interactions.
4 Results
4.1 Comparison between AM, MLE, MSC, and MSC by Probability
4.2 Comparison with Other Methods
5 Future Work
Any number of different weight functions could be implemented where different
weight functions could be more meaningful given the data used. Some weight
functions could rely more on the presence of protein interactions, others on the
absence of protein interactions, while others could take both into consideration, for
example MSC with probability. MSC can also incorporate different methods which
rely on domain interaction by using these different methods as weight functions. It
can be used to aid the different methods which use multiple sources to determine
domain interaction such as InterDom.
Different assumptions could also be made. Instead of looking at a domain pair as
the cause of interactions, consider groups of domains interacting with other groups, or
consider interactions which are dependent on previous interactions, protein
interactions are not independent.
Some work could also be done on the optimization of MSC. It currently uses a
matrix which contains a significant number of wasted space, and it is costly to find the
largest number in matrix every time you are searching for a domain pair. A heap data
structure could be implemented which minimizes these expenses.