Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
NEIGHBORHOOD REGULARIZED LOGISTIC MATRIX FACTORIZATION FOR DRUG-TARGET INTERACTION PREDICTION YONG LIU1,2*, MIN WU2, CHUNYAN MIAO1, PEILIN ZHAO2, XIAOLI LI2 PLOS COMPUT BIOL 12(2): E1004760. DOI:10.1371/ JOURNAL.PCBI.1004760 ABSTRACT • In pharmaceutical sciences, a crucial step of the drug discovery process is the identification of drug-target interactions. However, only a small portion of the drug-target interactions have been experimentally validated, as the experimental validation is laborious and costly. To improve the drug discovery efficiency, there is a great need for the development of accurate computational approaches that can predict potential drug-target interactions to direct the experimental verification. • In this paper, they propose a novel drug-target interaction prediction algorithm, namely neighborhood regularized logistic matrix factorization (NRLMF). Specifically, the proposed NRLMF method focuses on modeling the probability that a drug would interact with a target by logistic matrix factorization, where the properties of drugs and targets are represented by drugspecific and target-specific latent vectors, respectively. • They conducted extensive experiments over four benchmark datasets, and NRLMF demonstrated its effectiveness compared with five state-of-the-art approaches. INTRODUCTION • The drug discovery is one of the primary objectives of the pharmaceutical sciences, which is an interdisciplinary research field of fundamental sciences covering biology, chemistry, physics, statistics, etc. In the drug discovery process, the prediction of drug-target interactions (DTIs) is an important step that aims to identify potential new drugs or new targets for existing drugs. • In general, traditional computational methods proposed for DTI prediction can be categorized into two main groups: docking simulation approaches and ligand-based approaches. The docking simulation approaches predict potential DTIs, considering the structural information of target proteins. However, the docking simulation is extensively timeconsuming, and the structural information may not be available for some protein families, for example the G-protein coupled receptors (GPCRs). In the ligand-based approaches, potential DTIs are predicted by comparing a candidate ligand with the known ligands of the target proteins. This kind of approaches may not perform well for the targets with a small number of ligands. • In this paper, they propose a novel matrix factorization approach, namely neighborhood regularized logistic matrix factorization (NRLMF), for DTI prediction. The proposed NRLMF method focuses on predicting the probability that a drug would interact with a target. Specifically, the properties of a drug and a target are represented by two latent vectors in the shared low dimensional latent space, respectively. MATERIAL • The performances of DTI prediction algorithms were evaluated on four benchmark datasets, including Nuclear Receptors, GProtein Coupled Receptors (GPCR), Ion Channels, and Enzymes. • Each dataset contains three types of information: 1) the observed DTIs, 2) the drug similarities, and 3) the target similarities. • Particularly, the observed DTIs were retrieved from public databases KEGG BRITE, BRENDA, SuperTarget, and DrugBank. The drug similarities were computed based on the chemical structures of the compounds derived from the DRUG and COMPOUND sections in the KEGG LIGAND database. For a pair of compounds, the similarity between their chemical structures was measured by the SIM- COMP algorithm. The target similarities, on the other hand, were calculated based on the amino acid sequences of target proteins, which were retrieved from the KEGG GENES data- base. METHOD: PROBLEM FORMALIZATION • In this paper, the set of drugs is denoted by D={di}, and the set of targets is denoted by T ={tj}, where m and n are the number of drugs and number of targets, respectively. The interactions between drugs and targets are represented by a binary matrix Y, where each element yij={0,1}. • If a drug di has been experimentally verified to interact with a target tj, yij is set to 1; otherwise, yij is set to 0. METHOD: LOGISTIC MATRIX FACTORIZATION • Factorization technique has been successfully applied for DTI prediction in previous studies. In this work, they develop the DTI prediction model based on logistic matrix factorization (LMF). The primary idea of applying LMF for DTI prediction is to model the probability that a drug would interact with a target. In particular, both drugs and targets are mapped into a shared latent space, with a low dimensionality r, where r ( min(m, n). The properties of a drug di and a target tj are described by two latent vectors ui and vj, respectively. Then, the interaction probability pij of a drugtarget pair (di, tj) is modeled by the following logistic function: • For simplicity, they further denote the latent vectors of all drugs and all targets by U and V respectively, where ui is the ith row In U and vj is the jth row in V. METHOD: LOGISTIC MATRIX FACTORIZATION • In DTI prediction tasks, the observed interacting drug-target pairs have been experimentally verified, thus they are more trustworthy and important than the unknown pairs. Towards a more accurate modeling for DTI prediction, we propose to assign higher importance levels to the interaction pairs than unknown pairs. In particular, each interaction pair is treated as c (c >= 1) positive training examples, and each unknown pair is treated as a single negative training example. Here, c is a constant used to control the importance levels of observed interactions and is empirically set to 5 in the experiments. • By assuming that all the training examples are independent, the probability of the observations is as follows: METHOD: LOGISTIC MATRIX FACTORIZATION METHOD: REGULARIZED BY NEIGHBORHOOD • Through mapping both drugs and targets into a shared latent space, the LMF model can effectively estimate the global structure of the DTI data. However, LMF ignores the strong neighborhood associations among a small set of closely related drugs or targets. Thus, we propose to exploit the nearest neighborhood of a drug and that of a target to further improve the DTI prediction accuracy. • For a drug di, we denote the set of its nearest neighbors by N(di), where N(di) is constructed by choosing K1 most similar drugs with di. Then, we construct the set N(tj), which consists of the K1 most similar targets with tj. METHOD: REGULARIZED BY NEIGHBORHOOD • In this paper, the drug neighborhood information is represented using an adjacency matrix A, where the (i, μ) element aiμ is defined as follows: • Similarly, the adjacency matrix used to describe the target neighborhood information is denoted by B, where its (j, ν) element bjν is defined as follows: METHOD: REGULARIZED BY NEIGHBORHOOD • The primary idea of exploiting the drug neighborhood information for DTI prediction is to minimize the distances between di and its nearest neighbors N(di) in the latent space. This objective can be achieved by minimizing the following objective function: • Moreover, we also exploit the neighborhood information of targets for DTI prediction by minimizing the following objective function: METHOD: NRLMF • By plugging all equations, the proposed NRLMF model is formulated as follows: • The optimization problem in equation can be solved by an alternating gradient ascent procedure. Denoting the objective function in equation by L, the partial gradients with respect to U and V are as follows: EXPERIMENT • Following previous studies, the performance of the DTI prediction methods were evaluated under five trials of 10-fold crossvalidation (CV), and both AUC and AUPR were used as the evaluation metrics. In particular, for each method, we performed 10-fold CV for five times, each time with a different random seed. Then, we calculated an AUC score in each repetition of CV and reported a final AUC score that was the average over the five repetitions. The AUPR score was calculated in he same manner. • The drug-target interaction matrix Y had m rows for drugs and n columns for tar- gets. We conducted CV under three different settings as follows: • CVS1: CV on drug-target pairs—random entries in Y (i.e., drugtarget pairs) were selected for testing. • CVS2: CV on drugs—random rows in Y (i.e., drugs) were blinded for testing. • CVS3: CV on targets—random columns in Y (i.e., targets) were blinded for testing. • They all used 90% as training data and 10% as test data. RESULT • NRLMF RESULT • NRLMF RESULT • NRLMF DISCUSSION • These results indicate that NRLMF outperforms existing state-of-the-art methods in predicting new pairs and new drugs, and is comparable or even better than existing methods in predicting new targets. A MULTIPLE KERNEL LEARNING ALGORITHM FOR DRUG-TARGET INTERACTION PREDICTION ANDRÉ C. A. NASCIMENTO, RICARDO B. C. PRUDÊNCIO AND IVAN G. COSTA NASCIMENTO ET AL. BMC BIOINFORMATICS (2016) 17:46 DOI 10.1186/S12859-016-0890-3 KERNEL METHODS • In machine learning, kernel methods are a class of algorithms for pattern analysis, whose best known member is the support vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example clusters, rankings, principal components, correlations, classifications) in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a userspecified kernel, i.e., a similarity function over pairs of data points in raw representation. ABSTRACT • Drug-target networks are receiving a lot of attention in late years, given its relevance for pharmaceutical innovation and drug lead discovery. Different in silico approaches have been proposed for the identification of new drug-target interactions, many of which are based on kernel methods. Despite technical advances in the latest years, these methods are not able to cope with large drug-target interaction spaces and to integrate multiple sources of biological information. ABSTRACT • They propose KronRLS-MKL, which models the drug-target interaction problem as a link prediction task on bipartite networks. This method allows the integration of multiple heterogeneous information sources for the identification of new interactions, and can also work with networks of arbitrary size. Moreover, it automatically selects the more relevant kernels by returning weights indicating their importance in the drug-target prediction at hand. Empirical analysis on four data sets using twenty distinct kernels indicates that our method has higher or comparable predictive performance than 18 competing methods in all prediction tasks. Moreover, the predicted weights reflect the predictive quality of each kernel on exhaustive pairwise experiments, which indicates the success of the method to automatically reveal relevant biological sources. BACKGROUND • Most network approaches are based on bipartite graphs, in which the nodes are composed of drugs (small molecules) and biological targets (proteins). Edges between drugs and targets indicate a known DTI (Fig). Given a known interaction network, kernel based methods can be used to predict unknown drug-target interactions. A kernel can be seen as a similarity matrix estimated on all pairs of instances. The main assumption behind network kernel methods is that similar ligands tend to bind to similar targets and vice versa. These approaches use base kernels to measure the similarity between drugs (or targets) using distinct sources of information (e.g., structural, pharmacophore, sequence and function similarity). A pairwise kernel function, which measures the similarity between drugtarget pairs, is obtained by combining a drug and a protein base kernel via kernel product. BACKGROUND • The majority of previous network approaches use classification methods, as Support Vector Machines (SVM), to perform predictions over the drug-target interaction space. However, such techniques have major limitations. • First, they can only incorporate one pair of base kernels at a time (one for drugs and one for proteins) to perform predictions. • Second, the computation of the pairwise kernel matrix for the whole interaction space (all possible drug-target pairs) is computationally unfeasible even for a moderate number of drugs and targets. • Moreover, most drug target interaction databases provide no true negative interaction examples. The common solution for these issues is to randomly sample a small proportion of unknown interactions to be used as negative examples. While this approach provides a computationally trackable small drugtarget pairwise kernel, it generates an easier but unreal classification task with balanced class size. BACKGROUND • An emerging machine learning (ML) discipline focused on the search for an optimal combination of kernels, called Multiple Kernel Learning (MKL). MKL-like methods have been previously proposed to the problem of DTI prediction and the closely related protein-protein interaction (PPI) prediction problem. This is extremely relevant, as it allows the use of distinct sources of biological information to define similarities between molecular entities. However, since traditional MKL methods are SVM-based, they are subject to memory limitations imposed by the pairwise kernel, and are not able to perform predictions in the complete drugs vs. protein space. Moreover, MKL approaches used in PPI prediction problem and protein function prediction can not be applied to bipartite graphs, as the problem at hand. Currently, we are only aware of two recent works proposing MKL approach to integrate similarity measures for drugs and targets. BACKGROUND • In this work, they propose a new MKL algorithm to automatically select and combine kernels on a bipartite drug-protein prediction problem, the KronRLS-MKL algorithm (Fig 1). For this, we extend the KronRLS method to a MKL scenario. Our method uses L2 regularization to produce a non-sparse combination of base kernels. The proposed method can cope with large drug vs. target interaction matrices; does not requires sub-sampling of the drug-target network; and is also able to combine and select relevant kernels. We perform an empirical analysis using drug-target datasets previously described and a diverse set of drug kernels (10) and protein kernels (10). METHOD RLS • Given a set of drugs D={d1,…,dnd}, targets T={t1,…,tnt}, and the set of training inputs x i (drug-target pairs) and their binary labels yi∈ℝ (where 1 stands for a known interaction and 0 otherwise), with 1<i≤n, n=|D||T| (number of drug-target pairs). The RLS approach minimizes the following function: • where ∥f∥ K is the norm of the prediction function f on the Hilbert space associated to the kernel K, and λ>0 is a regularization parameter which determines the compromise between the prediction error and the complexity of the model. According to the representer theorem, a minimizer of the above objective function admits a dual representation of the following form: METHOD RLS • where K : |D||T| × |D||T| → R is named the pair- wise kernel function and a is the vector of dual variables corresponding to each separation constraint. The RLS algorithm obtains the minimizer of Eq. 1 solving a sys- tem of linear equations defined by (K + λI)a = y, where a and y are both n-dimensional vectors consisting of the parameters ai and labels yi . • One can construct such pairwise kernel as the prod- uct of two base kernels, namely K((d, t), (d′, t′)) = KD(d, d′)KT (t, t′), where KD and KT are the base kernels for drugs and targets, respectively. This is equivalent to the Kronecker product of the two base kernels: K = KD ⊗ KT. The size of the kernel matrix makes the model training computationally unfeasible even for moderate number of drugs and targets. METHOD KRONRLS • The KronRLS algorithm is a modification of RLS, and takes advantage of two specific algebraic properties of the Kronecker product to speed up model training: the so called vec trick and the relation of the eigendecomposition of the Kronecker product to the eigendecomposition of its factors. • The solution a can be given by solving the following equation • where vec(·) is the vectorization operator that stacks the columns of a matrix into a vector, and C is a matrix defined as: METHOD KRONRLS MKL • In this work, a vector of different kernels is considered, i.e., kD=(K1D,K2D,…,KPDD) and kT=(K1T,K2T,…,KPTT), P D and P T indicate the number of base kernels defined over the drugs and target set, respectively. In this section, we propose an extension of KronRLS to handle multiple kernels. • The kernels can be combined by a linear function, i.e., the weighted sum of base kernels, corresponding to the optimal kernels K∗D METHOD KRONRLS MKL • The classification function of Eq. 2 can be written in matricial form, fa = Ka and applying the well known property of the Kronecker product, (A ⊗ B)vec(X) = vec BXAT , we have: • More specifically, Eq. 1 can be redefined when a is fixed, and knowing that ∥f∥2F =aTKa[28],wehave: METHOD KRONRLS MKL • Since the second term does not depend on K (and thus does not depend on the kernel weights), and, as y and a are fixed, it can be discarded from the weights optimization procedure. Note that we are not interested in a sparse selection of base kernels as in, therefore we intro-duce a L2 regularization term to control sparsity of the kernel weights, also known as a ball constraint. This term is parameterized by the σ regularization coefficient. the optimal value for the combination vector is obtained by solving the optimization problem defined as: • The optimization method used here is the interior-point optimization algorithm. DATA • Each dataset consists of a binary matrix, containing the known interactions of a determined set of drug targets, namely Enzyme (E), Ion Channel (IC), GPCR and Nuclear Receptors (NR), based on information extracted from the KEGG BRITE, BRENDA, SuperTarget and DrugBank databases. All four datasets are extremely unbalanced, if we consider the whole drug-target interaction space, i.e., the number of known interactions is extremely lower than the number of unknown interactions, as presented in Table 1. PROTEIN KERNELS • Here we use the following information sources about target proteins: amino acid sequence, functional annotation and proximity in the protein-protein network. Concerning sequence information, we consider the normalized score of the Smith-Waterman alignment of the amino acid sequence (SW) [23], as well as different parametrizations of the Mismatch (MIS) [40] and the Spectrum (SPEC) [41] kernels. For the Mismatch kernel, we evaluated four combinations of distinct values for the k-mers length (k=3 and k=4) and the number of maximal mismatches per k-mer (m=1 and m=2), namely MIS-k3m1, MIS-k3m2, MIS-k4m1 and MIS-k4m2; for the Spectrum kernel, we varied the k-mers length (k=3 and k=4, SPEC-k3 and SPEC-k4, respectively). Both Mismatch and Spectrum kernels were calculated using the R packageKeBABS [42]. • The Gene Ontology semantic similarity kernel (GO) was used to encode functional information. GO terms were extracted from the BioMART database [43], and the semantic similarity scores between the GO annotation terms were calculated using the csbl.go R package [44], with the Resnik algorithm [45]. We also extracted a similarity measure from the human protein-protein network (PPI), obtained from the BioGRID database [46]. The similarity between each pair of targets was calculated based on the shortest distance on the corresponding PPI network, according to: DRUG KERNELS • As drug information sources, we consider 6 distinct chemical structure and 3 sideeffects kernels. Chemical structure similarity between drugs was achieved by the application of the SIMCOMP algorithm [47] (obtained from [23]), defined as the ratio of common substructures between two drugs based on the chemical graph alignment. We also computed the Lambda-k kernel (LAMBDA) [48], the Marginalized kernel [49] (MARG), the MINMAX kernel [50], the Spectrum kernel [48] (SPEC) and the Tanimoto kernel [50] (TAN). These later kernels were calculated with the R Package Rchemcpp [48] with default parameters. • Two distinct side-effects data sources were also considered. The FDA adverse event reporting system (AERS), from which side effect keywords (adverse event keywords) similarities for drugs were first retrieved by [51]. The authors introduced two types of pharmacological profiles for drugs, one based on the frequency information of side effect keywords in adverse event reports (AERS-freq) and another based on the binary information (presence or absence) of a particular side-effect in adverse event reports (AERS-bit). Since not every drug in the Nuclear Receptors, Ion Channel, GPCR and Enzyme datasets is also present on AERS-based data, we extracted the similarities of the drugs in AERS, and assigned zero similarity to drugs not present. • The second side-effect resource was the SIDER database1 [52]. This database contains information about commercial drugs and their recorded side effects or adverse drug reactions. Each drug is represented by a binary profile, in which the presence or absence of each side effect keyword is coded 1 or 0, respectively. Both AERS and SIDER based profile similarities were obtained by the weighted cosine correlation coefficient between each pair of drug profiles [51]. EXPERIMENT • Previous work suggest that, in the context of paired input problems, one should consider separately the experiments where the training and test sets share common drugs or proteins. In order to achieve a clear notion of the performance of each method, all competing approaches were evaluated under 5 runs of three distinct 5-fold cross-validation (CV) procedures: • 1.‘new drug’ scenario: it simulates the task of predicting targets for new drugs. In this scenario, the drugs in a dataset were divided in 5 disjoint subsets (folds). Then the pairs associated to 4 folds of drugs were used to train the classifier and the remaining pairs are used to test; • 2.‘new target’ scenario: it corresponds in turn to predicting interacting drugs for new targets. This is analogous to the above scenario, however considering 5 folds of targets; • 3.pair prediction: is consists of predicting unknown interactions between known drugs and targets. All drug-target interactions were split in five folds, from which 4 were used for training and 1 for testing. • The evaluation metric considered was the AUPR, as it allows a good quantitative estimate of the ability to separate the positive interactions from the negative ones. COMPETING METHODS • We compare the predictive performance of the KronRLS-MKL algorithm against other MKL approaches, as well as in a single kernel context (one kernel for drugs, and one for targets). In the latter, we evaluate the performance of each possible combination of base kernels with the KronRLS algorithm, recently reported as the best method for predicting drug-target pairs with single paired kernels. RESULT • Mean AUPR ranking of each method when compared to the new interactions found on updated databases. The KronRLSbased methods achieved superior performance when compared to other integration strategies MULTI-KERNELS APPROACHES • First, it is noticeable that the KA weights are very similar to the average selection (0.10). This indicates that no clear kernel selection is performed. WANGMKL and KRONRLS-MKL give low weights to drug kernels LAMBDA, MARG, MINIMAX, SPEC and TAN and protein kernel MIS-k3m2. These kernels have overall worst AUPR in the single kernel experiments, which indicates an agreement with both selection procedures. CONCLUSION • We have presented a new Multiple Kernel Learning algorithm for the bipartite link prediction problem, which is able to identify and select the most relevant information sources for DTI prediction. Most previous MKL methods mainly solve the problem of MKL when kernels are built over the same set of entities, which is not the case for the bipartite link prediction problem, e.g. drug-target networks. Regarding predictions in drug-target networks, the sampling of negative/unknown examples, as a way to cope with large data sets, is a clear limitation [2]. Our method takes advantage of the KronRLS framework to efficiently perform link prediction on data with arbitrary size. • In our experiments, the KronRLS-MKL algorithm demonstrated an interesting balance between accuracy and computational cost in relation to other approaches. It performed best in the “pair” prediciton problem and the “new target” problem. In the ’new drug’ and ’new target’ prediction tasks, BLM-KA was also top ranked. This method has a high computational cost. This arises from the fact it requires a classifier for each DT pair [2]. Moreover, it obtained poor results in the evaluation scenario to predict novel drug-protein pairs interactions. • The convex constraint estimation of kernel weights correlated well with the accuracy of a brute force pair kernel search. This non-sparse combination of kernels possibly increased the generalization of the model by reducing the bias for a specific type of kernel. This usually leads to better performance, since the model can benefit from different heterogeneous information sources in a systematic way [33]. Finally, the algorithm performance was not sensitive to class unbalance and can be trained over the whole interaction space without sacrificing performance. POLITICAL SPEECH GENERATION VALENTIN KASSARNIG ARXIV:1601.03313 [CS.CL] INTRODUCTION • Many political speeches show the same structures and same characteristics regardless of the actual topic. Some phrases and arguments appear again and again and indicate a certain political affiliation or opinion. They want to use these remarkable patterns to train a system that generates new speeches. Since there are major differences between the political parties they want the system to consider the political affiliation and the opinion of the intended speaker. The goal is to generate speeches where no one can tell the difference to hand-written speeches. DATA SET • Dataset contains almost 4,000 political speech segments from 53 U.S. Congressional floor debates • These speeches consist of over 50,000 sentences each containing 23 words on average. Kassarnig also categorized the speeches by political party, whether Democrat or Republican, and by whether it was in favor or against a given topic. • Kassarnig begins by telling the algorithm what type of speech it is supposed to write—whether for Democrats or Republicans. MODEL: LANGUAGE MODEL • They use a simple statistical languages model based on ngrams. In particular, we use 6-grams. That is, for each sequence of six consecutive words we calculate the probability of seeing the sixth word given the previous five ones. • 1. Determine the beginnings • From the language model of the selected class we obtain the probabilities for each 5-gram that starts a speech. From that distribution we pick one of the 5-grams at random and use it as the beginning of our opening sentence. • 2. Determine the candidates for next words • All words which have been seen in the training data following the previous 5-gram are our candidates. • 3. Calculate the planguage • planguage tells how likely this word is to occur after the previous 5 ones. MODEL: TOPIC MODEL • 1. Extract topics: • For our topic model we use a Justeson and Katz (J&K) POS tag filter for two- and three-word terms. We determined the POS tags for each sentence in the corpus and identified then all two- and three-word terms that match one of the patterns. For the POS tagging we used maxent treebank pos tagging model from the Natural Language Toolkit (NLTK) for Python. • Our significance score Z is defined by the ratio of the probability of seeing a word w in a certain class c to the probability to see the word in the entire corpus: • Z(w,c)=P(w|c)P(w) • This significance score gives information about how often a term occurs in a certain class compared to the entire corpus. That is, every score greater than 1.0 indicates that in the given class a certain term occurs more often than average. We consider all phrases which occur at least 20 times in the corpus and have a ratio greater than 1. MODEL: TOPIC MODEL • 2. Determine the topics • This is done by checking every topic-term if it appears in the speech. For every occurring term we calculate the topic coverage T C in our speech. The topic coverage is an indicator of how well a certain topic t is represented in a speech S. The following equation shows the definition of the topic coverage: • We rank all topics by their topic coverage values and pick the top 3 terms as our current topic set T. For these 3 terms we normalize the values of the ratios so that they sum up to 1. This gives us the probability P (t|S, c) of seeing a topic t in our current speech S of class c. 3. Calculate ptopic • ptopic tells how likely the word w is to occur in a speech which covers the current topics T . • Furthermore, we want to make sure that a phrase is not repeated again and again. Thus, we check how often the phrase consisting of the previous five words and the current candidate word has already occurred in the generated speech and divide the combined probability by this value squared plus 1. RESULT • In this experiment we generated ten speeches, five for class DN and five for class RY. RESULT • Note that each criterion scores between 0 and 3 which leads to a maximum total score of 12. The achieved total score range from 5 to 10 with an average of 8.1. In particular, the grammatical correctness and the sentence transitions were very good. Each of them scored on average 2.3 out of 3. The speech content yielded the lowest scores. This indicates that the topic model may need some improvement. • Thanks you!