* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 吴冬茵
Survey
Document related concepts
Protein (nutrient) wikipedia , lookup
Biochemistry wikipedia , lookup
Community fingerprinting wikipedia , lookup
Multi-state modeling of biomolecules wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genetic code wikipedia , lookup
Protein moonlighting wikipedia , lookup
Protein adsorption wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Proteolysis wikipedia , lookup
Western blot wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Transcript
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015 ABSTRACT Related Work on Remote Homology Detection LSA LSA-based SVM and Data set Result and Discussion CONCLUSION ABSTRACT Motivation Remote homology detection: A central problem in computational biology, the classification of proteins into functional and structural classes given their amino acid sequences Discriminative method such as SVM is one of the most effective methods Explicit feature are usually large and noise data may be introduced, and it leads to peaking phenomenon Results Introduce LSA, which is an effecient feature extraction technique in NLP LSA model significantly improves the performance of remote homology detection in comparison with basic formalisms, and its peformance is comparable with complex kernel methods such as SVM-LA and better than other sequence-based methods Related Work of Remote Homology Detection structure is more conserved than sequence -- detecting very subtle sequence similarities, or remote homology is important Most methods can detect homology with a high level of similarity, while remote homology is often difficult to be separated from pairs of proteins that share similarities owing to chance -- 'twilight zone' pairwise sequence comparison algorithm dynamic programming algorithm: BLAST, FASTA, PSI-BLAST, etc generative models HMM, etc rotein families and discriminative classifiers SVM, SVM-fisher, SVM-k-spectrum, mismatch-SVM, SVM-pairwise, SVM-I-sites, SVM-LA, SVM-SW, etc The success of a SVM classification method depends on the choice of the feature set to describe each protein. Most of these research efforts focus on finding useful representations of protein sequence data for SVM training by using either explicit feature vector representations or kernel functions. LSA Latent semantic analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning by statistical computations applied to a large corpus of text. LSA analysis the relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. LSA bag-of-words model c1 human 1 interface 1 computer 1 c2 c3 c4 c5 m2 m4 1 1 1 1 system 1 1 response 1 1 time 1 1 EPS 1 1 2 1 1 tree 1 1 graph N document M words in total m3 1 user survey m1 minor Word-Document Matrix (M×N) ( This repretation does not recognize synonymous or related words and the dimensions are too large) 1 1 1 1 1 1 1 LSA U(M×K) S(K×K) W USV T VT(K×N) ( R Min( M, N )) LSA + sequences of proteins c1 human 1 interface 1 computer 1 c2 c4 c5 1 system 1 1 response 1 1 time 1 1 graph minor m3 m4 1 1 tree m2 1 user survey m1 1 EPS documents c3 1 1 2 1 1 1 1 1 1 1 1 1 1 1 LSA For a new document (sequence) which is not in the training set, it is required to add the unseen document (sequence) to the original training set and the LSA model be computed. The new vector t can be approximated as t = dU where d is the raw vector of the new document, which is similar to the columns of the matrix W LSA-based SVM and Data set Structral Classification of Proteins (SCOP) 1.53 sequences from ASTRAL database 54 families 4352 distinct sequences remote homology is simulated by holding out all members of a target 1.53 family from a given superfamily. 3 basic building block of proteins N-gram N = 3, 20^3, 8000 words Patterns alphabet ∑U{‘.’}, where ∑ is the set of the 20 amino acids and {‘.’} can be any of the amino acids. X2 selection, 8000 patterns. Motifs denotes the limited, highly conserved regions of proteins. 3231 motifs. Result and Discussion Two methods are used to evaluate the experimental results: the receiver operating characteristic (ROC) scores. the median rate of false positives (M-RFP) scores. The fraction of negative test sequences that score as high or better than the median score of the positive sequences. Result and Discussion Result and Discussion When the families are in the left-upper area, it means that the method labeled by y-axis outperforms the method labeled by x-axis on this family. Result and Discussion 1. Family level fold1 superfamily1.1 family1.1.1 positive train 20 family1.1.2 positive test 13 family1.1.3 fold2 superfamily1.2 family1.2.1 family1.2.2 superfamily2.1 family2.1.1 negative train & negative test 3033 & 1137 Result and Discussion 2. Superfamily level fold1 superfamily1.1 family1.1.1 positive test 33 family1.1.2 fold2 superfamily1.2 family1.1.3 positive train 88 family1.2.1 family1.2.2 superfamily2.1 family2.1.1 negative train & negative test 3033 & 1137 Result and Discussion 3. Fold level fold1 superfamily1.1 family1.1.1 positive test 33 family1.1.2 family1.1.3 fold2 superfamily1.2 family1.2.1 positiv e train 61 family1.2.2 superfamily2.1 family2.1.1 negative train & negative test 3033 & 1137 Result and Discussion Result and Discussion computational efficiency vectorization step optimization step SVM-pairwise O(n2l2) O(n3) SVM-LA O(n2l2) O(n2p) SVM-Ngram O(nml) O(n2m) SVM-Pattern O(nml) O(n2m) SVM-Motif O(nml) O(n2m) SVM-Ngram-LSA O(nmt) O(n2R) SVM-Pattern-LSA O(nmt) O(n2R) SVM-Motif-LSA O(nmt) O(n2R) LSA better than SVM-pairwise and SVM-LA worse than methods without LSA and PSI-BLAST n: the number of training examples l: the length of the longest training sequence m: the total number of words t: min (m,n) p: the length of the latent semantic representation vector p = n, in SVM-pairwise p = m,in the method with LSA p = R, in the LSA method CONCLUSION successfully used in protein remote homology detection and improved performances have been acquired in In this paper, the LSA model from natural language processing is comparison with the basic formalisms. Each document is represented as a linear combination of which arise automatically from the SVD mechanism. hidden abstract concepts, LSA defines a transformation between high-dimensional discrete entities (the vocabulary) and a low-dimensional continuous vector space S, the R-dimensional space spanned by the Us, leading to noise removal and efficient representation of the protein sequence. As a result, the LSA model achieves better performance than the methods without LSA.