Download We propose a frequent pattern-based algorithm for predicting

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Rosetta@home wikipedia , lookup

List of types of proteins wikipedia , lookup

Structural alignment wikipedia , lookup

Circular dichroism wikipedia , lookup

Bimolecular fluorescence complementation wikipedia , lookup

Protein domain wikipedia , lookup

Proteomics wikipedia , lookup

Cyclol wikipedia , lookup

Protein folding wikipedia , lookup

Protein design wikipedia , lookup

Homology modeling wikipedia , lookup

Protein purification wikipedia , lookup

Protein wikipedia , lookup

Western blot wikipedia , lookup

Alpha helix wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Protein moonlighting wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Protein structure prediction wikipedia , lookup

Transcript
Abstract:
We propose a frequent pattern-based algorithm for predicting functions and localizations of
proteins from their primary structure (amino acid sequence). We use reduced alphabets that
capture the higher rate of substitution between amino acids that are physiochemically similar.
Frequent sub strings are mined from the training sequences, transformed into different alphabets,
and used as features to train an ensemble of SVMs. We evaluate the performance of our
algorithm using protein sub-cellular localization and protein function datasets. Pair-wise
sequence-alignment-based nearest neighbor and basic SVM k-gram classifiers are included as
comparison algorithms. Results show that the frequent sub string-based SVM classifier
demonstrates better performance compared with other classifiers on the sub-cellular localization
datasets and it performs competitively with the nearest neighbor classifier on the protein function
datasets. Our results also show that the use of reduced alphabets provides statistically significant
performance improvements for half of the classes studied.