Download List of protein families currently covered by SVMProt

Appendix S2 Method for computing the feature vector of a protein sequence A protein sequence is represented by specific feature vector assembled from encoded representations of tabulated residue properties including amino acid composition, hydrophobicity, normalized Van der Waals volume, polarity, polarizability, charge, surface tension, secondary structure and solvent accessibility for each residue in the sequence. For each of these properties, amino acids are divided into three groups such that those in a particular group are regarded to have the same property. For instance, amino acids can be divided into hydrophobic (CVLIMFW), neutral (GASTPHY), and polar (RKEDQN) groups. The groupings of amino acids for each of the properties are given in Table 1: Table 2. Division of amino acids into 3 different groups for different physicochemical properties Property Group 1 Group 2 Group 3 Type Polar Neutral Hydrophobic Hydrophobicity Amino Acids in Group RKEDQN GASTPHY CVLIMFW Value 0~2.78 2.95~4.0 4.43~8.08 Van der Waals volume Amino Acids in Group GASCTPD NVEQIL MHKFRYW Value 4.9~6.2 8.0~9.2 10.4~13.0 Polarity Amino Acids in Group LIFWCMVY PATGS HQRKNED Value 0~0.108 0.128~0.186 0.219~0.409 Polarizability Amino acids GASDT CPNVEQIL KMHFRYW Three descriptors, composition (C), transition (T), and distribution (D), are used to describe global composition of each of the properties. C is the number of amino acids of a particular property (such as hydrophobicity) divided by the total number of amino acids in a protein sequence. T characterizes the percent frequency with which amino acids of a particular property are followed by amino acids of a different property. D measures the chain length within which the first, 25%, 50%, 75% and 100% of the amino acids of a particular property are located, respectively. Figure 1 shows a hypothetical protein sequence AEAAAEAEEAAAAAEAEEEAAEEAEEEAAE. Figure 1. The sequence of a hypothetic protein for illustration of derivation of the feature vector of a protein. Sequence index indicates the position of an amino acid in the sequence. The index for each type of amino acid in the sequence (A or E) indicates the position of the first, second, third, … of that type of amino acid (The positions of the first, second, third, …, A are at 1, 3, 4, …). A/E transition indicates the positions of AE or EA pairs in the sequence. It has 16 alanines (n1=16) and 14 glutamic acids (n2=14). The composition for these two amino acids are n1×100.00/(n1+n2)=53.33 and n2×100.00/(n1+n2)=46.67 respectively. There are 15 transitions from A to E or from E to A in this sequence and the percent frequency of these transitions is (15/29)×100.00=51.72. The first, 25%, 50%, 75% and 100% of As are located within the first 1, 5, 12, 20, and 29 residues respectively. The D descriptor for As is thus 1/30 ×100.00=3.33, 5/30×100.00=16.67, 12/30×100.00=40.0, 20/30×100.00=66.67, 29/30×100.00=96.67. Likewise, the D descriptor for Es is 6.67, 26.67, 60.0, 76.67, 100.0. Overall, the amino acid composition descriptors for this sequence are C=(53.33, 46.67), T=(51.72), and D=(3.33, 16.67, 40.0, 66.67, 96.67, 6.67, 26.67, 60.0, 76.67, 100.0) respectively. Descriptors for other properties can be computed by a similar procedure. Overall, there are 21 elements representing these three descriptors: 3 for C, 3 for T and 15 for D (Bock & Gough, 2001;Cai et al., 2003). The feature vector of a protein is constructed by combining the 21 elements of all of these properties and the 20 elements of amino acid composition in sequential order. Table 2 gives the computed descriptors of the human insulin precursor (SwissProt AC P01308). The feature vector of a protein is constructed by combining all of the descriptors in sequential order. Table 2. Characteristic descriptors of human insulin precursor (SwissProt AC P01308). The feature vector of this protein is constructed by combining all of the descriptors in sequential order. Property Elements of Descriptors 9.09 5.45 1.82 7.27 2.73 10.91 1.82 1.82 1.82 18.18 Amino acid composition 1.82 2.73 5.45 6.36 4.55 4.55 2.73 5.45 1.82 3.64 24.55 38.18 37.27 15.60 16.51 30.28 5.45 40.91 54.55 80.00 Hydrophobicity 100.0 1.82 21.82 47.27 68.18 98.18 0.91 12.73 37.27 72.37 99.09 40.00 41.82 18.18 29.36 11.01 13.76 1.82 21.82 52.73 71.82 Van der waals 99.09 2.73 25.45 56.36 78.18 100.0 0.91 15.45 41.82 50.00 volume 98.18 40.91 32.73 26.36 24.77 20.18 13.76 0.91 14.55 38.18 74.55 99.09 1.82 20.91 49.09 68.18 91.82 5.45 33.64 53.64 79.09 Polarity 100.0 29.09 52.73 18.18 31.19 9.17 15.60 1.82 21.82 52.73 68.18 Polarizability 91.82 2.73 25.45 56.36 79.09 100.0 0.91 15.45 41.82 50.00 98.18 References Bock JR, Gough DA. 2001. Predicting protein--protein interactions from primary structure. Bioinformatics 17: 455-460. Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ. 2003. SVM-Prot: Web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Research 31: 3692-3697.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download List of protein families currently covered by SVMProt