Download List of protein families currently covered by SVMProt

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Citric acid cycle wikipedia , lookup

Expression vector wikipedia , lookup

Magnesium transporter wikipedia , lookup

Western blot wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

Fatty acid synthesis wikipedia , lookup

Metalloprotein wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Fatty acid metabolism wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Homology modeling wikipedia , lookup

Protein wikipedia , lookup

Peptide synthesis wikipedia , lookup

Point mutation wikipedia , lookup

Metabolism wikipedia , lookup

Proteolysis wikipedia , lookup

Amino acid wikipedia , lookup

Genetic code wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Biosynthesis wikipedia , lookup

Biochemistry wikipedia , lookup

Transcript
Appendix S2
Method for computing the feature vector of a protein sequence
A protein sequence is represented by specific feature vector assembled from encoded representations of
tabulated residue properties including amino acid composition, hydrophobicity, normalized Van der Waals
volume, polarity, polarizability, charge, surface tension, secondary structure and solvent accessibility for each
residue in the sequence.
For each of these properties, amino acids are divided into three groups such that those in a particular
group are regarded to have the same property. For instance, amino acids can be divided into hydrophobic
(CVLIMFW), neutral (GASTPHY), and polar (RKEDQN) groups. The groupings of amino acids for each of
the properties are given in Table 1:
Table 2. Division of amino acids into 3 different groups for different physicochemical properties
Property
Group 1
Group 2
Group 3
Type
Polar
Neutral
Hydrophobic
Hydrophobicity
Amino Acids in Group
RKEDQN
GASTPHY
CVLIMFW
Value
0~2.78
2.95~4.0
4.43~8.08
Van der Waals
volume
Amino Acids in Group
GASCTPD
NVEQIL
MHKFRYW
Value
4.9~6.2
8.0~9.2
10.4~13.0
Polarity
Amino Acids in Group
LIFWCMVY
PATGS
HQRKNED
Value
0~0.108
0.128~0.186
0.219~0.409
Polarizability
Amino acids
GASDT
CPNVEQIL
KMHFRYW
Three descriptors, composition (C), transition (T), and distribution (D), are used to describe global
composition of each of the properties. C is the number of amino acids of a particular property (such as
hydrophobicity) divided by the total number of amino acids in a protein sequence. T characterizes the percent
frequency with which amino acids of a particular property are followed by amino acids of a different property.
D measures the chain length within which the first, 25%, 50%, 75% and 100% of the amino acids of a
particular property are located, respectively.
Figure 1 shows a hypothetical protein sequence AEAAAEAEEAAAAAEAEEEAAEEAEEEAAE.
Figure 1. The sequence of a hypothetic protein for illustration of derivation of the feature vector of a protein. Sequence
index indicates the position of an amino acid in the sequence. The index for each type of amino acid in the sequence (A
or E) indicates the position of the first, second, third, … of that type of amino acid (The positions of the first, second,
third, …, A are at 1, 3, 4, …). A/E transition indicates the positions of AE or EA pairs in the sequence.
It has 16 alanines (n1=16) and 14 glutamic acids (n2=14). The composition for these two amino acids are
n1×100.00/(n1+n2)=53.33 and n2×100.00/(n1+n2)=46.67 respectively. There are 15 transitions from A to E or
from E to A in this sequence and the percent frequency of these transitions is (15/29)×100.00=51.72. The first,
25%, 50%, 75% and 100% of As are located within the first 1, 5, 12, 20, and 29 residues respectively. The D
descriptor for As is thus 1/30 ×100.00=3.33, 5/30×100.00=16.67, 12/30×100.00=40.0, 20/30×100.00=66.67,
29/30×100.00=96.67. Likewise, the D descriptor for Es is 6.67, 26.67, 60.0, 76.67, 100.0. Overall, the amino
acid composition descriptors for this sequence are C=(53.33, 46.67), T=(51.72), and D=(3.33, 16.67, 40.0,
66.67, 96.67, 6.67, 26.67, 60.0, 76.67, 100.0) respectively. Descriptors for other properties can be computed by
a similar procedure.
Overall, there are 21 elements representing these three descriptors: 3 for C, 3 for T and 15 for D (Bock &
Gough, 2001;Cai et al., 2003). The feature vector of a protein is constructed by combining the 21 elements of all
of these properties and the 20 elements of amino acid composition in sequential order. Table 2 gives the
computed descriptors of the human insulin precursor (SwissProt AC P01308). The feature vector of a protein is
constructed by combining all of the descriptors in sequential order.
Table 2. Characteristic descriptors of human insulin precursor (SwissProt AC P01308). The feature vector
of this protein is constructed by combining all of the descriptors in sequential order.
Property
Elements of Descriptors
9.09 5.45 1.82 7.27 2.73 10.91 1.82 1.82 1.82 18.18
Amino acid
composition
1.82 2.73 5.45 6.36 4.55 4.55 2.73 5.45 1.82 3.64
24.55 38.18 37.27 15.60 16.51 30.28 5.45 40.91 54.55 80.00
Hydrophobicity 100.0 1.82 21.82 47.27 68.18 98.18 0.91 12.73 37.27 72.37
99.09
40.00 41.82 18.18 29.36 11.01 13.76 1.82 21.82 52.73 71.82
Van der waals
99.09 2.73 25.45 56.36 78.18 100.0 0.91 15.45 41.82 50.00
volume
98.18
40.91 32.73 26.36 24.77 20.18 13.76 0.91 14.55 38.18 74.55
99.09 1.82 20.91 49.09 68.18 91.82 5.45 33.64 53.64 79.09
Polarity
100.0
29.09 52.73 18.18 31.19 9.17 15.60 1.82 21.82 52.73 68.18
Polarizability 91.82 2.73 25.45 56.36 79.09 100.0 0.91 15.45 41.82 50.00
98.18
References
Bock JR, Gough DA. 2001. Predicting protein--protein interactions from primary structure. Bioinformatics
17: 455-460.
Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ. 2003. SVM-Prot: Web-based support vector machine
software for functional classification of a protein from its primary sequence. Nucleic Acids Research 31:
3692-3697.