* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download Results
Polycomb Group Proteins and Cancer wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Metagenomics wikipedia , lookup
Copy-number variation wikipedia , lookup
Non-coding DNA wikipedia , lookup
Genetic engineering wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Gene therapy wikipedia , lookup
Genomic library wikipedia , lookup
Gene expression programming wikipedia , lookup
Genome (book) wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Gene desert wikipedia , lookup
Protein moonlighting wikipedia , lookup
Genome evolution wikipedia , lookup
Genome editing wikipedia , lookup
History of genetic engineering wikipedia , lookup
Gene expression profiling wikipedia , lookup
Gene nomenclature wikipedia , lookup
Point mutation wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Microevolution wikipedia , lookup
Designer baby wikipedia , lookup
Helitron (biology) wikipedia , lookup
Predicting functionally related proteins based on regulatory features Shih-Feng Wang, Tzu-Wen Lin, Shao-Ting Jang and Darby Tien-Hao Chang§ Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan § Corresponding author Email addresses: SFW: [email protected] TWL: [email protected] STJ: [email protected] DTHC: [email protected] -1- Abstract Background Protein functions are essential to many biological processes. Elucidating these protein functions and linking functionally related proteins improves our understanding of the mechanisms of biological systems at the molecular level. Currently, many features intrinsic to proteins (e.g., protein sequences, structures, and functions) have been studied to predict functionally related proteins. However, no study has analyzed the regulatory features (e.g., transcription factors that regulate the gene of a protein) of two interacting proteins. This study determines whether regulatory features affect functional relationships after the gap between genes to their protein products, as well as builds a regulatory feature-based prediction model for functionally related proteins. Results This work comprehensively analyzed regulatory features. eight transcriptional characteristics were identified: DNA bendability; gene size; gene distance; transcription direction; nucleosome occupancy; TATA box status; transcription factor (TF) binding evidence; TF knockout evidence; and transcription factor binding site (TFBS). Experimental results show that adding gene distance and TATA box status improved accuracy when predicting functionally related proteins, and indicate that regulatory features influence the functional relation after the gap from genes to their protein products. For Saccharomyces cerevisiae, the proposed prediction method is more accurate than previous methods. -2- Conclusions This work is the first to assess the effectiveness of using regulatory features to predict functionally related proteins. The proposed encoding method for regulatory characteristics can determine whether two proteins are functionally related. Background Many protein functions are essential to biological processes in living cells. Elucidating these protein functions and linking their functionally related proteins improves our understanding of the mechanisms of biological systems at the molecular level [1]. As the number of sequenced genomes is increasing, conducting biological experiments to identify all protein pairs that are functionally related is impractical in terms of both time and cost. Thus, new computational methods are needed. Various computational methods have been applied to predict functional linkages between proteins based on the observation that functionally related proteins have some co-occurrence patterns. Shoemaker and Panchenko have performed a review of these methods [2]. Gene neighbor and gene clustering methods infer functional linkages based on the observation that genes whose protein products interact with are usually clustered within a transcriptional unit, an operon, in a genome [3-5]. The Rosetta Stone method, conversely, is based on the finding that certain interacting proteins in an organism have homologues in another organism forming into a fused protein chain, a Rosetta Stone protein [6-9]. Gene neighbor, gene cluster, and the Rosetta Stone method share a disadvantage that very limited number of functional linkages have such specific co-occurrence patterns. Thus, recent co-occurrence-based methods have shifted to phylogenetic profiling (PP)-based method, which is used to identify a more general co-occurrence pattern. The basic assumption of PP-based -3- methods is that the co-presence and co-absence of proteins across organisms, the coevolved pattern, result from functional linkages between proteins [10-14]. These PPbased methods, which perform adequately, have been applied mainly to prokaryotes. A previous study proposed a two-stage framework that integrated machine learning (ML) with a PP-based method to overcome the limitations of PP [15]. ML techniques have been widely used by many studies to predict protein relations, and several techniques have been developed to capture important features of protein pairs. In this work, eight regulatory characteristics are added to the two-stage predictor to increase prediction accuracy and the hybrid feature selection techniques are further used to promote the effect of that. For Saccharomyces cerevisiae, the prediction by the proposed method is more accurate than that by previous methods. Additionally, the prediction result indicates that regulatory characteristics affect functional linkages between proteins. Methods Data collection This work retrieved 6,717 gene sequence of proteins of S. cerevisiae from the Saccharomyces Genome Database (SGD) database, which was released on February 3, 2011.And protein-protein interactions dataset which contained 198376 pairs was collected from Biological General Repository for Interaction Datasets (BioGRID) of version 3.1.89. The eight regulatory features came from different database. The gene size and distance data were calculated based on gene sequence from SGD database. The DNA bendability, nucleosome occupancy, TATA box status and TFBS similarity came from Yeast Promoter Atlas (YPA) [24] database. The transcription factor binding -4- evidence and transcription factor knockout evidence came from YEASTRACT[27] database. In addition to an evaluation organism, the first stage of the proposed framework requires a reference collection to construct phylogenetic profiles. This work used the 132 eukaryotes in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database to compile a eukaryotic reference collection. In these reference organisms, only the gene and protein sequences were required. Since the functional linkage information of the reference organisms was not used, these organisms were not training data for machine learning. Phylogenetic profile The PP-based methods are based on the observation that genes with similar phylogenetic profiles tend to exist in the same protein complex, biochemical pathway, or sub-cellular compartment. Here, the phylogenetic profile of a gene is a vector, representing the presence or absence of homologues to that gene across the reference collection. PP-based methods have two issues: (i) how to construct a phylogenetic profile of a given gene; and (ii) how to determine the similarity of two phylogenetic profiles. First, the presence or absence of homologues can be determined by sequence alignment scores, such as a BLAST E-value. A protein is considered “present” in an organism when the sequence alignment score for the protein between at least one protein in the organism exceeds a threshold. Such binary vectors are improved as realvalued vectors of normalized alignment scores without arbitrarily determining a score threshold. A real-valued phylogenetic profile is adopted in this work. Suppose a collection of n reference organisms is used to build the phylogenetic profile of a query -5- gene. The first step is to compare the open reading frame (ORF) of the query gene to all ORFs of the n reference organism using BLAST. The best bit score of the query gene a and all ORFs of a reference organism b is used to measure the presence of a in b, called the “S-value of gene a and organism b” (Sab). As non-homologous genes have a certain chance to align with each other of a bit score exceeding 50, the S-value is trimmed to zero when it is lower than 50. Because the bit score depends on the sequence of a, the S-value is further normalized as an R-value by the following equation: Rab S ab , S aa where Saa is the score obtained by aligning a to itself. The n-dimensional vector of Rvalues obtained by BLASTing a gene to n reference organisms represents the phylogenetic profile of that gene. In addition, the non-zero R-values of all genes of the query organism to a reference organism are normalized by dividing the average score. This procedure prevents the similarity of two phylogenetic profiles of two genes being dominated by a few large R-values resulted from phylogenetically close organisms. Second, any similarity/distance function, such as cosine similarity or Euclidean distance [33], of vectors can be used to define the similarity of phylogenetic profiles. Enault et al. examined four widely used distance functions and concluded that the inner product is a good indicator [12]. In this work, similarity between two genes, i and j, is defined as follows: n Sim(i, j ) R k 1 ik R jk n 2 n 2 Rij R jk k 1 k 1 1/ 2 . -6- Feature encoding The second stage of this work retrieve 20000 gene pair that have highest similarity from first stage and encodes a gene pair into a feature vector and then invokes a classifier to perform the prediction. This subsection describes the feature encoding process while the next subsection elucidates the classification algorithm. The used feature set can be divided into two parts. The first feature set considers the conjoint triads observed in a protein sequence. A conjoint triad regards three continuous residues as a unit. Each gene pair is then encoded by concatenating the two feature vectors of the two individual genes. However, to consider all 203 conjoint triads, one must use a 16000-dimensional feature vector to encode a gene pair, which exceeds to size limit for contemporary classifiers. Thus, Shen et al. clustered 20 amino acid types into seven groups based on dipole strength and side chain volumes, thereby reducing the dimensions of the feature vector. Table 1 lists theses seven amino acid groups. Figure 11 shows the process of encoding a protein sequence. First, the protein sequence is transformed into a sequence of amino acid groups. Then the triads are scanned along the sequence of amino acid groups. Each scanned triad is counted in an occurrence vector, O. Each element oi in O represents the number of the i-th type of triad observed in the sequence of amino acids groups. Accordingly, each protein sequence is represented as a 343-dimentional occurrence vector. For a protein pair, each vector of both protein sequences are concatenated to form a 686-dimensional feature vector. -7- The second feature set considers different regulatory features, which are discussed in subsections of the “Adding eight regulatory features” section. The encoding process of these features are described in the corresponding subsections. Feature selection In order to promote the effect of adding eight regulatory features and remove the redundant dimensions in 686 of two-stage predictor, feature selection is a common way to achieve this goal. There are two types of feature selection: filters and wrappers. In this work, filters were chosen and further divided it into two parts: supervised part and unsupervised part. Considering the possibility of bias of each part, a two-stage feature selection is designed to combine the result. First, for the supervised part, seven well-known feature selection methods including Chi-squire test, Pearson correlation, Distance correlation, Kendall’s tau correlation, Spearman’s rank correlation, Random forest, and Maximal information coefficient (MIC) are used to prioritize the importance of each dimension in 686 of two-stage predictor. Training data are used as input and calculated independently to generate seven different results. For the unsupervised part, there are four methods including Principal component analysis (PCA), Laplacian score, variance and Spectral feature selection. The last three methods prioritize the importance of each dimension, but PCA is slightly different from that, it merges original dimensions into new dimensions. This part takes 6717 gene sequences as input data. 343 dimensions contained by each gene sequence are either selected or merged, and then combine each other to form new dimensions of the gene pair. Second, the best dimensions amount should be found out first in this step. Obviously 686 dimensions is much greater than 12 dimensions of eight regulatory features, so -8- we start with about half of it which is 350 dimensions and then reduce 350 dimensions by 50 dimensions each time. After knowing the best dimensions amount, the best three methods of supervised part and the best two methods of unsupervised part are chosen from supervised and unsupervised part. We then combine one supervised methods with one unsupervised methods in different proportion to get the best performance. TFBS similarity scores This work uses seven TFBS similarity scores to encode a protein pair. First, van Helden used the Poisson distribution to compute the probability of a common transcription factor that binds to the sequence between two genes [16]. Second, Garten et al. used the cumulative hyper-geometric test to estimate the significance of the overlap of two TF sets [17]. Third, Veerla and Höglund used the Jaccard index to determine the regulatory similarity of two genes [18]. Fourth, Kafri et al. proposed a formula to compute the ratio of the union to the intersection of TFs to determine regulatory similarity [19]. Fifth, Kim et al. used the number of common TFs in two sequences to determine regulatory similarity [20]. Sixth, Park et al. considered the proportion of TFBSs in common and introduced a penalty term for TFBSs appearing in only one gene’s promoter [21]. Seventh, Shalgi et al. proposed a formula that replaced the denominator of Jaccard index with the smaller number of TFs that regulate either gene [22]. Results Feature selection -9- The eleven methods (7 of supervised, 4 of unsupervised) are carried out to determine the best amount of dimensions. The AUC of each dimension of each method are demonstrated in Table 2. According to Table 2, 300 dimensions achieve the highest AUC in most methods and average of all methods. Therefore, we can conclude 300 dimensions are the best amount of dimensions. Further, three methods which are Distance correlation, Kendall’s tau correlation and Spearman’s rank correlation of supervised selection and two methods which are Laplacian score and Spectral feature selection of unsupervised selection are selected to combine each other to construct six combination with proportion achieving the highest AUC. The AUC of 686 dimensions, eleven feature selection methods and the six combinations are demonstrated in Table 3. Among other combination in Table 3, combination of Laplacian score (120 dimensions) and Distance correlation (180 dimensions) has the highest AUC score and do promote the performance comparing to 686 dimensions. Adding eight regulatory features This work added eight regulatory characteristics, including DNA bendability, gene distance, gene size, transcription direction, nucleosome occupancy, TATA box status, transcription factor (TF) binding evidence, TF knockout evidence and transcription factor binding site (TFBS). Figure 1 shows the prediction performance of individually adding the first eight regulatory features; Figure 2 shows the prediction performance of TFBS, which includes seven TFBS similarity scores; Table 4 shows a summary of these two figures. Adding DNA bendability The DNA bendability data used in this work is collected from the YPA database [24], which is based on the DNase I experiments conducted by Brukner et al. [23]. In this - 10 - work, each gene has a corresponding DNA bendability score, which is the average of bendability at each position on the gene. In this analysis, the two DNA bendability scores of a protein pair are repeated five times and appended to the original vector of 686 features to form a new one of 696 features. Adding gene distance In this analysis, if two genes are on the same chromosome, their gene distance is the shortest distance from one gene to the other. If two genes are on the same chromosome and overlap, their distance is zero. If two genes are not on the same chromosome, their distance is -1. The gene distance of a protein pair is repeated ten times and appended to the original vector of 686 features to form a new one of 696 features. Adding gene size From the YPA database, this work collects the position of the start codon and stop codon of each gene in the yeast genome. Genes size are the distance between start codon and stop codon. In this analysis, the two gene sizes of a protein pair are repeated five times and appended to the original vector of 686 features to form a new one of 696 features. Adding nucleosome occupancy The nucleosome occupancy data used in this work is collected from the YPA database, which is based on two models proposed in 2009 [25, 26]. In this work, each gene has a corresponding nucleosome occupancy score, which is the average of nucleosome occupancy at each position on the gene. For every gene pair, average and difference of nucleosome occupancy score of two genes are added as feature. In this analysis, the two nucleosome occupancy scores of a protein pair are repeated five - 11 - times and appended to the original vector of 300 features to form a new one of 310 features. Adding TATA box status This study collects TATA box status from the YPA database, which states whether a TATA box exists in a gene’s promoter. If a TATA box exists in a gene’s promoter, the TATA box status of the gene is 1. If no TATA box exists in a gene’s promoter, the TATA box status of the gene is 0. In this analysis, the two TATA box statuses of a protein pair are repeated five times and appended to the original vector of 300 features to form a new one of 310 features. Adding transcription factor binding evidence This work uses TF binding evidence, which tells whether a TF binds a gene’s promoter based on ChIP experiment. The TF binding score of a protein pair a and b is TFa TFb , TFa TFb where TFa and TFb are the sets of TFs that bind gene a and b, respectively. In this analysis, the TF binding score of a protein pair is repeated ten times and appended to the original vector of 300 features to form a new one of 310 features. Adding transcription factor knockout evidence This work uses TF knockout evidence, which tells whether the expression of a gene changes significantly after knocking out a TF. The TF knockout score of a protein pair is identical to that of the TF binding score except that TFa and TFb are the sets of TFs whose knockout result in significant expression change of gene a and b, - 12 - respectively. In this analysis, the TF knockout score is repeated ten times and appended to the original vector of 300 features to form a new one of 310 features Adding TFBS similarity This work uses TFBS data, which includes 422,576 TFBS locations for 164 TFs, from the YPA database. For each protein pair, seven TFBS similarity scores are calculated: van Helden [16], Garten et al. [17], Veerla and Höglund [18], Kafri et al. [19], Kim et al. [20], Park et al. [21] and Shalgi et al. [22]. Table 4 and Figure 2 show the prediction performance after adding these seven TFBS similarity scores. In this analysis, each TFBS similarity score is repeated ten times and appended to the original vector of 686 features to form a new one of 696 features, individually. Adding the van Helden similarity score slightly decreases prediction accuracy (>90% for the top 2,631 predictions), the prediction AUC (0.6447) is slightly higher than that of the original vector; adding the Garten et al. TFBS similarity score decreases both prediction accuracy (>90% for the top 2,488 predictions) and AUC (0.6403); adding the Veerla and Höglund TFBS similarity score decreases both prediction accuracy (>90% for the top 2,523 predictions) and AUC (0.6428); adding the Kafri et al. TFBS similarity score decreases both prediction accuracy (>90% for the top 2,525 predictions) and AUC (0.6427); adding the Kim et al. TFBS similarity score decreases both prediction accuracy (>90% for the top 2,525 predictions) and AUC (0.6428); adding the Park et al. TFBS similarity score slightly decreases prediction accuracy (>90% for the top 2,614 predictions), the prediction AUC (0.6444) is slightly higher than that of the original vector; and adding the Shalgi et al. TFBS similarity score decreases both prediction accuracy (>90% for the top 2,522 predictions) and AUC (0.6427). - 13 - As a result, adding TFBS information based on the van Helden and Park et al.’s scores improves prediction performance. This analytical result indicates not only that TFBS may improve the identification of functional linkages between proteins but also that encoding process of regulatory features is important. Adding all the seven TFBS similarity scores, which forms a 693 dimensional vector, achieves a prediction AUC of 0.6650, better than those of adding individual TFBS similarity scores. Summary After adding gene distance and TATA box status, prediction performance are much improved. Although adding the eight regulatory features slightly decreases prediction performance in the high precision (>90%) region, adding seven of them (except TF binding evidence and TF knockout evidence) increases AUC. This finding shows that most functionally related proteins are affected by regulatory features. Adding all these features, which forms a 836 dimensional vector, achieves a prediction AUC of 0.6509, better than those of adding individual regulatory features. The weights when adding regulatory features The “Adding eight regulatory features”subsection used a fixed number of repeats (ten times for each regulatory features) when adding regulatory features to the original 686 dimensional vector. This number of repeats, for machine learning, stands for weights of the added regulatory features relative to the original 686 features. This subsection further discusses the weights when adding regulatory features to sequence features. The results are shown in Table and Figures 3-10. The best numbers of repeats for DNA bendability, gene distance, gene size, transcription direction, nucleosome occupancy, TATA box status, TF binding evidence and TF knockout evidence are 100, 100, 60, 40, 30, 10, 90 and 70, achieving - 14 - AUCs of 0.6463, 0.6567, 0.6469, 0.6459, 0.6477, 0.6460, 0.6448 and 0.6465, respectively. Most of the AUCs of the other number of repeats are also slightly higher than that of the original 686 features. The best numbers of repeats for the van Helden, Garten et al., Veerla and Höglund, Kafri et al., Kim et al., Park et al. and Shalgi et al. TFBS similarity scores are 40, 10, 10, 10, 10, 40 and 10. Only adding the van Helden and Park et al.’s TFBS similarity scores achieves better prediction performance than that of the original feature vector. The conclusion of adding TFBS information is similar without depending on adjusting the number of repeats. To sum up, adding any of the eight regulatory features introduced in this work improves the overall prediction performance for functionally related proteins. TF binding evidence and TF knockout evidence, the two regulatory features that do not improve prediction AUC with ten repeats, improve prediction AUC after adjusting the number of repeats. The analytical results in this section indicate that regulatory features have different best number of repeats, thereby having different weights relative to sequence features. Conclusions This work is the first to discuss the use of regulatory features for predicting functionally related proteins. Experimental results show that adding gene distance and TATA box status improved accuracy when predicting functionally related proteins, and indicate that regulatory features influence the functional relation after the gap from genes to their protein products. - 15 - Acknowledgements The authors would like to thank the Ministry of Science and Technology of the Republic of China, Taiwan, for financially supporting this research under Contract No. NSC 102-2221-E-006-085-MY2. References 1. Ge, H., A.J. Walhout, and M. Vidal, Integrating 'omic' information: a bridge between genomics and systems biology. Trends Genet, 2003. 19(10): p. 55160. 2. Shoemaker, B.A. and A.R. Panchenko, Deciphering protein–protein interactions. Part II. Computational methods to predict protein and domain interaction partners. PLoS computational biology, 2007. 3(4): p. e43. 3. Salgado, H., et al., Operons in Escherichia coli: genomic analyses and predictions. Proceedings of the National Academy of Sciences of the United States of America, 2000. 97(12): p. 6652. 4. Strong, M., et al., Inference of protein function and protein linkages in Mycobacterium tuberculosis based on prokaryotic genome organization: a combined computational approach. Genome Biol, 2003. 4(9): p. R59. 5. Bowers, P., et al., Prolinks: a database of protein functional linkages derived from coevolution. Genome Biology, 2004. 5(5): p. R35. 6. Marcotte, E., et al., Detecting protein function and protein-protein interactions from genome sequences. Science, 1999. 285(5428): p. 751. 7. Enright, A., et al., Protein interaction maps for complete genomes based on gene fusion events. Nature, 1999. 402(6757): p. 86-90. 8. Yanai, I., A. Derti, and C. DeLisi, Genes linked by fusion events are generally of the same functional category: a systematic analysis of 30 microbial genomes. Proceedings of the National Academy of Sciences, 2001. 98(14): p. 7940. 9. Marcotte, C. and E. Marcotte, Predicting functional linkages from gene fusions with confidence. Applied Bioinformatics, 2002. 1(2): p. 93-100. 10. Date, S. and E. Marcotte, Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages. Nature Biotechnology, 2003. 21(9): p. 1055-1062. 11. Sun, J., et al., Refined phylogenetic profiles method for predicting proteinprotein interactions. Bioinformatics, 2005. 21(16): p. 3409. - 16 - 12. Enault, F., et al., Annotation of bacterial genomes using improved phylogenomic profiles. Bioinformatics, 2003. 19(Suppl 1): p. i105. 13. Snitkin, E., et al., Comparative assessment of performance and genome dependence among phylogenetic profiling methods. BMC bioinformatics, 2006. 7(1): p. 420. 14. Ruano-Rubio, V., O. Poch, and J. Thompson, Comparison of eukaryotic phylogenetic profiling approaches using species tree aware methods. BMC bioinformatics, 2009. 10(1): p. 383. 15. Lin, T.W., J.W. Wu, and D.T. Chang, Combining phylogenetic profiling-based and machine learning-based techniques to predict functional related proteins. PLoS One, 2013. 8(9): p. e75940. 16. van Helden, J., Metrics for comparing regulatory sequences on the basis of pattern counts. Bioinformatics, 2004. 20(3): p. 399-406. 17. Garten, Y., S. Kaplan, and Y. Pilpel, Extraction of transcription regulatory signals from genome-wide DNA-protein interaction data. Nucleic Acids Res, 2005. 33(2): p. 605-15. 18. Veerla, S. and M. Hoglund, Analysis of promoter regions of co-expressed genes identified by microarray analysis. BMC Bioinformatics, 2006. 7: p. 384. 19. Kafri, R., A. Bar-Even, and Y. Pilpel, Transcription control reprogramming in genetic backup circuits. Nat Genet, 2005. 37(3): p. 295-9. 20. Kim, R.S., H. Ji, and W.H. Wong, An improved distance measure between the expression profiles linking co-expression and co-regulation in mouse. BMC Bioinformatics, 2006. 7: p. 44. 21. Park, P.J., A.J. Butte, and I.S. Kohane, Comparing expression profiles of genes with similar promoter regions. Bioinformatics, 2002. 18(12): p. 157684. 22. Shalgi, R., et al., Global and local architecture of the mammalian microRNAtranscription factor regulatory network. PLoS Comput Biol, 2007. 3(7): p. e131. 23. Brukner, I., et al., Sequence-dependent bending propensity of DNA as revealed by DNase I: parameters for trinucleotides. EMBO J, 1995. 14(8): p. 1812-8. 24. Chang, D.T., et al., YPA: an integrated repository of promoter features in Saccharomyces cerevisiae. Nucleic Acids Res, 2011. 39(Database issue): p. D647-52. 25. Kaplan, N., et al., The DNA-encoded nucleosome organization of a eukaryotic genome. Nature, 2009. 458(7236): p. 362-6. 26. Segal, E. and J. Widom, From DNA sequence to transcriptional behaviour: a quantitative approach. Nat Rev Genet, 2009. 10(7): p. 443-56. - 17 - 27. Teixeira, M.C., et al., The YEASTRACT database: a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae. Nucleic Acids Res, 2006. 34(Database issue): p. D446-51. 28. Bhardwaj, N. and H. Lu, Correlation between gene expression profiles and protein-protein interactions within and across genomes. Bioinformatics, 2005. 21(11): p. 2730-8. 29. Gyenesei, A., et al., Mining co-regulated gene profiles for the detection of functional associations in gene expression data. Bioinformatics, 2007. 23(15): p. 1927-35. 30. Reimand, J., et al., Comprehensive reanalysis of transcription factor knockout expression data in Saccharomyces cerevisiae reveals many new targets. Nucleic Acids Res, 2010. 38(14): p. 4768-77. 31. Yang, T.H. and W.S. Wu, Identifying biologically interpretable transcription factor knockout targets by jointly analyzing the transcription factor knockout microarray and the ChIP-chip data. BMC Syst Biol, 2012. 6: p. 102. 32. Svetnik, V., et al., Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci, 2003. 43(6): p. 1947-58. 33. Witten, I.H. and E. Frank, Data mining : practical machine learning tools and techniques. 2nd ed. Morgan Kaufmann series in data management systems. 2005, Amsterdam ; Boston, MA: Morgan Kaufman. xxxi, 525. 34. Yu, C., L. Chou, and D. Chang, Predicting protein-protein interactions in unbalanced data using the primary structure of proteins. BMC bioinformatics, 2010. 11(1): p. 167. 35. Artin, E., The Gamma Function. 1964, New York: Holt, Rinehart and Winston. - 18 - Figures Figure 1 - Prediction performance by adding eight regulatory features for functionally related proteins The eight added regulatory features are DNA bendability, gene distance, gene size, transcription direction, nucleosome occupancy, TATA box status, TF binding evidence and TF knockout evidence. Figure 2 - Prediction performance by adding seven TFBS similarity scores for functionally related proteins The seven added TFBS similarity scores are van Helden [16], Garten et al. [17], Veerla and Höglund [18], Kafri et al. [19], Kim et al. [20], Park et al. [21] and Shalgi et al. [22]. Figure 3- Prediction performance by adding DNA bendability with different number of repeats Figure 4 - Prediction performance by adding gene distance with different number of repeats Figure 5 - Prediction performance by adding gene size with different number of repeats Figure 6 - Prediction performance by adding transcription direction with different number of repeats Figure 7 - Prediction performance by adding nucleosome occupancy with different number of repeats Figure 8 - Prediction performance by adding TATA box status with different number of repeats - 19 - Figure 9 - Prediction performance by adding TF binding evidence with different number of repeats Figure 10 - Prediction performance by adding TF knockout evidence with different number of repeats Figure 11 - Schematic diagram of encoding a protein sequence into a feature vector. Step 1: Transform an amino acid sequence into a group sequence. Step 2: Scan the group sequence and count the triads to an occurrence vector O. - 20 - Tables Table 1 - Amino acid groups adopted in this work Group no. 1 2 3 4 5 6 7 Amino acids Ala, Gly, Val Ile, Leu, Phe, Pro Tyr, Met, Thr, Ser His, Asn, Gln, Trp Arg, Lys Asp, Glu Cys This table follows the Shen et al.’s work. Table 2 - AUC of different feature selection methods in different dimensions Feature Selection method Chi-squire test Distance correlation Kendall’s tau correlation MIC Pearson correlation Random forest Spearman’s correlation Variance PCA Laplacian score Spectral feature selection Average 50 0.2255 0.2860 0.2729 0.3124 0.2860 0.2411 0.2866 0.3149 0.3121 0.2798 0.2328 0.2773 100 0.2682 0.3227 0.3176 0.3132 0.3227 0.2833 0.3190 0.3197 0.3314 0.3310 0.3022 0.3120 150 0.2937 0.3387 0.3287 0.3212 0.3387 0.3008 0.3281 0.3192 0.3234 0.3284 0.3157 0.3215 Number of dimension 200 250 0.3007 0.3045 0.3378 0.3360 0.3339 0.3357 0.3167 0.3217 0.3378 0.3360 0.3174 0.3220 0.3373 0.3387 0.3292 0.3361 0.3237 0.3267 0.3333 0.3334 0.3267 0.3306 0.3281 0.3279 300 0.3135 0.3396 0.3388 0.3239 0.3396 0.3180 0.3374 0.3270 0.3152 0.3321 0.3277 0.3285 350 0.3104 0.3350 0.3323 0.3234 0.3350 0.3138 0.3313 0.3208 0.3131 0.3310 0.3177 0.3240 Table 3 – The prediction performance of hybrid feature selection Hybrid feature selection Sequence1 Spectral (150) + Distance (150) Spectral (90) + Kendall’s tau (210) Spectral (120) + Spearman’s (180) Laplacian (120) + Distance (180) Laplacian (120) + Kendall’s (180) Laplacian (120) + Spearman’s (180) 1 Area under curve (AUC) 0.3132 0.3429 0.3372 0.3359 0.3507 0.3496 0.3486 The original 686 dimensional feature vector proposed by Lin et al. [15], which is the baseline. All other features are added to this feature. Table 4 - The prediction performance by adding eight regulatory features for functionally related proteins Feature Hybrid1 DNA bendability Gene distance Gene size Area under curve (AUC) 0.3507 0.3402 0.3413 0.3517 - 21 - Nucleosome occupancy TATA box TF binding TF knockout TFBS similarity van Helden Garten et al. Veerla and Höglund Kafri et al. Kim et al. Park et al. Shalgi et al. All features 1 0.3492 0.3436 0.3501 0.3413 0.3403 0.3413 0.3400 0.3400 0.3401 0.3412 0.3399 0.3567 The hybrid here indicates the combination of Laplacian score and Distance correlation which achieve the best performance among others. For example, DNA bendability indicates the original feature vectors adding the information of DNA bendability. Prediction performance that are better than this baseline is marked in bold. Table - The area under curve by adding regulatory feature with different number of repeats Feature DNA bendability Gene distance Gene size Transcription direction Nucleosome occupancy TATA box TF binding TF knockout TFBS similarity van Helden Garten et al. Veerla and Höglund Kafri et al. Kim et al. Park et al. Shalgi et al. 10 0.6451 0.6477 0.6441 0.6448 0.6451 0.6460 0.6421 0.6428 20 0.6428 0.6511 0.6451 0.6451 0.6464 0.6451 0.6425 0.6464 30 0.6463 0.6527 0.6457 0.6425 0.6477 0.6441 0.6448 0.6431 40 0.6441 0.6538 0.6459 0.6459 0.6456 0.6440 0.6447 0.6432 Number of repeats 50 60 0.6446 0.6451 0.6545 0.6553 0.6466 0.6469 0.6436 0.6440 0.6459 0.6488 0.6439 0.6439 0.6448 0.6447 0.6432 0.6436 70 0.6455 0.6557 0.6448 0.6445 0.6462 0.6439 0.6448 0.6465 80 0.6459 0.6560 0.6453 0.6449 0.6463 0.6439 0.6421 0.6438 90 0.6459 0.6564 0.6455 0.6451 0.6467 0.6439 0.6448 0.6439 100 0.6463 0.6567 0.6458 0.6479 0.6469 0.6439 0.6421 0.6443 0.6446 0.6402 0.6427 0.6426 0.6427 0.6443 0.6426 0.6447 0.6389 0.6414 0.6411 0.6412 0.6413 0.6411 0.6450 0.6382 0.6403 0.6398 0.6400 0.6447 0.6398 0.6450 0.6376 0.6370 0.6394 0.6397 0.6449 0.6394 0.6448 0.6373 0.6364 0.6389 0.6391 0.6419 0.6390 0.6421 0.6368 0.6388 0.6381 0.6383 0.6420 0.6382 0.6423 0.6365 0.6386 0.6380 0.6381 0.6420 0.6380 0.6425 0.6363 0.6385 0.6378 0.6379 0.6420 0.6379 0.6426 0.6363 0.6383 0.6376 0.6377 0.6420 0.6377 0.6417 0.6369 0.6390 0.6386 0.6387 0.6420 0.6387 Prediction performance is marked in bold when it is better than the original 686 dimensional feature vector and is the best number repeats for the corresponding regulatory feature. - 22 - FIGURE Figure 1 - 23 - Figure 2 - 24 - Figure3 - 25 - Figure 4 - 26 - Figure 5 - 27 - Figure 6 - 28 - Figure 7 - 29 - Figure 8 - 30 - Figure 9 - 31 - Figure 10 - 32 - Figure 11 - 33 -