Download Results

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Polycomb Group Proteins and Cancer wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Metagenomics wikipedia , lookup

Copy-number variation wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genetic engineering wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Gene therapy wikipedia , lookup

Genomic library wikipedia , lookup

Gene expression programming wikipedia , lookup

Genome (book) wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Genomics wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene desert wikipedia , lookup

Protein moonlighting wikipedia , lookup

Genome evolution wikipedia , lookup

Genome editing wikipedia , lookup

Gene wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene nomenclature wikipedia , lookup

Point mutation wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Microevolution wikipedia , lookup

NEDD9 wikipedia , lookup

Designer baby wikipedia , lookup

Helitron (biology) wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Predicting functionally related proteins based on
regulatory features
Shih-Feng Wang, Tzu-Wen Lin, Shao-Ting Jang and Darby Tien-Hao Chang§
Department of Electrical Engineering, National Cheng Kung University, Tainan,
Taiwan
§
Corresponding author
Email addresses:
SFW: [email protected]
TWL: [email protected]
STJ: [email protected]
DTHC: [email protected]
-1-
Abstract
Background
Protein functions are essential to many biological processes. Elucidating these protein
functions and linking functionally related proteins improves our understanding of the
mechanisms of biological systems at the molecular level. Currently, many features
intrinsic to proteins (e.g., protein sequences, structures, and functions) have been
studied to predict functionally related proteins. However, no study has analyzed the
regulatory features (e.g., transcription factors that regulate the gene of a protein) of
two interacting proteins. This study determines whether regulatory features affect
functional relationships after the gap between genes to their protein products, as well
as builds a regulatory feature-based prediction model for functionally related proteins.
Results
This work comprehensively analyzed regulatory features. eight transcriptional
characteristics were identified: DNA bendability; gene size; gene distance;
transcription direction; nucleosome occupancy; TATA box status; transcription factor
(TF) binding evidence; TF knockout evidence; and transcription factor binding site
(TFBS). Experimental results show that adding gene distance and TATA box status
improved accuracy when predicting functionally related proteins, and indicate that
regulatory features influence the functional relation after the gap from genes to their
protein products. For Saccharomyces cerevisiae, the proposed prediction method is
more accurate than previous methods.
-2-
Conclusions
This work is the first to assess the effectiveness of using regulatory features to predict
functionally related proteins. The proposed encoding method for regulatory
characteristics can determine whether two proteins are functionally related.
Background
Many protein functions are essential to biological processes in living cells.
Elucidating these protein functions and linking their functionally related proteins
improves our understanding of the mechanisms of biological systems at the molecular
level [1]. As the number of sequenced genomes is increasing, conducting biological
experiments to identify all protein pairs that are functionally related is impractical in
terms of both time and cost. Thus, new computational methods are needed.
Various computational methods have been applied to predict functional linkages
between proteins based on the observation that functionally related proteins have
some co-occurrence patterns. Shoemaker and Panchenko have performed a review of
these methods [2]. Gene neighbor and gene clustering methods infer functional
linkages based on the observation that genes whose protein products interact with are
usually clustered within a transcriptional unit, an operon, in a genome [3-5]. The
Rosetta Stone method, conversely, is based on the finding that certain interacting
proteins in an organism have homologues in another organism forming into a fused
protein chain, a Rosetta Stone protein [6-9]. Gene neighbor, gene cluster, and the
Rosetta Stone method share a disadvantage that very limited number of functional
linkages have such specific co-occurrence patterns. Thus, recent co-occurrence-based
methods have shifted to phylogenetic profiling (PP)-based method, which is used to
identify a more general co-occurrence pattern. The basic assumption of PP-based
-3-
methods is that the co-presence and co-absence of proteins across organisms, the coevolved pattern, result from functional linkages between proteins [10-14]. These PPbased methods, which perform adequately, have been applied mainly to prokaryotes.
A previous study proposed a two-stage framework that integrated machine learning
(ML) with a PP-based method to overcome the limitations of PP [15]. ML techniques
have been widely used by many studies to predict protein relations, and several
techniques have been developed to capture important features of protein pairs. In this
work, eight regulatory characteristics are added to the two-stage predictor to increase
prediction accuracy and the hybrid feature selection techniques are further used to
promote the effect of that. For Saccharomyces cerevisiae, the prediction by the
proposed method is more accurate than that by previous methods. Additionally, the
prediction result indicates that regulatory characteristics affect functional linkages
between proteins.
Methods
Data collection
This work retrieved 6,717 gene sequence of proteins of S. cerevisiae from the
Saccharomyces Genome Database (SGD) database, which was released on February
3, 2011.And protein-protein interactions dataset which contained 198376 pairs was
collected from Biological General Repository for Interaction Datasets (BioGRID) of
version 3.1.89.
The eight regulatory features came from different database. The gene size and
distance data were calculated based on gene sequence from SGD database. The DNA
bendability, nucleosome occupancy, TATA box status and TFBS similarity came
from Yeast Promoter Atlas (YPA) [24] database. The transcription factor binding
-4-
evidence and transcription factor knockout evidence came from YEASTRACT[27]
database.
In addition to an evaluation organism, the first stage of the proposed framework
requires a reference collection to construct phylogenetic profiles. This work used the
132 eukaryotes in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database
to compile a eukaryotic reference collection. In these reference organisms, only the
gene and protein sequences were required. Since the functional linkage information of
the reference organisms was not used, these organisms were not training data for
machine learning.
Phylogenetic profile
The PP-based methods are based on the observation that genes with similar
phylogenetic profiles tend to exist in the same protein complex, biochemical pathway,
or sub-cellular compartment. Here, the phylogenetic profile of a gene is a vector,
representing the presence or absence of homologues to that gene across the reference
collection. PP-based methods have two issues: (i) how to construct a phylogenetic
profile of a given gene; and (ii) how to determine the similarity of two phylogenetic
profiles.
First, the presence or absence of homologues can be determined by sequence
alignment scores, such as a BLAST E-value. A protein is considered “present” in an
organism when the sequence alignment score for the protein between at least one
protein in the organism exceeds a threshold. Such binary vectors are improved as realvalued vectors of normalized alignment scores without arbitrarily determining a score
threshold. A real-valued phylogenetic profile is adopted in this work. Suppose a
collection of n reference organisms is used to build the phylogenetic profile of a query
-5-
gene. The first step is to compare the open reading frame (ORF) of the query gene to
all ORFs of the n reference organism using BLAST. The best bit score of the query
gene a and all ORFs of a reference organism b is used to measure the presence of a in
b, called the “S-value of gene a and organism b” (Sab). As non-homologous genes
have a certain chance to align with each other of a bit score exceeding 50, the S-value
is trimmed to zero when it is lower than 50. Because the bit score depends on the
sequence of a, the S-value is further normalized as an R-value by the following
equation:
Rab 
S ab
,
S aa
where Saa is the score obtained by aligning a to itself. The n-dimensional vector of Rvalues obtained by BLASTing a gene to n reference organisms represents the
phylogenetic profile of that gene. In addition, the non-zero R-values of all genes of the
query organism to a reference organism are normalized by dividing the average score.
This procedure prevents the similarity of two phylogenetic profiles of two genes being
dominated by a few large R-values resulted from phylogenetically close organisms.
Second, any similarity/distance function, such as cosine similarity or Euclidean
distance [33], of vectors can be used to define the similarity of phylogenetic profiles.
Enault et al. examined four widely used distance functions and concluded that the
inner product is a good indicator [12]. In this work, similarity between two genes, i
and j, is defined as follows:
n
Sim(i, j ) 
R
k 1
ik
 R jk
 n 2   n 2 
  Rij     R jk 

 k 1   k 1
1/ 2
.
-6-
Feature encoding
The second stage of this work retrieve 20000 gene pair that have highest similarity from
first stage and encodes a gene pair into a feature vector and then invokes a classifier to
perform the prediction. This subsection describes the feature encoding process while
the next subsection elucidates the classification algorithm.
The used feature set can be divided into two parts. The first feature set considers the
conjoint triads observed in a protein sequence. A conjoint triad regards three
continuous residues as a unit. Each gene pair is then encoded by concatenating the
two feature vectors of the two individual genes. However, to consider all 203 conjoint
triads, one must use a 16000-dimensional feature vector to encode a gene pair, which
exceeds to size limit for contemporary classifiers. Thus, Shen et al. clustered 20
amino acid types into seven groups based on dipole strength and side chain volumes,
thereby reducing the dimensions of the feature vector. Table 1 lists theses seven
amino acid groups.
Figure 11 shows the process of encoding a protein sequence. First, the protein
sequence is transformed into a sequence of amino acid groups. Then the triads are
scanned along the sequence of amino acid groups. Each scanned triad is counted in an
occurrence vector, O. Each element oi in O represents the number of the i-th type of
triad observed in the sequence of amino acids groups. Accordingly, each protein
sequence is represented as a 343-dimentional occurrence vector. For a protein pair,
each vector of both protein sequences are concatenated to form a 686-dimensional
feature vector.
-7-
The second feature set considers different regulatory features, which are discussed in
subsections of the “Adding eight regulatory features” section. The encoding process
of these features are described in the corresponding subsections.
Feature selection
In order to promote the effect of adding eight regulatory features and remove the
redundant dimensions in 686 of two-stage predictor, feature selection is a common
way to achieve this goal. There are two types of feature selection: filters and wrappers.
In this work, filters were chosen and further divided it into two parts: supervised part
and unsupervised part. Considering the possibility of bias of each part, a two-stage
feature selection is designed to combine the result.
First, for the supervised part, seven well-known feature selection methods including
Chi-squire test, Pearson correlation, Distance correlation, Kendall’s tau correlation,
Spearman’s rank correlation, Random forest, and Maximal information coefficient
(MIC) are used to prioritize the importance of each dimension in 686 of two-stage
predictor. Training data are used as input and calculated independently to generate
seven different results. For the unsupervised part, there are four methods including
Principal component analysis (PCA), Laplacian score, variance and Spectral feature
selection. The last three methods prioritize the importance of each dimension, but
PCA is slightly different from that, it merges original dimensions into new
dimensions. This part takes 6717 gene sequences as input data. 343 dimensions
contained by each gene sequence are either selected or merged, and then combine
each other to form new dimensions of the gene pair.
Second, the best dimensions amount should be found out first in this step. Obviously
686 dimensions is much greater than 12 dimensions of eight regulatory features, so
-8-
we start with about half of it which is 350 dimensions and then reduce 350
dimensions by 50 dimensions each time. After knowing the best dimensions amount,
the best three methods of supervised part and the best two methods of unsupervised
part are chosen from supervised and unsupervised part. We then combine one
supervised methods with one unsupervised methods in different proportion to get the
best performance.
TFBS similarity scores
This work uses seven TFBS similarity scores to encode a protein pair. First, van
Helden used the Poisson distribution to compute the probability of a common
transcription factor that binds to the sequence between two genes [16]. Second,
Garten et al. used the cumulative hyper-geometric test to estimate the significance of
the overlap of two TF sets [17]. Third, Veerla and Höglund used the Jaccard index to
determine the regulatory similarity of two genes [18]. Fourth, Kafri et al. proposed a
formula to compute the ratio of the union to the intersection of TFs to determine
regulatory similarity [19]. Fifth, Kim et al. used the number of common TFs in two
sequences to determine regulatory similarity [20]. Sixth, Park et al. considered the
proportion of TFBSs in common and introduced a penalty term for TFBSs appearing
in only one gene’s promoter [21]. Seventh, Shalgi et al. proposed a formula that
replaced the denominator of Jaccard index with the smaller number of TFs that
regulate either gene [22].
Results
Feature selection
-9-
The eleven methods (7 of supervised, 4 of unsupervised) are carried out to determine
the best amount of dimensions. The AUC of each dimension of each method are
demonstrated in Table 2. According to Table 2, 300 dimensions achieve the highest
AUC in most methods and average of all methods. Therefore, we can conclude 300
dimensions are the best amount of dimensions. Further, three methods which are
Distance correlation, Kendall’s tau correlation and Spearman’s rank correlation of
supervised selection and two methods which are Laplacian score and Spectral feature
selection of unsupervised selection are selected to combine each other to construct six
combination with proportion achieving the highest AUC. The AUC of 686
dimensions, eleven feature selection methods and the six combinations are
demonstrated in Table 3. Among other combination in Table 3, combination of
Laplacian score (120 dimensions) and Distance correlation (180 dimensions) has the
highest AUC score and do promote the performance comparing to 686 dimensions.
Adding eight regulatory features
This work added eight regulatory characteristics, including DNA bendability, gene
distance, gene size, transcription direction, nucleosome occupancy, TATA box status,
transcription factor (TF) binding evidence, TF knockout evidence and transcription
factor binding site (TFBS). Figure 1 shows the prediction performance of individually
adding the first eight regulatory features; Figure 2 shows the prediction performance
of TFBS, which includes seven TFBS similarity scores; Table 4 shows a summary of
these two figures.
Adding DNA bendability
The DNA bendability data used in this work is collected from the YPA database [24],
which is based on the DNase I experiments conducted by Brukner et al. [23]. In this
- 10 -
work, each gene has a corresponding DNA bendability score, which is the average of
bendability at each position on the gene. In this analysis, the two DNA bendability
scores of a protein pair are repeated five times and appended to the original vector of
686 features to form a new one of 696 features.
Adding gene distance
In this analysis, if two genes are on the same chromosome, their gene distance is the
shortest distance from one gene to the other. If two genes are on the same
chromosome and overlap, their distance is zero. If two genes are not on the same
chromosome, their distance is -1. The gene distance of a protein pair is repeated ten
times and appended to the original vector of 686 features to form a new one of 696
features.
Adding gene size
From the YPA database, this work collects the position of the start codon and stop
codon of each gene in the yeast genome. Genes size are the distance between start
codon and stop codon. In this analysis, the two gene sizes of a protein pair are
repeated five times and appended to the original vector of 686 features to form a new
one of 696 features.
Adding nucleosome occupancy
The nucleosome occupancy data used in this work is collected from the YPA
database, which is based on two models proposed in 2009 [25, 26]. In this work, each
gene has a corresponding nucleosome occupancy score, which is the average of
nucleosome occupancy at each position on the gene. For every gene pair, average and
difference of nucleosome occupancy score of two genes are added as feature. In this
analysis, the two nucleosome occupancy scores of a protein pair are repeated five
- 11 -
times and appended to the original vector of 300 features to form a new one of 310
features.
Adding TATA box status
This study collects TATA box status from the YPA database, which states whether a
TATA box exists in a gene’s promoter. If a TATA box exists in a gene’s promoter,
the TATA box status of the gene is 1. If no TATA box exists in a gene’s promoter, the
TATA box status of the gene is 0. In this analysis, the two TATA box statuses of a
protein pair are repeated five times and appended to the original vector of 300 features
to form a new one of 310 features.
Adding transcription factor binding evidence
This work uses TF binding evidence, which tells whether a TF binds a gene’s
promoter based on ChIP experiment. The TF binding score of a protein pair a and b is
TFa  TFb
,
TFa  TFb
where TFa and TFb are the sets of TFs that bind gene a and b, respectively. In this
analysis, the TF binding score of a protein pair is repeated ten times and appended to
the original vector of 300 features to form a new one of 310 features.
Adding transcription factor knockout evidence
This work uses TF knockout evidence, which tells whether the expression of a gene
changes significantly after knocking out a TF. The TF knockout score of a protein
pair is identical to that of the TF binding score except that TFa and TFb are the sets of
TFs whose knockout result in significant expression change of gene a and b,
- 12 -
respectively.
In this analysis, the TF knockout score is repeated ten times and
appended to the original vector of 300 features to form a new one of 310 features
Adding TFBS similarity
This work uses TFBS data, which includes 422,576 TFBS locations for 164 TFs, from
the YPA database. For each protein pair, seven TFBS similarity scores are calculated:
van Helden [16], Garten et al. [17], Veerla and Höglund [18], Kafri et al. [19], Kim et
al. [20], Park et al. [21] and Shalgi et al. [22]. Table 4 and Figure 2 show the
prediction performance after adding these seven TFBS similarity scores. In this
analysis, each TFBS similarity score is repeated ten times and appended to the
original vector of 686 features to form a new one of 696 features, individually.
Adding the van Helden similarity score slightly decreases prediction accuracy (>90%
for the top 2,631 predictions), the prediction AUC (0.6447) is slightly higher than that
of the original vector; adding the Garten et al. TFBS similarity score decreases both
prediction accuracy (>90% for the top 2,488 predictions) and AUC (0.6403); adding
the Veerla and Höglund TFBS similarity score decreases both prediction accuracy
(>90% for the top 2,523 predictions) and AUC (0.6428); adding the Kafri et al. TFBS
similarity score decreases both prediction accuracy (>90% for the top 2,525
predictions) and AUC (0.6427); adding the Kim et al. TFBS similarity score
decreases both prediction accuracy (>90% for the top 2,525 predictions) and AUC
(0.6428); adding the Park et al. TFBS similarity score slightly decreases prediction
accuracy (>90% for the top 2,614 predictions), the prediction AUC (0.6444) is
slightly higher than that of the original vector; and adding the Shalgi et al. TFBS
similarity score decreases both prediction accuracy (>90% for the top 2,522
predictions) and AUC (0.6427).
- 13 -
As a result, adding TFBS information based on the van Helden and Park et al.’s
scores improves prediction performance. This analytical result indicates not only that
TFBS may improve the identification of functional linkages between proteins but also
that encoding process of regulatory features is important. Adding all the seven TFBS
similarity scores, which forms a 693 dimensional vector, achieves a prediction AUC
of 0.6650, better than those of adding individual TFBS similarity scores.
Summary
After adding gene distance and TATA box status, prediction performance are much
improved. Although adding the eight regulatory features slightly decreases prediction
performance in the high precision (>90%) region, adding seven of them (except TF
binding evidence and TF knockout evidence) increases AUC. This finding shows that
most functionally related proteins are affected by regulatory features. Adding all these
features, which forms a 836 dimensional vector, achieves a prediction AUC of
0.6509, better than those of adding individual regulatory features.
The weights when adding regulatory features
The “Adding eight regulatory features”subsection used a fixed number of repeats (ten
times for each regulatory features) when adding regulatory features to the original 686
dimensional vector. This number of repeats, for machine learning, stands for weights
of the added regulatory features relative to the original 686 features. This subsection
further discusses the weights when adding regulatory features to sequence features.
The results are shown in Table and Figures 3-10.
The best numbers of repeats for DNA bendability, gene distance, gene size,
transcription direction, nucleosome occupancy, TATA box status, TF binding
evidence and TF knockout evidence are 100, 100, 60, 40, 30, 10, 90 and 70, achieving
- 14 -
AUCs of 0.6463, 0.6567, 0.6469, 0.6459, 0.6477, 0.6460, 0.6448 and 0.6465,
respectively. Most of the AUCs of the other number of repeats are also slightly higher
than that of the original 686 features.
The best numbers of repeats for the van Helden, Garten et al., Veerla and Höglund,
Kafri et al., Kim et al., Park et al. and Shalgi et al. TFBS similarity scores are 40, 10,
10, 10, 10, 40 and 10. Only adding the van Helden and Park et al.’s TFBS similarity
scores achieves better prediction performance than that of the original feature vector.
The conclusion of adding TFBS information is similar without depending on
adjusting the number of repeats.
To sum up, adding any of the eight regulatory features introduced in this work
improves the overall prediction performance for functionally related proteins. TF
binding evidence and TF knockout evidence, the two regulatory features that do not
improve prediction AUC with ten repeats, improve prediction AUC after adjusting the
number of repeats. The analytical results in this section indicate that regulatory
features have different best number of repeats, thereby having different weights
relative to sequence features.
Conclusions
This work is the first to discuss the use of regulatory features for predicting
functionally related proteins. Experimental results show that adding gene distance and
TATA box status improved accuracy when predicting functionally related proteins,
and indicate that regulatory features influence the functional relation after the gap
from genes to their protein products.
- 15 -
Acknowledgements
The authors would like to thank the Ministry of Science and Technology of the
Republic of China, Taiwan, for financially supporting this research under Contract
No. NSC 102-2221-E-006-085-MY2.
References
1.
Ge, H., A.J. Walhout, and M. Vidal, Integrating 'omic' information: a bridge
between genomics and systems biology. Trends Genet, 2003. 19(10): p. 55160.
2.
Shoemaker, B.A. and A.R. Panchenko, Deciphering protein–protein
interactions. Part II. Computational methods to predict protein and domain
interaction partners. PLoS computational biology, 2007. 3(4): p. e43.
3.
Salgado, H., et al., Operons in Escherichia coli: genomic analyses and
predictions. Proceedings of the National Academy of Sciences of the United
States of America, 2000. 97(12): p. 6652.
4.
Strong, M., et al., Inference of protein function and protein linkages in
Mycobacterium tuberculosis based on prokaryotic genome organization: a
combined computational approach. Genome Biol, 2003. 4(9): p. R59.
5.
Bowers, P., et al., Prolinks: a database of protein functional linkages derived
from coevolution. Genome Biology, 2004. 5(5): p. R35.
6.
Marcotte, E., et al., Detecting protein function and protein-protein interactions
from genome sequences. Science, 1999. 285(5428): p. 751.
7.
Enright, A., et al., Protein interaction maps for complete genomes based on
gene fusion events. Nature, 1999. 402(6757): p. 86-90.
8.
Yanai, I., A. Derti, and C. DeLisi, Genes linked by fusion events are generally
of the same functional category: a systematic analysis of 30 microbial
genomes. Proceedings of the National Academy of Sciences, 2001. 98(14): p.
7940.
9.
Marcotte, C. and E. Marcotte, Predicting functional linkages from gene
fusions with confidence. Applied Bioinformatics, 2002. 1(2): p. 93-100.
10.
Date, S. and E. Marcotte, Discovery of uncharacterized cellular systems by
genome-wide analysis of functional linkages. Nature Biotechnology, 2003.
21(9): p. 1055-1062.
11.
Sun, J., et al., Refined phylogenetic profiles method for predicting proteinprotein interactions. Bioinformatics, 2005. 21(16): p. 3409.
- 16 -
12.
Enault, F., et al., Annotation of bacterial genomes using improved
phylogenomic profiles. Bioinformatics, 2003. 19(Suppl 1): p. i105.
13.
Snitkin, E., et al., Comparative assessment of performance and genome
dependence among phylogenetic profiling methods. BMC bioinformatics,
2006. 7(1): p. 420.
14.
Ruano-Rubio, V., O. Poch, and J. Thompson, Comparison of eukaryotic
phylogenetic profiling approaches using species tree aware methods. BMC
bioinformatics, 2009. 10(1): p. 383.
15.
Lin, T.W., J.W. Wu, and D.T. Chang, Combining phylogenetic profiling-based
and machine learning-based techniques to predict functional related proteins.
PLoS One, 2013. 8(9): p. e75940.
16.
van Helden, J., Metrics for comparing regulatory sequences on the basis of
pattern counts. Bioinformatics, 2004. 20(3): p. 399-406.
17.
Garten, Y., S. Kaplan, and Y. Pilpel, Extraction of transcription regulatory
signals from genome-wide DNA-protein interaction data. Nucleic Acids Res,
2005. 33(2): p. 605-15.
18.
Veerla, S. and M. Hoglund, Analysis of promoter regions of co-expressed
genes identified by microarray analysis. BMC Bioinformatics, 2006. 7: p. 384.
19.
Kafri, R., A. Bar-Even, and Y. Pilpel, Transcription control reprogramming in
genetic backup circuits. Nat Genet, 2005. 37(3): p. 295-9.
20.
Kim, R.S., H. Ji, and W.H. Wong, An improved distance measure between the
expression profiles linking co-expression and co-regulation in mouse. BMC
Bioinformatics, 2006. 7: p. 44.
21.
Park, P.J., A.J. Butte, and I.S. Kohane, Comparing expression profiles of
genes with similar promoter regions. Bioinformatics, 2002. 18(12): p. 157684.
22.
Shalgi, R., et al., Global and local architecture of the mammalian microRNAtranscription factor regulatory network. PLoS Comput Biol, 2007. 3(7): p.
e131.
23.
Brukner, I., et al., Sequence-dependent bending propensity of DNA as revealed
by DNase I: parameters for trinucleotides. EMBO J, 1995. 14(8): p. 1812-8.
24.
Chang, D.T., et al., YPA: an integrated repository of promoter features in
Saccharomyces cerevisiae. Nucleic Acids Res, 2011. 39(Database issue): p.
D647-52.
25.
Kaplan, N., et al., The DNA-encoded nucleosome organization of a eukaryotic
genome. Nature, 2009. 458(7236): p. 362-6.
26.
Segal, E. and J. Widom, From DNA sequence to transcriptional behaviour: a
quantitative approach. Nat Rev Genet, 2009. 10(7): p. 443-56.
- 17 -
27.
Teixeira, M.C., et al., The YEASTRACT database: a tool for the analysis of
transcription regulatory associations in Saccharomyces cerevisiae. Nucleic
Acids Res, 2006. 34(Database issue): p. D446-51.
28.
Bhardwaj, N. and H. Lu, Correlation between gene expression profiles and
protein-protein interactions within and across genomes. Bioinformatics, 2005.
21(11): p. 2730-8.
29.
Gyenesei, A., et al., Mining co-regulated gene profiles for the detection of
functional associations in gene expression data. Bioinformatics, 2007. 23(15):
p. 1927-35.
30.
Reimand, J., et al., Comprehensive reanalysis of transcription factor knockout
expression data in Saccharomyces cerevisiae reveals many new targets.
Nucleic Acids Res, 2010. 38(14): p. 4768-77.
31.
Yang, T.H. and W.S. Wu, Identifying biologically interpretable transcription
factor knockout targets by jointly analyzing the transcription factor knockout
microarray and the ChIP-chip data. BMC Syst Biol, 2012. 6: p. 102.
32.
Svetnik, V., et al., Random forest: a classification and regression tool for
compound classification and QSAR modeling. J Chem Inf Comput Sci, 2003.
43(6): p. 1947-58.
33.
Witten, I.H. and E. Frank, Data mining : practical machine learning tools and
techniques. 2nd ed. Morgan Kaufmann series in data management systems.
2005, Amsterdam ; Boston, MA: Morgan Kaufman. xxxi, 525.
34.
Yu, C., L. Chou, and D. Chang, Predicting protein-protein interactions in
unbalanced data using the primary structure of proteins. BMC bioinformatics,
2010. 11(1): p. 167.
35.
Artin, E., The Gamma Function. 1964, New York: Holt, Rinehart and
Winston.
- 18 -
Figures
Figure 1 - Prediction performance by adding eight regulatory features
for functionally related proteins
The eight added regulatory features are DNA bendability, gene distance, gene size, transcription
direction, nucleosome occupancy, TATA box status, TF binding evidence and TF knockout evidence.
Figure 2 - Prediction performance by adding seven TFBS similarity
scores for functionally related proteins
The seven added TFBS similarity scores are van Helden [16], Garten et al. [17], Veerla and Höglund
[18], Kafri et al. [19], Kim et al. [20], Park et al. [21] and Shalgi et al. [22].
Figure 3- Prediction performance by adding DNA bendability with
different number of repeats
Figure 4 - Prediction performance by adding gene distance with different
number of repeats
Figure 5 - Prediction performance by adding gene size with different
number of repeats
Figure 6 - Prediction performance by adding transcription direction with
different number of repeats
Figure 7 - Prediction performance by adding nucleosome occupancy
with different number of repeats
Figure 8 - Prediction performance by adding TATA box status with
different number of repeats
- 19 -
Figure 9 - Prediction performance by adding TF binding evidence with
different number of repeats
Figure 10 - Prediction performance by adding TF knockout evidence
with different number of repeats
Figure 11 - Schematic diagram of encoding a protein sequence into a
feature vector.
Step 1: Transform an amino acid sequence into a group sequence. Step 2: Scan the
group sequence and count the triads to an occurrence vector O.
- 20 -
Tables
Table 1 - Amino acid groups adopted in this work
Group no.
1
2
3
4
5
6
7
Amino acids
Ala, Gly, Val
Ile, Leu, Phe, Pro
Tyr, Met, Thr, Ser
His, Asn, Gln, Trp
Arg, Lys
Asp, Glu
Cys
This table follows the Shen et al.’s work.
Table 2 - AUC of different feature selection methods in different dimensions
Feature Selection method
Chi-squire test
Distance correlation
Kendall’s tau correlation
MIC
Pearson correlation
Random forest
Spearman’s correlation
Variance
PCA
Laplacian score
Spectral feature selection
Average
50
0.2255
0.2860
0.2729
0.3124
0.2860
0.2411
0.2866
0.3149
0.3121
0.2798
0.2328
0.2773
100
0.2682
0.3227
0.3176
0.3132
0.3227
0.2833
0.3190
0.3197
0.3314
0.3310
0.3022
0.3120
150
0.2937
0.3387
0.3287
0.3212
0.3387
0.3008
0.3281
0.3192
0.3234
0.3284
0.3157
0.3215
Number of dimension
200
250
0.3007
0.3045
0.3378
0.3360
0.3339
0.3357
0.3167
0.3217
0.3378
0.3360
0.3174
0.3220
0.3373
0.3387
0.3292
0.3361
0.3237
0.3267
0.3333
0.3334
0.3267
0.3306
0.3281
0.3279
300
0.3135
0.3396
0.3388
0.3239
0.3396
0.3180
0.3374
0.3270
0.3152
0.3321
0.3277
0.3285
350
0.3104
0.3350
0.3323
0.3234
0.3350
0.3138
0.3313
0.3208
0.3131
0.3310
0.3177
0.3240
Table 3 – The prediction performance of hybrid feature selection
Hybrid feature selection
Sequence1
Spectral (150) + Distance (150)
Spectral (90) + Kendall’s tau (210)
Spectral (120) + Spearman’s (180)
Laplacian (120) + Distance (180)
Laplacian (120) + Kendall’s (180)
Laplacian (120) + Spearman’s (180)
1
Area under curve (AUC)
0.3132
0.3429
0.3372
0.3359
0.3507
0.3496
0.3486
The original 686 dimensional feature vector proposed by Lin et al. [15], which is the baseline. All
other features are added to this feature.
Table 4 - The prediction performance by adding eight regulatory features for
functionally related proteins
Feature
Hybrid1
DNA bendability
Gene distance
Gene size
Area under curve (AUC)
0.3507
0.3402
0.3413
0.3517
- 21 -
Nucleosome occupancy
TATA box
TF binding
TF knockout
TFBS similarity
van Helden
Garten et al.
Veerla and Höglund
Kafri et al.
Kim et al.
Park et al.
Shalgi et al.
All features
1
0.3492
0.3436
0.3501
0.3413
0.3403
0.3413
0.3400
0.3400
0.3401
0.3412
0.3399
0.3567
The hybrid here indicates the combination of Laplacian score and Distance correlation which achieve
the best performance among others. For example, DNA bendability indicates the original feature
vectors adding the information of DNA bendability. Prediction performance that are better than this
baseline is marked in bold.
Table - The area under curve by adding regulatory feature with different
number of repeats
Feature
DNA bendability
Gene distance
Gene size
Transcription direction
Nucleosome occupancy
TATA box
TF binding
TF knockout
TFBS similarity
van Helden
Garten et al.
Veerla and Höglund
Kafri et al.
Kim et al.
Park et al.
Shalgi et al.
10
0.6451
0.6477
0.6441
0.6448
0.6451
0.6460
0.6421
0.6428
20
0.6428
0.6511
0.6451
0.6451
0.6464
0.6451
0.6425
0.6464
30
0.6463
0.6527
0.6457
0.6425
0.6477
0.6441
0.6448
0.6431
40
0.6441
0.6538
0.6459
0.6459
0.6456
0.6440
0.6447
0.6432
Number of repeats
50
60
0.6446 0.6451
0.6545 0.6553
0.6466 0.6469
0.6436 0.6440
0.6459 0.6488
0.6439 0.6439
0.6448 0.6447
0.6432 0.6436
70
0.6455
0.6557
0.6448
0.6445
0.6462
0.6439
0.6448
0.6465
80
0.6459
0.6560
0.6453
0.6449
0.6463
0.6439
0.6421
0.6438
90
0.6459
0.6564
0.6455
0.6451
0.6467
0.6439
0.6448
0.6439
100
0.6463
0.6567
0.6458
0.6479
0.6469
0.6439
0.6421
0.6443
0.6446
0.6402
0.6427
0.6426
0.6427
0.6443
0.6426
0.6447
0.6389
0.6414
0.6411
0.6412
0.6413
0.6411
0.6450
0.6382
0.6403
0.6398
0.6400
0.6447
0.6398
0.6450
0.6376
0.6370
0.6394
0.6397
0.6449
0.6394
0.6448
0.6373
0.6364
0.6389
0.6391
0.6419
0.6390
0.6421
0.6368
0.6388
0.6381
0.6383
0.6420
0.6382
0.6423
0.6365
0.6386
0.6380
0.6381
0.6420
0.6380
0.6425
0.6363
0.6385
0.6378
0.6379
0.6420
0.6379
0.6426
0.6363
0.6383
0.6376
0.6377
0.6420
0.6377
0.6417
0.6369
0.6390
0.6386
0.6387
0.6420
0.6387
Prediction performance is marked in bold when it is better than the original 686 dimensional feature
vector and is the best number repeats for the corresponding regulatory feature.
- 22 -
FIGURE
Figure 1
- 23 -
Figure 2
- 24 -
Figure3
- 25 -
Figure 4
- 26 -
Figure 5
- 27 -
Figure 6
- 28 -
Figure 7
- 29 -
Figure 8
- 30 -
Figure 9
- 31 -
Figure 10
- 32 -
Figure 11
- 33 -