Download PDF

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Oncogenomics wikipedia , lookup

Genetic engineering wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Epistasis wikipedia , lookup

Essential gene wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Gene therapy wikipedia , lookup

Metabolic network modelling wikipedia , lookup

Pathogenomics wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Gene nomenclature wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

NEDD9 wikipedia , lookup

History of genetic engineering wikipedia , lookup

Public health genomics wikipedia , lookup

Gene desert wikipedia , lookup

Minimal genome wikipedia , lookup

Genomic imprinting wikipedia , lookup

The Selfish Gene wikipedia , lookup

Genome evolution wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genome (book) wikipedia , lookup

Ridge (biology) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Microevolution wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Learning Phenotype Specific Gene Network by
Knowledge Driven Matrix Factorization
Xuerui Yang1 , Yang Zhou2 , Zheng Li1 , Shireesh Srivastava1 ,
Rong Jin2 , and Christina Chan1,2
1
Chemical Engineering and Materials Science Department
2
Computer Science and Engineering Department
Michigan State University, East Lansing, MI 48824 USA
{yangxuer, zhouyang, lizheng1, srivas14, rongjin, krischan}@msu.edu
Abstract. A popular method for reconstructing gene networks from
micro-array data is Bayesian structure learning. However, most Bayesian
structure learning algorithms suffer from three major shortcomings, i.e.,
the high computational cost, inefficiency in exploring qualitative knowledge, and inability of reconstructing phenotype specific gene network. We
address these three short-comings by presenting a new framework, which
first identifies the genes relevant to the given phenotype using a mixture
regression model, and then reconstructs the network for the selected
genes with a Knowledge driven Matrix Factorization (KMF) algorithm.
We applied the proposed framework to gene expression and phenotypic
data and identified highly enriched gene clusters with distinct cellular
functions and processes together with the interactions between the clusters. Most of the interactions predicted by KMF were indeed biologically
relevant. In summary, we have developed a framework that can correctly
reconstruct the gene network that is specific to a given phenotype.
Key words: microarray data, Knowledge driven Matrix Factorization,
phenotype specific genes, network reconstruction, numerical optimization
1
Introduction
Cellular processes are regulated by genes and their products within a complex
network. These networks are usually organized in modules such as the pathways
in the citric acid (TCA) cycle of a metabolic network [1]. Disease states may
ensue when biological functions are abnormally regulated, for example, cancer
arises from abnormal regulation of apoptosis. Identifying the gene modules and
their interactions may provide an understanding on how a biological function or
process is regulated and may help provide insights into the disease mechanism.
With the advent of high throughput technologies one can obtain a comprehensive gene expression profile for a cellular or tissue state. Uncovering the gene
modules and module network from the micro-array data could provide insight
into the underlying mechanisms involved. A number of clustering methods, such
2
Xuerui Yang et al.
as hierarchical clustering, K-means and self-organizing map [2], have been applied to identify gene modules. The main disadvantage of clustering methods
is that they are unable to identify the interaction among different modules,
which is crucial to the understanding of disease mechanisms. This problem is
addressed by several studies that integrate clustering methods with structure
learning. In [3], a clustering method is combined with the Graphical Gaussian
Model (GGM) for module network reconstruction. In [4], the authors presented
a Bayesian framework that incorporates the clustering method into Bayesian
network learning. Despite the limited success, these approaches are mainly data
driven and therefore could be sensitive to the noise within the expression data. In
addition, as revealed by several studies [5, 6], structure learning methods tend to
suffer from the sparse data problem when the number of experimental conditions
is limited. Finally, these approaches are unable to construct phenotype-specific
gene network, which is important to our study.
The aim of this work is to develop a phenotype specific gene network. This is
particularly important when the biological system is comprised of a large number
of genes, and only a subset of genes are relevant to the target phenotype. To
this end, we divide the process into two phases. In the first phase, we select the
subset of genes that are relevant to the target biological process using a mixture
of regression models. We refer to the first phase as “gene selection phase”. In
the second phase, we apply the proposed matrix factorization algorithm to the
selected genes to reconstruct the gene network. We refer to the second phase as
“network reconstruction phase”.
The goal of the gene selection phase is to efficiently identify the genes that
regulate the desired metabolic/phenotypic response of the cells and identify
better targets to regulate cellular processes. This problem is related to feature selection, which is an important problem that has been extensively studied in machine learning. The example algorithms for feature selection include
Wilcoxon’s rank sum test [7], Fisher’s Discriminant Analysis (FDA) [8], partial
least squares (PLS) [9] or genetic algorithm (GA)-based [10] classification and
clustering [11], minimum redundancy and maximum relevance (mRMR) [12], the
approaches based on Support Vector Machine (SVM) [13] and LASSO regression
(LASSO) [14], kernel Fisher discriminant analysis (KFDA) [15], multi-layer perceptrons. Since the above approaches are data driven, they strongly depend on
the quality of the microarray data. Furthermore, they are unable to incorporate
the vast amount of information already available on the functions of the genes.
To circumvent these issues, alternative analysis methods are being developed
which incorporate prior information of the genes [16]. In these knowledge-based
methods, the association of gene ontology (GO) categories to the target biological process is evaluated, and the relevant GO categories with high association
are used to identify the relevant genes. The problem with these approaches is
that they are unable to integrate both the prior knowledge and gene expression
data into a unified framework. In this work, these problems will be addressed
by a Bayesian mixture regression model that incorporates the prior knowledge
of gene functions.
Knowledge driven Matrix Factorization
3
The second phase of this work aims to reconstruct the gene network using
the selected genes. Previous studies have recognized the importance of exploiting prior knowledge for network reconstruction when expression data are sparse
and noisy [17–23]. Often, a Bayesian prior is constructed for the directed edges
of the gene network to reflect the known regulator-regulatee relationships that
are derived from data such as protein interaction data. Unfortunately, it is very
difficult to extend this approach for the co-regulation relationships that can be
easily derived from the GO database [24]. Exploiting GO database for network
reconstruction is especially important for mammalian systems, where interaction
data are not as readily available as GO information. One aim of this work is to
develop a framework of knowledge-driven analysis using high-throughput data
that effectively exploits the prior knowledge of co-regulation relationships that
may be obtained from GO. The key challenge with using GO for network reconstruction is that the co-regulation relationships derived from GO may be noisy
and inaccurate. We address this problem by developing a knowledge driven matrix factorization, or KMF, algorithm. The key features of the proposed matrix
factorization framework are (1) it derives both the gene modules and their interaction, (2) it incorporates the noisy prior knowledge into network reconstruction
via a regularization scheme, and (3) it presents an efficient learning algorithm
based on non-negative matrix factorization and semi-definite programming. We
emphasize that the difference between our work and the previous work on matrix
factorization methods for gene clustering is that the proposed framework is able
to derive both the gene modules and their interaction simultaneously.
2
Materials and Methods
In this section, we describe in detail the proposed framework for phenotype
specific gene network reconstruction. It consists of two phases, i.e., the gene
selection phase and the network reconstruction phases.
2.1
Microarray Gene Expression Data
The gene expression data was obtained for HepG2 cells exposed to free fatty
acids (FFAs) and tumor necrosis factor (TNF-α) [34]. Gene expression data were
obtained for 15 different conditions. For each condition, 2 microarray replicates
were obtained with a color swap for each replicate. The data consisted of 19458
genes. The analysis of variance (ANOVA) was applied to the entire list of genes
with P < 0.01 to compare the effect of treatment (e.g. FFA or TNF-α) and to
determine whether a treatment had a significant effect. The expression levels of
830 genes were found to be significant due to either TNF-α or FFA [35]. The
analysis was performed using Matlab 6.3.
4
2.2
Xuerui Yang et al.
Gene Selection by Knowledge-driven Bayesian Mixture
Regression Model
Here, we present a method for identifying a subset of genes that are relevant to
a target phenotype. The main idea is to integrate the ontology information of
the genes with their expression data (X) to perform unsupervised classification
and to identify the genes that regulate the cellular responses (Y ). For a given
cellular response, regression models are constructed to approximate the cellular
response by the linear combination of gene expression data. The genes with the
largest absolute weights are deemed to be important to the target phenotype
and therefore will be selected. The phenotype of interest in the study is cytotoxicity. As extensively discussed in the statistics literature, a simple regression
model tends to suffer from the sparse data problem and is also sensitive to the
noise within the expression data. To overcome these problem, we incorporate
the prior knowledge of GO into the regression model via a Bayesian prior. A
central assumption behind this method is that the genes within a GO category
would have similar function or effect on a cellular process. Thus, genes belonging
to the same GO category were constrained to have similar regression weights.
The second problem with using the standard linear regression model for gene
selection is that the genes with the largest regression weights may not be specific
to the target cellular process since they may also be important to a number of
biological processes other the target process. We address this problem by extending the linear regression model into a mixture of regression models. The main
idea behind the mixture models is to cluster the experimental conditions into
two subgroups: the subgroup of conditions that are related to the target phenotype, i.e. cytotoxicity, and the subgroup of conditions that are not. A different
regression model is built for each subgroup of conditions, and the genes with
the largest difference in their regression weights between these two subgroup of
conditions are deemed to be specific to the target phenotype. It is important
to note that the clustering of the experimental conditions are based on the regression weights of the genes. Since the regression weights of each subgroup of
experimental conditions depend on the clustering results, we are left with a circular cause and consequence problem. We resolve this dilemma by exploring the
Expectation Maximization (EM) algorithm. A full description of the Bayesian
mixture regression model can be found in our technical report [28]. We omit
the mathematical details due to space limitation. Figure 1 shows the genes that
are ordered by the absolute difference in their regression coefficients (denoted
by DRC) between two regression models. We observe that that the DRC values
decrease exponentially for the top ranked genes followed by a slow linear reduction. To decide the subset of genes with large DRC values, we will only extract
the genes that are related to the exponential component of the DRC curve. This
is done by identifying the point of the DRC curve where second order derivative start becoming zero or negative. As a result, 250 genes are selected by this
process.
abs(DRC)
Knowledge driven Matrix Factorization
5
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
100
200
300
400
500
600
700
800
Gene Number
Fig. 1. Sorted DRC for all the genes.
2.3
The Knowledge driven matrix factorization (KMF)
We denote the gene expression data by X = (x1 , x2 , . . . , xn ) where n is the
number of genes, and each xi = (xi,1 , xi,2 , . . . , xi,m ) ∈ Rm is the expression levels of the ith gene measured under m conditions. We can compute the pairwise
correlation between any two genes using a number of statistical correlation metrics like Pearson correlation, mutual information and chi-square statistics. This
computation results in a symmetric matrix W = [wi,j ]n×n where wi,j measures
the correlation between gene xi and xj . This estimated correlation matrix W
provides us valuable information about the structure of the gene network since a
high correlation wi,j between two genes xi and xj could suggest that: 1) genes xi
and xj belong to the same module, or 2) gene xi regulates the expression levels
of gene xj or vice versa. To derive these two types of interactions simultaneously,
we factorize W as follows:
W ≈ M × C × M >.
In the above equation, M is a matrix of size n × r and C is a matrix of size
r × r, where r ¿ n is the number of modules that can be determined empirically
as we will discuss later. Matrix M = [mi,j ]n×r represents the memberships
of the n genes in r modules, and every mi,j ≥ 0 indicates the confidence of
assigning the ith gene to the jth module. Matrix C = [ci,j ]r×r represents the
relationships among r modules, and each ci,j ≥ 0 indicates the confidence for the
two gene modules to interact (regulate) with each other. Note that in this study,
we focus on the undirected network since the gene module regulation matrix C
is symmetric.
To determine the appropriate factorization of matrix W , we first define a
loss function ld (W, M CM > ) that measures the difference between W and the
factorized matrices M and C as follows:
ld (W, M CM T ) = ||W − M CM > ||2F =
n
X
(Wi,j − [M CM T ]i,j )2
i,j=1
6
Xuerui Yang et al.
Second, we regularize the solution of M using the prior knowledge from GO
information. We encode the information within GO by a similarity matrix S,
where Si,j ≥ 0 represents the similarity between two genes in their biological
functions. The discussion of gene similarity by GO can be found in [29]. To
ensure the modules to be consistent with the prior knowledge within the GO,
we introduce another loss function lm (M, S) that measures the inconsistency
between M and S as follows:
lm (M, S) =
r
X
>
m>
k L(S)mk = tr(M L(S)M )
k=1
where mk is the kth column of M matrix. L(S) is the combinatorial Laplacian
of matrix S. The definition of combinatorial Laplacian and its application to
regularize numerical solutions can be found in [30]. Furthermore, we regularize
the solution for C by the regularizer lc (C) = ||C||2F . This regularizer enforces
sparse regulation among the gene modules, which is consistent with the scale
free structure of gene module network. By combining the above factors together,
we obtain the following optimization problem:
arg min
M ∈Rn×r ,C∈Rr×r
s. t.
ld (W, Z) + αlm (M, S) + βlc (C)
C º 0, Ci,i = 1, i = 1, 2, . . . , n, Ci,j ≥ 0, i, j = 1, 2, . . . , r
Mi,j ≥ 0, i, j = 1, 2, . . . , n, Z = M CM >
We solve the above optimization problem through alternating optimization.
It alters the process of optimizing M with fixed C and the process of optimizing
C with fixed M iteratively till the solution converges to the local optimum. We
describe these two processes as follows:
Optimize M by fixing C: The related optimization problem is:
arg min Fm (M ) = kW − M CM > k2F + αtr(M > L(S)M )
M ∈Rn×r
s. t.
Mi,j ≥ 0, i, j = 1, 2, . . . , n
To find an optimal solution for M , we propose an iterative bound optimization algorithm, the detailed derivation of which can be found in our technical
report [31]. The key idea is to upper bound the objective function using the
properties of convex functions, and iteratively update the solution based on the
solution of previous iteration. The new solution for M in each iteration is computed as:

 12
2c

q i,k
Mi,k = M̃i,k 
bi,k + b2i,k + 4ai,k ci,k
where
ai,k = [M̃ C M̃ > M̃ C]i,k , bi,k = αM̃i,k Di , ci,k = α[S M̃ ]i,k + [W M̃ C]i,k
Knowledge driven Matrix Factorization
7
Optimize C by fixing M : The related optimization problem is:
arg min η + βξ − 2tr(M > W M C)
C∈Rr×r
s. t.
Ci,i = 1, i = 1, 2, . . . , r, Ci,j ≥ 0, i, j = 1, 2, . . . , r, C º 0
r
r
X
X
2
B = M > M C, η ≥
Bi,j Bj,i , ξ ≥
Ci,j
i,j=1
i,j=1
The above problem can be solved effectively using semi-definite programming
technique [32].
2.4
Tuning the Parameters
According to our experience, there are two key parameters that can significantly
affects the outcome of the proposed algorithm: a) α, i.e., the weight for the
regulation term kM k22 , and b) the number of parameters. In this section, we will
present the evaluation metric that is used to measure the accuracy and stability
of the proposed algorithm, followed by the description of the procedures for
automatically determining the parameter α and the number of clusters.
The evaluation metric: If we already know the modules of the genes, namely
the ground truth, we can quantitatively evaluate the performance of KMF over
the ground truth using the Pairwise F-measure (PWF1) defined in the following
equation [25].
# of pairs correctly predicted to be in the same module
Total # of pairs predicted to be in the same module
# of pairs correctly predicted to be in the same module
recall =
Total # of pairs actually in the same module
2 × precision × recall
PWF1 =
precision + recall
precision =
The precision measures the accuracy in identifying the co-regulated genes,
and the recall measures the percentage of co-regulated genes that are correctly
identified. P W F 1 combines these two factors by their harmonic mean. Note that
neither the precision nor the recall alone is appropriate for evaluation since a
high precision can be achieved by making almost no prediction and a high recall
can be achieved by predicting all the genes in the same cluster.
Tuning the parameter α: In the knowledge driven matrix factorization algorithm, parameter α is used to balance the prior knowledge from GO against the
information from the gene expression data. To tune this parameter to achieve
the best result, we apply the supervised learning technique. First, we collect a
number of gene pairs that should belong to the same module based on the concrete and assuring biological knowledge about the functions of the genes. Second,
8
Xuerui Yang et al.
we gradually change the parameter within a predefined range (i.e., [0.1 . . . 10] in
our experiment), and evaluate the results using the given gene relationship and
find the parameter that gives the best performance in terms of P W F 1.
Determining the number of clusters: Another important parameter in the
algorithm is the number of modules. Estimating the optimal number of clusters
is a major challenge in the clustering analysis [27]. Some algorithms, like the
gap statistics [26], has been proposed to address this problem. In this work,
we determine the optimal number of clusters using stability analysis based on
the P W F 1 measure. The basic assumption of stability analysis is that if the
estimated number of clusters is close to the true number of clusters in the data,
we would expect that clustering runs with different random initialization should
result in more or less similar results. This can be evaluated through the stability
analysis in which a PWF1 measurement is computed between the clustering
results of two runs to reflect the stability.
Furthermore, we apply the above procedure in a hierarchical fashion. More
specifically, in our experiment, the application of the above procedure results in
a split of two major modules. After applying the procedure to the two major
modules, we further split them into 4 and 5 smaller modules, respectively. Hence,
the final number of clusters is 9 in our experiment.
3
3.1
Experimental Results and Discussion
Application of KMF to identify gene clusters in liver cells
In this section, we applied matrix factorization algorithm, KMF, to gene expression data obtained from liver cells where the objective was to identify the
gene clusters and the interactions between them. In particular, we want to uncover which clusters of genes are involved in palmitate-induced cytotoxicity and
how the clusters interact with each to produce the toxicity. In our experiment 9
clusters were identified by the KMF from the 250 genes that were selected from
phase 1, the gene selection phase. We found that genes with similar functions
were highly enriched in their own separate clusters/modules. For example, 30 out
of 31 genes in cluster 1 encode the enzymes involved in “lipid metabolism processing”. Cluster 2 has a variety of genes involved in different cellular signaling
activities. These genes encode proteins in G protein-coupled receptor signaling,
ion channel-related signaling pathways, and chemokine/cytokine receptor signaling pathways, rendering the major function of cluster 2 to be “signaling”. 5 of
7 genes in cluster 3 belong to glycolysis, and one of the other 2 genes, phosphogluconate dehydrogenase, is involved in the pentose phosphate pathway (PPP),
which is primarily an anabolic pathway that utilizes the 6 carbons of glucose to
generate 5 carbon sugars and reducing equivalents. Thus, we assigned the function of “glucose metabolism” (glycolysis and PPP) to cluster 3. Most of the genes
in cluster 4 are involved in the “post-translational modification of proteins”, including ubiquitin-proteasome pathway, protein folding, protein transportation,
Knowledge driven Matrix Factorization
9
and phosphorylation or dephosphorylation, while cluster 7 consisted primarily
of “transcription factors and translational initiation factors” that regulate the
synthesis of proteins. 8 of 10 genes of cluster 6 encode enzymes that catalyze
“ATP metabolism”, and similarly, in cluster 8, 16 of 17 genes encode enzymes
involved in “amino acid metabolism and the urea cycle”. Most of the genes in
cluster 9 encode proteins involved in “apoptosis”, including both the intrinsic
and extrinsic apoptosis.
A majority of the genes in cluster 5 is involved in lipid peroxidation, electron
transport chain (ETC), reactive oxygen species (ROS) homeostasis, oxidative
stress responses, and TCA cycle. It is well-established that lipid peroxidation,
ETC, ROS homeostasis and oxidative stress responses are highly connected with
each other in the redox signaling system and therefore regulate some very important cellular activities such as free radical formation, detoxification, immune
reactions, and cell death. The TCA cycle is essential in all oxidative organisms
and provides precursors for anabolic processes and reducing factors (NADH and
FADH2) that drive the generation of energy. It also plays a role in the oxidative defense machinery, namely, alpha-ketoglutarate, one of the products of the
TCA cycle, is a key participant in the detoxification of ROS [36]. Another study
that connected the TCA cycle and redox signaling through alpha-ketoglutarate,
found that three of the enzymes in the TCA cycle were diminished upon oxidative stress [37]. Thus, the TCA cycle plays a crucial role in modulating the
cellular redox environment. Disorder of this system is responsible for the development of atherosclerosis, degenerative diseases such as Parkinson’s disease,
Alzheimer’s disease, and ageing. Therefore, the major function of cluster 5 is
“ROS homeostasis, redox system regulation and TCA cycle”.
In summary, most of the gene-groups could be assigned a particular function/process based upon the list of genes enriched in them. The full list of gene
clusters is available online at http://www.chems.msu.edu/groups/chan/GO_
KMF_genecluster.xls.
3.2
Application of KMF to identify the interactions between gene
clusters
Next, we examined whether KMF is able to correctly uncover the interaction
among the different clusters of cellular functions. KMF uncovered how these
modules interacted based upon the C matrix (see table 1), whose coefficients indicated the interactions between the modules, analogous to a correlation matrix.
Rows 1-9 indicate the interaction values between the clusters, and the bottom
row (sum − 1) is the summation of each column minus 1. A higher sum − 1 value
indicates that the cluster is more closely connected to the others, and thereby
takes a more central position in the network. As the molecular currency of intracellular energy transfer, ATP is either produced or consumed by most cellular
activities, such as the metabolism and signaling pathways, respectively. Indeed,
cluster 6 (ATP metabolism) has the highest interaction values among each row,
and the summation of column 6 is also the largest. Meanwhile, cluster 6 has very
10
Xuerui Yang et al.
Table 1. C matrix of the clusters. Row 1–9 were filled with the interaction values
between two clusters, and the bottom row (sum − 1) is the summation of each column
minus 1.
Clusters
1
2
3
4
5
6
7
8
9
sum − 1
1
1
0.152
0.234
0.195
0.191
0.275
0.101
0.236
0.176
1.560
2
0.152
1
0.177
0.155
0.152
0.214
0.092
0.183
0.140
1.265
3
0.234
0.177
1
0.236
0.215
0.305
0.107
0.284
0.209
1.767
4
0.195
0.155
0.236
1
0.204
0.295
0.120
0.249
0.188
1.642
5
0.191
0.152
0.215
0.204
1
0.302
0.122
0.253
0.186
1.625
6
0.275
0.214
0.305
0.295
0.302
1
0.170
0.360
0.267
2.188
7
0.101
0.092
0.107
0.120
0.122
0.170
1
0.138
0.108
0.958
8
0.236
0.183
0.284
0.249
0.253
0.360
0.138
1
0.227
1.930
9
0.176
0.140
0.209
0.188
0.186
0.267
0.108
0.227
1
1.501
high interaction values with clusters 5 and 3, which reflects the facts that glucose metabolism and TCA cycle are the major metabolic pathways that produce
ATP and that ETC and ROS homeostasis are also related to the synthesis and
consumption of ATP, respectively.
In glucose metabolism, glycolysis is followed by the TCA cycle, and amino
acid metabolism and the urea cycle are connected to the TCA, which together
produce, as well as use, ATP. Furthermore, most of the amino acids give rise
to a net production of pyruvate or TCA cycle intermediates, such as alphaketoglutarate or oxaloacetate, all of which are precursors to glucose via gluconeogenesis. High interaction values between clusters 3 (glucose metabolism),
5 (ROS homeostasis, redox system regulation and TCA cycle), 8 (amino acid
metabolism and the urea cycle) and 6 (ATP metabolism) were identified by
KMF. Therefore, the connections identified by KMF demonstrate that the algorithm was able to capture the interactions involved in ATP generation.
Taken together, KMF is able to identify highly enriched gene clusters with
distinct cellular functions and the interactions between the clusters. We are currently investigating how clusters 5 and 9 interact with each other in order to
uncover potential pathways and mechanisms that may be involved in producing the phenotype specific response of cell death or cytotoxicity in liver cells
exposed to saturated FFAs. For example, inhibiting complex I has been found
to be effective in preventing palmitate induced toxicity [38], suggesting the involvement of energy production pathways in the toxicity. In summary, KMF is
an approach that can be applied to uncover pathways specific to a phenotype,
such as palmitate-induced toxicity, and may be used to elucidate mechanisms
involved in diseases by integrating gene expression and a prior knowledge.
Acknowledgements This work was supported in part by the National Science
Foundation (BES 0425821 and IIS-0643494), the National Institute of Health
(1R01GM079688-01 and 1R21CA126136-01), the MSU Foundation and the Center for Systems Biology.
Knowledge driven Matrix Factorization
11
References
1. Segal, E., Friedman, N., Koller, D., Regev, A.: A Module Map Showing Conditional
Activity of Expression Modules in Cancer. Nature Genetics 36(10) (2004) 1090–1098
2. Eisen, M. B., Spellman, P. T., Brown, P. O., Botstein, D.: Cluster Analysis and
Display of Genome-Wide Expression Patterns. Proc Natl Acad Sci U.S.A. 95 (1998)
14863–14868
3. Toh, H., Horimoto, K.: Inference of a genetic network by a combined approach
of cluster analysis and graphical Gaussian modeling. Bioinformatics 18(2) (2002)
287–297
4. Segal, E., Shapira, M., Regev, A., Peér, D., Botstein, D., Koller, D., Friedman,
N.: Module Networks: Identifying Regulatory Modules and their Condition Specific
Regulators from Gene Expression Data. Nature Genetics 34(2) (2003) 166-176
5. Husmeier, D.: Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks. Bioinformatics
19(17) (2003) 2271–2282
6. Yu, J., Smith, V. A., Wang, P. P., Hartemink, A. J., Jarvis, E. D.: Advances to
Bayesian network inference for generating causal networks from observational biological data. Bioinformatics 20(18) (2004) 3594–3603
7. Troyanskaya, O. G., Garber, M. E., Brown, P. O., et al.: Nonparametric methods for
identifying differentially expressed genes in microarray data. Bioinformatics 18(11)
(2002) 1454–1461
8. Chan, C., Hwang, D., Stephanopoulos, G. N., et al.: Application of Multivariate
Analysis to Optimize Function of Cultured Hepatocytes. Biotechnol. Prog. 19(2)
(2003) 580–598
9. Tan, Y., Shi, L., Tong, W., et al.: Multi-class tumor classification by discriminant
partial least squares using microarray gene expression data and assessment of classification models. Computational Biology 28(3) (2004) 235–243
10. Liu, J. J., Cutler, G., Li, W., et al.: Multiclass cancer classification and biomarker
discovery using GA-based algorithms. Bioinformatics bf 21(11) (2005) 2691–2697
11. Guo, L., Ma, Y., Ward, R., et al.: Constructing Molecular Classifiers for the Accurate Prognosis of Lung Adenocarcinoma. Clinical Cancer Research 12 (2006) 3344–
3354
12. Ding, C., Peng, H.: Minimum redundancy feature selection from microarray gene
expression data. J. Bioinform. Comput. Biol 3 (2005) 185-205
13. Brown, M. P., Grundy, W. N., Lin, D., et al.: Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad.
Sci. U.S.A. 97 (2000) 262-267
14. Roth, V.: The generalized LASSO. IEEE Trans. Neural Networks 15(1) (2004)
16–18
15. Cho, J., Lee, D., Park, J., Lee, I.: Gene selection and classification from microarray
data using kernel machine. FEBS Letters 571 (2004) 93–98
16. Le, P. P., Bahl, A., Ungar, L. H.: Using Prior Knowledge to Improve Genetic
Network Reconstruction from Microarray Data. In Silico Biology 4(3) (2004) 335–
353
17. Bar-Joseph, Z., Gerber, G. K., Lee, T. I., et al.: Computational discovery of gene
modules and regulatory networks. Nature Biotechnology 21 (2003) 1337–1342
18. Berman, B. P., nibu, Y., Pferffer, B. D., et al.: Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation
in the Drosophila genome. Proceedings of the National Academy of Sciences 99(2)
(2002) 757–762
12
Xuerui Yang et al.
19. Hartemink, A. J., Gifford, D. K., Jaakkola, T. S., Young, R. A.: Combining location
and expression data for principled discovery of genetic regulatory networks. Pacific
Symposium on Biocomputing (2002) 437–449
20. Ideker, T., Thorsson, V., Ranish, J. A. et al.: Integrated genomic and proteomic
analyses of a systematically perturbed metabolic network. Science 292 (2001) 929–34
21. Ihmel, J., Friedlander, G., Bergmann, S., et al.: Revealing modular organization
in the yeast transcriptional network. Nature Genetics 31 (2002) 370–377
22. Pilpel, Y., Sudarsanam, P., Church, G. M.: Identifying regulatory networks by
combinatorial analysis of promoter elements. Nature Genetics 29 (2001) 151–159
23. Li, F., Yang, Y.: Recovering Genetic Regulatory Networks from Micro-Array Data
and Location Analysis Data. Genome Informatics 15(2) (2004) 131–140
24. Heckerman, D.: A tutorial on learning bayesian networks. Technical Report MSRTR-95-06 Microsoft Research (1996)
25. Liu, Y., Jin, R.: BoostCluster: Boosting Clustering by Pairwise Constraints. In
KDD’07: The 13th International Conference on Knowledge Discovery and Data Mining (2007) 450–459
26. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a
dataset via the gap statistics. Technical Report 208, Dept. of Statistics, Stanford
University (2000)
27. Milligan G. W., Cooper, M. C.: An examination of procedures for determining the
number of clusters in a data set. Psychometrika 50 (1985) 159–179
28. Shireesh, S., Jin, R., Chan, C.: A novel unsupervised classification and feature
selection method which incorporates ontology information. Technical Report (MSUCSE-07-192), Department of Computer Science and Engineering, Michigan State
University.
29. Jin, R., Si, L., Srivastava, S., et al.: A Knowledge Driven Regression Model for
Gene Expression and Microarray Analysis. Proceedings of the 28th IEEE Eng. in
Medicine and Biology, August, (2006).
30. Chung, F. R. K.: Spectral Graph Theory. AMS, Providence, RI (1997)
31. http://www.cse.msu.edu/cgi-user/web/tech/document?ID=677.
32. Boyd, S., Vandenberghe, L.: Convex Optimization, Cambridge University Press 19
(2003) 2271-2282
33. http://genome-www.stanford.edu/cellcycle/data/rawdata/.
34. Srivastava, S., Chan, C.: Hydrogen peroxide and hydroxyl radicals mediate
palmitate-induced cytotoxicity to hepatoma cells: relation to MPT. Free Radical
Research 41(1) (2007) 38–49
35. Li, Z., Srivastava, S., Mittal, S., et al.: A Three Stage Integrative Pathway Search
(TIPS) framework to identify toxicity relevant genes and pathways. BMC Bioinformatics 8(202) (2007)
36. Kumar, M. J., Nicholls, D. G., Anderson, J. K.: Oxidative alpha-ketoglutarate
dehydrogenase inhibition via subtle elevations in monoamine oxidase B levels results
in loss of spare respiratory capacity: implications for Parkinson’s disease. J. Biol.
Chem. 278(47) (2003) 46432–46439
37. Mailloux, R. J., Beriault, R., Lemire, J, Singh, R., Chenier, D. R., Hamel, R. D.,
et al.: The tricarboxylic acid cycle, and ancient metabolic network with a novel twist.
PLoS ONE (2007) 2:e690
38. Srivastava, S., Chan, C.: Hydrogen peroxide and hydroxyl radicals mediate
palmitate-induced cytotoxicity to hepatoma cells: Relation to mitochondrial permeability transition. Free Radic Res. 41(1) (2006) 38–49