* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download PDF
Oncogenomics wikipedia , lookup
Genetic engineering wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Essential gene wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Gene therapy wikipedia , lookup
Metabolic network modelling wikipedia , lookup
Pathogenomics wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Gene nomenclature wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
History of genetic engineering wikipedia , lookup
Public health genomics wikipedia , lookup
Gene desert wikipedia , lookup
Minimal genome wikipedia , lookup
Genomic imprinting wikipedia , lookup
The Selfish Gene wikipedia , lookup
Genome evolution wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Genome (book) wikipedia , lookup
Ridge (biology) wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Microevolution wikipedia , lookup
Designer baby wikipedia , lookup
Learning Phenotype Specific Gene Network by Knowledge Driven Matrix Factorization Xuerui Yang1 , Yang Zhou2 , Zheng Li1 , Shireesh Srivastava1 , Rong Jin2 , and Christina Chan1,2 1 Chemical Engineering and Materials Science Department 2 Computer Science and Engineering Department Michigan State University, East Lansing, MI 48824 USA {yangxuer, zhouyang, lizheng1, srivas14, rongjin, krischan}@msu.edu Abstract. A popular method for reconstructing gene networks from micro-array data is Bayesian structure learning. However, most Bayesian structure learning algorithms suffer from three major shortcomings, i.e., the high computational cost, inefficiency in exploring qualitative knowledge, and inability of reconstructing phenotype specific gene network. We address these three short-comings by presenting a new framework, which first identifies the genes relevant to the given phenotype using a mixture regression model, and then reconstructs the network for the selected genes with a Knowledge driven Matrix Factorization (KMF) algorithm. We applied the proposed framework to gene expression and phenotypic data and identified highly enriched gene clusters with distinct cellular functions and processes together with the interactions between the clusters. Most of the interactions predicted by KMF were indeed biologically relevant. In summary, we have developed a framework that can correctly reconstruct the gene network that is specific to a given phenotype. Key words: microarray data, Knowledge driven Matrix Factorization, phenotype specific genes, network reconstruction, numerical optimization 1 Introduction Cellular processes are regulated by genes and their products within a complex network. These networks are usually organized in modules such as the pathways in the citric acid (TCA) cycle of a metabolic network [1]. Disease states may ensue when biological functions are abnormally regulated, for example, cancer arises from abnormal regulation of apoptosis. Identifying the gene modules and their interactions may provide an understanding on how a biological function or process is regulated and may help provide insights into the disease mechanism. With the advent of high throughput technologies one can obtain a comprehensive gene expression profile for a cellular or tissue state. Uncovering the gene modules and module network from the micro-array data could provide insight into the underlying mechanisms involved. A number of clustering methods, such 2 Xuerui Yang et al. as hierarchical clustering, K-means and self-organizing map [2], have been applied to identify gene modules. The main disadvantage of clustering methods is that they are unable to identify the interaction among different modules, which is crucial to the understanding of disease mechanisms. This problem is addressed by several studies that integrate clustering methods with structure learning. In [3], a clustering method is combined with the Graphical Gaussian Model (GGM) for module network reconstruction. In [4], the authors presented a Bayesian framework that incorporates the clustering method into Bayesian network learning. Despite the limited success, these approaches are mainly data driven and therefore could be sensitive to the noise within the expression data. In addition, as revealed by several studies [5, 6], structure learning methods tend to suffer from the sparse data problem when the number of experimental conditions is limited. Finally, these approaches are unable to construct phenotype-specific gene network, which is important to our study. The aim of this work is to develop a phenotype specific gene network. This is particularly important when the biological system is comprised of a large number of genes, and only a subset of genes are relevant to the target phenotype. To this end, we divide the process into two phases. In the first phase, we select the subset of genes that are relevant to the target biological process using a mixture of regression models. We refer to the first phase as “gene selection phase”. In the second phase, we apply the proposed matrix factorization algorithm to the selected genes to reconstruct the gene network. We refer to the second phase as “network reconstruction phase”. The goal of the gene selection phase is to efficiently identify the genes that regulate the desired metabolic/phenotypic response of the cells and identify better targets to regulate cellular processes. This problem is related to feature selection, which is an important problem that has been extensively studied in machine learning. The example algorithms for feature selection include Wilcoxon’s rank sum test [7], Fisher’s Discriminant Analysis (FDA) [8], partial least squares (PLS) [9] or genetic algorithm (GA)-based [10] classification and clustering [11], minimum redundancy and maximum relevance (mRMR) [12], the approaches based on Support Vector Machine (SVM) [13] and LASSO regression (LASSO) [14], kernel Fisher discriminant analysis (KFDA) [15], multi-layer perceptrons. Since the above approaches are data driven, they strongly depend on the quality of the microarray data. Furthermore, they are unable to incorporate the vast amount of information already available on the functions of the genes. To circumvent these issues, alternative analysis methods are being developed which incorporate prior information of the genes [16]. In these knowledge-based methods, the association of gene ontology (GO) categories to the target biological process is evaluated, and the relevant GO categories with high association are used to identify the relevant genes. The problem with these approaches is that they are unable to integrate both the prior knowledge and gene expression data into a unified framework. In this work, these problems will be addressed by a Bayesian mixture regression model that incorporates the prior knowledge of gene functions. Knowledge driven Matrix Factorization 3 The second phase of this work aims to reconstruct the gene network using the selected genes. Previous studies have recognized the importance of exploiting prior knowledge for network reconstruction when expression data are sparse and noisy [17–23]. Often, a Bayesian prior is constructed for the directed edges of the gene network to reflect the known regulator-regulatee relationships that are derived from data such as protein interaction data. Unfortunately, it is very difficult to extend this approach for the co-regulation relationships that can be easily derived from the GO database [24]. Exploiting GO database for network reconstruction is especially important for mammalian systems, where interaction data are not as readily available as GO information. One aim of this work is to develop a framework of knowledge-driven analysis using high-throughput data that effectively exploits the prior knowledge of co-regulation relationships that may be obtained from GO. The key challenge with using GO for network reconstruction is that the co-regulation relationships derived from GO may be noisy and inaccurate. We address this problem by developing a knowledge driven matrix factorization, or KMF, algorithm. The key features of the proposed matrix factorization framework are (1) it derives both the gene modules and their interaction, (2) it incorporates the noisy prior knowledge into network reconstruction via a regularization scheme, and (3) it presents an efficient learning algorithm based on non-negative matrix factorization and semi-definite programming. We emphasize that the difference between our work and the previous work on matrix factorization methods for gene clustering is that the proposed framework is able to derive both the gene modules and their interaction simultaneously. 2 Materials and Methods In this section, we describe in detail the proposed framework for phenotype specific gene network reconstruction. It consists of two phases, i.e., the gene selection phase and the network reconstruction phases. 2.1 Microarray Gene Expression Data The gene expression data was obtained for HepG2 cells exposed to free fatty acids (FFAs) and tumor necrosis factor (TNF-α) [34]. Gene expression data were obtained for 15 different conditions. For each condition, 2 microarray replicates were obtained with a color swap for each replicate. The data consisted of 19458 genes. The analysis of variance (ANOVA) was applied to the entire list of genes with P < 0.01 to compare the effect of treatment (e.g. FFA or TNF-α) and to determine whether a treatment had a significant effect. The expression levels of 830 genes were found to be significant due to either TNF-α or FFA [35]. The analysis was performed using Matlab 6.3. 4 2.2 Xuerui Yang et al. Gene Selection by Knowledge-driven Bayesian Mixture Regression Model Here, we present a method for identifying a subset of genes that are relevant to a target phenotype. The main idea is to integrate the ontology information of the genes with their expression data (X) to perform unsupervised classification and to identify the genes that regulate the cellular responses (Y ). For a given cellular response, regression models are constructed to approximate the cellular response by the linear combination of gene expression data. The genes with the largest absolute weights are deemed to be important to the target phenotype and therefore will be selected. The phenotype of interest in the study is cytotoxicity. As extensively discussed in the statistics literature, a simple regression model tends to suffer from the sparse data problem and is also sensitive to the noise within the expression data. To overcome these problem, we incorporate the prior knowledge of GO into the regression model via a Bayesian prior. A central assumption behind this method is that the genes within a GO category would have similar function or effect on a cellular process. Thus, genes belonging to the same GO category were constrained to have similar regression weights. The second problem with using the standard linear regression model for gene selection is that the genes with the largest regression weights may not be specific to the target cellular process since they may also be important to a number of biological processes other the target process. We address this problem by extending the linear regression model into a mixture of regression models. The main idea behind the mixture models is to cluster the experimental conditions into two subgroups: the subgroup of conditions that are related to the target phenotype, i.e. cytotoxicity, and the subgroup of conditions that are not. A different regression model is built for each subgroup of conditions, and the genes with the largest difference in their regression weights between these two subgroup of conditions are deemed to be specific to the target phenotype. It is important to note that the clustering of the experimental conditions are based on the regression weights of the genes. Since the regression weights of each subgroup of experimental conditions depend on the clustering results, we are left with a circular cause and consequence problem. We resolve this dilemma by exploring the Expectation Maximization (EM) algorithm. A full description of the Bayesian mixture regression model can be found in our technical report [28]. We omit the mathematical details due to space limitation. Figure 1 shows the genes that are ordered by the absolute difference in their regression coefficients (denoted by DRC) between two regression models. We observe that that the DRC values decrease exponentially for the top ranked genes followed by a slow linear reduction. To decide the subset of genes with large DRC values, we will only extract the genes that are related to the exponential component of the DRC curve. This is done by identifying the point of the DRC curve where second order derivative start becoming zero or negative. As a result, 250 genes are selected by this process. abs(DRC) Knowledge driven Matrix Factorization 5 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 100 200 300 400 500 600 700 800 Gene Number Fig. 1. Sorted DRC for all the genes. 2.3 The Knowledge driven matrix factorization (KMF) We denote the gene expression data by X = (x1 , x2 , . . . , xn ) where n is the number of genes, and each xi = (xi,1 , xi,2 , . . . , xi,m ) ∈ Rm is the expression levels of the ith gene measured under m conditions. We can compute the pairwise correlation between any two genes using a number of statistical correlation metrics like Pearson correlation, mutual information and chi-square statistics. This computation results in a symmetric matrix W = [wi,j ]n×n where wi,j measures the correlation between gene xi and xj . This estimated correlation matrix W provides us valuable information about the structure of the gene network since a high correlation wi,j between two genes xi and xj could suggest that: 1) genes xi and xj belong to the same module, or 2) gene xi regulates the expression levels of gene xj or vice versa. To derive these two types of interactions simultaneously, we factorize W as follows: W ≈ M × C × M >. In the above equation, M is a matrix of size n × r and C is a matrix of size r × r, where r ¿ n is the number of modules that can be determined empirically as we will discuss later. Matrix M = [mi,j ]n×r represents the memberships of the n genes in r modules, and every mi,j ≥ 0 indicates the confidence of assigning the ith gene to the jth module. Matrix C = [ci,j ]r×r represents the relationships among r modules, and each ci,j ≥ 0 indicates the confidence for the two gene modules to interact (regulate) with each other. Note that in this study, we focus on the undirected network since the gene module regulation matrix C is symmetric. To determine the appropriate factorization of matrix W , we first define a loss function ld (W, M CM > ) that measures the difference between W and the factorized matrices M and C as follows: ld (W, M CM T ) = ||W − M CM > ||2F = n X (Wi,j − [M CM T ]i,j )2 i,j=1 6 Xuerui Yang et al. Second, we regularize the solution of M using the prior knowledge from GO information. We encode the information within GO by a similarity matrix S, where Si,j ≥ 0 represents the similarity between two genes in their biological functions. The discussion of gene similarity by GO can be found in [29]. To ensure the modules to be consistent with the prior knowledge within the GO, we introduce another loss function lm (M, S) that measures the inconsistency between M and S as follows: lm (M, S) = r X > m> k L(S)mk = tr(M L(S)M ) k=1 where mk is the kth column of M matrix. L(S) is the combinatorial Laplacian of matrix S. The definition of combinatorial Laplacian and its application to regularize numerical solutions can be found in [30]. Furthermore, we regularize the solution for C by the regularizer lc (C) = ||C||2F . This regularizer enforces sparse regulation among the gene modules, which is consistent with the scale free structure of gene module network. By combining the above factors together, we obtain the following optimization problem: arg min M ∈Rn×r ,C∈Rr×r s. t. ld (W, Z) + αlm (M, S) + βlc (C) C º 0, Ci,i = 1, i = 1, 2, . . . , n, Ci,j ≥ 0, i, j = 1, 2, . . . , r Mi,j ≥ 0, i, j = 1, 2, . . . , n, Z = M CM > We solve the above optimization problem through alternating optimization. It alters the process of optimizing M with fixed C and the process of optimizing C with fixed M iteratively till the solution converges to the local optimum. We describe these two processes as follows: Optimize M by fixing C: The related optimization problem is: arg min Fm (M ) = kW − M CM > k2F + αtr(M > L(S)M ) M ∈Rn×r s. t. Mi,j ≥ 0, i, j = 1, 2, . . . , n To find an optimal solution for M , we propose an iterative bound optimization algorithm, the detailed derivation of which can be found in our technical report [31]. The key idea is to upper bound the objective function using the properties of convex functions, and iteratively update the solution based on the solution of previous iteration. The new solution for M in each iteration is computed as: 12 2c q i,k Mi,k = M̃i,k bi,k + b2i,k + 4ai,k ci,k where ai,k = [M̃ C M̃ > M̃ C]i,k , bi,k = αM̃i,k Di , ci,k = α[S M̃ ]i,k + [W M̃ C]i,k Knowledge driven Matrix Factorization 7 Optimize C by fixing M : The related optimization problem is: arg min η + βξ − 2tr(M > W M C) C∈Rr×r s. t. Ci,i = 1, i = 1, 2, . . . , r, Ci,j ≥ 0, i, j = 1, 2, . . . , r, C º 0 r r X X 2 B = M > M C, η ≥ Bi,j Bj,i , ξ ≥ Ci,j i,j=1 i,j=1 The above problem can be solved effectively using semi-definite programming technique [32]. 2.4 Tuning the Parameters According to our experience, there are two key parameters that can significantly affects the outcome of the proposed algorithm: a) α, i.e., the weight for the regulation term kM k22 , and b) the number of parameters. In this section, we will present the evaluation metric that is used to measure the accuracy and stability of the proposed algorithm, followed by the description of the procedures for automatically determining the parameter α and the number of clusters. The evaluation metric: If we already know the modules of the genes, namely the ground truth, we can quantitatively evaluate the performance of KMF over the ground truth using the Pairwise F-measure (PWF1) defined in the following equation [25]. # of pairs correctly predicted to be in the same module Total # of pairs predicted to be in the same module # of pairs correctly predicted to be in the same module recall = Total # of pairs actually in the same module 2 × precision × recall PWF1 = precision + recall precision = The precision measures the accuracy in identifying the co-regulated genes, and the recall measures the percentage of co-regulated genes that are correctly identified. P W F 1 combines these two factors by their harmonic mean. Note that neither the precision nor the recall alone is appropriate for evaluation since a high precision can be achieved by making almost no prediction and a high recall can be achieved by predicting all the genes in the same cluster. Tuning the parameter α: In the knowledge driven matrix factorization algorithm, parameter α is used to balance the prior knowledge from GO against the information from the gene expression data. To tune this parameter to achieve the best result, we apply the supervised learning technique. First, we collect a number of gene pairs that should belong to the same module based on the concrete and assuring biological knowledge about the functions of the genes. Second, 8 Xuerui Yang et al. we gradually change the parameter within a predefined range (i.e., [0.1 . . . 10] in our experiment), and evaluate the results using the given gene relationship and find the parameter that gives the best performance in terms of P W F 1. Determining the number of clusters: Another important parameter in the algorithm is the number of modules. Estimating the optimal number of clusters is a major challenge in the clustering analysis [27]. Some algorithms, like the gap statistics [26], has been proposed to address this problem. In this work, we determine the optimal number of clusters using stability analysis based on the P W F 1 measure. The basic assumption of stability analysis is that if the estimated number of clusters is close to the true number of clusters in the data, we would expect that clustering runs with different random initialization should result in more or less similar results. This can be evaluated through the stability analysis in which a PWF1 measurement is computed between the clustering results of two runs to reflect the stability. Furthermore, we apply the above procedure in a hierarchical fashion. More specifically, in our experiment, the application of the above procedure results in a split of two major modules. After applying the procedure to the two major modules, we further split them into 4 and 5 smaller modules, respectively. Hence, the final number of clusters is 9 in our experiment. 3 3.1 Experimental Results and Discussion Application of KMF to identify gene clusters in liver cells In this section, we applied matrix factorization algorithm, KMF, to gene expression data obtained from liver cells where the objective was to identify the gene clusters and the interactions between them. In particular, we want to uncover which clusters of genes are involved in palmitate-induced cytotoxicity and how the clusters interact with each to produce the toxicity. In our experiment 9 clusters were identified by the KMF from the 250 genes that were selected from phase 1, the gene selection phase. We found that genes with similar functions were highly enriched in their own separate clusters/modules. For example, 30 out of 31 genes in cluster 1 encode the enzymes involved in “lipid metabolism processing”. Cluster 2 has a variety of genes involved in different cellular signaling activities. These genes encode proteins in G protein-coupled receptor signaling, ion channel-related signaling pathways, and chemokine/cytokine receptor signaling pathways, rendering the major function of cluster 2 to be “signaling”. 5 of 7 genes in cluster 3 belong to glycolysis, and one of the other 2 genes, phosphogluconate dehydrogenase, is involved in the pentose phosphate pathway (PPP), which is primarily an anabolic pathway that utilizes the 6 carbons of glucose to generate 5 carbon sugars and reducing equivalents. Thus, we assigned the function of “glucose metabolism” (glycolysis and PPP) to cluster 3. Most of the genes in cluster 4 are involved in the “post-translational modification of proteins”, including ubiquitin-proteasome pathway, protein folding, protein transportation, Knowledge driven Matrix Factorization 9 and phosphorylation or dephosphorylation, while cluster 7 consisted primarily of “transcription factors and translational initiation factors” that regulate the synthesis of proteins. 8 of 10 genes of cluster 6 encode enzymes that catalyze “ATP metabolism”, and similarly, in cluster 8, 16 of 17 genes encode enzymes involved in “amino acid metabolism and the urea cycle”. Most of the genes in cluster 9 encode proteins involved in “apoptosis”, including both the intrinsic and extrinsic apoptosis. A majority of the genes in cluster 5 is involved in lipid peroxidation, electron transport chain (ETC), reactive oxygen species (ROS) homeostasis, oxidative stress responses, and TCA cycle. It is well-established that lipid peroxidation, ETC, ROS homeostasis and oxidative stress responses are highly connected with each other in the redox signaling system and therefore regulate some very important cellular activities such as free radical formation, detoxification, immune reactions, and cell death. The TCA cycle is essential in all oxidative organisms and provides precursors for anabolic processes and reducing factors (NADH and FADH2) that drive the generation of energy. It also plays a role in the oxidative defense machinery, namely, alpha-ketoglutarate, one of the products of the TCA cycle, is a key participant in the detoxification of ROS [36]. Another study that connected the TCA cycle and redox signaling through alpha-ketoglutarate, found that three of the enzymes in the TCA cycle were diminished upon oxidative stress [37]. Thus, the TCA cycle plays a crucial role in modulating the cellular redox environment. Disorder of this system is responsible for the development of atherosclerosis, degenerative diseases such as Parkinson’s disease, Alzheimer’s disease, and ageing. Therefore, the major function of cluster 5 is “ROS homeostasis, redox system regulation and TCA cycle”. In summary, most of the gene-groups could be assigned a particular function/process based upon the list of genes enriched in them. The full list of gene clusters is available online at http://www.chems.msu.edu/groups/chan/GO_ KMF_genecluster.xls. 3.2 Application of KMF to identify the interactions between gene clusters Next, we examined whether KMF is able to correctly uncover the interaction among the different clusters of cellular functions. KMF uncovered how these modules interacted based upon the C matrix (see table 1), whose coefficients indicated the interactions between the modules, analogous to a correlation matrix. Rows 1-9 indicate the interaction values between the clusters, and the bottom row (sum − 1) is the summation of each column minus 1. A higher sum − 1 value indicates that the cluster is more closely connected to the others, and thereby takes a more central position in the network. As the molecular currency of intracellular energy transfer, ATP is either produced or consumed by most cellular activities, such as the metabolism and signaling pathways, respectively. Indeed, cluster 6 (ATP metabolism) has the highest interaction values among each row, and the summation of column 6 is also the largest. Meanwhile, cluster 6 has very 10 Xuerui Yang et al. Table 1. C matrix of the clusters. Row 1–9 were filled with the interaction values between two clusters, and the bottom row (sum − 1) is the summation of each column minus 1. Clusters 1 2 3 4 5 6 7 8 9 sum − 1 1 1 0.152 0.234 0.195 0.191 0.275 0.101 0.236 0.176 1.560 2 0.152 1 0.177 0.155 0.152 0.214 0.092 0.183 0.140 1.265 3 0.234 0.177 1 0.236 0.215 0.305 0.107 0.284 0.209 1.767 4 0.195 0.155 0.236 1 0.204 0.295 0.120 0.249 0.188 1.642 5 0.191 0.152 0.215 0.204 1 0.302 0.122 0.253 0.186 1.625 6 0.275 0.214 0.305 0.295 0.302 1 0.170 0.360 0.267 2.188 7 0.101 0.092 0.107 0.120 0.122 0.170 1 0.138 0.108 0.958 8 0.236 0.183 0.284 0.249 0.253 0.360 0.138 1 0.227 1.930 9 0.176 0.140 0.209 0.188 0.186 0.267 0.108 0.227 1 1.501 high interaction values with clusters 5 and 3, which reflects the facts that glucose metabolism and TCA cycle are the major metabolic pathways that produce ATP and that ETC and ROS homeostasis are also related to the synthesis and consumption of ATP, respectively. In glucose metabolism, glycolysis is followed by the TCA cycle, and amino acid metabolism and the urea cycle are connected to the TCA, which together produce, as well as use, ATP. Furthermore, most of the amino acids give rise to a net production of pyruvate or TCA cycle intermediates, such as alphaketoglutarate or oxaloacetate, all of which are precursors to glucose via gluconeogenesis. High interaction values between clusters 3 (glucose metabolism), 5 (ROS homeostasis, redox system regulation and TCA cycle), 8 (amino acid metabolism and the urea cycle) and 6 (ATP metabolism) were identified by KMF. Therefore, the connections identified by KMF demonstrate that the algorithm was able to capture the interactions involved in ATP generation. Taken together, KMF is able to identify highly enriched gene clusters with distinct cellular functions and the interactions between the clusters. We are currently investigating how clusters 5 and 9 interact with each other in order to uncover potential pathways and mechanisms that may be involved in producing the phenotype specific response of cell death or cytotoxicity in liver cells exposed to saturated FFAs. For example, inhibiting complex I has been found to be effective in preventing palmitate induced toxicity [38], suggesting the involvement of energy production pathways in the toxicity. In summary, KMF is an approach that can be applied to uncover pathways specific to a phenotype, such as palmitate-induced toxicity, and may be used to elucidate mechanisms involved in diseases by integrating gene expression and a prior knowledge. Acknowledgements This work was supported in part by the National Science Foundation (BES 0425821 and IIS-0643494), the National Institute of Health (1R01GM079688-01 and 1R21CA126136-01), the MSU Foundation and the Center for Systems Biology. Knowledge driven Matrix Factorization 11 References 1. Segal, E., Friedman, N., Koller, D., Regev, A.: A Module Map Showing Conditional Activity of Expression Modules in Cancer. Nature Genetics 36(10) (2004) 1090–1098 2. Eisen, M. B., Spellman, P. T., Brown, P. O., Botstein, D.: Cluster Analysis and Display of Genome-Wide Expression Patterns. Proc Natl Acad Sci U.S.A. 95 (1998) 14863–14868 3. Toh, H., Horimoto, K.: Inference of a genetic network by a combined approach of cluster analysis and graphical Gaussian modeling. Bioinformatics 18(2) (2002) 287–297 4. Segal, E., Shapira, M., Regev, A., Peér, D., Botstein, D., Koller, D., Friedman, N.: Module Networks: Identifying Regulatory Modules and their Condition Specific Regulators from Gene Expression Data. Nature Genetics 34(2) (2003) 166-176 5. Husmeier, D.: Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks. Bioinformatics 19(17) (2003) 2271–2282 6. Yu, J., Smith, V. A., Wang, P. P., Hartemink, A. J., Jarvis, E. D.: Advances to Bayesian network inference for generating causal networks from observational biological data. Bioinformatics 20(18) (2004) 3594–3603 7. Troyanskaya, O. G., Garber, M. E., Brown, P. O., et al.: Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics 18(11) (2002) 1454–1461 8. Chan, C., Hwang, D., Stephanopoulos, G. N., et al.: Application of Multivariate Analysis to Optimize Function of Cultured Hepatocytes. Biotechnol. Prog. 19(2) (2003) 580–598 9. Tan, Y., Shi, L., Tong, W., et al.: Multi-class tumor classification by discriminant partial least squares using microarray gene expression data and assessment of classification models. Computational Biology 28(3) (2004) 235–243 10. Liu, J. J., Cutler, G., Li, W., et al.: Multiclass cancer classification and biomarker discovery using GA-based algorithms. Bioinformatics bf 21(11) (2005) 2691–2697 11. Guo, L., Ma, Y., Ward, R., et al.: Constructing Molecular Classifiers for the Accurate Prognosis of Lung Adenocarcinoma. Clinical Cancer Research 12 (2006) 3344– 3354 12. Ding, C., Peng, H.: Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol 3 (2005) 185-205 13. Brown, M. P., Grundy, W. N., Lin, D., et al.: Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. U.S.A. 97 (2000) 262-267 14. Roth, V.: The generalized LASSO. IEEE Trans. Neural Networks 15(1) (2004) 16–18 15. Cho, J., Lee, D., Park, J., Lee, I.: Gene selection and classification from microarray data using kernel machine. FEBS Letters 571 (2004) 93–98 16. Le, P. P., Bahl, A., Ungar, L. H.: Using Prior Knowledge to Improve Genetic Network Reconstruction from Microarray Data. In Silico Biology 4(3) (2004) 335– 353 17. Bar-Joseph, Z., Gerber, G. K., Lee, T. I., et al.: Computational discovery of gene modules and regulatory networks. Nature Biotechnology 21 (2003) 1337–1342 18. Berman, B. P., nibu, Y., Pferffer, B. D., et al.: Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proceedings of the National Academy of Sciences 99(2) (2002) 757–762 12 Xuerui Yang et al. 19. Hartemink, A. J., Gifford, D. K., Jaakkola, T. S., Young, R. A.: Combining location and expression data for principled discovery of genetic regulatory networks. Pacific Symposium on Biocomputing (2002) 437–449 20. Ideker, T., Thorsson, V., Ranish, J. A. et al.: Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science 292 (2001) 929–34 21. Ihmel, J., Friedlander, G., Bergmann, S., et al.: Revealing modular organization in the yeast transcriptional network. Nature Genetics 31 (2002) 370–377 22. Pilpel, Y., Sudarsanam, P., Church, G. M.: Identifying regulatory networks by combinatorial analysis of promoter elements. Nature Genetics 29 (2001) 151–159 23. Li, F., Yang, Y.: Recovering Genetic Regulatory Networks from Micro-Array Data and Location Analysis Data. Genome Informatics 15(2) (2004) 131–140 24. Heckerman, D.: A tutorial on learning bayesian networks. Technical Report MSRTR-95-06 Microsoft Research (1996) 25. Liu, Y., Jin, R.: BoostCluster: Boosting Clustering by Pairwise Constraints. In KDD’07: The 13th International Conference on Knowledge Discovery and Data Mining (2007) 450–459 26. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the gap statistics. Technical Report 208, Dept. of Statistics, Stanford University (2000) 27. Milligan G. W., Cooper, M. C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50 (1985) 159–179 28. Shireesh, S., Jin, R., Chan, C.: A novel unsupervised classification and feature selection method which incorporates ontology information. Technical Report (MSUCSE-07-192), Department of Computer Science and Engineering, Michigan State University. 29. Jin, R., Si, L., Srivastava, S., et al.: A Knowledge Driven Regression Model for Gene Expression and Microarray Analysis. Proceedings of the 28th IEEE Eng. in Medicine and Biology, August, (2006). 30. Chung, F. R. K.: Spectral Graph Theory. AMS, Providence, RI (1997) 31. http://www.cse.msu.edu/cgi-user/web/tech/document?ID=677. 32. Boyd, S., Vandenberghe, L.: Convex Optimization, Cambridge University Press 19 (2003) 2271-2282 33. http://genome-www.stanford.edu/cellcycle/data/rawdata/. 34. Srivastava, S., Chan, C.: Hydrogen peroxide and hydroxyl radicals mediate palmitate-induced cytotoxicity to hepatoma cells: relation to MPT. Free Radical Research 41(1) (2007) 38–49 35. Li, Z., Srivastava, S., Mittal, S., et al.: A Three Stage Integrative Pathway Search (TIPS) framework to identify toxicity relevant genes and pathways. BMC Bioinformatics 8(202) (2007) 36. Kumar, M. J., Nicholls, D. G., Anderson, J. K.: Oxidative alpha-ketoglutarate dehydrogenase inhibition via subtle elevations in monoamine oxidase B levels results in loss of spare respiratory capacity: implications for Parkinson’s disease. J. Biol. Chem. 278(47) (2003) 46432–46439 37. Mailloux, R. J., Beriault, R., Lemire, J, Singh, R., Chenier, D. R., Hamel, R. D., et al.: The tricarboxylic acid cycle, and ancient metabolic network with a novel twist. PLoS ONE (2007) 2:e690 38. Srivastava, S., Chan, C.: Hydrogen peroxide and hydroxyl radicals mediate palmitate-induced cytotoxicity to hepatoma cells: Relation to mitochondrial permeability transition. Free Radic Res. 41(1) (2006) 38–49