Download Missing Value Estimation for Gene Expression Profile Data

Chinese Journal of Electronics Vol.21, No.4, Oct. 2012 Missing Value Estimation for Gene Expression Profile Data∗ WANG Xuesong, LIU Qingfeng and CHENG Yuhu (School of Information and Electrical Engineering, China University of Mining and Technology, Xuzhou 221116, China) Abstract — A new Missing value (MV) estimation method for gene expression profile data is proposed by considering both the internal and external conditions of gene expression profiles. The internal condition emphasizes the time-series characteristic of gene expression profile data. Therefore, we can use the cubic spline fitting method to construct a gene expression curve so as to estimate MVs. The main idea of MV estimation based on the external condition is to reconstruct MVs according to the expression values of candidate genes. Firstly, an initial subset of candidate genes is determined by defining a trace matrix. Then a final subset of candidate genes is constructed by selecting genes from the initial subset according to an improved Pearson correlation coefficient. At last, we select K genes that are most correlated with the target gene from the final subset to compute the weighted sum of the K expression values. Thus, the weighted sum is the estimated value of the target gene based on the external condition. Experimental results indicate that, compared with commonly used MV estimation methods, KNNimpute, SKNNimpute and IKNNimpute, the proposed method has higher estimation accuracy and is robust to the magnitude of K. Key words — Gene expression profile, Missing value, Correlation coefficient, Curve fitting, Trace matrix. I. Introduction With the step of coming into post-genomic era, more and more experts pay attention to constructing Gene regulatory networks (GRNs) in recent years. As we know, gene expression profile data obtained from microarray technology is an important material for constructing GRNs. Generally speaking, GRNs can be constructed using a data analytical method[1] or a biological method[2] . However, due to the imperfections of microarray experiments, most of gene expression profile data contain an average of 5% of Missing values (MVs)[3] . Because of MVs, many data analytical methods cannot perform well[4] . Therefore, it is necessary to design suitable MV estimation methods. So far, the simple ways usually applied to dealing with MVs include removing the genes with MVs directly (case deletion), or replacing the MVs of a gene with zero or the average of the observed values over that gene[5] . Case deletion procedures may bias the results if the remaining cases are unrepresentative of the entire sample. Because the same value is used to replace MVs in a given gene, both zero and mean substitutions will reduce the variance of the variable in question[6] . Scholz et al. pointed out that the key of MV estimation is to find a relationship between genes[7] , based on which, a lot of methods have been developed which can be classified into two categories, i.e., global strategy and local strategy[8] . An assumption for the global strategy is that, all genes in a dataset are with covariance structure. Therefore, the global strategy is only suitable for datasets with strong global correlation. For the local strategy, because it can exploit the local similarity structure of genes, it has the ability of dealing with noise and time-series gene expression data. Typical methods using local strategy are the weighted Knearest neighbor imputation (KNNimpute)[9] and its improved methods including the sequential KNN imputation method (SKNNimpute)[10] and the iterative KNN imputation method (IKNNimpute)[11] . MV estimation using these methods can be carried out through constructing weights between the target gene and each candidate gene. These MV estimation methods merely take the external condition, the expression profile of candidate genes into consideration. Therefore, good estimation results can be obtained under the condition that there exists a strong correlation between the target gene and each candidate gene. If the number of genes is large, the probability of obtaining candidate genes that are strongly correlated with the target gene is high which is helpful for MV estimation. But in practice, the scale of genes is small. Therefore, insufficient genes may results in weak correlation and further large MV estimation error when only the external condition is considered. It is well known that gene expression profile is usually denoted by a large matrix. A row of the matrix represents a gene expression under different experimental conditions (time points) which is time-series data. Therefore, there exists an internal condition between the time-series data for a gene which is only related with the gene itself. Similar to the external condition, the internal condition is also applicable for MV es- ∗ Manuscript Received Mar. 2011; Accepted May 2011. This work is supported by the National Natural Science Foundation of China (No.60804022, No.60974050, No.61072094), Program for New Century Excellent Talents in University (No.NCET-08-0836), Fok Ying-Tung Education Foundation for Young Teachers (No.121066), Natural Science Foundation of Jiangsu Province (No.BK2008126). Chinese Journal of Electronics 674 timation. In our study, a novel MV estimation method is proposed by considering both the internal and external conditions. The estimated values of MVs are composed of two components. The first one is obtained from a curve fitting result based on the internal condition, and the second component is the weighted sum of observed values over candidate genes based on the external condition. II. Methods and Materials 1. Notation The dataset of a gene expression profile can be denoted as a matrix v = (xij )N×M , where xij represents the value of gene i at the jth time point, N and M are the numbers of genes and time points respectively. In our study, a gene with MVs is called target gene, and the genes with available information for estimating its missing entries constitute the set of candidate genes. If the value of target gene y at the time point z is missed, the estimated value of the MV is denoted as x̂yz . 2. KNNimpute, SKNNimpute and IKNNimpute KNNimpute is a classical method. It takes advantage of the principle of minimum Euclidean distance to select K nearest neighbors of the target gene, and then reconstructs the MVs of the target gene by weighted average of the K neighbors[9] . To compute the Euclidean distance dyi between the target gene y and each candidate gene i, a matrix r is proposed. If the value of gene i at the jth time point is missed, , is equal to 0; otherwise, it is 1. The the ijth element of r , rij Euclidean distance is defined as: M ryj rij (xyj − xij )2 j=1 (1) dyi = M ryj rij j=1 The weight between the selected gene k and the target gene y is defined as: 1/dyk wyk = K (2) 1/dyk k=1 the sum of squared differences between the last two estimations. If the sum is larger than a predefined threshold, return to (2) and continue until it is smaller than the threshold. 3. Our method Generally speaking, the interpolation method is good at estimating MVs for time-series data, and the cubic spline fitting is more suitable for processing gene expression profiles data. Gene expression curve obtained by cubic spline fitting not only can reflect the internal growth and change regulation of a gene but also can reconstruct MVs. Therefore, the cubic spline fitting method is used here to create the gene expression curve and the estimated value of xyz obtained by which is denoted as xfyz . For each gene, a threshold value is defined according to Eq.(4). (4) Thresholdi = |μi | + λ|σi | where μi and σi are the mean and variance of the expression value of gene i respectively, λ is a predefined constant with λ > 0. Let T = (tij )N×M be a trace matrix where tij is defined as: ⎧ ⎪ ⎨ 0, xij = NaN (5) tij = 1, |xij | < Thresholdi ⎪ ⎩ 2, |xij | ≥ Thresholdi where NaN represents MV. Then we can get an initial subset of candidate genes: v icandidate = (xoj )O×M , if tyz = 0 and toz = 0 x̂yz = wyk xkz (6) Gene expression values reflect the level of activity of gene under different experimental conditions, and large values always represent strong level of activity. The larger gene expression values, the smaller measurement error. In other words, large gene expression values are helpful for improving the estimation accuracy of MVs. The traditional Pearson correlation coefficient measures the similarity between any two genes from the overall situation, while it neglects the influence of large gene expression values. In order to highlight the effect of large gene expression values, an improved Pearson correlation coefficient ryi between the target gene y and each candidate gene i is proposed. The estimated value x̂yz follows the following form: K 2012 M (3) k=1 As an improved method, SKNNimpute is different from KNNimpute on two main points: (1) MVs are reconstructed sequentially from genes with the smallest missing rate. (2) For a target gene, if all the MVs in it have been reconstructed, it can be reused as a candidate gene in the following work. In SKNNimpute, a dataset is divided into two parts xincomplete and xcomplete . xincomplete is formed by all target genes while xcomplete is formed by all candidate genes. Generally, K nearest neighbors are selected from xcomplete . Compared with KNNimpute, IKNNimpute is based on an iterative procedure. It follows three steps: (1) Replace all MVs with the average values of their corresponding genes. (2) Estimate all MVs through SKNNimpute procedure. (3) Compute ryi ((tyj ∩ tij )xyj − x̄y )((tyj ∩ tij )xij − x̄i ) = M M ((tyj ∪ tij )xyj − x̄y )2 ((tyj ∪ tij )xij − x̄i )2 j=1 j=1 j=1 (7) where ‘∩’ and ‘∪’ denote minimizing and maximizing operations respectively. It should be especially specified that if each of the two variables tyj and tij is zero, tyj ∪tij is zero regardless of the ‘∩’ or ‘∪’ operation. Suppose ψ is the threshold of the improved correlation coefficient defined in Eq.(7), we can obtain the final subset of candidate genes denoted by v fcandidate = (xlj )L×M , if ryl > ψ (8) After getting the final subset v fcandidate , we select K genes with larger magnitude of improved correlation coefficient from Missing Value Estimation for Gene Expression Profile Data the final subset of candidate genes, and construct a weight: ryk L = K × wyk O r k=1 yk (9) where O and L represent the number of genes in the initial and final subsets respectively. According to Eq.(10), the estimated value x̂yz can be obtained. K L xfyz + wyk xkz (10) x̂yz = 1 − O k=1 The steps of constructing MVs through our method can be summarized as follows. Step 1 Sort target genes in an ascending order according to their missing rates; Step 2 Execute the cubic spline curve fitting operation for target gene y; Step 3 For the MV xyz in gene y: (a) Obtain the estimated value of xfyz from the cubic spline curve fitting; (b) Calculate the trace matrix T by Eq.(5) and obtain the initial subset of candidate genes according to Eq.(6); (c) Compute the improved Pearson correlation coefficient between target gene y and each initial candidate gene according to Eq.(7), and obtain the final subset of candidate genes through Eq.(8); (d) Select K genes according to the magnitude of correlation coefficients between target gene y and the final candidate genes, then construct the weight matrix according to Eq.(9); (e) Obtain the estimated value x̂yz according to Eq.(10) and replace the MV with x̂yz ; (f ) Compute the difference between the former and the current estimated values δ. If |δ| ≤ τ , go to Step 4; otherwise, return to Step 3(b) and iterate until the convergence criterion τ is reached; Step 4 Reconstruct the next MV in target gene y until all MVs in target gene y have been replaced; Step 5 Go to the next target gene until all target genes have been estimated. 4. Dataset In our study, we use Saccharomyces microarray dataset published by Spellman to validate our method. The dataset can be described as a matrix with rows corresponding to genes and columns to experimental conditions[12] . In our study, three datasets TD1, TD2 and TD3 obtained under three different experimental conditions, i.e., cdc28, cdc15 and alpha, respectively are used (http://cellcycle-www.stanford.edu/). Here, TD1 is used to do feasibility analysis, while TD2 and TD3 are used to do comparative analysis. Original datasets, TD1 and TD2 are pre-processed for the evaluation by removing rows and columns containing missing expression values, yielding ‘complete’ datasets. Table 1 shows the attributes of these datasets. Table 1. Attributes of datasets TD1 TD2 TD3 Experimental condition cdc28 cdc15 alpha Dimension of original dataset 6179×17 6179×24 6179×18 Dimension of complete dataset 1383×17 4380×24 675 III. Results and Analysis 1. Parameter sets In our method, there are three parameters including λ, τ and ψ need to be set. Here, we set λ and τ to be 5 and 10−3 respectively. Theoretically, two genes have strong similarity if the absolute value of their Pearson correlation coefficient is larger than 0.75[9] . As for the improved correlation coefficient defined in Eq.(7), when the minimizing operation is 1 or 0 and the maximizing operation is 2, we can get the smallest value that is a quarter of the Pearson correlation coefficient. Therefore, ψ is set to 0.2. 2. Feasibility analysis For the complete dataset TD1, some true values were deleted at random to create a test dataset. Here, the Proportion of genes containing MVs (PGMV) in the test dataset TD1 was 1%, 5%, 10% and 15% respectively. We then used our method to recover the introduced missing values and used the Normalized root mean squared error (NRMSE) as an evaluating index. (Vtrue − Vest )2 /n (11) N RM SE = (1/σtrue ) where n is the number of MVs and σtrue is the standard deviation for the n true values. Vtrue and Vest represent true and estimated values of MVs respectively. The NRMSE of the estimation is shown in Fig.1. Fig. 1. NRMSE obtained by our method under different PGMVs From Fig.1, it can be easily seen that the NRMSE firstly decreases, and then increases with the number of nearest neighbors. For all conditions of PGMV, NRMSEs keep unchanged basically when K varies between 10 and 40. In addition, for a fixed K value, the NRMSE increases as the PGMV increases. Brás and Menezes did the same experiment on TD1[11] , and their results showed that NRMSEs obtained by KNNimpute, SKNNimpute and IKNNimpute were larger than 0.6. Fig.1 shows that all NRMSEs obtained by our method are smaller than 0.5. Accordingly, our method has higher estimation accuracy which is much feasible for MV estimation. 3. Comparative analysis In our study, we select 170 genes at random from the complete dataset TD2 to do comparative analysis on four MV estimation methods, i.e., KNNimpute, SKNNimpute, IKNNimpute and our method. Similar to the research of Troyanskaya et al.[9] and Brás and Menezes[11] , we select 4 different values Chinese Journal of Electronics 676 2012 of K, 5, 10, 15 and 20, to show the results of comparative analysis. The evaluating indexes used here are Average effective error (AEE) and Error rate (ER). AEE is the average of estimation errors that are smaller than 1 and ER is the rate of estimation errors that are larger than 1. (Vtrue − Vest ) AEE = (1/n ) (12) ER = (n − n )/n (13) where n is the number of errors that are smaller than 1. From Fig.2, we can see that the AEE of our method is lower than that of other three methods. Moreover, the tendencies of AEE using our method are the same over different values of K which means that, our method has the highest stability. Table 2 shows that the ER of our me- Fig. 2. AEE obtained by different methods over different values of K thod is the lowest among all methods with the same PGMV. Therefore, it can be concluded that our method has better estimation accuracy and stability than KNNimpute, SKNNimpute and IKNNimpute methods. Table 2. ER obtained by different methods over different values of PGMV PGMV KNNimpute SKNNimpute IKNNimpute Our method 5% 31.3% 31.3% 25.0% 12.5% 10% 42.2% 37.5% 30.4% 25.0% 15% 47.0% 36.2% 32.4% 16.7% 20% 52.4% 42.0% 39.2% 15.6% In the third experiment, we select 104 genes from TD3 and use them to do cluster analysis. From the research of Spellman et al., we know that the 104 genes can be classified into 5 classes with containing 21, 8, 9, 15 and 51 genes respectively[12] . In the 5 classes, there are 7, 0, 1, 5 and 9 genes contain MVs respectively. Firstly we reconstruct all MVs in the input data by the four MV estimation methods respectively and then use the Fuzzy C-means clustering (FCM) algorithm to classify these genes. The number of genes correctly classified is shown in Table 3. It can be seen from Table 3 that more genes can be classified correctly using our method. The reason is that our method has higher MV estimation accuracy; therefore, the gene expression data after the reconstruction of MVs is much suitable for clustering analysis. Table 3. Number of genes correctly classified MV estimation Class 1 Class 2 Class 3 Class 4 Class 5 Total method KNNimpute 13 8 6 13 28 68 SKNNimpute 14 8 6 13 28 69 IKNNimpute 13 8 6 13 29 69 Our method 15 8 6 13 30 72 IV. Conclusions DNA microarray is a high-throughput technology that allows the recording of expression levels of thousands of genes simultaneously, giving a global view of gene expression. The data generated in a set of microarray experiments are usually gathered in a matrix with genes in rows and experimental conditions in columns. Frequently, these matrices contain MVs due to the occurrence of imperfections during the microarray experiment. MVs may make the precision or the stability of data analytical methods for gene expression data poor. In this work, we propose a new method for estimating MVs in gene expression data from the view of point of both the internal and external conditions, i.e., the estimation value of a target gene is composed of two components. The first estimation is the result of the cubic spline fitting, and the second one is the weighted sum of expression values over K candidate genes that are most similar to the target gene. In order to highlight the effect of large gene expression values, an improved Pearson correlation coefficient is proposed to measure the similarity between the target gene and each candidate gene. Experimental results concerning on the Saccharomyces microarray dataset verify the feasibility and validity of the proposed MV estimation method. References [1] X.S. Wang, Y.Y. Gu, Y.H. Cheng et al., Construction of delay gene regulatory network based on complex network”, Acta Electronica Sinica, Vol.38, No.11, pp.2518–2522, 2010. (in Chinese) [2] M. Choi, O.H. Lee, S. Jeon et al., “The oocyte-specific transcription factor, Nobox, regulates the expression of Pad6, a peptidylarginine deiminase in the oocyte”, FEBS Letters, Vol.584, No.16, pp.3629–3634, 2010. [3] A.G. De Brevern, S. Hazout, A. Malpertuy, “Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering”, BMC Bioinformatics, Vol.5, pp.114–125, 2004. [4] X.S. Wang, Y.Y. Gu, Y.H. Cheng et al., “An ensemble classi- Missing Value Estimation for Gene Expression Profile Data [5] [6] [7] [8] [9] [10] [11] [12] fier based on selective independent component analysis of DNA microarray data”, Chinese Journal of Electronics, Vol.18, No.4, pp.645–649, 2009. J.L. Schafer, J.W. Graham, “Missing data: our view of the state of the art”, Psychological Methods, Vol.7, No.2, pp.147– 177, 2002. M.T. Swain, J.J. Mandel, W. Dubitzky, “Comparative study of three commonly used continuous deterministic methods for modeling gene regulation networks”, BMC Bioinformatics, Vol.11, pp.459–484, 2010. M. Scholz, F. Kaplan, C.L. Guy et al., “Non-linear PCA: a missing data approach”, Bioinformatics, Vol.21, No.20, pp.3887– 3895, 2005. R. Jörnsten, H.Y. Wang, J.W. William et al., DNA microarray data imputation and significance analysis of differential expression”, Bioinformatics, Vol.21, No.22, pp.4155–4161, 2005. O. Troyanskaya, M. Cantor, G. Sherlock et al., “Missing value estimation methods for DNA microarrays”, Bioinformatics, Vol.17, No.6, pp.520–525, 2001. K.Y. Kim, B.J. Kim, G.S. Yi, “Reuse of imputed data in microarray analysis increases imputation efficiency”, BMC Bioinformatics, Vol.5, pp.160–169, 2004. L.P. Brás, J.C. Menezes, “Improving cluster-based missing value estimation of DNA microaray data”, Biomolecular Engineering, Vol.24, No.2, pp.273–282, 2007. P.T. Spellman, G. Sherlock, M.Q. Zhang et al., “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization”, Molecular Biology of the Cell, Vol.9, No.12, pp.3273–3297, 1998. 677 WANG Xuesong received the Ph.D. degree from China University of Mining and Technology in 2002. She is currently a professor in the School of Information and Electrical Engineering, China University of Mining and Technology. Her main research interests include machine learning, bioinformatics and artificial intelligence. (Email: [email protected]) LIU Qingfeng received the B.S. degree from China University of Mining and Technology in 2009. He is currently a M.S. candidate in the School of Information and Electrical Engineering, China University of Mining and Technology. His main research interest is bioinformatics. (Email: [email protected]) CHENG Yuhu received the Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences in 2005. He is currently a professor in the School of Information and Electrical Engineering, China University of Mining and Technology. His main research interests include machine learning and intelligent system. (Email: [email protected])

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Missing Value Estimation for Gene Expression Profile Data