Download Missing Value Estimation for Gene Expression Profile Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Transposable element wikipedia , lookup

X-inactivation wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Essential gene wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Copy-number variation wikipedia , lookup

Oncogenomics wikipedia , lookup

Genetic engineering wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Long non-coding RNA wikipedia , lookup

NEDD9 wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Public health genomics wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Minimal genome wikipedia , lookup

Gene therapy wikipedia , lookup

History of genetic engineering wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Genomic imprinting wikipedia , lookup

Ridge (biology) wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

The Selfish Gene wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene desert wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genome (book) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Microevolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression programming wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Chinese Journal of Electronics
Vol.21, No.4, Oct. 2012
Missing Value Estimation for Gene Expression
Profile Data∗
WANG Xuesong, LIU Qingfeng and CHENG Yuhu
(School of Information and Electrical Engineering, China University of Mining and Technology, Xuzhou 221116, China)
Abstract — A new Missing value (MV) estimation
method for gene expression profile data is proposed by considering both the internal and external conditions of gene
expression profiles. The internal condition emphasizes the
time-series characteristic of gene expression profile data.
Therefore, we can use the cubic spline fitting method to
construct a gene expression curve so as to estimate MVs.
The main idea of MV estimation based on the external condition is to reconstruct MVs according to the expression
values of candidate genes. Firstly, an initial subset of candidate genes is determined by defining a trace matrix. Then
a final subset of candidate genes is constructed by selecting genes from the initial subset according to an improved
Pearson correlation coefficient. At last, we select K genes
that are most correlated with the target gene from the
final subset to compute the weighted sum of the K expression values. Thus, the weighted sum is the estimated value
of the target gene based on the external condition. Experimental results indicate that, compared with commonly
used MV estimation methods, KNNimpute, SKNNimpute
and IKNNimpute, the proposed method has higher estimation accuracy and is robust to the magnitude of K.
Key words — Gene expression profile, Missing value,
Correlation coefficient, Curve fitting, Trace matrix.
I. Introduction
With the step of coming into post-genomic era, more and
more experts pay attention to constructing Gene regulatory
networks (GRNs) in recent years. As we know, gene expression profile data obtained from microarray technology is an important material for constructing GRNs. Generally speaking,
GRNs can be constructed using a data analytical method[1] or
a biological method[2] . However, due to the imperfections of
microarray experiments, most of gene expression profile data
contain an average of 5% of Missing values (MVs)[3] . Because
of MVs, many data analytical methods cannot perform well[4] .
Therefore, it is necessary to design suitable MV estimation
methods.
So far, the simple ways usually applied to dealing with
MVs include removing the genes with MVs directly (case deletion), or replacing the MVs of a gene with zero or the average
of the observed values over that gene[5] . Case deletion procedures may bias the results if the remaining cases are unrepresentative of the entire sample. Because the same value is used
to replace MVs in a given gene, both zero and mean substitutions will reduce the variance of the variable in question[6] .
Scholz et al. pointed out that the key of MV estimation is to
find a relationship between genes[7] , based on which, a lot of
methods have been developed which can be classified into two
categories, i.e., global strategy and local strategy[8] . An assumption for the global strategy is that, all genes in a dataset
are with covariance structure. Therefore, the global strategy
is only suitable for datasets with strong global correlation. For
the local strategy, because it can exploit the local similarity
structure of genes, it has the ability of dealing with noise and
time-series gene expression data.
Typical methods using local strategy are the weighted Knearest neighbor imputation (KNNimpute)[9] and its improved
methods including the sequential KNN imputation method
(SKNNimpute)[10] and the iterative KNN imputation method
(IKNNimpute)[11] . MV estimation using these methods can be
carried out through constructing weights between the target
gene and each candidate gene. These MV estimation methods
merely take the external condition, the expression profile of
candidate genes into consideration. Therefore, good estimation results can be obtained under the condition that there
exists a strong correlation between the target gene and each
candidate gene. If the number of genes is large, the probability of obtaining candidate genes that are strongly correlated
with the target gene is high which is helpful for MV estimation. But in practice, the scale of genes is small. Therefore,
insufficient genes may results in weak correlation and further
large MV estimation error when only the external condition is
considered.
It is well known that gene expression profile is usually denoted by a large matrix. A row of the matrix represents a
gene expression under different experimental conditions (time
points) which is time-series data. Therefore, there exists an internal condition between the time-series data for a gene which
is only related with the gene itself. Similar to the external
condition, the internal condition is also applicable for MV es-
∗ Manuscript Received Mar. 2011; Accepted May 2011. This work is supported by the National Natural Science Foundation of China
(No.60804022, No.60974050, No.61072094), Program for New Century Excellent Talents in University (No.NCET-08-0836), Fok Ying-Tung
Education Foundation for Young Teachers (No.121066), Natural Science Foundation of Jiangsu Province (No.BK2008126).
Chinese Journal of Electronics
674
timation. In our study, a novel MV estimation method is
proposed by considering both the internal and external conditions. The estimated values of MVs are composed of two
components. The first one is obtained from a curve fitting
result based on the internal condition, and the second component is the weighted sum of observed values over candidate
genes based on the external condition.
II. Methods and Materials
1. Notation
The dataset of a gene expression profile can be denoted as
a matrix v = (xij )N×M , where xij represents the value of gene
i at the jth time point, N and M are the numbers of genes
and time points respectively. In our study, a gene with MVs
is called target gene, and the genes with available information
for estimating its missing entries constitute the set of candidate genes. If the value of target gene y at the time point z is
missed, the estimated value of the MV is denoted as x̂yz .
2. KNNimpute, SKNNimpute and IKNNimpute
KNNimpute is a classical method. It takes advantage
of the principle of minimum Euclidean distance to select K
nearest neighbors of the target gene, and then reconstructs
the MVs of the target gene by weighted average of the K
neighbors[9] . To compute the Euclidean distance dyi between
the target gene y and each candidate gene i, a matrix r is proposed. If the value of gene i at the jth time point is missed,
, is equal to 0; otherwise, it is 1. The
the ijth element of r , rij
Euclidean distance is defined as:
M
ryj rij (xyj − xij )2
j=1
(1)
dyi = M
ryj rij
j=1
The weight between the selected gene k and the target gene y
is defined as:
1/dyk
wyk = K
(2)
1/dyk
k=1
the sum of squared differences between the last two estimations. If the sum is larger than a predefined threshold, return
to (2) and continue until it is smaller than the threshold.
3. Our method
Generally speaking, the interpolation method is good at
estimating MVs for time-series data, and the cubic spline fitting is more suitable for processing gene expression profiles
data. Gene expression curve obtained by cubic spline fitting
not only can reflect the internal growth and change regulation
of a gene but also can reconstruct MVs. Therefore, the cubic
spline fitting method is used here to create the gene expression curve and the estimated value of xyz obtained by which
is denoted as xfyz .
For each gene, a threshold value is defined according to
Eq.(4).
(4)
Thresholdi = |μi | + λ|σi |
where μi and σi are the mean and variance of the expression
value of gene i respectively, λ is a predefined constant with
λ > 0.
Let T = (tij )N×M be a trace matrix where tij is defined
as:
⎧
⎪
⎨ 0, xij = NaN
(5)
tij =
1, |xij | < Thresholdi
⎪
⎩
2, |xij | ≥ Thresholdi
where NaN represents MV.
Then we can get an initial subset of candidate genes:
v icandidate = (xoj )O×M , if tyz = 0 and toz = 0
x̂yz =
wyk xkz
(6)
Gene expression values reflect the level of activity of gene
under different experimental conditions, and large values always represent strong level of activity. The larger gene expression values, the smaller measurement error. In other words,
large gene expression values are helpful for improving the estimation accuracy of MVs. The traditional Pearson correlation coefficient measures the similarity between any two genes
from the overall situation, while it neglects the influence of
large gene expression values. In order to highlight the effect of
large gene expression values, an improved Pearson correlation
coefficient ryi between the target gene y and each candidate
gene i is proposed.
The estimated value x̂yz follows the following form:
K
2012
M
(3)
k=1
As an improved method, SKNNimpute is different from
KNNimpute on two main points: (1) MVs are reconstructed
sequentially from genes with the smallest missing rate. (2) For
a target gene, if all the MVs in it have been reconstructed, it
can be reused as a candidate gene in the following work. In
SKNNimpute, a dataset is divided into two parts xincomplete
and xcomplete . xincomplete is formed by all target genes while
xcomplete is formed by all candidate genes. Generally, K nearest neighbors are selected from xcomplete .
Compared with KNNimpute, IKNNimpute is based on an
iterative procedure. It follows three steps: (1) Replace all MVs
with the average values of their corresponding genes. (2) Estimate all MVs through SKNNimpute procedure. (3) Compute
ryi
((tyj ∩ tij )xyj − x̄y )((tyj ∩ tij )xij − x̄i )
= M
M
((tyj ∪ tij )xyj − x̄y )2
((tyj ∪ tij )xij − x̄i )2
j=1
j=1
j=1
(7)
where ‘∩’ and ‘∪’ denote minimizing and maximizing operations respectively. It should be especially specified that if each
of the two variables tyj and tij is zero, tyj ∪tij is zero regardless
of the ‘∩’ or ‘∪’ operation.
Suppose ψ is the threshold of the improved correlation coefficient defined in Eq.(7), we can obtain the final subset of
candidate genes denoted by
v fcandidate = (xlj )L×M , if ryl > ψ
(8)
After getting the final subset v fcandidate , we select K genes
with larger magnitude of improved correlation coefficient from
Missing Value Estimation for Gene Expression Profile Data
the final subset of candidate genes, and construct a weight:
ryk
L
= K
×
wyk
O
r
k=1 yk
(9)
where O and L represent the number of genes in the initial
and final subsets respectively.
According to Eq.(10), the estimated value x̂yz can be obtained.
K
L
xfyz +
wyk
xkz
(10)
x̂yz = 1 −
O
k=1
The steps of constructing MVs through our method can
be summarized as follows.
Step 1 Sort target genes in an ascending order according
to their missing rates;
Step 2 Execute the cubic spline curve fitting operation
for target gene y;
Step 3 For the MV xyz in gene y:
(a) Obtain the estimated value of xfyz from the cubic spline
curve fitting;
(b) Calculate the trace matrix T by Eq.(5) and obtain the
initial subset of candidate genes according to Eq.(6);
(c) Compute the improved Pearson correlation coefficient
between target gene y and each initial candidate gene according to Eq.(7), and obtain the final subset of candidate genes
through Eq.(8);
(d) Select K genes according to the magnitude of correlation coefficients between target gene y and the final candidate
genes, then construct the weight matrix according to Eq.(9);
(e) Obtain the estimated value x̂yz according to Eq.(10)
and replace the MV with x̂yz ;
(f ) Compute the difference between the former and the
current estimated values δ. If |δ| ≤ τ , go to Step 4; otherwise,
return to Step 3(b) and iterate until the convergence criterion
τ is reached;
Step 4 Reconstruct the next MV in target gene y until
all MVs in target gene y have been replaced;
Step 5 Go to the next target gene until all target genes
have been estimated.
4. Dataset
In our study, we use Saccharomyces microarray dataset
published by Spellman to validate our method. The dataset
can be described as a matrix with rows corresponding to genes
and columns to experimental conditions[12] . In our study,
three datasets TD1, TD2 and TD3 obtained under three different experimental conditions, i.e., cdc28, cdc15 and alpha,
respectively are used (http://cellcycle-www.stanford.edu/).
Here, TD1 is used to do feasibility analysis, while TD2 and
TD3 are used to do comparative analysis. Original datasets,
TD1 and TD2 are pre-processed for the evaluation by removing rows and columns containing missing expression values,
yielding ‘complete’ datasets. Table 1 shows the attributes of
these datasets.
Table 1. Attributes of datasets
TD1
TD2
TD3
Experimental condition
cdc28
cdc15
alpha
Dimension of original dataset 6179×17 6179×24 6179×18
Dimension of complete dataset 1383×17 4380×24
675
III. Results and Analysis
1. Parameter sets
In our method, there are three parameters including λ, τ
and ψ need to be set. Here, we set λ and τ to be 5 and 10−3
respectively. Theoretically, two genes have strong similarity
if the absolute value of their Pearson correlation coefficient is
larger than 0.75[9] . As for the improved correlation coefficient
defined in Eq.(7), when the minimizing operation is 1 or 0
and the maximizing operation is 2, we can get the smallest
value that is a quarter of the Pearson correlation coefficient.
Therefore, ψ is set to 0.2.
2. Feasibility analysis
For the complete dataset TD1, some true values were
deleted at random to create a test dataset. Here, the Proportion of genes containing MVs (PGMV) in the test dataset
TD1 was 1%, 5%, 10% and 15% respectively. We then used
our method to recover the introduced missing values and used
the Normalized root mean squared error (NRMSE) as an evaluating index.
(Vtrue − Vest )2 /n
(11)
N RM SE = (1/σtrue )
where n is the number of MVs and σtrue is the standard deviation for the n true values. Vtrue and Vest represent true
and estimated values of MVs respectively. The NRMSE of the
estimation is shown in Fig.1.
Fig. 1. NRMSE obtained by our method under different PGMVs
From Fig.1, it can be easily seen that the NRMSE firstly
decreases, and then increases with the number of nearest
neighbors. For all conditions of PGMV, NRMSEs keep unchanged basically when K varies between 10 and 40. In addition, for a fixed K value, the NRMSE increases as the PGMV
increases.
Brás and Menezes did the same experiment on TD1[11] ,
and their results showed that NRMSEs obtained by KNNimpute, SKNNimpute and IKNNimpute were larger than 0.6.
Fig.1 shows that all NRMSEs obtained by our method are
smaller than 0.5. Accordingly, our method has higher estimation accuracy which is much feasible for MV estimation.
3. Comparative analysis
In our study, we select 170 genes at random from the complete dataset TD2 to do comparative analysis on four MV estimation methods, i.e., KNNimpute, SKNNimpute, IKNNimpute and our method. Similar to the research of Troyanskaya
et al.[9] and Brás and Menezes[11] , we select 4 different values
Chinese Journal of Electronics
676
2012
of K, 5, 10, 15 and 20, to show the results
of comparative analysis. The evaluating
indexes used here are Average effective
error (AEE) and Error rate (ER). AEE
is the average of estimation errors that
are smaller than 1 and ER is the rate of
estimation errors that are larger than 1.
(Vtrue − Vest )
AEE = (1/n )
(12)
ER = (n − n )/n
(13)
where n is the number of errors that are
smaller than 1.
From Fig.2, we can see that the AEE
of our method is lower than that of other
three methods. Moreover, the tendencies
of AEE using our method are the same
over different values of K which means
that, our method has the highest stability. Table 2 shows that the ER of our me-
Fig. 2. AEE obtained by different methods over different values of K
thod is the lowest among all methods with the same PGMV.
Therefore, it can be concluded that our method has better estimation accuracy and stability than KNNimpute, SKNNimpute
and IKNNimpute methods.
Table 2. ER obtained by different methods over
different values of PGMV
PGMV KNNimpute SKNNimpute IKNNimpute Our method
5%
31.3%
31.3%
25.0%
12.5%
10%
42.2%
37.5%
30.4%
25.0%
15%
47.0%
36.2%
32.4%
16.7%
20%
52.4%
42.0%
39.2%
15.6%
In the third experiment, we select 104 genes from TD3
and use them to do cluster analysis. From the research of
Spellman et al., we know that the 104 genes can be classified into 5 classes with containing 21, 8, 9, 15 and 51 genes
respectively[12] . In the 5 classes, there are 7, 0, 1, 5 and 9 genes
contain MVs respectively. Firstly we reconstruct all MVs in
the input data by the four MV estimation methods respectively
and then use the Fuzzy C-means clustering (FCM) algorithm
to classify these genes. The number of genes correctly classified is shown in Table 3. It can be seen from Table 3 that
more genes can be classified correctly using our method. The
reason is that our method has higher MV estimation accuracy;
therefore, the gene expression data after the reconstruction of
MVs is much suitable for clustering analysis.
Table 3. Number of genes correctly classified
MV
estimation Class 1 Class 2 Class 3 Class 4 Class 5 Total
method
KNNimpute
13
8
6
13
28
68
SKNNimpute
14
8
6
13
28
69
IKNNimpute
13
8
6
13
29
69
Our method
15
8
6
13
30
72
IV. Conclusions
DNA microarray is a high-throughput technology that allows the recording of expression levels of thousands of genes
simultaneously, giving a global view of gene expression. The
data generated in a set of microarray experiments are usually
gathered in a matrix with genes in rows and experimental conditions in columns. Frequently, these matrices contain MVs
due to the occurrence of imperfections during the microarray
experiment. MVs may make the precision or the stability of
data analytical methods for gene expression data poor. In this
work, we propose a new method for estimating MVs in gene
expression data from the view of point of both the internal
and external conditions, i.e., the estimation value of a target
gene is composed of two components. The first estimation is
the result of the cubic spline fitting, and the second one is
the weighted sum of expression values over K candidate genes
that are most similar to the target gene. In order to highlight
the effect of large gene expression values, an improved Pearson
correlation coefficient is proposed to measure the similarity between the target gene and each candidate gene. Experimental
results concerning on the Saccharomyces microarray dataset
verify the feasibility and validity of the proposed MV estimation method.
References
[1] X.S. Wang, Y.Y. Gu, Y.H. Cheng et al., Construction of delay
gene regulatory network based on complex network”, Acta Electronica Sinica, Vol.38, No.11, pp.2518–2522, 2010. (in Chinese)
[2] M. Choi, O.H. Lee, S. Jeon et al., “The oocyte-specific transcription factor, Nobox, regulates the expression of Pad6, a peptidylarginine deiminase in the oocyte”, FEBS Letters, Vol.584,
No.16, pp.3629–3634, 2010.
[3] A.G. De Brevern, S. Hazout, A. Malpertuy, “Influence of microarrays experiments missing values on the stability of gene
groups by hierarchical clustering”, BMC Bioinformatics, Vol.5,
pp.114–125, 2004.
[4] X.S. Wang, Y.Y. Gu, Y.H. Cheng et al., “An ensemble classi-
Missing Value Estimation for Gene Expression Profile Data
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
fier based on selective independent component analysis of DNA
microarray data”, Chinese Journal of Electronics, Vol.18, No.4,
pp.645–649, 2009.
J.L. Schafer, J.W. Graham, “Missing data: our view of the
state of the art”, Psychological Methods, Vol.7, No.2, pp.147–
177, 2002.
M.T. Swain, J.J. Mandel, W. Dubitzky, “Comparative study
of three commonly used continuous deterministic methods for
modeling gene regulation networks”, BMC Bioinformatics,
Vol.11, pp.459–484, 2010.
M. Scholz, F. Kaplan, C.L. Guy et al., “Non-linear PCA: a missing data approach”, Bioinformatics, Vol.21, No.20, pp.3887–
3895, 2005.
R. Jörnsten, H.Y. Wang, J.W. William et al., DNA microarray
data imputation and significance analysis of differential expression”, Bioinformatics, Vol.21, No.22, pp.4155–4161, 2005.
O. Troyanskaya, M. Cantor, G. Sherlock et al., “Missing value
estimation methods for DNA microarrays”, Bioinformatics,
Vol.17, No.6, pp.520–525, 2001.
K.Y. Kim, B.J. Kim, G.S. Yi, “Reuse of imputed data in microarray analysis increases imputation efficiency”, BMC Bioinformatics, Vol.5, pp.160–169, 2004.
L.P. Brás, J.C. Menezes, “Improving cluster-based missing value
estimation of DNA microaray data”, Biomolecular Engineering,
Vol.24, No.2, pp.273–282, 2007.
P.T. Spellman, G. Sherlock, M.Q. Zhang et al., “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization”, Molecular
Biology of the Cell, Vol.9, No.12, pp.3273–3297, 1998.
677
WANG Xuesong
received the
Ph.D. degree from China University of
Mining and Technology in 2002.
She
is currently a professor in the School
of Information and Electrical Engineering, China University of Mining and Technology. Her main research interests include machine learning, bioinformatics and
artificial intelligence.
(Email: [email protected])
LIU Qingfeng received the B.S. degree from China University of Mining and
Technology in 2009. He is currently a M.S.
candidate in the School of Information and
Electrical Engineering, China University
of Mining and Technology. His main research interest is bioinformatics. (Email:
[email protected])
CHENG Yuhu received the Ph.D.
degree from the Institute of Automation,
Chinese Academy of Sciences in 2005. He is
currently a professor in the School of Information and Electrical Engineering, China
University of Mining and Technology. His
main research interests include machine
learning and intelligent system. (Email:
[email protected])