Download Analysis of Missing Data and Imputation on Agriculture

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
ISSN 2278-3083
International Journal of Science and Applied Information Technology (IJSAIT), Vol.5 , No.1, Pages : 01-04 (2016)
Special Issue of ICECT 2016 - Held on February 27, 2016 in Hyderabad Marriot Hotel & Convention Centre, Hyderabad
http://warse.org/IJSAIT/static/pdf/Issue/icect2016sp01.pdf
Analysis of Missing Data and Imputation on Agriculture Data
With Predictive Mean Matching Method
Jinubala, V
Lawrance, R
Research Scholar, Department of Computer Science,
Karpagam Academy of Higher Education,
Coimbatore – 641 021
[email protected]
Director, Department of Computer Applications,
Ayya Nadar Janaki Ammal College,
Sivakasi – 626124, Tamil Nadu, India,
[email protected]
Abstract - Data mining can be defined as the process of
selecting, exploring and modeling large amounts of data to
uncover previously unknown patterns. Data Mining is emerging
research field in Agriculture crop yield analysis. In the present
scenario data mining has become the eminent methodology for
accessing huge volume of information from the data set. Various
methods and algorithms are proposed for different data sets to
obtain better accuracy. In this paper an analysis of Predictive
Mean Matching Method has been implemented for identifying
and replacing missing values for crop pest data. This method is
also an imputation method used for continuous variables. It is
similar to the regression method except that for each missing
value, it imputes a value randomly from a set of observed values
whose predicted values are closest to the predicted value for the
missing value from the simulated regression model.
RELATED WORKS
The term ‘review’ means to organize the knowledge of the
specific area of research to evolve an edifice of knowledge to
show that his/her study would be an addition to this field. The
task of review of literature is highly creative and tedious
because researcher has to synthesize the available knowledge
of the field in a unique way to provide the rationale for the
study [5].
A number of studies have applied data mining techniques
to extract meaning from data collected from agriculture data
set. For example, the collection of data from agriculture data
set is challenging, with most of the datasets incomplete due to
the difficulty and methods of data collection. Missing data sets
can be problematic and may limit the analysis and extraction
of new knowledge. Most statistical theory focuses on data
modeling, prediction and statistical inference to be needed.
Keywords - Data mining, Imputation, Missing Values, and
Predictive Mean Matching Method, Agriculture Data.
Usually assumed that data are in the correct state for
analysis. In practice, a data analyst spends much if not most of
the time in preparing the data before doing any statistical
operation. It is very rare that the raw data one works with are
in the correct format, without errors, complete and has all the
correct labels and codes that are needed for analysis, the data
cleaning needs to be considered as a statistical operation,
Imputation is the process of estimating or deriving values for
fields where data is missing [6].
INTRODUCTION
Data Mining is the process of discovering the interesting
patterns or information from the data in large databases. The
data sources can include databases, data warehouses, the Web,
other information repositories, or data that are streamed into
the system dynamically. Data mining is defined as knowledge
Discovery in Databases, knowledge extraction, pattern
analysis, data archeology, business intelligence [1,2,3].
Data Mining is the science of finding new interesting
patterns and relationship in huge amount of data. It is defined
as “the process of discovering meaningful new correlations,
patterns, and trends by digging into large amounts of data
stored in warehouses”. Data mining is also sometimes called
Knowledge Discovery in Databases [1].
The mean and median imputation method is basic
imputation models x = x, where the x is the imputation value
and the mean is taken over the observed values. This model is
limited since it causes a bias in measures of spread, estimated
after imputation [7, 8, 9].
D.T. Larose has examined the methods for imputing
missing values for continuous variables, and categorical
variables. Missing data may arise from any of several different
causes. The method is simply to construct a flag variable and
another method is for dealing with missing data to reduce the
weight that the case wields in the analysis [10].
Many Data Mining algorithms are used for preprocessing
of data which removes noise from data sets, redundancy in
data sets, which makes data sets useful for processing of
knowledge from data sets [4]. In this paper, the technique of
statistical reconstruction of incomplete data sets using
predictive mean matching method to predict the missing
values which helps to produce complete data sets.
Shelke, M. B., has discussed the technique of statistical
reconstruction of incomplete data sets using multiple linear
regressions and uses the correlation of data sets attribute to
predict the missing values which helps to produce complete
data sets. This paper is theoretical and generalized algorithm
approach to predict missing values by using multiple
regressions Model in weka tool [11].
The structure of the paper is organized as reviewing the
related works in this field, the methodology of multiple
predictive mean matching methods for agriculture data,
presenting the experimental results and discussion followed
by conclusion and references used.
1
ISSN 2278-3083
International Journal of Science and Applied Information Technology (IJSAIT), Vol.5 , No.1, Pages : 01-04 (2016)
Special Issue of ICECT 2016 - Held on February 01, 2016 in Hyderabad Marriot Hotel & Convention Centre, Hyderabad
http://warse.org/IJSAIT/static/pdf/Issue/icect2016sp01.pdf
METHODOLOGY
B. Linear Regression Models
The linear regression models [12] are used to find the
Data Cleaning is the process of transforming raw data into
missing values and impute the missing values in the
consistent data that can be analyzed. It is aimed at improving
the content of statistical statements based on the data as well
agriculture data set. x = β + β y , + … . +β y , , Where
as their reliability. Data cleaning may profoundly influence
the β , β … … β are estimated linear regression coefficients
the statistical statements based on the data. Typical actions
for each of the auxiliary variables y , y , … … , y . Estimating
like imputation or outlier handling obviously influence the
linear models is easy to predict.
results of statistical analyses [1, 6]. For this reason, data
C. Predictive Mean Matching Method
cleaning should be considered as a statistical operation, to be
performed in a reproducible manner. The data mining
The predictive mean matching method [12] is also an
provides good techniques for reproducible data cleaning since
imputation method available for continuous variables. It is
all cleaning actions can be scripted and therefore reproduced.
similar to the regression method except that for each missing
The proposed approach of analysis of missing data and
value, it imputes a value randomly from a set of observed
imputation on agriculture data using multiple mean matching
values whose predicted values are the closest to predicted
method system is shown in fig. 1.
value for the missing value from the simulated regression
model.
New parameters β∗ = (β∗ , β∗ , … … … β∗( ) ) and σ∗ are
drawn from the posterior predictive distribution of the
parameters. That is, they are simulated from σ and
V . The variance is drawn as σ∗ = σ (n − k − 1)/g.
Where g is χ − k − 1 random variate and n is the
number of non missing observations for y .
The regression coefficients are drawn as β∗ = β +
σ∗ V Z. Where V is the upper triangular matrix in the
Cholesky decomposition, V = V V , and Z is a vector
of k + 1 independent random normal variants.
For each missing value, a predicted value y ∗ =
β∗ + β∗ x + β∗ x + ⋯ + β∗( ) x is computed with the
covariate values. x , x , … … … , x .
A set of k observations whose corresponding predicted
values closest to y ∗ are generated. The missing value is then
replaced by a value drawn randomly from these k observed
values. The predictive mean matching method requires the
number of closest observations to be specified. A smaller
k tends to increase the correlation among the multiple
imputations for the missing observation and results in a higher
variability of point estimators in repeated sampling. Finally
the reconstructed and completed data are extracted from the
method.
Fig. 1: Methodology
A. Data formats
The agriculture dataset can be in the form of M x N
matrix D of values, where the row represents field type and
column represents attributes. An illustration of agriculture
data is shown in Table 1. The data, usually contains large
amount of data as well as missing values, therefore data
mining techniques are used to impute the missing values so as
to extract useful knowledge from the imputed data.
D. Algorithm 1 : Predictive Mean Matching Method
Input:
D, Vector of real valued data
∗
=(
∗
,
∗
,………
Goal:
Goal is to predict the missing values of the data
Output:
TABLE 1 : Sample Data
Field
Type
Fixed
Random
Fixed
..
Avg_Spo
EggMass
0.6
0.2
0.1
…
Attributes
Avg_Spo
Avg_Spo
GregLar
SolLar
0
0.65
0.1
0
0.3
0
..
..
Impute the missing values of the agriculture data
E. Predictive Mean Matching Method Procedure:
Begin
…
Step1. Read the Agriculture data
..
..
..
..
Step2. Calculate variance
Step3. Calculate regression coefficients
2
∗( ) )
ISSN 2278-3083
International Journal of Science and Applied Information Technology (IJSAIT), Vol.5 , No.1, Pages : 01-04 (2016)
Special Issue of ICECT 2016 - Held on February 01, 2016 in Hyderabad Marriot Hotel & Convention Centre, Hyderabad
http://warse.org/IJSAIT/static/pdf/Issue/icect2016sp01.pdf
Using predictive mean matching method, the missing
Step4. For each missing value, a predicted value ∗ to be
value pattern is identified for the missing values in table 2
calculated
representing the agriculture data set. The missing features are
Step5. Compute the covariate values
identified by values, which are shown in table 4 and sorted by
values.
Step6. Generate predicted values
Table 4: Sorted Missing Features by values
Step7. Missing value is then replaced by a predicted value
Step8. Generate reconstructed and completed data
Variable Sorted by number of missing values
End
EXPERIMENTAL RESULTS
In this research the above discussed method has been
implemented with the real time agriculture data set. The
experimented dataset has huge volume of data regarding the
pests and other relevant information. In this paper the
algorithm approach to predict missing values by using
predictive mean matching method applied in R language and
also compared with linear regression model [13, 14]. With the
predictive mean matching method the experimentation has
been accomplished and the results are represented as below.
Variables
Count
Avg_SpoEggMass
0.444444
Avg_SpoGregLar
0
Avg_SpoSolLar
0
Avg_semilooper
0
Table 2: Missing data patterns
Field
type
Random1
Avg_
SpoEggMass
0.4
Avg_
SpoGregLar
0.2
Avg_
SpoSolLar
0.4
Avg_
semilooper
0.6
Random1
0.2
0.2
0.4
0.4
Fixed1
0.2
0.2
0.4
0.2
Random1
NA
0
0.2
NA
Random1
NA
0.4
1.4
NA
Fixed1
NA
0
0.2
NA
Fixed1
NA
0
0.2
NA
Fixed1
0.2
0.2
0.2
0.4
Fixed1
0.2
0
0.4
0
Fig. 2: Plotted image of Missing values
The missing values are plotted in the fig. 2. The missing
values are calculated and replaced by predictive mean
matching method; the imputed values are shown in the fig. 3.
Finally the method reconstructs incomplete data set and
produce complete data set which is represented in table 5.
Table 2 represents the agriculture data set with missing
value patterns, where the missing values are represented as
NA. The missing features identified and imputed by linear
regression model. The result is shown in table 3.
Table 3: Imputation Using Linear Regression Model
Field
type
Rando
m1
Rando
m1
Fixed1
Avg_
SpoEggMass
0.4
Avg_
SpoGregLar
0.2
Avg_
SpoSolLar
0.4
Avg_
semilooper
0.6
0.2
0.2
0.4
0.4
0.2
0.2
0.4
0.2
Rando
m1
Rando
m1
Fixed1
0.233333
0
0.2
0.2
0.416667
0.4
1.4
0.3
0.383333
0
0.2
0.5
Fixed1
0.233333
0
0.2
0.2
Fixed1
0.2
0.2
0.2
0.4
Fixed1
0.2
0
0.4
0
Fig. 3: Imputed Missing values
3
ISSN 2278-3083
International Journal of Science and Applied Information Technology (IJSAIT), Vol.5 , No.1, Pages : 01-04 (2016)
Special Issue of ICECT 2016 - Held on February 01, 2016 in Hyderabad Marriot Hotel & Convention Centre, Hyderabad
http://warse.org/IJSAIT/static/pdf/Issue/icect2016sp01.pdf
Wiley handbooks in survey methodology. John
Wiley & Sons, 2011.
Table 5: Final Imputed Data
[9]
I. P. Fellegi and D. Holt. A systematic approach to
automatic edit and imputation. Journal of the
Field
Avg_
Avg_
Avg_
Avg_
Americal Statistical Association, 71:17--35, 1976.
type
SpoEggMass
SpoGregLar
SpoSolLar
semilooper
[10]
D.T. Larose and C.D. Larose, “Imputation of
Random1
0.4
0.2
0.4
0.6
Missing Data”, Wiely Online Library,2014.
Random1
0.2
0.2
0.4
0.4
DOI: 10.1002/9781118874059.ch13.
[11]
Shelke, M. B., & Badade, K. B "Processing Of
Fixed1
0.2
0.2
0.4
0.2
Incomplete Data Set." IJCER 2, no. 5 (2013): 658Random1
0.4
0
0.2
0.2
660.
Random1
0.2
0.4
1.4
0.3
[12]
Horton, N. J., & Lipsitz, S. R. (2001). Multiple
imputation in practice: comparison of software
Fixed1
0.2
0
0.2
0.5
packages for regression models with missing
Fixed1
0.2
0
0.2
0.2
variables. The American Statistician, 55(3), 244Fixed1
0.2
0.2
0.2
0.4
254.
[13]
Buuren, Stef, and Karin Groothuis-Oudshoorn.
Fixed1
0.2
0
0.4
0
"mice: Multivariate imputation by chained equations
in R." Journal of statistical software 45, no. 3
(2011).
CONCLUSION
[14]
http://cran.r-project.org/
In this paper the predictive mean matching method is
applied and experimented with the real time agriculture data
set. The predictive mean matching method has been tested and
compared with, mean, median and linear regression model
imputation and the proposed method has improved the
predictive performance. As compared to other methods it
provides better accuracy and efficient imputations. The
method will help us to reconstruct incomplete data set and
produce complete data set. In future the imputed data will be
applied on data mining techniques to extract the knowledge
from the data.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
Hen J. and Kamber M., “Data Mining: Concepts and
Techniques”,
Second
Edition,
ELSEVIER
Publications, ISBN: 978-81-312-0535-81, 2005.
Alagukumar, S., and R. Lawrance. "A Selective
Analysis of Microarray Data Using Association Rule
Mining." Procedia Computer Science 47 (2015): 312.
Alagukumar, S., and R. Lawrance. “Algorithm for
Microarray Cancer Data Analysis using Frequent
Pattern Mining and Gene Intervals”, International
Journal of Computer Applications.pp.1-6, 2015.
Srinivasan Parthasarathy and Charu C. Aggarwal,
On the use of conceptual Reconstruction for Mining
Massively Incomplete Data Sets, IEEE PP.15121521,2003.
Singh Y. K., “Fundamental of research methodology
and statistics”, New Age International, 2006.
de Jonge, Edwin, and Mark van der Loo. An
introduction to data cleaning with R. Technical
Report 201313, Statistics Netherlands, 2013. URL
http://www. cbs. nl/nl-L/menu/methoden/onderzoekmethoden/discussionpapers/archief/2013/default.
htm, 2013.
B. van den Broek. Imputation in R, 2012. Statistics
Netherlands, internal report.
T. De Waal, J. Pannekoek, and S. Scholtus.
Handbook of statistical data editing and imputation.
4