Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ISSN 2278-3083 International Journal of Science and Applied Information Technology (IJSAIT), Vol.5 , No.1, Pages : 01-04 (2016) Special Issue of ICECT 2016 - Held on February 27, 2016 in Hyderabad Marriot Hotel & Convention Centre, Hyderabad http://warse.org/IJSAIT/static/pdf/Issue/icect2016sp01.pdf Analysis of Missing Data and Imputation on Agriculture Data With Predictive Mean Matching Method Jinubala, V Lawrance, R Research Scholar, Department of Computer Science, Karpagam Academy of Higher Education, Coimbatore – 641 021 [email protected] Director, Department of Computer Applications, Ayya Nadar Janaki Ammal College, Sivakasi – 626124, Tamil Nadu, India, [email protected] Abstract - Data mining can be defined as the process of selecting, exploring and modeling large amounts of data to uncover previously unknown patterns. Data Mining is emerging research field in Agriculture crop yield analysis. In the present scenario data mining has become the eminent methodology for accessing huge volume of information from the data set. Various methods and algorithms are proposed for different data sets to obtain better accuracy. In this paper an analysis of Predictive Mean Matching Method has been implemented for identifying and replacing missing values for crop pest data. This method is also an imputation method used for continuous variables. It is similar to the regression method except that for each missing value, it imputes a value randomly from a set of observed values whose predicted values are closest to the predicted value for the missing value from the simulated regression model. RELATED WORKS The term ‘review’ means to organize the knowledge of the specific area of research to evolve an edifice of knowledge to show that his/her study would be an addition to this field. The task of review of literature is highly creative and tedious because researcher has to synthesize the available knowledge of the field in a unique way to provide the rationale for the study [5]. A number of studies have applied data mining techniques to extract meaning from data collected from agriculture data set. For example, the collection of data from agriculture data set is challenging, with most of the datasets incomplete due to the difficulty and methods of data collection. Missing data sets can be problematic and may limit the analysis and extraction of new knowledge. Most statistical theory focuses on data modeling, prediction and statistical inference to be needed. Keywords - Data mining, Imputation, Missing Values, and Predictive Mean Matching Method, Agriculture Data. Usually assumed that data are in the correct state for analysis. In practice, a data analyst spends much if not most of the time in preparing the data before doing any statistical operation. It is very rare that the raw data one works with are in the correct format, without errors, complete and has all the correct labels and codes that are needed for analysis, the data cleaning needs to be considered as a statistical operation, Imputation is the process of estimating or deriving values for fields where data is missing [6]. INTRODUCTION Data Mining is the process of discovering the interesting patterns or information from the data in large databases. The data sources can include databases, data warehouses, the Web, other information repositories, or data that are streamed into the system dynamically. Data mining is defined as knowledge Discovery in Databases, knowledge extraction, pattern analysis, data archeology, business intelligence [1,2,3]. Data Mining is the science of finding new interesting patterns and relationship in huge amount of data. It is defined as “the process of discovering meaningful new correlations, patterns, and trends by digging into large amounts of data stored in warehouses”. Data mining is also sometimes called Knowledge Discovery in Databases [1]. The mean and median imputation method is basic imputation models x = x, where the x is the imputation value and the mean is taken over the observed values. This model is limited since it causes a bias in measures of spread, estimated after imputation [7, 8, 9]. D.T. Larose has examined the methods for imputing missing values for continuous variables, and categorical variables. Missing data may arise from any of several different causes. The method is simply to construct a flag variable and another method is for dealing with missing data to reduce the weight that the case wields in the analysis [10]. Many Data Mining algorithms are used for preprocessing of data which removes noise from data sets, redundancy in data sets, which makes data sets useful for processing of knowledge from data sets [4]. In this paper, the technique of statistical reconstruction of incomplete data sets using predictive mean matching method to predict the missing values which helps to produce complete data sets. Shelke, M. B., has discussed the technique of statistical reconstruction of incomplete data sets using multiple linear regressions and uses the correlation of data sets attribute to predict the missing values which helps to produce complete data sets. This paper is theoretical and generalized algorithm approach to predict missing values by using multiple regressions Model in weka tool [11]. The structure of the paper is organized as reviewing the related works in this field, the methodology of multiple predictive mean matching methods for agriculture data, presenting the experimental results and discussion followed by conclusion and references used. 1 ISSN 2278-3083 International Journal of Science and Applied Information Technology (IJSAIT), Vol.5 , No.1, Pages : 01-04 (2016) Special Issue of ICECT 2016 - Held on February 01, 2016 in Hyderabad Marriot Hotel & Convention Centre, Hyderabad http://warse.org/IJSAIT/static/pdf/Issue/icect2016sp01.pdf METHODOLOGY B. Linear Regression Models The linear regression models [12] are used to find the Data Cleaning is the process of transforming raw data into missing values and impute the missing values in the consistent data that can be analyzed. It is aimed at improving the content of statistical statements based on the data as well agriculture data set. x = β + β y , + … . +β y , , Where as their reliability. Data cleaning may profoundly influence the β , β … … β are estimated linear regression coefficients the statistical statements based on the data. Typical actions for each of the auxiliary variables y , y , … … , y . Estimating like imputation or outlier handling obviously influence the linear models is easy to predict. results of statistical analyses [1, 6]. For this reason, data C. Predictive Mean Matching Method cleaning should be considered as a statistical operation, to be performed in a reproducible manner. The data mining The predictive mean matching method [12] is also an provides good techniques for reproducible data cleaning since imputation method available for continuous variables. It is all cleaning actions can be scripted and therefore reproduced. similar to the regression method except that for each missing The proposed approach of analysis of missing data and value, it imputes a value randomly from a set of observed imputation on agriculture data using multiple mean matching values whose predicted values are the closest to predicted method system is shown in fig. 1. value for the missing value from the simulated regression model. New parameters β∗ = (β∗ , β∗ , … … … β∗( ) ) and σ∗ are drawn from the posterior predictive distribution of the parameters. That is, they are simulated from σ and V . The variance is drawn as σ∗ = σ (n − k − 1)/g. Where g is χ − k − 1 random variate and n is the number of non missing observations for y . The regression coefficients are drawn as β∗ = β + σ∗ V Z. Where V is the upper triangular matrix in the Cholesky decomposition, V = V V , and Z is a vector of k + 1 independent random normal variants. For each missing value, a predicted value y ∗ = β∗ + β∗ x + β∗ x + ⋯ + β∗( ) x is computed with the covariate values. x , x , … … … , x . A set of k observations whose corresponding predicted values closest to y ∗ are generated. The missing value is then replaced by a value drawn randomly from these k observed values. The predictive mean matching method requires the number of closest observations to be specified. A smaller k tends to increase the correlation among the multiple imputations for the missing observation and results in a higher variability of point estimators in repeated sampling. Finally the reconstructed and completed data are extracted from the method. Fig. 1: Methodology A. Data formats The agriculture dataset can be in the form of M x N matrix D of values, where the row represents field type and column represents attributes. An illustration of agriculture data is shown in Table 1. The data, usually contains large amount of data as well as missing values, therefore data mining techniques are used to impute the missing values so as to extract useful knowledge from the imputed data. D. Algorithm 1 : Predictive Mean Matching Method Input: D, Vector of real valued data ∗ =( ∗ , ∗ ,……… Goal: Goal is to predict the missing values of the data Output: TABLE 1 : Sample Data Field Type Fixed Random Fixed .. Avg_Spo EggMass 0.6 0.2 0.1 … Attributes Avg_Spo Avg_Spo GregLar SolLar 0 0.65 0.1 0 0.3 0 .. .. Impute the missing values of the agriculture data E. Predictive Mean Matching Method Procedure: Begin … Step1. Read the Agriculture data .. .. .. .. Step2. Calculate variance Step3. Calculate regression coefficients 2 ∗( ) ) ISSN 2278-3083 International Journal of Science and Applied Information Technology (IJSAIT), Vol.5 , No.1, Pages : 01-04 (2016) Special Issue of ICECT 2016 - Held on February 01, 2016 in Hyderabad Marriot Hotel & Convention Centre, Hyderabad http://warse.org/IJSAIT/static/pdf/Issue/icect2016sp01.pdf Using predictive mean matching method, the missing Step4. For each missing value, a predicted value ∗ to be value pattern is identified for the missing values in table 2 calculated representing the agriculture data set. The missing features are Step5. Compute the covariate values identified by values, which are shown in table 4 and sorted by values. Step6. Generate predicted values Table 4: Sorted Missing Features by values Step7. Missing value is then replaced by a predicted value Step8. Generate reconstructed and completed data Variable Sorted by number of missing values End EXPERIMENTAL RESULTS In this research the above discussed method has been implemented with the real time agriculture data set. The experimented dataset has huge volume of data regarding the pests and other relevant information. In this paper the algorithm approach to predict missing values by using predictive mean matching method applied in R language and also compared with linear regression model [13, 14]. With the predictive mean matching method the experimentation has been accomplished and the results are represented as below. Variables Count Avg_SpoEggMass 0.444444 Avg_SpoGregLar 0 Avg_SpoSolLar 0 Avg_semilooper 0 Table 2: Missing data patterns Field type Random1 Avg_ SpoEggMass 0.4 Avg_ SpoGregLar 0.2 Avg_ SpoSolLar 0.4 Avg_ semilooper 0.6 Random1 0.2 0.2 0.4 0.4 Fixed1 0.2 0.2 0.4 0.2 Random1 NA 0 0.2 NA Random1 NA 0.4 1.4 NA Fixed1 NA 0 0.2 NA Fixed1 NA 0 0.2 NA Fixed1 0.2 0.2 0.2 0.4 Fixed1 0.2 0 0.4 0 Fig. 2: Plotted image of Missing values The missing values are plotted in the fig. 2. The missing values are calculated and replaced by predictive mean matching method; the imputed values are shown in the fig. 3. Finally the method reconstructs incomplete data set and produce complete data set which is represented in table 5. Table 2 represents the agriculture data set with missing value patterns, where the missing values are represented as NA. The missing features identified and imputed by linear regression model. The result is shown in table 3. Table 3: Imputation Using Linear Regression Model Field type Rando m1 Rando m1 Fixed1 Avg_ SpoEggMass 0.4 Avg_ SpoGregLar 0.2 Avg_ SpoSolLar 0.4 Avg_ semilooper 0.6 0.2 0.2 0.4 0.4 0.2 0.2 0.4 0.2 Rando m1 Rando m1 Fixed1 0.233333 0 0.2 0.2 0.416667 0.4 1.4 0.3 0.383333 0 0.2 0.5 Fixed1 0.233333 0 0.2 0.2 Fixed1 0.2 0.2 0.2 0.4 Fixed1 0.2 0 0.4 0 Fig. 3: Imputed Missing values 3 ISSN 2278-3083 International Journal of Science and Applied Information Technology (IJSAIT), Vol.5 , No.1, Pages : 01-04 (2016) Special Issue of ICECT 2016 - Held on February 01, 2016 in Hyderabad Marriot Hotel & Convention Centre, Hyderabad http://warse.org/IJSAIT/static/pdf/Issue/icect2016sp01.pdf Wiley handbooks in survey methodology. John Wiley & Sons, 2011. Table 5: Final Imputed Data [9] I. P. Fellegi and D. Holt. A systematic approach to automatic edit and imputation. Journal of the Field Avg_ Avg_ Avg_ Avg_ Americal Statistical Association, 71:17--35, 1976. type SpoEggMass SpoGregLar SpoSolLar semilooper [10] D.T. Larose and C.D. Larose, “Imputation of Random1 0.4 0.2 0.4 0.6 Missing Data”, Wiely Online Library,2014. Random1 0.2 0.2 0.4 0.4 DOI: 10.1002/9781118874059.ch13. [11] Shelke, M. B., & Badade, K. B "Processing Of Fixed1 0.2 0.2 0.4 0.2 Incomplete Data Set." IJCER 2, no. 5 (2013): 658Random1 0.4 0 0.2 0.2 660. Random1 0.2 0.4 1.4 0.3 [12] Horton, N. J., & Lipsitz, S. R. (2001). Multiple imputation in practice: comparison of software Fixed1 0.2 0 0.2 0.5 packages for regression models with missing Fixed1 0.2 0 0.2 0.2 variables. The American Statistician, 55(3), 244Fixed1 0.2 0.2 0.2 0.4 254. [13] Buuren, Stef, and Karin Groothuis-Oudshoorn. Fixed1 0.2 0 0.4 0 "mice: Multivariate imputation by chained equations in R." Journal of statistical software 45, no. 3 (2011). CONCLUSION [14] http://cran.r-project.org/ In this paper the predictive mean matching method is applied and experimented with the real time agriculture data set. The predictive mean matching method has been tested and compared with, mean, median and linear regression model imputation and the proposed method has improved the predictive performance. As compared to other methods it provides better accuracy and efficient imputations. The method will help us to reconstruct incomplete data set and produce complete data set. In future the imputed data will be applied on data mining techniques to extract the knowledge from the data. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] Hen J. and Kamber M., “Data Mining: Concepts and Techniques”, Second Edition, ELSEVIER Publications, ISBN: 978-81-312-0535-81, 2005. Alagukumar, S., and R. Lawrance. "A Selective Analysis of Microarray Data Using Association Rule Mining." Procedia Computer Science 47 (2015): 312. Alagukumar, S., and R. Lawrance. “Algorithm for Microarray Cancer Data Analysis using Frequent Pattern Mining and Gene Intervals”, International Journal of Computer Applications.pp.1-6, 2015. Srinivasan Parthasarathy and Charu C. Aggarwal, On the use of conceptual Reconstruction for Mining Massively Incomplete Data Sets, IEEE PP.15121521,2003. Singh Y. K., “Fundamental of research methodology and statistics”, New Age International, 2006. de Jonge, Edwin, and Mark van der Loo. An introduction to data cleaning with R. Technical Report 201313, Statistics Netherlands, 2013. URL http://www. cbs. nl/nl-L/menu/methoden/onderzoekmethoden/discussionpapers/archief/2013/default. htm, 2013. B. van den Broek. Imputation in R, 2012. Statistics Netherlands, internal report. T. De Waal, J. Pannekoek, and S. Scholtus. Handbook of statistical data editing and imputation. 4