Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sameer S. Prabhune 1 and S.R. Sathe 2 Reconstruction of a Complete Dataset from an Incomplete Dataset by LineReg (Linear Regression Analysis): Some Results Sameer S. Prabhune 1 [email protected] Assistant Prof. & HOD- I.T. S.S.G.M. College Of Engineering, Shegaon,444203,Maharashtra,India Dr.S.R. Sathe 2 [email protected] Professor, Dept. of E & CS, V.N.I.T. Nagpur, Maharashtra,India Abstract Preprocessing is crucial steps used for variety of data warehousing and mining Real world data is noisy and can often suffer from corruptions or incomplete values that may impact the models created from the data. Accuracy of any mining algorithm greatly depends on the input data sets. Incomplete data sets have become almost ubiquitous in a wide variety of application domains. Common examples can be found in climate and image data sets, sensor data sets and medical data sets. The incompleteness in these data sets may arise from a number of factors: in some cases it may simply be a reflection of certain measurements not being available at the time; in others the information may be lost due to partial system failure; or it may simply be a result of users being unwilling to specify attributes due to privacy concerns. When a significant fraction of the entries are missing in all of the attributes, it becomes very difficult to perform any kind of reasonable extrapolation on the original data. For such cases, we introduce the novel idea of linear regression, in which we built up the regression line to for prediction of the complete data set from incomplete data sets, on which the data mining algorithms can be directly applied. We demonstrate the effectiveness of the approach on a variety of real data sets. This paper describes a theory and implementation of a new filter LineReg (Linear Regression Analysis) to the WEKA workbench, for finding the complete dataset from an incomplete dataset. Keywords: Data mining, data preprocessing, missing data. 1 Sameer S. Prabhune 1 and S.R. Sathe 2 1. INTRODUCTION Many data analysis applications such as data mining, web mining, and information retrieval system require various forms of data preparation. Mostly all this worked on the assumption that the data they worked is complete in nature, but that is not true! In data preparation, one takes the data in its raw form, removes as much as noise, redundancy and incompleteness as possible and brings out that core for further processing. Common solutions to missing data problem include the use of imputation, statistical or regression based procedures [11]. We note that, the missing data mechanism would rely on the fact that the attributes in a data set are not independent from one another, but that there is some predictive value from one attribute to another [1]. Linear regression analyses the relationship between two variables, X and Y. For each subject (or experimental unit), you know both X and Y and you want to find the best straight line through the data. In some situations, the slope and/or intercept have a scientific meaning. In other cases, you use the linear regression line as a standard curve to find new values of X from Y, or Y from X. The term "regression", like many statistical terms, is used in statistics quite differently than it is used in other contexts. The method was first used to examine the relationship between the heights of fathers and sons. The two were related, of course, but the slope is less than 1.0. A tall father tended to have sons shorter than himself; a short father tended to have sons taller than himself. The height of sons regressed to the mean. The term "regression" is now used for many sorts of curve fitting. Prism determines and graphs the best-fit linear regression line, optionally including a 95% confidence interval or 95% prediction interval bands. You may also force the line through a particular point (usually the origin), calculate residuals, calculate a runs test, or compare the slopes and intercepts of two or more regression lines. In general, the goal of linear regression is to find the line that best predicts Y from X. Linear regression does this by finding the line that minimizes the sum of the squares of the vertical distances of the points from the line. Note: that linear regression does not test whether your data are linear (except via the runs test). It assumes that your data are linear, and finds the slope and intercept that make a straight line best fit your data. 1.1 How linear regression works 1.1.1 Minimizing sum-of-squares The goal of linear regression is to adjust the values of slope and intercept to find the line that best predicts Y from X. More precisely, the goal of regression is to minimize the sum of the squares of the vertical distances of the points from the line. Why minimize the sum of the squares of the distances? Why not simply minimize the sum of the actual distances? If the random scatter follows a Gaussian distribution, it is far more likely to have two medium size deviations (say 5 units each) than to have one small deviation (1 unit) and one large (9 units). A procedure that minimized the sum of the absolute value of the distances would have no preference over a line that was 5 units away from two points and one that was 1 unit away from one point and 9 units from another. The sum of the distances (more precisely, the sum of the absolute value of the distances) is 10 units in each case. A procedure that minimizes the sum of the squares of the distances prefers to be 5 units away from two points (sum-of-squares = 50) rather than 1 unit away from one point and 9 units away from another (sum-of-squares = 82). If the scatter is Gaussian (or nearly so), the line determined by minimizing the sum-of-squares is most likely to be correct. Y 2 Sameer S. Prabhune 1 and S.R. Sathe 2 X Fig. 1.1 Linear Relationship between x and y 1.1.2 Slope and intercept Prism reports the best-fit values of the slope and intercept, along with their standard errors and confidence intervals. The slope quantifies the steepness of the line. It equals the change in Y for each unit change in X. It is expressed in the units of the Y-axis divided by the units of the X-axis. If the slope is positive, Y increases as X increases. If the slope is negative, Y decreases as X increases. The Y intercept is the Y value of the line when X equals zero. It defines the elevation of the line. The standard error values of the slope and intercept can be hard to interpret, but their main purpose is to compute the 95% confidence intervals. If you accept the assumptions of linear regression, there is a 95% chance that the 95% confidence interval of the slope contains the true value of the slope, and that the 95% confidence interval for the intercept contains the true value of the intercept. 1.1.3 Contribution of this paper This paper gives the theory and implementation details of LineReg workbench. Also it gives the precise results on real datasets [12]. 2 filter addition to the WEKA PRELIMENARY TOOLS KNOWLEDGE To complete our main objective, i.e. to develop the LineReg filter for the WEKA workbench we have used the following technologies. These are as follows: 2.1 WEKA 3-5-4 Weka is an excellent workbench [4] for learning about machine learning techniques. We used this tool and the package because it was completely written in java and its package gave us the ability to use ARFF datasets in our filter. The weka package contains many useful classes, which were required to code our filter. Some of the classes from weka package are as follows [4]. weka.core weka.core.instances weka.filters weka.core.matrix.package weka.filters.unsupervised.attribute; weka.core.matrix.Matrix; weka.core.matrix.EigenvalueDecomposition; etc. We have also studied the working of a simple filter by referring to the filters available in java [9,10]. 3 Sameer S. Prabhune 1 and S.R. Sathe 2 2.2 JAVA We used java as our coding language because of two reasons: 1. As the weka workbench is completely written in java and supports the java packages, it is useful to use java as the coding language. 2. The second reason was that we could use some classes from java package and some from weka package to create the filter. 3 ALGORITHM This algorithm is designed to give the user an understanding of the LineReg working. LineReg is a simple yet robust technique for predicting the missing patterns in data. [2,14]. 4 Sameer S. Prabhune 1 and S.R. Sathe 2 Algorithm For Linear Regression . Missing Value predication result_data, test_inst m_attribute m_slope m_intercept m_attributeIndex – object of Instances – Attribute Value – Slope of Line – Intercept of Line – index of the chosen attribute m_suppressErrorMessage – suppress error message rows – Number Of Instances cols –Number Of Attribute mvals[][] –String Input Array rows_attrs[][] – Array of instances usable[] – Finding Useful Attribute Step 1) Retrieve the data from the dataset 1) Data must be in ARFF format. 2) Dataset may be multidimensional. Here instance class is used for handling ordered set of instances. and construct empty set of instances for copying all instances. Check whether the dataset is empty or not. Step 2) Find missing values in dataset and replace it with some variable (suppose X). Repeat until (i < test_inst.numInstances()) Repeat until (j<test_inst.numAttributes()) If (!inst1.isMissing(j)) then rows_attrs[i][j]=inst1.value(j); mvals[i][j]=String.valueOf(inst1.value(j)); else rows.attrs[i][j]=0; mvals[i][j]=”X.X”; Step 3) For set the class values and deleting the missing values. Repeat until (k<test_inst.numAttributes()) insts.setClassIndex(k); insts.deleteWithMisssingClass(); And take yMean(y) yMean=insts.meanOrMode(insts.classIndex()); Step 4) Choose best attribute and compute the xMean , the slope and intercept. Repeat until (i<insts.numAttributes()) If (i!=insts.classIndex())) xMean=insts.meanOrMode(i); Repeat(j<insts.numInstances()) 5 Sameer S. Prabhune 1 and S.R. Sathe 2 If (!inst.isMissing(i) && !inst.classIsMissing()) Calculate Xdiff and Ydiff and weightedXdiff and weightedYdiff. Slope=slope+weightedXdiff* ydiff sumWeightedXDiffSquared +=weightedXDiff*xDiff; sumWeightedYDiffSquared +=weightedYDiff*yDiff; m_slope /=sumWeightedXDiffSquared; m_intercept =yMean – m_slope*xMean. Step5) Compare sum of Squared errors msq =sumWeightedYDiffSquared - m_slope*numerator Check for whether it is best Attribute or not If (msg<minMsg) minMsg=msg chosenSlope=m_slope choosenIntercept=m_intercept Step6) Set parameter is used to set slope, intercept and best attribute for particular class attribute. If(choosen==-1) If (!m_suppressErrorMessage) Print no use full attribute form else m_attribute=test_inst.attribute(choosen); m_attributeIndex=choosen; slope=choosenslope; intercept=choosenIntercept. Step 7) Repeat until (i<test.inst.numAttributes()) Repeat until(j<test_inst.numInstances()) If(mvals[j][i].equals(“x.x”)) rows_attrs[j][i]=slopel * rows_attrs[j][usable1]+interl; Step 8) For handling instances in class Repeat until (i<test_inst.numInstances()) Repeat until (j<test_inst.numAttributes()) If (inst1.isMissing(j)) inst1.setValue(j,rows_attrs[i][j]); result_data.add(inst1); Return result_data; Step 9) Create batch filter for testing filters ability to process multitest batches and this procedure is repeat for any dataset. Figure 1 Shows the LineReg algorithm for prediction of the missing values. 6 Sameer S. Prabhune 1 and S.R. Sathe 2 4. IMPLEMENTATION We were using datasets written in ARFF format as an input to this algorithm and the LineReg filter [2,7,8]. The filter would then take ARFF dataset as input and find out the missing values in the input dataset. After finding out the missing values in the given dataset it would apply the LineReg algorithm and predict the missing values and also reconstruct the whole dataset by putting the missing values into the given dataset. We have created an LineReg filter class, which is an extension of the Simple Batch Filter class, which is an abstract class. The LineReg filter works by fitting the data using sum of squared errors and slope of the straight line and predicts the missing data. 5 EXPERIMENTAL SET UP 5.1 Approach The objective of our experiment is to build the filter as a preprocessing step in Weka Workbench, which completes the data sets from missing data sets. We did not intentionally select those data sets in UCI [12], which originally come with missing values because even if they do contain missing values, we don’t know the accuracy of our approach. For experimental set up, we take the complete dataset from UCI repository [12], and then missing values are artificially added to the data sets to simulate MCAR missing values. To introduce m% missing values per attribute xi in a dataset of size n, we randomly selected mn instances and replaced its xi value with unknown i.e. ?( In WEKA, missing values are denoted as “?”). We use 10%,20% and 30% missingness for every dataset. 5.2 Results After preprocessing steps, we use WEKA’s M5Rules classifier for finding the error analysis. The classification was carried out on 10 fold cross validation technique. In Table 1, we have calculated the standard errors along with correlation coefficient on UCI[12] database repository. When observing the correlation coefficient of all the dataset i.e. CPU, Glass and Wine with missingness parameters – 10%,20% and 30%, it has been clear that, as missingness increases the accuracy of the classifier decreases. 7 Sameer S. Prabhune 1 and S.R. Sathe 2 Dataset Error CPU Glass Wine Analysis SN 1 10%M 20%M 30%M 10%M 20%M 30%M 10%M 20%M 30%M 0.9678 0.9795 0.9883 0.9918 0.9838 0.98 0.8547 0.8341 0.7894 20.0325 16.9157 9.1454 0.1855 0.2189 0.2425 131.7643 136.077 39.0065 30.3346 21.0669 0.2676 0.3802 0.4179 162.2657 170.5029 182.1391 Relative 23.1372 18.8482 10.8232% 10.7876 12.6937 14.0545 51.5862 absolute error % % Root relative 25.1575 20.0951 squared error % % Correlation coefficient 2 Mean 140.51 absolute error 3 Root mean squared error 4 5 6 Total Number 209 209 53.7478 59.6804 % % 12.7254 18.0095 19.7775 52.2063 55.1499 61.5428 % % % % % % 214 214 214 178 178 178 % 15.226% 209 % % % of Instances TABLE 1 After applying the LineReg Filter with M5Rules classifier in WEKA workbench on UCI[12] datasets. 8 Sameer S. Prabhune 1 and S.R. Sathe 2 6 CONCLUSION As seen from Table 1, if the percentage of missing value increases the correlation coefficient decreases.i.e. the classifier accuracy decreases gradually.In this paper, we provided the exact implementation details of adding a new filter viz. LineReg in the WEKA workbench. As seen from the result, the LineReg filter works by fitting the data using sum of squared errors and slope of the straight line and predicts the missing data. We demonstrate the complete procedure of making the filter by means of available technologies and also addition of this filter as an extension to the WEKA workbench. ACKNOWLEDGMENTS Our special thanks to Mr.Peter Reutemann, of University of Waikato, [email protected], for providing us the support as and when required. REFERENCES 1. S.Parthsarthy and C.C. Aggarwal, “On the Use of Conceptual Reconstruction for Mining Massively Incomplete Data Sets”,IEEE Trans. Knowledge and Data Eng., pp. 1512-1521,2003. 2. J. Quinlan, “C4.5: Programs for Machine Learning”, San Mateo, Calif.: Morgan Kaufmann, 1993. 3. http://weka.sourceforge.net/wiki/index.php/Writing_your_own_Filter 4. wekaWiki link : http://weka.sourceforge.net/wiki/index.php/Main_Page 5. S. Mehta,S.Parthsarthy and H. Yang “ Toward Unsupervised correlation preserving discretization”, IEEE Trans. Knowledge and Data Eng.,pp 1174-1185 ,2005. 6. Ian H. Witten and Eibe Frank , “Data Mining: Practical Machine Learning Tools and Techniques” Second Edition, Morgan Kaufmann Publishers. ISBN: 81-312-0050-7. 7. http://weka.sourceforge.net/wiki/index.php/CVS 8. http://weka.sourceforge.net/wiki/index.php/Eclipse_3.0.x 9. weka.filters.SimpleBatchFilter 10. weka.filters.SimpleStreamFilter 11. R.J.A. Little, D. Rubin. “Statastical Analysis with Missing Data”. pp 41-53,Wiley Series in Prob. and Stat., Second Ed.,2002. 12. UCI Machine Learning Repository,http://www.ics.uci.edu/umlearn/MLsummary.html 13. Schafer , Analysis of Incomplete Multivariate Data”, Monograph in Prob. And Stat. 72, Chapman and Hall, CRC. 14. J. Han and M. Kamber, Data Mining Concepts and Techniques,Second ed. Morgan Kaufmann Publishers, 2006. 9 Sameer S. Prabhune 1 and S.R. Sathe 2 10