Download IJDE-62 - CSC Journals

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia , lookup

Least squares wikipedia , lookup

Forecasting wikipedia , lookup

Linear regression wikipedia , lookup

Data assimilation wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Sameer S. Prabhune 1 and S.R. Sathe 2
Reconstruction of a Complete Dataset from an Incomplete Dataset by
LineReg (Linear Regression Analysis): Some Results
Sameer S. Prabhune 1
[email protected]
Assistant Prof. & HOD- I.T.
S.S.G.M. College Of Engineering,
Shegaon,444203,Maharashtra,India
Dr.S.R. Sathe 2
[email protected]
Professor,
Dept. of E & CS, V.N.I.T.
Nagpur, Maharashtra,India
Abstract
Preprocessing is crucial steps used for variety of data warehousing and mining Real
world data is noisy and can often suffer from corruptions or incomplete values that may
impact the models created from the data. Accuracy of any mining algorithm greatly
depends on the input data sets. Incomplete data sets have become almost ubiquitous
in a wide variety of application domains. Common examples can be found in climate
and image data sets, sensor data sets and medical data sets. The incompleteness in
these data sets may arise from a number of factors: in some cases it may simply be a
reflection of certain measurements not being available at the time; in others the
information may be lost due to partial system failure; or it may simply be a result of
users being unwilling to specify attributes due to privacy concerns. When a significant
fraction of the entries are missing in all of the attributes, it becomes very difficult to
perform any kind of reasonable extrapolation on the original data. For such cases, we
introduce the novel idea of linear regression, in which we built up the regression line to
for prediction of the complete data set from incomplete data sets, on which the data
mining algorithms can be directly applied. We demonstrate the effectiveness of the
approach on a variety of real data sets. This paper describes a theory and
implementation of a new filter LineReg (Linear Regression Analysis) to the WEKA
workbench, for finding the complete dataset from an incomplete dataset.
Keywords: Data mining, data preprocessing, missing data.
1
Sameer S. Prabhune 1 and S.R. Sathe 2
1. INTRODUCTION
Many data analysis applications such as data mining, web mining, and information retrieval system
require various forms of data preparation. Mostly all this worked on the assumption that the data they
worked is complete in nature, but that is not true!
In data preparation, one takes the data in its raw form, removes as much as noise,
redundancy and incompleteness as possible and brings out that core for further processing. Common
solutions to missing data problem include the use of imputation, statistical or regression based
procedures [11]. We note that, the missing data mechanism would rely on the fact that the attributes in a
data set are not independent from one another, but that there is some predictive value from one attribute
to another [1].
Linear regression analyses the relationship between two variables, X and Y. For each subject (or
experimental unit), you know both X and Y and you want to find the best straight line through the data. In
some situations, the slope and/or intercept have a scientific meaning. In other cases, you use the linear
regression line as a standard curve to find new values of X from Y, or Y from X. The term "regression",
like many statistical terms, is used in statistics quite differently than it is used in other contexts. The
method was first used to examine the relationship between the heights of fathers and sons. The two were
related, of course, but the slope is less than 1.0. A tall father tended to have sons shorter than himself; a
short father tended to have sons taller than himself. The height of sons regressed to the mean. The term
"regression" is now used for many sorts of curve fitting. Prism determines and graphs the best-fit linear
regression line, optionally including a 95% confidence interval or 95% prediction interval bands. You may
also force the line through a particular point (usually the origin), calculate residuals, calculate a runs test,
or compare the slopes and intercepts of two or more regression lines. In general, the goal of linear
regression is to find the line that best predicts Y from X. Linear regression does this by finding the line
that minimizes the sum of the squares of the vertical distances of the points from the line.
Note: that linear regression does not test whether your data are linear (except via the runs test). It
assumes that your data are linear, and finds the slope and intercept that make a straight line best fit your
data.
1.1 How linear regression works
1.1.1 Minimizing sum-of-squares
The goal of linear regression is to adjust the values of slope and intercept to find the line that best
predicts Y from X. More precisely, the goal of regression is to minimize the sum of the squares of the
vertical distances of the points from the line. Why minimize the sum of the squares of the distances? Why
not simply minimize the sum of the actual distances? If the random scatter follows a Gaussian
distribution, it is far more likely to have two medium size deviations (say 5 units each) than to have one
small deviation (1 unit) and one large (9 units). A procedure that minimized the sum of the absolute value
of the distances would have no preference over a line that was 5 units away from two points and one that
was 1 unit away from one point and 9 units from another. The sum of the distances (more precisely, the
sum of the absolute value of the distances) is 10 units in each case. A procedure that minimizes the sum
of the squares of the distances prefers to be 5 units away from two points (sum-of-squares = 50) rather
than 1 unit away from one point and 9 units away from another (sum-of-squares = 82). If the scatter is
Gaussian (or nearly so), the line determined by minimizing the sum-of-squares is most likely to be correct.
Y
2
Sameer S. Prabhune 1 and S.R. Sathe 2
X
Fig. 1.1 Linear Relationship between x and y
1.1.2 Slope and intercept
Prism reports the best-fit values of the slope and intercept, along with their standard errors and
confidence intervals. The slope quantifies the steepness of the line. It equals the change in Y for each
unit change in X. It is expressed in the units of the Y-axis divided by the units of the X-axis. If the slope is
positive, Y increases as X increases. If the slope is negative, Y decreases as X increases. The Y
intercept is the Y value of the line when X equals zero. It defines the elevation of the line. The standard
error values of the slope and intercept can be hard to interpret, but their main purpose is to compute the
95% confidence intervals. If you accept the assumptions of linear regression, there is a 95% chance that
the 95% confidence interval of the slope contains the true value of the slope, and that the 95% confidence
interval for the intercept contains the true value of the intercept.
1.1.3 Contribution of this paper
This paper gives the theory and implementation details of LineReg
workbench. Also it gives the precise results on real datasets [12].
2
filter addition to the WEKA
PRELIMENARY TOOLS KNOWLEDGE
To complete our main objective, i.e. to develop the LineReg filter for the WEKA workbench we have used
the following technologies. These are as follows:
2.1 WEKA 3-5-4
Weka is an excellent workbench [4] for learning about machine learning techniques. We used this tool
and the package because it was completely written in java and its package gave us the ability to use
ARFF datasets in our filter. The weka package contains many useful classes, which were required to
code our filter. Some of the classes from weka package are as follows [4].
weka.core
weka.core.instances
weka.filters
weka.core.matrix.package
weka.filters.unsupervised.attribute;
weka.core.matrix.Matrix;
weka.core.matrix.EigenvalueDecomposition; etc.
We have also studied the working of a simple filter by referring to the filters available in java [9,10].
3
Sameer S. Prabhune 1 and S.R. Sathe 2
2.2 JAVA
We used java as our coding language because of two reasons:
1. As the weka workbench is completely written in java and supports the java packages, it is useful to use
java as the coding language.
2. The second reason was that we could use some classes from java package and some from weka
package to create the filter.
3 ALGORITHM
This algorithm is designed to give the user an understanding of the LineReg working. LineReg is a
simple yet robust technique for predicting the missing patterns in data. [2,14].
4
Sameer S. Prabhune 1 and S.R. Sathe 2
Algorithm For Linear Regression
.
Missing Value predication
result_data, test_inst
m_attribute
m_slope
m_intercept
m_attributeIndex
– object of Instances
– Attribute Value
– Slope of Line
– Intercept of Line
– index of the chosen attribute
m_suppressErrorMessage – suppress error message
rows
– Number Of Instances
cols
–Number Of Attribute
mvals[][]
–String Input Array
rows_attrs[][]
– Array of instances
usable[]
– Finding Useful Attribute
Step 1) Retrieve the data from the dataset
1) Data must be in ARFF format.
2) Dataset may be multidimensional.
Here instance class is used for handling ordered set of instances. and construct empty set
of instances for copying all instances. Check whether the dataset is empty or not.
Step 2) Find missing values in dataset and replace it with some variable (suppose X).
Repeat until (i < test_inst.numInstances())
Repeat until (j<test_inst.numAttributes())
If (!inst1.isMissing(j))
then rows_attrs[i][j]=inst1.value(j);
mvals[i][j]=String.valueOf(inst1.value(j));
else
rows.attrs[i][j]=0;
mvals[i][j]=”X.X”;
Step 3) For set the class values and deleting the missing values.
Repeat until (k<test_inst.numAttributes())
insts.setClassIndex(k);
insts.deleteWithMisssingClass();
And take yMean(y)
yMean=insts.meanOrMode(insts.classIndex());
Step 4) Choose best attribute and compute the xMean , the slope and intercept.
Repeat until (i<insts.numAttributes())
If (i!=insts.classIndex()))
xMean=insts.meanOrMode(i);
Repeat(j<insts.numInstances())
5
Sameer S. Prabhune 1 and S.R. Sathe 2
If (!inst.isMissing(i) && !inst.classIsMissing())
Calculate Xdiff and Ydiff and weightedXdiff and weightedYdiff.
Slope=slope+weightedXdiff* ydiff
sumWeightedXDiffSquared +=weightedXDiff*xDiff;
sumWeightedYDiffSquared +=weightedYDiff*yDiff;
m_slope /=sumWeightedXDiffSquared;
m_intercept =yMean – m_slope*xMean.
Step5) Compare sum of Squared errors
msq =sumWeightedYDiffSquared - m_slope*numerator
Check for whether it is best Attribute or not
If (msg<minMsg)
minMsg=msg
chosenSlope=m_slope
choosenIntercept=m_intercept
Step6) Set parameter is used to set slope, intercept and best attribute for particular class attribute.
If(choosen==-1)
If (!m_suppressErrorMessage)
Print no use full attribute form
else
m_attribute=test_inst.attribute(choosen);
m_attributeIndex=choosen;
slope=choosenslope;
intercept=choosenIntercept.
Step 7)
Repeat until (i<test.inst.numAttributes())
Repeat until(j<test_inst.numInstances())
If(mvals[j][i].equals(“x.x”))
rows_attrs[j][i]=slopel * rows_attrs[j][usable1]+interl;
Step 8) For handling instances in class
Repeat until (i<test_inst.numInstances())
Repeat until (j<test_inst.numAttributes())
If (inst1.isMissing(j))
inst1.setValue(j,rows_attrs[i][j]);
result_data.add(inst1);
Return result_data;
Step 9)
Create batch filter for testing filters ability to process multitest batches and this
procedure is repeat for any dataset.
Figure 1 Shows the LineReg algorithm for prediction of the missing values.
6
Sameer S. Prabhune 1 and S.R. Sathe 2
4.
IMPLEMENTATION
We were using datasets written in ARFF format as an input to this algorithm and the LineReg filter [2,7,8].
The filter would then take ARFF dataset as input and find out the missing values in the input dataset.
After finding out the missing values in the given dataset it would apply the LineReg algorithm and predict
the missing values and also reconstruct the whole dataset by putting the missing values into the given
dataset.
We have created an LineReg filter class, which is an extension of the Simple Batch Filter class, which is
an abstract class. The LineReg filter works by fitting the data using sum of squared errors and slope of
the straight line and predicts the missing data.
5 EXPERIMENTAL SET UP
5.1 Approach
The objective of our experiment is to build the filter as a preprocessing step in Weka Workbench, which
completes the data sets from missing data sets. We did not intentionally select those data sets in UCI
[12], which originally come with missing values because even if they do contain missing values, we don’t
know the accuracy of our approach. For experimental set up, we take the complete dataset from UCI
repository [12], and then missing values are artificially added to the data sets to simulate MCAR missing
values. To introduce m% missing values per attribute xi in a dataset of size n, we randomly selected mn
instances and replaced its xi value with unknown i.e. ?( In WEKA, missing values are denoted as “?”). We
use 10%,20% and 30% missingness for every dataset.
5.2 Results
After preprocessing steps, we use WEKA’s M5Rules classifier for finding the error analysis. The
classification was carried out on 10 fold cross validation technique. In Table 1, we have calculated the
standard errors along with correlation coefficient on UCI[12] database repository. When observing the
correlation coefficient of all the dataset i.e. CPU, Glass and Wine with missingness parameters –
10%,20% and 30%, it has been clear that, as missingness increases the accuracy of the classifier
decreases.
7
Sameer S. Prabhune 1 and S.R. Sathe 2
Dataset
Error
CPU
Glass
Wine
Analysis
SN
1
10%M
20%M
30%M
10%M
20%M
30%M
10%M
20%M
30%M
0.9678
0.9795
0.9883
0.9918
0.9838
0.98
0.8547
0.8341
0.7894
20.0325
16.9157
9.1454
0.1855
0.2189
0.2425
131.7643 136.077
39.0065
30.3346
21.0669
0.2676
0.3802
0.4179
162.2657 170.5029 182.1391
Relative
23.1372
18.8482
10.8232% 10.7876 12.6937 14.0545 51.5862
absolute error
%
%
Root relative
25.1575
20.0951
squared error
%
%
Correlation
coefficient
2
Mean
140.51
absolute error
3
Root mean
squared error
4
5
6
Total Number 209
209
53.7478
59.6804
%
%
12.7254 18.0095 19.7775 52.2063
55.1499
61.5428
%
%
%
%
%
%
214
214
214
178
178
178
%
15.226%
209
%
%
%
of Instances
TABLE 1 After applying the LineReg Filter with M5Rules classifier in WEKA workbench on UCI[12]
datasets.
8
Sameer S. Prabhune 1 and S.R. Sathe 2
6 CONCLUSION
As seen from Table 1, if the percentage of missing value increases the correlation coefficient
decreases.i.e. the classifier accuracy decreases gradually.In this paper, we provided the exact
implementation details of adding a new filter viz. LineReg in the WEKA workbench. As seen from the
result, the LineReg filter works by fitting the data using sum of squared errors and slope of the straight
line and predicts the missing data.
We demonstrate the complete procedure of making the filter by means of available technologies and also
addition of this filter as an extension to the WEKA workbench.
ACKNOWLEDGMENTS
Our special thanks to Mr.Peter Reutemann, of University of Waikato, [email protected], for
providing us the support as and when required.
REFERENCES
1. S.Parthsarthy and C.C. Aggarwal, “On the Use of Conceptual Reconstruction for Mining Massively
Incomplete Data Sets”,IEEE Trans. Knowledge and Data Eng., pp. 1512-1521,2003.
2. J. Quinlan, “C4.5: Programs for Machine Learning”, San Mateo, Calif.: Morgan Kaufmann, 1993.
3. http://weka.sourceforge.net/wiki/index.php/Writing_your_own_Filter
4. wekaWiki link : http://weka.sourceforge.net/wiki/index.php/Main_Page
5. S. Mehta,S.Parthsarthy and H. Yang “ Toward Unsupervised correlation preserving discretization”,
IEEE Trans. Knowledge and Data Eng.,pp 1174-1185 ,2005.
6. Ian H. Witten and Eibe Frank , “Data Mining: Practical Machine Learning Tools and Techniques”
Second Edition, Morgan Kaufmann Publishers. ISBN: 81-312-0050-7.
7. http://weka.sourceforge.net/wiki/index.php/CVS
8. http://weka.sourceforge.net/wiki/index.php/Eclipse_3.0.x
9. weka.filters.SimpleBatchFilter
10. weka.filters.SimpleStreamFilter
11. R.J.A. Little, D. Rubin. “Statastical Analysis with Missing Data”. pp 41-53,Wiley Series in Prob. and
Stat., Second Ed.,2002.
12. UCI Machine Learning Repository,http://www.ics.uci.edu/umlearn/MLsummary.html
13. Schafer , Analysis of Incomplete Multivariate Data”, Monograph in Prob. And Stat. 72, Chapman and
Hall, CRC.
14. J. Han and M. Kamber, Data Mining Concepts and Techniques,Second ed. Morgan Kaufmann
Publishers, 2006.
9
Sameer S. Prabhune 1 and S.R. Sathe 2
10