Download Uncertain Data Classification Using Decision Tree

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 number 2- Nov 2013
Research Paper
Uncertain Data Classification Using Decision Tree
Classification Tool With Probability Density
Function Modeling Technique
C. Sudarsana Reddy1
S. Aquter Babu2
Dr. V. Vasu3
Department of Computer Science and Engineering, S.V. University College of Engineering, S.V. University, Tirupati,
Andhra Pradesh, India.
Assistant Professor of Computer Science, Department of Computer Science, Dravidian University, Kuppam -517425
Chittoor District, Andhra Pradesh, India
Department of Mathematics, S.V. University, Tirupati, Andhra Pradesh, India
Abstract – Classical decision tree classifiers are
constructed using certain or point data only. But in
many real life applications inherently data is always
uncertain. Attribute or value uncertainty is
inherently associated with data values during data
collection process. Attributes in the training data
sets are of two types – numerical (continuous) and
categorical (discrete) attributes. Data uncertainty
exists in both numerical and categorical attributes.
Data uncertainty in numerical attributes means
range of values and data uncertainty in categorical
attributes means set or collection of values. In this
paper we propose a method for handling data
uncertainty in numerical attributes. One of the
simplest and easiest methods of handling data
uncertainty in numerical attributes is finding the
mean or average or representative value of the set
of original values of each value of an attribute. With
data uncertainty the value of an attribute is usually
represented by a set of values. Decision tree
classification accuracy is much improved when
attribute values are represented by sets of values
rather than one single representative value.
Probability density function with equal probabilities
is one effective data uncertainty modelling
technique to represent each value of an attribute as
a set of values. Here the main assumption is that
actual values provided in the training data sets are
averaged or representative values of originally
collected values through data collection process.
ISSN: 2231-5381
For each representative value of each numerical
attribute in the training data set, approximated
values corresponding to the originally collected
values are generated by using probability density
function with equal probabilities and these newly
generated sets of values are used in constructing a
new decision tree classifier.
Keywords — probability density function with
equal probabilities; uncertain data; decision tree;
classification; data mining; machine learning
1. INTRODUCTION
Classification is a data analysis technique.
Decision tree is a powerful and popular tool for
classification and prediction but decision trees are
mainly used for classification [1]. Main advantage
of decision tree is its interpretability – the decision
tree can easily be converted to a set of IF-THEN
rules that are easily understandable [2].
Examples sources of data uncertainty
include measurement/quantization errors, data
staleness, and multiple repeated measurements [3].
Data mining applications for uncertain data are –
classification of uncertain data, clustering of
uncertain data, frequent pattern mining, outlier
detection etc. Examples for certain data are –
locations of universities, buildings, schools,
colleges, restaurants, railway stations and bus
http://www.ijettjournal.org
Page 76
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 number 2- Nov 2013
stands etc. Data uncertainty naturally arises in a
large number of real life applications including
scientific data, web data integration, machine
learning, and information retrieval.
Data uncertainty in databases is broadly
classified into three types:
1. Attribute or value uncertainty
2. Correlated uncertainty and
3. Tuple or existential uncertainty
In attribute or value uncertainty, the value of each
attribute is represented by an independent
probability distribution. In correlated uncertainty,
values of multiple attributes may be described by a
joint probability distribution. Existential or tuple
uncertainty exists when it is uncertain whether a
data tuple is present or not in the relational database.
For example, a data tuple in a relational database
could be associated with a probability that
represents the confidence of its presence [3]. In the
case of existential uncertainty, a data tuple may or
may not exist in the relational database. For
example, assume we have the following tuple from
a probabilistic database [(a,0.4),(b,0.5)] the tuple
has 10% chance of not existing in the relational
database.
Constructing a decision tree classifier using
given representative (certain) values as it is without
any modification is called certain decision tree
(CDT) classifier construction approach and
constructing a decision tree classifier by modelling
uncertain data with probability density function
with equal probabilities is called decision tree
classifier construction on uncertain data (DTU)
approach.
Decision tree classifiers constructed using
representative values are less accurate than decision
tree classifiers constructed using approximated
probability density function values for each
representative value of each numerical attribute in
the training data set. Originally collected correct
values are approximately regenerated by using
probability density function modelling technique
with equal probabilities. Hence, probability density
function modelling technique models data
uncertainty appropriately.
When data mining is performed on uncertain
data, different types uncertain data modelling
ISSN: 2231-5381
techniques have to be considered in order to obtain
high quality data mining results. One of the current
challenges in the field of data mining is to develop
good data mining techniques to analyse uncertain
data. That is, one of the current challenges
regarding the development of data mining
techniques is the ability to manage data uncertainty.
2. INTRODUCTION TO UNCERTAIN DATA
Data
Mining on
precise data
Mining on
Uncertain data
Association
rule mining
Data
Classification
Data
Clustering
Hard
Clustering
Other data
mining methods
Fuzzy
Clustering
Fig 1.1 Taxonomy of Uncertain Data Mining
Many real life applications contain uncertain
data. With data uncertainty data values are no
longer atomic or certain. Data is often associated
with uncertainty because of measurement errors,
sampling errors, repeated measurements, and
outdated data sources [3]. When data mining
techniques are applied on uncertain data it is called
Uncertain Data Mining (UDM).
For preserving privacy sometimes certain
data values are explicitly transformed to range of
values. For example, for preserving privacy the
certain value of the true age of a person is
represented as a range [16, 26] or 16 – 26.
Tuple
No.
1
2
3
4
5
6
7
Marks
(Numerical)
550 – 600
222 – 444
470 – 580
123- 290
345 – 456
111 – 333
200 – 280
http://www.ijettjournal.org
Result
(Categorical)
(0.8,0.1,0.1,0.0)
(0.6,0.2,0.1,0.1)
(0.7,0.2,0.1,0.0)
(0.4,0.2,0.3,0.1)
(0.6,0.2,0.1,0.1)
(0.3,0.3,0.2,0.2)
(0.3,0.3,0.2,0.2)
Class Label
(Categorical)
(0.8,0.2)
(0.5,0.5)
(0.9,0.1)
(0.7,0.3)
(0.8,0.2)
(0.9,0.1)
(0.7,0.3)
Page 77
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 number 2- Nov 2013
8
9
10
11
12
500 – 580
530 – 590
450 – 550
150 – 250
180 – 260
(0.7,0.2,0.1,0.0)
(0.7,0.3,0.0,0.0)
(0.7,0.2,0.1,0.0)
(0.3,0.3,0.2,0.2)
(0.4,0.2,0.2,0.2)
(0.5,0.5)
(0.6,0.4)
(0.4,0.6)
(0.2,0.8)
(0.1,0.9)
Table 1.1 Example of Numerical Uncertain and
Categorical Uncertain attributes.
Marks attribute is a numerical uncertain attribute
(NUA) and Result attribute is a categorical
uncertain attribute (CUA). Class label can also be
either numerical or categorical.
3. PROBLEM DEFINITION
In many real life applications information
cannot be ideally represented by point data only.
All decision tree algorithms so far developed were
based on certain data values present in the
numerical attributes of the training data sets. These
values of the training data sets are the
representatives of the originally collected data
values. Data uncertainty was not considered during
the development of many data mining algorithms
including decision tree classification technique.
Thus, there is no classification technique which
handles the uncertain data. This is the problem with
the existing certain (traditional or classical) decision
tree classifiers.
Data uncertainty is usually modelled by a
probability density function and probability density
function is represented by a set of values rather than
one single representative or average or aggregate
value. Thus, in uncertain data management, training
data tuples are typically represented by probability
distributions rather than deterministic values.
Currently existing decision tree classifiers
consider values of attributes in the tuples with
known and precise point data values only. In real
life the data values inherently suffer from value
uncertainty (or attribute uncertainty). Hence, certain
(traditional or classical) decision tree classifiers
produce incorrect or less accurate data mining
results.
Decision tree classifiers constructed using
representative (average) values are less accurate
than decision tree classifiers constructed using
ISSN: 2231-5381
approximated probability density function values
with equal probabilities for each representative
value of each numerical attribute. Originally
collected correct values are approximately
regenerated by using probability density function
with equal probabilities. Hence, probability density
function models data uncertainty.
A training data set can have both Uncertain
Numerical Attributes (UNAs) and Uncertain
Categorical Attributes (UCAs) and both training
tuples as well as test tuples contain uncertain data.
As data uncertainty widely exists in real life, it is
important to develop accurate and more efficient
data mining techniques for uncertain data
management.
The present study proposes an algorithm
called Decision Tree classifier construction on
Uncertain Data (DTU) to improve performance of
Certain Decision Tree (CDT). DTU uses probability
density function with equal probabilities in
modelling data uncertainty in the values of
numerical attributes of training data sets. The
performance of these two algorithms is compared
experimentally
through
simulation.
The
performance of DTU is proves to be better.
4. EXISTING ALGORITHM
4.1 Certain Decision Tree (CDT) Algorithm
Description
In the existing decision tree classifier
construction each tuple ti is associated with a set of
values of attributes of training data sets and the ith
tuple is represented as
ti  (ti,1 , ti , 2 , ti,3 ,...ti , k , classlabel)
where



‘i’ is the tuple number and ‘k’ is the number of
attributes in the training data set.
ti,1 is the value of the first attribute of the ith tuple.
ti,2 is the value of the second attribute of the ith tuple
and so on.
It is required to traverse the decision tree
from the root node to a specific leaf node to find the
class label of an unseen (new) test tuple.
ttest  ( a1 , a2 , a3 ,...ak , ?)
The certain decision tree (CDT) algorithm
constructs a decision tree classifier by splitting each
http://www.ijettjournal.org
Page 78
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 number 2- Nov 2013
node into left and right nodes. Initially, the root
node contains all the training tuples. The process of
partitioning the training data tuples in a node into
two nodes based on the best split point value zT of
the best split attribute and storing the resulting
tuples in its left and right nodes is referred to as
splitting. Whenever further split of a node is not
required then it becomes a leaf node referred to as
an external node. The splitting process at each
internal node is carried out recursively until no
further split is required. Continuous valued
attributes must be discretized prior to attribute
selection [7]. Further splitting of an internal node is
stopped if all the tuples in an internal node have the
same class label or splitting does not result
nonempty left and right nodes.
During decision tree construction within
each internal node only crisp and deterministic tests
are applied. Entropy is a metric or function that is
used to find the degree of dispersion of training data
tuples in a node. In decision tree construction the
goodness of a split is quantified by an impurity
measure [2]. One possible function to measure
impurity is entropy [2]. Entropy is an information
based measure and it is based only on the
proportions of tuples of each class in the training
data set. Entropy is taken as dispersion measure
because it is predominantly used for constructing
decision trees. In most of the cases, entropy finds
the best split and balanced node sizes after split in
such a way that both left and right nodes are as
much pure as possible.
Accuracy and execution time of certain
decision tree (CDT) algorithm for 9 data sets are
shown in Table 6.2
Entropy is calculated using the formula
Where






Aj is the splitting attribute.
L is the total number of tuples to the left side of the split
point z.
R is the total number of tuples to the right side of the split
point z.
is the number of tuples belongs to the class label c to
the left side of the split point z.
is the number of tuples belongs to the class label c to
the right side of the split point z.
S is the total number of tuples in the node.
4.2 Pseudo code for Certain Decision Tree (CDT)
Algorithm
CERTAIN_DECISION_TREE (T)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
If all the training tuples in the node T have the same
class label then
set
return(T)
If tuples in the node T have more than one class then
Find_Best_Split(T)
For i ← 1 to datasize[T] do
If split_atribute_value[ti] <= split_point[T] then
Add tuple ti to left[T]
Else
Add tuple ti to right[T]
If left[T] = NIL or right[T] = NIL then
Create empirical probability distribution of the node T
return(T)
If left[T] != NIL and right[T] != NIL then
CERTAIN_DECISION_TREE(left[T])
CERTAIN_DECISION_TREE(right[T])
return(T)
5. PROPOSED ALGORITHM
ith class
Where pi = number of tuples belongs to the
5.1 Proposed Decision Tree Classification on
Uncertain Data (DTU) Algorithm Description
The procedure for creating Decision Tree
classifier on Uncertain Data (DTU) is same as that
of Certain Decision Tree (CDT) classifier
construction except that DTU calculates entropies
ISSN: 2231-5381
http://www.ijettjournal.org
Page 79
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 number 2- Nov 2013
for all the modelled data values of the numerical
attributes of the training data sets using probability
density function with equal probabilities.
For each value of each numerical attribute,
an interval is constructed and within the interval a
set of ‘n’ sample values are generated using
probability density function and Gaussian
distribution with the attribute value as the mean and
standard deviation as the length of the interval
divided by 6 and then entropies are computed for all
‘n’ sample points within that interval and the point
with minimum entropy is selected.
If the training data set contains ‘m’ tuples
then each attribute of the training data set has ‘m’
values. For each attribute ‘m - 1’ intervals are
generated and within each interval ‘n’ probability
density function values with equal probabilities
using Gaussian distribution are generated and the
entropy is calculated for all these values and then
one best split point is selected for each interval. One
optimal best split point is selected from all the best
points of all the intervals of one particular attribute.
Same process is repeated for all attributes of the
training data set. Finally, one best optimal split
attribute and optimal split point is selected from all
the possible ‘k’ attributes and k(nm – 1) potential
split points. Optimal split attribute and optimal split
point constitutes optimal split pair.
The decision tree for uncertain data (DTU)
algorithm constructs a decision tree classifier
splitting each node into left and right nodes.
Initially, the root node contains all the training data
tuples. A set of ‘n’ sample values are generated
using probability density function model for each
value of each numerical attribute in the training data
set and then stored in the root node.
Entropy values are computed for k(mn – 1)
split points where k is the number attributes of the
training data set, ‘m’ is the number of training data
tuples at the current node T and ‘n’ is the number of
probability density function values for each
numerical attribute value in the training data set.
The process of partitioning the training data tuples
in a node into two subsets based on the best split
point value zT of best split attribute and storing
the resulting tuples in its left and right nodes is
referred to as splitting.
ISSN: 2231-5381
After splitting of the root node into two left
and right sub-nodes the same process is applied for
both left and right nodes. The recursive process
stops when all the divided tuples have the same
class or there are no two sub nodes.
5.2 Pseudo code for Decision Tree Classification
on Uncertain Data (DTU) Algorithm
UNCERTAIN_DATA_DECISION_TREE(T)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
If all the training tuples in the node T have the
same class label then
set
return(T)
If tuples in the node T have more than one class then
For each value of each numerical attribute in the
training data set construct an interval and then find
entropy at ‘n’ probability density function generated
values with equal probabilities in the interval and
then select one optimal point, point with the
minimum entropy, in the interval. If the training
data set contains ‘m’ tuples then for each numerical
attribute ‘m-1’ probability density function intervals
are generated and finally one best optimal split point
is selected from kn(m – 1) possible potential split
points
Find_Best_Split(T)
For i ← 1 to datasize[T] do
If split_atribute_value[ti] <= split_point[T] then
Add tuple ti to left[T]
Else
Add tuple ti to right[T]
If left[T] = NIL or right[T] = NIL then
Create empirical probability distribution of the node T
return(T)
If left[T] != NIL and right[T] != NIL then
UNCERTAIN_DATA_DECISION_TREE(left[T])
UNCERTAIN_DATA_DECISION_TREE(right[T])
return(T)
DTU can build more accurate decision tree
classifiers but computational complexity of DTU is
‘n’ times expensive than CDT. Hence, DTU is not
efficient as that of CDT. As far as accuracy is
considered DTU classifier is more accurate but as
far as efficiency is considered certain decision tree
(CDT) classifier is more efficient.
To reduce the computational complexity
DTU classifier we have proposed a new pruning
technique so that entropy is calculated only at one
best point for each interval. Hence, the new
approach, pruned version, PDTU, for decision tree
http://www.ijettjournal.org
Page 80
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 number 2- Nov 2013
construction is more accurate with approximately
same computational complexity as that of CDT.
Accuracy and execution time of certain
decision tree (CDT) classifier algorithm for 9
training data sets are shown in Table 6.2 and
accuracy and execution time of decision tree
classification on uncertain data (DTU) classifier
algorithm for 9 training data sets are shown in
Table 6.3 and comparison of execution time and
accuracy for certain decision tree (CDT) and DTU
algorithms for 9 training data sets are shown in
Table 6.4 and charted in Figure 6.1 and Figure 6.2
respectively.
6. EXPERIMENTAL RESULTS
A simulation model is developed for evaluating the
performance of two algorithms – Certain Decision
Tree (CDT) classifier and Decision Tree
classification on Uncertain Data (DTU) classifier
experimentally. The training data sets shown in
Table 6.1 from University of California (UCI)
Machine Learning Repository are employed for
evaluating the performance and accuracy of the
above said algorithms.
No
1
Data Set
Name
Iris
Training
Tuples
150
No. Of
Attributes
4
No. Of
Classes
3
2
Glass
214
9
6
3
Ionosphere
351
32
2
4
Breast
569
30
2
5
Vehicle
846
18
4
6
Segment
2310
14
7
7
Satellite
4435
36
6
8
Page
5473
10
5
9
Pen Digits
7494
16
10
Table 6.1 Data Sets from the UCI Machine Learning Repository
In all our experiments we have used training
data sets from the UCI Machine Learning
Repository [6]. The simulation model is
implemented in Java 1.7 on a Personal Computer
with 3.22 GHz Pentium Dual Core processor (CPU),
and 2GB of main memory (RAM). The
performance measures, accuracy and execution time,
for the above said algorithms are presented in Table
6.2 to Table 6.4 and Figure 6.1 to Figure 6.2.
ISSN: 2231-5381
No
Total
Tuples
150
Accuracy
1
Data Set
Name
Iris
97.4422
Execution
Time
1.1
2
Glass
214
88.4215
1.3
3
Ionosphere
351
84.4529
1.47
4
Breast
569
96.9614
2.5678
5
Vehicle
846
78.9476
6.9
6
Segment
2310
97.0121
29.4567
7
Satellite
4435
83.94
153.234
8
Page
5473
97.8762
36.4526
9
Pen Digits
7494
90.2496
656.164
Table 6.2 Certain Decision Tree (CDT) Accuracy and Execution
Time
No Data Set
Total
Accuracy
Execution
Name
Tuples
Time
1
Iris
150
98.5666
1.1
2
Glass
214
95.96
1.2
3
Ionosphere
351
98.128
16.504
4
Breast
569
97.345
24.223
5
Vehicle
846
96.01281
35.365
6
Segment
2310
98.122
212.879
7
Satellite
4435
85.891
294.96
8
Page
5473
98.8765
289.232
9
Pen Digits
7494
91.996
899.3491
Table 6.3 Uncertain Decision Tree (DTU) Accuracy and Execution
Time
No
Data Set
Name
CDT
Accuracy
DTU
Accuracy
CDT
Execution
Time
DTU
Execution
Time
1
Iris
97.4422
98.566
1.1
1.1
2
Glass
88.4215
95.96
1.3
1.2
3
Ionosphere
84.4529
98.128
1.47
16.504
4
Breast
96.9614
97.345
2.5678
24.223
5
Vehicle
78.9476
96.281
6.9
35.365
6
Segment
97.0121
98.122
29.4567
212.879
7
Satellite
83.94
85.891
153.234
294.96
8
Page
97.8762
98.847
36.4526
289.232
9
Pen Digits
90.2496
91.996
656.164
899.434
Table 6.4 Comparison of accuracy and execution times of CDT and
DTU
http://www.ijettjournal.org
Page 81
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 number 2- Nov 2013
attributes. Also computational complexity of DTU
is very high and execution time of Decision Tree
Classification on Uncertain Data (DTU) is more for
many of the training data sets.
7.3. Suggestions for future work
Figure 6.1 Comparison of execution times of CDT and DTU
Special techniques or ideas or plans are
needed to handle different types of data uncertainties
present in the training data sets. Special methods
are needed to handle data uncertainty in categorical
attributes also. Special pruning techniques are
needed to reduce execution time of Decision Tree
Classification on Uncertain Data (DTU). Also
special techniques are needed to find and correct
random noise and other errors in the categorical
attributes.
REFERENCES
Figure6.2 Comparison of Classification Accuracies of CDT and DTU
7. CONCLUSIONS
7.1 Contributions
The performance of existing traditional or
classical or certain decision tree (CDT) is verified
experimentally through simulation. A new decision
tree classifier construction algorithm called
Decision Tree Classification on Uncertain Data
(DTU) is proposed and compared with the existing
Certain
Decision
Tree
classifier
(CDT).
Experimentally it is found that the classification
accuracy of proposed algorithm DTU is much better
than CDT algorithm.
[1] Jiawei Han, Micheline Kamber, Data Mining: Concepts and
Techniques, Morgan Kaufmann, second edition,2006. pp.285–292
[2] Introduction to Machine Learning Ethem Alpaydin PHI
MIT Press, second edition. pp. 185–188
[3] SMITH Tsang, Ben Kao, Kevin Y. Yip, Wai-Shing Ho, and Sau
Dan Lee “Decision Trees for Uncertain Data”
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA
ENGINEERING, VOL.23, No.1, JANUARY 2011
[4] Hsiao-Wei Hu, Yen-Liang Chen, and Kwei Tang “A Dynamic
Discretization Approach for Constructing Decision Trees with a
Continuous Label” IEEE TRANSACTIONS ON KNOWLEDGE
AND DATA ENGINEERING,VOL.21,No.11,NOVEMBER 2009
[5] R.E. Walpole and R.H. Myers, Probability and Statistics for
Engineers and Scientists. Macmillan Publishing Company, 1993.
[6] A. Asuncion and D. Newman, UCI Machine Learning Repository,
http://www.ics.uci.edu/mlearn/MLRepository.html, 2007.
[8] U.M. Fayyad and K.B. Irani, “On the Handling of Continuous –
Valued Attributes in Decision tree Generation”, Machine
Learning, vol. 8, pp. 87-102, 1996.
7.2. Limitations
Proposed
algorithm,
Decision Tree
Classification on Uncertain Data (DTU) classifier
construction, handles only data uncertainty present
in the values of numerical (continuous) attributes of
the training data sets only but not the data
uncertainty present in categorical (discrete)
ISSN: 2231-5381
http://www.ijettjournal.org
Page 82