Download IEEE Transactions on Magnetics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Handling Uncertain Data By Using Decision Tree Algorithms
1
Aarati R. Patil, 2Nilesh V. Patil, 3Kamini S.Patil
1
BE Student of NMU, Jalgaon, AP-India,[email protected]
2
BE Student of NMU, Jalgaon, AP-India,[email protected]
3
BE Student of NMU, Jalgaon, AP-India,[email protected]

1
Naive Bayes, Neural Networks and Decision Tree
algorithms.
Abstract-- Traditional decision tree classifiers work with
data whose values are known and precise. The proposed
approach extends such classifiers to handle data with
uncertain information. There are sources of uncertain
data
includes
measurement
analysis errors,
data
staleness and repeated measurements. This approach
extends classical decision tree building algorithms to
handle data tuples with uncertain values. Since
processing PDFs is computationally more costly than
processing single values (e.g., averages), decision tree
construction on uncertain data is more CPU demanding
than that for certain data. To tackle this problem, this is
the proposed series of pruning techniques that can
greatly improve construction efficiency.
Index Terms-- Data Mining, Uncertain Data, Classification,
Distribution, Averaging, Decision Tree, Pruning Techniques.
One of the important method in classification is called
decision tree. It can be defined as it is structured tree where
each node represents a test on a attribute value as well as
every branch i.e., age represents an output of the test. The
leaves in the tree are represented as classes. Models of
decision trees can be characterized into 2 ways, first is
descriptive and predictive is another one. Decision tree as
respective to data mining having of 2 types of trees,
classification & regression trees. Classification tree is used,
when the output(which is predicted) is the class in which the
data belongs.
The paper contains introduction of data mining,
classification, decision tree and background. Motivational
Theory with the objectives, literature survey and reviews
done for the proposed approach are described in the paper.
The Proposed system and implemented algorithms as well as
I. INTRODUCTION
Results are described in next chapters of paper.
Data uncertainty is generally used in emerging applications.
Data uncertainty can be caused by multiple factors which
II. MOTIVATION
consists measurements precision limitations, derived from
The certain data can be defined as the data which having
applications due to different reasons such as outdated
values that never changes. That means certain data is
sources & imprecise measurements. Now a days, huge
always fixed by certain means. The values associated with
amount of historical data handling is very important. The
certain(fixed) data sets are fixed, they never changes their
data mining is very effective tool which extracts knowledge
attribute values. What is the need of analysis of the
from stored historical information. Classification is one of
performance of this certain data set? Because these data
the most effective techniques in data mining to handle an
sets contains historical data so we need to keep it in
uncertain data. Classification means separation or ordering
proper way, by this means it is helpful for users of it. The
of objects into classes. It involves 3 data mining techniques
performance analysis of certain data is quite easy task.
But the problem arises when the term uncertain comes in
application. The uncertain data means the data which
changes randomly day by day, user to user and even
2
(varies from)application to application. The values
current behavior or predict future outcome. Data Mining
associated with the uncertain data set and they are
process by explaining how it can be used to solve real
updated as respective to their continuation. They are not
problems, & then allowing to do own data mining [3].
fixed so when the user want to analyze the performance of
Wei Dai and Wei Ji [4], suggested that implementation of
uncertain data set it becomes difficult due to the changing
Decision Tree using C4.5 algorithm, by using Map Reduce
nature of data set. And the efficient analysis is needed for
techniques. A decision tree is directed tree(structured tree)
better performance.
with edges and nodes. C4.5 uses values of information gain
for tree construction [4].
A. OBJECTIVES
The need to handle uncertain data is increasing day by
day. The performance analysis of uncertain data is done to
IV. PROPOSED SYSTEM
achieve some objectives that are as follows:
There is a problem of constructing decision trees classifiers
1. To implement algorithm to construct decision trees from
on data with uncertain numerical attributes. The Proposed
uncertain data by using averaging-method.
system overcomes this drawback. The model is proposed to
2. To construct a decision tree model using distribution-
achieve some goals and they are described in section 2.1.
based technique.
The block diagram of the proposed system is shown as
3. Even when the data which is particularly used is uncertain
follows. The proposed architecture involves modules
in high amount, take satisfactory results in experimentation
explained as follows:
by using pruning technique.
A. DATA INSERTION
4. Pruning technique highly improves efficiency of building
In many applications, however, data uncertainty is common.
a decision tree model.
With uncertainty, the value of a data item is often
5. To establish a theoretical basic on which pruning
represented not by one single value, but multiple values
techniques states that computational efficiency of the
form is a probability distribution. The uncertain data in
decision trees based on distribution-based algorithms can be
module is inserted by user.
significantly improved.
B. GENERATE TREE
Construction of a decision tree on tuples(values in datasets)
with numerical, decimal point values then data is
III. LITERATURE SURVEY
they have
computationally demanding. Given a set of n training tuples
suggested that data with uncertain information, rather than
with a numerical attribute, there are binary split points or
fixed values. Data uncertainty arises due to different values
ways to partition. Computationally expensiveness can
in different applications. Decision tree is made by averaging
obtained by finding the best split point. Using classification,
method which is more CPU demanding [1].
DT is generated.
Bhosale J. D. and Patil B. M. [2], discussed that data
C. AVERAGING
uncertainty is very common in applications such as sensor
A simple way to handle data uncertainty is to abstract
networks
by various factors like
probability distributions by summary statistics such as
measurement errors, outdated sources. Classification is the
means and variances. This approach is called Averaging.
most popular data mining technique which includes
Averaging is a greedy algorithm that builds a tree top-down.
averaging and distribution method [2].
When processing a node, examine a set of tuples S. The
Richard J. Roiger and Michael W. Geatz [3], studied that the
working of C4.5 algorithm starts with the root node and with
According to Sunit S. Dongare et. al., [1],
etc.
and
caused
data mining is discovery of patterns in data to help explain
3
S being the set of all training tuples. At each node n, first
a) If the instances in subclass satisfy predefined criteria
check if all the tuples in S have the same class label.
of set of remaining attribute choices for this path of
tree is NULL, specify classification for new
instances following the decision path.
(b)If the subclass does not satisfy the predefined
criteria and there is at least one attribute to further
subdivide the path of the tree, let T be the current
set of subclass instances and return to step 2 [4].
Now, following is the Pre Pruning algorithm which
eliminates unwanted nodes from tree, increases execution
time as well as improves efficiencyDPSA(Decision tree pre-pruning self-learning algorithm)[1].
Input- decision table S=<U,R,V,f>, where R=CUD.
Output- A decision tree.
Step 1- Calculate the global certainties of the decision table
Influenced by each of its condition attributes. Then select
the maximum one (μc(a)), and denote the associated
condition attribute as a.
Step 2- Create a node by the whole instances of the current
decision table, use condition attribute n to divide’ the
present decision table. Each class determines one branch of
Figure
1.
Complete
Methodology
for
proposed
system(Block Diagram).
D. DISTRIBUTION-BASED
An approach is to consider the complete information carried
by the probability distributions to build a decision tree. This
approach is called Distribution-based.
the node. Calculate the certainties of each condition attribute
class, and denote them as K (Eai), where i=1,2 ,... m, m=|
U/IND(u)|.
Step 3- For each class do: If K (Eai) >= μc(a), or there is no
other condition attribute left, then create a leaf node for this
class;
Else
V. ALGORITHMS
Following are the steps of C4.5 algorithm [4].
Step 1- Let T be the training instances or datasets.
Step 2- Then select the attribute that differentiates instances
in T in best way.
Step 3- Then create a tree node whose value is the attribute
that is previously selected. Create a child node from this
selected node where each link represents a unique value for
the chosen attribute. Use this child node value to further
subdivide instances into subclasses.
Step 4- Then for each subclass created in step3
{
Generate a sub decision tables by selecting the instances
corresponding to this class and deleting attribute a from the
current decision table.
Select the maximum global certainty μc(b) of the sub
decision table influenced by each condition attribute, and
denote the associated
attribute as b.
If μc(b) = K (Eai), then create D leaf node for
Else- Let a=b, μc(a), = μc(b) take the sub
4
decision tables generated above as current decision table and
important issue, because of the huge amount of information
go to step 2.
to be processed, as well as there is participation of more
this class;
complicated entropy computations. Therefore, the approach
}
is devised a series of pruning techniques to improve tree
Step 4- Return the decision tree [1].
construction
efficiency.
Algorithms
have
been
experimentally verified to be highly effective at execution
VI. RESULTS OF EXPERIMENT
time. Some of these pruning techniques are generalizations
The developed system after executing the C4.5 algorithm
of analogous techniques for handling decimal point valued
decision tree of the training dataset is generated. As well as
data. Other techniques such as pruning by bounding and
after execution of Pruning method, the unwanted nodes of
end-point sampling are also helps in increasing execution
tree are eliminated from resulted tree. Following is the bar
time. Although proposed model techniques are primarily
graph showing the analysis of share market, which shows
designed to handle uncertain data, they are also useful for
prediction of next days in values of share and C4.5
construction of decision trees using classification algorithms
algorithm works on that values to predict the solution. The
when there are tremendous amounts of data tuples.
solution helps customer to decide whether to invest or not.
VII. REFERENCES
1500
1450
[1] Sunit S. Dongare et. al., “Analysis on uncertain Data of
Prediction Of Shares For Seven Days
Share Market using Decision Tree and Pruning Algorithm”,
1400
IJCEA, ISSN 2321-3469, Vol. VI, Issue III, June 14.
1350
[2] Bhosale J. D. and Patil B. M., “Performance Analysis on
1300
Uncertain Data using Decision Tree”, IJCA(0975-8887),
1250
Real
1200
predicted
Vol. 96- No. 7, June 2014.
[3] Richard J. Roiger and Michael W. Geatz, “Data Mining:
1150
A Tutorial Based Primer”, by Pearson Education.
1100
[4] Wei Dai and Wei Ji, “A Map Reduce Implementation of
1050
C4.5 Decision Tree Algorithm”, IJDTA, 2014.7.1.05, Vol 7,
1000
No.1(2014), pp. 49-60.
Day Day Day Day Day Day
1 2 3 4 5 6
[5] J. R. Quinlan, “Induction of Decision Trees”, Machine
Learning, vol. 1, no. 1, pp. 81-106, 1986.
Figure 2. Analysis of share market for 7 days using Bar
[6] C4.5- “Programs for Machine Learning”, Morgan
Graph.
Kaufmann, 1993, ISBN 1-55860-238-0.
VI. CONCLUSION
Two decision tree techniques are used, averaging and
distribution-based. These techniques are based on C4.5
algorithm. In this measures are calculated in decision tree,
such as information entropy and information gain. Therefore
the approach is to advocate that data to be collected and
stored with the PDFs information infact. Performance is an