Download 6: Review on data stream classification algorithm

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
International Journal of Conceptions on Electrical and Electronics Engineering
Vol. 1, Issue 1, Oct 2013; ISSN: 2345 – 9603
Review on Data Stream Classification Algorithm
(Hoeffding Tree, VFDT, CVFDT)
Kirankumar Patel
Information Technology Department,
Ganpat University
Mehsana, Gujarat, India
[email protected]
sequence of items. It impossible to control the order in
which items are arrives, nor is it feasible to locally store a
stream in its entirety.”( Golab & Oszu (2003))
Abstract— the data stream has recently emerged in response to
the continuous data problem. The algorithm processing the
stream has no control over the order of the examples seen, and
must update its model incrementally as each example is
inspected. Performance of data stream classification is
measuring by involving processing speed, memory and
accuracy. Also A classification algorithm must meet several
requirements in order to work with the assumptions and be
suitable for learning from data streams that is process an
example at a time and inspect it only once; use limited amount
of memory; work in limited amount of time; to predict at any
point. So studying purely theoretical advantages of algorithms
is certainly useful and enables new developments. In this
review paper we study different algorithms of classification
that is Hoeffding tree, VFDT, CVFDT and comparison
between all this algorithms
Also There are number of application characteristics for
stream data such that Massive volumes of data (several
terabytes); Records arrive at a rapid rate; Most of data will
never be seen by a human!; Need for near-real time analysis
of data feeds.
II.
ISSUE IN STREAM MINING
The first issue is limited computation resources: in many
applications, the computation power and memory at hand
does not measure up to the massive amount of data in the
input stream. For example, in a single day, Google serviced
more than 150 million searches; Walmart executed 20
million sales transactions; and Telstra generated 15 million
call records. However, traditional data mining algorithms
make the assumption that the resources available will
always match the amount of data they process. This
assumption does not hold in data stream mining. Stream
mining algorithms shall learn fast and consume little
memory resources. And the Another characteristic of data
streams is that data is no longer a snapshot, but rather a
continuous stream. This means that the concept underlying
the data may change over time. For effective decision
making, stream mining must be adaptive to concept change.
For example, when customer purchasing patterns change,
marketing strategies based on out-dated transaction data
must be modified in order to reflect current customer needs.
Keywords— Decision trees, Hoeffding bounds, concept Drift,
incremental learning
I.
INTRODUCTION
There are number of algorithm for mining a data that do
not fit in main memory. But this algorithm only tested on
few million examples. But currently every day record
millions of transactions, millions of calls connect by
telecommunications companies, large banks process
millions of ATM and credit card operations, and popular
Web sites log millions of hits. So create a massive amount
of data. So for this type of large data Current data mining
systems are not equipped to cope with them. In many cases,
these large volumes of data can be mined for interesting and
relevant information in a wide variety of applications. When
the volume of the underlying data is very large, it leads to a
number of computational and mining challenges: (1) with
increasing volume of the data, it is no longer possible to
process the data efficiently by using multiple passes. Rather,
one can process a data item at most once. This leads to
constraints on the implementation of the underlying
algorithms. Therefore, stream mining algorithms typically
need to be designed so that the algorithms work with one
pass of the data. And (2) in most cases, there is an inherent
temporal component to the stream mining process. This is
because the data may evolve over time. This behaviour of
data streams is referred to as temporal locality. Therefore, a
straightforward adaptation of one-pass mining algorithms
may not be an effective solution to the task. Stream mining
algorithms need to be carefully designed with a clear focus
on the evolution of the underlying data.
Problems in Mining Data Streams
TABLE I PROBLEMS IN DATA STREAMS
Traditional data mining
techniques usually require
Challenges of stream mining




Entire data set to be
present
Random access (or
multiple passes) to the
data
Much time per data
item


Impractical to store
the whole data
Random access is
expensive
Simple calculation per
data due to time and
space constraints
So continuous stream of information has challenged our
storage, computation and communication capabilities in
computing systems. And for effective processing of stream
data, new data structure, techniques, and algorithms are
needed. Because we do not have finite amount of space to
“A data stream is a real-time, continuous, ordered
(implicitly by arrival time or explicitly by timestamp)
30
International Journal of Conceptions on Electrical and Electronics Engineering
Vol. 1, Issue 1, Oct 2013; ISSN: 2345 – 9603
 On-demand stream classification, using microcluster ideas that each micro-cluster ids associated with
specific class label which defines the class label of the
points in it.
TABLE II CLASSIFY CHALLENGES IN FIVE DIFFERENT
CATEGORY
No.
1
2
Issues
Memory
management
Data Preprocessing
Challenges
Fluctuated
and irregular
data arrival
rate and
variant data
arrival rate
over time
Quality of
mining result
and
automation of
preprocessing
techniques
Different
approaches
for that
Summarizatio
n techniques
IV.
3
Compact
Data
Structure
Limited
memory size
and large
volume of
data stream
4
Resource
aware
Limited
resource like
storage and
computation
capabilities
AOG
Visualization
of results
Problem in
data analysis
and quick
decision
making by
user
Still is a
research
issue(one of
the proposed
approach is:
intelligent
monitoring)
HOEFFDING TREE
Before starting a Hoeffding algorithm, first of all we
define classification problem is a set of training examples of
the form (x, y), where x is a vector of d attributes and y is a
discrete class label. Our goal is to produce from the
examples a model y=f(x) that predict the classes y for future
examples x with high accuracy. Decision tree learning is
one of the most effective classification methods. a Decision
tree is learned by recursively replacing leaves by test nodes,
starting at the root. Each node contains a test on the
attributes and each branch from a node corresponds to a
possible outcome of the test and each leaf contains a class
prediction. All training data stored in main memory so it’s
expensive to repeatedly read from disk when learning
complex trees so our goal is design decision tree learners
than read each example at most once, and use a small
constant time to process it.
Light-weight
preprocessing
techniques
Incremental
maintaining
of data
structure,
novel
indexing,
storage and
querying
techniques
5
So we discuss the Hoeffding algorithm in next section
IV. Than after in section V discuss very fast decision tree
algorithm (VFDT) also their advantages and disadvantages.
in section VI we discuss concept adapting very fast decision
tree algorithm(CVFDT).
So key observation is find the best attribute at a node. So
for that consider only small subset of training examples that
pass through that nodes. Choose the root attribute. Then
expensive examples are passed down to the corresponding
leaves, and used to choose the attribute there, and so on
recursively. so use Hoeffding bound to decide, how many
examples are enough at each node???
Consider a random variable a contain range is R.
suppose we have n observation of a. so find mean of a ().so
Hoeffding bound states that “with probability 1-δ,the true
mean of a is at least - ε. where Hoeffding bound ε
store stream data, we often trade-off between accuracy and
storage.So we can classify into five category so in table 2.
[4]
III. CLASSIFICATION ALGORITHMS
ε = √ (R2 ln (1/δ)/2n
TABLE III HOEFFDING TREE INDUCTION ALGORITHM
---------------------------------------------------------------------Algorithm Hoeffding tree induction algorithm.
There are number of algorithm for classification.
Following are listed some algorithm and approaches for
classification of data stream based on the data mining tasks.
1: HT be a tree with a single leaf (the root)
2: for all training examples do
 Hoeffding tree algorithm that is based on decision
tree.
3: Sort example into leaf l using HT
 GEMM and FOCUS algorithm and that mining
task is decision tree and frequent item set.
4: Update sufficient statistics in l
 OLIN algorithm uses info-fuzzy techniques for
building a tree-like classification model.
6: if nl mod Nmin = 0 and examples seen at l not all of same
5: Increment nl , the number of examples seen at l
Class then
 VFDT (Very Fast Decision Tree) and CVFDT
(Concept-Adapting Very Fast Decision Tree) algorithm
works on decision tree task.
7: Compute l (X l) for each attribute
8: Let Xa be attribute with highest l
 A Classifier ensemble Approach.
9: Let Xb be attribute with second-highest l
10: Compute Hoeffding bound
31
International Journal of Conceptions on Electrical and Electronics Engineering
Vol. 1, Issue 1, Oct 2013; ISSN: 2345 – 9603
11: if Xa ≠ XØ; and ( l (Xa ) - l (Xb) > ε or ε < τ ) then
required is O(ldvc). This is independent of the number of
examples seen, if the size of the tree depends only on the
\true" concept and is independent of the size of the training
set.[2]
12: Replace l with an internal node that splits on Xa
13: for all branches of the split do
VFDT supports all three pruning option of our method
no-pruning, pre-pruning, and post-pruning. When using the
no-pruning option VFDT refines the tree it is learning
indefinitely. When using pre-pruning VFDT takes an
additional pre-prune parameter, τ’, and stops growing any
leaf where the difference between the best split candidate all
others is less than G (.) and G (.) < τ’. If there is ever a point
at which all of the leaves in the tree are pre-pruned, the
VFDT procedure terminates and returns the tree. VFDT is
incremental and anytime, Means when new example can be
quickly incorporated as they arrive. So a usable model is
available after the first few examples and then progressively
defined.
14: Add a new leaf with initialized sufficient statistics
15: end for
16: end if
17: end if
18: end for
-----------------------------------------------------------------------In Line 1 initializes the tree data structure, which starts
out as a single root node. Lines 2-18 form a loop that is
performed for every training example. Every example is
filtered down the tree to an appropriate leaf, depending on
the tests present in the decision tree built to that point (line
3). This leaf is then updated (line 4)—each leaf in the tree
holds the sufficient statistics needed to make decisions
about further growth. The sufficient statistics that are
updated are those that make it possible to estimate the
information gain of splitting on each attribute. Line 5 simply
points out that nl is the example count at the leaf, and it too
is updated. Technically nl can be computed from the
sufficient statistics. Lines 7-11 perform the test described in
the previous section, using the Hoeffding bound to decide
when a particular attribute has won against all of the others.
G is the splitting criterion function (information gain) and is
its estimated value. In line 11 the test for XØ, the null
attribute, is used for pre-pruning. The test involving τ is
used for tie-breaking. If an attribute has been selected as the
best choice, lines 12-15 split the node, causing the tree to
grow. [1]
TABLE IV VERY FAST DECISION TREE (VFDT) ALGORITHM
----------------------------------------------------------------------Algorithm:
Input: δ desired probability level.
Output: τ a Decision Tree.
In it: τ ← Empty Leaf(Root)
1. While (TRUE)
2. Read next example
3. Propagate Example through the tree from the root till a
leaf
4. Update sufficient statistics at leaf
5. If leaf (number of examples) > Nmin
6. Evaluate the merit of each attribute
A. Strengths and weakness of Hoeffding tree algorithm
7. Let A1 the best attribute and A2 the second best
It contain such advantage, in that one is Scales better
than traditional methods in terms of Sub linear with
sampling and it utilization very small memory. Second it
Make class predictions in parallel and new examples are
added as they come.
8. Let
9. If G (A1)-G (A2) > ε
10. Install a splitting test based on A1
11. Expand the tree with two descendant leaves
Also Hoeffding tree algorithm have Weakness that one
is Could spend a lot of time with ties. Second is Memory
used with tree expansion and large Number of candidate
attributes.
V.
Some refinement in this algorithm that are defined below:
Ties-Breaking: if two or more attribute have identical
attribute values means G’s than it will be required to decide
between them with high confidence. so VFDT can decide
that there is effectively a tie and split on the current best
attribute if difference between the best split candidate all
others is less than G(.) and G(.) < τ. where τ is a userdefined threshold.[3]G computation: inefficient to
recompute G for every new example, because it unlikely
that the decision to split will be made at that specific point.
So VFDT allows the users to specify a minimum number of
new examples Nmin that must be accumulated at a leaf before
G is recomputed. This effectively reduced the global time
spent on G computations by a factor of Nmin, and making
learning With VFDT nearly as fast as simply classify the
training examples.[3]
VERY FAST DECISION TREE (VFDT) ALGORITHM
It is a decision-tree learning system based on the
Hoeffding tree algorithm. In VFDT algorithm, split on the
current best attribute, if the difference is less than a userspecified threshold. So it good compare to Hoeffding tree
algorithm. Because decide between identical attribute is
wasteful. So simply compute G and check for split
periodically. Also deceive or drop less promising leaves
when needed. And rescan old data when time available.
VFDT's memory use is dominated by the memory
required to keep counts for all growing leaves. If d is the
number of attributes, v is the maximum number of values
per attribute, and c is the number of classes, VFDT requires
O(dvc) memory to store the necessary counts at each leaf. If
l is the number of leaves in the tree, the total memory
Compare to Hoeffding tree VFDT is better in terms of
time and memory. Also, similar accuracy. Also compare to
32
International Journal of Conceptions on Electrical and Electronics Engineering
Vol. 1, Issue 1, Oct 2013; ISSN: 2345 – 9603
VFDT and C4.5 in this experiment same memory limit for
both (40 MB), VFDT setting=, Nmin=200 and τ=5%,
Domain:2 classes and 100 binary attributes, fifteen synthetic
trees 2.2k-500k leaves. So running time for c4.5 is takes 35
seconds to read and process 100k examples and VFDT takes
47 seconds. In second experiment VFDT takes 6377
seconds for 20 million examples. So in VFDT take
advantage after 100k to greatly improve accuracy.
7. CVFDTGrow
But Still VFDT not handles concept drift. Means as time
goes by, different elements belong to the mental categories
example is interesting literature from novice to experts and
reasonably priced from student to manager etc. so concept
drift means that the concept about which data is obtained
may shift time to time, each time after some minimum
permanence.
It is similar to VFDT algorithm, but CVFDT monitors
the validity of its old decisions by maintaining sufficient
statistics at every node in Decision tree (DT). Forgetting an
old example is slightly complicated by the fact that DT may
have grown or changed since the example was initially
incorporated. To avoid forgetting an example from a node
that has never seen it, nodes are assigned a unique,
monotonically increasing ID as they are created. When an
example is added to W, the maximum ID of the leaves it
reaches in DT and all alternate trees is recorded with it. An
example’s effects are forgotten by decrementing the counts
in the sufficient statistics of every node the example reaches
in DT whose ID is less than or equal to the stored ID.
VI.
8. CheckSplitValidity if f examples seen since
Last checking of alternate trees.
9. Return HT.
------------------------------------------------------------------------
CONCEPT ADAPTING VERY FAST DECISITON TREE
(CVFDT) ALGORITHM
Most KDD systems include VFDT; assume training data
is a sample drawn from stationary distribution. But most
large databases or data stream violate this assumption
because of concept drift. That we discussed in section IV.
So our goal is mining continuously changing data stream.
A.
CVFDTGrow
In CVFDTGrow, for each node reached by the example
in Hoeffding Tree Increment the corresponding statistics at
the node. If enough examples seen at the leaf in HT which
the example reaches, Choose the attribute that has the
highest average value of the attribute evaluation measure
(information gain or gini index). If the best attribute is not
the “null” attribute, create a node for each possible value of
this attribute.
CVFDT is an extension to VFDT which maintains
VFDT's speed and accuracy advantages but adds the ability
to detect and respond to changes in the example-generating
process. Like other systems with this capability (Widmer
and Kubat, 1996, Ganti et al., 2000), CVFDT works by
keeping its model consistent with a sliding window of
examples. However, unlike most of these other systems, it
does not need to learn a new model from scratch every time
a new example arrives. Instead, it monitors the quality of its
old decisions on the new data and adjusts those that are no
longer correct. In particular, when new data arrives, CVFDT
updates the sucient statistics at its nodes by incrementing
the counts corresponding to the new examples and
decrementing the counts corresponding to the oldest
examples in the window (which need to be forgotten). This
will statistically have no effect if the underlying concept is
stationary. If the concept is changing, however, some splits
that were previous selected will no longer appear best
because other attributes will have higher gain on the new
data. When this happens, CVFDT begins to grow an
alternative sub tree with the new best attribute at its root.
The old sub tree is replaced by the alternate one when the
alternate becomes more accurate on new data.
TABLE V CVFDT ALGORITHM
----------------------------------------------------------------------1. Alternate trees for each node in HT staret as empty.
2. Process Examples from the stream indefinitely.
3. For Each Example (x, y),
4. Pass (x, y) down to a set of leaves using HT
And all alternate trees of the nodes (x, y) pass
Through.
5. Add(x, y) To the sliding window of examples.
6. Remove and forget the effect of the oldest
Fig.1 CVFDT algorithm: process each example
B. Forget Old Example
Examples, if the sliding window overflows.
33
International Journal of Conceptions on Electrical and Electronics Engineering
Vol. 1, Issue 1, Oct 2013; ISSN: 2345 – 9603
Maintain the sufficient statistics at every node in Hoeffding
tree to monitor the validity of its decisions. VFDT only
maintain such statistics at leaves. Than HT might have
grown or changed since the example was initially
incorporated. Assigned each node a unique, monotonically
increasing ID as they are created. Such as forget Example
(HT, example, maxID). After that For each node reached by
the old example with node ID no larger than the max leave
ID the example reaches, Decrement the corresponding
statistics at the node and For each alternate tree Talt of the
node, forget (Talt, example, maxID).
models every 10,000 examples throughout the run and
averaging these results. Drift level, reported on the minor
axis, is the average percentage of the test set that changes
label at each point the concept changes. CVFDT is
substantially more accurate than VFDT, by approximately
10% on average, and CVFDT's performance improves
slightly with increasing d (from approximately 13.5% when
d = 10 to approximately 10% when d = 150). [6]
Figure 4 compares the average size of the models
induced during the run shown in Figure 11 (the reported
values are generated by averaging after every 10,000
examples, as before). CVFDT's trees are substantially
smaller than VFDT's, and the advantage is consistent across
all the values of d we tried. This simultaneous accuracy and
size advantage derives from the fact that CVFDT's tree is
built on the 100,000 most relevant examples, while VFDT's
is built on millions of outdated examples.
C. CheckSplitValidity
In Split validity periodically scans the internal nodes of
HT. and then starts a new alternate tree when a new winning
attribute is found. In terms of following assumptions one is
tighter criteria to avoid excessive alternate tree creation. and
other Limit the total number of alternate trees.
A series of experiments comparing CVFDT to VFDT.
Our goals were to evaluate CVFDT's ability to scale up, to
evaluate CVFDT's ability to deal with varying levels of
drift, and to identify and characterize the situations where
CVFDT outperforms the other systems. Our experiments
used synthetic data that was generated from a changing
concept based on a rotating hyper plane.
In experiment 5 million training examples, drift inserted
by periodically rotating hyper planes, about 8% of test
points change label each drift, 100,000 examples in
window,5% noise and results sampled every 10k examples
throughout the run and averaged.
Figure 3 compares the accuracy of the algorithms as a
function of d, the dimensionality of the space. The reported
values are obtained by testing the accuracy of the learned
Fig.4 Tree size vs. number of Attributes
VII. CONCLUSIONS
In this paper we study about issues of stream mining and
different stream mining classification algorithm. In
classification algorithm Hoeffding trees allow learning in
small Constant time per example, and have strong
guarantees of High asymptotic similarity to the
corresponding batch trees. But, in the real world such as
working within limited RAM, adapting to time-changing
data, and producing high-quality results very quickly.
VFDT is a high-performance data mining system based on
Hoeffding trees. It for learning decision trees from highspeed data streams. For time-changing data Extending
VFDT to develop the CVFDT system for keeping trees upto-date with time-changing data streams.
Fig.3 Error Rate vs. number of Attributes
34
International Journal of Conceptions on Electrical and Electronics Engineering
Vol. 1, Issue 1, Oct 2013; ISSN: 2345 – 9603
Sixth International Conference on Knowledge Discovery and Data
Mining.
[7] Babcock, B., Babu, S., Deter, M., Motwani, R., and Widom, J.,
(2002): Models and issues in data stream systems. In Proceedings of
the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on
Principles of database systems (PODS). Madison, Wisconsin, pp. 116.
[8] R. Agrawal and G. Psaila. Active data mining. In Proceedings of the
First International Conference on Knowledge Discovery and Data
Mining, pages 3{8, Montr_eal, Canada, 1995. AAAI Press.
[9] Khaled Alsabti, Sanjay Ranka, and Vineet Singh. CLOUDS: A
decision tree classifier for large datasets. In Knowledge Discovery
and Data Mining, pages 2{8, 1998.
[10] P. L. Bartlett, S. Ben-David, and S. R. Kulkarni. Learning changing
concepts by exploiting the structure of change. Machine Learning,
41:153{174, 2000.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
Albert Bifet, Geoff Holmes, Richard Kirkby and Bernhard
Pfahringer(may 2011) Data Stream Mining A Practical Approach
Alexey T Symbal, (2004),The problem of concept drift: definitions
and related work, Department of Computer Science, Trinity College
Dublin, Ireland.
Dariusz Brzezinski (2010), Mining Data Streams With Concept Drift,
Poznan University of Technology.
Aggarwal, C., Han, J., Wang, J., and Yu, P.S., (2004): On Demand
Classification of Data Streams. In Proceedings of 2004 International
Conference On Knowledge Discovery and Data Mining (KDD '04).
Seattle, WA.
Agrawal, C.C. (2007). Data Streams: Models and Algorithms.
Springer.
Domingo’s, P. and Hulten, G., (2000): Mining High-Speed Data
Streams. In Proceedings of the Association for Computing Machinery
35