Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Conceptions on Electrical and Electronics Engineering Vol. 1, Issue 1, Oct 2013; ISSN: 2345 – 9603 Review on Data Stream Classification Algorithm (Hoeffding Tree, VFDT, CVFDT) Kirankumar Patel Information Technology Department, Ganpat University Mehsana, Gujarat, India [email protected] sequence of items. It impossible to control the order in which items are arrives, nor is it feasible to locally store a stream in its entirety.”( Golab & Oszu (2003)) Abstract— the data stream has recently emerged in response to the continuous data problem. The algorithm processing the stream has no control over the order of the examples seen, and must update its model incrementally as each example is inspected. Performance of data stream classification is measuring by involving processing speed, memory and accuracy. Also A classification algorithm must meet several requirements in order to work with the assumptions and be suitable for learning from data streams that is process an example at a time and inspect it only once; use limited amount of memory; work in limited amount of time; to predict at any point. So studying purely theoretical advantages of algorithms is certainly useful and enables new developments. In this review paper we study different algorithms of classification that is Hoeffding tree, VFDT, CVFDT and comparison between all this algorithms Also There are number of application characteristics for stream data such that Massive volumes of data (several terabytes); Records arrive at a rapid rate; Most of data will never be seen by a human!; Need for near-real time analysis of data feeds. II. ISSUE IN STREAM MINING The first issue is limited computation resources: in many applications, the computation power and memory at hand does not measure up to the massive amount of data in the input stream. For example, in a single day, Google serviced more than 150 million searches; Walmart executed 20 million sales transactions; and Telstra generated 15 million call records. However, traditional data mining algorithms make the assumption that the resources available will always match the amount of data they process. This assumption does not hold in data stream mining. Stream mining algorithms shall learn fast and consume little memory resources. And the Another characteristic of data streams is that data is no longer a snapshot, but rather a continuous stream. This means that the concept underlying the data may change over time. For effective decision making, stream mining must be adaptive to concept change. For example, when customer purchasing patterns change, marketing strategies based on out-dated transaction data must be modified in order to reflect current customer needs. Keywords— Decision trees, Hoeffding bounds, concept Drift, incremental learning I. INTRODUCTION There are number of algorithm for mining a data that do not fit in main memory. But this algorithm only tested on few million examples. But currently every day record millions of transactions, millions of calls connect by telecommunications companies, large banks process millions of ATM and credit card operations, and popular Web sites log millions of hits. So create a massive amount of data. So for this type of large data Current data mining systems are not equipped to cope with them. In many cases, these large volumes of data can be mined for interesting and relevant information in a wide variety of applications. When the volume of the underlying data is very large, it leads to a number of computational and mining challenges: (1) with increasing volume of the data, it is no longer possible to process the data efficiently by using multiple passes. Rather, one can process a data item at most once. This leads to constraints on the implementation of the underlying algorithms. Therefore, stream mining algorithms typically need to be designed so that the algorithms work with one pass of the data. And (2) in most cases, there is an inherent temporal component to the stream mining process. This is because the data may evolve over time. This behaviour of data streams is referred to as temporal locality. Therefore, a straightforward adaptation of one-pass mining algorithms may not be an effective solution to the task. Stream mining algorithms need to be carefully designed with a clear focus on the evolution of the underlying data. Problems in Mining Data Streams TABLE I PROBLEMS IN DATA STREAMS Traditional data mining techniques usually require Challenges of stream mining Entire data set to be present Random access (or multiple passes) to the data Much time per data item Impractical to store the whole data Random access is expensive Simple calculation per data due to time and space constraints So continuous stream of information has challenged our storage, computation and communication capabilities in computing systems. And for effective processing of stream data, new data structure, techniques, and algorithms are needed. Because we do not have finite amount of space to “A data stream is a real-time, continuous, ordered (implicitly by arrival time or explicitly by timestamp) 30 International Journal of Conceptions on Electrical and Electronics Engineering Vol. 1, Issue 1, Oct 2013; ISSN: 2345 – 9603 On-demand stream classification, using microcluster ideas that each micro-cluster ids associated with specific class label which defines the class label of the points in it. TABLE II CLASSIFY CHALLENGES IN FIVE DIFFERENT CATEGORY No. 1 2 Issues Memory management Data Preprocessing Challenges Fluctuated and irregular data arrival rate and variant data arrival rate over time Quality of mining result and automation of preprocessing techniques Different approaches for that Summarizatio n techniques IV. 3 Compact Data Structure Limited memory size and large volume of data stream 4 Resource aware Limited resource like storage and computation capabilities AOG Visualization of results Problem in data analysis and quick decision making by user Still is a research issue(one of the proposed approach is: intelligent monitoring) HOEFFDING TREE Before starting a Hoeffding algorithm, first of all we define classification problem is a set of training examples of the form (x, y), where x is a vector of d attributes and y is a discrete class label. Our goal is to produce from the examples a model y=f(x) that predict the classes y for future examples x with high accuracy. Decision tree learning is one of the most effective classification methods. a Decision tree is learned by recursively replacing leaves by test nodes, starting at the root. Each node contains a test on the attributes and each branch from a node corresponds to a possible outcome of the test and each leaf contains a class prediction. All training data stored in main memory so it’s expensive to repeatedly read from disk when learning complex trees so our goal is design decision tree learners than read each example at most once, and use a small constant time to process it. Light-weight preprocessing techniques Incremental maintaining of data structure, novel indexing, storage and querying techniques 5 So we discuss the Hoeffding algorithm in next section IV. Than after in section V discuss very fast decision tree algorithm (VFDT) also their advantages and disadvantages. in section VI we discuss concept adapting very fast decision tree algorithm(CVFDT). So key observation is find the best attribute at a node. So for that consider only small subset of training examples that pass through that nodes. Choose the root attribute. Then expensive examples are passed down to the corresponding leaves, and used to choose the attribute there, and so on recursively. so use Hoeffding bound to decide, how many examples are enough at each node??? Consider a random variable a contain range is R. suppose we have n observation of a. so find mean of a ().so Hoeffding bound states that “with probability 1-δ,the true mean of a is at least - ε. where Hoeffding bound ε store stream data, we often trade-off between accuracy and storage.So we can classify into five category so in table 2. [4] III. CLASSIFICATION ALGORITHMS ε = √ (R2 ln (1/δ)/2n TABLE III HOEFFDING TREE INDUCTION ALGORITHM ---------------------------------------------------------------------Algorithm Hoeffding tree induction algorithm. There are number of algorithm for classification. Following are listed some algorithm and approaches for classification of data stream based on the data mining tasks. 1: HT be a tree with a single leaf (the root) 2: for all training examples do Hoeffding tree algorithm that is based on decision tree. 3: Sort example into leaf l using HT GEMM and FOCUS algorithm and that mining task is decision tree and frequent item set. 4: Update sufficient statistics in l OLIN algorithm uses info-fuzzy techniques for building a tree-like classification model. 6: if nl mod Nmin = 0 and examples seen at l not all of same 5: Increment nl , the number of examples seen at l Class then VFDT (Very Fast Decision Tree) and CVFDT (Concept-Adapting Very Fast Decision Tree) algorithm works on decision tree task. 7: Compute l (X l) for each attribute 8: Let Xa be attribute with highest l A Classifier ensemble Approach. 9: Let Xb be attribute with second-highest l 10: Compute Hoeffding bound 31 International Journal of Conceptions on Electrical and Electronics Engineering Vol. 1, Issue 1, Oct 2013; ISSN: 2345 – 9603 11: if Xa ≠ XØ; and ( l (Xa ) - l (Xb) > ε or ε < τ ) then required is O(ldvc). This is independent of the number of examples seen, if the size of the tree depends only on the \true" concept and is independent of the size of the training set.[2] 12: Replace l with an internal node that splits on Xa 13: for all branches of the split do VFDT supports all three pruning option of our method no-pruning, pre-pruning, and post-pruning. When using the no-pruning option VFDT refines the tree it is learning indefinitely. When using pre-pruning VFDT takes an additional pre-prune parameter, τ’, and stops growing any leaf where the difference between the best split candidate all others is less than G (.) and G (.) < τ’. If there is ever a point at which all of the leaves in the tree are pre-pruned, the VFDT procedure terminates and returns the tree. VFDT is incremental and anytime, Means when new example can be quickly incorporated as they arrive. So a usable model is available after the first few examples and then progressively defined. 14: Add a new leaf with initialized sufficient statistics 15: end for 16: end if 17: end if 18: end for -----------------------------------------------------------------------In Line 1 initializes the tree data structure, which starts out as a single root node. Lines 2-18 form a loop that is performed for every training example. Every example is filtered down the tree to an appropriate leaf, depending on the tests present in the decision tree built to that point (line 3). This leaf is then updated (line 4)—each leaf in the tree holds the sufficient statistics needed to make decisions about further growth. The sufficient statistics that are updated are those that make it possible to estimate the information gain of splitting on each attribute. Line 5 simply points out that nl is the example count at the leaf, and it too is updated. Technically nl can be computed from the sufficient statistics. Lines 7-11 perform the test described in the previous section, using the Hoeffding bound to decide when a particular attribute has won against all of the others. G is the splitting criterion function (information gain) and is its estimated value. In line 11 the test for XØ, the null attribute, is used for pre-pruning. The test involving τ is used for tie-breaking. If an attribute has been selected as the best choice, lines 12-15 split the node, causing the tree to grow. [1] TABLE IV VERY FAST DECISION TREE (VFDT) ALGORITHM ----------------------------------------------------------------------Algorithm: Input: δ desired probability level. Output: τ a Decision Tree. In it: τ ← Empty Leaf(Root) 1. While (TRUE) 2. Read next example 3. Propagate Example through the tree from the root till a leaf 4. Update sufficient statistics at leaf 5. If leaf (number of examples) > Nmin 6. Evaluate the merit of each attribute A. Strengths and weakness of Hoeffding tree algorithm 7. Let A1 the best attribute and A2 the second best It contain such advantage, in that one is Scales better than traditional methods in terms of Sub linear with sampling and it utilization very small memory. Second it Make class predictions in parallel and new examples are added as they come. 8. Let 9. If G (A1)-G (A2) > ε 10. Install a splitting test based on A1 11. Expand the tree with two descendant leaves Also Hoeffding tree algorithm have Weakness that one is Could spend a lot of time with ties. Second is Memory used with tree expansion and large Number of candidate attributes. V. Some refinement in this algorithm that are defined below: Ties-Breaking: if two or more attribute have identical attribute values means G’s than it will be required to decide between them with high confidence. so VFDT can decide that there is effectively a tie and split on the current best attribute if difference between the best split candidate all others is less than G(.) and G(.) < τ. where τ is a userdefined threshold.[3]G computation: inefficient to recompute G for every new example, because it unlikely that the decision to split will be made at that specific point. So VFDT allows the users to specify a minimum number of new examples Nmin that must be accumulated at a leaf before G is recomputed. This effectively reduced the global time spent on G computations by a factor of Nmin, and making learning With VFDT nearly as fast as simply classify the training examples.[3] VERY FAST DECISION TREE (VFDT) ALGORITHM It is a decision-tree learning system based on the Hoeffding tree algorithm. In VFDT algorithm, split on the current best attribute, if the difference is less than a userspecified threshold. So it good compare to Hoeffding tree algorithm. Because decide between identical attribute is wasteful. So simply compute G and check for split periodically. Also deceive or drop less promising leaves when needed. And rescan old data when time available. VFDT's memory use is dominated by the memory required to keep counts for all growing leaves. If d is the number of attributes, v is the maximum number of values per attribute, and c is the number of classes, VFDT requires O(dvc) memory to store the necessary counts at each leaf. If l is the number of leaves in the tree, the total memory Compare to Hoeffding tree VFDT is better in terms of time and memory. Also, similar accuracy. Also compare to 32 International Journal of Conceptions on Electrical and Electronics Engineering Vol. 1, Issue 1, Oct 2013; ISSN: 2345 – 9603 VFDT and C4.5 in this experiment same memory limit for both (40 MB), VFDT setting=, Nmin=200 and τ=5%, Domain:2 classes and 100 binary attributes, fifteen synthetic trees 2.2k-500k leaves. So running time for c4.5 is takes 35 seconds to read and process 100k examples and VFDT takes 47 seconds. In second experiment VFDT takes 6377 seconds for 20 million examples. So in VFDT take advantage after 100k to greatly improve accuracy. 7. CVFDTGrow But Still VFDT not handles concept drift. Means as time goes by, different elements belong to the mental categories example is interesting literature from novice to experts and reasonably priced from student to manager etc. so concept drift means that the concept about which data is obtained may shift time to time, each time after some minimum permanence. It is similar to VFDT algorithm, but CVFDT monitors the validity of its old decisions by maintaining sufficient statistics at every node in Decision tree (DT). Forgetting an old example is slightly complicated by the fact that DT may have grown or changed since the example was initially incorporated. To avoid forgetting an example from a node that has never seen it, nodes are assigned a unique, monotonically increasing ID as they are created. When an example is added to W, the maximum ID of the leaves it reaches in DT and all alternate trees is recorded with it. An example’s effects are forgotten by decrementing the counts in the sufficient statistics of every node the example reaches in DT whose ID is less than or equal to the stored ID. VI. 8. CheckSplitValidity if f examples seen since Last checking of alternate trees. 9. Return HT. ------------------------------------------------------------------------ CONCEPT ADAPTING VERY FAST DECISITON TREE (CVFDT) ALGORITHM Most KDD systems include VFDT; assume training data is a sample drawn from stationary distribution. But most large databases or data stream violate this assumption because of concept drift. That we discussed in section IV. So our goal is mining continuously changing data stream. A. CVFDTGrow In CVFDTGrow, for each node reached by the example in Hoeffding Tree Increment the corresponding statistics at the node. If enough examples seen at the leaf in HT which the example reaches, Choose the attribute that has the highest average value of the attribute evaluation measure (information gain or gini index). If the best attribute is not the “null” attribute, create a node for each possible value of this attribute. CVFDT is an extension to VFDT which maintains VFDT's speed and accuracy advantages but adds the ability to detect and respond to changes in the example-generating process. Like other systems with this capability (Widmer and Kubat, 1996, Ganti et al., 2000), CVFDT works by keeping its model consistent with a sliding window of examples. However, unlike most of these other systems, it does not need to learn a new model from scratch every time a new example arrives. Instead, it monitors the quality of its old decisions on the new data and adjusts those that are no longer correct. In particular, when new data arrives, CVFDT updates the sucient statistics at its nodes by incrementing the counts corresponding to the new examples and decrementing the counts corresponding to the oldest examples in the window (which need to be forgotten). This will statistically have no effect if the underlying concept is stationary. If the concept is changing, however, some splits that were previous selected will no longer appear best because other attributes will have higher gain on the new data. When this happens, CVFDT begins to grow an alternative sub tree with the new best attribute at its root. The old sub tree is replaced by the alternate one when the alternate becomes more accurate on new data. TABLE V CVFDT ALGORITHM ----------------------------------------------------------------------1. Alternate trees for each node in HT staret as empty. 2. Process Examples from the stream indefinitely. 3. For Each Example (x, y), 4. Pass (x, y) down to a set of leaves using HT And all alternate trees of the nodes (x, y) pass Through. 5. Add(x, y) To the sliding window of examples. 6. Remove and forget the effect of the oldest Fig.1 CVFDT algorithm: process each example B. Forget Old Example Examples, if the sliding window overflows. 33 International Journal of Conceptions on Electrical and Electronics Engineering Vol. 1, Issue 1, Oct 2013; ISSN: 2345 – 9603 Maintain the sufficient statistics at every node in Hoeffding tree to monitor the validity of its decisions. VFDT only maintain such statistics at leaves. Than HT might have grown or changed since the example was initially incorporated. Assigned each node a unique, monotonically increasing ID as they are created. Such as forget Example (HT, example, maxID). After that For each node reached by the old example with node ID no larger than the max leave ID the example reaches, Decrement the corresponding statistics at the node and For each alternate tree Talt of the node, forget (Talt, example, maxID). models every 10,000 examples throughout the run and averaging these results. Drift level, reported on the minor axis, is the average percentage of the test set that changes label at each point the concept changes. CVFDT is substantially more accurate than VFDT, by approximately 10% on average, and CVFDT's performance improves slightly with increasing d (from approximately 13.5% when d = 10 to approximately 10% when d = 150). [6] Figure 4 compares the average size of the models induced during the run shown in Figure 11 (the reported values are generated by averaging after every 10,000 examples, as before). CVFDT's trees are substantially smaller than VFDT's, and the advantage is consistent across all the values of d we tried. This simultaneous accuracy and size advantage derives from the fact that CVFDT's tree is built on the 100,000 most relevant examples, while VFDT's is built on millions of outdated examples. C. CheckSplitValidity In Split validity periodically scans the internal nodes of HT. and then starts a new alternate tree when a new winning attribute is found. In terms of following assumptions one is tighter criteria to avoid excessive alternate tree creation. and other Limit the total number of alternate trees. A series of experiments comparing CVFDT to VFDT. Our goals were to evaluate CVFDT's ability to scale up, to evaluate CVFDT's ability to deal with varying levels of drift, and to identify and characterize the situations where CVFDT outperforms the other systems. Our experiments used synthetic data that was generated from a changing concept based on a rotating hyper plane. In experiment 5 million training examples, drift inserted by periodically rotating hyper planes, about 8% of test points change label each drift, 100,000 examples in window,5% noise and results sampled every 10k examples throughout the run and averaged. Figure 3 compares the accuracy of the algorithms as a function of d, the dimensionality of the space. The reported values are obtained by testing the accuracy of the learned Fig.4 Tree size vs. number of Attributes VII. CONCLUSIONS In this paper we study about issues of stream mining and different stream mining classification algorithm. In classification algorithm Hoeffding trees allow learning in small Constant time per example, and have strong guarantees of High asymptotic similarity to the corresponding batch trees. But, in the real world such as working within limited RAM, adapting to time-changing data, and producing high-quality results very quickly. VFDT is a high-performance data mining system based on Hoeffding trees. It for learning decision trees from highspeed data streams. For time-changing data Extending VFDT to develop the CVFDT system for keeping trees upto-date with time-changing data streams. Fig.3 Error Rate vs. number of Attributes 34 International Journal of Conceptions on Electrical and Electronics Engineering Vol. 1, Issue 1, Oct 2013; ISSN: 2345 – 9603 Sixth International Conference on Knowledge Discovery and Data Mining. [7] Babcock, B., Babu, S., Deter, M., Motwani, R., and Widom, J., (2002): Models and issues in data stream systems. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PODS). Madison, Wisconsin, pp. 116. [8] R. Agrawal and G. Psaila. Active data mining. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, pages 3{8, Montr_eal, Canada, 1995. AAAI Press. [9] Khaled Alsabti, Sanjay Ranka, and Vineet Singh. CLOUDS: A decision tree classifier for large datasets. In Knowledge Discovery and Data Mining, pages 2{8, 1998. [10] P. L. Bartlett, S. Ben-David, and S. R. Kulkarni. Learning changing concepts by exploiting the structure of change. Machine Learning, 41:153{174, 2000. REFERENCES [1] [2] [3] [4] [5] [6] Albert Bifet, Geoff Holmes, Richard Kirkby and Bernhard Pfahringer(may 2011) Data Stream Mining A Practical Approach Alexey T Symbal, (2004),The problem of concept drift: definitions and related work, Department of Computer Science, Trinity College Dublin, Ireland. Dariusz Brzezinski (2010), Mining Data Streams With Concept Drift, Poznan University of Technology. Aggarwal, C., Han, J., Wang, J., and Yu, P.S., (2004): On Demand Classification of Data Streams. In Proceedings of 2004 International Conference On Knowledge Discovery and Data Mining (KDD '04). Seattle, WA. Agrawal, C.C. (2007). Data Streams: Models and Algorithms. Springer. Domingo’s, P. and Hulten, G., (2000): Mining High-Speed Data Streams. In Proceedings of the Association for Computing Machinery 35