Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mining Time-Changing Data Streams Advisor: Dr. Hsu Graduate: Yung-Chu Lin 2002/3/6 IDS Lab Seminar Outline Motivation Objective Hoeffding Bounds The VFDT Algorithm The CVFDT Algorithm Window Size Time and Space Complexity Empirical Study Conclusion Opinion 2002/3/6 IDS Lab Seminar Motivation The volume and time span of accumulated data for future use Real large database is not random sample drawn from a stationary 2002/3/6 IDS Lab Seminar Objective Solving the classification problem: concept drift 2002/3/6 IDS Lab Seminar Hoeffding Bounds n examples confidence 1-δ variable r, range R 2002/3/6 probability: R=1 information gain: R=log C IDS Lab Seminar Concept of VFDT Algorithm Initialize the HT Repeat 2002/3/6 Scan nmin examples (a window size) Compute Gain(x) of every attribute Split the leaf node or not IDS Lab Seminar The VFDT Algorithm 2002/3/6 IDS Lab Seminar Example for Algorithms Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes Rain Mild Strong No D14 2002/3/6 High IDS Lab Seminar Concept of CVFDT Algorithm An extension to VFDT, which adds the ability to detect and respond to changes in examplegenerating Not need to learn a new model from scratch every time a new example arrives Scan HT and alternate trees periodically look for internal nodes whose sufficient statistics indicate better attribute When alternate subtree becomes more accurate, the old subtree is replaced by the new one 2002/3/6 IDS Lab Seminar The CVFDT Algorithm 2002/3/6 IDS Lab Seminar The CVFDTGrow Procedure 2002/3/6 IDS Lab Seminar The ForgetExample and CheckSplitValidity procedure 2002/3/6 IDS Lab Seminar Window Size One windows size w will not be appropriate for every concept and every type of drift Shrink w when many of the nodes in HT become questionable at once or in response to a rapid change in data rate Increase w when 2002/3/6 Few questionable node concept is stable IDS Lab Seminar Time and Space Complexity VFDT CVFDT time O(lvdvc) O(lcdvc) Space O(ndvc) n: #nodes in CVFDT’s main tree and alternate trees d: #attributes v: max number of values per attribute c: #class lv,lc: longest path 2002/3/6 IDS Lab Seminar Empirical Study Synthetic data Web data 2002/3/6 IDS Lab Seminar Synthetic Data A hyperplane in d-dimensional space is the set of points x that satisfy d w x i 1 i i w0 where xi is the ith coordinate of x. d d positive: wi xi w0 negative: wi xi w0 i 1 2002/3/6 i 1 IDS Lab Seminar Synthetic Data (cont’d) Initialize weight to .2 except for w0 which is .25d Substitute its coordinates into the left hand side of Equation to obtain a sum s |s|<=.1*w0 positive |s|<=.2*w0 negative xi=[0,1] 2002/3/6 IDS Lab Seminar Synthetic Data (cont’d) 5 million training examples δ=0.0001 f=20,000 Nmin=300;┬=0.05;w=100,000 2002/3/6 IDS Lab Seminar Synthetic Data (cont’d) 2002/3/6 IDS Lab Seminar Synthetic Data (cont’d) CVFDT took 4.3 times longer than VFDT VFDT’s average memory allocation over the course of the run was 23MB while CVFDT’s was 16.5MB The average number of nodes in VFDT’s tree was 2696 and in CVFDT’s tree was 677(132: alternate tree, 545: main tree) 2002/3/6 IDS Lab Seminar Conclusion Introducing CVFDT, learning accurate models from the most demanding highspeed, concept-drifting data streams Maintain a decision-tree up-to-date with a windows of examples 2002/3/6 IDS Lab Seminar Opinion Many techniques have the problem, concept drift. So, maybe we could apply the concept in this paper to our other techniques, like the association rule mining, clustering algorithms, and so on. 2002/3/6 IDS Lab Seminar