Download COMP5331

COMP5331 Other Data Stream Models Prepared by Raymond Wong Presented by Raymond Wong raywong@cse COMP5331 1 Entire Data Streams    Association Clustering Classification COMP5331 2 Clustering over data streams   BIRCH Advantages   Incremental Scalable COMP5331 3 Entire Data Streams    Association Clustering Classification COMP5331 4 Data Mining over Static Data 1. Association 2. Clustering 3. Classification Static Data COMP5331 Output (Data Mining Results) 5 child=yes root 100% Yes 0% No child=no 20% Yes 80% No Race Income Child Insurance black high no yes white high yes yes white low yes yes white low yes yes black low no no black low no no black low no no white low no no Static data For attribute Race, Gain(Race, T) = 0.1887 For attribute Income, Gain(Income, T) = 0.3113 COMP5331 For attribute Child, Gain(Child, T) = 0.5488 6 child=yes root child=no 20% Yes 80% No 100% Yes 0% No Income=high 100% Yes 0% No Income=low 0% Yes 100% No Race Income Child Insurance black high no yes white high yes yes white low yes yes white low yes yes black low no no black low no no black low no no white low no no Static data For attribute Race, Gain(Race, T) = 0.0729 For attribute Income, Gain(Income, T) = 0.7219 COMP5331 7 Data Mining over Data Streams 1. Association 2. Clustering 3. Classification … Unbounded Data COMP5331 Output (Data Mining Results) Real-time Processing 8 child=yes root child=no 20% Yes 80% No 100% Yes 0% No Insurance n(Race, black, yes) = 1 n(Race, black, no) = 3 n(Race, white, yes) = 3 Storage for each leaf node n(Race, white, no) = 1 n(Income, high, yes) = 2 n(child, yes, yes) = 3 n(Income, high, no) = 0 n(child, yes, no) = 0 n(Income, low, yes) = 2 n(child, no, yes) = 1 n(Income, low, no) = 4 n(child, no, no) = 4 For attribute Race, Gain(Race, T) = 0.1887 For attribute Income, Gain(Income, T) = 0.3113 COMP5331 For attribute Child, Gain(Child, T) = 0.5488 Race Income Child Insurance black high no yes white high yes yes white low yes yes white low yes yes black low no no black low no no black low no no white low no no … … … … Data stream Whenever we read a new data point, we just increment the correspondence variables 9 Data Streams  Data Streams …     It is impossible to build a decision tree according to all data points Because we have to “wait” for all data points Instead, we just read a certain amount of data points and build a decision tree However, there is a problem (incorrect information gain) COMP5331 10 child=yes root child=no 20% Yes 80% No 100% Yes 0% No n(Race, black, yes) = 1 n(Race, black, no) = 3 n(Race, white, yes) = 3 Storage for each leaf node Race Income Child Insurance black high no yes white high yes yes white low yes yes white low yes yes black low no no black low no no black low no no white low no no n(Race, white, no) = 1 According to the first 8 … data points n(child, yes, yes) = 3 n(Income, high, yes) = 2 n(child, yes, no) = 0 n(Income, high, no) = 0 n(Income, low, yes) = 2 n(child, no, yes) = 1 n(Income, low, no) = 4 n(child, no, no) = 4 … … According to…the entire data streams For attribute Race, Gain(Race, T) = 0.1887 Gain(Race, T) = 0.2624 For attribute Income, Gain(Income, T) = 0.3113 Gain(Income, T) = 0.3513 COMP5331 For attribute Child, Gain(Child, T) = 0.5488 Gain(Child, T) = 0.4180 11 child=yes root child=no 20% Yes 80% No 100% Yes 0% No n(Race, black, yes) = 1 n(Race, black, no) = 3 n(Race, white, yes) = 3 Storage for each leaf node Race Income Child Insurance black high no yes white high yes yes white low yes yes white low yes yes black low no no black low no no black low no no white low no no n(Race, white, no) = 1 According to the first 8 … “other”n(child, data points yes, yes) = 3 n(Income, high, yes) = 2 n(child, yes, no) = 0 n(Income, high, no) = 0 n(Income, low, yes) = 2 n(child, no, yes) = 1 n(Income, low, no) = 4 n(child, no, no) = 4 … … According to…the entire data streams For attribute Race, Gain(Race, T) = 0.1287 Gain(Race, T) = 0.2624 For attribute Income, Gain(Income, T) = 0.4113 Gain(Income, T) = 0.3513 COMP5331 For attribute Child, Gain(Child, T) = 0.3488 Gain(Child, T) = 0.4180 12 Data Streams  Idea    Read “enough” data points before making a decision to split the data points Let N be the number of data points read so far Gain(A, T) approaches Gain(A, T) when N becomes large COMP5331 13 Hoeffding Tree Algorithm  Given     Gain(A, T) is the greatest value Gain(B, T) is the second greatest value  : a user parameter (a real no. between 0 and 1) if Gain(A, T) - Gain(B, T) >    the probability that Gain(A, T) – Gain(B, T) > 0 is at least 1- COMP5331  is dependent on  and the number of data points read so far 14 Hoeffding Tree Algorithm  R 2 ln( 1 /  ) 2N If N is larger,  is smaller where N is the total number of data points read so far R is the range of Gain(A, T) R = log c for “Gain(A, T)” where c is the total number of possible classes in the target attribute COMP5331 15 Algorithm   Create a storage for the root node Whenever the data point comes,  For each leaf node    Update the storage Calculate Information Gain of each (remaining) attribute Find    If Gain(A, T) - Gain(B, T) >    COMP5331 Gain(A, T) which is the greatest value Gain(B, T) which is the second greatest value Perform the splitting according to attribute A For each leaf node,  Create a storage for this node 16 Algorithm   Create a storage for the root node Increment the Whenever the data pointcorrespondence comes, counts  For each leaf node    Update the storage Calculate Information Gain of each (remaining) attribute Find    If Gain(A, T) - Gain(B, T) >    COMP5331 Gain(A, T) which is the greatest value Gain(B, T) which is the second greatest value Perform the splitting according to attribute A For each leaf node,  Create a storage for this node 17 Algorithm   Create a storage for the root node Use the storage to calculate Whenever the data pointthecomes, information gain  For each leaf node    Update the storage Calculate Information Gain of each (remaining) attribute Find    If Gain(A, T) - Gain(B, T) >    COMP5331 Gain(A, T) which is the greatest value Gain(B, T) which is the second greatest value Perform the splitting according to attribute A For each leaf node,  Create a storage for this node 18 Algorithm   Create a storage for the root node Whenever the data point comes,  For each leaf node    Update the storage Calculate Information Gain of each (remaining) attribute Find    If Gain(A, T) - Gain(B, T) >    COMP5331 Gain(A, T) which is the greatest value Gain(B, T) which is the second greatest value Perform the splitting according to attribute A For each leaf node,  Create a storage for this node 19 Algorithm   Create a storage for the root node Whenever the data point comes,  For each leaf node    Update the storage Calculate Information Gain of each (remaining) attribute Find    If Gain(A, T) - Gain(B, T) >    COMP5331 Gain(A, T) which is the greatest value Gain(B, T) which is the second greatest value Perform the splitting according to attribute A For each leaf node,  Create a storage for this node 20 Algorithm root Storage COMP5331 Update storage See whether we can split the node 21 Algorithm root Storage COMP5331 Update storage See whether we can split the node 22 Algorithm Update storage See whether we can split the node child=yes 100% Yes 0% No root Storage child=no 20% Yes 80% No Storage COMP5331 23 Algorithm Update storage See whether we can split the node child=yes 100% Yes 0% No root Storage child=no 20% Yes 80% No Storage COMP5331 24

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download COMP5331