Download Synthesizing high_frequency rules from different data

Synthesizing High-Frequency Rules from Different Data Sources Xindong Wu and Shichao Zhang IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 2, MARCH/APRIL 2003 1 Pre-work Knowledge management. Knowledge discovery Data mining. Data warehouse 2 Knowledge Management Building data warehousing by Knowledge management 3 Knowledge Discovery and Data Mining Data mining is a tool of knowledge discovery 4 Why data mining If a supermarket manager, simon, want to arrange these commodities into supermarket, how to do will make more revenues, conveniences…. Commodities if one customer buys milk then he is likely to buy bread, so... Supermarket Simon 5 Why data mining Before long, if simon want to send some advertisement letters for customers, how to consider the individual differences is an important task. Mary always buys diapers and milk powders, she may have a baby, so …. Simon 6 The role of Data mining Useful patterns Knowledge and strategy Preprocess data 7 Mining association rules Milk Bread IF bread is bought then milk is bought 8 Mining steps step1: define minsup and minconf ex: minsup=50% minconf=50% step2: find large itemsets step3: generate association rules 9 Example Large itemsets 10 Outline Introduction Weights of Data Sources Rule Selection Synthesizing High-Frequency Rules Algorithm Relative Synthesizing Model Experiments Conclusion 11 Introduction Framework DB1 AB→C A→D B→E RD1 DB2 AB→C A→D B→E RD2 GRB ... ... DBn AB→C A→D B→E RDn  Synthesizing High-Frequency Rules • Weighting • Ranking 12 Weights of Data Sources Definition Di : data sources Si : set of association rules from Di Ri : association rule 3 Steps Step 1 : union of all Si Step 2 : assigning each Ri a weight Step 3 : assigning each Di a weight & normalization 13 Example 3 Data Sources (minsupp=0.2, minconf=0.3) S1 S2 S3 AB→C with supp=0.4, conf=0.72 A→D with supp=0.3, conf=0.64 B→E with supp=0.34, conf=0.7 B→C with supp=0.45, conf=0.87 A→D with supp=0.36, conf=0.7 B→E with supp=0.4, conf=0.6 AB→C with supp=0.5, conf=0.82 A→D with supp=0.25, conf=0.62 14 Step 1 Union of all Si S’ = {S1, S2, S3} R1 : AB→C S1, S3  2 times R2 : A→D S1, S2, S3  3 times R3 : B→E S1, S2  2 times R4 : B→C S2  1 time S1 1. AB→C with supp=0.4, conf=0.72 2. A→D with supp=0.3, conf=0.64 3. B→E with supp=0.34, conf=0.7 S2 1. B→C with supp=0.45, conf=0.87 2. A→D with supp=0.36, conf=0.7 3. B→E with supp=0.4, conf=0.6 S3 1. AB→C with supp=0.5, conf=0.82 2. A→D with supp=0.25, conf=0.62 15 Step 2 Assigning each Ri a weight R1 WR1 = 2 2+3+2+1 = 0.25 WR2 = 3 2+3+2+1 = 0.375 WR 3 = 2 2+3+2+1 = 0.25 WR 4 = 1 2+3+2+1 = 0.125 R2 R3 R4 16 Step 3 Assigning each Di a weight WD1 2*0.25+3*0.375+2*0.25=2.125 WD2 1*0.125+2*0.25+3*0.375=2 WD3 Ri W Ri Time Si R1:AB→C 0.25 2 S 1, S 3 R2:A→D 0.375 3 S1,S2, S3 R3:B→E 0.25 2 S 1, S 2 R4:B→C 0.125 1 S2 2*0.25+3*0.375=1.625 Normalization WD1  2.125/(2.125+2+1.625)=0.3695 WD2  2/(2.125+2+1.625)=0.348 WD3  1.625/(2.125+2+1.625)=0.2825 17 Why Rule Selection ? Goal Extracting High-Frequency Rules Low-Frequency Rules  Noise Solution If Num(Ri) / n <  n : data sources, Num(Ri) : frequency of Ri Then Rule Ri  be wiped out 18 Rule Selection Example : 10 Data Sources D1~D9 : {R1 : X→Y} D10 : {R1 : X→Y, R2: X1→Y1, …, R11: X10→Y10 } Let =0.8 Num(R1) / 10 = 10/10 = 1 >   keep Num(R2~11) / 10 = 1/10 = 0.1 <   be wiped out WR1 D1~D10 : {R1 : X→Y} WR1 : 10/10=1  WD1~10 : 10*1 / 10*10*1 = 0.1 n Num(R1) 19 Comparison Without Rules Selection WD1~9  0.099 WD10  0.109 With Rules Selection WD1~10  0.1 From High-Frequency Rules Point of view Weight Errors D1~9  |0.1-0.099|  0.001 D10  |0.1-0.109|  0.009 Total Error = 0.01 20 Synthesizing High-Frequency Rules Algorithm 5 Steps Step 1 : Rules Selection Step 2 : Weights of Data Sources Step 2.1 : union of all Si Step 2.2 : assigning each Ri a weight Step 2.3 : assigning each Di a weight & normalization Step 3 : computing supp & conf of each Ri Step 4 : ranking all rules by support Step 5 : output the High-Frequency Rules 21 An Example 3 Data Sources =0.4, minsupp=0.2, minconf=0.3 S1 S2 1. AB→C with supp=0.4, conf=0.72 1. B→C with supp=0.45, conf=0.87 2. A→D with supp=0.3, conf=0.64 2. A→D with supp=0.36, conf=0.7 3. B→E with supp=0.34, conf=0.7 3. B→E with supp=0.4, conf=0.6 S3 1. AB→C with supp=0.5, conf=0.82 2. A→D with supp=0.25, conf=0.62 22 Step 1 Rules Selection R1 : AB→C S1, S3  2 times Num(R1) / 3 = 0.66  keep R2 : A→D S1, S2, S3  3 times Num(R2) / 3 = 1  keep R3 : B→E S1, S2  2 times Num(R3) / 3 = 0.66  keep R4 : B→C S2  1 time Num(R4) / 3 = 0.33  wiped out 23 Step 2 : Weights of Data Sources Weights of Ri 2 = 0.29 WR1 = 2+3+2 3 = 0.42 WR2 = 2+3+2 2 = 0.29 WR2 = 2+3+2 Ri WRi Time Si R1:AB→C 0.29 2 S1 , S3 R2:A→D 0.42 3 S 1 ,S 2 , S 3 R3:B→E 0.29 2 S1 , S2 Weight of Di WD1  2*0.29+3*0.42+2*0.29=2.42 WD2  3*0.42+2*0.29=1.84 WD3  2*0.29+3*0.42=1.84 Normalization WD1  2.42/(2.42+1.84+1.84)=0.3695=0.396 WD2  1.84/(2.42+1.84+1.84)=0.302 WD3  1.84/(2.42+1.84+1.84)=0.302 24 Step 3 Computing supp & conf of each Ri Support ABC 0.396*0.4+0.302*0.5=0.3094 AD 0.396*0.3+0.302*0.36=0.228 BE 0.396*0.34+0.302*0.4=0.255 Confidence ABC WD1 =0.396 WD2 =0.302 WD3 =0.302 S1 1. AB→C with supp=0.4, conf=0.72 2. A→D with supp=0.3, conf=0.64 3. B→E with supp=0.34, conf=0.7 S2 2. A→D with supp=0.36, conf=0.7 3. B→E with supp=0.4, conf=0.6 0.396*0.72+0.302*0.82=0.532 AD 0.396*0.64+0.302*0.7=0.465 BE S3 1. AB→C with supp=0.5, conf=0.82 2. A→D with supp=0.25, conf=0.62 0.396*0.7+0.302*0.6=0.458 25 Step 4 & Step 5 Ranking all rules by support & output minsupp=0.2, minconf=0.3 ABC, BE, AD Ranking 1. ABC (0.3094) 2. BE (0.255) 3. AD (0.228) Output – 3 rules ABC(0.3094, 0.532) BE (0.255, 0.458) AD (0.228, 0.465) 26 Relative Synthesizing Model Framework Unknown Di Internet Web X→Y conf=0.7 books X→Y conf=0.72 X→Y conf=? journals X→Y conf=0.68  Synthesizing • clustering method • roughly method 27 Synthesizing Methods Physical Meaning if the confidences  irregularly distributed Maximum synthesizing operator Minimum synthesizing operator Average synthesizing operator if the confidences (X)  normal distribution clustering  interval [a, b] satisfy 1. P{ a  Xb } (m/n)   2. | b – a |   3. a, b > minconf. 28 Clustering Method 5 Steps Step 1 : closeness  1 - | confi – confj | The distance relation table Step 2 : closeness degree measure The confidence-confidence matrix Step 3 : two confidences  close enough ?  The confidence relationship matrix Step 4 : classes creating [a, b]  interval of the confidence of rule X→Y Step 5 : interval verifying satisfy the constraints ? 29 An Example Assume rule  X→Y conf1=0.7, conf2=0.72, conf3=0.68, conf4=0.5 conf5=0.71, conf6=0.69, conf7=0.7, conf8=0.91 3 parameters =0.7 =0.08 =0.69 30 Step 1 : Closeness Example conf1=0.7, conf2=0.72 c1, 2= 1 - | conf1 - conf2 | = 1 - |0.70-0.72|=0.98 31 Step 2 : Closeness Degree Measure Example          32 Step 3 : Close Enough ? Example =6.9 > 6.9 < 6.9 33 Step 4 : Classes Creating Example 1 Class 1 : conf1~3, conf5~7 2 Class 2 : conf4 3 Class 3 : conf8 34 Step 5 : Interval Verifying Example Class 1 conf1=0.7, conf2=0.72, conf3=0.68, conf5=0.71, conf6=0.69, conf7=0.7 [min, max] = [conf3, conf2] = [0.68, 0.72] constraint 1 P{ 0.68  X 0.72 } (6/8)   (0.7) constraint 2  |0.72-0.68| (0.04) <  (0.08) constraint 3  0.68, 0.75 > minconf. (0.65) In the same way Class 2 & Class 3  be wiped out Result  X→Y : conf=[0.68, 0.72] Support ? In the same way  Interval 35 Roughly Method Example R : AB→C supp1=0.4, conf1=0.72 supp2=0.5, conf2=0.82 Maximum max ( supp (R) )=max (0.4, 0.5)=0.5 max ( conf (R) )=max (0.72, 0.82)=0.82 Minimum & Average min  0.4, 0.72 avg  0.45, 0.77 36 Experiments Time SWNBS (without rules selection) SWBRS (with rules selection) SWNBS > SWBRS Error first 20 frequent itemset Max=0.000065 Avg=0.00003165 37 Conclusion Synthesizing Model Data Sources  known weighting Data Sources  unknown clustering method roughly method 38 Future works Sequence pattern Combine GA and other techniques 39

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Synthesizing high_frequency rules from different data