Download Synthesizing high_frequency rules from different data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Synthesizing High-Frequency Rules
from
Different Data Sources
Xindong Wu and Shichao Zhang
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA
ENGINEERING, VOL. 15, NO. 2, MARCH/APRIL 2003
1
Pre-work
Knowledge management.
Knowledge discovery
Data mining.
Data warehouse
2
Knowledge Management
Building data warehousing by
Knowledge management
3
Knowledge Discovery and Data
Mining
Data mining is a tool of knowledge discovery
4
Why data mining
If a supermarket manager, simon, want to arrange these
commodities into supermarket, how to do will make more
revenues, conveniences….
Commodities
if one customer buys milk
then he is likely to buy bread, so...
Supermarket
Simon
5
Why data mining
Before long, if simon want to send some advertisement letters
for customers, how to consider the individual differences is an
important task.
Mary always buys diapers and
milk powders, she may have a
baby, so ….
Simon
6
The role of Data mining
Useful patterns
Knowledge and
strategy
Preprocess data
7
Mining association rules
Milk
Bread
IF bread is bought then
milk is bought
8
Mining steps
step1: define minsup and minconf
ex: minsup=50%
minconf=50%
step2: find large itemsets
step3: generate association rules
9
Example
Large itemsets
10
Outline
Introduction
Weights of Data Sources
Rule Selection
Synthesizing High-Frequency Rules Algorithm
Relative Synthesizing Model
Experiments
Conclusion
11
Introduction
Framework
DB1
AB→C
A→D
B→E
RD1
DB2
AB→C
A→D
B→E
RD2
GRB
...
...
DBn
AB→C
A→D
B→E
RDn
 Synthesizing High-Frequency Rules
• Weighting
• Ranking
12
Weights of Data Sources
Definition
Di : data sources
Si : set of association rules from Di
Ri : association rule
3 Steps
Step 1 : union of all Si
Step 2 : assigning each Ri a weight
Step 3 : assigning each Di a weight
& normalization
13
Example
3 Data Sources (minsupp=0.2, minconf=0.3)
S1
S2
S3
AB→C with supp=0.4, conf=0.72
A→D with supp=0.3, conf=0.64
B→E with supp=0.34, conf=0.7
B→C with supp=0.45, conf=0.87
A→D with supp=0.36, conf=0.7
B→E with supp=0.4, conf=0.6
AB→C with supp=0.5, conf=0.82
A→D with supp=0.25, conf=0.62
14
Step 1
Union of all Si
S’ = {S1, S2, S3}
R1 : AB→C
S1, S3  2 times
R2 : A→D
S1, S2, S3  3 times
R3 : B→E
S1, S2  2 times
R4 : B→C
S2  1 time
S1
1. AB→C with supp=0.4, conf=0.72
2. A→D with supp=0.3, conf=0.64
3. B→E with supp=0.34, conf=0.7
S2
1. B→C with supp=0.45, conf=0.87
2. A→D with supp=0.36, conf=0.7
3. B→E with supp=0.4, conf=0.6
S3
1. AB→C with supp=0.5, conf=0.82
2. A→D with supp=0.25, conf=0.62
15
Step 2
Assigning each Ri a weight
R1
WR1 =
2
2+3+2+1
= 0.25
WR2 =
3
2+3+2+1
= 0.375
WR 3 =
2
2+3+2+1
= 0.25
WR 4 =
1
2+3+2+1
= 0.125
R2
R3
R4
16
Step 3
Assigning each Di a weight
WD1
2*0.25+3*0.375+2*0.25=2.125
WD2
1*0.125+2*0.25+3*0.375=2
WD3
Ri
W Ri
Time
Si
R1:AB→C
0.25
2
S 1, S 3
R2:A→D
0.375
3
S1,S2, S3
R3:B→E
0.25
2
S 1, S 2
R4:B→C
0.125
1
S2
2*0.25+3*0.375=1.625
Normalization
WD1  2.125/(2.125+2+1.625)=0.3695
WD2  2/(2.125+2+1.625)=0.348
WD3  1.625/(2.125+2+1.625)=0.2825
17
Why Rule Selection ?
Goal
Extracting High-Frequency Rules
Low-Frequency Rules  Noise
Solution
If
Num(Ri) / n < 
n : data sources, Num(Ri) : frequency of Ri
Then
Rule Ri  be wiped out
18
Rule Selection
Example : 10 Data Sources
D1~D9 : {R1 : X→Y}
D10 : {R1 : X→Y, R2: X1→Y1, …, R11: X10→Y10 }
Let =0.8
Num(R1) / 10 = 10/10 = 1
>   keep
Num(R2~11) / 10 = 1/10 = 0.1
<   be wiped out
WR1
D1~D10 : {R1 : X→Y}
WR1 : 10/10=1  WD1~10 : 10*1 / 10*10*1 = 0.1
n
Num(R1)
19
Comparison
Without Rules Selection
WD1~9  0.099
WD10  0.109
With Rules Selection
WD1~10  0.1
From High-Frequency Rules Point of view
Weight Errors
D1~9  |0.1-0.099|  0.001
D10  |0.1-0.109|  0.009
Total Error = 0.01
20
Synthesizing High-Frequency
Rules Algorithm
5 Steps
Step 1 : Rules Selection
Step 2 : Weights of Data Sources
Step 2.1 : union of all Si
Step 2.2 : assigning each Ri a weight
Step 2.3 : assigning each Di a weight & normalization
Step 3 : computing supp & conf of each Ri
Step 4 : ranking all rules by support
Step 5 : output the High-Frequency Rules
21
An Example
3 Data Sources
=0.4, minsupp=0.2, minconf=0.3
S1
S2
1. AB→C with supp=0.4, conf=0.72 1. B→C with supp=0.45, conf=0.87
2. A→D with supp=0.3, conf=0.64 2. A→D with supp=0.36, conf=0.7
3. B→E with supp=0.34, conf=0.7 3. B→E with supp=0.4, conf=0.6
S3
1. AB→C with supp=0.5, conf=0.82
2. A→D with supp=0.25, conf=0.62
22
Step 1
Rules Selection
R1 : AB→C
S1, S3  2 times
Num(R1) / 3 = 0.66  keep
R2 : A→D
S1, S2, S3  3 times
Num(R2) / 3 = 1  keep
R3 : B→E
S1, S2  2 times
Num(R3) / 3 = 0.66  keep
R4 : B→C
S2  1 time
Num(R4) / 3 = 0.33  wiped out
23
Step 2 : Weights of Data Sources
Weights of Ri
2
= 0.29
WR1 =
2+3+2
3
= 0.42
WR2 =
2+3+2
2
= 0.29
WR2 =
2+3+2
Ri
WRi
Time
Si
R1:AB→C
0.29
2
S1 , S3
R2:A→D
0.42
3
S 1 ,S 2 , S 3
R3:B→E
0.29
2
S1 , S2
Weight of Di
WD1  2*0.29+3*0.42+2*0.29=2.42
WD2  3*0.42+2*0.29=1.84
WD3  2*0.29+3*0.42=1.84
Normalization
WD1  2.42/(2.42+1.84+1.84)=0.3695=0.396
WD2  1.84/(2.42+1.84+1.84)=0.302
WD3  1.84/(2.42+1.84+1.84)=0.302
24
Step 3
Computing supp & conf of each Ri
Support
ABC
0.396*0.4+0.302*0.5=0.3094
AD
0.396*0.3+0.302*0.36=0.228
BE
0.396*0.34+0.302*0.4=0.255
Confidence
ABC
WD1
=0.396
WD2
=0.302
WD3
=0.302
S1
1. AB→C with supp=0.4, conf=0.72
2. A→D with supp=0.3, conf=0.64
3. B→E with supp=0.34, conf=0.7
S2
2. A→D with supp=0.36, conf=0.7
3. B→E with supp=0.4, conf=0.6
0.396*0.72+0.302*0.82=0.532
AD
0.396*0.64+0.302*0.7=0.465
BE
S3
1. AB→C with supp=0.5, conf=0.82
2. A→D with supp=0.25, conf=0.62
0.396*0.7+0.302*0.6=0.458
25
Step 4 & Step 5
Ranking all rules by support & output
minsupp=0.2, minconf=0.3
ABC, BE, AD
Ranking
1. ABC (0.3094)
2. BE (0.255)
3. AD (0.228)
Output – 3 rules
ABC(0.3094, 0.532)
BE (0.255, 0.458)
AD (0.228, 0.465)
26
Relative Synthesizing Model
Framework
Unknown Di
Internet
Web
X→Y
conf=0.7
books
X→Y
conf=0.72
X→Y
conf=?
journals
X→Y
conf=0.68
 Synthesizing
• clustering method
• roughly method
27
Synthesizing Methods
Physical Meaning
if the confidences  irregularly distributed
Maximum synthesizing operator
Minimum synthesizing operator
Average synthesizing operator
if the confidences (X)  normal distribution
clustering  interval [a, b]
satisfy
1. P{ a  Xb } (m/n)  
2. | b – a |  
3. a, b > minconf.
28
Clustering Method
5 Steps
Step 1 : closeness  1 - | confi – confj |
The distance relation table
Step 2 : closeness degree measure
The confidence-confidence matrix
Step 3 : two confidences  close enough ? 
The confidence relationship matrix
Step 4 : classes creating
[a, b]  interval of the confidence of rule X→Y
Step 5 : interval verifying
satisfy the constraints ?
29
An Example
Assume
rule  X→Y
conf1=0.7, conf2=0.72, conf3=0.68, conf4=0.5
conf5=0.71, conf6=0.69, conf7=0.7, conf8=0.91
3 parameters
=0.7
=0.08
=0.69
30
Step 1 : Closeness
Example
conf1=0.7, conf2=0.72
c1, 2= 1 - | conf1 - conf2 | = 1 - |0.70-0.72|=0.98
31
Step 2 : Closeness Degree Measure
Example









32
Step 3 : Close Enough ?
Example
=6.9
> 6.9
< 6.9
33
Step 4 : Classes Creating
Example
1
Class 1 : conf1~3, conf5~7
2 Class 2 : conf4
3
Class 3 : conf8
34
Step 5 : Interval Verifying
Example
Class 1
conf1=0.7, conf2=0.72, conf3=0.68,
conf5=0.71, conf6=0.69, conf7=0.7
[min, max] = [conf3, conf2] = [0.68, 0.72]
constraint 1 P{ 0.68  X 0.72 } (6/8)   (0.7)
constraint 2  |0.72-0.68| (0.04) <  (0.08)
constraint 3  0.68, 0.75 > minconf. (0.65)
In the same way
Class 2 & Class 3  be wiped out
Result  X→Y : conf=[0.68, 0.72]
Support ?
In the same way  Interval
35
Roughly Method
Example
R : AB→C
supp1=0.4, conf1=0.72
supp2=0.5, conf2=0.82
Maximum
max ( supp (R) )=max (0.4, 0.5)=0.5
max ( conf (R) )=max (0.72, 0.82)=0.82
Minimum & Average
min  0.4, 0.72
avg  0.45, 0.77
36
Experiments
Time
SWNBS (without rules selection)
SWBRS (with rules selection)
SWNBS > SWBRS
Error
first 20 frequent itemset
Max=0.000065
Avg=0.00003165
37
Conclusion
Synthesizing Model
Data Sources  known
weighting
Data Sources  unknown
clustering method
roughly method
38
Future works
Sequence pattern
Combine GA and other techniques
39