Download kdd08mbt-Final

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Direct Mining of Discriminative and
Essential Frequent Patterns via
Model-based Search Tree
How to find good features from semi-structured raw data
for classification
Wei Fan, Kun Zhang, Hong Cheng,
Jing Gao, Xifeng Yan, Jiawei Han,
Philip S. Yu, Olivier Verscheure
Feature Construction

Most data mining and machine learning model
assume the following structured data:





(x1, x2, ..., xk) -> y
where xi’s are independent variable
y is dependent variable.
 y drawn from discrete set: classification
 y drawn from continuous variable: regression
When feature vectors are good, differences in
accuracy among learners are not much.
Questions: where do good features come from?
Frequent Pattern-Based
Feature Extraction

Data not in the pre-defined feature vectors

Transactions

Biological sequence

Graph database
Frequent pattern is a good candidate for discriminative features
So, how to mine them?
A discovered
pattern
FP: Sub-graph
NSC 4960
O
O
NSC 699181
NSC 40773
OH
HO
O
O
SH
NSC 191370
HN
O
NH
O
O
H2N
O
S
NSC 164863
O
HO
N
O
O
O
O
HO
O
O
O
O
O
O
O
O
O
O
(example borrowed from George Karypis presentation)
O
Frequent Pattern Feature Vector Representation
Petal.Length< 2.45
|
P1 P2 P3
110
101
110
001
………
Data1
Data2
Data3
Data4
Mining these predictive
features is an NP-hard
problem.
DT
setosa
Petal.Width< 1.75
versicolor
SVM
LR
100 examples can get up to
1010 patterns
Most are useless
Any classifiers you can
name
virginica
Example

192 examples




12% support (at least 12% examples contain the pattern),
8600 patterns returned by itemsets
 192 vs 8600 ?
4% support, 92,000 patterns
 192 vs 92,000 ??
Most patterns have no predictive power and cannot be
used to construct features.
Our algorithm


Find only 20 highly predictive patterns
can construct a decision tree with about 90% accuracy
Data in “bad” feature space

Discriminative patterns


A non-linear combination of single feature(s)
Increase the expressive and discriminative power of the
feature space

An example
y
X
Y
C
0
0
0
1
1
1
-1
1
1
1
-1
1
-1 -1
1
1
1
0
1
x
1
Data is non-linearly separable in (x, y)
New Feature Space
• Solving Problem
X
Y
C
0
0
0
1
1
1
-1
1
1
1
-1
1
-1 -1
1
1
ItemSet:
F: x=0,y=0
Association rule
F: x=0  y=0
0
1
1
1
F
1
1
0
X
Y
F:x=0,
y=0
0
0
1
0
1
1
0
1
-1
1
0
1
1
-1
0
1
-1
-1
0
1
C
1
1
x
y
Data is linearly separable in
(x, y, F)
Computational Issues

Measured by its “frequency” or support.

E.g. frequent subgraphs with sup ≥ 10% or ≥ 10% examples contain
these patterns





“Ordered” enumeration: cannot enumerate “sup = 10%” without first
enumerating all patterns > 10%.
NP hard problem, easily up to 1010 patterns for a realistic problem.
Most Patterns are Non-discriminative.
Low support patterns can have high “discriminative power”. Bad!
Random sampling not work since it is not exhaustive.


Most patterns are useless. Random sample patterns (or blindly
enumerate without considering frequency) is useless.
Small number of examples.


If subset of vocabulary, incomplete search.
If complete vocabulary, won’t help much but introduce sample selection bias
problem, particularly to miss low support but high info gain patterns
Conventional Procedure
Two-Step Batch Method
Frequent Patterns
DataSet
mine
1------------------------------2----------3
----- 4 --- 5 ---------- 6 ------- 7------
Mined
Discriminative
select
Patterns
Petal.Length< 2.45
|
124
F1 F2 F4
1.
Mine frequent patterns (>sup)
2.
Select most discriminative
patterns;
3.
Represent data in the feature space
using such patterns;
4.
Build classification models.
Data1
Data2
Data3
Data4
110
101
110
001
………
Feature Construction and Selection
DT
setosa
Petal.Width< 1.75
versicolor
virginica
SVM
LR
Any classifiers you can
name
Two Problems

Mine step

combinatorial explosion
1. exponential explosion
Frequent Patterns
DataSet
mine
1------------------------------2----------3
----- 4 --- 5 ---------- 6 ------- 7------
2. patterns not considered
if minsupport isn’t small
enough
Two Problems

Select step

Issue of discriminative power
4. Correlation not
directly evaluated on their
joint predictability
3. InfoGain against the complete
dataset, NOT on subset of
examples
Frequent Patterns
1------------------------------2----------3
----- 4 --- 5 ---------- 6 ------- 7------
select
Mined
Discriminative
Patterns
124
Direct Mining & Selection via Modelbased Search Tree
Classifier Feature

Miner
Basic Flow
Mine & dataset
Select
P: 20%
1
Y
Mine &
Select
P: 20%
Y
Mine &
Select
P:20% 3
Y
+
N
5
2
N
Y
4
N
…
Few
Data
6
Y
…
Compact set
of highly
discriminative
patterns
Most
discriminative
F based on IG
Mine &
Select
P:20%
N
7
N Y
Mine &
Select
P:20%
N
+
Divide-and-Conquer Based Frequent
Pattern Mining
Global
Support:
10*20%/10000
=0.02%
1
2
3
4
5
6
7
.
.
.
Mined Discriminative
Patterns
Analyses (I)
1.
2.
Scalability (Theorem 1)

Upper bound

“Scale down” ratio to obtain extremely low
support pat:
Bound on number of returned
features (Theorem 2)
Analyses (II)
Subspace is important for discriminative pattern
3.
Original set: no-information gain if





4.
5.
C1 and C0: number of examples belonging to class 1 and 0
P1: number of examples in C1 that contains “a pattern α”
P0: number of examples in C0 that contains the same pattern α
Subsets could have info gain:
Non-overfitting
Optimality under exhaustive search
Experimental Studies:
Itemset Mining (I)

Scalability Comparison
Log(DT #Pat)
Mine & dataset
4 Select
3 P: 20%
1
2
N
Y
Log(DTAbsSupport)
Log(MbT #Pat)
4
Mine & 32dataset
Select 1
P: 20% 0 1
Adult
N
Y
Most
discriminative
F based on IG
1
0
Mine &
Select
P: 20%
Y
Adult
HypoMine &Sick
Chess
5
2
N
Y
Select
P:20%
N
Datasets #Pat using
Mine &
Select
P:20% 3
Y
4
NChess
Hypo
+
Few
Sick
Data
Sonar
5
2
Global
Mine &
Support:
7 P:20% Select
4
N
P:20% 3
Y
10*20%/10000
N
Y
423439
=0.02%
+∞
4818391
+
95507
+
Sick
Sonar
Mine &
Select
P:20%
N
RatioN(MbTY #Pat / #Pat using MbT sup)
252809
Select
6
Most
discriminative
Hypo
F Chess
based on IG
Sonar
Mine &
Select
P: 20%
MbT supY
Mine &
Adult
Log(MbTAbsSupport)
Few
Data
0.41%
6
~0%
7
Mine &
Select
P:20%
N
Y
0.0035%
0.00032%
0.00775%
+
Global
Support:
10*20%/10000
=0.02%
Experimental Studies:
Itemset Mining (II)

Accuracy of Mined Itemsets
DT Accuracy
MbT Accuracy
100%
90%
4 Wins
80%
1 loss
70%
Adult
Chess
Hypo
Log(DT #Pat)
Sick
Sonar
much smaller
number of
patterns
Log(MbT #Pat)
4
3
2
1
0
Adult
Chess
Hypo
Sick
Sonar
Experimental Studies:
Itemset Mining (III)

Convergence
Experimental Studies:
Graph Mining (I)

9 NCI anti-cancer screen datasets



2 AIDS anti-viral screen datasets

URL: http://dtp.nci.nih.gov.

H1: CM+CA – 3.5%
H2: CA – 1%

O
The PubChem Project. URL: pubchem.ncbi.nlm.nih.gov.
Active (Positive) class : around 1% - 8.3%
HO
HO
O
O
O
O
O
Experimental Studies:
Graph Mining (II)
Scalability

DT #Pat
MbT #Pat
1800
1500
1200
900
600
300
0
Mine & dataset
Select
P: 20%
1
N
Y
NCI1
NCI33
NCI41
NCI47
NCI81
NCI83
NCI109
Mine
& NCI123 NCI145
Log(DT Abs Support)
Select
2
P: 20%
Log(MbT Abs Support)
N
Y
Most
discriminative
F based on IG
H2Mine
&
Select
P:20%
N
H1
5
Y
4
Mine &
Select
P:20% 3
Y
3
2
4
6
N
7
Y
Mine &
Select
P:20%
N
1
+
0
NCI1
NCI33
NCI41
NCI47
NCI81
NCI83
NCI109
Few
Data
NCI123
NCI145
+
H1
H2
Global
Support:
10*20%/10000
=0.02%
Experimental Studies:
Graph Mining (III)

AUC and Accuracy
AUC
DT
MbT
0.8
0.7
11 Wins
0.6
0.5
NCI1
NCI33
NCI41
NCI47
NCI81
NCI83
Accuracy
NCI109 NCI123 NCI145
DT
H1
H2
MbT
1
0.96
10 Wins
0.92
1 Loss
0.88
NCI1
NCI33
NCI41
NCI47
NCI81
NCI83 NCI109 NCI123 NCI145
H1
H2
Experimental Studies:
Graph Mining (IV)

AUC of MbT, DT MbT VS Benchmarks
7 Wins, 4 losses
Summary

Model-based Search Tree







Integrated feature mining and construction.
Dynamic support
Can mine extremely small support patterns
Both a feature construction and a classifier
Not limited to one type of frequent pattern: plug-play
Experiment Results

Itemset Mining

Graph Mining
Software and Dataset available from:

www.cs.columbia.edu/~wfan
Related documents