Download Contrast Data Mining: Methods and Applications

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Selected Research Results & Applications
of WSU' Data Mining Research Lab
Guozhu Dong
PhD, Professor
Data Mining Research Lab
Wright State University
Outline












Contrast data mining
Contrast pattern based classifiers
Contrast pattern mining on sequence data
Real-time mining/analysis of sensor network data
Multi-dimensional multi-level data mining in data cubes
Mining large collections of time series
Microarray concordance analysis
Summarizing clusterings of abstracts/articles
Alternative clustering
Focus on the
Conversion of undesirable objects
“bold” topics
Data mining for knowledge transfer
Comparative summary of search results
Data Mining Results and Applications
Guozhu Dong
2
Contrast data mining - What & Why ?
Contrast - ``To compare or appraise in respect to
differences’’ (Merriam Webster Dictionary)
 Contrast data mining - The mining of patterns and
models contrasting two or more classes, conditions, or
datasets.
Why:
 ``Sometimes it’s good to contrast what you like with
something else. It makes you appreciate it even more’’

Darby Conley, Get Fuzzy, 2001

Useful for understanding, prediction/classification, outlier
detection, …
Data Mining Results and Applications
Guozhu Dong
3
What can be contrasted ?

Objects at different time periods


Objects for different spatial locations


``Compare ICDM papers published in 2006-2007
versus those in 2004-2005 to find emerging research
directions’’
``Find the distinguishing patterns of cars sold in the
south, versus those sold in the north’’
Objects across different classes

``Find the key differences between normal
colon tissues and cancerous colon tissues’’
Data Mining Results and Applications
Guozhu Dong
4
How do we contrast two datasets, without
advanced mining tools?

Let D1 and D2 be the two datasets.

We usually find a prototypical case p1 for D1, and a
prototypical case p2 for D2. Then we compare p1 against
p2.

We may also compare the distribution of D1 against that
of D2.

Such simplifications often miss the interesting contrast
patterns.
Data Mining Results and Applications
Guozhu Dong
5
Alternative names for contrast data
mining/patterns


Contrast data mining is related to change
mining, difference mining, discriminator mining,
classification rule mining, …
Contrast patterns are related to these patterns:
Change patterns, class based association rules, contrast sets,
concept drift, difference patterns, discriminative patterns,
(dis)similarity patterns, emerging patterns, gradient patterns, high
confidence patterns, (in)frequent patterns, ……
Data Mining Results and Applications
Guozhu Dong
6
How is contrast data mining used ?

Domain understanding


Used for building classifiers



Many different techniques - to be covered later
Also used for weighting and ranking instances
Used for monitoring


``Young children with diabetes have a greater risk of hospital
admission, compared to the rest of the population
``Tell me when something unusual (unlike others in this class)
arrives”
Understanding can help us do prevention, prediction can help us
do treatment. An ounce of prevention is worth a pound of cure!
Data Mining Results and Applications
Guozhu Dong
7
Support =
frequency
Emerging Patterns

Emerging Patterns (EPs) are contrast patterns between two
classes of data whose support changes significantly between the
two classes. “Significant change” can be defined by:
similar to RiskRatio; +:
big support ratio:
allowing patterns with
supp2(X)/supp1(X) >= minRatio
small overall support
big support difference:
|supp2(X) – supp1(X)| >= minDiff

If supp2(X)/supp1(X) = infinity, then X is a jumping EP.


(as defined by Bay+Pazzani 99)
jumping EP occurs in some members of one class but never
occurs in the other class.
Here, X is the AND of a set of simple conditions.
Extension to OR was also studied
Data Mining Results and Applications
Guozhu Dong
8
Example EP in microarray data for cancer
Normal Tissues
Cancer Tissues
genes
g1
g2
g3
g4
g1
g2
g3
g4
tissues
L
H
L
H
H
H
L
H
L
H
L
L
L
H
H
H
H
L
L
H
L
L
L
H
L
H
H
L
H
H
H
L
binned
data
EP example: X={g1=L,g2=H,g3=L}; suppN(X)=50%, suppC(X)=0
Use minimality to reduce number of mined EPs
Data Mining Results and Applications
Guozhu Dong
9
Top support minimal jumping EPs
for colon cancer
Colon Cancer EPs
Colon Normal EPs
{1+ 4- 112+ 113+} 100%
{1+ 4- 113+ 116+} 100%
{1+ 4- 113+ 221+} 100%
{1+ 4- 113+ 696+} 100%
{1+ 108- 112+ 113+} 100%
{1+ 108- 113+ 116+} 100%
{4- 108- 112+ 113+} 100%
{4- 109+ 113+ 700+} 100%
{4- 110+ 112+ 113+} 100%
{4- 112+ 113+ 700+} 100%
{4- 113+ 117+ 700+} 100%
{1+ 6+ 8- 700+} 97.5%
{12- 21- 35+ 40+ 137+ 254+} 100%
{12- 35+ 40+ 71- 137+ 254+} 100%
{20- 21- 35+ 137+ 254+} 100%
{20- 35+ 71- 137+ 254+} 100%
{5- 35+ 137+ 177+} 95.5%
{5- 35+ 137+ 254+} 95.5%
{5- 35+ 137+ 419-} 95.5%
{5- 137+ 177+ 309+} 95.5%
{5- 137+ 254+ 309+} 95.5%
{7- 21- 33+ 35+ 69+} 95.5%
{7- 21- 33+ 69+ 309+} 95.5%
{7- 21- 33+ 69+ 1261+} 95.5%
Very few 100% support EPs.
These EPs have 95%-100% support in one
class but 0% support
in the other class.
Minimal: Each proper
subset occurs in both
classes.
EPs from
Mao+Dong 05
(gene club +
border-diff).
There are ~1000 items
with supp >= 80%.
Colon cancer dataset (Alon et al,
1999 (PNAS)): 40 cancer tissues,
22 normal tissues. 2000 genes
Data Mining Results and Applications
Guozhu Dong
10
Besides uses discussed earlier, another
potential use of minimal jumping EPs:

Minimal jumping EPs for normal tissues
 Properly expressed gene groups important for normal cell functioning, but
destroyed in all colon cancer tissues
 Restore these  ?cure colon cancer?

Li+Wong 02 proposed “gene
therapy using EP” idea
Minimal jumping EPs for cancer tissues
 Bad gene expression groups that occur in some cancer tissues but never occur in
normal tissues
 Disrupt these  ?cure colon cancer?

? Possible targets for drug design ?
Paper using EP published in Cancer Cell (cover, 3/02).
EPs have been applied in medical applications for diagnosing acute
Lymphoblastic Leukemia etc.
Data Mining Results and Applications
Guozhu Dong
11
EP Mining Algorithms and Studies








Complexity result (Wang et al 05)
Border-differential algorithm (Dong+Li 99)
Gene club + border differential (Mao+Dong 05)
Constraint-based approach (Zhang et al 00)
Tree-based approach (Bailey et al 02,
Fan+Kotagiri 02)
Projection based algorithm (Bailey el al 03)
ZBDD based method (Loekito+Bailey 06)
Equivalence class based (Li et al 07).Can handle
Data Mining Results and Applications
Guozhu Dong
200+
dimensions
12
Contrast pattern based classification
-- history

Contrast pattern based classification: Methods to build or improve
classifiers, using contrast patterns














CBA (Liu et al 98)
CAEP (Dong et al 99)
Instance based method: DeEPs (Li et al 00, 04)
Jumping EP based (Li et al 00), Information based (Zhang et al 00), Bayesian
based (Fan+Kotagiri 03), improving scoring for >=3 classes (Bailey et al 03)
CMAR (Li et al 01)
Top-ranked EP based PCL (Li+Wong 02)
CPAR (Yin+Han 03)
Weighted decision tree (Alhammady+Kotagiri 06)
Rare class classification (Alhammady+Kotagiri 04)
Constructing supplementary training instances (Alhammady+Kotagiri 05)
Noise tolerant classification (Fan+Kotagiri 04)
One-class classification/detection of outlier cases (Chen+Dong 06)
…
Most follow the aggregating approach of CAEP.
Data Mining Results and Applications
Guozhu Dong
13
EP-based classifiers: rationale




Consider a typical EP in the Mushroom dataset, {odor = none,
stalk-surface-below-ring = smooth, ring-number = one}; its support
increases from 0.2% from “poisonous” to 57.6% in “edible” (support
ratio = 288).
Strong differentiating power: if a test case T contains this EP, we
can predict T as edible with high confidence 99.6% =
57.6/(57.6+0.2)
A single EP is usually sharp in telling the class of a small fraction
(e.g. 3%) of all instances. Need to aggregate the power of many
EPs to make the classification.
EP based classification methods often out perform state of the art
classifiers, including C4.5 and SVM. They are also noise tolerant.
Data Mining Results and Applications
Guozhu Dong
14
CAEP (Classification by Aggregating Emerging Patterns)
 Given a test case T, obtain T’s scores for each class, by
aggregating the discriminating power of EPs contained in T; assign
the class with the maximal score as T’s class.
 The discriminating power of EPs are expressed in terms of
supports and growth rates. Prefer large supRatio, large support
 The contribution of one EP X (support weighted confidence):
strength(X) = sup(X) * supRatio(X) / (supRatio(X)+1)
 Given a test T and a set E(Ci) of EPs for class Ci, the
aggregate score of T for Ci is
CMAR aggregates
“Chi2 weighted Chi2”
score(T, Ci) = S strength(X)
(over X of Ci matching T)
 For each class, may use median (or 85%) aggregated value to
normalize to avoid bias towards class with more EPs
Data Mining Results and Applications
Guozhu Dong
15
How CAEP works? An example

Given a test case T={a,d,e}, how to classify T?
 T contains EPs of class 1 : {a,e} (50%:25%) and
{d,e} (50%:25%), so Score(T, class1) =
0.5*[0.5/(0.5+0.25)] + 0.5*[0.5/(0.5+0.25)] = 0.67
 T contains EPs of class 2: {a,d} (25%:50%), so
Class 1 (D1)
a
c
a
e
b
c
e
d
e
b
Class 2 (D2)
Score(T, class 2) = 0.33;
a
b
 T will be classified as class 1 since
a
b
Score1>Score2
c
e
a
b
Data Mining Results and Applications
Guozhu Dong
d
c
d
d
e
16
DeEPs (Decision-making by Emerging Patterns)


An instance based (lazy) learning method, like k-NN; but does not
use the normal distance measure.
For a test instance T, DeEPs





First project all training instances to contain only items in T
Discover EPs from the projected data
Use these EPs to get the training data that match some discovered EPs
Finally, use the proportional size of matching data in a class C as T’s
score for C
Advantage: disallow similar EPs to give duplicate votes!
Data Mining Results and Applications
Guozhu Dong
17
Why EP-based classifiers are good



Use the discriminating power of low support EPs (with high
supRatio), in addition to the high support ones
Use multi-feature conditions, not just single-feature conditions
Select from larger pools of discriminative conditions



Compare: Search space of patterns for decision trees is limited by
early greedy choices.
Aggregate/combine the discriminating power of a diversified
committee of “experts” (EPs)
Decision of such classifiers is highly explainable
Data Mining Results and Applications
Guozhu Dong
18
Also Studied Contrast Pattern Mining for








Sequence family A vs sequence family B
Graph collection A vs graph collection B
Build contrast pattern based clustering quality index
Constructing synthetic training data for classes with few training
instances
…
More than 6 PhD dissertations
About 50 research papers
A tutorial given at IEEE ICDM 2007
Data Mining Results and Applications
Guozhu Dong
19
Multi-dimensional multi-level data
mining in data cubes

Data cube is used for discovering patterns captured in consolidated
historical data for a company/organization:
 rules, anomalies, unusual factor combinations

Data cube is focused on modeling & analysis of data for decision
makers, not daily operations.

Data organized around major subjects or factors, such as
customer, product, time, sales.

Cube “contains” huge number of MDML sumaries for “segments” or
“sectors” at different levels of details

Basic OLAP operations: Drill down, roll up, slice and dice, pivot
Data Mining Results and Applications
Guozhu Dong
20
Data Cubes: Base Table & Hierarchies
Base table stores sales volume (measure), a function of
product, time, & location (dimensions)
Time
Hierarchical summarization paths
Industry Region
Year
Category Country Quarter
Product

Product
City
Office
a base cell
Data Mining Results and Applications
Guozhu Dong
Month Week
Day
*: all (as top of
each dimension)
21
Data Cubes: Derived Cells
1Qtr
4Qtr
Measures:
sum, count,
avg, max,
min, std, …
sum
U.S.A
Canada
Mexico
sum
Location
TV
PC
VCR
sum
Time
2Qtr 3Qtr
(TV,*,Mexico)
Derived cells, different levels of details
Data Mining Results and Applications
Guozhu Dong
22
Gradient mining in data cubes


Find syntactically similar cells with significantly different
measure values
EG:

(house,California,May,2008), total-sale=100M
vs (house,Iowa,May,2008), total-sale = 200M

*** This is made up to show the point ***

Other people studied: iceberg cubes, cells
significantly different from neighbors, …
Data Mining Results and Applications
Guozhu Dong
23
Multi-Dimensional Trends Analysis of
Sets of Time-Series in Data Cubes




Consider applications having many time series
 ECG curves, stocks, power grids, sensor networks,
internet, gene expressions for toxicology study, …
Need MDML trends analysis
 Mining/monitoring unusual patterns/events, in
MDML manner
 E.G. Find good sets of stocks with desired total
risk/reward ratios
Regression cube for time series
 Store regression base cube
 Support MDML OLAP of regressions
Results also useful for MDML data stream monitoring
Data Mining Results and Applications
Guozhu Dong
24
Example: Aggregating Set of Time Series
Two component cells
Aggregated cell
Deriving regression of
aggregated cell from
regression of component
cells
Data Mining Results and Applications
Guozhu Dong
25
In-Network Detection of Shapes of RegionBased Events in Sensor Networks
Event Sensing
Event Sensing
Event Sensing
Event Sensing
Event Sensing
Sensor Node
Event Sensing
Each sensor can sense events, and talk with
neighbors
Data Mining Results and Applications
Guozhu Dong
26
Research Problems Studied
Detection of Region-Based Events: given a sensor
network, when a region-based event occurs, report the
spatial geometric information, which may include



the boundaries and the shape of the region;
positions of important points;
important metrics: length, area, density…
Tracking of Region-Based Events: after initial detection
of a region-based event, determine its spatial dynamic
parameters (moving direction, speed, expansion rate of
area, etc).
Computation is done in the sensor network, which is
organized into an R-tree.
Data Mining Results and Applications
Guozhu Dong
27
Multiple platforms/labs dataset
concordance/consistency evaluation

Microarrays (supplied by different manufactures) are used
to measure gene expressions in tissues, by different labs.

Without knowing the concordance between platform/lab
conditions, it is hard to transfer knowledge
(patterns/classifiers) from one lab to another

We provide measures and techniques to address this
problem, based on “discriminating gene/classifier
transferability”
Data Mining Results and Applications
Guozhu Dong
28
Summarizing clusterings of documents




We often need to process large collections of
documents (abstracts, articles, google search, …)
We need methods to help us quickly get a sense
of the main themes of the documents
We gave methods to find “summary word sets”
(cluster description sets) to describe clusterings
of documents
Words in a summary set for a cluster should be
typical in the cluster, and be rare in other clusters
Data Mining Results and Applications
Guozhu Dong
29
Alternative Clustering




Clustering is usually performed on poorly
understood datasets
Multiple clusterings (ways to group the data) may
exist
Need methods to discover alternative clusterings
We gave algorithms to solve this problem, and
introduced a new similarity measure between
clusterings
Data Mining Results and Applications
Guozhu Dong
30
Undesirable object converter
mining
We have a class of desirable objects and a
class of undesirable objects.
 The goal is to mine “small sets of attribute
changes, which when applied to undesirable
objects, may change those objects’ class from
undesirable to desirable.”
 We considered two types of converter sets –
personalized, and universal
 We gave algorithms to mine them

Data Mining Results and Applications
Guozhu Dong
31
Data mining for knowledge transfer
We have two application domains: a well
understood one and a less understood
one.
 The goal is to mine knowledge that can
be transferred from the well understood
domain to the less understood domain, to
solve problems in the less understood
domain

Data Mining Results and Applications
Guozhu Dong
32
Comparative summary of search results




We often perform multiple searches on the web or on a
document collection.
There is an information overload, when we process the
search results.
We developed tools to compare and summarize the search
results to reduce the information overload.
Compare two searches -- examples:
 Same key words searched at two time points
 Same key words searched over two locations etc
Data Mining Results and Applications
Guozhu Dong
33
Outline of Some Recent Works,
Review












Contrast data mining
Contrast pattern based classifiers
Contrast pattern mining on sequence data
Real-time mining/analysis of sensor network data
Multi-dimensional multi-level data mining in data cubes
Mining large collections of time series
Microarray concordance analysis using contrast patterns
Summarizing clusterings of abstracts/articles
Alternative clustering
Conversion of undesirable objects
Data mining for knowledge transfer
Comparative summary of search results
Data Mining Results and Applications
Guozhu Dong
34
Thank you

List of papers available at

http://www.cs.wright.edu/~gdong/
Email: [email protected]

Collaboration opportunities to work on your
problems are welcome
Data Mining Results and Applications
Guozhu Dong
35