Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Extending Association Analysis
Michael Steinbach
Ph.D. Defense
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Outline
Introduction
Extending association analysis to non-binary
data and non-traditional patterns
– Generalizing the notion of support
– Generalizing the notion of confidence
Creating new types of association patterns
Analyzing the structure of association patterns
Conclusions and future work
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Traditional Association Analysis
Association
analysis: Analyzes
relationships among items
(attributes) in a binary transaction
data
– Example data: market basket data
– Data can be represented as a binary
matrix
– Applications in business and science
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
© 2005 M. Steinbach
Ph.D. Defense
Beer
Eggs
Coke
1
2
3
4
5
Diapers
– Itemsets: Collection of items
Example: {Milk, Diaper}
– Association Rules: X Y, where X
and Y are itemsets.
Example: Milk Diaper
Milk
types of patterns
Bread
Two
Set-Based Representation of Data
1
1
0
1
1
1
0
1
1
1
0
1
1
1
1
0
1
1
1
0
0
1
0
0
0
0
0
1
0
1
Binary Matrix Representation of Data
‹#›
Traditional Association Analysis …
Association measures evaluate the strength of an
association pattern
– Support and confidence are the most commonly used
– The support, (X), of an itemset X is the number of
transactions that contain all the items of the itemset
Frequent itemsets have support > specified threshold
Different types of itemset patterns are distinguished
by a measure and a threshold
– The confidence of an association rule is given by
conf(X Y) = (X Y) / (X)
Estimate of the conditional probability of Y given X
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Traditional Association Analysis …
Process of finding interesting patterns:
1. Find frequent itemsets using a support threshold
2. Find association rules for frequent itemsets
3. Sort association rules according to confidence
Support filtering is necessary
– To eliminate spurious patterns
– For efficiency, we need the
anti-monotone property:
X Y implies (Y) ≤ (X)
A
Confidence is used because
of its interpretation as
conditional probability
© 2005 M. Steinbach
Ph.D. Defense
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
null
ABCE
ABDE
ACDE
BCDE
ABCDE
Given d items, there are 2d
possible candidate itemsets
‹#›
Extending Association Analysis
Why extend association analysis?
– To address limitations of existing schemes for
association analysis
– To create new kinds of useful patterns
– To better understand the structure of the
association patterns in a data set
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Limitations of Association Analysis
Traditional association analysis does not apply to
– Non-binary data
Must
transform data into binary transaction data to apply
traditional association analysis techniques.
Order
and magnitude information
can be lost
Can
often “make it work” by coding
combinations of values, but this adds
complexity and explodes the number
of items
Limited
solutions exist
– Min-Apriori (Han, Karypis, Kumar 1997)
Document Data
– Non-traditional association patterns.
Error
Tolerant Itemsets (ETIs)
General
© 2005 M. Steinbach
Boolean formulas
(Yang, Fayyad, and Bradley 2001)
(Bollman-Sdorra, et al. 01, Srikant et al. 97)
Ph.D. Defense
‹#›
Limitations of Association Analysis …
Support and confidence are not appropriate for all
applications
Example involving coffee and tea:
– Every customer in a grocery store purchases coffee
– Only 1/4 of the customers purchase tea
– conf(tea coffee) = 1
– But this is misleading because any item implies coffee
– This problem is common when the frequency of items has a
skewed support distribution
– This cross-support problem can be addressed by using other
measures, such as h-confidence (hyperclique pattern)
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Limitations of Association Analysis …
Lack of knowledge of structure of association patterns
– Support threshold is critical
If too high, no patterns
If too low, too many patterns
– At some support threshold,
algorithms to find association
patterns “hit the wall”
– Particular difficulty in finding
patterns with low support
LPMiner (Seno, Karypis 2001)
© 2005 M. Steinbach
Ph.D. Defense
From Summary of
Results, Frequent
Itemset Mining
Implementations 2003
‹#›
Overview and Contributions
Presentation and contributions fall into three
categories
1. A mathematical framework to extend
association analysis to non-binary data and
non-traditional patterns
Generalizing the notion of support
– Extend the hyperclique pattern (Xiong, et al 2003)
to continuous data
Generalizing the notion of confidence
– Define notion of confidence for Error-Tolerant
Itemsets
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Overview and Contributions
2. A framework for creating new types of
association measures (and their accompanying
itemset patterns)
Can use any pairwise association or proximity
measure as the basis for defining a measure of
itemset strength
– Examples: cosine, confidence, correlation
All measures have the anti-monotone property
3. Analyzing the structure of association patterns
Introduce the notion of support envelopes
Can visualize the structure of association
patterns
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Publications Related to Thesis
Steinbach, M., Tan, P., Xiong, H., and Kumar, V.,
Generalizing the Notion of Support. KDD '04, pp. 689-694,
Seattle, WA, August 22 - 25, 2004.
Steinbach, M. and Kumar, V.,
Generalizing the Notion of Confidence. ICDM’05, to appear,
Houston, TX, November 27 - 30, 2005.
Steinbach, M., Tan, P., and Kumar, V.,
Support Envelopes: A Technique for Exploring the Structure
of Association Patterns. KDD '04, pp. 689-694, Seattle, WA,
August 22 - 25, 2004.
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Additional Publications
Books:
P.-N. Tan, M. Steinbach, and V. Kumar,
Introduction to Data Mining, Pearson Addison-Wesley, May, 2005.
Book Chapters:
V. Kumar, P.-N. Tan, and M. Steinbach, Data Mining, in Handbook of Data
Structures and Applications, CRC Press, 2004.
M. Steinbach, L. Ertoz, and V. Kumar, Challenges of Clustering High
Dimensional Data. in New Vistas in Statistical Physics - Applications in
Econophysics, Bioinformatics, and Pattern Recognition, Springer-Verlag,
2004.
L. Ertoz, M. Steinbach, and Vipin Kumar, Finding Topics in Collections of
Documents: A Shared Nearest Neighbor Approach, in Clustering and
Information Retrieval, 2003, Kluwer Academic Publishers.
P. Zhang, M. Steinbach, V. Kumar, S. Shekhar, P.-N. Tan, S. Klooster, and C.
Potter, Discovery of Patterns of Earth Science Data Using Data Mining, in
Next Generation of Data Mining Applications, IEEE Press, 2005.
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Additional Publications …
Journal Articles:
H. Xiong, G. Pandey, M. Steinbach, and V. Kumar, Enhancing Data Analysis with Noise Removal, IEEE
Transactions on Knowledge and Data Engineering (TKDE), 2006, accepted for publication as a regular paper.
C. Potter, P.-N.Tan, M. Steinbach, S. Klooster, V. Kumar, R. Myneni, and V. Genovese, Major Disturbance Events in
Terrestrial Ecosystems Detected using Global Satellite Data Sets, Global Change Biology, 2003.
C. Potter, S. Klooster, M. Steinbach, P. Tan, V. Kumar, S. Shekhar, R. Nemani, and R. Myneni, Global
Teleconnections of Ocean Climate to Terrestrial Carbon Flux, J. of Geophysical Research, Vol. 108, No. D17,
4556, 2003.
C. Potter, S. Klooster, M. Steinbach, P. Tan, V. Kumar, S. Shekhar, and C. Carvalho, Understanding Global
Teleconnections of Climate to Regional Model Estimates of Amazon Ecosystem Carbon Fluxes, Global Change
Biology, 2003
C. Potter, S. Klooster, M. Steinbach, P. Tan, V. Kumar, R. Myneni, V. Genovese, Variability in Terrestrial Carbon
Sinks Over Two Decades: Part 1-North America, Earth Interactions, 2003.
Conferences:
H. Xiong, M. Steinbach, and V. Kumar, Privacy Leakage in Multi-relational Databases via Pattern based Semisupervised Learning, in Proc. of the ACM Conference on information and Knowledge Management (CIKM
2005), Bremen, Germany, 2005.
H. Xiong, M. Steinbach, P.-N. Tan, and V. Kumar, HICAP: Hierarchical Clustering with Pattern Preservation, in Proc.
2004 SIAM International Conf. on Data Mining (SDM 2004), pp. 279 - 290, Florida, 2004
M. Steinbach, P.N Tan, V. Kumar, S. Klooster, C. Potter: Discovery of climate indices using clustering. KDD 2003:
446-455
L. Ertöz, M. Steinbach, and V. Kumar: Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High
Dimensional Data. SDM 2003.
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Additional Publications …
Workshops:
M. Steinbach, P.-N. Tan, V. Kumar, C. Potter, and S. Klooster, Temporal Data Mining for the Discovery and
Analysis of Ocean Climate Indices, KDD Workshop on Temporal Data Mining, 2002.
M. Steinbach, P.-N. Tan, V. Kumar, C. Potter, and S. Klooster, Data Mining for the Discovery of Ocean
Climate Indices, The Fifth Workshop on Scientific Data Mining, 2nd SIAM International Conference on
Data Mining, 2002.
V. Kumar, M. Steinbach, P.-N. Tan, S. Klooster, C. Potter, A. Torregrosa, Mining Scientific Data: Discovery
of Patterns in the Global Climate System, Joint Statistical Meeting, 2001.
M. Steinbach, P.-N. Tan, V. Kumar, C. Potter, S. Klooster, A. Torregrosa, Clustering Earth Science Data:
Goals, Issues and Results, KDD Workshop on Mining Scientific Datasets, 2001.
P.-N. Tan, M. Steinbach, V. Kumar, C. Potter, S. Klooster, A. Torregrosa, Finding Spatio-Temporal Patterns
in Earth Science Data, KDD Workshop on Temporal Data Mining, 2001.
M. Steinbach, G. Karypis, and V. Kumar, Efficient Algorithms for Creating Product Catalogs, Web Mining
Workshop, 1st SIAM International Conference on Data Mining, Chicago, IL, 2001.
L. Ertoz, M. Steinbach, and V. Kumar, Finding Topics in Collections of Documents: A Shared Nearest
Neighbor Approach,Text Mine'01, Workshop on Text Mining, 1st SIAM International Conference on Data
Mining, Chicago, IL, April, 2001.
M. Steinbach, G. Karypis, and V. Kumar, A Comparison of Document Clustering Techniques TextMining
Workshop, KDD 2000, Boston, MA, August, 2000.
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Outline
Introduction
Extending association analysis to non-binary data and non-traditional
patterns
– Generalizing the notion of support
– Generalizing the notion of confidence
Creating new types of association patterns
Analyzing the structure of association patterns
Conclusions and future work
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Generalizing Support: Problem Statement
Challenge: Create a framework for generalizing
support that
– Handles non-binary data (ordinal, continuous)
– Handles new types of patterns
– Allow people to more easily express, explore, and
communicate new types of association patterns
Motivating examples for continuous data
Document Data
Microarray data
www.biology.ucsc.edu/mcd/research.html
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Proposed Approach
Proposed Approach: Support ( ) can be viewed as being
composed of two steps (functions):
– Evaluate the strength of a pattern in each object (transaction)
Evaluation vector is given by v = eval(X)
Summarization (norm) function measures
strength of the pattern in all transactions
Example
– eval = (logical and)
– X = { Milk, Diapers }
– norm = sum
(X) = (norm eval)(X) = norm(eval(X))
= norm(v)
© 2005 M. Steinbach
Ph.D. Defense
1
2
3
4
5
Diapers
Milk
– Summarize all these evaluations with a single number
1
0
1
1
1
0
1
1
1
1
norm(v)
v
0
0
1
1
1
3
‹#›
Evaluation and Summarization Functions
Evaluation functions
Summarization functions
– Boolean functions constructed from and (), or (), and
not ()
– min, max, range
– product
– Special purpose: Error-Tolerant Itemsets
– Vector norms
L1, L2, and L2 squared
– Sums
Average
Weighted average
Weighted vector norms
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Usefulness of Support Framework
Traditional support results from a number of choices
– eval = { , min, }
– norm = { L1 , L2 squared, sum }
– Any of these nine combinations give the traditional support for
binary data
– But for continuous data, these support measures are different
Can extend a recently developed association pattern,
the hyperclique pattern (Xiong, et al. 2003), to
continuous data
– eval = min
– norm = L2 squared
Has led to the creation of a new kind of pattern defined
by range support
– eval = range
– norm = L2 squared
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Outline
Introduction
Extending association analysis to non-binary data and non-traditional
patterns
– Generalizing the notion of support
– Generalizing the notion of confidence
Creating new types of association patterns
Analyzing the structure of association patterns
Conclusions and future work
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Generalizing Confidence: Problem Statement
Challenge: Create a framework for
generalizing confidence that
– Handles non-binary data (ordinal, continuous)
– Handles new types of patterns
– Allow people to more easily express, explore, and
communicate new types of association patterns
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Example: Error-Tolerant Itemsets
A (strong) error-tolerant itemset (ETI) can have a fraction
of the items missing in each transaction.
Example: see the data in the table
– Let = 5/8. In other words, each
transaction only needs to have
3/8 (37.5%) of the items.
– X = {i1, i2, i3, i4} and
Y = {i5, i6, i7, i8} are both
ETIs with a support of 4.
!
Standard confidence:
© 2005 M. Steinbach
Ph.D. Defense
‹#›
A Framework for Generalizing Confidence
Proposed Approach: Confidence can be viewed as being
composed of two steps (functions):
– Evaluate the strength of a pattern in each object (transaction)
for the two sets of attributes (items), X and Y (X Y = )
Evaluation
functions can be the same as previously mentioned,
e.g., min, max, range, boolean functions, etc.
– Measure the strength of the relationship between the resulting
pair of pattern evaluation vectors, vX and vY
Confidence functions can be a measure of prediction or
proximity.
– Measure the extent to which the strength of one association pattern
can be used to predict another, such as confidence, or
– Capture the proximity (similarity or dissimilarity) between the two
association patterns.
Euclidean distance, correlation, cosine, Bregman divergence
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Confidence for Boolean Support Functions
A Boolean support function
– Has an evaluation function that returns a binary evaluation
vector indicating the presence or absence of a pattern in each
transaction.
– Uses the sum, L1, or L2 squared summarization function
Goal is to define confidence for Boolean support functions so
that conf( X Y ) can be interpreted as an estimate of the
conditional probability of Y given X.
Key observation is that you have to work with the evaluation
vectors and the basic definition of conditional probability
Thus, conf(X Y) = prob(vY | vX ) = prob(vX vY ) / prob(vY )
Another way to express this is as
conf( X Y ) = traditional confidence(vX, vY)
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Example: Error-Tolerant Itemsets …
Returning to the ETI example, we get the following:
vX
X = {i1, i2, i3, i4} Y = {i5, i6, i7, i8}
vY
conf(X,Y) = prob(vY | vX ) = prob(vX vY ) / prob(vX )
= support(vX vX) / support(vX) = 0 / 4 = 0
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Confidence for Continuous Data
One approach is to define a confidence measure for continuous
data that agrees with traditional confidence for binary data.
–
–
–
–
Normalize attributes to have an L1 norm of 1
eval fuction is min,
norm fuction is L1
Confidence is defined as
Another approach is to drop the requirement of being consistent
with the case of binary data (Min-Apriori (Han, Karypis, Kumar 1997)
–
–
–
–
Normalize attributes to have an L1 norm of 1
eval fuction is min
norm fuction is L1
Traditional definition of confidence: conf(X Y) = (X Y) / (X)
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Example: Min Apriori
This approach is inconsistent with traditional confidence
Original Data
Normalized Data
Evaluation Vectors
Standard confidence:
Min-Apriori confidence:
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Outline
Introduction
Extending association analysis to non-binary
data and non-traditional patterns
– Generalizing the notion of support
– Generalizing the notion of confidence
Creating new types of association patterns
Analyzing the structure of association patterns
Conclusions and future work
© 2005 M. Steinbach
Ph.D. Defense
‹#›
New Association Patterns: Motivation
There are many pairwise measures of association or
proximity among items (attributes)
– Each measure has specific properties and applications
– E.g., cosine measure is good for sparse data, while
correlation is more appropriate for dense data
Interestingness measures
© 2005 M. Steinbach
(Tan and Kumar 02)
Ph.D. Defense
‹#›
Proposed Approach
Proposed Approach: Using pairwise measures of
association or proximity
– Find values for all pairs of attributes (or sets of
attributes)
– Apply the min function to obtain a single value
Example: If X = {i1, i2, i3} and our pairwise measure is cosine,
then we can define, , a measure of itemset strength
(X) = min( cosine(i1, i2), cosine(i1, i3), cosine(i2, i3) )
A set of attributes, X, is a clique association pattern
with respect to a threshold and a pairwise association
measure if
(i, j) , i, j X ( can be cosine, corr, conf,…)
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Proposed Approach …
Actually three approaches ( is a pairwise measure)
Subset-Subset
– min{ ( X, Y), for all itemsets X and Y}
– All-confidence ( = confidence) is an example (Omiecinski 2003)
– All-subsets patterns: all-subsets cosine, all-subsets correlation, allsubsets confidence
Item-Subset
– min{ ( X, Y), for all itemsets X and Y, where X is a single item}
– H-confidence ( = confidence) is an example (Xiong, 2003)
– Hyperclique patterns: h-cosine, h-correlation, h-confidence
Item-Item
– min{ ( X, Y), for all itemsets X and Y, X and Y are single items}
– Clique patterns: cosine clique, correlation clique, confidence clique
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Proposed Approach …
When one or both of the itemsets are not single items
(attributes), it is not possible to directly apply most
pairwise measures
– Confidence is an exception
Can use the approach proposed for generalizing
confidence
– Compute the evaluation vector of the itemset
– Then apply the pairwise measure to the two vectors:
the evaluation vector and the original attribute vector
© 2005 M. Steinbach
Ph.D. Defense
‹#›
An Experiment
We compared the performance of h-confidence, cosine
clique, and confidence clique patterns.
The h-confidence hyperclique pattern is important
because the hyperclique pattern has many applications
Clustering, classification, data cleaning
Typically applied to objects instead of items
Purity of patterns is excellent
Often the h-confidence patterns
don’t cover many objects
Better coverage may mean
better application performance
Cos, conf related to h-conf
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Experimental Results
We used several document data sets with class labels for
the documents
Patterns were found on documents and goodness was
measured by the entropy of the patterns
Three quantities are reported
– Number of patterns
– Average entropy of the patterns
– Coverage of documents
Also evaluated the cosine cliques for original data
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Experimental Results – LA1 and FBIS
la1 level=50
fbis level=70
2500
7000
h-confidence
cosine (orig data)
cosine (binary data)
confidence
2000
h-confidence
cosine (orig data)
cosine (binary data)
confidence
6000
Number of Patterns
Number of Patterns
5000
1500
1000
4000
3000
2000
500
1000
0
2
3
4
5
6
7
8
9
10
11
12
0
2
3
4
5
Number of Attributes in the Pattern
6
la1 level=50
9
10
11
12
1
h-confidence
cosine (orig data)
cosine (binary data)
confidence
0.9
0.8
h-confidence
cosine (orig data)
cosine (binary data)
confidence
0.9
0.8
0.7
0.7
Average Entropy
Average Entropy
8
fbis level=70
1
0.6
0.5
0.4
0.6
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
2
7
Number of Attributes in the Pattern
3
4
5
6
7
8
9
Number of Attributes in the Pattern
10
11
12
0
2
3
4
5
6
7
8
9
Number of Attributes in the Pattern
10
11
12
Experimental Results – CranMed and tr45
cranmed level=30
tr45 level=50
5000
9000
h-confidence
cosine (orig data)
cosine (binary data)
confidence
4500
4000
7000
Number of Patterns
3500
Number of Patterns
h-confidence
cosine (binary data)
confidence
8000
3000
2500
2000
6000
5000
4000
3000
1500
2000
1000
1000
500
0
2
3
4
5
6
7
8
9
10
11
12
0
2
13
3
4
5
6
Number of Attributes in the Pattern
9
10
11
12
13
14
15
1
h-confidence
cosine (orig data)
cosine (binary data)
confidence
0.9
0.8
h-confidence
cosine (binary data)
confidence
0.9
0.8
0.7
Average Entropy
0.7
Average Entropy
8
tr45 level=50
cranmed level=30
1
0.6
0.5
0.4
0.6
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
2
7
Number of Attributes in the Pattern
3
4
5
6
7
8
9
10
Number of Attributes in the Pattern
11
12
13
0
2
3
4
5
6
7
8
9
10
11
Number of Attributes in the Pattern
12
13
14
15
Experimental Results – Percent Coverage
la1 level=50
fbis level=70
30
35
h-confidence
cosine (orig data)
cosine (binary data)
confidence
25
h-confidence
cosine (orig data)
cosine (binary data)
confidence
30
Percent Coverage
Percent Coverage
25
20
15
20
15
10
10
5
5
0
2
3
4
5
6
7
8
9
10
11
12
0
2
3
4
5
Number of Attributes in the Pattern
6
cranmed level=30
9
10
11
18
h-confidence
cosine (orig data)
cosine (binary data)
confidence
35
h-confidence
cosine (binary data)
confidence
16
30
14
Percent Coverage
Percent Coverage
8
tr45 level=50
40
25
20
15
10
12
10
8
6
5
0
2
7
Number of Attributes in the Pattern
4
3
4
5
6
7
8
9
10
Number of Attributes in the Pattern
11
12
13
2
2
3
4
5
6
7
8
9
10
11
Number of Attributes in the Pattern
12
13
14
15
12
Outline
Introduction
Extending association analysis to non-binary
data and non-traditional patterns
– Generalizing the notion of support
– Generalizing the notion of confidence
Creating new types of association patterns
Analyzing the structure of association patterns
Conclusions and future work
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Describing Association Patterns: Support Envelopes
The support envelope for
a binary transaction data
set and a pair of positive
integers (m, n)
– Is a subset of all items and
transactions
0
1
0
0
0
1
0
1
1
0
1
0
1
0
1
1
1
0
0
0
0
1
0
1
1
1
1
1
1
0
1
0
0
0
1
0
1
1
1
1
1
1
0
1
1
1
1
0
1
1
1
1
1
0
1
0
0
0
0
1
The support envelope contains all association
patterns involving m or more transactions and n
or more items.
–
–
–
–
m is support
n is the length of the itemset
Itemsets and variants (frequent, maximal, closed)
Error Tolerant Itemsets (ETIs)
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Simple Example
Idea: instead of finding all association patterns
containing at least m transactions and n items, find the
items and transactions containing all such patterns.
– For an example using the data set below, find the set of items and
transactions that contain all patterns with at least 3 transactions and at
least 3 items.
trans/item
1
2
3
4
5
6
7
8
9
10
11
12
col sum
© 2005 M. Steinbach
A
1
0
0
0
0
0
1
1
1
1
0
1
6
4
B
0
1
1
0
1
1
0
0
0
0
1
0
5
2
C
1
0
1
1
0
0
1
1
0
1
1
0
7
5
6
Ph.D. Defense
D
1
1
1
0
1
1
1
1
1
1
1
0
10
5
6
E
1
0
1
1
0
0
1
1
0
1
0
1
7
5
row sum
4
2
3
4
2
2
2
4
4
2
4
3
2
2
‹#›
Support Envelope Algorithm (SEA)
The algorithm to find a support envelope is simple.
1: input: A data matrix and a pair of positive integers (m, n)
2: repeat
3: Eliminate all rows whose sum is less than n
4: Eliminate all columns whose sum is less than m
5: until there is no change
6: return the set of remaining rows and columns
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Support Envelopes Form a Lattice
(5, 2)
{1-12}
{A-E}
Each box represents a
support envelope.
Format is the following:
(6, 1)
{1-12}
{A,C,D,E}
(m,n)
Transactions
Items
Entire lattice of
Envelopes is
called the
support lattice.
(7, 1)
{1-12}
{C,D,E}
(2, 3)
{1,3,7,8,10,11}
{A,B,C,D,E}
(6, 2)
{1,3,4,7-12}
{A, C,D,E}
(10, 1)
{1-3,5-11}
{E}
(4, 3)
{1,3,7,8,10}
{A,C,D,E}
Envelopes drawn with a dotted
border are on the lattice boundary,
which we call the support boundary.
At most min( M, N) such envelopes.
© 2005 M. Steinbach
(1, 4)
{1,3,7,8,10}
{A,B,C,D,E}
Ph.D. Defense
(5, 3)
{1,3,7,8,10}
{C,D,E}
(4, 4)
{1,7,8,10}
{A,C,D,E}
‹#›
Visualizing Support Envelopes for Mushroom
One of the support envelopes (576, 23) is denser than its
surrounding neighbors.
© 2005 M. Steinbach
Ph.D. Defense
‹#›
An Interesting Dense Envelope for Mushroom
One of the columns was the column 48, ‘gill-color:buff’
– There are exactly 1728 instances of item 48, every one of which
occurs with 13 other items (one of which is ‘poisonous’).
– The co-occurrence of 14 items is larger than is typical for this
data set.
Support Envelope (576,23)
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Outline
Introduction
Generalizing Support
Generalizing Confidence
Generalizing Association Patterns
Support Envelopes
Conclusions and Future Work
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Conclusions and Future Work: Generalizing Support
We described a framework for generalizing support that is
based on the simple, but useful observation that support
can be viewed as the composition of two functions:
– A function that evaluates the strength or presence of a pattern in
each object, and
– A function that summarizes these evaluations with a single
number.
Future work
– Efficient implementations
– Exploring applications of the continuous hyperclique and range
patterns
– New types of support for non-binary data and nontraditional
association patterns
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Conclusions and Future Work: Generalizing Confidence
We described a framework for generalizing confidence
that is based on the simple, but useful observation that
support can be defined in terms of two functions:
– A function that evaluates the strength or presence of a pattern in
each object, and
– A function that summarizes the relationship between the two
evaluation vectors with a single number.
Future work
– Exploring applications of the different measures of confidence
– Creating new types of confidence based on interestingness and
proximity measures
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Conclusions and Future Work: New Patterns
We described a framework for creating a wide variety of
new association measures from any pairwise association
or proximity measure
– These measures are guaranteed to have the anti-monotone
property
– Specific instances of these measures, the cosine and
confidence cliques, were proposed and found to be strictly
superior to the hyperclique pattern
Future work
– Research is needed to determine which measures (out of the
large number possible) are useful for association analysis and
what additional properties they might have
– A more detailed study using more and different types of data
sets is needed for cosine and confidence clique patterns
– More efficient algorithms needed
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Conclusions and Future Work: Support Envelopes
Support envelopes are a new tool for exploring association
structure.
– Support envelopes form a lattice - at most M * N envelopes
– Envelopes on the boundary are especially interesting.
Bound the maximum sizes of association patterns
At most min( M, N ) boundary envelopes
– Can visualize association structure by plotting support envelopes
– Efficient algorithms
Future work
– Parallel/distributed implementations of the support envelope code
– Investigation of the basic approach and its variations for binary data
– Application of support envelopes to other kinds of data or patterns
– Support envelopes for a cube
– Continuous data
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Thank You!
Questions?
Hyperclique Pattern for Binary Data
Definition: The h-confidence of an itemset
X = {i1, i2, …, im} is the minimum confidence with
which with one item implies the others, or
hconf(X) = (X) / max{( i1 ), ( i2 ), …, ( im )}
If X={A,B,C}, (X) = 0.06, (A)= (B)= 0.1, and (C)= 0.6,
then hconf(X) = 0.06/0.1 = 0.6
H-confidence is
– Non-increasing in the size of the itemset (anti-monotone)
– Two items belong to the same hyperclique only if their support is
similar (cross-support)
– The items in an itemset with h-confidence h are have a pairwise
cosine similarity of at least h (high affinity)
Used for clustering, noise removal, classification, etc.
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Hyperclique Pattern for Continuous Data
To extend hyperclique pattern to continuous data
– Choose eval function to be min
– Choose norm function to be L2 squared
– We write this support function as
min,L2squared
This support is between 0 and 1 and is anti-monotone
If we normalize attributes to have an L2 norm of 1, then
hconf(X) = min,L2squared(X)
This is true whether the attributes are binary or
continuous.
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Example: Continuous Hyperclique
Compute the support {term1, term2, term3} using the
min,L2squared support function.
– Attributes normalized to have an L2 norm of 1
– Compute support by taking the min across rows and the sum
of squares of that result
Original Data
Normalized Data
Pairwise Cosine
© 2005 M. Steinbach
Ph.D. Defense
‹#›