Download mike_phd_defense_final

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Extending Association Analysis
Michael Steinbach
Ph.D. Defense
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Outline

Introduction

Extending association analysis to non-binary
data and non-traditional patterns
– Generalizing the notion of support
– Generalizing the notion of confidence

Creating new types of association patterns

Analyzing the structure of association patterns

Conclusions and future work
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Traditional Association Analysis
 Association
analysis: Analyzes
relationships among items
(attributes) in a binary transaction
data
– Example data: market basket data
– Data can be represented as a binary
matrix
– Applications in business and science
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
© 2005 M. Steinbach
Ph.D. Defense
Beer
Eggs
Coke
1
2
3
4
5
Diapers
– Itemsets: Collection of items
 Example: {Milk, Diaper}
– Association Rules: X  Y, where X
and Y are itemsets.
 Example: Milk  Diaper
Milk
types of patterns
Bread
 Two
Set-Based Representation of Data
1
1
0
1
1
1
0
1
1
1
0
1
1
1
1
0
1
1
1
0
0
1
0
0
0
0
0
1
0
1
Binary Matrix Representation of Data
‹#›
Traditional Association Analysis …

Association measures evaluate the strength of an
association pattern
– Support and confidence are the most commonly used
– The support, (X), of an itemset X is the number of
transactions that contain all the items of the itemset
 Frequent itemsets have support > specified threshold
 Different types of itemset patterns are distinguished
by a measure and a threshold
– The confidence of an association rule is given by
conf(X  Y) = (X  Y) / (X)
 Estimate of the conditional probability of Y given X
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Traditional Association Analysis …

Process of finding interesting patterns:
1. Find frequent itemsets using a support threshold
2. Find association rules for frequent itemsets
3. Sort association rules according to confidence

Support filtering is necessary
– To eliminate spurious patterns
– For efficiency, we need the
anti-monotone property:
X  Y implies (Y) ≤ (X)
A
Confidence is used because
of its interpretation as
conditional probability
© 2005 M. Steinbach
Ph.D. Defense
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD

null
ABCE
ABDE
ACDE
BCDE
ABCDE
Given d items, there are 2d
possible candidate itemsets
‹#›
Extending Association Analysis

Why extend association analysis?
– To address limitations of existing schemes for
association analysis
– To create new kinds of useful patterns
– To better understand the structure of the
association patterns in a data set
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Limitations of Association Analysis

Traditional association analysis does not apply to
– Non-binary data
 Must
transform data into binary transaction data to apply
traditional association analysis techniques.
 Order
and magnitude information
can be lost
 Can
often “make it work” by coding
combinations of values, but this adds
complexity and explodes the number
of items
 Limited
solutions exist
– Min-Apriori (Han, Karypis, Kumar 1997)
Document Data
– Non-traditional association patterns.
 Error
Tolerant Itemsets (ETIs)
 General
© 2005 M. Steinbach
Boolean formulas
(Yang, Fayyad, and Bradley 2001)
(Bollman-Sdorra, et al. 01, Srikant et al. 97)
Ph.D. Defense
‹#›
Limitations of Association Analysis …

Support and confidence are not appropriate for all
applications
Example involving coffee and tea:
– Every customer in a grocery store purchases coffee
– Only 1/4 of the customers purchase tea
– conf(tea  coffee) = 1
– But this is misleading because any item implies coffee
– This problem is common when the frequency of items has a
skewed support distribution
– This cross-support problem can be addressed by using other
measures, such as h-confidence (hyperclique pattern)
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Limitations of Association Analysis …

Lack of knowledge of structure of association patterns
– Support threshold is critical

If too high, no patterns

If too low, too many patterns
– At some support threshold,
algorithms to find association
patterns “hit the wall”
– Particular difficulty in finding
patterns with low support
 LPMiner (Seno, Karypis 2001)
© 2005 M. Steinbach
Ph.D. Defense
From Summary of
Results, Frequent
Itemset Mining
Implementations 2003
‹#›
Overview and Contributions

Presentation and contributions fall into three
categories
1. A mathematical framework to extend
association analysis to non-binary data and
non-traditional patterns

Generalizing the notion of support
– Extend the hyperclique pattern (Xiong, et al 2003)
to continuous data

Generalizing the notion of confidence
– Define notion of confidence for Error-Tolerant
Itemsets
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Overview and Contributions
2. A framework for creating new types of
association measures (and their accompanying
itemset patterns)

Can use any pairwise association or proximity
measure as the basis for defining a measure of
itemset strength
– Examples: cosine, confidence, correlation

All measures have the anti-monotone property
3. Analyzing the structure of association patterns


Introduce the notion of support envelopes
Can visualize the structure of association
patterns
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Publications Related to Thesis
Steinbach, M., Tan, P., Xiong, H., and Kumar, V.,
Generalizing the Notion of Support. KDD '04, pp. 689-694,
Seattle, WA, August 22 - 25, 2004.
Steinbach, M. and Kumar, V.,
Generalizing the Notion of Confidence. ICDM’05, to appear,
Houston, TX, November 27 - 30, 2005.
Steinbach, M., Tan, P., and Kumar, V.,
Support Envelopes: A Technique for Exploring the Structure
of Association Patterns. KDD '04, pp. 689-694, Seattle, WA,
August 22 - 25, 2004.
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Additional Publications
Books:
P.-N. Tan, M. Steinbach, and V. Kumar,
Introduction to Data Mining, Pearson Addison-Wesley, May, 2005.
Book Chapters:
V. Kumar, P.-N. Tan, and M. Steinbach, Data Mining, in Handbook of Data
Structures and Applications, CRC Press, 2004.
M. Steinbach, L. Ertoz, and V. Kumar, Challenges of Clustering High
Dimensional Data. in New Vistas in Statistical Physics - Applications in
Econophysics, Bioinformatics, and Pattern Recognition, Springer-Verlag,
2004.
L. Ertoz, M. Steinbach, and Vipin Kumar, Finding Topics in Collections of
Documents: A Shared Nearest Neighbor Approach, in Clustering and
Information Retrieval, 2003, Kluwer Academic Publishers.
P. Zhang, M. Steinbach, V. Kumar, S. Shekhar, P.-N. Tan, S. Klooster, and C.
Potter, Discovery of Patterns of Earth Science Data Using Data Mining, in
Next Generation of Data Mining Applications, IEEE Press, 2005.
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Additional Publications …
Journal Articles:
H. Xiong, G. Pandey, M. Steinbach, and V. Kumar, Enhancing Data Analysis with Noise Removal, IEEE
Transactions on Knowledge and Data Engineering (TKDE), 2006, accepted for publication as a regular paper.
C. Potter, P.-N.Tan, M. Steinbach, S. Klooster, V. Kumar, R. Myneni, and V. Genovese, Major Disturbance Events in
Terrestrial Ecosystems Detected using Global Satellite Data Sets, Global Change Biology, 2003.
C. Potter, S. Klooster, M. Steinbach, P. Tan, V. Kumar, S. Shekhar, R. Nemani, and R. Myneni, Global
Teleconnections of Ocean Climate to Terrestrial Carbon Flux, J. of Geophysical Research, Vol. 108, No. D17,
4556, 2003.
C. Potter, S. Klooster, M. Steinbach, P. Tan, V. Kumar, S. Shekhar, and C. Carvalho, Understanding Global
Teleconnections of Climate to Regional Model Estimates of Amazon Ecosystem Carbon Fluxes, Global Change
Biology, 2003
C. Potter, S. Klooster, M. Steinbach, P. Tan, V. Kumar, R. Myneni, V. Genovese, Variability in Terrestrial Carbon
Sinks Over Two Decades: Part 1-North America, Earth Interactions, 2003.
Conferences:
H. Xiong, M. Steinbach, and V. Kumar, Privacy Leakage in Multi-relational Databases via Pattern based Semisupervised Learning, in Proc. of the ACM Conference on information and Knowledge Management (CIKM
2005), Bremen, Germany, 2005.
H. Xiong, M. Steinbach, P.-N. Tan, and V. Kumar, HICAP: Hierarchical Clustering with Pattern Preservation, in Proc.
2004 SIAM International Conf. on Data Mining (SDM 2004), pp. 279 - 290, Florida, 2004
M. Steinbach, P.N Tan, V. Kumar, S. Klooster, C. Potter: Discovery of climate indices using clustering. KDD 2003:
446-455
L. Ertöz, M. Steinbach, and V. Kumar: Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High
Dimensional Data. SDM 2003.
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Additional Publications …
Workshops:
M. Steinbach, P.-N. Tan, V. Kumar, C. Potter, and S. Klooster, Temporal Data Mining for the Discovery and
Analysis of Ocean Climate Indices, KDD Workshop on Temporal Data Mining, 2002.
M. Steinbach, P.-N. Tan, V. Kumar, C. Potter, and S. Klooster, Data Mining for the Discovery of Ocean
Climate Indices, The Fifth Workshop on Scientific Data Mining, 2nd SIAM International Conference on
Data Mining, 2002.
V. Kumar, M. Steinbach, P.-N. Tan, S. Klooster, C. Potter, A. Torregrosa, Mining Scientific Data: Discovery
of Patterns in the Global Climate System, Joint Statistical Meeting, 2001.
M. Steinbach, P.-N. Tan, V. Kumar, C. Potter, S. Klooster, A. Torregrosa, Clustering Earth Science Data:
Goals, Issues and Results, KDD Workshop on Mining Scientific Datasets, 2001.
P.-N. Tan, M. Steinbach, V. Kumar, C. Potter, S. Klooster, A. Torregrosa, Finding Spatio-Temporal Patterns
in Earth Science Data, KDD Workshop on Temporal Data Mining, 2001.
M. Steinbach, G. Karypis, and V. Kumar, Efficient Algorithms for Creating Product Catalogs, Web Mining
Workshop, 1st SIAM International Conference on Data Mining, Chicago, IL, 2001.
L. Ertoz, M. Steinbach, and V. Kumar, Finding Topics in Collections of Documents: A Shared Nearest
Neighbor Approach,Text Mine'01, Workshop on Text Mining, 1st SIAM International Conference on Data
Mining, Chicago, IL, April, 2001.
M. Steinbach, G. Karypis, and V. Kumar, A Comparison of Document Clustering Techniques TextMining
Workshop, KDD 2000, Boston, MA, August, 2000.
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Outline

Introduction

Extending association analysis to non-binary data and non-traditional
patterns
– Generalizing the notion of support
– Generalizing the notion of confidence

Creating new types of association patterns

Analyzing the structure of association patterns

Conclusions and future work
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Generalizing Support: Problem Statement

Challenge: Create a framework for generalizing
support that
– Handles non-binary data (ordinal, continuous)
– Handles new types of patterns
– Allow people to more easily express, explore, and
communicate new types of association patterns

Motivating examples for continuous data
Document Data
Microarray data
www.biology.ucsc.edu/mcd/research.html
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Proposed Approach

Proposed Approach: Support ( ) can be viewed as being
composed of two steps (functions):
– Evaluate the strength of a pattern in each object (transaction)


Evaluation vector is given by v = eval(X)
Summarization (norm) function measures
strength of the pattern in all transactions
Example
– eval =  (logical and)
– X = { Milk, Diapers }
– norm = sum

(X) = (norm eval)(X) = norm(eval(X))
= norm(v)
© 2005 M. Steinbach
Ph.D. Defense
1
2
3
4
5
Diapers

Milk
– Summarize all these evaluations with a single number
1
0
1
1
1
0
1
1
1
1
norm(v)
v
0
0
1
1
1
3
‹#›
Evaluation and Summarization Functions

Evaluation functions

Summarization functions
– Boolean functions constructed from and (), or (), and
not ()
– min, max, range
– product
– Special purpose: Error-Tolerant Itemsets
– Vector norms

L1, L2, and L2 squared
– Sums

Average

Weighted average

Weighted vector norms
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Usefulness of Support Framework

Traditional support results from a number of choices
– eval = { , min,  }
– norm = { L1 , L2 squared, sum }
– Any of these nine combinations give the traditional support for
binary data
– But for continuous data, these support measures are different

Can extend a recently developed association pattern,
the hyperclique pattern (Xiong, et al. 2003), to
continuous data
– eval = min
– norm = L2 squared

Has led to the creation of a new kind of pattern defined
by range support
– eval = range
– norm = L2 squared
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Outline

Introduction

Extending association analysis to non-binary data and non-traditional
patterns
– Generalizing the notion of support
– Generalizing the notion of confidence

Creating new types of association patterns

Analyzing the structure of association patterns

Conclusions and future work
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Generalizing Confidence: Problem Statement

Challenge: Create a framework for
generalizing confidence that
– Handles non-binary data (ordinal, continuous)
– Handles new types of patterns
– Allow people to more easily express, explore, and
communicate new types of association patterns
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Example: Error-Tolerant Itemsets
A (strong) error-tolerant itemset (ETI) can have a fraction 
of the items missing in each transaction.
Example: see the data in the table

– Let  = 5/8. In other words, each
transaction only needs to have
3/8 (37.5%) of the items.
– X = {i1, i2, i3, i4} and
Y = {i5, i6, i7, i8} are both
ETIs with a support of 4.
!
Standard confidence:
© 2005 M. Steinbach
Ph.D. Defense
‹#›
A Framework for Generalizing Confidence

Proposed Approach: Confidence can be viewed as being
composed of two steps (functions):
– Evaluate the strength of a pattern in each object (transaction)
for the two sets of attributes (items), X and Y (X  Y = )
 Evaluation
functions can be the same as previously mentioned,
e.g., min, max, range, boolean functions, etc.

– Measure the strength of the relationship between the resulting
pair of pattern evaluation vectors, vX and vY
Confidence functions can be a measure of prediction or
proximity.
– Measure the extent to which the strength of one association pattern
can be used to predict another, such as confidence, or
– Capture the proximity (similarity or dissimilarity) between the two
association patterns.
 Euclidean distance, correlation, cosine, Bregman divergence
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Confidence for Boolean Support Functions

A Boolean support function
– Has an evaluation function that returns a binary evaluation
vector indicating the presence or absence of a pattern in each
transaction.
– Uses the sum, L1, or L2 squared summarization function

Goal is to define confidence for Boolean support functions so
that conf( X  Y ) can be interpreted as an estimate of the
conditional probability of Y given X.

Key observation is that you have to work with the evaluation
vectors and the basic definition of conditional probability

Thus, conf(X  Y) = prob(vY | vX ) = prob(vX  vY ) / prob(vY )

Another way to express this is as
conf( X  Y ) = traditional confidence(vX, vY)
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Example: Error-Tolerant Itemsets …

Returning to the ETI example, we get the following:
vX
X = {i1, i2, i3, i4} Y = {i5, i6, i7, i8}
vY
conf(X,Y) = prob(vY | vX ) = prob(vX  vY ) / prob(vX )
= support(vX  vX) / support(vX) = 0 / 4 = 0
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Confidence for Continuous Data

One approach is to define a confidence measure for continuous
data that agrees with traditional confidence for binary data.
–
–
–
–

Normalize attributes to have an L1 norm of 1
eval fuction is min,
norm fuction is L1
Confidence is defined as
Another approach is to drop the requirement of being consistent
with the case of binary data (Min-Apriori (Han, Karypis, Kumar 1997)
–
–
–
–
Normalize attributes to have an L1 norm of 1
eval fuction is min
norm fuction is L1
Traditional definition of confidence: conf(X  Y) = (X  Y) / (X)
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Example: Min Apriori

This approach is inconsistent with traditional confidence
Original Data
Normalized Data
Evaluation Vectors
Standard confidence:
Min-Apriori confidence:
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Outline

Introduction

Extending association analysis to non-binary
data and non-traditional patterns
– Generalizing the notion of support
– Generalizing the notion of confidence

Creating new types of association patterns

Analyzing the structure of association patterns

Conclusions and future work
© 2005 M. Steinbach
Ph.D. Defense
‹#›
New Association Patterns: Motivation

There are many pairwise measures of association or
proximity among items (attributes)
– Each measure has specific properties and applications
– E.g., cosine measure is good for sparse data, while
correlation is more appropriate for dense data

Interestingness measures
© 2005 M. Steinbach
(Tan and Kumar 02)
Ph.D. Defense
‹#›
Proposed Approach

Proposed Approach: Using pairwise measures of
association or proximity
– Find values for all pairs of attributes (or sets of
attributes)
– Apply the min function to obtain a single value
Example: If X = {i1, i2, i3} and our pairwise measure is cosine,
then we can define, , a measure of itemset strength
(X) = min( cosine(i1, i2), cosine(i1, i3), cosine(i2, i3) )

A set of attributes, X, is a clique association pattern
with respect to a threshold  and a pairwise association
measure  if
(i, j)  ,  i, j  X ( can be cosine, corr, conf,…)
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Proposed Approach …
Actually three approaches ( is a pairwise measure)
 Subset-Subset
– min{ ( X, Y), for all itemsets X and Y}
– All-confidence ( = confidence) is an example (Omiecinski 2003)
– All-subsets patterns: all-subsets cosine, all-subsets correlation, allsubsets confidence

Item-Subset
– min{ ( X, Y), for all itemsets X and Y, where X is a single item}
– H-confidence ( = confidence) is an example (Xiong, 2003)
– Hyperclique patterns: h-cosine, h-correlation, h-confidence

Item-Item
– min{ ( X, Y), for all itemsets X and Y, X and Y are single items}
– Clique patterns: cosine clique, correlation clique, confidence clique
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Proposed Approach …

When one or both of the itemsets are not single items
(attributes), it is not possible to directly apply most
pairwise measures
– Confidence is an exception

Can use the approach proposed for generalizing
confidence
– Compute the evaluation vector of the itemset
– Then apply the pairwise measure to the two vectors:
the evaluation vector and the original attribute vector
© 2005 M. Steinbach
Ph.D. Defense
‹#›
An Experiment





We compared the performance of h-confidence, cosine
clique, and confidence clique patterns.
The h-confidence hyperclique pattern is important
because the hyperclique pattern has many applications

Clustering, classification, data cleaning

Typically applied to objects instead of items

Purity of patterns is excellent
Often the h-confidence patterns
don’t cover many objects
Better coverage may mean
better application performance
Cos, conf related to h-conf
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Experimental Results

We used several document data sets with class labels for
the documents

Patterns were found on documents and goodness was
measured by the entropy of the patterns

Three quantities are reported
– Number of patterns
– Average entropy of the patterns
– Coverage of documents

Also evaluated the cosine cliques for original data
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Experimental Results – LA1 and FBIS
la1 level=50
fbis level=70
2500
7000
h-confidence
cosine (orig data)
cosine (binary data)
confidence
2000
h-confidence
cosine (orig data)
cosine (binary data)
confidence
6000
Number of Patterns
Number of Patterns
5000
1500
1000
4000
3000
2000
500
1000
0
2
3
4
5
6
7
8
9
10
11
12
0
2
3
4
5
Number of Attributes in the Pattern
6
la1 level=50
9
10
11
12
1
h-confidence
cosine (orig data)
cosine (binary data)
confidence
0.9
0.8
h-confidence
cosine (orig data)
cosine (binary data)
confidence
0.9
0.8
0.7
0.7
Average Entropy
Average Entropy
8
fbis level=70
1
0.6
0.5
0.4
0.6
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
2
7
Number of Attributes in the Pattern
3
4
5
6
7
8
9
Number of Attributes in the Pattern
10
11
12
0
2
3
4
5
6
7
8
9
Number of Attributes in the Pattern
10
11
12
Experimental Results – CranMed and tr45
cranmed level=30
tr45 level=50
5000
9000
h-confidence
cosine (orig data)
cosine (binary data)
confidence
4500
4000
7000
Number of Patterns
3500
Number of Patterns
h-confidence
cosine (binary data)
confidence
8000
3000
2500
2000
6000
5000
4000
3000
1500
2000
1000
1000
500
0
2
3
4
5
6
7
8
9
10
11
12
0
2
13
3
4
5
6
Number of Attributes in the Pattern
9
10
11
12
13
14
15
1
h-confidence
cosine (orig data)
cosine (binary data)
confidence
0.9
0.8
h-confidence
cosine (binary data)
confidence
0.9
0.8
0.7
Average Entropy
0.7
Average Entropy
8
tr45 level=50
cranmed level=30
1
0.6
0.5
0.4
0.6
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
2
7
Number of Attributes in the Pattern
3
4
5
6
7
8
9
10
Number of Attributes in the Pattern
11
12
13
0
2
3
4
5
6
7
8
9
10
11
Number of Attributes in the Pattern
12
13
14
15
Experimental Results – Percent Coverage
la1 level=50
fbis level=70
30
35
h-confidence
cosine (orig data)
cosine (binary data)
confidence
25
h-confidence
cosine (orig data)
cosine (binary data)
confidence
30
Percent Coverage
Percent Coverage
25
20
15
20
15
10
10
5
5
0
2
3
4
5
6
7
8
9
10
11
12
0
2
3
4
5
Number of Attributes in the Pattern
6
cranmed level=30
9
10
11
18
h-confidence
cosine (orig data)
cosine (binary data)
confidence
35
h-confidence
cosine (binary data)
confidence
16
30
14
Percent Coverage
Percent Coverage
8
tr45 level=50
40
25
20
15
10
12
10
8
6
5
0
2
7
Number of Attributes in the Pattern
4
3
4
5
6
7
8
9
10
Number of Attributes in the Pattern
11
12
13
2
2
3
4
5
6
7
8
9
10
11
Number of Attributes in the Pattern
12
13
14
15
12
Outline

Introduction

Extending association analysis to non-binary
data and non-traditional patterns
– Generalizing the notion of support
– Generalizing the notion of confidence

Creating new types of association patterns

Analyzing the structure of association patterns

Conclusions and future work
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Describing Association Patterns: Support Envelopes

The support envelope for
a binary transaction data
set and a pair of positive
integers (m, n)
– Is a subset of all items and
transactions

0
1
0
0
0
1
0
1
1
0
1
0
1
0
1
1
1
0
0
0
0
1
0
1
1
1
1
1
1
0
1
0
0
0
1
0
1
1
1
1
1
1
0
1
1
1
1
0
1
1
1
1
1
0
1
0
0
0
0
1
The support envelope contains all association
patterns involving m or more transactions and n
or more items.
–
–
–
–
m is support
n is the length of the itemset
Itemsets and variants (frequent, maximal, closed)
Error Tolerant Itemsets (ETIs)
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Simple Example

Idea: instead of finding all association patterns
containing at least m transactions and n items, find the
items and transactions containing all such patterns.
– For an example using the data set below, find the set of items and
transactions that contain all patterns with at least 3 transactions and at
least 3 items.
trans/item
1
2
3
4
5
6
7
8
9
10
11
12
col sum
© 2005 M. Steinbach
A
1
0
0
0
0
0
1
1
1
1
0
1
6
4
B
0
1
1
0
1
1
0
0
0
0
1
0
5
2
C
1
0
1
1
0
0
1
1
0
1
1
0
7
5
6
Ph.D. Defense
D
1
1
1
0
1
1
1
1
1
1
1
0
10
5
6
E
1
0
1
1
0
0
1
1
0
1
0
1
7
5
row sum
4
2
3
4
2
2
2
4
4
2
4
3
2
2
‹#›
Support Envelope Algorithm (SEA)
The algorithm to find a support envelope is simple.
1: input: A data matrix and a pair of positive integers (m, n)
2: repeat
3: Eliminate all rows whose sum is less than n
4: Eliminate all columns whose sum is less than m
5: until there is no change
6: return the set of remaining rows and columns
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Support Envelopes Form a Lattice
(5, 2)
{1-12}
{A-E}
Each box represents a
support envelope.
Format is the following:
(6, 1)
{1-12}
{A,C,D,E}
(m,n)
Transactions
Items
Entire lattice of
Envelopes is
called the
support lattice.
(7, 1)
{1-12}
{C,D,E}
(2, 3)
{1,3,7,8,10,11}
{A,B,C,D,E}
(6, 2)
{1,3,4,7-12}
{A, C,D,E}
(10, 1)
{1-3,5-11}
{E}
(4, 3)
{1,3,7,8,10}
{A,C,D,E}
Envelopes drawn with a dotted
border are on the lattice boundary,
which we call the support boundary.
At most min( M, N) such envelopes.
© 2005 M. Steinbach
(1, 4)
{1,3,7,8,10}
{A,B,C,D,E}
Ph.D. Defense
(5, 3)
{1,3,7,8,10}
{C,D,E}
(4, 4)
{1,7,8,10}
{A,C,D,E}
‹#›
Visualizing Support Envelopes for Mushroom

One of the support envelopes (576, 23) is denser than its
surrounding neighbors.
© 2005 M. Steinbach
Ph.D. Defense
‹#›
An Interesting Dense Envelope for Mushroom

One of the columns was the column 48, ‘gill-color:buff’
– There are exactly 1728 instances of item 48, every one of which
occurs with 13 other items (one of which is ‘poisonous’).
– The co-occurrence of 14 items is larger than is typical for this
data set.
Support Envelope (576,23)
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Outline

Introduction

Generalizing Support

Generalizing Confidence

Generalizing Association Patterns

Support Envelopes

Conclusions and Future Work
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Conclusions and Future Work: Generalizing Support

We described a framework for generalizing support that is
based on the simple, but useful observation that support
can be viewed as the composition of two functions:
– A function that evaluates the strength or presence of a pattern in
each object, and
– A function that summarizes these evaluations with a single
number.

Future work
– Efficient implementations
– Exploring applications of the continuous hyperclique and range
patterns
– New types of support for non-binary data and nontraditional
association patterns
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Conclusions and Future Work: Generalizing Confidence

We described a framework for generalizing confidence
that is based on the simple, but useful observation that
support can be defined in terms of two functions:
– A function that evaluates the strength or presence of a pattern in
each object, and
– A function that summarizes the relationship between the two
evaluation vectors with a single number.

Future work
– Exploring applications of the different measures of confidence
– Creating new types of confidence based on interestingness and
proximity measures
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Conclusions and Future Work: New Patterns

We described a framework for creating a wide variety of
new association measures from any pairwise association
or proximity measure
– These measures are guaranteed to have the anti-monotone
property
– Specific instances of these measures, the cosine and
confidence cliques, were proposed and found to be strictly
superior to the hyperclique pattern

Future work
– Research is needed to determine which measures (out of the
large number possible) are useful for association analysis and
what additional properties they might have
– A more detailed study using more and different types of data
sets is needed for cosine and confidence clique patterns
– More efficient algorithms needed
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Conclusions and Future Work: Support Envelopes

Support envelopes are a new tool for exploring association
structure.
– Support envelopes form a lattice - at most M * N envelopes
– Envelopes on the boundary are especially interesting.
 Bound the maximum sizes of association patterns
 At most min( M, N ) boundary envelopes
– Can visualize association structure by plotting support envelopes
– Efficient algorithms

Future work
– Parallel/distributed implementations of the support envelope code
– Investigation of the basic approach and its variations for binary data
– Application of support envelopes to other kinds of data or patterns
– Support envelopes for a cube
– Continuous data
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Thank You!
Questions?
Hyperclique Pattern for Binary Data

Definition: The h-confidence of an itemset
X = {i1, i2, …, im} is the minimum confidence with
which with one item implies the others, or
hconf(X) = (X) / max{( i1 ), ( i2 ), …, ( im )}


If X={A,B,C}, (X) = 0.06, (A)= (B)= 0.1, and (C)= 0.6,
then hconf(X) = 0.06/0.1 = 0.6
H-confidence is
– Non-increasing in the size of the itemset (anti-monotone)
– Two items belong to the same hyperclique only if their support is
similar (cross-support)
– The items in an itemset with h-confidence h are have a pairwise
cosine similarity of at least h (high affinity)

Used for clustering, noise removal, classification, etc.
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Hyperclique Pattern for Continuous Data

To extend hyperclique pattern to continuous data
– Choose eval function to be min
– Choose norm function to be L2 squared
– We write this support function as
min,L2squared

This support is between 0 and 1 and is anti-monotone

If we normalize attributes to have an L2 norm of 1, then
hconf(X) = min,L2squared(X)

This is true whether the attributes are binary or
continuous.
© 2005 M. Steinbach
Ph.D. Defense
‹#›
Example: Continuous Hyperclique

Compute the support {term1, term2, term3} using the
min,L2squared support function.
– Attributes normalized to have an L2 norm of 1
– Compute support by taking the min across rows and the sum
of squares of that result
Original Data
Normalized Data
Pairwise Cosine
© 2005 M. Steinbach
Ph.D. Defense
‹#›