Download Tutorial on data mining

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining Tutorial
Tomasz Imielinski
Rutgers University
What is data mining?
• Finding interesting, useful, unexpected
• Finding patterns, clusters, associations,
classifications
• Answering inductive queries
• Aggregations and their changes on
multidimensional cubes
Table of Content
•
•
•
•
Association Rules
Interesting Rules
OLAP
Cubegrades – unification of association
rules and OLAP
• Classification and Clustering methods – not
included in this tutorial
Association Rules
• [AIS 1993] – Agrawal, Imielinski, Swami “Mining Association Rules”
SIGMOD 1993
• [AS 1994] - Agrawal, Srikant “Fast algortihms for mining association
rules in large databases” VLDB 94
• [ [B 1998] – Bayardo “Efficiently Mining Long Patterns from
databases” Sigmod 98
• [SA 1996] – Srikant, Agrawal “Mining Quantitative Association Rules
in Large Relational Tables”, Sigmod 96
• [T 1996] – Toivonen “Sampling Large Databases for Association
Rules”, VLDB 96
• [BMS 1997] – Brin, Motwani, Silverstein “Beyond Market Baskets:
Generalizing Association Rules to Correlations”
• [IV 1999] – Imielinski, Virmani “MSQL: A query language for
database mining” DMKD 1999
Baskets
• I1,…Im a set of (binary) attributes called
items
• T is a database of transactions
• t[k] = 1 if transaction t bought item k
• Association rule X => I with support s and
confidence c
• Support – what fraction of T satisfies X
• Confidence – what fraction of X satisfies I
Baskets
• Minsup. Minconf
• Frequent sets – sets of items X such that
their support sup(X) > minsup
• If X is frequent all its subsets are (closure
downwards)
Examples
• 20% of transactions which bought cereal and
milk also bought bread (support 2%)
• Worst case – exponential number (in terms of
size of the set of items) of such rules.
• What is the set of transactions which leads to
exponential blow up of the rule set?
• Fortunately worst cases are unlikely – not
typical. Support provides excellent pruning
ability.
General Strategy
• Generate frequent sets
• Get association rules X=>I and their
confidence and support as s=support(X+I)
and confidence c= supportX+I)/support(X)
• Key property: downward closure of the
frequent sets – don’t have to consider
supersets of X if X is not frequent
General strategies
• Make repetitive passes through the database
of transactions
• In each pass count support of CANDIDATE
frequent sets
• In the next pass continue with frequent sets
obtained so far by “expanding” them. Do
not expand sets which were determined
NOT to be frequent
AIS Algorithm
(R. Agrawal, T. Imielinski, A. Swami, “Mining Association Rules Between Sets of Items in Large Databases”, SIGMOD’93)
AIS – generating association rules
(R. Agrawal, T. Imielinski, A. Swami, “Mining Association Rules Between Sets of Items in Large Databases”, SIGMOD’93)
AIS – estimation part
(R. Agrawal, T. Imielinski, A. Swami, “Mining Association Rules Between Sets of Items in Large Databases”, SIGMOD’93)
Apriori
(R. Agrawal, R Srikant, “Fast Algorithms for Mining Association Rules”, VLDB’94)
Apriori algorithm
(R. Agrawal, R Srikant, “Fast Algorithms for Mining Association Rules”, VLDB’94)
Pruning in apriori through self-join
(R. Agrawal, R Srikant, “Fast Algorithms for Mining Association Rules”, VLDB’94)
Performance improvement due to Apriori
pruning
(R. Agrawal, R Srikant, “Fast Algorithms for Mining Association Rules”, VLDB’94)
Other pruning techniques
• Key question: At any point of time how to
determine which extensions of a given candidate
set are “worth” counting
• Apriori – only these for which all subsets are
frequent
• Only these for which the estimated upper bound
of the count is above minsup
• Take a risk – count a large superset of the given
candidate set. If it is frequent than all its subsets
are also – large saving. If not, at least we have
pruned all its supersets.
Jump ahead schemes: Bayardo’s
MaxMine
(R. Bayardo, “Efficiently Mining Long Patterns from Databases, SIGMOD’98)
Jump ahead scheme
• h(g) and t(g): head and tail of an item
group. Tail is the maximal set of items
which g can be possibly extended with
Max-miner
(R. Bayardo, “Efficiently Mining Long Patterns from Databases, SIGMOD’98)
Max-miner
(R. Bayardo, “Efficiently Mining Long Patterns from Databases, SIGMOD’98)
Max-miner
(R. Bayardo, “Efficiently Mining Long Patterns from Databases, SIGMOD’98)
Max-miner
(R. Bayardo, “Efficiently Mining Long Patterns from Databases, SIGMOD’98)
Max-miner vs Apriori vs Apriori LB
• Max-miner is over two orders of magnitude
faster than apriori in identifying maximal
frequent patterns on data sets with long
max patterns
• Considers fewer candidate sets
• Indexes only on head items
• Dynamic item reordering
Quantitative Rules
• Rules which involve contignous/quantitative
attributes
• Standard approach: discretize into intervals
• Problem: it is arbitrary, we will miss rules
• MinSup problem: if the number of intervals
is large their support will be low
• MinConf problem: if intervals are large
rules may not meet min confidence
Correlation Rules [BMS 1997]
• Suppose the conditional probability that the
customer buys coffee given that he buys tea is
80%, is this an important/interesting rule?
• It depends…if apriori probability of a customer
buying coffee is 90%, than it is not
• Need 2x2 contingency tables rather than just pure
association rules. Chi-square test for correlation
rather than just support/confidence framework
which can be misleading
Correlation Rules
• Events A and B are independent if p(AB) = p(A) x
p(B)
• If any of the AB, A(notB), (notA)B, (notA)(notB)
are dependent than AB are correlated; likewise for
three items if any of the eight combinations of A, B
and C are dependent then A, B, C are correlated
• I={i1,…in} is correlation rule iff the occurrences of
i1,…in are correlated
• Correlation is upward closed; if S is correlated so
is any superset of S
Downward vs upward closure
• Downward closure (frequent sets) is a
pruning property
• Upward closure – minimal correlated
itemsets, such that no subsets of them are
correlated. Then finding correlation is a
pruning step – prune all the parents of a
correlated itemset because they are not
minimal.
• Border of correlation
Pruning based on supportcorrelation
• Correlation can be additional pruning
criterion next to support
• Unlike support/confidence where confidence
is not upward closed
Chi-square
(S. Brin, R. Motwani, C. Silverstein, “Beyond Market Baskets: Generalizing Association Rules to Correlations”, SIGMOD’97)
Correlation Rules
(S. Brin, R. Motwani, C. Silverstein, “Beyond Market Baskets: Generalizing Association Rules to Correlations”, SIGMOD’97)
(S. Brin, R. Motwani, C. Silverstein, “Beyond Market Baskets: Generalizing Association Rules to Correlations”, SIGMOD’97)
Algorithms for Correlation Rules
• Border can be large, exponential in terms of the
size of the item set – need better pruning functions
• Support function needs to be defined but also for
negative dependencies
• A set of items S has support s at the p% level if at
least p% of the cells in the contingency table for S
have value s
• Problem (p<50% all items have support at the
level one)
• For p > 25% at least two cells in the contingency
table will have support s
Pruning…
• Antisupport (for rare events)
• Prune itemsets with very high chi-square to
eliminate obvious correlations
• Combine chi-squared correlation rules with
pruning via support
• Itemset is significant iff it is supported and
minimally correlated
Algorithm
2
 -support
INPUT: A chi-squared significance level , support s, support fraction p > 0.25.
Basket data B.
OUTPUT: A set of minimal correlated itemsets, from B.
1. For each item
, do count O(i). We can use these values to calculate any necessary
expected value.
2. Initialize
3. For each pair of items
such that
and
, do add
to
4.
.
5. If
is empty, then return SIG and terminate.
3. For each itemset in
, do construct the contingency table for the itemset. If less than
p percent of the cells have count s, then goto Step 8.
7. If the
value for contingency table is at least
, then add the itemset to SIG,
else add the items to NOTSIG.
8. Continue with the next itemset in
. If there are no more itemsets in
,
then set
to be the set of all sets S such that every subset of size |S| - 1 is not
.
Goto Step 4.
(S. Brin, R. Motwani, C. Silverstein, “Beyond Market Baskets: Generalizing Association Rules to Correlations”, SIGMOD’97)
Sampling Large Databases for
Correlation Rules [T1996]
• Pick a random sample
• Find all association rules which hold in that
sample
• Verify the results with the rest of the database
• Missing rules can be found in the second pass
Key idea – more detail
• Find a collection of frequent sets in the
sample using lower support threshold. This
collection is likely to be a superset of the
frequent sets in entire database
• Concept of negative border: minimal sets
which are not in a set collection S
Algorithm
(H. Toivonen, “Sampling Large Databases for Association Rules”, VLDB’96)
Second pass
• Negative border consists of the “closest”
itemsets which can be frequent too
• These have to be tried (measured)
(H. Toivonen, “Sampling Large Databases for Association Rules”, VLDB’96)
Probabilty that a sample s has
exactly c rows that contain X
(H. Toivonen, “Sampling Large Databases for Association Rules”, VLDB’96)
Bounding error
(H. Toivonen, “Sampling Large Databases for Association Rules”, VLDB’96)
Approximate mining
(H. Toivonen, “Sampling Large Databases for Association Rules”, VLDB’96)
Approximate mining
(H. Toivonen, “Sampling Large Databases for Association Rules”, VLDB’96)
Summary
• Discover all frequent sets in one pass in a
fraction of 1-D of the cases when D is given
by the user; missing sets may be found in
second pass
Rules and what’s next?
• Querying rules
• Embedding rules in applications (API)
MSQL
(T. Imielinski, A. Virmani, “MSQL: A Query Language for Database Mining”, Data Mining and Knowledge Discovery 3, 99)
MSQL
(T. Imielinski, A. Virmani, “MSQL: A Query Language for Database Mining”, Data Mining and Knowledge Discovery 3, 99)
Applications with embedded rules
(what are rules good for)
•
•
•
•
•
•
•
Typicality
Characteristic of
Changing patterns
Best N
What if
Prediction
Classification
OLAP
•
•
•
•
Multidimensional queries
Dimensions
Measures
Cubes
Data CUBE
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Vankatrao, “Data Cube: A Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals”, Data Mining and Knowledge Discovery 1, 1997)
Data Cube
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Vankatrao, “Data Cube: A Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals”, Data Mining and Knowledge Discovery 1, 1997)
Data Cube
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Vankatrao, “Data Cube: A Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals”, Data Mining and Knowledge Discovery 1, 1997)
Data Cube
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Vankatrao, “Data Cube: A Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals”, Data Mining and Knowledge Discovery 1, 1997)
Data Cube
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Vankatrao, “Data Cube: A Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals”, Data Mining and Knowledge Discovery 1, 1997)
Data Cube
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Vankatrao, “Data Cube: A Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals”, Data Mining and Knowledge Discovery 1, 1997)
Data Cube
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Vankatrao, “Data Cube: A Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals”, Data Mining and Knowledge Discovery 1, 1997)
Measure Properties
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Vankatrao, “Data Cube: A Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals”, Data Mining and Knowledge Discovery 1, 1997)
Measure Properties
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Vankatrao, “Data Cube: A Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals”, Data Mining and Knowledge Discovery 1, 1997)
Measure Properties
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Vankatrao, “Data Cube: A Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals”, Data Mining and Knowledge Discovery 1, 1997)
Monotonicty
• Iceberg Queries
• COUNT, MAX, SUM etc allow pruning
• AVG does not – AVG of a cube extension
can be larger or smaller than the AVG over
the original cube: thus no pruning in the
apriori sense
Examples of Monotonic
Conditions
• MAX, MIN
• TOP-k AVG
Cubegrades: combining OLAP
and association rules
• Consider rule: milk, butter=> bread [s:100, C:75%].
• Consider it as a gradient or derivative of a cube.
• Body: 2d-cube in multidimensional space representing
transactions where milk and butter are bought together.
• Consequent: Represents the specialization of “body” cube
by bread. “Body+consequent” represents subcube where
milk, butter and bread are bought together.
• Support: COUNT of records in body cube.
• Confidence: measures how COUNT is affected when we
specialize “body” cube by “consequent”.
A Different Perspective
• Consider rule: milk, butter=> bread [s:100, C:75%].
• Consider it as a gradient or derivative of a cube.
• Body: 2d-cube in multidimensional space representing
transactions where milk and butter are bought together.
• Consequent: Represents the specialization of “body” cube
by bread. “Body+consequent” represents subcube where
milk, butter and bread are bought together.
• Support: COUNT of records in body cube.
• Confidence: measures how COUNT is affected when we
specialize “body” cube by “consequent”.
Cubegrades: Generalization of
Association Rules
• We can generalize this in two ways:.
– Allow additional operators for cube transformation
including specializations, generalization and mutations.
– Allow additional measures such as MIN, MAX, SUM,
etc.
• Result=Cubegrades
– entities that describe how transforming source cube X to
target cube Y affects a set of measure values.
Mathematical Similarity
• Similar to function gradient: measures how changes in
function argument affects the function value.
• Cubegrade measures how changes in cube affects
measure (function) values.
Using cubegrades: Examples
• Data description: Monthly summaries of item sales
per customer + customer demographics.
• Examples:
– How is the average amount of milk bought affected
by different age categories among buyers of cereals?
– What factors cause the average amount of milk
bought to increase by more than 25% among
suburban buyer?
– How do buyers in rural cubes compare with buyers
in suburban cubes in terms of the average amount
spent on bread milk and cereal?
Cubegrade lingo
• Consider the following cube:
areaType=‘urban’, Age=[25,35] (Avg(salesMilk)=25)
• Descriptor: attribute-value pair.
• K-Conjunct: Conjunct of k-descriptors
• Cube: set of objects in a database that satisfy the kconjunct.
• Dimensions: The attributes used in the descriptor.
• Measures: Attributes that are aggregated over
objects.
Cubegrade Definition
• Mathematically, a cubegrade is a 5-tuple <Source,
Target, Measures, Values, Delta-Value>:
– Source: The source or initial cube.
– Target: Target cube obtained by applying factor F
on source. Target= Source + Factor.
– Measures: set of measures evaluated.
– Values: function evaluating a measure in source.
– Delta-Value: function evaluating the ratio of
measure value in target cube versus measure value
in source cube.
Cubegrade Example:
Source cube
Target cube
areaType=‘urban’->
areaType=‘urban’, Age=[25,35]
Measure
(Avg(salesMilk), Avg(salesMilk)=25,
DeltaAvg(salesMilk)=125%)
Value
Delta Value
Types of cubegrades
A=a1, B=b1
Generalize on C
Mutate C to c2
A=a1, B=b1, C=c1
A=a1, B=b1, C=c2
Specialize by D
A=a1, B=b1, C=c1, D=d1
Querying cubegrades.
• CubeQL (for querying cubes) and CubegradeQL(for
querying cubegrades).
• Features:.
–
–
–
–
SQL-like, declarative style.
Conditions on Source cube and target cube.
Conditions on measure values and delta values.
Join conditions between source and target.
How, which and what
(A. Abdulgani, Ph.D. Thesis, Ruthers University 2000)
The Challenge
• Pruning was what made association rules practical.
• Computation was bottom-up. If a cube doesn’t satisfy
the support threshold, no subcube would satisfy the
support threshold.
• COUNT is no longer the sole constraint. New
additional constraints.
Assumptions
• Dealing with the SQL aggregate measures MIN,MAX,
SUM, AVG.
• Each constraint is of the form AGG(X)[>,<,=] c,
where c is a constant.
Monotonicity
• Consider a query Q, a database D and a cube X in D.
• Query Q is monotonic if the condition:
Q(X) is FALSE in D
Q(X’) is FALSE in
D, where X’X
View Monotonicity
• Alternatively, define a cube’s view as projection of the
measure and dimension values holding on the cube.
• A view is not tied to a particular cube or database.
• Q is monotonic for view V, if the condition
For any cube X in
any D s.t. V is a view
for X, Q(X) is
FALSE
Q(X’) is FALSE,
where X’  X
GBP Sketch
• Grid Construction for
input query
MAX(X)
F
T
T
T
F
T
F
F
T
150
50
0
25
50
AVG(X)
– Axes defined on
dimension/measure
attributes used in query.
– Axis intervals based on
constants used in query.
– Cartesian product of
intervals define
individual cells.
– Query evaluation for
each cell.
Checking for satisfiability
.•
Cell C defined by
• Reduce to the system:
– mL  MIN(A)  mH
– (N-1)mL+ML  S (N1)MH+mH
– ML MAX(A)  MH
– SL  S  SH
– AL  AVG(A)  AH
– ALN  S  AHN
– SL SUM(A)  SH
– CL  N  CH
– CL  COUNT()  CH
Solve for N and check the interval returned for N.
 For measures on multiple attributes solve independently for
distinct attributes. Check for a common shared interval for N.

View Reachability
MAX(X)
F
T
T
T
FV•
T
Question: Is there a
cube X with view V
s.t. X has a subcube
which falls in a TRUE
cell?
F
T
Is a TRUE cell C
reachable from V?
150
50
F
0
25
50
AVG(X)
Defining View Reachability
• A view V defined by:
–
–
–
–
–
MIN(A)=m
MAX(A)=M
AVG(A)=a
SUM(A)=
COUNT(A)=c
• A cell C defined by:
–
–
–
–
–
mLMIN(A)  mH
ML  MAX(A)  MH
AL  AVG(A)  AH
SL  SUM(A)  SH
CL  COUNT()  CH
 Cell C is reachable from view V if there is a set X of {X1,
X2, .. XN, .. XC} real elements which satisfies the view
constraints and a subset X’ of {X1, X2, .. XN} which satisfies
the cell constraints.
Checking for View Reachability
• View Reachability on measures of a single
attribute can be reduced to at most 4
systems with constant number of linear
constraints on N.
• For measures on multiple distinct
attributes, obtain set of intervals on every
attribute separately. V is reachable from C
if there is a shared interval obtained on N
containing an integral point.
Example
• Consider view of 19 records X={X1, …, X19} with:
– MIN(X)=0, MAX(X)=75, SUM(X)=1000.
• Let C be defined by
– [CL, CH]=[1, 19], [mL, mH]=[0,10], [ML, MH]=[0,50], [AL,
AH]=[46.5, 50].
• C is reachable from V either with N=12 or with N=15.
Complexity Analysis
• Let Q be a query in disjunctive normal form
consisting of m conjuncts in J dimensions and K
distinct measure attributes.
• The monotonicity of Q for a given view can be tested
in O(m(J+KlogK)) time.
Computing cubegrades
Algorithm Cubegrade Gen Basic:
• Evaluate Q[source];
• For each S in Q[source]
– Evaluate Q[S];
– For each T in Q[S]
• Form the cubegrade <S, T, Measure, Values, Delta
Values> where Delta Values have to be calculated as
ratios of the Measure evaluated on the target and on
the source cubes respectively.
Cube and Cubegrade query classes
• Cube Query classification:
– Queries with strong monotonicity.
– Queries with weak monotonicity.
– Hopeless queries.
• Cubegrade query classification, based on source cube
query classification and target cube classification:
– Focused.
– Weakly focused.
– Hopeless.
Cubegrade Application
Development
• Cubegrades are not end products. Rather, an
investment to drive a set of applications.
• Definition of an application framework for
cubegrades. Features include:
– Extension of Dmajor datamining platform.
– Generation, storage and retrieval of cubegrades.
– Accessing internal components of cubegrades for
browsing, comparisons and modifications.
– Traversals through a set of cubegrades.
– Primitives for correlating cubegrades with underlying
data and vice versa.
Application Example: Effective
Factors
• Find factors which are effective in changing a measure
value m for a collection of cubes by a significant ratio.
• Factor F is effective for C iff for all
G=<C’,C’+F,m,V,Delta> where C’ C it holds that
Delta(m)>(1+x) or Delta(m)<(1-x).
Cubegrades and OLAP
Traditional OLAP
Cubegrades
Scope
Static multidimensional
object.[GBLP96]
Dynamic multidimensional object
Query
Type
Query cubes [CT98,
GL98]; mostly
structural querying
Query cubegrades;
Structural and value
querying
Query
Evaluation
Static top-down
precomputation
[AAD96, RS97]
Dynamic bottom-up
computations. Novel
pruning method
Future work
• Extending GBP to cover additional constraint types.
• Monotonicity threshold of a query.
• Domain Specific Application: Gene Expression Mining.
Summary
• Cubegrade concept as a generalization of association rules and
cubes.
• Concept of querying of cubes and cubegrades.
• Description of a GBP method for efficient pruning of queries
with constraints of type Agg(a) {,} c, where Agg() can be
MIN(), MAX(), SUM(), AVG().
• Experimentally through a cubegrade engine prototype shown the
viability of GBP and the cubegrade generation process.
• Classification of a hierarchy of query classes based on theoretical
pruning characteristics.
• Presentation of a framework for developing cubegrade
applications.
Conclusions
• OLAP and Association rules – really one approach
• Key problem - the set of rules, cubegrades – can
be orders of magnitude larger than the source
data set
• Hence, the key issue is how do we present/use the
obtained rules in applications which provide real
value for the user
• Discovery as querying